How Valuable is company-specific Data

How Valuable is company-specific Data Compared to multi-company Data for Software Cost Estimation? Isabella Wieczorek Fraunhofer Institute for Experimental Software Engineering 67663 Kaiserslautern, Germany +49 6301 707 255 [email protected]

Abstract The objective of this paper is to investigate the pertinent question whether multi-organizational data is valuable for software project cost estimation. Local, company-specific data is widely believed to provide a better basis for accurate estimates. On the other hand, multi-organizational databases provide an opportunity for fast data accumulation and shared information benefits. Therefore, this paper trades off the potential advantages and drawbacks of using local data as compared to multi-organizational data. Motivated by the results from previous investigations, we further analyzed a large cost database from Finland that collects standard cost factors and includes information on six individual companies. Each of these companies provided data for more than ten projects. This information was used to compare the accuracy between company-specific (local) and company-external (global) cost models. Consistent with some of the previous investigations in this area, the general trends in the results are surprising. They show that company-specific models seem not to yield better results than the companyexternal models. Our results are based on applying two standard statistical estimation methods (OLS-regression, Analysis of Variance) and Analogy-based estimation.

1. Introduction Delivering a software product on time, within budget, and to an agreed level of quality is a critical concern for software organizations. Accurate estimates are crucial for better planning, monitoring and control. Therefore, there has been excessive focus of research on estimation

Melanie Ruhe University of Kaiserslautern Department of Computer Science 67661 Kaiserslautern, Germany [email protected] methods from a variety of fields [8]. Data-driven cost estimation models can be based on company-specific (local) project data or on multi-organizational (global) data that is external to a company. Company-specific, homogeneous data is widely believed to form a good basis for accurate cost estimates. This is because they allow for a number of advantages, such as better control of the data collection process, better understanding of the measurement methods and attributes used to characterize the projects, the possibility to define tailored cost drivers, or higher data homogeneity [6]. On the other hand, practitioners are often faced with the lack of explicit cost and resource data collected from past projects. This is because local data collection is an expensive, time-consuming investment that needs to be justified. Moreover, company-specific data collection allows the accumulation of data only at a slow pace, as in most organizations a limited number of projects is completed every year. Multi-organizational databases, where organizations share project data collected in a consistent manner, can address the above-mentioned issues. Unfortunately, they have problems of their own. Collecting consistent data may turn out to be difficult and trends may differ significantly across organizations. Despite this, the big advantage is that their administrators offer a standardized channel of data collection. The data can rapidly accumulate and the methods of data collection, analysis, and distribution are cost effective. Other advantages are to use those estimates for benchmarking purposes or as a basis for calibration in order to derive a company-specific model. If estimates derived from multi-organizational data is comparable in accuracy to estimates derived from company-specific data, then there are benefits to be drawn

from multi-organizational data repositories. There are also critical implications for project managers since projects can be compared in the light of industry wide practices. Project data across organizations could offer commercial value and longer-term considerations for effort estimation. Strategic decisions have to be made by software organizations in order to investigate whether sanitized multi-organizational databases are worth developing. Therefore, the pertinent question is whether multiorganizational data could provide means to better exploit data for cost estimation, especially for companies without locally collected data available. This question is tackled within the framework of this paper. We investigate the feasibility of multi-organizational databases and trade off whether to start the collection of local data or to contribute to multi-company data. In recent years it has become possible to gain access to large public domain data sets and to use these data for cost estimation. Recent publications have investigated the feasibility of using those large multi-company data for cost estimation [4, 5, 13, 14]. Since the results have been contradictory so far, the current study further investigates this issue using information on six individual companies that provided data for the so-called Laturi database. The rest of the paper is structured as follows: Section 2 starts with a discussion on related work. Section 3 follows with the description of the data set and the data preparation. The design of this study, the estimation methods applied, and the evaluation criteria are presented in Section 4. Section 5 summarizes and discusses the results of the analysis. Finally, Section 6 presents the conclusions and discussion of practical implications.

2. Related Work Recently, there have been several pieces of research completed that investigate the feasibility of global and local data collection for cost estimation. They address, among other things, the question: What are the benefits and drawbacks of using organization-specific data as compared to a multi-organization data set? Those studies mainly comprised two distinct application domains, namely, business applications and applications in the space and military domain. The first study [4] was based on the so-called Laturidatabase, which included 206 business software projects from 26 companies in Finland. This data contains information that allows for the identification of projects from individual companies. Using the company that provided the largest proportion of projects (63 projects) and the rest of the database, this study emulated two situations. (1) a company has only external data available

to build a cost model, and (2) a company has its own data available to build a cost estimation model. The estimation models were build applying a cross-section of commonly used estimation methods, such as ordinary least squares regression [22], analysis of variance [15], regression trees [3], and Analogy [23]. The results revealed that there was no significant advantage in using local, company-specific data to build estimates over using multi-organizational data external to that specific company. The second study [5] was a replication of the first study using the European Space Agency (ESA) data set including 166 mainly space and military projects. Again, a company that provided the largest proportion of the data (29 projects) and the rest of the database were used to address the issue of building local vs. global software cost data collection. In consistency to the first study, there was no obvious advantage in using local data to build cost estimates over using global data external to that specific company. A third study in this area applied two modeling methods to the International Software Benchmarking Standards Group (ISBSG, Release 5) data and an Australian company (Megatec) [13]. Estimates derived from the ISBSG data were compared with estimates derived from the Megatec data. At the time of this study Megatec did not contribute to the ISBSG repository. Thus, collected variables for Megatec had to be matched with project attributes collected within ISBSG. This resulted in using 145 usable projects from ISBSG and 19 projects from Megatec. The two modeling methods applied were ordinary least squares regression and different variants of Analogy. The results show that significantly more accurate estimates could be built based on the Megatec than on the multi-company data from ISBSG. This was mainly due to a higher homogeneity and higher productivity of Megatec compared to the considered ISBSG projects. A fourth study was based on 324 projects of the ISBSG database (Release 6) [14]. This study was a further replication of the two previous studies on the ESA and the Laturi data [4, 5]. To tackle the issue of the benefits of local data collection, it was possible to identify information on one company that provided 14 projects to this database. This fourth investigation showed significantly more accurate results when using companyspecific data. The main reason for more accurate company-specific results was the large difference in average productivity and effort for the one company compared to the rest of the data repository. This result is in contrast to the ESA and Laturi studies [4, 5] but consistent with the ISBSG - Megatec study [13].

To summarize these results, Table 1 lists the four studies and their main characteristics and outcomes. Table 1: Summary of Previous Work

Database Application Domains

Countries

Whole One company Significant Difference of local vs global models?

Study 1 [4] Laturi

Study 2 [5] ESA

Study 3 [13] ISBSG, Megatec mixed

Study 4 [14] ISBSG

management and information systems Europe

mainly aerospace, military, industry Europe

Worldwide

166 28

ISBSG: worldwide, Megatec: Australia 164=145+19 19

206 63

324 14

No

No

Yes

Yes

mixed

Contradictory trends in the results can be observed. The studies where a large proportion of the whole data came from one single company failed to proof any obvious benefits of local data collection. It should be noted that the data sets used in those studies came from European companies and collected typical, standard costdrivers [4, 5]. Moreover, the data collection followed rigor quality assurance procedures and projects belonged to similar, relatively consistent application domains. On the other hand, in the case where a single company only provided a very small proportion or even no data to the whole repository showed significant benefits of local data collection [13, 14]. Moreover, these data is collected from companies all over the world and no rigorous quality assurance procedures are followed to validate the data provided. Rather, quality levels are attached indicating the perceived quality of the data. In addition, the project attributes collected are no typical cost-drivers, but rather standard project characteristics [12]. The current study further explores the issue addressed in the previous studies. Moreover, it discusses the results obtained here in the light of the previously obtained results. We utilize all the information on individual companies collected in the Laturi database [4]. We compare the accuracy of cost models for six companies that provided more than ten projects. These models are then compared to models that are based on multiorganizational data, external to each individual company.

3. Data Set Description The Laturi database is one of the largest public experience database in Europe. The Laturi project started

in 1990 in close cooperation with 16 companies in Finland (STTF). The goal of the project has been to develop a family of methods for advanced estimation needs. Four goals have been pursuit. (1) The improvement of existing sizing methods and models. (2) The development of new productivity factors and analysis approaches working better in the European context than the well-known COCOMO model. (3) The improvement of the estimation and planning processes in the contributing companies. (4) The motivation of the companies to start measurement programs with some core data that should be comparable with the cross-industrial data collected into the shared database. The Software Technology Transfer Finland Ltd. was established in 1995. They maintain the database and have developed a tool (Experience Pro) that supports a variety of methods for cost estimation using the Laturi database, for example, software sizing, effort and schedule estimation, and benchmarking possibilities. Companies buy the Experience tool and pay an annual maintenance fee. In return, they receive the tool incorporating the database, new versions of software and updated data sets. Companies can add their own data to the tool and are also given an incentive to donate their data to the shared database through the reduction of the maintenance fee for each project contributed. The validity and comparability of the data is assured as all companies collect data using the same tool and the value of every variable is precisely defined. Moreover, companies providing the data is individually contacted in order to verify and check their submission. At the time of our analysis, the database consisted of 206 projects from 26 different companies. The projects are mainly business applications in banking (38%), insurance (27%), wholesale/retail (9%), public administration (7%), and manufacturing (19%) sectors. The whole database consisted of in total 144,517 Function Points. Six companies provided data for more than ten projects. One company provided a big proportion (one third) of the whole database. The six companies together provided data on in total 119 projects. The other 87 projects are spread over 20 companies. Using this information, we were able to address important issues regarding the usefulness of local vs. global data collection. Figure 1 illustrates the distribution of projects by companies.

The 15 productivity factors are: #OMPANY

#OMPANY

#OMPANY

ID F01

#OMPANY

F02

#OMPANY

#OMPANY

F03

F04 /THER #OMPANIES

Figure 1. Laturi Data Set by Company

The database includes detailed function point data, actual effort and schedules, as well as numerous project attributes. The system size is measured in so-called Experience Function Points, a variation in conformance to the Albrecht’s Function Point measure [19]. Productivity is measured as Function Points per Person hour. The database collects 15 standard influential factors on effort that are, similar to the COCOMO factors, measured on an ordinal scale. The variables considered in this analysis are presented in Table 2.

BRA HAR F01 - F15 Prod

Organization Type

Unadjusted Function Points (FP) APP

No. Projects

Effort (Ph)

Primary intended use of the application (Customer service, Management information, Office Information, Process Control, Automation, Network Processing, Transaction Processing, Production Control, On-Line Service) Type of organization Target Platform (Networked, Mainframe, PC, Mini computer, Combined) 15 Productivity Factors Productivity = FP/ph

Abbreviation used in the paper Effort (ph)

System Size (FP)

Work Effort measured spent for development from planning to the installation/release and user training System size measured in unadjusted function points

Table 3 summarizes the descriptive statistics for the variables organization type, effort, system size, and productivity. The table shows the results for the whole database and for the six main contributing companies.

Productivity

Definition

F08 F09 F10 F11 F12 F13 F14 F15

Table 3. Laturi Data Set Profile

Table 2. Variables used in this study Variable Name and

F05 F06 F07

Description Customer Participation (how actively the client takes part in the development work) Development Environment (capacity of the tool resources and equipment during the project) Staff Availability (the availability of the sw personnel during the project), Level and use of Standards (the quality of the existing standards applied in the project) Level and use of Methods Level and use of Tools Logical Complexity of the Software (computing, I/O needs and user interface requirements) Requirements Volatility Quality Requirements Efficiency Requirements Installation Requirements Analysis Skills of Staff Application Experience of Staff Tools Skills of Staff Project and Team Skills of Staff

Bank Insurance Manufactu ring Wholesale / Retail Public Administration Min Mean Median Max StdDev Min Mean Median Max StdDev Min Mean Median Max StdDev

Whole DB 206 79 56 38

Comp 1 63 63 -

Comp 2 13 13

Comp 3 12 12 -

Comp 4 11 11 -

Comp 5 10 10 -

Comp 6 10 10 -

19

-

-

-

-

-

-

14

-

-

-

-

-

-

250 6645 3816 63694 8684 33 695 474 3634 676 0.026 0.177 0.128 1.349 0.165

583 8110 5100 63694 10454 48 671 387 3634 777 0.026 0.108 0.092 0.529 0.079

480 2426 1979 6030 1862 219 593 370 1613 494 0.15 0.267 0.236 0.542 0.113

780 14546 15800 24788 7414 189 1215 1138 2155 610 0.052 0.102 0.099 0.242 0.054

918 10505 6290 51100 14183 129 693 546 2105 581 0.039 0.091 0.091 0.152 0.039

592 5220 4182 17745 5170 137 528 422 1619 441 0.033 0.171 0.106 0.679 0.188

1330 10922 8649 26670 7575 176 804 707 1364 358 0.043 0.093 0.08 0.173 0.041

The projects coming from company 2 consumed, on average, considerably less effort than the rest of the database. Also, these projects were most productive in terms of system size per effort. Whereas projects from

company 3 exhibit a much higher average effort consumption compared to the rest of the database. This goes together with a large average system size for this company’s projects. For all of the six companies, the target hardware platform was mainly mainframe. No specific primary intended use of application was targeted by any of the individual companies.

4. Research Method This section introduces the concepts of the methods applied, briefly describes the design of the study, and provides the criteria used to evaluate the prediction accuracy.

4.1. Modeling Methods Applied Three of the previous studies proposed the application of a comprehensive set of cost estimation methods [4, 5, 14]. Since in our case for five of the six companies, we only had data on up to 13 projects, we did not apply set reduction methods such as regression trees (CART) [3]. We rather concentrated on a subset of methods that are commonly used. The methods applied were different variants of: ordinary least squares regression (OLS) [22], stepwise Analysis of Variance (stepwise ANOVA) [15], and Analogy-based estimation [23]. We abandoned the application of robust regression on individual companies, since no outlying observations could be observed for these companies [21]. Nevertheless, we applied robust regression to build models based on multi-organizational data, external to each individual company. This application was performed in consistency with the procedure described in [14]. OLS Regression OLS is the most commonly modeling technique applied to software cost estimation [8]. In consistency with the previous Laturi study [4] we applied multivariate regression analysis fitting the data to an exponential model specification. Exponential relationships have also been modeled in many other investigations [4, 6]. We performed logarithmic transformations of the ratio-scaled variables and generated dummy variables [22] for the nominal scaled variables. Ordinal-scaled variables were treated as if they were measured on an interval scale. This was shown to be reasonable in practice [25]. A mixed stepwise procedure was applied (probability to enter/leave the model=0.05) to determine variables having a significant impact on effort.

For the subsets with less than 15 projects (i.e., for five of the six companies) we performed a univariate regression analysis using system size as the independent variable. Stepwise Analysis of Variance Stepwise Analysis of Variance (ANOVA) combines a variety of techniques to analyze the variance of unbalanced data [15]. It applies ANOVA to categorical variables and OLS regression to continuous variables in a stepwise manner. The stepwise procedure includes one (the most significant) independent variable at a time into a linear model. Its effect is removed transforming the dependent variable into a residual variable. Then the impact of each remaining independent variable on the residual is assessed to identify the next variable to include in the model. The analysis is repeated until all significant variables are found. We transformed the continuous variables applying their natural logarithms. This was done to ensure normally distributed dependent variables. Consistent with related, previous studies [4, 5, 14], we used effort as the dependent variable. A second variant uses productivity as the dependent variable, which is in line with [15]. In this paper, we only present the best results obtained. These were obtained when using effort as the dependent variable. Analogy-based estimation Analogy is a common problem solving technique. Its potential for software cost estimation has been evaluated and confirmed in many studies [2, 18, 23, 27]. In software effort estimation, Analogy-based estimation involves the comparison of a new (target) project with completed (source) projects. The basic idea is to identify source projects that are most similar to the new project. Major issues are (1) to select relevant project attributes (in our case cost-drivers), (2) to define an appropriate similarity function, and (3) to decide upon the number of similar source projects to consider for estimation (analogues). Some strategies exist to determine relevant project attributes [10, 23]. Applying one of these strategies to our study lead to selecting only system-size as a relevant attribute. However, to ensure a focused selection of similar projects, it is necessary to take into account more variables. Therefore, we used the variables identified as significant by the stepwise ANOVA procedure as the relevant attributes for Analogy. Consistent with [4, 5, 14], we applied a simple measure proposed by Shepperd [17] in the Angel tool: the unweighted Euclidean distance using variables normalized between 0 and 1.

For effort prediction, one may consider one or more source projects. However, some studies report no significant differences in accuracy when using different numbers of analogues [4, 23]. Another recent study [1] concluded that the best choice for the number of analogies was one project. The study showed that an increase of the number of analogs causes an increase in the mean relative error. This was demonstrated by creating empirical distributions of relative error values for an increasing number of analogies using a particular data set. For the sake of simplicity and based on the results of these studies, our predictions were based on the most similar project. We applied three different variants of Analogy based on the suggestions by [4, 23, 27]. In this paper, we only present the best results obtained. These were obtained when using productivity as the dependent variable and then applying linear size adjustment to the effort prediction as suggested by [27]. The linear size adjustment addresses the differences in size between target (estimated) and source (most similar) project. The equation below defines how the predicted effort was adjusted. FP Effort ESTIMATED = ESTIMATED × EffortSOURCE FPSOURCE

4.2. Study Design Splitting data sets in various ways is applied to fulfill specific research objectives [e.g., 2, 10, 18, 20, 27]. This section describes how the data used for the performed case studies is partitioned in order to achieve the intended objective. Using a particular data set to build a cost estimation model, and then computing the accuracy of the model using the same data set, will lead to an optimistic accuracy evaluation [28]. To avoid this, we applied a procedure known as cross-validation approach. The basic idea is to use different subsets for model building (training sets) and model evaluation (test sets) [11]. We want to show the trade off between possibly higher prediction accuracy from company-specific data and the application of available multi-organizational data. Therefore, we compare the performance of estimation methods in two contexts: (a) when models are built using locally collected, company-specific data (b) when models are built using multi-organizational data external to a specific company. To determine the accuracy of local, company-specific cost estimation models we used projects coming from each of the six individual companies that provided at least 10 projects. For each company data, we applied a leaveone-out strategy or randomly divided the one company

data generating partitions of almost equal size. For each test set we used the remaining projects as the training set. This situation is represented in Figure 2. The overall accuracy was aggregated across the test sets. Calculating accuracy in this manner indicates the accuracy to be expected if an organization built a model using its own data set, and then uses that model to predict the cost of a new project. Company 1 Company 2

206 projects (whole DB)

…

Company n

Training Set Test Set

Figure 2. Building local, company-specific cost models

In the second situation, a company has only external data available to build an estimation model for projects within its company. The projects from each of the six companies were used in turn as test samples. Predictions were made for each of the six companies. Accordingly, for each company, we used the whole data set minus the projects of the selected holdout company as a training set. Note that the additional chunk with 87 projects of 20 companies was never used as a test sample. This resulted in six different training test set combinations as depicted in Figure 3. Calculating accuracy in this manner allows for accuracy comparisons of company-external cost models. It indicates the accuracy of using external, multiorganizational data for building a cost estimation model, and then testing it on an organization’s projects. Project 1 Project 2

n Projects of one company

…

Project n

Training Set Test Set

Figure 3. Building global, company-external cost models

Comp.

4.3. Evaluation Criteria The evaluation of the cost estimation models was done by using the following common criterion [9]. The magnitude of relative error as a percentage of the actual effort for a project, is defined as: Effort ACTUAL − Effort ESTIMATED MRE = Effort ACTUAL The MRE is calculated for each project in the data sets. Either the mean MRE or the median MRE aggregates the multiple observations. In addition, we used the prediction level Pred. This measure is often used in the literature and is a proportion of observations for a given level of accuracy: Pred( l ) =

k N

Where, N is the total number of observations, and k the number of observations with an MRE less than or equal to l. A common value for l is 0.25, which is used for this study as well. The Pred(0.25) gives the percentage of projects that were predicted with an MRE equal or less than 0.25. Conte et al. [9] suggest an acceptable threshold value for the mean MRE to be less than 0.25 and for Pred(0.25) greater or equal than 0.75. In general, the accuracy of an estimation technique is proportional to the Pred(0.25) and inversely proportional to the MRE and the mean MRE. For testing the statistical significance between paired samples we used the Wilcoxon matched pairs test, a nonparametric analogue to the t-test [24].

4

5

6

MRE/ Pred(0.25)

OLS

Pred(0.25) Mean Median Pred(0.25) Mean Median Pred(0.25) Mean Median Pred(0.25)

ANOVA

0.33 0.49 0.55 0.27 1.10 0.68 0.20 0.39 0.26 0.50

Analogy

0.33 0.49 0.55 0.27 1.36 0.99 0.10 0.39 0.26 0.50

0.58 0.34 0.25 0.55 0.54 0.32 0.40 0.30 0.27 0.50

It can be observed that the statistical methods obtained very poor accuracy results for company 5. Calculating the corresponding R2 values, the variance explained by the local statistical models for company 5 was also very low. Only 2% and 16% of the variance in the data was explained, when applying OLS and ANOVA, respectively. In addition, here the differences in accuracy are significant across the three methods. Beside this, the prediction accuracy was similar across the three methods applied. Consistently, all the three methods obtained the most accurate results when using company 2 data. This was confirmed when calculating the R2 values and the Pred(0.25) values for this company. The variables modeled by the three methods are depicted in Table 5. For each company and each method, we list the set of independent variables. We performed an n-fold cross validation to generate the local models for each of the companies (Section 4). Therefore, we list all the variables that appeared to be significant in any of the train-test set runs. Table 5. Variables modeled – local models

5. Analysis Results

Company

FP, F01, F03, F04, F08, F11, F14, F15

2

FP, HAR, F01, F03, F04, F05, F09, F10, F12

3 4 5

FP FP FP

FP, F04, F06, F07, F10, FP FP, F02, F05, F08, F11

6

FP

FP

FP, APP, HAR, F01, F03, F04, F05, F09, F10, F12 FP, F06, F11 FP, F04, F07, F10 FP, F02, F05, F08, F10, F11 FP, F12, F14

Table 4. Accuracy Results – local models

1

Mean Median Pred(0.25) Mean Median Pred(0.25) Mean Median

2

3

OLS 0.67 0.40 0.22 0.34 0.20 0.69 0.54 0.37

ANOVA 0.62 0.43 0.33 0.34 0.23 0.54 0.53 0.37

Analogy 0.64 0.39 0.33 0.27 0.20 0.69 0.52 0.22

Analogy

FP, APP, F08, F11, F14

1

MRE/ Pred(0.25)

ANOVA

FP, APP, F04, F08, F09, F14 FP

The following section presents the results of applying the previously described estimation methods on the Laturi data set. Table 4 summarizes the prediction accuracy when applying each estimation method to each of the companies using their own, local data. We provide the mean and median MRE values, as well as the results for Pred(0.25).

Comp.

OLS

System size (variable FP) was identified as the most important cost-driver by all methods and for all companies. In addition, installation requirements (variable F11) and efficiency requirements (variable F10) had a significant impact on cost for four of the six companies. Table 6 summarises the accuracy in effort predictions for each company using the rest of the database for model

building. The mean and median MRE, and values for Pred(0.25) are provided. Table 6. Accuracy Results - global models Comp.

1

2

3

4

5

6

MRE/ Pred(0.25) Mean Median Pred(0.25) Mean Median Pred(0.25) Mean Median Pred(0.25) Mean Median Pred(0.25) Mean Median Pred(0.25) Mean Median Pred(0.25)

OLS 0.57 0.47 0.24 0.43 0.31 0.39 0.49 0.35 0.42 0.34 0.30 0.45 0.84 0.52 0.30 0.35 0.36 0.30

ANOVA 0.54 0.48 0.40 0.20 0.13 0.85 0.50 0.43 0.25 0.35 0.41 0.46 1.31 0.52 0.10 0.42 0.30 0.20

Analogy 0.59 0.46 0.33 0.62 0.58 0.15 0.48 0.32 0.33 0.37 0.33 0.27 0.68 0.31 0.50 0.34 0.33 0.30

Consistent with the local model results, it seems that the statistical methods applied obtained relatively poor accuracy results for company 5. However, for this company, Analogy and OLS regression obtained significantly better predictions than the ANOVA models. In general, the application of robust regression to company-external data produced results similar to OLS regression. Therefore, these results are not presented here. The most accurate results when applying statistical methods were obtained using data external to company 2. The application of ANOVA showed highly accurate results for company 2 with a Pred(0.25) of 85%. Analogy obtained the worst results for company 2 and was significantly outperformed by the two other methods. The variables modeled are listed in Table 7. In general, there is a large overlap of variables identified as significant, when using company external data for model building. Beside system size (variable FP), organization type (BRA) and target platform (HAR), level and usage of methods and/or requirements volatility (variable F5) had a significant impact on cost for all of the six training sets. Table 7: Variables modeled – global models Train-set for company

1 2 3 4 5

OLS FP, BRA, HAR, F08, F09 FP, APP, BRA, HAR, F08, F07, F11 FP, BRA, HAR, F04, F08, F09 FP, BRA, HAR, F08,F09 FP, BRA, HAR, F08, F09

ANOVA and Analogy FP, BRA, HAR, F05, F08 FP, BRA, HAR, F05 FP, BRA, HAR, F05, F08 FP, BRA, HAR, F05, F08 FP, BRA, HAR, F05, F08

FP, BRA, HAR, F05, F07, F08

6

FP, BRA, HAR, F05

Table 8 shows the p-values obtained from comparing local and global models for each of the three methods applied. The “(+)” and “(-)” indicate significantly better performance (on a 5% significance level) for local models and global models, respectively. Table 8. Local versus global cost models Company

OLS

ANOVA

Analogy

1 2 3 4 5 6

0.279 0.575 0.875 0.075 0.059 0.575

0.524 0.046(-) 0.937 0.016(-) 0.798 0.878

0.962 0.011(+) 0.695 0.722 0.959 0.284

Surprisingly, the general trend does not indicate any significant differences of local vs. global estimation models. Beside for company 2 when applying Analogy, for none of the companies significantly better results could be obtained when using their own local data. In two cases even better results could be obtained from the application of global cost models (application of global ANOVA models to company 2 and company 4). In general, one reason for this result is that the data of each of the individual companies is not very homogeneous. The distributions of the project attributes productivity, effort, or system size are widely spread and comparable to the whole database (Table 3). Moreover, the statistical methods (OLS and ANOVA) are better suitable when having a substantial amount of data available. For five of the six companies, however, we only had up to 13 projects available. Thus, predictions were highly affected by each single project in the company. The relatively poor results for company 5, for example, are caused by one extreme outlying prediction for both methods. It should be noted that OLS models using local company data were only based on modeling one independent variable, namely system size, for five of the six companies (Table 5). However, the company-external OLS models were obtained using many more independent variables (Table 7). Having redone the analysis for the global OLS models using size as the only independent variable, we obtained different results. Under these conditions, most of the local OLS models outperformed significantly the global models. Locally derived Analogy models predicted significantly more accurate for company 2. On average, company 2 was as double as productive as the projects across the whole database. Analogy seemed not to be able to adjust for this using the adaptation rule.

Even though each company represents a distinct organization type, this does not seem to be the driving factor for differences in prediction accuracy of the global models. This is because the company-external models take this into account having identified the organization type (BRA) as one significant cost driver (Table 7). We performed an additional analysis matching carefully the organization types of our company-external training sets to the organization type of each individual company. The results showed that slightly better or the same results could be obtained in terms of prediction accuracy. Another point to consider is that none of the six companies could collect their data within say one year. For example, for company 4 the data collection for the 11 projects took about four years. From a practical perspective, it is not feasible to wait so long for the first usable application of a cost prediction model. Similar durations for data collection are observable for the other individual companies. When trading off the pace of local data collection and the prediction accuracy of global models, it was highly beneficial for all the investigated companies to contribute to the Laturi database.

6. Conclusions This paper investigated the benefits and drawbacks of using company-specific, local data for building software cost estimation models and company-external, global data that comes from many companies. Consistent with two previous studies, we found that in general the local models developed using company-specific data do not perform significantly better than the global models. This study’s result is on the other hand contradictory to two other previously reported studies. The main conditions under which these results were derived should be considered which are the following: the rigor quality assurance of the data collection procedures, the relative consistent application domain of the whole database, the relative heterogeneity of the data within one individual company, and the collection of standard costdrivers. The following three main conclusions can be drawn from the results obtained. (1) Application of appropriate estimation methods Two of the modeling methods applied in this study are not fully suitable to derive estimates based on a small amount of data (in this case OLS and ANOVA). This might be one reason why global models derived from OLS and ANOVA performed with an accuracy comparable to local models. For deriving predictions for around ten projects, the application of other estimation methods than pure data-driven ones is more appropriate, such as subjective effort estimation [26] or combinations of expert judgment and project data [7, 16]. This is an implication that can be

drawn from all the previous studies’ results since, in general, the MRE values for local models were mostly far from being satisfactory [9]. (2) Homogeneity of project data - The homogeneity in each of the company data was not higher compared to the whole database. This implies that a company should not rely on building its own data-driven model, if the underlying data is relatively heterogeneous. To get a notion of the underlying homogeneity, examining indicators of homogeneity, such as the standard deviation of distributions for the underlying data is one simple thing to do. On the other hand, institutions that collect and maintain multi-organizational data repositories should carefully select what type of project attributes to store and what types of projects are to be considered for storing in such a database. A multi-organizational database might not be beneficial to cost estimation for individual companies as evidenced in [13, 14]. However, project data that is more focused on specific application domains might bring advantages for individual companies from similar domains. If a large amount of multi-organizational, heterogeneous data is collected already, as in case of the ISBSG data [12], techniques such as cluster analysis could be considered to identify the most appropriate subset for building global models. (3) Collection of Cost Factors - The collected standard cost factors did not explain a large amount of the variability in the local data. This is one indication that the locally developed models performed relatively poorly, because very important factors might be missing in the data set. One strategy for a company could be to start data collection including the standard factors and in addition collect factors specific to the organization. Thus, the benefits of the global data could be utilized, as well as organizational characteristics could be considered. An important issue not covered by our study should be more considered in future and puts our results into perspective. Usually, companies know their own project data and thus can remove projects that are not suitable. Moreover, a company may know the reasons for outlying projects in their database. This knowledge has huge impact on how a new project will be estimated in terms of for example, assessing its similarity. We can conclude from this study that the benefits for a company to contribute to the Laturi database were potentially high for cost estimation. Our study could not proof the benefits of local data collection in this specific context of application. However, this supports the fact that local data collection is only highly beneficial under certain circumstances, such as when defining tailored cost-drivers [7] and including expert opinion [7, 16].

7. References [1] L. Angelis, I. Stamelos. A Simulation Tool for Efficient Analogy Based Cost Estimation, Empirical Software Engineering, 5, 2000, pp. 35-68. [2] R. Bisio, F. Malabocchia. Cost Estimation of Software Projects through Case Based Reasoning. Case Based Reasoning Research and Development, In: Proceedings of the International Conference on Case-Based Reasoning, 1995, pp. 11-22. [3] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone. Classification and Regression Trees. Wadsworth & Books/Cole Advanced Books & Software, 1984. [4] L.C. Briand, K. El Emam, K. Maxwell, D. Surmann, I. Wieczorek. An Assessment and Comparison of Common Software Cost Estimation Models. In: Proceedings of the 21st International Conference on Software Engineering, ICSE 99, Los Angeles, USA, 1999, pp. 313-322. [5] L.C. Briand, T. Langley, I. Wieczorek. A replicated Assessment of Common Software Cost Estimation Techniques. In: Proceedings of the 22nd International Conference on Software Engineering, ICSE 2000, Limerick, 2000, pp. 377386. [6] B. W. Boehm. Software Engineering Economics. Prentice – Hall, Englewood Cliffs, NJ, 1981. [7] L.C. Briand, K. El Emam, F. Bomarius. A Hybrid Method for Software Cost Estimation, Benchmarking, and Risk Assessment. In: Proceedings of the 20th International Conference on Software Engineering, ICSE 98, pp. 390-399. [8] L.C. Briand, I. Wieczorek. Software Resource Estimation in Software Engineering. Accepted for Publication in: Encyclopedia of Software Engineering. John Wiley & Sons. [9] S.D. Conte, H.E. Dunsmore, V. Y. Shen. Software engineering metrics and models. The Benjamin/Cummings Publishing Company, Inc., 1986. [10] G.R. Finnie, G.E. Wittig. A comparison of software effort estimation techniques: using function points with neural networks, case based reasoning and regression models. J. Systems Software, 39 (1997), pp. 281-289. [11] W. Hayes. Statistics. Fifth Edition. Hartcourt Brace College Publishers. 1994. [12] Software Project Estimation: A Workbook for MacroEstimation of Software Development Effort and Duration, ISBSG, 1999. [13] R. Jeffery, M. Ruhe, I. Wieczorek. A Comparative Study of Two Software Development Cost Modeling Techniques using Multi-organizational and Company-specific Data. Information and Software Technolog,y, 42, 2000, pp. 1009-1016.

[14] R. Jeffery, M. Ruhe, I. Wieczorek. Using Public Domain Metrics to Estimate Software Development Effort. In: Proceedings of the METRICS 2001 Symposium, London, pp. 16-27. [15] B. Kitchenham. A. Procedure for Analyzing Unbalanced Data Sets, IEEE Transactions on Software Engineering, 24, 4, April 1998, pp. 278-301. [16] S. Chulani, B. Boehm, B. Steece, Bayesian Analysis of Empirical Software Engineering Cost Models. IEEE Transactions on Software Engineering, 25, 4, April1999, pp 573-583. [17]. [18] T. Mukhopadhyay, Vicinanza, S.S., Prietula, M.J. Examining the feasibility of a case-based reasoning model for software effort estimation. MIS Quarterly, June 1992, pp. 155171. [19] A.J. Albrecht, J.E. Gaffney. Software Function, Source Line of Code and Development Effort Prediction : A Software Science Validation. IEEE Transactions on Software Engineering, 9, 6, November 1983, pp. 639-648. [20] M. Jørgensen. Experience with the Accuracy of Software Maintenance Task Effort Prediction Models. IEEE Transactions on Software Engineering, 21, 8, August 1995, pp. 674-681. [21] P.J. Rousseeuw; A.M. Leroy. Robust Regression and Outlier Detection, John Wiley & Sons, 1987. [22] L. Schroeder, D. Sjoquist, P. Stephan. Understanding Regression Analysis: An Introductory Guide. No. 57 In Series: Quantitative Applications in the Social Sciences, Sage Publications, Newbury Park CA, USA, 1986. [23] M. Shepperd, C. Schofield. Estimating software project effort using analogies. IEEE Transactions on Software Engineering, 23, 12, November 1997, pp. 736-743. [24] D. Sheskin. Handbook of Parametric and Non-parametric Procedures. CRC Press. 1997. [25] P. Spector. Ratings of Equal and Unequal Response Choice Intervals. The Journal of Social Psychology, 112, 1980, pp. 119-155. [26] R.T. Hughes. Expert Judgement as an estimating method. Information and Software Technology, 38, no. 2, 1996, pp. 6775. [27] F. Walkerden, R. Jeffery. An Empirical Study of Analogybased Software Effort Estimation. Empirical Software Engineering, 42, June 1999, pp. 135-158. [28] S. Weiss, C. Kulikowski. Computer Systems that Learn. Morgan Kaufmann Publishers, Inc. San Francisco, CA, 1991.