Do Adaptation Rules Improve Web Cost Estimation?

0 downloads 0 Views 624KB Size Report
Shortland Street, Auckland, NZ. ++64 09 372 3211 [email protected]. ABSTRACT. Analogy-based estimation has, over the last 15 years, and.
Do Adaptation Rules Improve Web Cost Estimation ? Emilia Mendes

Steve Counsell

Nile Mosley

The University of Auckland Private bag 92019 Auckland, NZ ++64 09 3737 599 ext. 6137

The University of London Mallet Street London WC1E 7HX, UK ++44 0207 631 6723

MxM Technology P.O. Box 3139 Shortland Street, Auckland, NZ ++64 09 372 3211

[email protected]

[email protected]

[email protected]

ABSTRACT Analogy-based estimation has, over the last 15 years, and particularly over the last 7 years, emerged as a promising approach with comparable accuracy to, or better than, algorithmic methods in some studies. In addition, it is potentially easier to understand and apply; these two important factors can contribute to the successful adoption of estimation methods within Web development Companies We believe therefore, analogy-based estimation should be examined further. This paper compares several methods of analogy-based effort estimation. In particular, it investigates the use of adaptation rules as a contributing factor to better estimation accuracy. Two datasets are used in the analysis; results show that the best predictions are obtained for the dataset that first, presents a continuous “cost” function, translated as a strong linear relationship between size and effort, and second, is more “unspoiled” in terms of outliers and collinearity. Only one of the two types of adaptation rules employed generated good predictions.

Categories and Subject Descriptors D.2 [Software Engineering]: D.2.8 Metrics – complexity measures, process metrics, product metrics, software science

General Terms Measurement, Experimentation.

Keywords Web effort prediction, Web hypermedia, case-based reasoning, Web hypermedia metrics, prediction models.

1. INTRODUCTION Most approaches to software cost estimation have focused on Expert opinion and algorithmic models. Expert opinion is widely used in industry, despite the fact that the means of deriving an estimate are not explicit and are therefore difficult to repeat. Expert opinion, although difficult to quantify, can be an effective estimation tool on its own or as an adjusting factor for algorithmic models [15].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HT’03, August 26–30, 2003, Nottingham, United Kingdom. Copyright 2003 ACM 1-58113-704-4/03/0008…$5.00.

Algorithmic models attempt to represent the relationship between effort and one or more project features, where the main feature is usually taken to be some notion of software size (e.g. the number of lines of source code, number of Web pages, or number of graphics). These models require calibration to local circumstances and can be discouraging to adapt and use where there is limited or incomplete data, and limited expertise in statistical techniques. Examples of such models are the COCOMO model [3] and the SLIM model [37]. Analogy-based estimation has, over the last 15 years, and particularly over the last 7 years, emerged as a promising approach with comparable accuracy to, or better than, algorithmic methods in some studies. In addition, it is potentially easier to understand and apply; two important factors to the successful adoption of estimation methods within Web development Companies. We believe therefore, analogy-based estimation should be examined further. This paper investigates the use of adaptation rules as a contributing factor to better estimation accuracy by comparing a number of methods of analogy-based effort estimation. Two datasets (DS1 and DS2) are used in the analysis. These datasets have very different characteristics (outliers, collinearity); DS1 presents a continuous “cost” function, translated as a strong linear relationship between size and effort; DS2 presents a discontinuous “cost” function, where there is no linear or log-linear relationship between size and effort. In addition, DS1 is more “unspoiled” than DS2, reflected by the absence of outliers and smaller collinearity. Results show that the best predictions are obtained for DS1, suggesting that the type of “cost” function has indeed a strong influence on the prediction accuracy, to some extent more so than dataset characteristics (outliers, collinearity). In addition, only one of the two types of adaptation rules employed generated good predictions. Section 2 reviews the process of estimation by analogy and previous research in the area relative to Web engineering. Section 3 describes the CBR parameters employed by this study. The datasets used are presented in Section 4 and in Section 5 a comparison of the CBR approach is detailed; this is followed by conclusions in Section 6.

2. BACKGROUND 2.1 Introduction to CBR Estimation by analogy is an application of a case-based reasoning approach. Case-based reasoning (CBR) is a form of analogical reasoning where cases stored on the case base and the target case

are instances of the same category [48]. In the context of our investigation, a ‘case’ is a Web project (new or finished). As such, an effort estimate for a new Web project is obtained by searching one or more similar cases, each representing information about finished Web projects. For the remainder of this paper the term CBR will be applied instead of Estimation by Analogy. The rationale for CBR is the use of historical information from completed projects with known effort. It involves [2]: •





The characterising of a new active project p, for which an estimate is required, with attributes (features) common to those completed projects stored in the case base. In our context most features represent size measures that have a bearing on effort. Feature values are normally standardized (between 0 and 1) so they can have the same degree of influence on the results. The use of this characterisation as a basis for finding similar (analogous) completed projects, for which effort is known. This process can be achieved by measuring the “Euclidean distance” between two projects, based on the values for k features for these projects. Although numerous techniques are available to measure similarity, nearest neighbour algorithms [33] using the unweighted Euclidean distance have been the most widely used in Software and Web engineering. The generation of a predicted value of effort for a given project p based on the effort for those completed projects that are similar to p. The number of similar projects in general depends on the size of the dataset. For small datasets typical numbers are 1, 2 and 3 closest neighbours (cases). The calculation of estimated effort is often obtained by using the same effort value of the closest neighbour, or the mean of effort values from 2 or more cases. In Software engineering and Web engineering a common choice is the nearest neighbour or the mean for 2 and 3 cases.

2.2 Advantages of CBR Potential advantages of using CBR for effort estimation are as follows [47]:

2.2.1 CBR has the potential to alleviate problems with calibration. Effort estimations can be inaccurate whenever an algorithmic model, derived using past projects data from one company, is used to generate estimations for projects belonging to a different company [22,17-19]. Inaccuracy occurs whenever the relationship between effort and associated factors differs broadly from company to company. However, CBR can give reasonably accurate estimates provided the set of past projects for a certain company has at least one project similar to a given target new project (which may belong to a different Company).

2.2.2 CBR can be valuable where the domain is complex and difficult to model. Many features (Web application’s size, Web developer’s experience etc) can potentially influence a Web project’s development effort. However, identifying these features and understanding how they interact with each other is difficult. For example, there is no standard as to what size measures should be employed to measure a Web application’s size. Proposals

[11,28,40] have been made, but to what extent the proposed size measures can be collected early enough in the development cycle such that they are useful as effort predictors, is still an open research question. In addition, to develop an algorithmic model, it is necessary to determine which features can predict effort, where the number of features has an influence on how much historical data is necessary to develop that model. Unfortunately, historical data is often in short supply and for those companies who develop Web projects it is even more difficult to obtain given that: • • •

Web projects have short schedules and a fluidic scope [35]. A Web project’s primary goal is to bring quality applications to market as quickly as possible, varying from a few weeks [35] to 6 months [39]. Processes employed are in general heuristic, although some organisations are starting to look into the use of agile methods [1].

CBR has the potential for successful use without a clear model of the relationship between effort and other project features; rather than assuming a general relationship between effort and other project features applicable to all projects, it relies predominantly on selecting a past project that is similar to the target project.

2.2.3 The basis for an estimate can be easily understood Algorithmic models use a historical dataset to derive cost models where effort is the output and size measures and cost drivers are used as input, with an empirical relation between input and output. Therefore, an algorithmic model generated using data from a given company A can only be applied to a different company B if the relation between effort and its inputs embodied in the model also applies to company B’s estimation problem [47]. CBR, on the other hand, bases estimates on concrete past cases, a familiar mode of human problem solving [48]. According to [23] a large number of managers and developers is comfortable estimating in this manner.

2.2.4 CBR Challenges The accuracy of estimates generated using CBR relies upon three broad factors [47]: availability of suitable similar cases, soundness of the strategy used for selecting them, and the way in which differences between the most similar cases and the target case are considered to derive an estimate. Those issues will be discussed in Section 3. There is no recipe for applying each of those factors. Hence empirical investigation is necessary to help identify which combination of factors can work best given the characteristics of past and target projects, and a particular dataset. This paper’s contribution is an investigation into the use of adaptation rules for Web hypermedia development effort estimation using a comprehensive set of CBR techniques. We have focused on two types of adaptation rule where estimated effort is in some way adjusted according to the estimated size for a new Web hypermedia application. A Web hypermedia application [9] is a non-conventional application characterised by the authoring of information using nodes (chunks of information), links (relations between nodes), anchors, access structures (for navigation) and its delivery over the Web. Technologies

commonly used for developing such applications are HTML, JavaScript and multimedia. In addition, typical developers are writers, artists and organisations that wish to publish information on the Web and/or CD-ROM without the need to know programming languages such as Java. These applications have potential in areas such as software engineering [13], literature [46], education [32], and training [38].

2.3 Related Work There are relatively few examples in the literature of studies that use CBR as an estimation technique for Web hypermedia applications [27-29,31]. Mendes et al. [27] (1st study) describes a case study involving the development of 76 Web hypermedia applications structured according to the Cognitive Flexibility Theory (CFT) [44] principles in which length size and complexity size measures were collected. The measures obtained were page count, connectivity, compactness [5], stratum [5] and reused page count. The original dataset was split into four homogeneous Web project datasets of sizes 22, 19, 15 and 14 respectively. Several prediction models were generated for each dataset using three cost modelling techniques, namely multiple linear regression, stepwise regression, and case-based reasoning. Their predictive power was compared using the Mean Magnitude of Relative Error (MMRE) and the Median Magnitude of Relative Error (MdMRE) measures. Results showed that the best predictions were obtained using Case-based Reasoning for all four datasets. Limitations of this study were: i) some measures used were highly subjective, which may have influenced the validity of their results; ii) they only applied one CBR technique, where similarity between cases was measured using the unweighted Euclidean distance and estimated effort was calculated using 1 case and the mean for 2 and 3 cases. Mendes et al. [30] (2nd study) presents a case study in which 37 Web hypermedia applications were used (DS1). These were also structured according to the CFT [44] principles and the Web hypermedia measures collected were organised into size, effort and confounding factors. The study compares the prediction accuracy of three CBR techniques to estimate effort for developing Web hypermedia applications. They also compare the best CBR technique against three commonly used prediction models in software engineering, namely multiple linear regression, stepwise regression and regression trees. Prediction accuracy is measured using several measures of accuracy and multiple linear regression and stepwise regression presented the best results for that dataset. The limitation of this study is that it did not use adaptation rules when applying CBR techniques. Mendes et al. [31] (3rd study) uses two Web hypermedia datasets. One of the datasets is the same as that used in [30] and the other has 25 cases (DS2). They apply to both datasets the same three CBR techniques employed in [30]. Regarding DS2, subject pairs developed each application. The size measures collected were the same used in [30], except for Reused Media Count and Reused Program Count. Prediction accuracy was measured using MMRE, MdMRE and Pred(25). DS1 presented better predictions than DS2. The limitation of this study was that it did not use adaptation rules when applying CBR techniques. The work presented in this paper uses the same datasets employed in the 3rd study. Although they are not industrial datasets, they exhibit size measures and characteristics similar to those found in industry [29]. We use a wider range of CBR techniques than in

any previous work to investigate CBR techniques further, looking also at the use of two types of adaptation rules. Few publications in Software engineering and no publications in Web engineering have looked at the use of adaptation rules [14, 36, 44] therefore we believe this paper makes a relevant contribution to both fields. Results are compared using MMRE and Pred(25) as measures of prediction accuracy. MMRE was chosen for two reasons: Firstly, MMRE is the de facto standard evaluation criterion to assess the accuracy of software prediction models [6]; Secondly, recent empirical investigation did not find evidence to discourage the use of MRE and MMRE [45]. Prediction at level l, also known as Pred(l), is another indicator which is commonly used. It measures the percentage of estimates that are within l% of the actual values. Suggestions have been made [10] that l should be set at 25% and that a good prediction system should offer this accuracy level 75% of the time. We also have set l=25%. MMRE is calculated as: MMRE = where

i

1 i = n Effort Actual i − Effort Estimated i ∑ i i =1 Effort Actual i

(1)

represents a project, Effort Actual i represents the actual

effort for project i, and Effort Estimatedi represents the estimated effort for project i.

3. CBR PARAMETERS When using CBR the three broad parameters mentioned in Section 2.2.6 are composed of six parameters [21]: • • • • • •

Feature Subset Selection Similarity Measure Scaling Number of cases Case Adaptation Adaptation Rules

Each parameter in turn can be split into more detail for use by a CBR tool, allowing for several combinations of those parameters. Each parameter is described below, and we also indicate our choice and the motivation for each within this study. All the results for CBR were obtained using CBR-Works [40], a commercially available CBR tool.

3.1 Feature Subset Selection Feature subset selection involves determining the optimum subset of features that give the most accurate estimation. Some existing CBR tools, e.g. ANGEL [43], offer this functionality optionally by applying a brute force algorithm, searching for all possible feature subsets. CBR-Works does not offer such functionality; for each estimated effort, we used all features in order to retrieve the most similar cases.

3.2 Similarity Measure Similarity Measure measures the level of similarity between cases. Several similarity measures have been proposed in the software engineering literature; however the ones described here and used in this study are the unweighed Euclidean distance, the weighted Euclidean distance and the Maximum distance. Readers are referred to [2] for details on other similarity measures. The motivation for using unweighted Euclidean (UE) and Maximum (MX) distances is that they have been previously used with

encouraging results in software and Web engineering cost estimation studies (UE: [43,27]; MX: [2]) and are applicable to quantitative variables, suitable for our needs. The weighted Euclidean was also chosen as it seemed reasonable to give different weights to our size measures (features) to reflect the importance of each, rather than expect all size measures to have the same influence on effort. Our datasets have seven and five size measures respectively (Section 4.1), representing different facets of size. Each similarity measure we used is described below:

3.2.1 Unweighted Euclidean Distance The Unweighted Euclidean distance measures the Euclidean (straight-line) distance d between the points (x0,y0) and (x1,y1), given by the formula:

d = ( x0 − x1 ) 2 + ( y 0 − y1 ) 2

(2)

This measure has a geometrical meaning as the distance of two points in the n-dimensional Euclidean space [2]. Figure 1 illustrates this distance by representing co-ordinates in E2. The number of features employed determines the number of dimensions: Page-complexity

presented a normal distribution of values or very close to normal, we used the parametric test – Pearson. Otherwise, Spearman’s rho was chosen. For DS2, some correlations presented negative values. Given that CBR-works does not accept negative values as weights, we added +1.0 to all correlations, making all values positive and thus suitable for use. All the results presented in this paper have taken the +1 effect into account.

3.2.3 Maximum Distance The maximum measure computes the highest feature similarity, defining the most similar project. For two points (x0,y0) and (x1,y1), the maximum measure d is given by the formula: d = max(( x 0 − x1 ) 2 , ( y 0 − y1 ) 2 )

(4)

This distance effectively reduces the similarity measure down to a single feature, although the maximum feature may differ for each retrieval occurrence. In other words, although we used several size measures, for a given “new” project p, the closest project in the case base will be the one with a size measure that has the most similar value to the same measure for that project p.

3.3 Scaling Scaling or Standardisation represents the transformation of attribute values according to a defined rule such that all attributes have the same degree of influence and the method is immune to the choice of units [2]. One possible solution is to assign value zero to the minimum observed value and value one to the maximum observed value [21]. We have scaled all features used in this study by dividing each feature value by that features’ range.

Y1

d

3.4 Number of Cases

Y0

0

x0

x1

The number of cases refers to the number of most similar projects that will be used to generate the estimation. According to Angelis and Stamelos [2] when small sets of data are used it is reasonable to consider only a small number of cases. Several studies in software engineering have restricted their analysis to the closest case (k=1) [7,8,19,33]. However, we decided to use 1, 2 and 3 most similar cases, similarly to [2,18,27,30,31,43].

Page-count

Figure 1 . Unweighted Euclidean distance using two size measures

3.2.2 Weighted Euclidean Distance

3.5 Case Adaptation

The weighted Euclidean distance is used when features vectors are given weights that reflect the relative importance of each feature. The weighted Euclidean distance d between the points (x0,y0) and (x1,y1) is given by the formula:

Once the most similar case(s) has/have been selected the next step is to decide how to generate the estimation for the “new” project p. Choices of case adaptation techniques presented in the Software engineering literature vary from the nearest neighbour [7,19], the mean of the closest cases [43], the median [2], inverse distance weighted mean and inverse rank weighted mean [21], to illustrate just a few. In the Web engineering literature, the adaptations used to date are the nearest neighbour, mean of the closest cases [27], and the inverse rank weighted mean [30,31].

d = wx ( x0 − x1 ) 2 + w y ( y 0 − y1 ) 2

(3)

where wx and wy are the weights of x and y respectively. Weight was calculated using two separate approaches: 1. On both datasets we attributed weight=2 to size measures PaC, MeC and RMC (DS1 only) and weight =1 to the remaining size measures. The choice of weights was based on expert opinion and on existing literature [11]. 2. We measured the linear association between the size measures and effort using a two-tailed Pearson’s correlation for DS1 and a Spearman’s rho test, a non-parametric test equivalent to Pearson’s test, for DS2. The choice of tests was based on the characteristics of the measures employed. Whenever measures

We opted for the mean, median and the inverse rank weighted mean. Each adaptation and motivation for its use are explained below: •



Mean: Represents the average of k cases, when k>1. The mean is a typical measure of central tendency used in the Software/Web engineering literature. It treats all cases as being equally influential on the outcome. Median: Represents the median of k cases, when k>2. Another measure of central tendency and a more robust



statistic when the number of most similar cases increases [2]. Although this measure, when used by Angelis and Stamelos [2], did not present encouraging results, measured using MMRE and Pred(25), we wanted to observe how it would behave for our datasets. Inverse rank weighted mean: Allows higher ranked cases to have more influence than lower ones. If we use 3 cases, for example, the closest case (CC) would have weight = 3, the second closest (SC) weight = 2 and the last one (LC) weight =1. The estimation would then be calculated as (3*CC + 2*SC + LC)/6). It seemed reasonable to us to allow higher ranked cases to have more influence than lower ranked ones, so we decided to use this adaptation as well.

3.6 Adaptation Rules Adaptation rules are used to adapt the estimated effort, according to a given criterion, such that it reflects the characteristics of the target project more closely. For example, in the context of effort prediction, the estimated effort to develop an application app would be adapted such that it would also take into consideration an app’s size values. The adaptation rules employed in this study are based on the linear size adjustment to the estimated effort [47]. For Walkerden and Jeffery [47], once the most similar finished project in the case base has been retrieved, its effort value is adjusted to estimate effort for the target project. A linear extrapolation is performed along the dimension of a single metric, which is a size metric strongly correlated with effort. The linear size adjustment is represented as follows: Effortt arg et =

Effort finished Pr oject Size finished Pr oject

∗ Sizet arg et

(5)

We will use an example to illustrate linear size adjustment. Suppose a client requests the development of a new Web hypermedia application using 40 Web pages. The closest finished Web project in the case base has used 50 Web pages and its total effort was 50 person/hours, representing 1 person/hour per Web page. We could use for the new Web project the same effort spent developing its most similar finished project, i.e., 50 person/hours. However, estimated effort can be adapted to reflect more closely the estimated size (number of Web pages) requested by the client for the new Web project. This can be achieved by dividing actual effort by actual size and then multiplying the result by estimated size, which would give an estimated effort of 40 person/hours. Unfortunately in the context of Web projects, size is not measured using a single size measure as that there are several size measures that can be used as effort predictors [28,40]. In that case, the linear size adjustment has to be applied to each size measure and then the estimated efforts generated averaged to obtain a unique estimated effort (Equation 6).

E est .P =

1  E act ∗ S est q ∑ q  q =1 S act q q= x

   

where:

Eest .P is the Total Effort estimated for the new Web project P.

(6)

E act is the Total Effort for a finished Web project obtained from the case base. S est q is the estimated value for the size measure q, which is obtained from the client.

S act q is the actual value for the size measure q, for a finished Web project obtained from the case base. The number of size measures characterising a Web project is represented as q. Equation 6 presents one type of adaptation rule employed, named “adaptation without weights”. When using this adaptation, all size measures contribute the same towards total estimated effort, indicated by the use of a simple average. If we were to apply different weights to indicate the strength of relationship between a size measure and effort, we would have to use the following equation:

E est .P =

 q= x   w ∗  E act∗ S est q ∑ q q= x   S ∑ wq  q =1  act q 1

   

(7)

q =1

where:

wq represents the weight w attributed to size measure q. Equation 7 presents the other type of adaptation rule we employed. We have applied both types of adaptation rules to this study to determine if different adaptations give different prediction accuracy. We have also used the same weights as those presented in Section 3.2 for Weighted Euclidean distance. Finally, as we have used up to three closest Web projects, we had to add another equation to account for that:

1  r =3 Etotal est =  ∑ Eest.Pr r  r =1

  

(8)

Here r is the number of closest Web projects obtained from the case base, where its minimum and maximum values are 1 and 3, respectively.

4. DATASETS EMPLOYED 4.1 Description The analysis presented in this paper was based on two datasets containing information about Web hypermedia applications developed by Computer Science Honours and postgraduate students, attending a Hypermedia and Multimedia Systems course at the University of Auckland. The first dataset (DS1, 34 applications) was obtained using a case study (CS1) consisting of the design and authoring, by each student, of Web hypermedia applications aimed at teaching a chosen topic, structured according to the CFT principles [44], using a minimum of 50 pages. Each Web hypermedia application provided 46 pieces of data [29], from which we identified 8 attributes, shown in Table 1, to characterise a Web hypermedia application and its development process. The second dataset (DS2, 25 applications) was obtained using another case study (CS2) consisting of the design and authoring,

by pairs of students, of Web hypermedia applications structured using an adaptation of the Unified Modelling Language [4], with a minimum of 25 pages. Each Web hypermedia application provided 42 pieces of data, from which we identified 6 attributes, shown in Table 2, to characterise a Web hypermedia application and its development process.

questions, unusual tasks, and number of questions and definitions in the Appendix. Table 3 - Properties of DS1

The criteria used to select the attributes were [11,12]: i) practical relevance for Web hypermedia developers; ii) measures which are easy to learn and cheap to collect; iii) counting rules which were simple and consistent. Tables 1 and 2 show the attributes that form a basis for our data analysis. Total effort is our dependent/response variable; remaining attributes our independent/predictor variables. All attributes were measured on a ratio scale. Tables 3 and 4 outline the properties of the datasets used. DS1 had originally 37 observations with three outliers where total effort was unrealistic compared to duration. Those outliers were removed from the dataset, leaving 34 observations. Collinearity represents the number of statistically significant correlations with other independent variables out of the total number of independent variables [19]. Summary statistics for all the variables on DS1 and DS2 have been omitted due to a restriction in space. Table 1 - Size and Complexity Metrics for DS1 Metric Page Count (PaC) Media Count (MeC) Program Count (PRC) Reused Media Count (RMC) Reused Program Count (RPC) Connectivity Density (COD) Total Page Complexity (TPC) Total Effort (TE)

Description Total number of html or shtml files Total number of original media files Total number of JavaScript files and Java applets Total number of reused/modified media files. Total number of reused/modified programs. Average number of internal links per page. Average number of different types of media per page.

Metric Page Count (PaC) Media Count (MeC) Program Length (PRL) Connectivity Density (COD) Total Page Complexity (TPC) Total Effort (TE)

Description Total number of html files. Total number of original media files. Total number of statements used in either Javascript or Cascading Style Sheets. Average number of links, internal or external, per page. Average number of different types of media per page. Effort in person hours to design and author the application

For both case studies two questionnaires were used to collect data. The first asked subjects to rate their Web hypermedia authoring experience using five scales, from no experience (zero) to very good experience (four). The second was used to measure characteristics of the Web hypermedia applications developed (suggested measures) and the effort involved in designing and authoring those applications. On both questionnaires, we describe in depth each scale type, to avoid misunderstanding. Members of the research group checked both questionnaires for ambiguous

Feature s

Categorical features

Outliers

Collinearity

34

8

0

0

2/7

Number of Cases

Feature s

Categorical features

Outliers

Collinearity

25

6

0

2

3/6

Table 4 - Properties of DS2

To reduce learning effects in both case studies, subjects were given coursework prior to designing and authoring the Web hypermedia applications. This consisted of creating a simple Web hypermedia application and loading the application to a Web server. In addition, to measure possible factors that could influence the validity of the results, we also asked subjects about the main structure (backbone) of their applications1, their authoring experience before and after developing the applications, and the type of tool used to author/design the Web pages2. Finally, CS1 subjects received training on the Cognitive Flexibility Theory authoring principles (approximately 150 minutes) and CS2 subjects received training on the UML's adapted version (approximately 120 minutes). The adapted version consisted of Use Case Diagrams, Class Diagrams and Transition Diagrams.

4.2 Validity of Both Case Studies Here we present our comments on the validity of both case studies: •

Effort in person hours to design and author the application

Table 2 - Size and Complexity Metrics for DS2

Number of Cases





• •

The measures collected, except for effort, experience, structure and tool, were all objective, quantifiable, and remeasured by one of the authors using the applications developed. The scales used to measure experience, structure and tool were described in detail in both questionnaires. Subjects' authoring and design experiences were mostly scaled ‘little’ or ‘average’, with a low difference between skill levels. Consequently, the original datasets were left intact. To reduce maturation effects, i.e. learning effects caused by subjects learning as an evaluation proceeds, subjects had to develop a small Web hypermedia application prior to developing the application measured. They also received training in the CFT principles, for the first case study, and in the UML's adapted version, for the second case study. The majority of applications used a hierarchical structure. Notepad and FirstPage were the two tools most frequently used on CS1 and FirstPage was the tool most frequently used on CS2. Notepad is a simple text editor while FirstPage is freeware offering button-embedded HTML tags. Although they differ with respect to the functionality offered, data analysis using DS1 revealed that the corresponding effort

1

Sequence, hierarchy or network.

2

WYSIWYG (What You See Is What You Get), close to WYSIWYG or text-based.







was similar, suggesting that confounding effects from the tools were controlled. As the subjects who participated in the case studies were either Computer Science Honours or postgraduate students, it is likely that they present skill sets similar to Web hypermedia professionals at the start of their careers. For both case studies, subjects were given forms, similar to those used in the Personal Software Process (PSP) [16], to enter effort data as the development proceeded. Unfortunately, we were unable to guarantee that effort data had been rigorously recorded and therefore we cannot claim that effort data on both datasets was unbiased. However, we believe that this problem does not affect the analysis undertaken in this paper. Although the datasets employed in this study are studentbased, they exhibit similar characteristics to those of industry-based Web hypermedia applications [29], therefore not hampering the results.

5. COMPARISON OF CBR TECHNIQUES To compare the CBR techniques we used the jack-knife method (also known as leave one out cross-validation). It is a useful mechanism for validating the error of the prediction procedure employed [2]. For each dataset, we repeated the steps described below 34 times for DS1 (34 cycles) and 25 times for DS2 (25 cycles), as we had 34 and 25 projects respectively. All projects in both datasets were completed projects for which actual effort was known. Step 1: Project number i (where i varies from 1 to 34/25) is removed from the case base such that it is considered a new project for the purpose of the estimation procedure. Step 2: The remaining 33/24 projects are kept in the case base and used for the estimation process. Step 3: The CBR tool finds the most similar cases, looking for projects that have feature values similar to the feature values for project i. Step 4: Project i, which had been removed from the case base, is added back. For each cycle we calculated the MRE. The results in Tables 7 and 8 have been obtained by considering four similarity measures (unweighted Euclidean (UE), weighted Euclidean using Subjective Weights (WESub), Weighted Euclidean using Correlation-based Weights (WECorr) and Maximum (MX)), three choices for the number of cases (1, 2 and 3), three choices for the case adaptation (mean, inverse rank weighted mean and median) and three choices for the two types of adaptation rules (adaptation using no size weights, adaptation using subjective size weights and adaptation using correlation-based size weights). The statistical significance of the results was based on absolute residuals, tested using the Wilcoxon Rank Sum Test, since the data was naturally paired. The confidence limit was set at α=0.05. Non-parametric tests were employed since the absolute residuals for all the models used in this study were not normally distributed, as confirmed by the Kolmogorov-Smirnov test for non-normality. Shaded areas on both tables (along the MMRE column) indicate the model(s) that presented statistically significantly better predictions than others. Sometimes, the shaded area includes two models across the same row when their predictions are not statistically different from each other. Conversely, no shaded

areas across a row mean that there was no best model in that group. The legend used on both tables is as follows: UE – Unweighted Euclidean WESub – Weighted Euclidean based on Subjective weights WECorr – Weighted Euclidean based on Correlation coefficients (Pearson’s for DS1 and Spearman’s for DS2) MX – Maximum K – number of cases CC – Closest Case IRWM – Inverse Rank Weighted Mean Figure 2 - Legend for Tables 5 and 6 The first interesting observation is the difference in accuracy between both datasets, based on both MMRE and Pred(25) (Tables 5 and 6). If we consider that an MMRE = 75% [10] we can conclude that DS2, despite presenting absolute residuals for “No Adaptation” statistically better than for all other groups, showed bad accuracy level for all distances used. On the contrary, DS1 in general presented good prediction accuracy where the best predictions were obtained for no adaptation and for adaptation using no weights. There are several differences between DS1 and DS2 that might have influenced the results: • •







The range of data for DS1 is smaller than that for DS2, suggesting that DS1 may be more homogenous than DS2. The “cost” function presented in DS1 is continuous, revealed in hyper planes, which form the basis for useful predictions [39], and a strong linear relationship between size measures and effort. On the other hand, the “cost” function presented in DS2 is discontinuous, suggesting that DS2 has heterogeneous data. Linear size adjustment on a size measure adjusts the prediction assuming similar productivity in the matched cases. However, when similar cases have broadly varying productivity, errors in estimation can result [23]. That may help to partially explain why adaptation rules did not improve results for DS2. DS1 is more “unspoiled” than DS2, if we consider dataset characteristics such as variables with normal distribution, no outliers and no collinearity [42]. DS1 has several variables with a close to normal distribution, no outliers and two of seven variables presenting collinearity. DS2, on the other hand, does not have any variables with distributions close to normal, has outliers and three of six variables present collinearity. Clearly, DS2 is more “spoiled” than DS1. DS1 is larger than DS2. Previous work [42] shows that CBR always benefited from having larger datasets and that extra cases were most valuable where there was a discontinuous “cost” function and/or the datasets were “spoiled” (outliers and outliers + collinearity).

Surprisingly, on both datasets, applying adaptation rules using weights did not give better predictions. This result suggests that a simple average of the intermediate efforts obtained for each size measure is better than weighting these intermediate efforts. Using the original datasets to carry out the calculations, rather than the normalised versions, may have also contributed to these results, as the range of values amongst size measures was sometimes very different.

Except for Weighted Euclidean distance based on subjective weights (WESub), the best predictions for DS1 were obtained without the use of adaptation rules and by applying adaptation rules without weights. Regarding WESub, the adaptation without weights did not give as good results as those obtained using no adaptation, suggesting that the size measures for the three closest cases on each retrieval episode were predominantly not as homogenous as their corresponding effort.

The statistical significance comparing all distances across the column “no adaptation” (Table 5) for DS1 revealed that, except for the mean of 3 cases, all results obtained using the maximum distance were the worst, confirmed statistically. A boxplot of absolute residuals for the remaining three distances (Figure 3) illustrates residuals that do not differ much from each other.

Table 5 - Results for DS1 No Adaptation Distance

K 1 2

UE 3 1 2 WESub 3 1 2 WECorr 3 1 2 MX 3

Adaptation CC Mean IRWM Mean IRWM Median CC Mean IRWM Mean IRWM Median CC Mean IRWM Mean IRWM Median CC Mean IRWM Mean IRWM Median

MMRE 12 15 13 14 13 14 10 13 12 13 12 14 11 14 12 13 12 15 32 23 25 25 23 31

Pred(25) 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

Adaptation No Weights MMRE 23 32 28 30 28 21 21 32 26 31 27 23 24 33 28 32 28 24 20 17 18 14 15 16

Pred(25) 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

Adaptation Subjective Weights MMRE Pred(25) 22 72 32 60 28 64 29 52 28 64 23 68 24 68 34 56 30 64 32 56 31 64 20 68 21 72 42 52 34 68 37 52 35 60 24 64 20 56 16 80 17 76 21 88 17 84 14 88

Adaptation Pearson Weights MMRE Pred(25) 22 72 34 64 29 68 31 60 29 68 24 64 30 68 38 48 34 60 36 44 34 56 25 52 24 72 42 52 34 68 38 56 36 60 27 56 21 56 19 72 19 64 22 88 19 84 17 80

Table 6 - Results for DS2 No Adaptation Distance UE

K 1 2 3

WESub

1 2 3

WECorr

1 2 3

MX

1 2 3

Adaptation CC Mean IRWM Mean IRWM Median CC Mean IRWM Mean IRWM Median CC Mean IRWM Mean IRWM Median CC Mean IRWM Mean IRWM Median

MMRE 83 75 78 74 77 83 66 58 58 58 57 60 158 195 134 118 125 118 64 67 86 64 86 88

Pred(25) 68 60 72 56 64 60 0 0 0 0 0 0 20 4 4 0 0 12 36 12 16 8 12 8

Adaptation No Weights MMRE 87 87 85 74 79 56 66 87 85 74 79 56 252 197 214 174 191 152 156 133 138 143 140 120

Pred(25) 28 24 28 20 24 32 32 24 28 20 24 32 20 12 8 8 16 4 24 12 24 32 20 16

Adaptation Subjective Adaptation Spearman Weights Weights MMRE Pred(25) MMRE Pred(25) 94 24 112 24 94 20 114 12 92 20 112 16 79 20 96 16 86 20 104 12 58 28 65 12 65 24 65 28 55 36 57 40 56 36 57 44 61 36 57 28 57 36 56 36 57 28 59 20 229 12 242 16 183 16 191 16 198 12 207 8 162 16 169 12 177 20 185 16 152 16 151 12 170 16 193 28 136 16 146 20 145 16 160 8 141 32 136 20 141 24 146 24 117 16 115 12

However, a closer look revealed that the weighted Euclidean distance, using weights based on Pearson’s correlation coefficients, and the inverse rank weighed mean for three cases is a good candidate as the one that gives the best predictions for DS1 overall. Results for all distances across the column “adaptation no weights” (Table 5), for DS1, revealed that, except for the mean of three cases for the maximum distance, there were no statistically significant differences. Therefore, the best prediction over that column would be the maximum distance using the mean of three cases.







.5 4

absolute residuals

2

.3

4

4

.4

4

4 4

4

4

4

4

24 25

25

4 4

34

4

4

4

24

4

4



4

25

.2

.1

0.0

-.1 N = 34

34

34

34

34

34

34

34

34

34

34

34

34

34

34

34

34

34

It seems that there is a strong relationship between the nature of the “cost” function and the prediction accuracy. Since our results corroborate previous work [42] (which used different datasets and size measures) we believe it gives an indication that a relationship does indeed exist. Regarding datasets that have a discontinuous “cost” function; the use of CBR adaptation rules does not seem to improve accuracy. For those with a continuous “cost” function, the absence of adaptation rules does not seem to hinder good prediction accuracy. Linear size adjustment adjusts effort assuming similar productivity in the matched cases. However, when similar cases have broadly varying productivity, errors in estimation can result [23]. That may help to partially explain why adaptation rules did not improve results for DS2. The use of adaptation rules with size weights did not improve accuracy on any of the datasets employed. This may be a consequence of using very different size measures with varying ranges.

In addition, previous studies in the Software engineering literature have shown that feature subset selection (FSS) prior to building case-based prediction systems can greatly improve their accuracy [24,25]. Unfortunately, as the CBR tool employed does not offer such functionality, we were unable to compare results using no FSS to those using FSS. That will be the aim of our future work.

3 ED PM W R3 PI W 3 PM W R2 PI W 2 PM W P1 D3 W E SM W R3 SI W 3 SM W R2 SI W 2 SM W S1 3 W ED EM U R3 EI U 3 EM U R2 EI U 2 EM E1

U

U

Figure 3. Boxplots of absolute residuals for column “no adaptation” for DS1

The statistical significance comparing all distances across the column “no adaptation” for DS2 (see Table 6) revealed that all results obtained using the unweighted Euclidean distance were the best, confirmed statistically. A boxplot of absolute residuals for the unweighted Euclidean distance illustrates residuals that do not differ very much from each other. A closer look does not reveal any promising candidates for best predictor. It would not make sense to compare both datasets looking for the one that gives the best prediction as DS1 clearly presented better estimations than DS2. Nevertheless, this does not mean that datasets that do not present a continuous “cost” function should be discarded. A possible approach to deal with DS2 might be to partition the data into smaller, more homogeneous datasets.

6. CONCLUSIONS In this paper we compared several methods of CBR-based effort estimation. In particular, we investigated the use of adaptation rules as a contributing factor for better estimation accuracy. Two datasets were employed in the analysis and results showed that best predictions were obtained for the dataset that presented a continuous “cost” function, reflected as a strong linear relationship between size and effort, and that was more “unspoiled” (no outliers, small collinearity). We believe that the lessons to be learnt from our results are as follows:

Finally, other issues such as the choice of size measures also need further investigation, as they may influence the type of “cost” function a dataset presents and even the results for adaptation rules.

7. REFERENCES [1] Ambler, S.W. Lessons in Agility from Internet-Based Development, IEEE Software, (March-April 2002), 66-73, 2002. [2] Angelis, L., and Stamelos, I. A Simulation Tool for Efficient Analogy Based Cost Estimation, Empirical Software Engineering, 5, 35-68, 2000. [3] Boehm, B. Software Engineering Economics. PrenticeHall: Englewood Cliffs, N.J., 1981. [4] Booch, G., Rumbaugh, J., and Jacobson, I. The Unified Modelling Language User Guide, Addison-Wesley, 1998. [5] Botafogo, R., Rivlin, A.E., and Shneiderman, B. Structural Analysis of Hypertexts: Identifying Hierarchies and Useful Metrics, ACM TOIS,10, 2, 143-179, 1992. [6] Briand, L., and Wieczorek, I. Resource Modeling in Software Engineering, Second edition of the Encyclopedia of Software Engineering, Wiley, (Editor: J. Marciniak), 2002 [7] Briand, L.C., El-Emam, K., Surmann, D., Wieczorek, I., and Maxwell, K.D. An Assessment and Comparison of Common Cost Estimation Modeling Techniques, in Proceedings. ICSE 1999, Los Angeles, USA, 313-322, 1999. [8] Briand, L.C., Langley, T., and Wieczorek, I. A Replicated Assessment and Comparison of Common Software Cost Modeling Techniques, in Proceedings ICSE 2000, Limerick, Ireland, 377-386, 2000.

[9] Christodoulou, S. P., Zafiris, P. A., and Papatheodorou, T. S. WWW2000: The Developer's view and a practitioner's approach to Web Engineering,in Proceedings 2nd ICSE Workshop Web Engineering, 7592, 2000. [10] Conte, S., Dunsmore, H., and Shen, V. Software Engineering Metrics and Models. Benjamin/Cummings, Menlo Park, California, 1986. [11] Cowderoy, A.J.C. Measures of size and complexity for web-site content, in Proceedings Combined 11th ESCOM Conference and the 3rd SCOPE conference on Software Product Quality, Munich, Germany, 423-431, 2000. [12] Cowderoy, A.J.C., Donaldson, A.J.M., and Jenkins, J.O. A Metrics framework for multimedia creation, in Proceedings 5th IEEE International Software Metrics Symposium, Maryland, USA, 1998. [13] Fielding, R.T., and Taylor, R.N. Principled design of the modern Web architecture. in Proceedings ICSE. ACM. New York, NY, USA, 407-416, 2000. [14] Finnie, G.R., G.E. Wittig and J.-M. Desharnais, Estimating software development effort with case-based reasoning. 2nd Intl. Conf. on Case-Based Reasoning, 1997. [15] Gray, R., MacDonell, S. G., and Shepperd, M. J. Factors Systematically associated with errors in subjective estimates of software development effort: the stability of expert judgement, in Proceedings IEEE 6th Metrics Symposium, 1999. [16] Humphrey, W.S. A Discipline for Software Engineering, SEI Series in Software Engineering, Addison-Wesley, 1995. [17] Jeffery, D.R., and Low, G.C. Calibrating Estimation Tools for Software Development, Software Engineering Journal, 5, 4, 215-221, 1990. [18] Jeffery, R., Ruhe, M., and Wieczorek, I. A Comparative study of two software development cost modelling techniques using multi-organizational and companyspecific data, Information and Software Technology, 42, 1009-1016, 2000. [19] Jeffery, R., Ruhe, M., and Wieczorek, I. Using Public Domain Metrics to Estimate Software Development Effort, in Proceedings IEEE 7th Metrics Symposium, London, UK, 16-27, 2001. [20] Kadoda, G., Cartwright, M., and Shepperd, M.J. Issues on the effective use of CBR technology for software project prediction, Proc. 4th International Conference on CaseBased Reasoning, Vancouver, Canada, July/August, 276290, 2001. [21] Kadoda, G., Cartwright, M., Chen, L., and Shepperd, M.J. Experiences Using Case-Based Reasoning to Predict Software Project Effort, in Proceedings EASE 2000 Conference, Keele, UK, 2000. [22] Kemerer, C.F. An Empirical Validation of Software Cost Estimation Models, Communications of the ACM, 30, 5, 416-429, 1987. [23] Kirsopp, C., Mendes, E., Premraj, R., and Shepperd, M. An Empirical Analysis of Adaptation Techniques for Case-Based Prediction, submitted paper, 2003. [24] Kirsopp, C., M. Shepperd and J. Hart, Search heuristics, case-based reasoning and software project effort

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35] [36]

[37]

[38]

[39]

[40]

Prediction, Genetic and Evolutionary Computation Conference, New York, USA, July 9-13, 2002. Kirsopp,C. and M. Shepperd, Case and Feature Subset Selection in Case-Based Software Project Effort Prediction, Research and Development in Intelligent Systems XIX, Springer-Verlag, 2002. Lederer, A., and Prasad, J. Information Systems Software Cost Estimating: A Current Assessment, Journal of Information Technology, 8, 22-33, 1993. Mendes, E., Counsell, S., and Mosley, N. Measurement and Effort Prediction of Web Applications, in Proceedings 2nd ICSE Workshop on Web Engineering, June, Limerick, Ireland, 2000. Mendes, E., Mosley, N., and Counsell, S. Web Metrics – Estimating Design and Authoring Effort. IEEE Multimedia, Special Issue on Web Engineering, (JanuaryMarch 2001), 50-57, 2001. Mendes, E., Mosley, N., and Counsell, S. Investigating Early Web Size Measures for Web Cost Estimation, EASE Conference, accepted for publication, 2003. Mendes, E., Mosley, N., and Watson, I. A Comparison of Case-based Reasoning Approaches to Web Hypermedia Project Cost Estimation. in Proceedings 11th International World-Wide Web Conference, Hawaii, 2002. Mendes, E., Watson, I., Triggs, C., Mosley, N., and Counsell, S. A Comparative Study of Cost Estimation Models for Web Hypermedia Applications, Empirical Software Engineering, (accepted for publication), 2002. Michau, F., Gentil, S., and Barrault, M. Expected benefits of web-based learning for engineering education: examples in control engineering. European Journal of Engineering Education, 26, 2, (June 2001), 151-168, 2001. Myrtveit, I., and Stensrud, E. A Controlled Experiment to Assess the Benefits of Estimating with Analogy and Regression Models, IEEE Transactions on Software Engineering, 25, 4, (July-August 1999), 510-525, 1999. Okamoto, S., and Satoh, K. An average-case analysis of knearest neighbour classifier. In, CBR Research and Development, Veloso, M., & Aamodt, A. (Eds.) Lecture Notes in Artificial Intelligence 1010 Springer-Verlag, 1995. Pressman, R.S. What a Tangled Web We Weave. IEEE Software, (January-February 2000), 18-21, 2000. Prietula, M.J., S.S. Vincinanza and T. Mukhopadhyay Software Effort Estimation With a Case-Based Reasoner, J. of Experimental & Theoretical Artificial Intelligence 8, pp341-363, 1996. Putnam, L. H. 1978. A General Empirical Solution to the Macro Sizing and Estimating Problem, IEEE Transactions on Software Engineering, SE-4, 4, 345-361, 1978. Ranwez, S., Leidig, T., and Crampes, M. Formalization to improve lifelong learning. Journal of Interactive Learning Research, 11, 3-4, 389-409. Publisher: Assoc. Advancement Comput. Educ, USA, 2000. Reifer, D.J. Ten Deadly Risks in Internet and Intranet Software Development, IEEE Software, (March-April 2002), 12-14, 2002. Reifer, D.J. Web Development: Estimating Quick-toMarket Software, IEEE Software, (November-December 2000), 57-64, 2000.

[41] Schulz, S. CBR-Works - A State-of-the-Art Shell for Case-Based Application Building, Proc. of the German Workshop on Case-Based Reasoning, 1995. [42] Shepperd, M., and Kadoda, G. Comparing Software Prediction Techniques Using Simulation, IEEE Transactions on Software Engineering, 27, 11, (November 2002), 1014-1022, 2001. [43] Shepperd, M.J., and Schofield, C. Estimating Software Project Effort Using Analogies. IEEE Transactions on Software Engineering, 23, 11, 736-743, 1997. [44] Spiro, R.J., Feltovich, P..J., Jacobson, M.J., and Coulson, R.L. Cognitive Flexibility, Constructivism, and Hypertext: Random Access Instruction for Advanced Knowledge Acquisition in Ill-Structured Domains, In: L. Steffe & J. Gale, eds., Constructivism, Hillsdale, N.J.:Erlbaum, 1995.

[45] Stensrud, E., Foss, T., Kitchenham, B., and Myrtveit, I. A Further Empirical Investigation of the Relationship between MRE and Project Size, Empirical Software Engineering (Accepted for publication), 2002. [46] Tosca, S.P. The lyrical quality of hypertext links, in Proceedings 10th ACM Hypertext Conference. ACM, 217-218, 1999. [47] Walkerden F., and Jeffery R. An Empirical Study of Analogy-based Software Effort Estimation. Empirical Software Engineering, 4,2, (June 1999), 135-158, 1999. [48] Watson, I. Applying Case-Based Reasoning: techniques for enterprise systems. Morgan Kaufmann, San Francisco, USA, 1997.