Software Cost Estimation Models Using Radial Basis Function Neural

Software Cost Estimation Models Using Radial Basis Function Neural Networks Ali Idri1, Azeddine Zahi2, Emilia Mendes3, and Abdelali Zakrani1 1

Department of Software Engineering, ENSIAS, Mohamed V University Rabat, Morocco [email protected], [email protected] 2 Department of Computer Science FST, Sidi Mohamed Ben Abdellah Universit Fez, Morocco [email protected] 3 Computer Science Department, The University of Auckland, Private Bag 92019 Auckland, New Zealand [email protected]

Abstract. Radial Basis Function Neural Networks (RBFN) have been recently studied due to their qualification as an universal function approximation. This paper investigates the use of RBF neural networks for software cost estimation. The focus of this study is on the design of these networks, especially their middle layer composed of receptive fields, using two clustering techniques: the Cmeans and the APC-III algorithms. A comparison between a RBFN using C-means and a RBFN using APC-III, in terms of estimates accuracy, is hence presented. This study uses the COCOMO’81 dataset and data on Web applications from the Tukutuku database. Keywords: Software effort estimation, Neural Networks, predictive accuracy, Radial basis function neural networks.

1 Introduction Estimating software development effort remains a complex problem, and one which continues to attract considerable research attention. Improving the accuracy of the effort estimation models available to project managers would facilitate more effective control of time and budgets during software development. In order to make accurate estimates and avoid gross mis-estimations, several cost estimation techniques have been proposed, grouped into two major categories: (1) parametric models, which are derived from the statistical or numerical analysis of historical projects data [2], and (2) non-parametric models, which are based on a set of artificial intelligence techniques such as artificial neural networks, analogy-based reasoning, regression trees, genetic algorithms and rule-based induction [3,8,19,20,21]. In this paper, we focus on non-parametric cost estimation models based on artificial neural networks, and in particular Radial Basis Function Neural Networks. Based on biological receptive fields, Moody and Darken proposed a network architecture referred to as the Radial Basis Function Network (RBFN), which employs J.J. Cuadrado-Gallego et al. (Eds.): IWSM-Mensura 2007, LNCS 4895, pp. 21–31, 2008. © Springer-Verlag Berlin Heidelberg 2008

22

A. Idri et al.

local receptive fields to perform function mappings [16]. RBFN can be used for a wide range of applications, primarily because it can approximate any regular function [17]. A RBFN is a three-layer feedforward network consisting of one input layer, one middle-layer and an output layer. Fig. 1 illustrates a possible RBFN architecture configured for web development effort. The RBFN generates output (effort) by propagating the initial inputs (cost drivers) through the middle-layer to the final output layer. Each input neuron corresponds to a component of an input vector. The middle layer contains M neurons, plus, eventually, one bias neuron. Each input neuron is fully connected to the middle-layer neurons. The activation function of each middle neuron is usually the Gaussian function:

f ( x) = e

(−

x −ci

σ i2

2

)

(1)

where ci and σi are the center and the width of the ith middle neuron respectively. . denotes the Euclidean distance. Hence, each ith hidden neuron has its own receptive field in the input space, a region centered on ci with size proportional to σi.. The Gaussian function decreases rapidly if the width σi is small, and slowly if it is large. The output layer consists of one output neuron that computes the development effort as a linear weighted sum of the middle layer outputs as follows:

Effort =

M

∑

y jβ j

with

yj =e

(−

x−c j

σ 2j

2

)

(2)

j =1

The use of an RBFN to estimate software development effort requires the determination of the middle-layer parameters (receptive fields) and the weights βj. The choice of the receptive fields, especially their distribution in the input space is often critical to the successful performance of an RBF network [18]. In general, there are two primary sources of this knowledge:

• •

The use of some clustering techniques to analyze and find clusters in the training data. The results of this grouping are used to establish prototypes of the receptive fields. The use of existing empirical domain knowledge to form the set of receptive fields. Along this line emerge some fuzzy models; in particular, some interesting equivalence between RBF networks and Fuzzy Rule-Based systems [11].

The aim of this study is to discuss the preprocessing phase of the RBFN networks in designing software cost estimation as it occurs through data clustering techniques. More especially, we use two clustering techniques: 1) the APC-III algorithm developed by Hwang and Bang [6], and 2) the best-known clustering method that is the Cmeans algorithm. In an earlier work [10], we have empirically evaluated these two clustering techniques, when designing RBF neural networks for software cost estimation, using the COCOMO’81 dataset. The RBFN designed with the C-means algorithm performed better, in terms of cost estimates accuracy, than the RBFN designed with the APC-III algorithm [10]. To confirm those results, this paper replicates that study using a dataset of 53 Web projects from the Tukutuku database [14].

Software Cost Estimation Models Using Radial Basis Function Neural Networks

Input layer

Hidden layer ci, Vi

Teamexp

Output

y1

Devteam

Ei Effort

y2

Webpages

23

Img

y3

Anim

Fig. 1. An example of Radial Basis Function Network architecture for Web development effort

This paper is structured as follows. Section 2 briefly describes how the two clustering algorithms APC-III and C-means, can be used in the design of an RBF neural network. In Section 3, the empirical results obtained using an RBFN on the COCOMO’81 and the Tukutuku datasets are discussed and compared in terms of prediction accuracy. Conclusions and an overview of future work are presented in Section 5.

2 Clustering Techniques for RBF Neural Networks Clustering techniques have been successfully used in many application domains, including biology, medicine, economics, and patterns recognition [1,4,5,12,13]. These techniques can be grouped into major categories: Hierarchical or Partitional [5]. In this paper, we focus only on the Partitional clustering algorithm since it is used more frequently than other clustering algorithms in pattern recognition fields. Generally, Partitional clustering algorithms suppose that the data set can be well represented by finite prototypes. Partitional clustering is also referred to as an objective functionbased clustering algorithm. Clustering has been often used as a pre-processing phase in the design of RBF neural networks, where its primary aim is to set up an initial distribution of the receptive fields (hidden neurons) across the input variables space of the input variables. In particular, this implies a location of the modal values of these fields (e.g., the modes of the Gaussian functions). In this work, we use and compare the C-means and the APC-III clustering algorithms to determine the receptive fields of an RBF network for software cost estimation [9,10]. These two clustering techniques are briefly presented next. APC-III is a one-pass clustering algorithm with a constant radius R0 defined as follows: R0 = α

1 N

N

( Pi − P j ∑ min i≠ j

)

(3)

i =1

Where N is the number of historical software projects and α is a predetermined constant. R0 expresses the radius of each cluster and consequently it controls the number of the clusters provided. APC-III generates many clusters if R0 is small and few clusters if it is large. The outline of the APC-III algorithm can be stated as follows:

24

A. Idri et al.

1- Initially, the number of clusters is set at 1; the center of this cluster C1 is the first software project in the dataset, say P1 2- Iterate from i=2 to N (N is the number of historical software projects) a. For j=1 to c (c is the number of clusters) • Compute dij (dij is the Euclidean distance of Pi and cj, cj is the center of cluster Cj) • If dij 1 ⎪ x j ∈Ci max σ k if card (Ci ) = 1 ⎪k / card ( Ck ) >1 ⎩

σi = ⎨

(6)

⎧max d ( x j , ci ) if card (Ci ) > 1 ⎪ x j ∈Ci min σ k if card (C i ) = 1 ⎪k / card (Ck ) >1 ⎩

σi = ⎨

(7)

where card(Ci) is the cardinality of cluster Ci. Concerning the weights βj, we may set each βj to the associated effort of the center of the jth neuron. However, this technique is not optimal and does not take into account the overlapping that may exist between receptive fields of the hidden layer. Thus, we use the Delta rule to derive the values of βj.

3 Data Description In this work, we evaluate empirically these two clustering techniques when designing RBF neural networks for software cost estimation based on two different databases: The COCOMO’81 dataset and data web from Tukutuku database. The COCOMO’81 dataset contains 252 software projects which are mostly scientific applications developed by Fortran [2,10]; Each project is described by 13 attributes (see Table 1) : the software size measured in KDSI (Kilo Delivered Source Instructions) and the remaining 12 attributes are measured on a scale composed of six linguistic values: ‘very low’, ‘low’, ‘nominal’, ‘high’, ‘very high’ and ‘extra high’. These 12 attributes are related to the software development environment such as the experience of the personnel involved in the software project, the method used in the development and the time and storage constraints imposed on the software. The Tukutuku dataset contains 53 Web projects [14,15]. Each Web application is described using 9 numerical attributes such as: the number of html or shtml files used, the number of media files and team experience (see Table 2). However, each project volunteered to the Tukutuku database was initially characterized using more than 9 software attributes, but some of them were grouped together. For example, we grouped together the following three attributes: number of new Web pages developed by the team, number of Web pages provided by the customer and the number of Web pages developed by a third party (outsourced) in one attribute reflecting the total number of Web pages in the application (Webpages).

26

A. Idri et al. Table 1. Software attributes for COCOMO’81

Software attribute

SIZE DATA TIME STOR VIRTMIN VIRT MAJ TURN ACAP AEXP PCAP VEXP LEXP SCED

Description

Software Size Database Size Execution Time Constraint Main Storage Constraint Virtual Machine Volatility Virtual Machine Volatility Computer Turnaround Analyst Capability Applications Experience Programmer Capability Virtual Machine Experience Programming Language Experience Required Development Table 2. Software attributes for the web dataset

Attribute Teamexp

Description Average number of years of experience the team has on Web development

Devteam

Number of people who worked on the software project

Webpages

Number of web pages in the application

Textpages

Number of text pages in the application(text page has 600 words)

Img

Number of images in the application

Anim

Number of animations in the application

Audio/video

Number of audio/video files in the application

Tot-high

Number of high effort features in the application

Tot-nhigh

Number of low effort features in the application

4 Empirical Results The following section presents and discusses the results obtained when applying the RBFN to the COCOMO’81 and the Tukutuku datasets. The calculations were made using two software prototypes developed using the C programming language under a Microsoft Windows PC environment. The first software prototype implements the APC-III and the C-means clustering algorithms, providing both the clusters and their centers from the database. The second software prototype implements a cost estimation model based on an RBFN architecture in which the middle-layer parameters are determined by means of the first software prototype. The accuracy of the estimates generated by the RBFN is evaluated by means of the Mean Magnitude of Relative Error MMRE and the Prediction level Pred defined as follows:

Software Cost Estimation Models Using Radial Basis Function Neural Networks

MMRE =

1 N

N

Effortactual,i − Effortestimated ,i

i =1

Effortactual,i

∑

Pred(p) =

× 100

k N

27

(8) (9)

where N is the total number of observations, k is the number of observations with an MRE less than or equal to p. A common value for p is 25. To decide on the number of hidden units, we have conducted several experiments with both the C-means and the APC-III algorithms. These experiments use the full dataset for training. Choosing the ‘best’ classification to determine the number of hidden neurons and their centers is a complex task. For software cost estimation, we suggest that the ‘best’ classification is the one that satisfies the following criteria:

It provides coherent clusters, i.e. the software projects of a given cluster have satisfactory degrees of similarity; It improves the accuracy of the RBFN.

where N is the total number of observations, k is the number of observations with an MRE less than or equal to p. A common value for p is 25. 4.1 RBFN with C-Means Algorithm To measure the coherence of clusters in the case of the C-means algorithm, one of the two criteria was used: the objective function J given in Equation 5 or the Dunn’s index defined by the following Equation:

⎛ ⎛ d (C , C ) ⎞ ⎞ ⎜ i j ⎜ ⎟⎟ D1 = min ⎜ min ⎜ ⎟ 1≤i ≤ c⎜ i +1≤ j ≤ c −1⎜ max (d (C k )) ⎟ ⎟⎟ ⎝ 1≤ k ≤c ⎠⎠ ⎝ d (C i , C j ) =

min

xi ∈Ci , x j ∈C j

d ( xi , x j )

(10)

d (C k ) = max d ( x j , x k ) xi , x j ∈Ck

where d(Ci,Cj) is the distance between clusters Ci, and Cj (intercluster distance); and d(Ck) is the intracluster distance of cluster Ck. The Dunn’s index (D1) expresses the idea of identifying clusters that are compact and well separated. The main goal of the measure (D1) is to maximize the intercluster distances and minimize the intra-cluster distances. Therefore, the number of cluster that maximizes D1 is taken as the optimal number of the clusters. Several experiments were conducted with an RBFN, each time using one of the classifications generated by the C-means algorithm. These experiments used the full set of historical projects for training and testing. For each number of clusters (c), the RBFN used the two C-means classifications that respectively minimise J or maximise D1. For instance, when fixing c to 34 for the Tukutuku dataset, the two chosen classifications were respectively those for which J is equal to 9299.78 and D1 is equal to 7.14. Fig. 2-a and Fig 2-b show the relationship between the accuracy of the RBFN on the Tukutuku and the COCOMO’81 datasets, measured in terms of Pred(0.25) and

28

A. Idri et al.

MMRE, and the used classifications (number of clusters) minimizing the objective function J or maximizing the D1 index. It can be noticed for both datasets that the accuracy of the RBFN when using the C-means classification that minimizes J (Pred_J and MMRE_J) is superior to the accuracy obtained when using the classification maximizing D1 (Pred_D1 and MMRE_D1). Fig. 2-a (resp. Figure 2-b) shows only the results of experiments for the Tukutuku dataset (resp. the COCOMO’81 dataset) when the number of clusters is higher than 34 (resp. is higher than 120) because the evaluated accuracy of the RBFN is acceptable (the common values used in the literature are Pred(25)>= 70 and MMRE