Adjusted Case-Based Software Effort Estimation ...

4 downloads 298 Views 541KB Size Report
Adjusted Case-Based Software Effort Estimation Using Bees Optimization Algorithm. 3 .... difference between target project and top similar projects. However, we ...
Adjusted Case-Based Software Effort Estimation Using Bees Optimization Algorithm

[email protected]

Case-Based Reasoning (CBR) has achieved a considerable interest from researchers for solving non-trivial or ill-defined problems such as those encountered by project managers including support for software project management in predictions and lesson learned. Software effort estimation is the key factor for successful software project management. In particular, the use of CBR for effort estimation was favored over regression and other machine learning techniques due to its performance in generating reliable estimates. However, this method was subject to variety of design options which therefore has strong impact on the prediction accuracy. Selection of CBR adjustment method and deciding on the number of analogies are such two important decisions for generating accurate and reliable estimates. This paper proposed a new method to adjust the retrieved projects efforts and find optimal number of analogies by using Bees optimization algorithm. The Bees algorithm will be used to search for the best number of analogies and features coefficient values that will be used to reduce estimates errors. Results obtained are promising and the proposed method could forma useful extension for Case-based effort prediction model. Abstract.

Keywords: Case-Based Reasoning, Software Effort Estimation,

Bees Algorithm

2

1

[email protected]

Introduction

Software effort estimation has been one of the major challenge tasks in software engineering and has achieved a considerable interest from both industry and the research community [1][2][15][18]. So it preserves popularity within research community and became an active research topic in software engineering. Software effort estimation is very important at early stage of software development for project bedding and feasibility study and during software development for resource allocation, risk evaluation and progress monitoring [11]. Machine Learning (ML) based prediction algorithms have been the most investigated models for the problem of software effort estimation [12][15]. Amongst them, Case-Based Reasoning (CBR) has achieved considerable attention because of its outstanding performance in prediction when different data types are used [17]. CBR is a knowledge management technology based on premise that history almost repeats itself which leads to problem solving can be based upon retrieval by similarity. CBR has been widely used for many software engineering problems to solve non-trivial or ill-defined problems such as those encountered by project managers including support for software project management in predictions and lesson learned [2][3]. Figure 1 illustrates the process of CBR in software effort estimation. The case-base consists of several cases described by a set of features that is divided into two parts: problem description and solution. The problem description consists of a set of features that may be continuous or categorical whereas the solution is always described by the effort needed to accomplish software project. However, software effort datasets are characteristically noisy datasets and CBR methods are more capable of handling noisy datasets than regression based models [21]. But, like other ML methods, the performance of CBR is dataset dependent and has large space of configuration possibilities and design options induced for each individual dataset [22]. So it is not surprise to see contradictory results and different performance figures [6].

Fig. 1. Process of CBR for software effort estimation (Shepperd & Kadoda, 2001)

The use of CBR in software cost estimation involves number of significant interrelated design options that should be carefully determined by project manager

Adjusted Case-Based Software Effort Estimation Using Bees Optimization Algorithm 3

when developing the model [2][12][22]. Such design options have strong impact on the accuracy and reliability of CBR including: characteristics of dataset, case selection, feature selection, similarity measurement, case adjustment and number of analogies [13][14]. Adjustment method is one of the most influential parameters on CBR as it fits case in hand and minimizes the variation between a new case and retrieved cases. The use of adjustment requires some parameter to be set such as number of analogies (i.e. similar retrieved solutions) and the method to make the adjustment. Our claim is that, we can avoid sticking to a fixed best performing number analogies that change from dataset to dataset. We use an alternative method to calibrate CBR by optimizing the feature similarity coefficients and number of analogies. This paper employed Bees algorithm (BA) to optimize number of analogies (K) and weights used to adjust feature similarity values between new case and other K analogies. The Bees Algorithm is a new population-based search algorithm, it was first proposed by [16 ] in 2005. The algorithm mimics the food foraging behavior of swarms of honey bees. In its basic version, the algorithm performs a kind of neighborhood search combined with random search and can be used for optimization [16]. The present paper investigates the effect on the improvement of effort estimation accuracy in CBR when the BA method is adopted to find optimal number of analogies as well as coefficient values that can be used to adjust retrieved project efforts. To the best of our knowledge the BA has not been used in software effort estimation and this paper shows a potential to provide more accurate results. The rest of this paper is organized as follows: Section 2 gives an overview to Casebased effort estimation and adjustment methods. Section 3 presents the proposed adjustment method. Section 4 presents methodology of this study. Section 5 introduces threats to validity. Section 6 presents results of empirical validation. Finally, section 7 summarizes our work and outlines the future studies.

2

Background

Case-based effort estimation makes prediction for a new project by retrieving previously completed successful projects that have been encountered and remembered as historical projects [2][3][7][14][18][22]. Although CBR generates successful performance figures in certain datasets, it is still suffered from local tuning problems when they were to be applied in another setting [10]. Local tuning requires finding appropriate K analogies that fits procedure of adjustment and reflects dataset characteristics, where this process is a challenge on its own [21][22]. In literature various methods have been used to determine the best number of analogies such as nearest neighbor algorithms like k-NN [13], expert guidance or goal based preference [22]. Idri et al. [8] suggested using all projects that fall within a certain similarity threshold. This approach could ignore some useful cases which might contribute better when similarity between selected and unselected cases is negligible. On

4

[email protected]

the other hand, several researchers suggested using a fixed number of analogies (K=1, or 2 or…etc) which is considered somewhat simpler than first one and depends heavily on the estimator intuitions [6][13][14][20] Another study focusing on analogies selection in context of CBR is conducted by Li et. al. [11]. In their study they perform rigorous trials on actual and artificial datasets and they observe effect of various K values. Concerning the adjustment procedure, there have been various methods developed for effort adjustment such as un-weighted mean [13], weighted mean [14], and median [12] of closest efforts. However, these adjustment methods are global tuning and directly applied to the retrieved efforts without capturing the actual differences between target project and retrieved projects. Mendes et al. [14] carried out several studies to check the impact of case adjustment rules on prediction accuracy of analogy estimation. Their adjustment rules considered the use of linear extrapolation along the dimension of all continues features that are chosen as being strongly correlated with effort. On the other hand, Li J. [11] proposed another adjustment approach (called AQUA) using similarity degrees of all K analogies as weight drivers to indicate the strength of relationship between a similarity metric and effort. Jorgensen et al. [9][19] investigated the use of ‘Regression Towards the Mean’ (RTM) method based on the adjusted productivity of the new project and productivity of the closest analogies. This method is more suitable when the selected analogues are extreme and availability of coherent groups. They indicated that the adjusted estimation using RTM method follows the same estimation procedure conducted by experts. Chiu & Huang [7] investigated the use of Genetic Algorithms (GA) based project distance to optimize similarity between target project and its closest projects. More recently Li et al. [12] proposed alternative approach for analogy software cost estimation based on non-linear method (Neural network). The method is suitable for complex non-uniform datasets as it has ability to learn the difference between target project and top similar projects. However, we believe that reflection on dataset before applying to different algorithms under multiple settings is of more significance. But, this is not enough because the selection of K analogies is not only a dataset dependent but also adjustment method dependent. So we can conclude that finding optimal K analogies is an optimization problem depending on the choice of adjustment method followed. In this study we propose making use of BA to address this challenge. The illustration of the proposed approach is described in the next section.

3

The proposed Method (CBR+)

CBR adjustment is a technique used to derive a new solution by minimizing the differences between retrieved analogues and target project

Adjusted Case-Based Software Effort Estimation Using Bees Optimization Algorithm 5

(for which the estimation is required) [12]. It is an important step in Casebased effort prediction as it reflects structure of analogies on the target case. This technique can be represented mathematically as a function that captures differences between problem description of target project and its analogies description in an attempt to generate more reasonable solution. In Case-based effort predication this function can be formulated as depicted in Eq. 1: =



(1)

Where Pt is the target project and P1 to PN is the top K similar projects to the target one. The similarity degree is assessed using Euclidean distance. F is the adjustment function used to capture differences between Pt and all other top similar projects, and then convert these differences into the amount of changes in the effort value. The adjustment function used in this study is illustrated in equations 2 and 3, where wj is the optimization coefficient. The proposed CBR method (hereafter we will call it CBR+) exploits the search capability of the BA to overcome the local tuning problem of effort adjustment. More specifically, the task is to search for appropriate weights (wj) and K values such that the performance measure is minimized.

∑Mj= w j × ( f tj - f ij ) K Effort( pt ) = ∑ Effort( pti ) i= K

Effort( pti ) = Effort( pi ) +

(2) (3)

Where M is the number of features, ftj is the jth feature value of the target project. fij is the jth feature value of the analogy project Pi. Before starting, the BA parameters must be carefully set [16], these parameters are: problem size (Q), number of scout bees (n), number of sites selected out of n visited sites (s), number of best sites out of s selected sites (e), number of bees recruited for best e sites (nep), number of bees recruited for the other selected sites (nsp), other bees number (osp) and initial size of patches (ngh) which includes site and its neighborhood in addition to Stopping criterion which in our study is to minimize MMRE performance measure. The algorithm starts with an initial population of n scout bees. Each bee represents potential solution as set of K analogy coefficient values. The scout bees are placed randomly in initial search space. The fitness computation process is carried out, using Leave-one out cross validation, for each site visited by a scout bee by calculating Mean Magnitude Relative Error (MMRE). This step is essential for colony communication which shows the direction in which flower patch will be found, its distance from the hive and its fitness [16]. This information helps the colony to send its bees to flower patches precisely, without using guides or maps. Then, the best sites visited by the highest fittest bees are being selected for neighborhood search. The area of neighborhood search is determined by identifying the radius of search area from best site which is considered the key operation

6

[email protected]

of BA. The algorithm continues searching in the neighborhood of the selected sites, recruiting more bees to search near to the best sites which may have promising solutions. The bees can be chosen directly according to the fitnesses associated with the sites they are visiting. Alternatively, the fitness values are used to determine the probability of the bees being selected. Then the fittest bee from each patch is selected to form the next bee population. Our claim here is to reduce the number of points to be explored. Finally the remaining bees are assigned to search randomly for new potential solutions. These steps are repeated until the criterion of stop is met or the number of iteration has finished. At the end of each iteration, the colony of bees will have two parts to its new population – those that were the fittest representatives from a patch and those that have been sent out randomly. The pseudo code of BA is shown in Figure 2. Input:Q,n,s,e,ngh,nep,nsp Output:BestBee Population ← InitializePopulation(n, Q) While(! StopCondition()) MMRE=EvaluatePopulation(Population) BestBee ← GetBestSolution(Population) NextGeneration ← ø ngh ← ( ngh × PatchDecreasefactor) Sitesbest ← SelectBestSites(Population, s) for(Sitei ∈ Sitesbest) nsp ← ø if(i