mates the rank-order of modules according to a quantita- tive quality factor, such .... percentile rank of observation £ in the predicted ranking ac- cording to ай ¡ .
Multi-Objective Optimization by CBR GA-Optimizer for Module-Order Modeling Taghi M. Khoshgoftaar, Yudong Xiao, and Kehan Gao Florida Atlantic University, Boca Raton, Florida USA
Abstract
number of faults expected. However, in software engineering practice the software management team often cannot choose an appropriate quality threshold at the time of modeling. Therefore, a prediction of the rank-order of modules from the most to the least faulty is more practical. Based on such a predicted rank-order, the software quality management team can target a set of the most faulty modules for enhancement as per the available resources. A module-order model (MOM) is a technique that estimates the rank-order of modules according to a quantitative quality factor, such as number of faults. A MOM is constructed through an underlying quantitative prediction model, such as multiple linear regression [7] and case-based reasoning [8]. Previous works associated with MOM mainly concentrate on how to improve the accuracy of the quantitative prediction model by minimizing the average, relative, or mean square errors [7]. However, it is the predicted rankings of program modules that affect the behavior of MOM(s) and not the predicted value of the quality factor. In this paper, we propose a method that directly targets the performance behavior of MOM(s). More specifically, for a given number of modules enhanced, we are interested in maximizing the number of faults accounted for by the prediction models. A genetic algorithm is ideally suited (in conjunction to an underlying prediction model) for such a direct optimization. Existing software quality prediction techniques, such as multiple linear regression and case-based reasoning, cannot achieve such a direct optimization, because the optimization objective is often highly discontinuous with multiple minima or maxima. By combining genetic algorithms (GA) [2, 10, 13] with case-based reasoning (CBR), a CBR GA-optimizer tool is developed. The underlying quantitative model used is the one obtained by CBR, while the GA-optimizer automatically finds the best CBR models according to a given objective function. The developed tool can be used to solve the multiobjective optimization problem [4, 14] related to CBR. Instead of minimizing the quantitative error such as AAE for the underlying prediction model, we directly maximize the MOM performances for a given set of cutoff percentiles, i.e., the percentages of modules enhanced for the given data set.
In the case when resources allocated for software quality improvement are limited or unknown, an estimation of the relative rank-order of modules based on a quality factor such as number of faults is of practical importance to the software quality assurance team. This is because improvements can be targeted toward a set of most faulty modules according to resource availability. A module-order model (MOM) can be used to determine the relative rank-order of modules. A MOM usually ranks the modules according to the predicted number of faults obtained from an underlying quantitative prediction technique, such as multiple linear regression and case-based reasoning. In this paper we propose a computational intelligence-based method for targeting the performance behavior of MOM(s). The method maximizes the number of faults accounted for by the given percentage of modules enhanced. A new modeling tool called CBR GA-optimizer is developed through a synergy of genetic algorithms (GA) and case-based reasoning (CBR). The tool automatically finds the best CBR fault prediction models according to a project-specific objective function.
1 Introduction Software quality classification models that identify software modules as fault-prone and not fault-prone [1, 6, 8, 11] have been used to direct enhancement resources toward the low-quality modules. The degree of software quality improvement efforts is dependent on the availability of reliability enhancement resources. Software quality classification models require that the individual quality-based groups be defined prior to modeling, usually via a threshold on the 1
The proposed methodology is validated through a case study of a full-scale industrial software system. Four performance goals were chosen for optimization purposes: maximizing the MOM performances for the 5%, 10%, 20%, and 30% of the total number of modules enhanced. This selection is based on our discussion with, and the inputs provided by, the software management team of the system under consideration. The four performance goals are combined into an objective function by empirically determining the appropriate weights. To demonstrate justification of using a sophisticated software quality modeling method such as the one proposed, we compared the MOM performances based on the GA-CBR technique with those based on ordering by the lines of code (LOC) metric. It is shown that the proposed GA CBR-based MOM generally had better performances as compared to ordering by LOC. To our knowledge, this is the first study to use GA to implement a performance optimization for building MOM(s) based on CBR. The remainder of this paper continues with the next section, which presents the details of the module-order modeling technique. This is followed by a section that presents the case-based reasoning and CBR GA-optimizer modeling methods, and a section that discusses our case study of a wireless configuration software system. Finally, the conclusion and suggestions for future work are presented in the last section.
Let be the percentile rank of observation in a per fect ranking of modules according to . Let be the percentile rank of observation in the predicted ranking ac cording to . The following steps illustrate the evaluation procedure for a module-order model. Given a prediction model and a data set having modules indexed by : 1. Management will choose to enhance modules in a priority order, beginning with the most faulty. Determine a range of percentiles that covers management’s options for the last module that will be enhanced, based on the schedule and resources allocated for reliability enhancement. Choose a set of representative cutoff percentiles, , from that range. 2. For each , determine the number of faults accounted for by modules above the percentile . This is done for both the perfect and predicted ranking of the mod ules: is the number of faults accounted for by the modules that are ranked (perfect ranking) above the percentile , and is the number of faults accounted for by the modules that are predicted as falling above the percentile . Therefore, a higher value of corresponds to the more faulty modules.
!
2 Module-Order Modeling A module-order model (MOM) is used to predict the relative rank-order, and hence software quality, of each program module based on a set of product and process metrics. The primary advantage of using a MOM over a software quality classification model is that it enables project managers to enhance as many modules beginning with the most faulty in the rank list as the available resources allow. Usually, a MOM is calibrated according to the following three steps: (1) Build an underlying quantitative software quality prediction model, such as a software fault prediction model; (2) Rank the program modules according to a quality measure predicted by the underlying model; (3) Evaluate the accuracy of the predicted ranking. Initially, a quantitative software quality prediction model is calibrated to predict the dependent variable, which in our studies is the number of faults associated with a program module. The software fault prediction modeling technique used in our study is CBR. For a given quantitative model, the number of faults, , in module is a function (
) of its software measurements, the vector . Let be the estimate of by a fitted model, . In module-order modeling, the predicted values of the dependent variable obtained by are only used to obtain the (predicted) relative order of each program module.
" #$%
(1)
(2)
3. Calculate the performance of the MOM, & , which indicates how closely the faults accounted for by the model ranking match with those of the perfect module ranking.
&
where that &
(3)
& is a number between 0 and 1. It is desired be close to 100% (or 1) for the of interest.
4. After evaluating the performance of a MOM, it is ready for use on a currently underdevelopment similar project or subsequent release. Determine the predicted ranking, by ordering modules in the current data set according to , and subsequently, compute the re spective & values.
3 Case-Based Reasoning A CBR system [9] attempts to find a solution to a new problem based on previous experiences, represented by a case-base or case library. A solution algorithm uses a similarity function to measure the relationship between the new
problem and each case in the case-base, and finally retrieves relevant case(s) and determines a solution to the new problem. A CBR system, therefore, consists of three major components: a case-base, a similarity function, and a solution algorithm. Information related to the past cases is stored in a case-base, which is often the training data set. A case is composed of a set of independent variables and a dependent variable, which in our study is the number of faults. Using the cases in the case-base, a model is trained and is then applied to a test data set, which contains information related to program modules of a similar project. In order to retrieve the relevant case(s) in the case-base that are most similar to the new problem, a similarity function is used. A similarity function measures the distance between the new problem and all the cases in the case-base. Modules with the smallest distances are considered similar and designated as the nearest neighbors ( ) [3]. The commonly used similarity functions include: City Block, Euclidean, and Mahalanobis distances [8]. We use the latter, because it explicitly accounts for correlation among the independent variables. Let be the distance from the new case (or
module) under investigation, , to each of the cases in the case-base, .The Mahalanobis distance is given by:
(4) where prime ( ) represents a transpose, is the variance-
covariance matrix of the independent variables over the en
is its inverse. tire case-base, and By using a solution algorithm, we can estimate the number of faults of the new case under investigation. Let be the number of the nearest neighbors that are to be used to obtain the solution to the new problem. The prediction of the dependent variable (number of faults) of the target module, , can be calculated by a weighted average of dependent variables accounted for by the nearest neighbors. In this case study, an inverse-distance weight was used in a weighted average. Since a smaller distance implies a better match, we weight each case in the nearest neighbors set, , by a normalized inverse distance, . The prediction of the dependent variable of the target module is then given by,
!
(5)
For the given similarity function and solution algorithm used by our CBR-based fault prediction model, the number of the nearest neighbors is the only adjustable modeling parameter.
4 CBR GA-Optimizer The problem of finding the best model in a CBR system can be considered an optimization problem. Generally
speaking, optimization refers to finding the best solution(s) in some specific search space , according to a given ob!#"$ . Because the function !#"$&% ('*) jective function (where ) is a set of real numbers) is usually discontinuous and the search space may be very large, no traditional mathematical methods can be used to solve such an optimization problem. Genetic algorithm (GA) offers an interesting and natural approach to solve such a problem [5]. G A starts from a set of initial solutions, and uses biologically inspired evolution mechanisms to derive new and possibly better solutions. It starts from an initial population +-, , and generates a sequence of populations + /.0 0 0 . +21 , by using three types of operations within the population: crossover, mutation, and reproduction. The elements of the population are called chromosomes and the fitness of each chromosome is measured by a fitness function. Each chromosome consists of a set of genes. For each generation, the algorithm selects some of chromosomes and uses the crossover (for pairs), mutation (for singles), or reproduction operations on them, with some given probabilities, respectively. Crossover mixes genes and mutation randomly changes some genes. Each pair of chromosomes creates a new pair. Each generation inherits some chromosomes from the last generation and accepts some newly created chromosomes according to a given probability. The fitter chromosomes have a greater chance to be inherited into the next generation. The algorithm stops when a certain criterion is satisfied or a pre-defined number of generations is reached. We developed a new tool, named “CBR GA-optimizer”, by using a GA-engine that searches for the best models yielded by the CBR-solver. The CBR GA-optimizer consists of two major components: GA-engine and CBR-solver. The GA-engine creates the population of the chromosomes and implements the evolution process as described above. It sends each chromosome to the CBR-solver and receives the objective function value from the CBR-solver as a feedback for the chromosome evaluation. The CBR-solver receives a chromosome from the GA-engine, carries out the complete CBR process and builds a CBR model. The CBR model calculates the objective function value and sends it back to the GA-engine. For
a given system, finding the best performance, i.e., & , of a MOM for a set of cutoff percentiles of interest is a multi-objective optimization issue. In our case study, max imizing the & values at the 95%, 90%, 80% and 70% percentiles was desired. These four performance goals are combined to obtain the objective function that is to be optimized by the CBR GA-optimizer. The objective function is given by,
2!#"$ 4365 78:9 & 0 ;=< />?365 7:9 & 0 ; />?365 @:9 & 0 A />?365 B:9 & 0 C . (6) 365 78 , 365 7 , 365 @ , and 365 B represent the weights of the where performances at the respective cutoff percentiles of interest.
The weights can be determined according to the importance of each individual objective in the context of the optimization problem under consideration. The general form of the optimization issue related to MOM can be presented as follows: for a given fit data set, test data set, similarity function, and objective func2!#"$ , find some solution(s), , in the search space, tion - , that maximize 2!#"$ . As mentioned earlier, in the context of this paper the number of the nearest neighbors is the only parameter that can affect the performance of the underlying CBR-based quantitative prediction model. Therefore, the search space includes only one parameter, 2 . By using the CBR GA-optimizer, the GAengine automatically searches for the best model created by the CBR-solver, according to the optimization problem shown in objective function (6).
5 Empirical Case Study 5.1 System Description
This case study (denoted as WLTS) involves data collection efforts from initial releases of two large Windows c based embedded systems used primarily for customizing the configuration of wireless telecommunications products. The two C++ applications provide similar functionalities, and contain common source code. Hence, both systems are studied simultaneously. The main difference between them is the type of wireless product that each supports. Both systems consist of over source code files, and each system contained more than C million lines of code. Software metrics were obtained by observing the configuration management systems and the problem reporting systems of the applications. The problem reporting system tracked and recorded problem statuses. The fault data represents the faults discovered during system tests. Upon preprocessing and cleaning the collected data, i.e., removal of incomplete data points, modules remained. Over of modules ( A ; ) were observed to have no faults, and the remaining modules had at least one or more faults. An impartial data splitting was performed on the data set in order to obtain the fit (807 modules) and test (404 modules) data sets. To avoid biased results due to a lucky data split, the original data set was randomly split 3 times to obtain 3 pairs of the fit and test data sets. However, due to space considerations we only present the results for one data split, i.e., Split 1. The five software metrics used for reliability modeling for this case study are:- B LOC: the umber of lines of code for the source file version prior to the coding phase, i.e, auto-generated code; S LOC: the number of lines of code for the source file version delivered to system tests; B COM: the number of lines of commented code for source file version prior to coding phase, i.e, auto-generated code;
S COM: the number of lines of commented code for source file; and INSP: the umber of times the source file was inspected prior to system tests. The collection and use of these metrics for modeling purposes were dependent on their availability and the available data collection tools. The product metrics indicate the number of lines of source code prior to the coding phase and just before system tests. The inspection metric, INSP, was obtained from the problem reporting systems of the two embedded applications.
5.2 CBR GA-Optimizer Methodology In the GA-engine, some parameters associated with GA were set as follows: (1) Reproduction rate = 0.5; (2) Crossover probability = 0.9; (3) Mutation probability = 0.08; (4) Number of generations = 3000; (5) Size of population = 200; (6) Number of runs = 50. The optimization of GA parameters is beyond the scope of this study, however, is part of our future research. At the end of each run, the two best models are selected. Hence, at the end of all the 50 runs there were 100 candidate models among which we selected the best model, i.e., the one with the highest value for objective function (6). In the CBR-solver, an -fold cross-validation (also commonly known as the leave-one-out technique) was implemented on the fit data set to train the underlying quantitative (fault) prediction model. It is an iterative process such that during each iteration, one of the observations in the are fit data set is used as the test data and the other used to train or build the model. The Mahalanobis distance was used as the similarity function and the inverse-distance weighted average was used as the solution algorithm. The GA-optimizer engine initially creates a genome, i.e., number of nearest neighbors, 2 , and sends it to the CBR-solver to build a corresponding module-order model. The CBRsolver returns the value of the multi-objective function to the GA-optimizer engine. Then, the evolution process starts until the termination condition is met. The GA-optimizer finally outputs some “best” MOM(s), which maximizes the objective function (6), which consists of a weighted sum of performances at the cutoff percentiles of interest, i.e., & 0 ;=< , & 0 ; , & 0 A , and & 0 C . In the context of the multi-objective optimization for MOM (s), one of the key issues is to assign the suitable weights to the objective function (6). From a practical software engineering point of view, it is beneficial and costeffective to begin reliability enhancements with the most faulty modules. Since a higher cutoff percentile value corresponds to the more faulty modules, we assigned the high ;=< est weight to the performance at . Subsequently, we assigned decreasing weights to the performances at the = 90%, 80%, and 70% percentiles. We considered ten different sets of weights for the performances at the four val-
3 5 78 365 7
ues. The weights for each set were such that , 365 @ , 4365 B , . Subsequent to analysis based on each weight set, it was observed that performance of the models were not impacted by the different weight sets. The results presented 3 5 78 , 365 7 , , 365 @ , , in this paper are based on 6 3 5 B . As a future work, we shall consider using the and , GA-optimizer for directly optimizing the weights for each performance goal.
Table 1. Performances of Split 1
0.95 0.90 0.85 0.80 0.75 0.70 0.65
Fit
Test
558 734 859 946 998 1038 1071
487 630 734 807 877 932 973
87.28 % 85.83 % 85.45 % 85.31 % 87.88 % 89.79 % 90.85 %
508 609 680 731 760 780 786
406 551 625 670 711 730 743
79.92% 90.48% 91.91% 91.66% 93.55% 93.59% 94.53%
5.3 Results and Analysis 5.3.1 MOM Calibrated by CBR GA-Optimizer The performance of the best model for Split 1 is shown 1 in Table 1. The models were built using the five software metrics described earlier. The first column of the table lists the cutoff percentiles from 95% through 65% with decrements of 5%. We present the MOM performance beyond the lowest cutoff percentile of interest to the management team. The table shows the performances for both fit and test data sets. For each cutoff percentile, the , , and & values are presented. For a given , the number of faults is influenced by the way the original data set is split into the fit and test data sets. For example, the total numbers of faults in the fit and test data sets for Split 1 are 1071 and 786 faults. As shown in last row of the table, these values represent the < respective number of faults accounted for at . This < for a perfect ranking will implies that for Split 1, a account for 100% of faults in the fit and test data sets. This was also observed with the other two splits. We observe that for the cutoff percentiles of interest, although the performances at the higher percentiles in the multi-objective function were assigned larger weights, the final performances on the higher percentiles were not close to the respective objectives. This may be reflective of: (1) the underlying prediction model used for obtaining the predicted rankings of modules, and (2) the characteristics of the software metrics data [12]. The latter, is especially reflected in the software quality modeling of high assurance systems (such as WLTS), in which the percentage of faulty modules is usually a very small fraction of the total number of modules. In the case of Split 1, the performance of MOM on the test data set is generally better than that on the ;=< fit data set with the exception at . This was also observed for Split 3. However, for Split 2, the performance of MOM on the test data set was similar to that on the fit data set except for the cutoff percentiles from 90% to 80%. This implies that the performance of MOM is impacted by the way the original data set is randomly split into the fit and test data sets.
1 The models for the other two splits are not shown due to space considerations, however; similar empirical results were obtained.
Table 2. Comparisons of Performances for Split 1 Test
0.95 0.90 0.85 0.80 0.75 0.70 0.65
508 609 680 731 760 780 786
S LOC 82.68 % 81.28 % 86.18 % 85.77 % 84.34 % 82.31 % 81.93 %
GA-CBR 79.92 % 90.48 % 91.91 % 91.66 % 93.55 % 93.59 % 94.53 %
Difference -2.76 % 9.20 % 5.74 % 5.88 % 9.21 % 11.28 % 12.60 %
5.3.2 Comparison with a Simple Method The software quality assurance team of a given software project is often interested in knowing how well a given software quality model performs as compared to obtaining a model based on a simple rule of thumb, such as software size. This is often needed from a practical point of view for justifying the use of a sophisticated method such as the CBR GA-optimizer. In order to evaluate the MOM built by the CBR GA-optimizer, we compare it with the performance obtained when the modules are ordered according to their LOC. The LOC is often used as a heuristic practice to detect and enhance problematic software modules. This comparison is only done for the test data sets, because the generalization performance of a software quality model is of more interest to a practitioner. In the case of the ranking based on LOC, the modules in the test data are ranked according to their LOC prior to the system release, i.e., S LOC. The subsequent performance calculation is done using the procedure described earlier. The comparison between the CBR GA-optimizer model and performance obtained by LOC-based ranking (notations LOC and S LOC are used interchangeably to imply an ordering based on lines-of-code) is shown in Table 2, which shows the performances for Split 1. It is observed that when ; the top 10% ( ) modules of the test data set are chosen for reliability enhancements, then the MOM calibrated by the CBR GA-optimizer will have over 90% effectiveness, in other words, this MOM can detect over 90% faults ac-
counted by the top 10% of the most faulty modules accord; 0 A