January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
International Journal of Information Technology & Decision Making Vol. 10, No. 1 (2011) 187–206 c World Scientific Publishing Company DOI: 10.1142/S0219622011004282
ENSEMBLE OF SOFTWARE DEFECT PREDICTORS: AN AHP-BASED EVALUATION METHOD
YI PENG, GANG KOU∗ , GUOXUN WANG and WENSHUAI WU School of Management and Economics University of Electronic Science and Technology of China Chengdu, P. R. China, 610054 ∗
[email protected] YONG SHI College of Information Science & Technology University of Nebraska at Omaha Omaha, NE 68182, USA and CAS Research Center on Fictitious Economy and Data Sciences Beijing 100080, China
Classification algorithms that help to identify software defects or faults play a crucial role in software risk management. Experimental results have shown that ensemble of classifiers are often more accurate and robust to the effects of noisy data, and achieve lower average error rate than any of the constituent classifiers. However, inconsistencies exist in different studies and the performances of learning algorithms may vary using different performance measures and under different circumstances. Therefore, more research is needed to evaluate the performance of ensemble algorithms in software defect prediction. The goal of this paper is to assess the quality of ensemble methods in software defect prediction with the analytic hierarchy process (AHP), which is a multicriteria decision-making approach that prioritizes decision alternatives based on pairwise comparisons. Through the application of the AHP, this study compares experimentally the performance of several popular ensemble methods using 13 different performance metrics over 10 public-domain software defect datasets from the NASA Metrics Data Program (MDP) repository. The results indicate that ensemble methods can improve the classification results of software defect prediction in general and AdaBoost gives the best results. In addition, tree and rule based classifiers perform better in software defect prediction than other types of classifiers included in the experiment. In terms of single classifier, K-nearest-neighbor, C4.5, and Na¨ıve Bayes tree ranked higher than other classifiers. Keywords: Ensemble; classification; software defect prediction; the analytic hierarchy process (AHP).
1. Introduction Large and complex software systems have become an essential part of our society. Defects existing in software systems are prevalent and expensive. According to the 187
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
188
Y. Peng et al.
Research Triangle Institute (RTI), software defects cost the US economy billions of dollars annually, and more than a third of the costs associated with software defects could be avoided by improving software testing.1 As a useful software testing tool, software defect prediction can help detect software faults in an early stage, which facilitates efficient test resource allocation, improves software architecture design, and reduces the number of defective modules.2 Software defect prediction can be modeled as a two-group classification problem by categorizing software units as either fault-prone (fp) or nonfault-prone (nfp) using historical data. Researchers have developed many classification models for software defect prediction.2–15 Previous studies illustrate that ensemble methods, a combination of classifiers using some mechanisms, are superior to others in software defect prediction.2,16 However, other works indicate that classifiers’ performances may vary using different performance measures and under different circumstances.17–20 Furthermore, there are many ways to construct ensembles of classifiers. How to select the most appropriate ensemble method for software defect prediction problem has not been fully investigated. The objective of this study is to evaluate the quality of ensemble methods for software defect prediction with the analytic hierarchy process (AHP) method. The AHP is a multicriteria decision-making approach that helps decision makers structure a decision problem based on pairwise comparisons and experts’ judgments.21 Three popular ensemble methods (bagging, boosting, and stacking) are compared with 12 well-known classification methods using 13 performance measures over 10 public-domain datasets from the NASA Metrics Data Program (MDP) repository.22 The classification results are then analyzed using the AHP to determine the best classifier for software defect prediction task. The rest of this paper is organized as follows: Sections 2 and 3 describe the ensemble methods and the AHP, respectively; Section 4 explains the performance metric, datasets, and methodology used in the experiment and analyzes the results; Section 5 summarizes. 2. Ensemble Methods Ensemble learning algorithms construct a set of classifiers and then combine the results of these classifiers using some mechanisms to classify new data records.23 Experimental results have shown that ensembles are often more accurate and robust to the effects of noisy data, and achieve lower average error rate than any of the constituent classifiers.24–28 How to construct good ensembles of classifiers is one of the most active research areas in machine learning, and many methods for constructing ensembles have been proposed in the past two decades.29 Dietterich30 divides these methods into five groups: Bayesian voting, manipulating the training examples, manipulating the input features, manipulating the output targets, and injecting randomness. Several
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method
189
comparative studies have been conducted to examine the effectiveness and performance of ensemble methods. Results of these studies indicate that bagging and boosting are very useful in improving the accuracy of certain classifiers,31 and their performances vary with added classification noise.23 To investigate the capabilities of ensemble methods in software defect prediction, this study concentrates on three popular ensemble methods (i.e. bagging, boosting, and stacking) and compares their performances on public-domain software defect datasets. 2.1. Bagging Bagging combines multiple outputs of a learning algorithm by taking a plurality vote to get an aggregated single predictio.32 The multiple outputs of a learning algorithm are generated by randomly sampling with replacement of the original training dataset and applying the predictor to the sample. Many experimental results show that bagging can improve accuracy substantially. The vital element in whether bagging will improve accuracy is the instability of the predictor.32 For an unstable predictor, a small change in the training dataset may cause large changes in predictions.33 For a stable predictor, however, bagging may slightly degrade the performance.32 Researchers have performed large empirical studies to investigate the capabilities of ensemble methods. For instance, Bauer and Kohavi31 compared bagging and boosting algorithms with a decision tree inducer and a Na¨ıve Bayes inducer. They concluded that bagging reduces variance of unstable methods and leads to significant reductions in mean-squared errors. Dietterich23 studied three ensemble methods (bagging, boosting, and randomization) using decision tree algorithm C4.5 and pointed out that bagging is much better than boosting when there is substantial classification noise. In this study, bagging is generated by averaging probability estimates.34 2.2. Boosting Similar to bagging, boosting method also combines the different decisions of a learning algorithm to produce an aggregated prediction.24,35 In boosting, however, weights of training instances change in each iteration to force learning algorithms to put more emphasis on instances that were predicted incorrectly previously and less emphasis on instances that were predicted correctly previously.30 Boosting often achieves more accurate results than bagging and other ensemble methods.23,29,31 However, boosting may overfit the data and its performance deteriorates with classification noise. This study evaluates a widely used boosting method, AdaBoost algorithm, in the experiment. AdaBoost is the abbreviation for adaptive boosting algorithm because it adjusts adaptively to the errors returned by classifiers from previous iterations.24,36 The algorithm assigns equal weight to each training instance at the beginning. It then builds a classifier by applying the learning algorithm to the
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
190
Y. Peng et al.
training data. Weights of misclassified instances are increased, while weights of correctly classified instances are decreased. Thus, the new classifier concentrates more on incorrectly classified instances in each iteration. 2.3. Stacking Stacking generalization, often abbreviated as stacking, is a scheme for minimizing the generalization error rate of one or more learning algorithms.37 Unlike bagging and boosting, stacking can be applied to combine different types of learning algorithms. Each base learner, also called “level 0” model, generates a class value for each instance. The predictions of level-0 models are then fed into the level-1 model, which combines them to form a final prediction.34 Another ensemble method used in the experiment is voting, which is a simple average of multiple classifiers probability estimates provided by WEKA.34 2.4. Selected classification models As a powerful tool that has numerous applications, classification methods have been studied extensively by several fields, such as machine learning, statistics, and data mining.38 Previous studies have shown that an ideal ensemble should consist of accurate and diverse classifiers.39,40 Therefore, this study selects 12 classifiers to build ensembles. They represent five categories of classifiers (i.e. trees, functions, Bayesian classifiers, lazy classifiers, and rules) and were implemented in WEKA. For trees category, we chose classification and regression tree (CART), Na¨ıve Bayes tree, and C4.5. Functions category includes linear logistic regression, radial basis function (RBF) network, sequential minimal optimization (SMO), and Neural Networks. Bayesian classifiers include Bayesian network and Na¨ıve Bayes. K-nearest-neighbor was chosen to represent lazy classifiers. For rules category, decision table and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) rule induction were selected. Classification and regression tree (CART) can predict both continuous and categorical dependent attributes by building regression trees and discrete classes, respectively.41 Na¨ıve Bayes tree is an algorithm that combines Na¨ıve Bayes induction algorithm and decision trees to increase the scalability and interpretability of Na¨ıve Bayes classifiers.42 C4.5 is a decision tree algorithm that constructs decision trees in a top–down recursive divide-and-conquer manner.43 Linear logistic regression models the probability of occurrence of an event as a linear function of a set of predictor variables.44 Neural network is a collection of artificial neurons that learns relationships between inputs and outputs by adjusting the weights.28 RBF network45 is an artificial neural network that uses radial basis functions as activation functions. The centers and widths of hidden units are derived using k-means, and the outputs obtained from the hidden layer are combined using logistic regression.34 SMO is a sequential minimal optimization algorithm for training support vector machines (SVM).46,47
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method
191
Bayesian network and Na¨ıve Bayes both model probabilistic relationships between the predictor variables and the class variable. While Na¨ıve Bayes classifier48 estimates the class-conditional probability based on Bayes theorem and can only represent simple distributions, Bayesian network is a probabilistic graphic model and can represent conditional independencies between variables.49 K-nearest-neighbor50 classifies a given data instance based on learning by analogy. That is, it assigns an instance to the closest training examples in the feature space. Decision table selects the best-performing attribute subsets using best-first search and uses cross-validation for evaluation.51 RIPPER52 is a sequential covering algorithm that extracts classification rules directly from the training data without generating a decision tree first.28 Each of stacking and voting combines all classifiers to generate one prediction. Since bagging and boosting are designed to combine multiple outputs of a single learning algorithm, they are applied to each of the 12 classifiers and produced a total of 26 aggregated outputs.
3. The Analytic Hierarchy Process (AHP) The analytic hierarchy process is a multicriteria decision-making approach that helps decision makers structure a decision problem based on pairwise comparisons and experts’ judgments.53,54 Saaty55 summarizes four major steps for the AHP. In the first step decision makers define the problem and decompose the problem into a three-level hierarchy (the goal of the decision, the criteria or factors that contribute to the solution, and the alternatives associated with the problem through the criteria) of interrelated decision elements.56 The middle level of criteria might be expanded to include subcriteria levels. After the hierarchy is established, the decision makers compare the criteria two by using a fundamental scale in the second step. In the third step, these human judgments are converted to a matrix of relative priorities of decision elements at each level using the eigenvalue method. The fourth step calculates the composite or global priorities for each decision alternatives to determine their ratings. The AHP has been applied in diverse decision problems, such as economics and planning, policies and allocations of resources, conflict resolution, arms control, material handling and purchasing, manpower selection and performance measurement, project selection, marketing, portfolio selection, model selection, politics, and environment.57 Over the last 20 years, the AHP has been studied extensively and various variants of the AHP have been proposed.58–61 In this study, the decision problem is to select the best ensemble method for the task of software defect prediction. The first step of the AHP is to decompose the problem into a decision hierarchy. As shown in Fig. 1, the goal is to select an ensemble method that is superior to other ensemble methods over publicdomain software defect datasets through the comparison of a set of performance
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
192
Y. Peng et al.
Fig. 1. An AHP hierarchy for the ensemble selection problem.
measurements. The criteria are performance measures for classifiers, such as overall accuracy, F-measure, area under ROC (AUC), precision, recall, and Kappa statistic. The decision alternatives are ensembles and individual classification methods, such as AdaBoost, bagging, stacking, C4.5, SMO, and Na¨ıve Bayes. Individual classifiers are included as the decision alternatives for the purpose of comparisons. In step 2, the input data for the hierarchy, which is a scale of numbers that indicates the preference of decision makers about the relative importance of the criteria, are collected. Saaty53 provides a fundamental scale for this purpose, which has been validated theoretically and practically. The scale ranges from 1 to 9 with increasing importance. Numbers 1, 3, 5, 7, and 9 represent equal, moderate, strong, very strong, and extreme importance, respectively, while 2, 4, 6, and 8 indicate intermediate values. This study uses 13 measures to assess the capability of ensembles and individual classifiers. The matrix of pairwise comparisons of these measures are exhibited in Table 1. Previous works have proved that the AUC is the most informative and objective measurement of predictive accuracy62 and is an extremely important measure in software defect prediction. Therefore, it is assigned a number of 9. The F-measure, mean absolute error, and overall accuracy are very important measures, but less important than the AUC. The true positive rate (TPR), true negative rate (TNR), false positive rate (FPR), false negative rate (FNR), precision, recall, and Kappa statistic are strongly important classification measures that are less important than the F-measure, mean absolute error, and overall accuracy. Training and test time refer to the time needed to train and test a classification algorithm or ensemble method, respectively. They are useful measures in real-time software defect identification. Since this study is not aimed at real-time software defect identification problem, they are included to measure the efficiency of ensemble methods and are given the lowest importance. The third step of the AHP computes the principal eigenvector of the matrix to estimate the relative weights (or priorities) of the criteria. The estimated priorities are obtained through a two-step process: (1) raise the matrix to large powers (square); (2) sum and normalize each row. This process is repeated until the
Acc
1 1/3 1/3 1/3 1/3 1/3 1/3 1 3 1/3 1 1/7 1/7
Measures
Acc TPR FPR TNR FNR Precision Recall F-measure AUC Kappa MAE TrainTime TestTime
3 1 1 1 1 1 1 3 5 1 3 1/5 1/5
TPR
3 1 1 1 1 1 1 3 5 1 3 1/5 1/5
FPR 3 1 1 1 1 1 1 3 5 1 3 1/5 1/5
TNR 3 1 1 1 1 1 1 3 5 1 3 1/5 1/5
FNR 3 1 1 1 1 1 1 3 5 1 3 1/5 1/5
Precision 3 1 1 1 1 1 1 3 5 1 3 1/5 1/5
Recall 1 1/3 1/3 1/3 1/3 1/3 1/3 1 3 1/3 1 1/7 1/7
F-measure 1/3 1/5 1/5 1/5 1/5 1/5 1/5 1/3 1 1/5 1/3 1/9 1/9
AUC 3 1 1 1 1 1 1 3 5 1 3 1/5 1/5
Kappa
Table 1. Pairwise comparisons of performance measures.
1 1/3 1/3 1/3 1/3 1/3 1/3 1 3 1/3 1 1/7 1/7
MAE 7 5 5 5 5 5 5 7 9 5 7 1 1
TrainTime
7 5 5 5 5 5 5 7 9 5 7 1 1
TestTime
0.1262 0.0491 0.0491 0.0491 0.0491 0.0491 0.0491 0.1262 0.2513 0.0491 0.1262 0.0133 0.0133
Priority
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method 193
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
194
Y. Peng et al.
difference between the sums of each row in two consecutive rounds is smaller than a prescribed value. The priorities, which provide the relative ranking of performance measures, are shown at the rightmost column in Table 1. After obtaining the priority vector of the criteria level, the AHP method moves to the lowest level in the hierarchy, which consists of ensemble methods and classification algorithms in this experiment. The pairwise comparisons at this level compare learning algorithms with respect to each performance measure in the level immediately above. The matrices of comparisons of the learning algorithms with respect to the criteria and their priorities are analyzed and summarized in the experimental study section. The ratings for the learning algorithms are produced by aggregating the relative priorities of decision elements.21 4. Experimental Study The experiment is designed to compare a wide selection of ensemble methods and individual classifiers for software defect prediction based on the AHP method. As discussed in Sec. 3, the performance of ensemble methods and classification algorithms are evaluated using 13 measures over 10 public-domain software defect datasets. The following paragraphs define the performance measures and the datasets, describe the experimental design, and present the results. 4.1. Performance measures There are an extensive number of performance measures for classification. These measures have been introduced for different applications and to evaluate different things. Commonly used performance measures in software defect classification are accuracy, precision, recall, F-measure, AUC, and mean absolute error.63,64 Besides these popular measures, this work includes seven other major classification measures. The definitions of these measures are as follows. • Overall accuracy: Accuracy is the percentage of correctly classified modules.28 It is one the most widely used classification performance metrics. TN + TP . Overall accuracy = TP + FP + FN + TN • True positive (TP ): TP is the number of correctly classified fault-prone modules. TP rate measures how well a classifier can recognize fault-prone modules. It is also called sensitivity measure. TP True positive rate/Sensitivity = . TP + FN • False positive (FP ): FP is the number of nonfault-prone modules that is misclassified as fault-prone class. FP rate measures the percentage of nonfault-prone modules that were incorrectly classified. FP . False positive rate = FP + TN
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method
195
• True negative (TN ): TN is the number of correctly classified nonfault-prone modules. TN rate measures how well a classifier can recognize nonfault-prone modules. It is also called specificity measure. True negative rate/Specificity =
TN . TN + FP
• False negative (FN ): FN is the number of fault-prone modules that is misclassified as nonfault-prone class. FN rate measures the percentage of fault-prone modules that were incorrectly classified. False negative rate =
FN . FN + TP
• Precision: This is the number of classified fault-prone modules that actually are fault-prone modules. Precision =
TP . TP + FP
• Recall : This is the percentage of fault-prone modules that are correctly classified. Recall =
TP . TP + FN
• F-measure: It is the harmonic mean of precision and recall. F-measure has been widely used in information retrieval.65 F-measure =
2 × Precision × Recall Precision + Recall
• AUC : ROC stands for receiver operating characteristic, which shows the tradeoff between TP rate and FP rate.28 AUC represents the accuracy of a classifier. The larger the area, the better the classifier. • Kappa statistic (KapS ): This is a classifier performance measure that estimates the similarity between the members of an ensemble in multiclassifiers systems.66 KapS =
P (A) − P (E) 1 − P (E)
P (A) is the accuracy of the classifier and P (E) is the probability that agreement among classifiers is due to chance. c m c m c k=1 ([ j=1 i=1 f (i, k)C(i, j)] · [ j=1 i=1 f (i, j)C(i, k)]) P (E) = 2 m m is the number of modules and c is the number of classes. f (i, j) is the actual probability of i module to be of class j. m i=1 f (i, j) is the number of modules of class j. Given threshold θ, Cθ (i, j) is 1 iff j is the predicted class for i obtained from P (i, j); otherwise it is 0.62
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
196
Y. Peng et al.
• Mean absolute error (MAE ): This measures how much the predictions deviate from the true probability. P (i, j) is the estimated probability of i module to be of class j taking values in [0, 1].62 c m j=1 i=1 |f (i, j)−P (i, j)| MAE = m·c • Training time: The computational time for training a classification algorithm or ensemble method. • Test time: The computational time for testing a classification algorithm or ensemble method. 4.2. Data sources The datasets used in this study are 10 public-domain software defect datasets provided by the NASA IV&V Facility Metrics Data Program (MDP) repository. The brief descriptions of these MDP datasets are provided by the NASA Web site: • CM1: This dataset is from a science instrument written in a C code with approximately 20 kilo-source lines of code (KLOC). It contains 505 modules. • JM1: This dataset is a real-time C project containing about 315 KLOC. There are 8 years of error data associated with the metrics and has 2012 modules. • KC3: This dataset is about the collection, processing, and delivery of satellite metadata. It is written in Java with 18 KLOC and has 458 modules. • KC4: This dataset is a ground-based subscription server written in Perl code containing 25 KLOC with 125 modules. • MC1: This dataset is about a combustion experiment that is designed to fly on the space shuttle written in C and C++ code containing 63 KLOC. There are 23 526 modules. • MW1: This dataset is about a zero gravity experiment related to combustion written in C code containing 8 KLOC with 403 modules. • PC1: This dataset is a flight software from an earth orbiting satellite that is no longer operational. It contains 40 KLOC of C code with 1107 modules. • PC2: This dataset is a dynamic simulator for attitude control systems. It contains 26 KLOC of C code with 5589 modules. • PC3: This dataset is a flight software from an earth orbiting satellite that is currently operational. It has 40 KLOC of C code with 1563 modules. • PC4: This dataset is a flight software from an earth orbiting satellite that is currently operational. It has 36 KLOC of C code with 1458 modules. Though these datasets have different sets of attributes, they share some common structures. Four attributes (i.e. module id, defect id, priority, and severity) that are important for defect classification task exist in all 10 datasets. Module id is a unique identifier for every individual module. Defect id identifies the types of defects and is the dependant attribute in software defect classification. In two-class software defect
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method
197
classification, modules with nonempty defect id are labeled as fp and modules with empty defect id are labeled as nfp. 4.3. Experimental design The experiment was carried out according to the following process: Step 1 Prepare datasets: select relevant features. Step 2 Train and test classification models on a randomly sampled partitions (i.e. 10-fold cross-validation) using WEKA 3.7.34 Step 3 Collect evaluation measures of classification models using WEKA 3.7. These performance measures are input data of the pairwise comparisons of each alternative (i.e. classification models) for the AHP. Step 4 Construct a set of pairwise comparison matrices of classification models with respect to each performance measure (criterion in the AHP) using Matlab 7.0. Step 5 Multiply the relative ranking obtained from step 4 by the priorities of performance measures (Table 1) to get the overall priorities for each classification model. END As mentioned in Sec. 2, this study compares three ensemble methods (i.e. AdaBoost, bagging, and stacking) and a selection of 12 classifiers. These learning models are separated into three groups: AdaBoost, bagging, and others. Each of the AdaBoost and bagging group has 12 algorithms, which are generated by applying AdaBoost and bagging to each of the 12 classifiers, respectively. The third group has 14 algorithms, including stacking, voting, and the 12 individual classifiers. 4.4. Experimental results Table 2 summarizes the average of the classification results of all learning methods over the 10 datasets (test datasets) using 10-fold cross-validation. The format of classifiers presented in Table 2 follows the names for classification methods used in WEKA 3.7. The 13 performance measures described in Sec. 4.1 are collected for each learning method. The classification algorithm producing the best result for a specific performance measure is highlighted in boldface. If there are more than one algorithm achieving the best result, they are all highlighted in boldface. From Table 2, we observe that the performances of some classifiers using certain measures are rather close. For instance, bagging of Na¨ıve Bayes tree (trees.NBTree.Bagging) achieves the best AUC (0.9042), which is considered as an extremely important measure in software defect prediction2,62 ; while boosting of CART (trees.SimpleCart.Adaboost), bagging of C4.5 decision tree (trees.J48.Bagging), and bagging of decision table (rules.DecisionTable.Bagging) produce similar results in terms of the AUC. The second observation is that a
86.78
88.16
90.65 91.28
91.88 92.53 91.98
92.63
79.04
79.56
89.13
90.61
86.88
84.90
82.76 88.32
89.17 89.22 89.37
89.98
85.35
81.53
86.91
89.43
82.29
82.83
87.40
86.03
77.08
79.02
90.88
89.01 90.72 88.70
88.95 88.51
83.54
83.02
19.82
13.02
15.38
20.54
20.77
7.66
9.44 7.73 9.07
9.60 10.86
16.11
17.50
13.69
24.16
54.83
45.88
29.40
35.22
65.94
61.72 64.90 62.53
59.26 59.04
36.73
33.39
51.61
84.18
86.65
85.36
77.19
78.87
90.32
87.87 90.13 87.29
88.30 87.80
82.93
83.17
86.54
85.45
63.56
36.30
44.39
38.17
25.42
25.47
28.05 26.04 27.24
27.46 32.67
50.10
53.16
37.58
44.85
36.44
63.70
55.61
61.83
74.58
74.53
71.95 73.96 72.76
72.54 67.33
49.90
46.84
62.42
55.15
60.47
15.82
13.35
14.64
22.81
21.13
9.68
12.13 9.87 12.71
11.70 12.20
17.07
16.83
13.46
14.55
18.09
84.40
90.09
89.23
83.22
82.19
91.95
91.23 91.67 91.50
90.20 90.38
88.66
85.51
89.04
89.33
82.52
84.18
86.65
85.36
77.19
78.87
90.32
87.87 90.13 87.29
88.30 87.80
82.93
83.17
86.54
85.45
81.91
83.60
83.58
86.80
45.97
39.53
86.76
89.84
15.50
81.91
16.40
80.98
86.11
33.82
56.35
89.25
24.83
43.65
82.78
79.63
83.60
81.38
43.39
79.28
15.96
87.63
85.67
Bayes.BayesNet. Adaboost Bayes.Na¨ıveBayes. Adaboost functions.Logistic. Adaboost functions.Multilayer Perceptron.Adaboost functions.RBFNetwork. Adaboost functions.SMO. Adaboost lazy.IBk.Adaboost rules.DecisionTable. Adaboost rules.JRip.Adaboost trees.J48.Adaboost trees.NBTree. Adaboost trees.SimpleCart. Adaboost Bayes.BayesNet. Bagging Bayes.Na¨ıveBayes. Bagging functions.Logistic. Bagging functions.Multilayer Perceptron.Bagging functions.RBFNetwork. Bagging
83.85
Area Overall F-measure Mean Kappa True False True False Precision Recall Under Accuracy Absolute Statistic Positive Positive Negative Negative ROC Error Rate Rate Rate Rate
0.02
0.12
0.03
0.02
0.09
0.08
Test Time
10.75
905.00
13.07
0.95
1.88
18.52
14.29 5.42 154.99
0.10
0.13
0.02
0.13
0.05
0.00
0.00 0.00 0.08
715.35 40.63 32.65 0.02
134.79
12.31
320.49
13.27
2.60
4.83
Train Time
198
Algorithms/Measures in Percents
Table 2. Classification results of ensembles and classifiers.
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Y. Peng et al.
86.28 87.29 90.70 89.77 90.16 91.06 89.63 90.41 92.26 91.37
82.22 82.48 89.17 86.49 87.34 89.09 88.14 88.21 90.18 88.57
19.98 12.71 9.49 17.82 14.31 11.28 12.95 13.13 11.40 8.63
21.14 20.53 15.28 12.90 24.03 23.28 59.50 49.35 52.81 56.80 53.40 52.25 59.76 56.31
34.97 29.68 46.18 50.59 84.42 84.26 88.63 87.22 86.51 88.42 88.92 87.60 90.17 88.23
78.72 76.90 85.48 86.36
87.68 90.14 90.95 89.53
62.84 65.23 27.51 42.28 35.48 33.31 37.23 37.96 34.13 37.02
25.30 37.61 44.83 38.46
37.38 33.34 34.99 35.50
37.16 34.77 72.49 57.72 64.52 66.69 62.77 62.04 65.87 62.98
74.70 62.39 55.17 61.54
62.62 66.66 65.01 64.50
72.54 61.29
15.58 15.74 11.37 12.78 13.49 11.58 11.08 12.40 9.83 11.77
21.28 23.10 14.52 13.64
12.32 9.86 9.05 10.47
11.58 13.15
82.06 86.40 90.15 87.16 89.49 90.65 88.20 89.92 90.73 90.43
82.25 83.37 89.33 88.81
91.28 91.94 89.41 91.38
90.13 92.55
0.02
Test Time
0.20 0.10 1.38 85.80
21.69 5.16 207.68 18.85
0.01 0.00 4.26 0.00 0.00 0.00 0.01 0.00 2.55 2.21
0.01 0.01 0.00 0.01
0.00 0.00 0.08 0.00
0.02 40.12 31.04 0.02
362.19
Train Time
84.42 1.12 84.26 40.18 88.63 0.00 87.22 3.11 86.51 1.92 88.42 0.52 88.92 19.65 87.60 1.89 90.17 1064.14 88.23 82.59
78.72 76.90 85.48 86.36
87.68 90.14 90.95 89.53
88.42 86.85
84.36
80.48 59.52 86.15 85.37 75.92 79.54 84.57 78.94 86.97 75.60
78.87 76.87 86.13 86.55
55.29 59.83 58.94 57.43
27.46 38.71
86.55
78.73 79.43 89.28 89.56
13.54 11.53 12.81 12.40
88.42 86.85
15.64
84.70 80.72 86.58 86.74
88.56 90.66 89.89 89.99
59.44 54.52
35.13
91.55 92.43 91.44 92.02
10.28 18.25
64.87
85.14 89.97 90.42 89.36
88.98 88.06
84.36
90.68 91.69
23.75
88.48 89.76
12.70
87.34
63.82
functions.SMO. Bagging lazy.IBk.Bagging rules.DecisionTable. Bagging rules.JRip.Bagging trees.J48.Bagging trees.NBTree.Bagging trees.SimpleCart. Bagging Bayes.BayesNet Bayes.Na¨ıveBayes functions.Logistic functions.Multilayer Perceptron functions.RBFNetwork functions.SMO lazy.IBk rules.DecisionTable rules.JRip trees.J48 trees.NBTree trees.SimpleCart meta.Stacking meta.Vote
82.59
Area Overall F-measure Mean Kappa True False True False Precision Recall Under Accuracy Absolute Statistic Positive Positive Negative Negative ROC Error Rate Rate Rate Rate
Algorithms/Measures in Percents
Table 2. (Continued )
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method 199
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
200
Y. Peng et al.
classifier which obtains the best result for a given measure may perform poorly on different measures. For instance, Bayesian network (Bayes.BayesNet) has the best FP and TN rates, but performs poorly on F-measure, precision, and recall. The third observation is that no classifier yielding the best measures across the 13 measures, which is consistent with Challagulla et al.’s work.64 After the computation of classification models over software defect datasets, the next step is to conduct a set of pairwise comparisons. As Saaty21 pointed out one should put decision elements into groups when the number is large. Since there are 38 classification models in the experiment, pairwise comparisons were carried out in two stages. In the first stage, classification models are grouped into AdaBoost, bagging, and others. Others group includes stacking, voting, and 12 individual base classifiers. Pairwise comparison is conducted to find out the relatively high ranking classification models within each group. Second, the top five ranking algorithms from each group are then compared with each other to get a global ranking of classification algorithms for software defect prediction.
Table 3. Priorities of AdaBoost classifiers (Group 1). Algorithms trees.SimpleCart.Adaboost trees.J48.Adaboost trees.NBTree.Adaboost rules.JRip.Adaboost rules.DecisionTable.Adaboost lazy.IBk.Adaboost functions.MultilayerPerceptron.Adaboost functions.Logistic.Adaboost bayes.BayesNet.Adaboost functions.SMO.Adaboost functions.RBFNetwork.Adaboost bayes.Na¨ıveBayes.Adaboost
Priorities 0.11092 0.10994 0.10025 0.09858 0.09497 0.09341 0.08028 0.07428 0.06831 0.06339 0.05892 0.04675
Table 4. Priorities of bagging classifiers (Group 2). Algorithms trees.J48.Bagging lazy.IBk.Bagging trees.NBTree.Bagging trees.SimpleCart.Bagging rules.JRip.Bagging functions.MultilayerPerceptron.Bagging rules.DecisionTable.Bagging functions.Logistic.Bagging functions.RBFNetwork.Bagging bayes.BayesNet.Bagging functions.SMO.Bagging bayes.Na¨ıveBayes.Bagging
Priorities 0.11215 0.10731 0.10696 0.10545 0.09810 0.09752 0.09395 0.08071 0.05279 0.05246 0.05239 0.04020
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method
201
Table 5. Priorities of stacking, voting, and individual classifiers (Group 3). Algorithms meta.Stacking lazy.IBk trees.J48 meta.Vote trees.NBTree trees.SimpleCart functions.MultilayerPerceptron rules.JRip rules.DecisionTable functions.Logistic bayes.BayesNet functions.SMO functions.RBFNetwork bayes.Na¨ıveBayes
Priorities 0.09551 0.09531 0.08763 0.08494 0.08475 0.07873 0.07809 0.07737 0.07480 0.07264 0.04673 0.04445 0.04271 0.03633
The priorities of classification methods within each group using pairwise comparisons are summarized in Tables 3–5, respectively. The rightmost column of Table 3 reports the priorities of 12 AdaBoost algorithms, each with a different base classifier. The priorities were calculated following the AHP method described in Sec. 3. Within the AdaBoost group, CART (trees.SimpleCart.Adaboost), C4.5 (trees.J48.Adaboost), and Na¨ıve Bayes tree (trees.NBTree.Adaboost) are the topranked classifiers. C4.5 (trees.J48.Bagging), K-nearest-neighbor (lazy.IBk.Bagging), and Na¨ıve Bayes tree (trees.NBTree.Bagging) are the top-ranked classifiers in the
Table 6. Priorities of classifiers of the top five classifiers from each group. Algorithms trees.SimpleCart.Adaboost trees.J48.Adaboost trees.J48.Bagging trees.NBTree.Adaboost lazy.IBk meta.Stacking rules.JRip.Adaboost trees.NBTree.Bagging lazy.IBk.Bagging trees.SimpleCart.Bagging rules.DecisionTable.Adaboost trees.J48 rules.JRip.Bagging trees.NBTree meta.Vote
Priorities 0.07529 0.07465 0.07029 0.06840 0.06768 0.06744 0.06724 0.06707 0.06684 0.06629 0.06466 0.06213 0.06157 0.06026 0.06020
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
202
Y. Peng et al.
bagging group. For the others group, stacking and k-nearest-neighbor (lazy.IBk) achieve the highest ranking, followed by C4.5 (trees.J48). The results presented in Tables 3–5 suggest that C4.5 decision tree, Na¨ıve Bayes tree, and K-nearestneighbor are among the best performers for software defect detection. Table 6 gives the results of pairwise comparisons of the top five ranking classifiers selected from each of the three groups. Among the top five ranking classifiers in Table 6, three are AdaBoost algorithms and one is bagging algorithm, which indicate that ensemble methods can certainly improve the performance of base classifiers for the task of software defect detection.
5. Conclusions Though some previous studies have illustrated that ensemble methods can achieve satisfactory results in software defect prediction, inconsistencies exist in different studies and the performances of learning algorithms may vary using different performance measures and under different circumstances. Therefore, more research is needed to improve our understanding about the performance of ensemble algorithms in software defect prediction. Realizing that the experimental results using different performance measures over different datasets may be inconsistent, this work introduced the AHP method, a multicriteria decision-making approach, to derive the priorities of ensemble algorithms for the task of software defect prediction. An experiment was designed to compare three popular ensemble methods (bagging, boosting, and stacking) and 12 well-known classification methods using 13 performance measures over 10 public-domain software defect datasets from the NASA Metrics Data Program (MDP) repository. The experimental results can be summarized in the following observations: • Ensemble methods can improve the classification results for software defect prediction in general and AdaBoost gives the best results. • Tree and rule-based classifiers perform better in software defect prediction than other types of classifiers included in the experiment. In terms of single classifier, K-nearest-neighbor (lazy.IBk), C4.5 (trees.J48), and Na¨ıve Bayes tree (trees.NBTree) ranked higher than other classifiers. • Stacking and voting can improve classification results and provide relatively stable outcomes, but the results are not as good as AdaBoost and bagging. • The ranking of algorithms may change in different settings of comparisons. For example, voting (meta.Vote) outranks Na¨ıve Bayes tree (trees.NBTree) in group three (Table 5). While in the overall comparisons, Na¨ıve Bayes tree (trees.NBTree) ranks better than voting (meta.Vote) (Table 6). This is due to the pairwise comparisons conducted by the AHP. When the set of alternative classifiers change, the relative ranking of algorithms may change, especially when the difference between two classifiers is statistically significant.
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method
203
Acknowledgments The authors would like to thank the anonymous reviewers for their insightful comments and the NASA MDP for providing the software defect datasets. This research has been partially supported by grants from the National Natural Science Foundation of China (Nos. 70901011, 70901015, and 70921061). References 1. NIST Planning Report 02-3, The Economic Impacts of Inadequate Infrastructure for Software Testing (U.S. Department of Commerce’s National Institute of Standards & Technology, 2002), . 2. S. Lessmann, B. Baesens, C. Mues and S. Pietsch, Benchmarking classification models for software defect prediction: A proposed framework and novel findings, IEEE Transactions on Software Engineering 34(4) (2008) 485–496. 3. J. C. Munson and T. M. Khoshgoftaar, The detection of fault-prone programs, IEEE Transactions on Software Engineering 18(5) (1992) 423–433. 4. T. M. Khoshgoftaar, A. S. Pandya and D. L. Lanning, Application of neural networks for predicting faults, Annals of Software Engineering 1(1) (1995) 141–154. 5. T. M. Khoshgoftaar, E. B. Allen, J. P. Hudepohl and S. J. Aud, Application of neural networks to software quality modeling of a very large telecommunications system, IEEE Transactions on Neural Networks 8(4) (1997) 902–909. 6. T. M. Khoshgoftaar, E. B. Allen, W. D. Jones and J. P. Hudepohl, Classification-tree models of software-quality over multiple releases, IEEE Transactions on Reliability 49(1) (2000) 4–11. 7. T. M. Khoshgoftaar, E. B. Allen and J. Deng, Using regression trees to classify faultprone software modules, IEEE Transactions on Reliability 51(4) (2002) 455–462. 8. T. M. Khoshgoftaar and N. Seliya, Analogy-based practical classification rules for software quality estimation, Empirical Software Engineering 8(4) (2003) 325–350. 9. T. Menzies, J. DiStefano, A. Orrego and R. Chapman, Assessing predictors of software defects, in Proceedings of Workshop Predictive Software Models (2004). 10. A. A. Porter and R. W. Selby, Evaluating techniques for generating metric-based classification trees, Journal of Systems and Software 12(3) (1990) 209–218. 11. K. El-Emam, S. Benlarbi, N. Goel and S. N. Rai, Comparing case-based reasoning classifiers for predicting high risk software components, Journal of Systems and Software 55(3) (2001) 301–310. 12. K. Ganesan, T. M. Khoshgoftaar and E. B. Allen, Case-based software quality prediction, International Journal of Software Engineering and Knowledge Engineering 10(2) (2000) 139–152. 13. K. O. Elish and M. O. Elish, Predicting defect-prone software modules using support vector machines, Journal of Systems and Software 81(5) (2008) 649–660. 14. Y. Peng, G. Kou, G. Wang, H. Wang and F. Ko, Empirical evaluation of classifiers for software risk management, International Journal of Information Technology and Decision Making 8(4) (2009) 749–768. 15. Y. Peng, G. Wang and H. Wang, User preferences based software defect detection algorithms selection using MCDM, Information Sciences (2010), doi: 10.1016/j.ins.2010.04.019. 16. L. Guo, Y. Ma, B. Cukic and H. Singh, Robust prediction of fault-proneness by random forests, in Proceedings of 15th International Symposium on Software Reliability Engineering (2004).
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
204
Y. Peng et al.
17. N. E. Fenton and M. Neil, A critique of software defect prediction models, IEEE Transactions on Software Engineering 25(5) (1999) 675–689. 18. I. Myrtveit and E. Stensrud, A controlled experiment to assess the benefits of estimating with analogy and regression models, IEEE Transactions on Software Engineering 25(4) (1999) 510–525. 19. I. Myrtveit, E. Stensrud and M. Shepperd, Reliability and validity in comparative studies of software prediction models, IEEE Transactions on Software Engineering 31(5) (2005) 380–391. 20. M. Shepperd and G. Kadoda, Comparing software prediction techniques using simulation, IEEE Transactions on Software Engineering 27(11) (2001) 1014–1022. 21. T. L. Saaty, The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation (McGraw-Hill, Columbus, OH, 1980). 22. M. Chapman, P. Callis and W. Jackson, Metrics Data Program, NASA IV and V Facility, http://mdp.ivv.nasa.gov/, 2004. 23. T. G. Dietterich, An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization, Machine Learning 40(2) (2000) 139–157. 24. Y. Freund and R. E. Schapire, Experiments with a new boosting algorithm, in Proc. 13th International Conference on Machine Learning (Morgan Kaufmann, San Francisco, 1996), pp. 148–156. 25. T. G. Dietterich, Machine learning research: Four current directions, AI Magazine 18 (1997) 97–136. 26. K. M. Ting and Z. Zheng, A study of AdaBoost with Na¨ıve Bayesian classifiers: Weakness and improvement, Computational Intelligence 19(2) (2003) 186–200. 27. T. Wilson, J. Wiebe and R. Hwa, Recognizing strong and weak opinion clauses, Computational Intelligence 22(2) (2006) 73–99. 28. J. Han and M. Kamber, Data Mining: Concepts and Techniques, 2nd edn. (Morgan Kaufmann, 2006). 29. D. Opitz and R. Maclin, Popular ensemble methods: An empirical study, Journal of Artificial Intelligence Research 11 (1999) 169–198. 30. T. G. Dietterich, Ensemble methods in machine learning, in J. Kittler and F. Roli (eds.) First International Workshop on Multiple Classifier Systems, Lecture Notes in Computer Science, Vol. 1857 (New York, Springer-Verlag, 2000b), pp. 1–15. 31. E. Bauer and R. Kohavi, An empirical comparison of voting classification algorithms: Bagging, boosting, and variants, Machine Learning 36(1/2) (1999) 105–139. 32. L. Breiman, Bagging predictors, Machine Learning 24(2) (1996) 123–140. 33. L. Breiman, Heuristics of instability in model selection, The Annals of Statistics 24(6) (1994) 2350–2383. 34. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. (Morgan Kaufmann, San Francisco, 2005). 35. R. Schapire, The strength of weak learnability, Machine Learning 5(2) (1990) 197–227. 36. Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 55 (1997) 119–139. 37. D. H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. 38. Y. Peng, G. Kou, Y. Shi and Z. Chen, A descriptive framework for the field of data mining and knowledge discovery, International Journal of Information Technology and Decision Making 7(4) (2008) 639–682. 39. L. Hansen and P. Salamon, Neural network ensembles, IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 993–1001.
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
Ensemble of Software Defect Predictors: An AHP-Based Evaluation Method
205
40. A. Krogh and J. Vedelsby, Neural network ensembles, cross validation, and active learning, in Tesauro, G., Touretzky, D. and Leen, T. (eds.), Advances in Neural Information Processing Systems, Vol. 7 (Cambridge, MA, MIT Press, 1995), pp. 231–238. 41. L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees (Wadsworth International Group, Belmont, California, 1984). 42. R. Kohavi, Scaling up the accuracy of Na¨ıve Bayes classifiers: A decision tree hybrid, in E. Simoudis, J. W. Han and U. Fayyad (eds.), Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (Portland, OR. Menlo Park, CA, AAAI Press, 1996), pp. 202–207. 43. J. R. Quinlan, C4.5: Programs for Machine Learning (Morgan Kaufmann, 1993). 44. S. Le Cessie and J. C. Houwelingen, Ridge estimators in logistic regression, Applied Statistics 41(1) (1992) 191–201. 45. C. M. Bishop, Neural Networks for Pattern Recognition (Oxford University Press, 1995). 46. J. C. Platt, Fast training of support vector machines using sequential minimal optimization, in B. Schotolkopf, C. J. C. Burges and A. Smola (eds.), Advances in Kernel Methods-Support Vector Learning (MIT Press, 1998), pp. 185–208. 47. V. N. Vapnik, The Nature of Statistical Learning Theory (Springer, New York, USA, 1995). 48. P. Domingos and M. Pazzani, On the optimality of the simple Bayesian classifier under zero-one loss, Machine Learning 29(203) (1997) 103–130. 49. S. M. Weiss and C. A. Kulikowski, Computer Systems that Learn: Classification and Predication Methods from Statistics, Neural Nets, Machine Learning and Expert Systems (Morgan Kaufmann, 1991). 50. B. V. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques (IEEE Computer Society Press, 1991). 51. R. Kohavi, The power of decision tables, in N. Lavrac and S. Wrobel (eds.), Proceedings of the Eighth European Conference on Machine Learning (Springer-Verlag, Iraklion, Crete, Greece, 1995), pp. 174–189. 52. W. W. Cohen, Fast effective rule induction, in Proceedings of the Twelfth International Conference on Machine Learning (Morgan Kaufmann, 1995), pp. 115–123. 53. T. L. Saaty, How to make a decision: The analytic hierarchy process, European Journal of Operational Research 48 (1990) 9–26. 54. T. L. Saaty and M. Sagir, Extending the measurement of tangibles to intangibles, International Journal of Information Technology & Decision Making 8(1) (2009) 7–27. 55. T. L. Saaty, Decision making with the analytic hierarchy process, International Journal of Services Sciences 1(1) (2008) 83–98. 56. T. L. Saaty, A scaling method for priorities in hierarchical structures, Journal of Mathematical Psychology 15(3) (1977) 234–281. 57. F. Zahedi, The analytic hierarchy process-a survey of the method and its applications, Interfaces 16(4) (1986) 96–108. 58. W. Ho, Integrated analytic hierarchy process and its applications — A literature review, European Journal of Operational Research 186(1) 211–228. 59. K. Sugihara and H. Tanaka, Interval evaluations in the analytic hierarchy process by possibility analysis, Computational Intelligence 17(3) (2001) 567–579. 60. H. Li and L. Ma, Ranking decision alternatives by integrated DEA, AHP and gower plot techniques, International Journal of Information Technology & Decision Making 7(2) (2008) 241–258.
January 11, 2011 8:58 WSPC/S0219-6220 173-IJITDM 00428
206
Y. Peng et al.
61. D. K. Despotis and D. Derpanis, A min–max goal programming approach to priority derivation in AHP with interval judgments, International Journal of Information Technology & Decision Making 7(1) (2008) 175–182. 62. C. Ferri, J. Hernandezorallo and R. Modroiu, An experimental comparison of performance measures for classification, Pattern Recognition Letters (2009) 27–38. 63. C. Mair, G. Kadoda, M. Leflel, L. Phapl, K. Schofield, M. Shepperd and S. Webster, An investigation of machine learning based prediction systems, Journal of Systems Software 53(1) (2000) 23–29. 64. V. U. B. Challagulla, F. B. Bastani, I. Y. Raymond and A. Paul, Empirical assessment of machine learning based software defect prediction techniques, International Journal on Artificial Intelligence Tools 17(2) (2008) 389–400. 65. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval (Addison Wesley, 1999). 66. L. I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms (Wiley, 2004).