2009 21st IEEE International Conference on Tools with Artificial Intelligence
Exploring Software Quality Classification with a Wrapper-Based Feature Ranking Technique Kehan Gao Eastern Connecticut State University Willimantic, Connecticut 06226
[email protected]
Taghi Khoshgoftaar & Amri Napolitano Florida Atlantic University Boca Raton, Florida 33431
[email protected],
[email protected]
1. Introduction
Abstract
High quality is one of the goals for any software product, especially high-assurance and mission-critical systems. Over the past couple of decades, software quality assurance teams have dedicated a great deal of effort towards improving software quality. The methods that are often used include rigorous design and code review and extensive testing and reengineering. In any circumstances, early detection of problematic components in a software system is very critical so that software quality assurance efforts can be prioritized for targeting those problematic components (modules). Software quality models based on software metrics collected prior to software testing and operations are tools that can estimate faulty components in the software development process [6]. For example, software quality models can estimate quality based classes, fault-prone (fp) and not fault-prone (nfp), for a given program module. In recent years, a variety of techniques have been developed for improving the predictive accuracy of software quality models [12, 14, 17]. It has been shown in some studies that the performance of these models is influenced by two factors. One factor is the learner(s) used in the modeling process, while the other factor is the quality of the input data that are involved in the model building process. When conflicting and noisy data instances and/or irrelevant and redundant features are eliminated from the original data set, the predictive accuracy of software quality models can be improved [10, 12, 17]. In this study, we focused on feature selection by the means of feature ranking techniques. Feature selection, also known as attribute selection, is the process of selecting a subset of relevant features for building learning models. It is a frequently used activity in data preprocessing for data mining problems. Feature selection algorithms can be categorized into two groups, feature ranking and feature subset selection. Feature ranking orders the features by some criterion and a user selects some of the features that are appropriate for a given data set. Feature subset selection techniques search the space of possible fea-
Feature selection is a process of selecting a subset of relevant features for building learning models. It is an important activity for data preprocessing used in software quality modeling and other data mining problems. Feature selection algorithms can be divided into two categories, feature ranking and feature subset selection. Feature ranking orders the features by a criterion and a user selects some of the features that are appropriate for a given scenario. Feature subset selection techniques search the space of possible feature subsets and evaluate the suitability of each. This paper investigates performance metric based feature ranking techniques by using the multilayer perceptron (MLP) learner with nine different performance metrics. The nine performance metrics include Overall Accuracy (OA), Default F-Measure (DFM), Default Geometric Mean (DGM), Default Arithmetic Mean (DAM), Area Under ROC (AUC), Area Under PRC (PRC), Best F-Measure (BFM), Best Geometric Mean (BGM) and Best Arithmetic Mean (BAM). The goal of the paper is to study the effect of the different performance metrics on the feature ranking results, which in turn influences the classification performance. We assessed the performance of the classification models constructed on those selected feature subsets through an empirical case study that was carried out on six data sets of real-world software systems. The results demonstrate that AUC, PRC, BFM, BGM and BAM as performance metrics for feature ranking outperformed the other performance metrics, OA, DFM, DGM and DAM, unanimously across all the data sets and therefore are recommended based on this study. In addition, the performances of the classification models were maintained or even improved when over 85 percent of the features were eliminated from the original data sets. Keywords: performance metric, feature ranking technique, software quality modeling
1082-3409/09 $26.00 © 2009 IEEE DOI 10.1109/ICTAI.2009.24
67
ture subsets and evaluate the suitability of each. In this paper, we investigate a performance metric based feature ranking technique by using the multilayer perceptron (MLP) learner with nine different performance metrics, Overall Accuracy (OA), Default F-Measure (DFM), Default Geometric Mean (DGM), Default Arithmetic Mean (DAM), Area Under ROC (AUC), Area Under PRC (PRC), Best F-Measure (BFM), Best Geometric Mean (BGM) and Best Arithmetic Mean (BAM). Our goal is to discover the potential impact of the performance metrics on the feature rankings. We used the MLP learner and nine various performance metrics to generate nine feature ranking results for a given data set. Then, for each we selected a number of features that are appropriate for the given data set. The selected attributes were treated as the most relevant attributes for building classification models. To assess the performance of the classification models constructed on those selected attribute subsets, we used MLP, but this time as a classifier to build classification models with the selected attributes and all attributes, and compare them in terms of the AUC performance metric. To our knowledge, very limited studies have been done on such wrapper-based feature ranking techniques. In related literature [7], Forman used 12 filterbased ranking techniques (no learners were involved in feature ranking) in the domain of text classification, where all the attributes are in binary format. Three performance metrics (two forms of accuracy measures and F-measure) were used. However, our study investigated the data sets with numeric attributes. The experiments of this study were carried out on six data sets of real-world software systems. We also compared the classification performance on the smaller subsets of attributes with those on the original data sets. The experimental results demonstrate that AUC, PRC, BFM, BGM and BAM as performance metrics for feature ranking outperformed the other performance metrics, OA, DFM, DGM and DAM, unanimously across all the data sets and therefore are recommended based on this study. In addition, the classification accuracy of the models were maintained or even improved when over 85 percent of the features were removed from the original data sets. This would significantly reduce future software quality assurance (SQA) efforts on metrics collection, model construction, and model validation of similar systems. Also, it is easier for SQA team members to manage software quality with fewer software metrics. The rest of the paper is organized as follows. Section 2 provides more detailed information about the attribute ranking technique and nine performance metrics used in the study. The data sets used in the experiments are described in Section 3. Section 4 presents the experimental results. Some related work is discussed in Section 5. Finally, the conclusion is summarized in Section 6.
2. Methodology 2.1. Feature Ranking Technique Attribute selection is a process of reducing data dimension. It includes various divisions. Some references divide attribute selection algorithms into two groups, wrappers and filters [15]. Wrappers are algorithms that use feedback from a learning algorithm to determine which attribute(s) to use in building a classification or prediction model, whereas, in filters, the training data is analyzed using a method that does not require a learning algorithm to determine which attributes are most relevant. Also, attribute selection can be divided into feature ranking and feature subset selection categories [9]. Feature ranking assesses attributes individually and ranks attributes according to their individual predictive power. Feature subset selection approaches select subsets of attributes that together have good predictive power.
!
"
#
$
$
%
&
'
(
)
$
Figure 1. Performance Metric Based Feature Ranking Technique In this study, we used a performance metric based feature ranking technique. The overall structure is presented in Figure 1. The technique consists of two parts: a learner (classifier) and a performance metric. We used the multilayer perceptron (MLP) learner implemented in the WEKA machine learning tool [18] and nine different performance metrics (discussed in the next section) for this feature ranking technique. Some related parameters of the MLP learner (a type of neural network) were set as follows. The ‘hiddenLayers’ parameter was set to ‘3’ to define a network with one hidden layer containing three nodes, and the ‘validationSetSize’ parameter was set to ‘10’ to cause the classifier to leave 10% of the training data aside to be used as a validation set to determine when to stop the iterative training process. Changes to the default parameter values in WEKA were done only when experimentation showed a general improvement in the classifier performance across all data sets based on preliminary analysis. In the feature ranking technique, every attribute in a given fit (training) data set was used individually in the 68
process. Table 1. Confusion Matrix for Binary Classification Correct Result + Obtained + TP FP TN Result - FN
2.2. Performance Metrics In a binary (positive and negative1) classification problem, there can be four possible outcomes of classifier prediction: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). A two-by-two confusion matrix is described in Table 1. The four values |T P |, |T N |, |F P | and |F N | provided by the confusion matrix form the basis for several other performance metrics that are well known and commonly used within the data mining and machine learning community, where |• | represents the number of instances in a given set. The four basic performance metrics, true positive rate (TPR), true negative rate (TNR), false positive rate (FPR) and false negative rate (FNR) are defined as follows.
model building process. For example, we used the first attribute of the fit data set as the single independent variable and the class attribute, fp or nfp, as the dependent variable. We used three-fold cross-validation to build the model and assessed the performance of the model based on nine performance metrics. Then we repeated for the second attribute, third attribute and so on. In total, we constructed n data sets, each with an independent attribute and a class attribute, where n is the number of independent attributes in the complete (original) data set. We ranked the attributes based on the different performance metrics. Then, we selected the top log2 n attributes according to a given performance metric based ranking and used them as the set of the selected attributes. The reasons why we selected the top log2 n features include (1) in related literature, there is no guidance on the number of features that should be selected when using a feature ranking technique. And (2) in one of our recent empirical studies [13], we showed that it was appropriate to use log2 n as the number of features, when using WEKA to build Random Forests learners for binary classification in general and imbalanced data sets in particular. Although we used a different learner in this study, a preliminary study showed that log2 n is still a good choice for various learners [8]. Moreover, a software engineering expert with more than 20 years experience in the area of software quality modeling agreed that the number of selected attributes, log2 n, is appropriate for the datasets used in this study. After the attribute selection, MLP was used again, but this time to build software quality classification models. The performances of the classification models were evaluated in terms of the AUC performance metric, a commonly used performance metric in software engineering, data mining and machine learning. MLP was used in the case study because it is a wellknown classifier in data mining and has been widely used in various classification problem. In addition, MLP itself does not possess the capability of attribute selection. Other learners besides MLP, such as na¨ıve Bayes, support vector machine and instance-based learning, have been investigated in the study too. However, due to space limitation, the results based on those learners are not presented in the paper. In addition, it is worthwhile to note that MLP in the study played two roles: a learner in the feature ranking process and a classifier in the program module classification
|T P | |T P | + |F N | |F P | FPR = |F P | + |T N | TPR =
|T N | |F P | + |T N | |F N | FNR = |T P | + |F N | T NR =
These metrics are typically used in pairs when evaluating classifiers. For example, the false positive and true positive rates are often considered simultaneously. The difficulty with this approach when comparing two classifiers, however, is that inconclusive results are often obtained. In other words, classifier A may have a higher true positive rate but a lower true negative rate than classifier B. The need to evaluate classifier performance with a single metric has led to the proposal of numerous alternatives, as discussed below. The Overall Accuracy (OA) provides a single value that ranges from 0 to 1. It can be calculated by the following equation. |T P | + |T N | OA = N where N represents the total number of instances in a data set. While the overall accuracy allows for easier comparisons of model performance, it is often not considered to be a reliable performance metric, especially in the presence of class imbalance [1]. The F-Measure is a single value metric that originated from the field of information retrieval [16]. It can be expressed by the following equation. FM =
2 × |T P | 2 × |T P | + |F P | + |F N |
The Default F-Measure (DFM) corresponds to a decision threshold value of 0.5, while the Best F-Measure (BFM) is the largest value of FM when varying the threshold value 1 Positive and negative refer to fault-prone and not fault-prone modules respectively.
69
between 0 and 1. A perfect classifier yields an F-measure of 1, i.e., no misclassifications. The Geometric Mean is a single-value performance measure that ranges from 0 to 1, and a perfect classifier provides a value of 1. GM can be calculated using the following formula. √ GM = T P R × T N R
Table 2. Software Metrics for LLTS Data Sets
Symbol Description Product Metrics CALUNQ Number of distinct procedure calls to others. CAL2 Number of second and following calls to others. CAL2 = CAL - CALUNQ where CAL is the total number of calls. CNDNOT Number of arcs that are not conditional arcs. IFTH Number of non-loop conditional arcs (i.e., if-then constructs). LOP Number of loop constructs. CNDSPNSM Total span of branches of conditional arcs. The unit of measure is arcs. CNDSPNMX Maximum span of branches of conditional arcs. CTRNSTMX Maximum control structure nesting. KNT Number of knots. A “knot” in a control flow graph is where arcs cross due to a violation of structured programming principles. NDSINT Number of internal nodes (i.e., not an entry, exit, or pending node). NDSENT Number of entry nodes. NDSEXT Number of exit nodes. NDSPND Number of pending nodes (i.e., dead code segments). LGPATH Base 2 logarithm of the number of independent paths. FILINCUQ Number of distinct include files. LOC Number of lines of code. STMCTL Number of control statements. STMDEC Number of declarative statements. STMEXE Number of executable statements. VARGLBUS Number of global variables used. VARSPNSM Total span of variables. VARSPNMX Maximum span of variables. VARUSDUQ Number of distinct variables used. VARUSD2 Number of second and following uses of variables. VARUSD2 = VARUSD - VARUSDUQ where VARUSD is the total number of variable uses. Process Metrics DES PR Number of problems found by designers during development of the current release. Number of problems found during beta testing of BETA PR the current release. Number of problems fixed that were found by DES FIX designers in the prior release. Number of problems fixed that were found by BETA FIX beta testing in the prior release. Number of problems fixed that were found by CUST FIX customers in the prior release. Number of changes to the code due to new requirements. REQ UPD Total number of changes to the code for any reason. TOT UPD REQ Number of distinct requirements that caused changes to the module. Net increase in lines of code. SRC GRO Net new and changed lines of code. SRC MOD Number of different designers making changes. UNQ DES Number of updates to this module by designers who VLO UPD had 10 or less total updates in entire company career. Number of updates to this module by designers who LO UPD had between 11 and 20 total updates in entire company career. Number of updates that designers had in their UPD CAR company careers. Execution Metrics USAGE Deployment percentage of the module. RESCPU Execution time (microseconds) of an average transaction on a system serving consumers. BUSCPU Execution time (microseconds) of an average transaction on a system serving businesses. TANCPU Execution time (microseconds) of an average transaction on a tandem system.
It is a useful performance measure since it is inclined to maximize the true positive rate and the true negative rate while keeping them relatively balanced. The decision threshold t = 0.5 is used for the Default Geometric Mean (DGM). The Best Geometric Mean (BGM) is the maximum geometric mean value that is obtained when varying the threshold between 0 and 1. The Arithmetic Mean is just like geometric mean but using the arithmetic mean of the true positive rate and true negative rate instead of the geometric mean. It is also a single-value performance measure that ranges from 0 to 1. The arithmetic mean is (T P R + T N R)/2. The threshold t = 0.5 is used for the Default Arithmetic Mean (DAM), while the Best Arithmetic Mean (BAM) is just like the BGM, but instead using the maximum arithmetic mean that is obtained when varying the threshold between 0 and 1. The Area Under the ROC (receiver operating characteristic) curve (i.e., AUC) is a single-value measurement that originated from the field of signal detection. The value of the AUC ranges from 0 to 1. The ROC curve is used to characterize the trade-off between true positive rate and false positive rate [5]. A classifier that provides a large area under the curve is preferable over a classifier with a smaller area under the curve. A perfect classifier provides an AUC that equals 1. The Area Under the Precision-Recall Curve (PRC) is a single-value measure that originated from the area of information retrieval. The area under the PRC ranges from 0 to 1. The PRC diagram depicts the trade off between recall and precision [4]. A perfect classifier results in an area under the PRC of 1. Recall and Precision are defined in the following equations. Recall =
|T P | |T P | + |F N |
P recision =
|T P | |T P | + |F P |
In fact, Recall is the same as TPR.
3. Data Set Description For this study, we conducted our experiments on six data sets. Among the six data sets, four of them are from a very large legacy telecommunications software system (denoted as LLTS) and the other two are the NASA software projects, CM1 and JM1. The two NASA software project data sets 70
were made available through the Metrics Data Program at NASA (http://mdp.ivv.nasa.gov/). The LLTS software system was developed in a large organization by professional programmers using PROTEL, a proprietary high level procedural language (similar to C). The system consists of four successive releases of new versions of the system, and each release was comprised of several million lines of code. The data collection effort used the Enhanced Measurement for Early Risk Assessment of Latent Defect (EMERALD) system [11]. A decision support system for software measurements and software quality modeling, EMERALD periodically measures the static attributes of the most recent version of the software code. We refer to these four releases as TC1, TC2, TC3 and TC4. Each set of associated source code files was considered as a program module. The LLTS data sets consist of 42 software metrics, including 24 product metrics, 14 process metrics and 4 execution metrics, as shown in Table 2. The dependent variable is the class of the software module, fp or nfp. The fault-proneness is based on a selected threshold, i.e., modules with one or more faults were considered as fp, nfp otherwise. One can refer to our recent work [12] for a more detailed discussion about the data sets we used. The NASA data sets included software measurement data and associated error (fault) data collected at the function, subroutine or method level. Hence, for the software systems, a function, subroutine or method is considered as a software module or an instance in the data set. The CM1 project, written in C, is a science instrument system used for mission measurements. The JM1 project, written in C, is a real-time ground system that uses simulations to generate predictions for missions. The fault data collected for the software systems represent faults detected during software development. Each module in the respective data sets was characterized by 21 software measurements, which included 13 basic software metrics as shown in Table 3 and 8 derived Halstead metrics. The quality of the modules is described by their class labels, i.e., nfp and fp. A module was considered fp if it had one or more software faults, and nfp otherwise. The derived Halstead metrics were not used in our case studies. Table 4 summarizes the numbers of the fp and nfp modules and their percentages in each data set. The original JM1 NASA data contained some inconsistent modules, i.e., those with identical software measurements but with different class labels. After eliminating all those modules and those with missing values, we had 8850 modules left for JM1. From the table, we can observe that the distributions of the fp and nfp modules are imbalanced for each data set. For the NASA data sets, around 10 to 20 percent of the modules belong to the fp class, while for the LLTS data sets, even less than 7 percent of the modules are in the fp class. In fact, the imbalanced distribution of the two classes (fp
Table 3. Software Metrics for NASA Data Sets Metrics BRANCH COUNT LOC TOTAL LOC EXECUTABLE LOC COMMENTS LOC BLANK NUM OPERATORS NUM OPERANDS NUM UNIQUE OPERATORS NUM UNIQUE OPERANDS CYCLOMATIC COMPLEXITY ESSENTIAL COMPLEXITY DESIGN COMPLEXITY LOC CODE AND COMMENT
Description Branch count metric Total number of source code lines Number of lines of executable code Number of lines of comments Number of blank lines Total number of operators Total number of operands Number of unique operators Number of unique operands Cyclomatic complexity of a module. Essential complexity of a module. Design complexity of a module. Number of lines with both code and comments
Table 4. Data Set Summary Data Set TC1 TC2 TC3 TC4 CM1 JM1
# of nfp Modules 3420 93.72% 3792 95.25% 3494 98.67% 3886 97.69% 457 90.50% 7163 80.94%
# of fp Modules 229 6.28% 189 4.75% 47 1.33% 92 2.31% 48 9.50% 1687 19.06%
Total Modules 3649 3981 3541 3978 505 8850
and nfp) often occurs in software systems, especially for high-assurance and mission-critical software systems.
4. Experiments As we mentioned previously, the experiments were performed to discover the impact of the different performance metrics on the feature ranking. In this study, we used MLP as a feature ranking learner. Each software attribute and class attribute (fp and nfp) from a given data set were used to train an MLP classification model. For every round of feature ranking, we built 42 classification models (one for each attribute) for each LLTS data set and 13 classification models for each NASA data set. We evaluated the models with nine different performance metrics, OA, DFM, DGM, DAM, AUC, PRC, BFM, BGM and BAM, and ranked the attributes accordingly. An attribute that achieves a higher performance measurement means that it is more relevant to the class attribute. We selected the top log2 n attributes for each ranking technique, where n is the number of independent attributes in the original data set. In other words, we selected six attributes for the four LLTS data sets, and four attributes for the two NASA data sets. MLP was used again but this time as a classifier to build a number of clas71
Table 5. Performance Metric, AUC Data Metric ALL OA DFM DGM DAM AUC PRC BFM BGM BAM
TC1 Mean 0.7876 0.7428 0.7389 0.7389 0.7375 0.7822 0.7924 0.7915 0.7809 0.7818
Stdev 0.0101 0.0205 0.0154 0.0154 0.0162 0.0090 0.0105 0.0085 0.0082 0.0083
TC2 Mean 0.7910 0.7717 0.7796 0.7796 0.7778 0.7941 0.8032 0.8003 0.7820 0.7829
Stdev 0.0244 0.0158 0.0125 0.0125 0.0135 0.0125 0.0090 0.0085 0.0115 0.0122
TC3 Mean Stdev 0.7646 0.0434 0.7013 0.0212 0.7013 0.0212 0.7013 0.0212 0.7013 0.0212 0.8162 0.0197 0.8065 0.0202 0.8059 0.0306 0.8169 0.0175 0.8149 0.0225
sification models on the selected attribute subsets. The performances of the classification models were evaluated in terms of the AUC performance metric. We also applied the MLP classifier to the original data set (42 attributes for the LLTS data sets and 13 attributes for the NASA data sets) and used the results as a baseline for comparison. In the experiments, ten runs of five-fold cross-validation were performed. For each of the five folds, one fold is used as the test data while the other four are used as the training data. First the feature ranking learner (MLP) was applied to the training data (4 folds from the full data set) to get a ranking of the attributes. As we mentioned early (in Section 2.1), three-fold cross-validation was used to build the model (on the training data set) during this process. Note that the feature ranking technique being applied to the training data means none of the instances from the test fold are involved in the attribute selection and building classification models with the selected attributes. Once the attributes are ranked and selected, they (as well as the class attribute) yield the final feature selected training data. This final training data is used to build the classification model and the resulting model is used on the test fold. We used the WEKA tool to implement the feature ranking and model building process. As explained previously, a total of 32,100 classification models were constructed in these experiments. The classification performance was evaluated in terms of AUC. All the results are reported in Table 5. The table lists the mean value and standard deviation of AUC for every classification model constructed over ten runs of five-fold crossvalidation. ‘All’ represents the classification model built on the complete attribute set. From the table, we can observe the following facts:
TC4 Mean Stdev 0.7374 0.0299 0.7368 0.0219 0.7305 0.0170 0.7305 0.0170 0.7367 0.0218 0.7897 0.0226 0.7832 0.0259 0.7748 0.0255 0.7845 0.0276 0.7868 0.0257
CM1 Mean Stdev 0.7928 0.0160 0.7170 0.0270 0.7123 0.0316 0.7123 0.0316 0.7191 0.0308 0.7463 0.0274 0.7409 0.0329 0.7423 0.0258 0.7498 0.0279 0.7480 0.0279
JM1 Mean Stdev 0.6991 0.0060 0.6835 0.0114 0.6926 0.0132 0.6922 0.0131 0.6926 0.0132 0.6953 0.0141 0.6939 0.0142 0.6968 0.0121 0.6955 0.0155 0.6959 0.0130
Table 6. Performance Rankings best
worst
TC1 PRC BFM AUC BAM BGM OA DFM DGM DAM
TC2 PRC BFM AUC BAM BGM DFM DGM DAM OA
TC3 BGM AUC BAM PRC BFM DFM DGM DAM OA
TC4 AUC BAM BGM PRC BFM OA DAM DFM DGM
CM1 BGM BAM AUC BFM PRC DAM OA DFM DGM
JM1 BFM BAM BGM AUC PRC DFM DAM DGM OA
2. For the two NASA data sets, the classification accuracy for the classifiers built on the attribute subsets (with 4 attributes) is similar to (for JM1) or a bit worse than (for CM1) those built on the complete attribute sets (with 13 attributes). We then compared the classification results for different ranking techniques. In order to observe the results easily, we ranked the performance measurements (ranking techniques) in a descending order (from best to worst) as shown in Table 6 and also mapped the results in Table 5 to a figure as shown in Figure 2 (the AUCs on the complete data sets are excluded). From the table and figure, we can intuitively observe that the feature ranking technique always performed better when using the AUC, PRC, BFM, BGM and BAM performance metrics than the four remaining performance metrics (OA, DFM, DGM and DAM). This fact is true for all six data sets in this study. We also conducted a two-way ANalysis Of VAriance (ANOVA) F test [2] on the performance metric, AUC, to examine if the performance difference (better/worse) is significant or not. The test contains two factors: Factor A represents the results from nine different performance metrics, while Factor B represents the results from the six different data sets. The null hypothesis for the ANOVA test is that all the group population means are the same. The alternate hypothesis is that at least one pair of means is different. The ANOVA results indicate that the alternate hypothesis
1. For the four LLTS data sets, the classification accuracy for the MLP classifiers built on the attribute subsets (with 6 attributes) is comparable to and even better than those built on the complete attribute sets (with 42 attributes). All the attribute subsets with better classification accuracy than the complete attribute sets are highlighted with bold in Table 5. 72
• The feature ranking technique showed very similar performance when using the performance metrics from the same group.
TC1 TC2 TC3 TC4 CM1 JM1
0.82
0.8
In summary, when we use a wrapper-based feature ranking technique, the selection of the performance metrics is very critical. It may directly influence the feature ranking result, which, in turn, affects the performance of a classification model built on the feature subset. Since the data sets used in this study are from real-world software projects, the distributions of the two classes (fp and nfp) are imbalanced. Therefore, the performance metrics with a decision threshold value of 0.5 (such as OA, DFM, DGM, DAM) are not recommended. We should consider using the performance metrics that can achieve better performances (such as AUC, PRC, BFM, BGM and BAM). In addition, the experiments demonstrate that in most circumstances, the performances of the classification models do not deteriorate when 70% ∼ 85% of attributes were eliminated from the original data sets.
AUC
0.78
0.76
0.74
0.72
0.7
0.68
OA
DFM
DGM
DAM
AUC
PRC
BFM
BGM
BAM
Figure 2. Performance Comparisons is true, i.e., the classification performances, in terms of the AUC, were not the same for all groups in Factor A nor for all groups in Factor B (both p values are close to zero). We further carried out a multiple comparison test [2] on Factor A with Tukey’s honestly significant difference criterion, since this study mainly focuses on the effect of the nine different performance metrics on the feature ranking results.
5. Related Work Feature selection, as an important activity in data preprocessing, has been extensively studied for many years in data mining and machine learning. Researchers attempt to improve existing algorithms or to develop new ones. Liu and Yu, in their research [15], provided a comprehensive survey of feature selection algorithms and presented an integrated approach to intelligent feature selection. Hall and Holmes investigated six attribute selection techniques that produce ranked lists of attributes and applied them to 15 datasets [10]. The classifiers used in the study were C4.5 and Na¨ıve Bayes. The comparison results showed no single best approach for all situations. But, overall, a wrapper approach was the best attribute selection schema in terms of accuracy if speed of execution was not considered. Otherwise, CFS (correlation-based feature selection), CNS (consistency-based subset evaluation) and RLF (RelieF) are overall good performers. Although feature selection has been widely applied in data mining problems, their applications in software quality and reliability engineering are limited. Rodriguez et. al. [17] applied feature selection with three filter models and two wrapper models to the five software engineering data sets. The results showed that the reduced data sets maintain the prediction capability with lower number of attributes than the original data sets. Also the wrapper model is better than the filter model but it is more computationally expensive. Chen et. al. [3] have studied the application of feature selection using wrappers to the problem of cost estimation. They also concluded that the reduced dataset could improve the estimation. Our research group recently investigated four filter attribute selection techniques, Automatic Hybrid Search
OA
Performance Metrics
DFM DGM DAM AUC PRC BFM BGM BAM 0.71
0.72
0.73
0.74
0.75 AUC
0.76
0.77
0.78
0.79
Figure 3. Multiple Comparisons The multiple comparison result is shown in Figure 3. It displays a graph with each group mean represented by a symbol (◦) and an interval around the symbol (95% confidence interval). Two means are significantly different (p = 0.05) if their intervals are disjoint, and are not significantly different if their intervals overlap. The results demonstrate the following two points. • The performance metrics can be divided into two groups based on their performances. Group1 includes OA, DFM, DGM and DAM and Group2 includes AUC, PRC, BFM, BGM and BAM. The feature ranking technique always performed significantly better when using the performance metrics in Group2 than when using the evaluation metrics in Group1. 73
(AHS), Rough Sets (RS), Kolmogorov-Smirnov (KS) and Probabilistic Search (PS) [8]. Three of them (AHS, RS and PS) belong to the category of feature subset selection, while the remaining one (KS) is in the category of feature ranking. The experimental results demonstrate that among the four attribute selection approaches, our recently proposed KS method performed better than the other three techniques (AHS, PS and RS). The classification accuracy of the models built with some smaller subsets of attributes is comparable to that built with the complete set of attributes.
[5] T. Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, June 2006. [6] N. E. Fenton and S. L. Pfleeger. Software Metrics: A Rigorous and Practical Approach. PWS Publishing Company: ITP, Boston, MA, 2nd edition, 1997. [7] G. Forman. An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3:1289–1305, March 2003. [8] K. Gao, T. M. Khoshgoftaar, and H. Wang. An empirical investigation of filter attribute selection techniques for software quality classification. In Proceedings of the 2009 IEEE International Conference on Information Reuse and Integration, pages 272–277, Las Vegas, Nevada, USA, August 1012 2009. IEEE Computer Society. [9] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, March 2003. [10] M. A. Hall and G. Holmes. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering, 15(6):1437 – 1447, Nov/Dec 2003. [11] J. P. Hudepohl, S. J. Aud, T. M. Khoshgoftaar, E. B. Allen, and J. Mayrand. E MERALD: Software metrics and models on the desktop. IEEE Software, 13(5):56–60, September 1996. [12] T. M. Khoshgoftaar, L. A. Bullard, and K. Gao. Attribute selection using rough sets in software quality classification. International Journal of Reliability, Quality and Safty Engineering, 16(1):73–89, 2009. [13] T. M. Khoshgoftaar, M. Golawala, and J. V. Hulse. An empirical study of learning from imbalanced data using random forest. In Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, volume 2, pages 310–317, Washington, DC, USA, 2007. IEEE Computer Society. [14] T. M. Khoshgoftaar, Y. Xiao, and K. Gao. Assessment of a multi-strategy classifier for an embedded software system. In Proceedings of 18th IEEE International Conference on Tools with Artificial Intelligence, pages 651–658, Washington, DC, USA, November 13-15 2006. [15] H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4):491–502, 2005. [16] Y. Ma and B. Cukic. Adequate and precise evaluation of quality models in software engineering studies. In Proceedings of the third International Workshop on Predictor Models in Software Engineering, Washington, DC, USA, 2007. IEEE Computer Society. [17] D. Rodriguez, R.Ruiz, J. Cuadrado-Gallego, and J. AguilarRuiz. Detecting fault modules applying feature selection to classifiers. In Proceedings of 8th IEEE International Conference on Information Reuse and Integration, pages 667–672, Las Vegas, Nevada, August 13-15 2007. [18] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2 edition, 2005.
6. Conclusion In this paper, we present performance metric based feature ranking techniques. We used nine different performance metrics. For a given data set, different performance metrics may lead to different rankings of the attributes, which in turn affects the selection of the attribute subsets and the performance of the classification models built on those attribute subsets. The experiments were carried out on six data sets from real-world software projects. The experimental results suggest that some performance metrics, such as AUC, PRC, BFM, BGM and BAM, can achieve better performances and should be considered when a feature ranking learner is being applied to an imbalanced data set for binary classification. In addition, the empirical study shows that the performances of the classification models were maintained or even improved when over 85 percent of the features were eliminated from the original data sets. Future research may involve conducting more experiments, incorporation of other techniques such as sampling with the wrapper-based feature ranking techniques, investigation of filter-based feature ranking techniques such as Chi-squared, gain ratio, information gain, etc., comparisons between the wrapper and filter techniques, and more data sets from other software projects.
References [1] R. Arbel and L. Rokach. Classifier evaluation under limited resources. Pattern Recognition Letters, 27(14):1619–1631, 2006. [2] M. L. Berenson, M. Goldstein, and D. Levine. Intermediate Statistical Methods and Applications: A Computer Package Approach. Prentice-Hall, Englewood Cliffs, NJ, 2 edition, 1983. [3] Z. Chen, T. Menzies, D. Port, and B. Boehm. Finding the right data for software cost modeling. IEEE Software, (22):38–46, 2005. [4] J. Davis and M. Goadrich. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning, pages 233–240, Pittsburgh, Pennsylvania, 2006. 74