ŇICCCT’10Ň
1
A Clustering Algorithm for Software Fault Prediction
Deepinder Kaur, 2 Arashdeep Kaur, 3Sunil Gulati, 4Mehak Aggarwal 1
Student, CSE Deptt. LLRIET, Moga,
[email protected] Lecturer, CSE Deptt, Amity University, Noida,
[email protected] 3 Senior Specialist, Backup team, HCL Comnet, Noida,
[email protected] 4 Asst. Prof., CSE Deptt, LLRIET, Moga ,
[email protected]. 2
Abstract— Software metrics are used for predicting whether modules of software project are faulty or fault free. Timely prediction of faults especially accuracy or computation faults improve software quality and hence its reliability. As we can apply various distance measures on traditional K-means clustering algorithm to predict faulty or fault free modules. Here, in this paper we have proposed K-Sorensen-means clustering that uses Sorensen distance for calculating cluster distance to predict faults in software projects. Proposed algorithm is then trained and tested using three datasets namely, JM1, PC1 and CM1 collected from NASA MDP. From these three datasets requirement metrics, static code metrics and alliance metrics (combining both requirement metrics and static code metrics) have been built and then KSorensen-means applied on all datasets to predict results. Alliance metric model is found to be the best prediction model among three models. Results of K-Sorensen-means clustering shown and corresponding ROC curve has been drawn. Results of K-Sorensen-means are then compared with KCanberra-means clustering that uses other distance measure for evaluating cluster distance. Keywords— Fault prediction; Clustering; K-means; Euclidean distance; Sorensen distance; Alliance metrics.
faults prior to system development may reduce software maintenance costs. Commonly used software quality estimation models classify program modules into one of the following categories: fault prone and fault free modules. Early software fault prediction helps to improve software quality and to achieve high software reliability [5]. K-Canberra means clustering uses Canberra distance i.e. the sum of series of a fraction differences between coordinates of a pair of objects. With this approach results are collected which shows better and accurate results as compared to traditional K-means clustering [10]. This paper introduces K-means clustering algorithm with other distance function, Sorensen Distance function or Bray Curtis distance is used with K-means clustering algorithm and the modified algorithm is referred to as K-Sorensenmeans clustering algorithm. Datasets have been collected from NASA MDP (metrics data program) [1]. Three faulty datasets are chosen CM1 (NASA spacecraft instrument), JM1 (real ground system) and PC1 (Earth satellite system). The datasets are analysed and refined. Results show that combination of both requirement and static code metrics are better predictors than alone requirement metrics and static code metrics [4]. Therefore, both the requirement and static code metrics are combined with the help of inner join to form new metrics known as alliance metrics. Then, the proposed approach has been applied on requirement metrics, static code metrics and alliance metrics for both distance functions. ROC curve i.e. receiver operator characteristic curve has been drawn to better predict the quality. Then results of K-Sorensen-means clustering are collected and compared with K-Canberra-means clustering on the basis of values of probability of detection and probability of false alarms.
I. INTRODUCTION Software metrics can be used for fault prediction in software projects. These metrics can be used in clustering algorithm, which uses various distance measures to evaluate cluster distance. Any software quality model is generally trained using software measurement and fault data obtained from a previous release or similar type of projects [2]. Different metrics can be used to predict faults such as McCabe’s complexity; various code size measures, Halstead complexity. Softwares having more numbers of faults are considered to be of poor quality than softwares having few numbers of faults. The basic hypothesis of software quality prediction is that a module currently under development is likely to be fault prone, if a module with the similar product or process metrics in an earlier project (or release) was fault prone [3]. Traditional K-means clustering approach shows that combination of requirement and static code metrics are better predictors than separate requirement and static code metrics [4]. Software quality depends upon software quality metrics that can be quantitative or qualitative metrics. Detecting software __________________________________ 978-1-4244-9034-/10/$26.00©2010 IEEE
II. MEASUREMENT A. Confusion Matrix
Experimental data
TABLE I : CONFUSION MATRIX
603
Actual data
Fault
Fault TP
No Fault FP
No Fault
FN
TN
ŇICCCT’10Ň
A table of confusion also sometimes known as confusion matrix is a visualization tool that reports the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). TN represents the fault free modules correctly classified. FP refers to fault free modules incorrectly labelled as faulty modules. FN corresponds to faulty modules incorrectly labelled as fault free modules. TP refers to modules that are correctly classified as faulty modules. Probability of detection (PD) and Probability of false alarms (PF), Precision and Accuracy has been used in this paper to evaluate the results [10]. The Probability of Detection (PD) [4], can be defined as probability of correct classification of module that has fault.
Risk incompatible region with high PD and high PF, is beneficial for safety critical systems as identification of faults is more important than validating false alarms. Cost incompatible region defines low PD and low PF, this region is beneficial for the organizations having limited Verification & Validation budgets. Negative region with low PD high PF is also preferred for some of the software projects. As the PD decreases and PF increases, the probability that modules can be classified incorrectly increases [3]. Generally ROC curve has concave shape that starts at point (0, 0) and end at (1, 1). III. RELATED WORK Clustering is a technique that divides data in to two or more clusters depending upon some criteria. As in this study data is divided in to two clusters depending upon that whether they are fault free or fault prone. Then, suitable algorithm is chosen for clustering of the software components into faulty/fault-free systems. Seliya N. and Khoshgoftaar T.M. investigated semi supervised learning approach for classifying data to improve software quality rather than supervised and unsupervised learning only [2]. K-means clustering is one of the best examples of semi-supervised learning [8]. K-means is a clustering algorithm depends upon iterative location that partitions dataset into K no. of clusters by standard Euclidean distance. K-Canberra-means clustering algorithm gives more accurate results than traditional K-means clustering algorithm [8]. K-Canberra-means clustering algorithm starts with determining the K i.e. no. of clusters to be formed. Then randomly K no. of centroids will be chosen from the data collected. Distances of all the attributes or data objects will be calculated with Canberra distance formula [8] and all objects are divided into K no. of cluster groups following the criteria of putting the object into the group that has minimum distance with centroid. Then recompute the cluster centroids. This process is repeated until convergence criteria met.
(1) PF can be defined as ratio of false positives to all nondefect modules.
(2) Probability of false alarms should be low, which signifies the correct classification of modules either as faulty or fault free, Whereas Probability of detection should be high as with these chances of accurate classification of fault prone modules are more that can increase software quality. B. ROC curve A receiver operating characteristic (ROC) curve can be represented equivalently by plotting the probability of detection (PD) vs. probability of false alarms (PF). ROC curves can be beneficial for finding accuracy of predictions.ROC curve is divided in two different regions defined as follows.
IV. DISTANCE FUNCTIONS Distance measure is important step in clustering that will determine how the similarity of two elements is calculated. This will influence the shape of clusters. In this paper we introduce K-means clustering algorithm with new distance measure i.e. Canberra distance.
PD=Probability of detection
ϭ Risk incompatible region Ϭ͘ϳϱ Ϭ͘ϱ
Cost incompatible region
C. Euclidean Distance Euclidean distance examines the root of square differences between coordinates of a pair of objects. It is used as standard or default distance for K-means clustering algorithm.
Negative curve region
Ϭ͘Ϯϱ Ϭ Ϭ
Ϭ͘Ϯϱ Ϭ͘ϱ Ϭ͘ϳϱ PF=Probability of false alarms
ϭ
Fig 1.ROC curve 604
ŇICCCT’10Ň
(3) D. Canberra Distance Canberra distance examines the sum of series of a fraction differences between coordinates of a pair of objects. This distance is very sensitive to a small change when both coordinates are nearest to zero [6]. (4)
action, conditional, continuance, Imperative, source and weak phrase. In the static code metrics similarly 22 attributes are useful, as others do not impact the binary classification. We have joined metrics available at requirement and during code to make combination metric model named as alliance metrics model. One to one inner join is applied on requirement metrics and static code metrics on basis of requirement_id. These metrics are then refined and normalized. Alliance metric model is proved to be the best prediction model as already shown in literature [4]. F. K-Sorensen-means clustering algorithm For a given set of observations/objects ( x1 , x2 ,…, xn ), where each observation has d attributes, the K-Sorensenmeans clustering algorithm attempt to partition the n observations into k sets ( k < n ) S = {S1 , S 2 , … , S k } . It
E. Sorensen Distance Sorensen distance is a normalization method that views the space as grid similar to the city block distance. Sorensen distance has a nice property that if all coordinates is positive; its value is between zero and one. The normalization is done using absolute difference divided by the summation.
is a twostep iterative algorithm, where the first step is the assignment step and second is the update step. First, the initial set of k means, m1(1) , m2 (1) , …, mk (1) are
(5)
initialized to first k observations.
V. DATASETS NASA Metrics Data Program [1] repository datasets have been used in this approach to predict the quality of a software product. The MDP (Metrics Data Program) Data Repository is a database that stores problem data, product data and metrics data. The datasets used in this approach are CM1, JM1, and PC1. Only these three projects in MDP data repository include requirement metrics. For each of the project requirement metrics, static code metrics and alliance metrics are collected refined and normalized. Requirement metrics, Static Code metrics and combination of requirement and code metrics [2] named as alliance metrics have been used for modelling. JM1 is a real time ground system that uses simulations to generate flight predictions. JM1 has been developed using C++ that has 10,878 modules. PC1 is earth satellite system containing 1107 modules and developed using C, and CM1 project is NASA spacecraft instrument also developed in C with 505 modules in it.
Assignment: Each observation is assigned to the cluster with the closest mean.
(6) Update: All means are recalculated to be the centroid of the respective observations in the cluster.
mi(
t +1)
=
1
Si( ) t
¦
(t )
xj
(7)
x j ∈Si
The algorithm iterates (iteration number represented by t) between the above two steps and converges when assignments no longer change. K-Sorensen means is applied on datasets and results are collected. PC1 dataset is used as training data and JM1 and CM1 are used to test the model. The fault prediction model is first of all trained with dataset without faulty data and then tested with other datasets to check the accuracy and correctness of the module. Result of K-Sorensen means and K-Canberra means are compared that shows that KSorensen means is more accurate and likely to predict more faults than K-Canberra means clustering. Corresponding ROC (Receiver Operator Characteristic) has been drawn to predict that which fault prediction metric model lies in which region.
VI. PRESENT WORK Fault prediction prior to software testing helps to improve software quality. Software quality can be measured with the help of software metrics. Software quality measures how well software is designed. For prediction of computation or accuracy faults in software projects this paper uses clustering approach similar to but different than K-Canberra-means clustering referred to as K-Sorensen means clustering. For applying this approach ,three datasets are chosen JM1, PC1 and CM1 out of present various datasets on NASA-MDP repository as only these three projects contain requirement metrics. In the requirement metrics 8 attributes are found to be useful i.e.
VII. RESULTS AND ANALYSIS The proposed method of K-means clustering i.e. KSorensen means clustering has been implemented using MATLAB 2009a version as the steps explained earlier. KSorensen means has been used for predicting fault free and faulty modules. PC1 is taken as training data while CM1 and JM1 are used as testing data.
605
ŇICCCT’10Ň
TABLE II
TABLE V
RESULTS OF CM1 PROJECT USING K-CANBERRA MEANS CLUSTERING
RESULTS OF JM1 PROJECT USING K-SORENSEN-MEANS CLUSTERING
JM1 Evaluation measures/Projects Requirement Code TP 13 1218
CM1 Evaluation Code Join measures/Projects Requirement TP 6 42 56 TN 17 225 102 FP 3 232 81 FN 63 6 27 PD 0.08695 0.875 0.6747 PF 0.15 0.50766 0.44262 Markers A B C
Join 4
TN FP
2 18
5526 3250
1 92
FN
4
884
0
PD
0.76471
0.57945
1
PF
0.9
Markers
J
0.37033 0.98925 K
L
Fig. 3 depicts the diagrammatical results of the above listed tables. ROC curve is a plot which shows the comparison between different plot values on the basis of probability of detection and probability of false alarms. Points plotted A, B, C, D, E and F are the results of KCanberra Means clustering algorithm. Rest of the points on the plots depict the results of K-Sorenson Means clustering algorithm. Basic hypothesis states that points with high PD and low PF are considered to be more accurate than others. As detecting faults without wrong classification in case of software fault prediction is more important for software project development.
Table II shows the results of K-Canberra means clustering algorithm with CM1 as testing data. Table III gives the result with PC1 dataset as training and JM1 as test dataset. Table II shows the results of K-Canberra means clustering algorithm with CM1 as testing data. Table III gives the result with PC1 dataset as training and JM1 as test dataset. Table IV and V shows the results of applying K-Sorensen means clustering using CM1 and JM1 as testing data respectively. TABLE III RESULTS OF JM1 PROJECT USING K-CANBERRA MEANS CLUSTERING
JM1 Evaluation Code Join measures/Projects Requirement TP 2 1302 1 TN 17 5189 67 FP 3 3587 26 FN 15 800 3 PD 0.11765 0.61941 0.25 PF 0.15 0.40873 0.27957 Markers D E F TABLE IV
Fig. 3 Plotted ROC curve The results show that in case of K-Canberra means clustering the performance is good in case of alliance metrics but with K-Sorensen means the results are more accurate which give better fault prediction than KCanberra means clustering as shown in Fig. 3.Proposed model will thus more accurately cluster modules into fault free and fault prone as compared to K-Canberra means clustering model as it has high probability of detection (PD) and less probability of false alarms (PF). For more accuracy PD should be near to 1 and PF lies near to 0. For Alliance metrics in K-Sorensen means clustering PD values which are 0.9875 and 1 which are high and PF
RESULTS OF CM1 PROJECT USING K-SORENSEN- MEANS CLUSTERING
CM1 Evaluation measures/Projects Requirement Code TP 37 40
Join 82
TN
8
254
7
FP
12
203
176
FN
32
8
1
PD PF
0.53623 0.6
Markers
G
0.83333 0.98795 0.4442 0.96175 H
I
606
ŇICCCT’10Ň [10] Jiang Y, Cukic B, Menzies T,”Cost curve Evaluation of fault prediction models”, Proceedings of the 2008 19th International Symposium on Software Reliability Engineering, 2008,pg 197-206 [11]Kaur Arashdeep, Brar S. Amanpreet & Sandhu P. “An Empirical Approach for Software Fault Prediction”, ICIIS 2010 [in Press].
values are 0.96175 and 0.98925 as compared to KCanberra means clustering PD values is 0.6747 and 0.25 and PF values are 0.44262 and 0.27957 resp. for CM1 and JM1 testing data. The markers A, D, G, and J are for requirement metrics lie near to no information region that gives no useful information. Similarly markers B, E, H and K are for static code metrics lays in better region. But I and L that are for alliance metrics are better predictors as they are near to risk incompatible region and cost incompatible region, high PD and low PF region. Result values are more accurate as compared to C and F of K-Canberra means clustering. CONCLUSION VIII. In this paper we analysed and compared the results of K-Canberra means clustering algorithm with K-Sorensen means clustering algorithm to predict the fault prone data at early stages of lifecycle combined with data available during code. This can help project managers to build the projects with more accuracy. K-Sorensen means gives more accurate results than K-Canberra means that is clear from ROC curve also and thus it can be used to improve the quality or productivity of the product. These algorithms are being used to achieve improved software accuracy and quality thus improving reliability of software projects. Even though, the results are far better than the algorithms in literature. But need for improvement is always there. So, further investigation can be done by comparing other algorithms with different distance measures to find out the better fault prediction model.
REFERENCES [1] NASA IV &V Facility. Metric Data Program. Available from http://MDP.ivv.nasa.gov/. [2] Seliya N., Khoshgoftaar T.M. (2007), “Software quality with limited fault-proneness defect data: A semi supervised learning perspective”, published online pp.327-324. [3] T. M. Khoshgoftaar, E. B. Allen, F. D. Ross, R. Munik oti, N. Goel, and A. Nandi. Predicting Fault-Prone Modules with Case-Based Reasoning. Proc. the Eighth International Symposium on Software Engineering (ISSRE’97), pp27, Nov. 1997. [4] Jiang Y., Cukic B. and Menzies T. (2007), “Fault Prediction Using Early Lifecycle Data”. ISSRE 2007, the 18th IEEE Symposium on Software Reliability Engineering, IEEE Computer Society, Sweden, pp. 237-246. [5] A kaur, Sandhu S. Parvinder, Brar S.Amandeep (2009),”Early software fault prediction using real time defect data”, 2009 Second International Conference on Machine Vision, pp 243-245. [6] Teknomo, Kardi. Similarity Measurement Available from http: http:\\people.revoledu.com\kardi\tutorial\Similarity\ [7] G.Gan,C. Ma,J.Wu,“Data clustering: theory,algorithms, and applications”, Society for Industrial and Applied Mathematics, Philadelphia, 2007. [8] Deepinder Kaur, Arashdeep Kaur,”Fault Prediction using K-CanberraMeans Clustering”, CNC 2010[in Press] [9] Bray J. R., Curtis J. T., 1957. An ordination of the upland forest of the southern Winsconsin. Ecological Monographies, 27, 325-349.
607