Statistical Bug Localization by Supervised Clustering of Program Predicates Farid Feyzi*, Saeed Parsa School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran
[email protected],
[email protected]
Farid Feyzi received the M.S. degree in Software Engineering from the Sharif University of Technology in 2012 and his Ph.D. in Software Engineering from the Iran University of Science and Technology in 2018. His research focus is on developing statistical algorithms to improve software quality with an emphasis on statistical fault localization and automated test data generation.
Saeed Parsa received his B.Sc. in Mathematics and Computer Science from Sharif University of Technology, Iran, his M.Sc. degree in Computer Science from the University of Salford in England, and his Ph.D. in Computer Science from the University of Salford, England. He is an associate professor of Computer Science at the Iran University of Science and Technology. His research interests include software engineering, software testing and debugging and algorithms.
Abstract Regarding the fact that the majority of faults may be revealed as joint effect of program predicates on each other, a new method for localizing complex bugs of programs is proposed in this article. The presented approach attempts to identify and select groups of interdependent predicates which altogether may affect the program failure. To find these groups, we suggest the use of a supervised algorithm that is based on penalized logistic regression analysis. To provide the failure context, faulty sub-paths are recognized as sequences of fault relevant predicates. Estimating the grouping effect of program predicates on the failure helps programmers in the multiple-bug setting. Several case studies have been designed to evaluate the proposed approach on well-known test suites. The evaluations show that our method produces more precise results compared with prior fault localization techniques. Keywords: Fault Localization, Statistical Debugging, Supervised Clustering, Grouping Effect
1. Introduction In manual debugging, programmers attempt to find buggy statements by inspecting the execution status of the program at different points. The tasks developers frequently perform while debugging are described in [1]. Developers first determine the parts of the program involved in the failure, such as execution paths. They use their knowledge of the program to make a conjecture about potentially faulty statements. They set breakpoints at the potentially faulty statements and suspend the program execution for verifying the program state. These tasks are
difficult and time-consuming and can be speeded up considerably by using automated support for narrowing down the search domain of faulty statements. Fault localization is a task in software debugging to automatically identify potentially faulty program statements. The existence of bugs in the most released software programs is not deniable and it necessitates the process of fault localization to make software more robust and reliable. Due to intricacy and inaccuracy of manual fault localization, a great amount of research has been carried out to develop automated techniques and tools to assist developers in finding bugs [2-9, 21-23]. Most of these approaches are coverage-based, and they compute the suspiciousness of each component using the coverage of program executions. They assume that program spectra could approximate fault causality [10], and often attempt to rank program elements according to their presence (as well as absence) in failing and passing executions. The more correlation between such program elements and the presence (as well as absence) of the observed failures, the larger degree of suspiciousness is assigned to the element. Despite the fact that majority of program faults involve complex interactions between many program predicates and are revealed as a mutual effect of program predicates on each other, existing fault localization techniques disregard the intrinsic interdependent structure among program predicates while selecting the most failurerelevant statements. Therefore, they cannot detect specific bugs caused by undesired interactions between predicates because they only consider predicates in isolation [8][9]. So, an important task is to reveal groups of predicates which act together and collectively cause the program to fail. To tackle the search for groups of interdependent predicates, unsupervised clustering algorithms are applied in previous fault localization studies [3]. All these methods cluster predicates according to unsupervised similarity measures computed from the predicate values, but without regarding the variation of the program termination state. Our approach differs from these techniques, as its primary goal is to reveal predicate groups that are strongly correlated with program failure; rather than to find clusters made up of correlated predicates. Hence, we suggest supervised algorithms that group predicates by incorporating information from the program termination state. In this article, inspired from [11], we employ a strategy for supervised grouping that combines predicate selection and predicate grouping in a single step and is based on sequentially improving an empirical objective function that measures the groups’ strength for explaining the program failure. The objectives of this study are listed below: 1- Achieving a high capability in finding bugs. Our studies show that a key point for achieving more precise results is to consider the combinatorial effect of predicates on the program termination status. 2- Finding multiple bugs in programs. The proposed method puts highly correlated predicates into a single cluster, in a supervised manner. Therefore, the faulty predicates corresponding to each individual fault are included into a cluster separated from clusters including faulty predicates of other existing faults of a program.
The remaining part of this paper is organized as follows. Section 2 presents some related works on software fault localization. The proposed method is detailed in Section 3. The experiments and results are shown in Section 4. Finally, the concluding remarks are depicted in Section 5.
2. Related Works A program spectrum details the execution information of a program from certain perspectives and can be used to track program behavior. SBFL techniques use program spectrum to indicate entities more likely to be faulty. In this section, we briefly review the previous works related to fault localization in general. Recently, Feyzi et al., [9], proposed an approach called Inforence, to isolate the suspicious codes that likely contain faults. Inference employs a feature selection method, based on mutual information, to identify those bug-related statements that may cause the program to fail. Inforence tries to identify and select groups of interdependent statements which altogether may affect the program failure. TDB Le et al., in [25], proposed a method namely, Savant, that employs a learning-to-rank strategy, using likely invariant and suspiciousness scores as features, to rank methods based on their likelihood of being a root cause of a failure. In [24], Zhang et al., presented a lightweight technique that boosts spectrum-based fault localization by differentiating tests using PageRank algorithm. Given the original program spectrum information, they used PageRank to recompute the spectrum information by considering the contributions of different tests. Then, the recomputed spectrum information is employed by traditional SBFL techniques to achieve more effective fault localization. Hong et al., [26], presented a method called, MUSEUM, which localizes bugs in complex real-world multilingual programs in a language agnostic manner through mutation analyses. They showed that the accuracy of fault localization for multilingual programs can be increased by adding new mutation operators relevant with FFI constraints. In [7], Feyzi et al., proposed a new combinatorial approach to fault localization, so-called FPA-FL. It considers both the dynamic and static attributes of programs to rank the statements according to their effects on the program termination state. They attempted to alleviate the bias introduced by the test data through taking into account the static structure of a program while modeling its dynamic behavior. The Ochiai similarity coefficient [4] has also been used in the context of software fault localization. It, based on the same heuristic as Tarantula [18]. According to the data reported in [4], Ochiai is more effective in fault localization than Tarantula. Wong et al., [5] propose a method called DStar that modifies a similarity-based coefficient (i.e., Kulczynski coefficient) to better localize the bugs as compared to Ochiai, and 12 similarity-based coefficients. Since we cannot theoretically prove that other values of its exponent shall always be more efficient in fault localization, without loss of generality, the exponent of D-Star is set to 2 for simplified calculation in our empirical study. In [6], Naish et al. propose two techniques, 𝑂 and 𝑂𝑃 . The technique 𝑂 is designed for programs with a single bug, while 𝑂𝑃 is better applied to programs with multiple bugs. Data from their experiments suggest that 𝑂 and 𝑂𝑃 are more effective than Tarantula and Ochiai for single-bug programs. A fault localization technique based on crosstab (cross-tabulation) which has been used to study the relationship between two or more categorical variables is proposed in [19]. Crosstabs constructed by two column-wise
categorical variables (covered and not covered) and two row-wise categorical variables (passing and failing) are used to aid in analyzing the association degree between the failing (or passing) execution result and the coverage of each statement with null hypothesis that they are independent given beforehand. In addition, the degree of association will be used to determine the suspiciousness value for each statement. A unique aspect of this technique is that a well-defined statistical analysis is applied to locating faults in software. The impact of how each additional failing (or passing) test case can help locate program faults is investigated in [7]. The conclusion is that the contribution of the identified failing test cases are stepwise diminishing such as the first failing test contributes larger than or equal to that of the second failing test, which is larger than or equal to that of the third failing test. This conclusion also holds for passing tests. The major difference between H3b and H3c and others is that they emphasize that the contribution of each additional failing and passing test in computing the suspiciousness of each statement should be different, whereas others assume an equal contribution for every failing and passing test. Comparisons among different SBFL techniques are frequently discussed in recent studies [5-6]. However, there is no technique claiming that it can outperform all others under every scenario.
3. Proposed Method Overview 3.1 Preprocessing stage In the preprocessing stage, the program to be debugged is executed several times to capture its runtime data. The runtime data contain the predicates values as well as the failing or passing termination status of the program. Given a program 𝑃 together with a set of 𝑛 test cases, 𝑡𝑐 = {𝑡𝑐1 , 𝑡𝑐2 , … , 𝑡𝑐𝑛 }, the program is instrumented and then executed for 𝑛 times. Each test case 𝑡𝑐𝑗 includes a set of input parameters and the expected output, 𝑜𝑗 , of 𝑃. If 𝑜𝑗 is different from the actual output generated when 𝑃 is executed with test case 𝑡𝑐𝑗 , it is labeled as a failing 𝑓𝑎𝑖𝑙
test case, 𝑡𝑐𝑗
𝑝𝑎𝑠𝑠
; otherwise, it is labeled as a passing test case, 𝑡𝑐𝑗
. In this way, 𝑡𝑐 is split into two sets of failing
and passing test cases represented as 𝑡𝑐 𝑓𝑎𝑖𝑙 and 𝑡𝑐 𝑝𝑎𝑠𝑠 , respectively. To determine the impact of each program predicate 𝑝𝑖 on the program termination status, a certain code is inserted before 𝑝𝑖 to capture its value during the execution of the program. The process of inserting some code into a program to capture its runtime behavior is called instrumentation [12]. Here, we have considered two predicates for each simple Boolean expression appearing as sub-clause conditions affecting program control flow. For instance, for the condition (𝑎 > 𝑏) && (𝑐 ≥ 10), four predicates 𝑝1 = (𝑎 > 𝑏), 𝑝2 = ! (𝑎 > 𝑏), 𝑝3 = (𝑐 ≥ 10), and 𝑝4 = ! (𝑐 ≥ 10) are designed. Also, to keep track of function calls, one predicate, 𝑝𝑗𝑐𝑎𝑙𝑙𝑒𝑟 , is inserted before each call statement and one predicate, 𝑝𝑘𝑐𝑎𝑙𝑙𝑒𝑒 , is inserted at the beginning of each function. For 𝑓
𝑓
𝑓
each numerical value 𝑟 returned by a function 𝑓, three predicates 𝑝1 = (𝑟 < 0), 𝑝2 = (𝑟 == 0), and 𝑝3 = (𝑟 > 0) are designed. Note that instrumentation does not alter the program logic. The inserted code does not change the value of the predicates nor program variables. Executing an instrumented program 𝑃 with test data
𝑡𝑐 𝑓𝑎𝑖𝑙 and 𝑡𝑐 𝑝𝑎𝑠𝑠 results in a matrix, the so-called data matrix. Each row of the data matrix, D, is defined as (𝑑𝑖 , 𝑌𝑖 ), 𝑖 = 1, … , #𝑅𝑢𝑛𝑠 where 𝑑𝑖 = (𝑑𝑖1 , 𝑑𝑖2 , … , 𝑑𝑖𝑚 ) is a vector containing the values of predicates 𝑝1 to 𝑝𝑚 , 𝑌𝑖 is the failing or passing termination status of the 𝑖th run and #𝑅𝑢𝑛𝑠 is the number of executions. Therefore, 𝑑𝑖𝑗 is the number of times the predicate 𝑝𝑗 has been evaluated as true in the 𝑖th execution of 𝑃. Since the runtime behavior of a program generally depends on the input data where each test case is independent from another, the predicate values corresponding to different test cases are statistically independent. For simplicity, the data matrix is standardized and converted into a normalized matrix 𝑋, called the predicate matrix. To normalize the data matrix, D, each element 𝑑𝑖𝑗 of the data matrix is converted into 𝑥𝑖𝑗 ∈ 𝑋 such that: xij2
# Runs 1 i
𝑥𝑖𝑗 ←
xij
و
# Runs 0 i
𝑥𝑖𝑗 − 𝑎𝑣𝑒(𝑥𝑗 ) 𝑓𝑜𝑟 𝑖 = 1, … , 𝑛 ; 𝑗 = 1, … , 𝑚 𝑠𝑑𝑒𝑣(𝑥𝑗 )
3.2 Probabilistic Model Considering the fact that only a few predicate subsets determine nearly all of the variation of program termination state, we model the conditional probability by ̃) 𝑃[𝑦 = 1|𝑿] = 𝑓(𝑿
with
̃ = (𝒙 ̃𝟏 , 𝒙 ̃𝟐 , … , 𝒙 ̃𝒒 ) 𝑿
(1)
where f(. ) is an unknown nonlinear function and ̃ 𝒙𝒋 are ‘representative’ values for q ≪ p unknown predicate groups ℊ1 , … , ℊ𝑞 : we use the centroid as the representative group value. 𝑥̃𝑗 =
1 ∑ 𝑥p |ℊ𝑗 |
(2)
𝑝∈ℊ𝑗
As mentioned before, we use the supervised clustering method proposed in [11]. The approach is based on a strategy which proceeds in a ‘‘cautious’’ forward way. We start from scratch and rely on growing the groups incrementally by adding one predicate after the other. Regularly recurring cleaning steps help us to remove spurious predicates that were incorrectly added to the groups at earlier stages. We repeat growth and pruning of a single group until it stabilizes and cannot be improved any further. Once a group is found to be terminated, a new group is started and the composition of the former groups is left unchanged, while they can still have an effect on the construction of the new group. All these grouping operations are based on an empirical objective function S; which measures the strength of the predicate groups for explaining the program termination status. We employ the ℓ2-penalized negative log-likelihood function:
𝑛
𝜆 ̃𝑖 , 𝛽) + (1 − 𝑦𝑖 ) log Pr(𝑦𝑖 = 0| 𝒙 ̃𝑖 , 𝛽) + 𝑛 𝛽 𝑇 𝑃𝛽 𝑦𝑖 log Pr( 𝑦𝑖 = 1|𝒙 2 𝑖=1
𝑆 = −∑
(3)
based on estimated conditional class probabilities Pr(𝑦 = 1|𝑋̃, 𝛽) from penalized logistic regression analysis. Note that 𝛽 is the parameter vector, λ is a tuning parameter that controls the penalization and P is a penalty matrix. The binomial log- likelihood is an attractive choice as a grouping criterion, since it is the ‘natural’ goodness-offit measure for dichotomous problems. By computing the grouping criterion directly from multiple groups instead of single groups only, we obtain better interacting predicate groups that explain the program termination status as an ensemble. Technical issues concerning penalized logistic regression and full details about the grouping procedure are given in the next two sections.
3.3 Penalized Logistic Regression Analysis In the case of large programs, the number of predicates is much larger than those of the executions. This problem is known as 𝑃 >> 𝑁 problem. Eilers et al. [13] as well as Zhu and Hastie [14] focus on computational issues that arise from the ‘‘𝑃 >> 𝑁’’ dimensionality phenomenon and report improved results compared to non-penalized logistic regression. Since we use the penalized version as an estimator in conjunction with our 𝑞 𝑆 𝑜𝑙𝑑 : Do not accept the predicate, terminate the group, continue with groups ℊ1 , … , ℊ𝑞 and their averages. If 𝑞 < 𝑞𝑓𝑖𝑛𝑎𝑙 , increment q and return to step 3 to start a new group. ∗
6-(b)- IF improved, i.e. 𝑆 𝑝 < 𝑆 𝑜𝑙𝑑 : Accept the predicate and update group, group average and criterion value to ℊ𝑞 ← ℊ𝑞 ∪ {𝑝∗ },
𝑥̃𝑞 =
|ℊ𝑞 | . 𝑥̃𝑞 +. 𝑥𝑝 , |ℊ𝑞 | + 1
𝑆 𝑜𝑙𝑑 ← 𝑆 𝑝
∗
(8)
7-For each predicate 𝑝 = 1, … , 𝑚 ̃ in group ℊ𝑞 repeat: Leave groups ℊ1 , … , ℊ(𝑞−1) unchanged and build the 𝑝
temporary candidate group ℊ𝑞 by excluding predicate p from group ℊ𝑞 : Update the group average,
𝑝
𝑥̃𝑞 =
1 |ℊ𝑞 | − 1
∑
𝑥𝑝 ′
(9)
𝑝′ ∈ ℊ𝑞 \{𝑝} 𝑝
Fit penalized logistic regression with predictor 𝑥̃ 𝑝 = (1, 𝑥̃1 , … , 𝑥̃𝑞 ) to obtain the parameter vector 𝜆𝑝 and conditional probabilities 𝑝𝜆𝑝 (𝑥̃ 𝑝 ). Compute the penalized negative log-likelihood 𝑆 𝑝 as in (5). 8-Identify the predicate 𝑝∗ = arg 𝑚𝑖𝑛 𝑝 𝑆 𝑝 ,whose exclusion minimizes the grouping criterion and compare it to 𝑆 𝑜𝑙𝑑 . ∗
9-(a)- IF not improved, i.e. 𝑆 𝑝 > 𝑆 𝑜𝑙𝑑 : Do not delete the predicate, continue with groups ℊ1 , … , ℊ𝑞 (note that ℊ𝑞 was augmented in step 6) and their averages. Try to add another predicate by restarting at step 4. ∗
9-(b)- IF improved, i.e. 𝑆 𝑝 < 𝑆 𝑜𝑙𝑑 : exclude predicate 𝑝∗ and update group, group average and criterion value to ℊ𝑞 ← ℊ𝑞 \{𝑝∗ },
𝑝∗
𝑥̃𝑞 = 𝑥̃𝑞
𝑆 𝑜𝑙𝑑 ← 𝑆 𝑝
∗
(10)
Now try to add another predicate by restarting at step 4. In summary, the supervised algorithm is a one-step procedure for predicate selection, predicate grouping, and formation of new predicates by averaging the predicate value within a group. predicate selection and grouping are done with a stepwise forward search, where we try all predicates and augment the group by the predicate which optimizes the criterion S from (5). After each forward search, we continue with a backward pruning step to root out predicates that have been added wrongly to the group at earlier forward stages. Again, we try all predicates and decide on removal by optimizing the criterion S: Our grouping procedure is supervised since all decisions are based on optimizing the criterion S that measures the ability of the groups for explaining the program termination status. The number of groups 𝑞𝑓𝑖𝑛𝑎𝑙 can be set according to previous knowledge, or it can be chosen dataadaptively by cross validation. The very important difference between our and most other clustering algorithms is, that we do not augment (or shorten) the cluster by the predicate that suits best (or least) into the current cluster in terms of an unsupervised similarity measure, but base our strategy for supervised clustering of predicates on adding (or removing) the predicate that improves the differential expression of the current cluster most. Thus, our clustering criterion S is a (possibly penalized) goodness-of-fit measure, which is used to find groups of predicates that separate two different class of executions, pass and fail, as accurately as possible.
4. Experimental Results In this section, the effectiveness of proposed technique is empirically evaluated. To this end, we compare the proposed method with some well-known SBFL techniques in the context of software fault localization. All experiments in this section are carried out on a 2.66 GHz Intel Core 2 Quad Processor PC with 6 GB RAM, running UBUNTU 9.10 Desktop (i386). To perform clustering, we applied WEKA machine learning and data
mining tool [15] which is open source software that implements clustering algorithms, accurately. regression model is constructed using the glmnet package [16].
4.1 Subject Programs We have used the Siemens suite, Gzip, Grep, Make and Space as our subject programs. The Siemens suite contains seven subject programs with 132 faulty versions and each faulty version contains one manually seeded fault. Version 1.1.2 of the Gzip program was downloaded from Software Infrastructure Repository (SIR). Also found at SIR were versions 1.2 of the Grep, 3.76.1 of the make and 2.0 of the Space programs. A brief description of the test suites is presented in Table 1. Table 1. A brief description of test suites which are used to evaluate the performance of the proposed method Program Print tokens Print tokens2 Replace Schedule Schedule2 Tcas Tot info Gzip Grep Sed Make Space
Faulty versions(Seeded/Real Faults) 7- all seeded faults 10- all seeded faults 32- all seeded faults 9- all seeded faults 10- all seeded faults 41- all seeded faults 23- all seeded faults 55- all seeded faults 17- all seeded faults 17-real and seeded faults 31- all seeded faults 38- real and seeded
# of Lines 472 399 512 292 301 141 440 6K 12K 12K 20K 9K
# of Test cases 4056 4071 5542 2650 2680 1578 1054 217 809 370 793 13585
Three evaluation metrics are used in this paper: 1) ‘The EXAM score’ gives the percentage of statements that need to be examined until the first faulty statement is reached. 2) ‘Average number of statements examined’ metric gives the average number of statements that need to be examined with respect to a faulty version (of a subject program) to find the bug. 3) The number of defective statements appear in the top5, top10 and top 200 statements. We note that, for all techniques discussed herein, the same suspiciousness value may be assigned to multiple statements, and therefore be tied for the same position in the ranking. So, the results are provided in two different levels of effectiveness – the ‘‘best” and the ‘‘worst”. In all our experiments we assume that for the ‘‘best” effectiveness we examine the faulty statement first and for the ‘‘worst” effectiveness we examine the faulty statement last.
4.2 Case Study 1: finding the location of single bugs In this study, we evaluate the performance of proposed method in finding the location of single bugs and compare it to the state-of-the-art fault localization techniques. In this regard, six distinguished fault localization techniques including Ochiai [4], D-Star [5], O [6], H3b [7], Learn-to-Rank [25] and Page-Rank [24] are compared to proposed method in this experiment. Tables 2 and 3 present the average number of statements that need to be examined by each fault localization technique across each of the subject programs, for both best and worst cases. For example, the average number of statements examined by our proposed method with respect to the all faulty versions of Gzip is 51.2 in the best case and 116.52 in the worst. Again, with respect to the Gzip program, we find that the second best technique is O, which can locate all the faults by requiring the examination of no more than 59.78 statements in the best case, and 131.52 in the worst. For D-Star, the best is 60.82 and the worst 134.59. A similar observation also applies to other programs. In the case of Space, the proposed method, in the worst case, detects the faults by examining 69.41 statements in average whereas O does so in 72.89 statements. Table 2 Average number of statements examined with respect to each faulty version (Best case) Technique Siemens Gzip Grep Sed Proposed Method 12.21 62.32 51.2 122.94 D-Star 14.66 60.82 148.21 79.38 H3c 13.21 76.48 145.63 122.10 Ochiai 15.57 68.10 164.50 81.45 O 11.21 59.78 135.06 74.45 Learn-to-Rank 11.89 58.40 133.80 60.86 Page-Rank 62.71 138.76 64.25 10.92
Space 31.9 43.28 57.95 44.25 37.12 42.63 35.97
Make 145.84 271.62 246.79 275.11 189.36 159.40 168.26
Table 3 Average number of statements examined with respect to each faulty version (Worst case) Technique Siemens Gzip Grep Sed Proposed Method 21.85 116.52 209.25 166.26 D-Star 24.69 134.59 225.16 191.51 H3c 29.46 147.23 236.19 228.70 Ochiai 27.62 136.90 255.06 198.05 O 31.72 131.52 215.63 183.05 Learn-to-Rank 22.35 119.74 218.47 184.52 Page-Rank 25.81 124.63 226.10 177.36
Space 69.41 82.81 92.56 74.89 72.89 81.70 72.65
Make 278.36 452.78 425.71 454.04 368.29 311.87 299.60
We now present the evaluation of the proposed method with respect to the EXAM score. Due to Space limitations, we only provide figures for three programs: Siemens, Gzip, and Sed. Figures for the other 3 sets of programs (Grep, Make and Space) were not included here. However, the conclusions drawn with respect to the first three programs are also applicable to the remaining three. Figure 1 illustrates the EXAM score of the proposed method and other peer techniques on Siemens suite, Gzip and Sed programs using four subplots. The x-axis represents the percentage of code (statements) examined while the y-axis represents the number of faulty versions where faults are located by the examination of an amount of code less than or equal to the corresponding value on the xaxis. For example, referring to Part (a) of Figure 1, we find that by examining 10 % of the code our method can locate 78% of the faults in the Siemens suite in the best cases and 64% in the worst, whereas D-Star has 68%
(best) and 52% (worst). In the case of Grip program, we observe that the effectiveness of our method is in general better than the effectiveness of peer techniques in both best and worst cases.
(a)-Exam score on Siemens suite-Best case
(b)-Exam score on Siemens suite-Worst case
(c)-Exam score on Gzip-Best case
(d)-Exam score on Gzip-Worst case
(e)-Exam score on Sed-Best case
(f)-Exam score on Sed-Worst case
Figure 1. EXAM score-based comparison on subject programs Table 4 compares our technique to the best SBFL techniques in terms of how often they report defective statements in the top 5, 10, or 200 statements. This is relevant because a recent study [27] showed that 98% of practitioners consider a fault localization technique to be useful only if it reports the defective statement(s) within the top-10
of the suspiciousness ranking. Another analysis [28] shows that automatic program repair systems perform best when they consider only the top-200 suspicious statements. Table 4 Percentage of failures whose faulty statements appear within the Top-5, Top-10, and Top-200 of the techniques’ suspiciousness ranking Debugging scenario Technique Proposed Method D-Star Ochiai H3 O Learn-to-Rank Page-Rank
Top-5 35% 30% 26% 26% 28% 32% 33%
Best case scenario Top-10 Top-200 42% 91% 39% 82% 39% 80% 36% 78% 40% 84% 40% 87% 39% 85%
Top-5 24% 17% 16% 16% 17% 19% 20%
Worst case scenario Top-10 Top-200 27% 65% 23% 57% 21% 55% 22% 53% 22% 61% 23% 64% 22% 65%
Average case scenario Top-5 Top-10 Top-200 26% 32% 76% 18% 26% 69% 17% 23% 71% 15% 25% 69% 18% 25% 72% 22% 29% 73% 23% 27% 73%
The results reveal that our proposed method performs much better in terms of including the correct answer (the actual faulty statement) within the top-5 or top-10 statements of its output.
4.3 Case Study 2: finding the location of multiple bugs Programs that contain multiple faults present new challenges for fault localization. Most published faultlocalization techniques target the problem of localizing a single fault in a program that contains only a single fault. In fact, the single-bug algorithms are not efficient for finding multiple bugs in programs. In this case study, the effectiveness of the proposed method is evaluated on programs with multiple bugs. A multiple fault version is a faulty version 𝑋 with 𝑘 faults that is made by combining 𝑘 faulty versions from a set {𝑥1 ; 𝑥2 ; . . . ; 𝑥𝑘 } where each bug 𝑖 in 𝑋 corresponds to the faulty version 𝑥𝑖 . In practice, developers are aware of the number of failed test cases for their programs, but are unaware of whether a single fault or many faults caused those failures. Thus, developers usually target one fault at a time in their debugging. Since there is more than one way to create a multi-bug version, using only one may lead to a biased conclusion. To overcome this problem, 30 distinct faulty versions with 2, 3 and 4 bugs, respectively, for Gzip, Grep and Sed were created. Also, a total of 120 multi-fault programs are created based on combinations of the single-fault programs of the Siemens suite, and they range from faulty versions with 2 faults to those with 5 faults. For programs with multiple faults, the authors of [17] also define an evaluation metric, Expense, corresponding to the percentage of code that must be examined to locate the first fault as they argue that this is the fault that programmers will begin to fix. We note that the Expense score, though defined in a multi-fault setting, is the same as the EXAM score used in this paper. The Expense score is really part of a bigger process that involves locating and fixing all faults (that result in at least one test case failure) that reside in the subject program. After the first fault has successfully been located and fixed, the next step is to re-run test cases to detect subsequent failures, whereupon the next fault is located and fixed. The process continues until failures are no longer observed, and we conclude (but are not guaranteed) that there are no more faults present in the program. This process is referred to
as the one-fault-at-a-time approach, and thus the Expense score only assesses the fault localization effectiveness with respect to the first iteration of the process. Data in Table 5 gives the total number of statements that need to be examined to find the first fault across all 120 faulty versions of Siemens suite. We observe that, regardless of best or worst case, our method is the most effective. Table 5 Total number of statements examined for the 120 multi-fault versions of the Siemens suite Technique Best Case Worst Case Technique Best Case Proposed Method Ochiai 1561 1351 2214 D-Star 1491 2398 O 1438 H3c 4620 5424 OP 1899 Learn-to-Rank 1589 2355 Page-Rank 2168
Worst Case 2416 3951 2655 2588
Tables 6 and 7 give the average number of statements that need to be examined to find the first bug for the best and the worst cases. For example, the average number of statements examined by our method for the three-bug version of Gzip is 54.26 for the best case and 84.48 for the worst, whereas 112.56 (best) and 155.25 (worst) statements needs to be examined by Ochiai and 108.51 (best) and 145.42 (worst) by D-Star. Table 6
Average number of statements examined to locate the first bug (best case) Gzip Grep Technique 2-bug 3-bug 4-bug 2-bug 3-bug Proposed Method 58.41 54.26 44.51 121.25 151.32 D-Star 131.68 108.51 109.62 145.66 209.56 H3b 81.52 89.21 78.6 221.65 257.48 Ochiai 135.88 112.56 118.75 169.76 228.56 O 75.12 83.89 63.35 269.65 256.5 Learn-to-Rank 79.51 76.20 69.56 139.86 211.90 Page-Rank 95.60 97.13 75.20 155.47 284.45
4-bug 225.16 291.82 299.55 271.73 262.11 244.76 282.70
Sed 2-bug 71.51 92.51 198.21 81.65 88.45 79.27 95.44
3-bug 64.42 99.85 246.59 79.45 81.82 78.50 114.60
4-bug 87.69 172.66 93.30 276.96 111.26 122.35 135.82
Table 7 Average number of statements examined to locate the first bug (worst case) Gzip Grep Technique 2-bug 3-bug 4-bug 2-bug 3-bug Proposed Method 94.46 84.48 94.59 218.21 229.45 D-Star 160.49 145.42 192.5 255.67 281.32 H3b 139.26 130.88 165.51 371.26 378.38 Ochiai 175.88 155.25 225.48 246.78 361.26 O 1529.1 1668.40 1679.58 411.65 391.85 Learn-to-Rank 131.77 118.13 120.73 265.82 235.94 Page-Rank 144.30 127.11 159.66 288.13 311.60
4-bug 301.45 326.76 341.62 378.96 384.62 334.21 364.78
Sed 2-bug 102.69 144.62 268.21 311.65 2156. 5 109.50 195.43
3-bug 84.82 121.62 428.12 109.45 2128.8 116.23 176.54
4-bug 106.1 194.60 119.30 376.96 2346.8 146.72 224.68
5. Conclusion In this paper, a new automated fault localization algorithm is proposed. Fault localization algorithms are mainly evaluated by measuring the amount of code to be manually examined around their reported fault suspicious statements before the actual failure origin is located. Fault suspicious statements can be pinpointed, relatively more accurately, by analyzing the combinatorial effect of statements on a program's termination status. Existing methods have a main disadvantage that often ignore some predicates which have strong discriminatory power as a group but are weak as individuals. Our aim is to overcome this limitation and to indicate the importance of considering this fact in fault localization. In this regard, we provide a new statistical method to identify the groups of interdependent predicates which altogether may affect the program failure. To find these groups, we suggest
the use of a supervised algorithm that is based on penalized logistic regression analysis. In fact, it is difficult to discover the interdependent groups of predicates from program execution data and some groups may contain irrelevant predicates. As a future work, it is proposed to use relevance, interdependence, and redundancy analysis to find the best groups of interdependent predicates which collectively cause the program to fail.
References [1] Pan H, Spafford E-H. Heuristics for automatic localization of software faults. Technical Report SERC-TR-116-P, Purdue University, 1992. [2] Feyzi, F., & Parsa, S. (2018). A program slicing-based method for effective detection of coincidentally correct test cases. Computing, DOI: 10.1007/s00607-018-0591-z. [3] Parsa S, Vahidi-Asl M, Asadi-Aghbolaghi M. Hierarchy-Debug: a scalable statistical technique for fault localization. Software Quality Journal. 2014; 22(3):427-466. [4] Abreu R, Zoeteweij P, Golsteijn R, Van Gemund AJ. A Practical Evaluation of Spectrum-based Fault Localization. Journal of System and Software. 2009, 82(11); 1780-1792. [5] Wong WE, Debroy V, Gao R, Li Y. The D-Star Method for Effective Software Fault Localization. IEEE Transactions on Reliability 2014, 63(1);290-308. [6] Naish L, Lee HJ, Ramamohanarao K. A model for spectra-based software diagnosis. ACM Transactions on Software engineering and methodology 2011, 20(3): 11. [7] Feyzi, F., & Parsa, S. (2018). FPA-FL: Incorporating static fault-proneness analysis into statistical fault localization. Journal of Systems and Software, 136, 39-58. [8] Feyzi F, Nikravan E, Parsa S. FPA-Debug: Effective Statistical Fault Localization Considering Fault-proneness Analysis. arXiv preprint arXiv:1612.05780, 2016. [9] Feyzi F, Parsa S: Inforence: Effective Fault Localization Based on Information-Theoretic Analysis and Statistical Causal Inference. Frontiers of Computer Science. 2017, DOI: 10.1007/s11704-017-6512-z. [10] Ju, X., Jiang, S., Chen, X., Wang, X., Zhang, Y., & Cao, H. HSFal: Effective fault localization using hybrid spectrum of full slices and execution slices. Journal of Systems and Software, 2014, 90, 3-17 [11] Dettling, M., & Bühlmann, P. (2004). Finding predictive gene groups from microarray data. Journal of Multivariate Analysis 2004. 90(1), 106-131. [12] Liblit, B., Aiken, A., Zheng, X., & Jordan, M.I. (2003). Bug isolation via remote program sampling. In Proceedings of the ACM SIGPLAN 2003 conference on programming language design and implementation (pp. 141–154). [13] P. Eilers, J. Boer, G.-J. Van Ommen, H. Van Houwelingen, Classification of microarray data with penalized logistic regression, in: Proceedings of SPIE: Progress in Biomedical Optics and Imaging, Vol. 2, 2001, pp. 187–198. [14] J. Zhu, T. Hastie, Classification of gene microarrays by penalized logistic regression, Technical Report, Department of Statistics, University of Stanford, 2002. [15] http://www.cs.waikato.ac.nz/ml/weka/ [16] Hui Z, Trevor H, Package elasticnet, www.stat.umn.edu/˜hzou. [17] Yu Y, Jones J A, Harrold M J. An empirical study of the effects of test-suite reduction on fault localization. In: Proceedings of the 30th international conference on Software engineering. 2008, 201-210 [18] Jones J-A, Harrold M-J. Empirical evaluation of the tarantula automatic fault-localization technique. Proceedings of the 20th IEEE/ACM international Conference on Automated software engineering, ACM, 2005; 273-282. [19] Wong W E, Debroy V, Xu D. Towards better fault localization: A crosstab-based statistical approach. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, 42(3), 378-396 [20] Wong W E, Qi Y. BP neural network-based effective fault localization. International Journal of Software Engineering and Knowledge Engineering, 2009, 19(04), 573-597 [21] Feyzi, F., & Parsa, S. (2018). Bayes-TDG: effective test data generation using Bayesian belief network: toward failuredetection effectiveness and maximum coverage. IET Software, 12(3), 225-235 [22] Feyzi, F. and Parsa, S. (2018) ‘Kernel-based detection of coincidentally correct test cases to improve fault localisation effectiveness’, Int. J. Applied Pattern Recognition, Vol. 5, No. 2, pp.119–136.
[23] Nikravan, E., Feyzi, F., & Parsa, S. (2015, November). Enhancing path-oriented test data generation using adaptive random testing techniques. In Knowledge-Based Engineering and Innovation (KBEI), 2015 2nd International Conference on(pp. 510-513). IEEE. [24] Zhang, M., Li, X., Zhang, L., & Khurshid, S. (2017, July). Boosting spectrum-based fault localization using PageRank. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis (pp. 261-272). ACM. [25] B Le, T. D., Lo, D., Le Goues, C., & Grunske, L. (2016, July). A learning-to-rank based fault localization approach using likely invariants. In Proceedings of the 25th International Symposium on Software Testing and Analysis (pp. 177-188). ACM. [26] Hong, S., Lee, B., Kwak, T., Jeon, Y., Ko, B., Kim, Y., & Kim, M. (2015, November). Mutation-based fault localization for real-world multilingual programs (T). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on (pp. 464-475). IEEE. [27] P. S. Kochhar, X. Xia, D. Lo, and S. Li. Practitioners’ Expectations on Automated Fault Localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis, ISSTA 2016, pages 165–176, New York, NY, USA, 2016. ACM. [28] F. Long and M. Rinard. An analysis of the search spaces for generate and validate patch generation systems. In Proceedings of the 38th International Conference on Software Engineering, pages 702–713. ACM, 2016.