Efficient Automated Program Repair through Fault ... - IEEE Xplore

4 downloads 28392 Views 217KB Size Report
School of Computer. National ... context of automated program repair, test case prioritization ...... automatic repair of large-scale programs using weak recompi-.
2013 IEEE International Conference on Software Maintenance

Efficient Automated Program Repair through Fault-Recorded Testing Prioritization Yuhua Qi, Xiaoguang Mao∗ and Yan Lei School of Computer National University of Defense Technology Changsha, China {yuhua.qi, xgmao, yanlei}@nudt.edu.cn address this issue, in this paper we plan to optimize the patch validation process by maximizing early invalid-patch detection through prioritizing test cases, and thus speed up the whole repair process. In the area of regression testing, test case prioritization techniques reorder test cases in order to maximize the rate of fault detection, a measure used to evaluate that how quickly faults are detected in the testing process [9]. And the ultimate goal of prioritization is to get such ordered test cases that shall reproduce the failures as soon as possible. In the context of automated program repair, test case prioritization can be redefined in such a way of scheduling test cases in an order that maximizes the rate of invalid-patch detection, a measure of how quickly invalid patches are detected in the validation testing process; higher rate means less time cost on the candidate patches validation and thus results in more efficient automated program repair. In the repair process, a candidate patch is considered to be invalid if some test case detects fault for the patched program. In this sense, test case prioritization for automated program repair is exactly equivalent to that for regression testing; and thus we can adopt existing test case prioritization techniques to improve the rate of invalid-patch detection. Most existing test case prioritization techniques, however, require some previous test execution information, such as code coverage and estimated ability of exposing faults for each test case, to reorder test cases for subsequent runs; the process of gathering that information data may be either difficult or time-consuming [7], [9] especially for largescale programs. For example, for large programs, techniques based on code coverage, which perform instrumentation to gather the total number of statements covered by each test case, can be too expensive and even infeasible because too much trace data (sometimes over Gb) need to be traced; techniques based on the ability of exposing faults most frequently approximate the fault-exposing-potential (FEP) of each test case by mutation analysis [7], [10], the cost of which could be very high even for some small programs. Hence, if we directly take advantage of existing regression test prioritization techniques, additional cost of gathering relevant information used to prioritize test cases is unavoidable before the repair progress, and can be computationally expensive especially for large programs. Note that although

Abstract—Most techniques for automated program repair use test cases to validate the effectiveness of the produced patches. The validation process can be time-consuming especially when the object programs ship with either lots of test cases or some long-running test cases. To alleviate the cost for testing, we first introduce regression test prioritization insight into the area of automated program repair, and present a novel prioritization technique called FRTP with the goal of reducing the number of test case executions in the repair process. Unlike most existing prioritization techniques frequently requiring additional cost for gathering previous test executions information, FRTP iteratively extracts that information just from the repair process, and thus incurs trivial performance lose. We also built a tool called TrpAutoRepair, which implements our FRTP technique and has the ability of automatically repairing C programs. To evaluate TrpAutoRepair, we compared it with GenProg, a state-of-the-art tool for automated C program repair. The experiment on the 5 subject programs with 16 real-life bugs provides evidence that TrpAutoRepair performs at least as good as GenProg in term of success rate, in most cases (15/16); TrpAutoRepair can significantly improve the repair efficiency by reducing efficiently the test case executions when searching a valid patch in the repair process. Keywords-automated program repair; test case prioritization; efficiency; automated debugging

I. I NTRODUCTION Automated program repair is considered to be more difficult than bug finding [1], and plays an important part on providing complete automated debugging [2]. Most publications in the field focus on the problem of how to generate candidate patches for validation and deployment [3], [4], [5], [6]. When one candidate patch is produced, test cases, which serve as regression testing, are executed to confirm that the patch is valid in the sense that the patch does fix the bug and does not introduce unexpected side effects. Unfortunately, chances are that regression testing includes either a large number of test cases or some long-running test cases. For example, as reported by Rothermel et al. in [7], running one of their products against the entire test cases requires seven weeks. What is worse, automated program repair to its essence is an exercise in trial and error, meaning that there may be lots of invalid patches to be identified through testing before a valid patch is found, resulting in a significant amount of time cost is spent on testing [8, Fig. 8]. To ∗ Corresponding

author.

1063-6773/13 $26.00 © 2013 IEEE DOI 10.1109/ICSM.2013.29

180

software engineering, automating the task of fixing faults is necessary. Generally, automated program repair can be divided into three procedures: fault localization, patch generation and patch validation. When some failure occurs, suspicious faulty code snippet causing the failure should be identified by applying existing fault localization techniques such as Tarantula [12]. When the faulty code snippet (most often in terms of statements list ranked by suspiciousness) is located, the repair tools can produce many candidate patches by modifying that code snippet, according to specified repair rules based on either evolutionary computation [3], [8], [6], [13] or code-based contracts [14], [15]. To check that whether one candidate patch is valid or not, test suites are often used to assess patched program correctness. For test suites, running the patched program against a set of test cases, which consists of negative test cases (revealing the presence of the fault) and positive test cases (characterizing the normal behaviors of object program), is used to check whether the patch is valid; a candidate patch is considered valid iff the patched program passes all the test cases. The time cost for patch validation scales with the time to run these test cases. Unfortunately, complex or critical programs most often equip with large, long-running test suites. Taking the php case in [3] for example, there are over 8,000 test cases; completely executing these test cases can take several minutes. To speed up the process of patch validation, one feasible solution is scheduling the test cases in an order that attempts to maximize early fault detection (if a candidate patch is really invalid), which improves the rate of invalid-patch detection; and this is also the problem that we plan to investigate in this paper.

some techniques presented in existing papers [11] can prioritize test cases without coverage information, they use some other static analysis, which may be time-consuming for complex programs, to guide the prioritization process. In this paper, we first introduce regression test prioritization insight into the area of automated program repair, and present fault-recorded test prioritization (FRTP), a novel technique that can effectively improve the rate of invalidpatch detection. Unlike most existing test case prioritization techniques requiring additional cost for gathering previous test executions information before the prioritization, FRTP requires no prioritization information before the repair process, but rather iteratively extracts that information from the repair process, which incurs trivial overhead and thus is very suitable in the context of automated program repair. For maximum applicability of FRTP, we built a tool called TrpAutoRepair, which implements our FRTP technique and has the ability of automatically repairing faulty C programs. We then compared TrpAutoRepair with GenProg, a stateof-the-art tool on automated program repair, by running them to repair a set of large-scale faulty programs including regression test cases with the size ranging from 53 to 4,986. Experimental results indicate that TrpAutoRepair clearly requires fewer test case executions, and thus has higher repair efficiency. In summary, this paper makes the following main contributions: • The first attempt to introduce regression test prioritization insight into the area of automated program repair, and a description of new prioritization technique called FRTP with the goal of reducing the number of test cases execution in the repair process (Section III). FRTP requires no previous test execution information before the repair process, but rather iteratively extracts that information from the repair process itself. • Experimental evaluation of TrpAutoRepair, which implements FRTP technique (Section IV), in comparison with GenProg, a state-of-the-art tool on automated program repair, with the same experimental context (Section V). The experimental results shows that TrpAutoRepair outperforms GenProg on both efficiency and effectiveness.

B. Test Case Prioritization As described in [7] by Rothermel et al., the test case prioritization problem can be defined as follows: Given: T , a test suite; P T , the set of permutations of T ; f , a function from P T to the real numbers. Problem: Find T  ∈ P T such that (∀T  ) (T  ∈ P T ) (T  = T  ) [f (T  ) ≥ f (T  )]. Here, P T represents the set of all possible permutations (in terms of orderings) of T , a specified test suite; f is a transition function which can evaluate an award value for any ordering of P T . There are many goals of test case prioritization [7], such as increasing the rate of fault detection of a test suite and increasing the coverage of coverable code in the system under test at a faster rate. f describes the goal quantitatively and thus represents different quantifications for these goals. In this paper, we focus on the goal of increasing the rate of fault detection of a test suite, that is, we plan to reorder test cases of test suite to increase the likelihood of revealing faults earlier in the testing process.

II. BACKGROUND This section first describes some core information on automated program repair, and then presents the definition of test case prioritization. A. Automated Program Repair We limit the scope of automated program repair to automatically fixing faults in source code, not binary files or run-time patching. Although there is plenty of research on automated fault localization, removing most faults is still the task for developers. Hence, if we seek a full automation of

181

In addition, two different types of test case prioritization are described in [7]: general and version-specific. For general test case prioritization, the goal is to find such a test case order T  that can be useful over a sequence of subsequent modified versions of P ; in contrast, for version-specific test case prioritization, the goal is to find such a test case order T  that is very effective for one specified modified version of P . The main difference between the two types is that the prior one can be effective on average over a succession of subsequent releases but may be less effective on one specific version; the latter one, however, has the opposite effect. Obviously, version-specific test case prioritization, in essence, is analogous to the process of patch validation for automated program repair described in Section II-A, which validates one candidate patch by running corresponding patched program against specified test cases. For both of them, specified test cases are all tested on a specific version. However, for automated program repair, directly adopting the techniques on version-specific test case prioritization is not suitable, because version-specific test case prioritization often suffers from the problem of excessively delaying the very regression testing activities due to the cost of prioritizing. The prioritization cost can be very expensive for large-scale programs, and thus reduce the efficiency.

produces many different mutant versions of the program by simple syntactic changes to the program source. For each test case the award value in terms of mutation score is determined to be the ratio of mutants exposed by this test case to the total mutants [16]. Then, FEP prioritization sorts the test cases in descending order of that award value. However, for existing prioritization techniques based on either code coverage or FEP, gathering prioritization information may be time-consuming. For example, mutation analysis used to evaluate FEP, traditionally, has been seen to be computationally expensive, though there are many techniques that try to reduce the cost [18]. In this paper, we try to address the above issue using a novel technique called fault-recorded testing prioritization (FRTP).

C. Prioritization Information and Limitations As described in Section II-B, prioritization information gathered from the previous test case executions is often required for most existing prioritization techniques. According to the description in [7], [16], although there are several types of prioritization information, we focus on two types which are highly correlated with our work in this paper: Code coverage. The intuition for the idea is that early maximization of code coverage are more likely to increase the chance of early maximization of fault detection. Prioritization techniques based on code coverage prioritize test cases in terms of their total coverage of code components1 by counting the number of components covered by each test case. If multiple test cases have the same number of code coverage, these test cases are reordered randomly. In addition, for version-specific prioritization, test cases which can cover the modified code area are considered to be more likely to reveal faults, and thus should be given more chances for early execution. FEP. The intuition for the idea is that the ability of a test case to expose a fault depends not only on whether the test case achieves the coverage of the faulty area, but also on the probability that a fault in that area will cause a failure for that test case [17]. That is, code coverage is necessary but not sufficient to be used as the prioritization criterion. Mutation analysis [18], in general, is used to measure the faultexposing-potential (FEP) of a test case. Program mutation

A. Overview

III. T HE FRTP T ECHNIQUE In this section we discuss our FRTP technique in detail. We first overview the interesting issue and our insight for suppressing it in Section III-A. Then, we discuss the insight and motivation of FRTP technique in Section III-B. Finally, we describe the algorithm details on the FRTP technique in Section III-C.

Recently, much research focuses on the area of automated program repair, which attempts to repair a bug-aware program by modifying the program at the source code level with the assumption that source code is accessible. Since there is no a priori guarantee that some modification (in terms of candidate patch) actually repairs the defective program, the repair process is not deterministic, which means that lots of candidate patches are most probably produced before a really valid patch is found. As described in [8, Fig. 8], overwhelming majority of time cost is spent on patch validation including test case executions cost and compilation cost. Unfortunately, in the process of patch validation, test case executions can be time-consuming in most cases but become more serious on complex or critical programs most of which equip with large, long-running test suites. To address the issue and speed up the repair process, we present the FRTP technique, which introduces the test case prioritization insight into the patch validation process. However, for most existing prioritization techniques, previous test execution information is often necessary for prioritizing test cases. If we directly adopt these techniques, the cost of gathering prioritization information may trade off the benefits generated by the very test case prioritization itself. For FRTP the prioritization information is not specially gathered before the repair process; rather, that information is incrementally gotten from the repair process. Hence, FRTP requires no previous test executions information before the repair process and incurs the trivial cost of gathering prioritization

1 The code components can be referred as to statements, functions or branches et al.

182

information. A more detail account on FRTP is presented in the following subsections.

P Ti

pt1

Ti

B. Prioritizing Test Cases for Automated Program Repair

t1

Having investigated the repair process, we find that the prioritization information, such as code coverage and FEP, can be extracted just from the repair process with trivial cost. Suppose that the source code of a program P is constructed from a set of components C = {c1 , c2 , . . . , ci , . . . , cn }; when a fault is detected, the defective component set of Csub ⊂ C causing the fault can be identified through existing fault localization techniques. Then, Csub is modified to repair the program, producing lots of candidate patches P T = {pt1 , pt2 , . . . , pti , . . .}. For each pt ∈ P T , run the program P  , a mutant version of P updated by pt, against a specified set of test cases T , which is used to guarantee that pt does fix the fault and does not introduce new faults. If a test case t ∈ T detects fault, then pt is killed by the test case and the next candidate patch will be considered; if all test cases successfully pass the testing, then the repair process is terminated and outputs pt as the valid patch for the fault. For the above description of automated program repair, given the ith candidate patch pti ∈ P T (suppose that a valid patch has not been found before this patch): prior to pti , there are total i−1 candidate patches having been validated; and these patches are considered invalid in the sense that some test case has detected some fault for each patch in the process of patch validation (i.e., the patch is killed by the test case). We record these test cases which have ever detected faults prior to validating pti , producing the set of test cases Ti ⊆ T corresponding to pti . Obviously, Ti meets the constraint of

pt2 .. .

t2

ptd

.. .

.. . ptg

tf

.. . pti−1

Figure 1. Mapping relation between Ti and P Ti : the directed line indicates that one test case has ever killed the corresponding candidate patch before the patch validation for pti .

analysis before the validation for pti , because the program P  updated by pt ∈ P Ti is actually one mutant version of P by modifying the program source code; the larger the value of i is, the more accuracy the results of mutation analysis are. Specifically in Figure 1, compared to tf both t1 and t2 have higher FEP in terms of killing more candidate patches, and thus should be earlier executed to reduce the time cost for exposing the fault if pti is actually an invalid patch, which is consistent with prioritization techniques based on FEP. In the following subsection, we describe the detailed algorithm for FRTP on how to take advantage of these prioritization information to prioritize test cases. C. Algorithm for FRTP Algorithm 1 shows our FRTP technique in detail. By iteratively recording the faults in the process of patch validation (hence, in term of fault-recorded), FRTP attempts to improve the rate of invalid-patch detection through prioritizing test cases. Before starting to repair the defective program P , we first reconstruct the test cases T for the purpose of giving T the ability of mapping each test case t ∈ T to the corresponding number of candidate patches killed by t. Specifically, for each t ∈ T , FRTP constructs a tuple (t, index) where t is the original test case and index is a non-negative integer recording the current number of candidate patches killed by t; in fact, index indicates the FEP ability of t and is also the main prioritization information. Then, FRTP reconstructs T from all these tuples and initializes the index value for each tuple (lines 1-3). Recall that test cases T consists of two parts — negative test cases (revealing the presence of the fault and represented by n0 ) and positive test cases (characterizing the normal behaviors of object program and represented by {t1 , t2 , . . . , tn }). Due to the natural ability of fault detection even before the repair process, on line 3 FRTP initializes the index value of n0 with 1 rather than 0 for other test cases. To limit the search space for generating patches, on line 4 the function FaultLocalization preliminarily locates the

Size(Ti ) ≤ Size(P Ti ), P Ti = {pt1 , pt2 , . . . , pti−1 }. The Size function gives the number of one specific set. The constraint indicates that one test case t ∈ Ti may kill more than one candidate patches, and one candidate patch pt ∈ P Ti is killed by one and only one test case t ∈ Ti . The mapping relation between Ti and P Ti is shown in Figure 1. In Figure 1, since each t ∈ Ti has ever killed some pt ∈ P Ti by detecting fault, t is highly correlated with the defective Csub components (located by fault localization techniques) in the sense that t most probably covers the code of Csub or at least covers some code highly depending on Csub . Thus, compared to Ti = T \ Ti , the relative complement set of T and Ti , Ti should have higher prioritization for early execution in the patch validation process of pti , which is consistent with version-specific prioritization techniques based on code coverage (in fact, for pti , the program P  updated by pti can be regarded as one specific version of P ). On the other hand, prior patch validation process for each pt ∈ P Ti to its essence can be considered as the mutation

183

by calling the function Prioritize (line 14). Prioritize sorts the test cases (in fact, the tuples for T ) in descending order of the value of index. (When multiple test cases have the same value of index, FRTP orders them randomly.) If P  successfully passes all the test cases T (line 16), FRTP considers that a valid patch is found (line 17), and terminates the repair process immediately with the valid patch outputted (line 23).

Algorithm 1: The FRTP Algorithm Input : Defective program P and test cases T Output: One valid patch pt 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

index ← 0; // Initialize the index value {n0 , t1 , t2 , . . . , tn } ← T ; T ← {(n0 , 1)(t1 , index), (t2 , index), . . . , (tn , index)}; Csub ← FaultLocalization(P, T); SuccessFlag ← f alse; repeat pt ← PatchGeneration(Csub ); P  ← Update(P, pt); for i ← 0 to n do //Check that whether pt is valid; (tindex , index) ← GetTestcase(T, i); if PatchValidation(P  , tindex ) = true then temp ← (tindex , index + 1); T ← Prioritize(T, temp); break; else if i = n + 1 then SuccessFlag ← true; else continue; end end until SuccessFlag = true; return pt;

IV. I MPLEMENTATION In this section, we first describe GenProg [3], [8], a stateof-the-art tool for automated C program repair, on which we built TrpAutoRepair. Then, we present the implementation details on TrpAutoRepair. A. GenProg GenProg2 has the ability of fixing bugs in deployed, legacy C programs without formal specifications. With the hypothesis that some missed important functionalities may occur in another position in the same program, GenProg attempts to automatically repair defective program with evolutionary computation [19]. GenProg takes advantage of a modified version of genetic programming to maintain a population of variants, which are represented by the corresponding pairs of abstract syntax tree (AST) and weighted path (the results of fault localization). In the process of patch generation, a candidate patch is generated using one of generic algorithm operations: mutation and crossover; the variant is just the AST representing the program updated by the patch. Then, in the process of patch validation, testing is required: fitness is evaluated by executing test cases. If the patched program passes all the test cases, GenProg terminates the repair process and declares that a valid patch is found, otherwise GenProg goes back to the process of patch generation, and generates another patch according to the prior fitness values. Although GenProg has successfully fixed many bugs existing among some off-the-shelf programs (e.g., python, libtiff), GenProg may suffer from the problem of timeconsuming test case executions [3], which becomes more serious on complex or critical programs for which most often equip with large, long-running test suites. What is worse, for the reason that GenProg takes the fitness values, which are evaluated by running a fixed number of test cases (10% of all test cases in [3] and [20]), as the metric which serves as the guiding force behind the search for optimal or near optimal patches in the process of patch generation, thus lots of test case executions are necessary even if an invalid patch has been aware (i.e, some test case has ever detected a fault for the patched program).

defective components set Csub by the analysis to trace data sets of running the program P against test cases T (a detailed account can be found in [12]). In addition, for simplicity FRTP creates the state variable SuccessF lag on line 5, which records current repair state of whether a valid patch being found. The search begins by generating one patch pt and getting a mutant executable program P  updated by pt. The function PatchGeneration can produce one concrete patch pt by modifying Csub in light of specific modification rules [3], [6], [14] (line 7 in Algorithm 1). Note that the patches produced by multiple calls to PatchGeneration are different due to the application of randomized algorithm (more details can be found in Section IV-B). Once pt is generated, the function Update is called to recompile the patched program to a executable program P  (line 8). Lines 9-21 run P  against T in order, and reorder test cases T according to the running result. Given that the ith test case for T (note that it is most probably that T has been prioritized due to prior invalid patches), line 11 calls the function GetTestcase to get the ith tuple (tindex , index) of T ; then, the function PatchValidation runs P  against tindex (line 12); if a fault is detected by tindex , then FRTP records the fault by updating the tuple (tindex , index) with (tindex , index+1) for T (line 13), and reorders the execution order of each test case t ∈ T

2 Available:

184

http://dijkstra.cs.virginia.edu/genprog/

Table I S UBJECT P ROGRAMS Program

LOC

libtiff

gmp python php wireshark

77,000

145,000 407,000 1,046,000 2,814,000

Test Cases

78

146 303 4,986 53

toRepair. RQ2, in fact, investigates the efficiency (in term of the rate of invalid patch detection) of TrpAutoRepair. • RQ1: How does TrpAutoRepair compare with GenProg in term of repair effectiveness? • RQ2: Can TrpAutoRepair improve the rate of invalidpatch detection over GenProg? If so, how much better can TrpAutoRepair perform than GenProg on the improvement?

Version bug-01209c9-aaf9eb3 bug-0860361d-1ba75257 bug-0fb6cf7-b4158fa bug-10a4985-5362170 bug-4a24508-cc79c2b bug-5b02179-3dfb33b bug-6f9f4d7-73757f3 bug-829d8c4-036d7bb bug-8f6338a-4c5a9ec bug-90d136e4-4c66680f bug-d39db2b-4cd598c bug-ee2ce5b7-b5691a5a

B. Subject Programs In this experiment, we selected the subject programs used in the most recent work [3] on GenProg as the experimental benchmarks3 , all of which are written in C language and are different real-world systems with real-life bugs from different domains. For the fbc program and 5 versions of libtiff programs, we have the compilation trouble when we try to compile these programs. We also exclude gzip and lighttpd programs, because there is too few positive test cases (no more than 10 ones) actually used in both gzip and lighttpd programs, although more test cases were listed in [3]. Too few test cases makes no sense on the optimization of testing. For php which equips with over 4,000 test cases, several minutes are needed for validating only one patched program, which will result in time-consuming repair process. Thus, complete experimental evaluation on all the php program versions can take too much time (see [3, Table II]); the authors in [3] used Amazon’s EC2 cloud computing infrastructure including 10 trials in parallel for their experiment. Given the expensive testing computation, we randomly selected one faulty php version without any bias, although these php versions shipping with lots of test cases will give TrpAutoRepair more advantages. For gmp, python and wireshark, there exists only one version having been repaired successfully by GenProg in [3] for each program. In total, Table I describes our subject programs in detail. The LOC column lists the scale of each subject program in term of lines of code, and the last two columns give the size of positive test cases and the version information. Note that for the libtiff program, although there are total 78 test cases listed in Table I, the concrete number of test cases used for each bug version is similar but different because not all the 78 test cases work well for every version. In addition, we assigned one negative test case for each subject program to reproduce the bug.

bug-14166-14167 bug-69783-69784 bug-309892-309910 bug-37112-37111

B. TrpAutoRepair We implemented TrpAutoRepair on GenProg in OCaml language by applying our FRTP techniques. Specifically, the suspiciousness value sp of being faulty for each statement s ∈ P is computed through a simple fault localization technique same to GenProg: a statement never visited by any test case has the sp value of 0; a statement visited only by negative test case which reproduces the failure has the high value of 1.0; a statement visited both by positive and negative test cases is given the moderate value of 0.1. For each statement s, the corresponding probability pb of s being selected is computed through the formula: pb = sp ∗ mute, where mute represents the global mute probability set for initialization. When generating a candidate patch, TrpAutoRepair randomly generated each probability value p ∈ [0, 1] for each statement s; TrpAutoRepair selects si for modification iff pbi ≥ pi ; the repair rules used by TrpAutoRepair to modify these selected statements are also the same as that used by GenProg: copying a few lines of code from other parts of the program or modifying existing code (see [8]). In fact, TrpAutoRepair produces candidate patches in the same way of randomly generating patches in the first generation of GenProg but without fitness guidance and crossover in the subsequent generations, i.e., muting the code according to both the suspiciousness and global mutation rate specified before the repair process. For the process of patch validation, we validate candidate patches in the way described in Algorithm 1.

C. Experimental Setup For the purposes of comparison, we separately ran TrpAutoRepair and GenProg to repair the 5 subject programs with 16 versions described in Table I. All the experimental parameters for GenProg in our experiment are the same as

V. E XPERIMENTAL E VALUATION A. Research Questions

3 https://church.cs.virginia.edu/genprog/archive/genprog-105-bugstarballs/

Our evaluation addresses the following Research Questions. In RQ1, we investigate the effectiveness of TrpAu-

185

Table II E XPERIMENTAL R ESULTS BY TrpAutoRepair AND GenProg

Approach

Mean of NTCE

Median of NTCE

Avg. Time(s) Per Repair

Success Rate

A-statistic on NTCE

p-value

libtiff-bug-01209c9-aaf9eb3

TrpAutoRepair GenProg

79 122

78 109

18.744 25.764

100% 100%

0.940400

0.000000

libtiff-bug-0860361d-1ba75257

TrpAutoRepair GenProg

63 177

57 119

269.149 439.535

100% 97%

0.787165

0.000000

libtiff-bug-0fb6cf7-b4158fa

TrpAutoRepair GenProg

187 587

170 505

242.009 465.513

100% 77%

0.813636

0.000000

libtiff-bug-10a4985-5362170

TrpAutoRepair GenProg

51 150

47 89

42.582 73.669

100% 93%

0.834946

0.000000

libtiff-bug-4a24508-cc79c2b

TrpAutoRepair GenProg

66 115

64 94

55.937 88.147

100% 100%

0.926400

0.000000

libtiff-bug-5b02179-3dfb33b

TrpAutoRepair GenProg

119 338

103 218

121.409 241.101

100% 95%

0.826053

0.000000

libtiff-bug-6f9f4d7-73757f3

TrpAutoRepair GenProg

68 123

65 101

66.104 114.560

100% 100%

0.922800

0.000000

libtiff-bug-829d8c4-036d7bb

TrpAutoRepair GenProg

154 540

148 554

233.548 539.218

88% 73%

0.830168

0.000000

libtiff-bug-8f6338a-4c5a9ec

TrpAutoRepair GenProg

47 92

43 75

25.950 33.386

100% 100%

0.867400

0.000000

libtiff-bug-90d136e4-4c66680f

TrpAutoRepair GenProg

100 349

93 225

82.677 132.086

100% 100%

0.936600

0.000000

libtiff-bug-d39db2b-4cd598c

TrpAutoRepair GenProg

110 387

99 277

46.467 96.754

98% 96%

0.883716

0.000000

libtiff-bug-ee2ce5b7-b5691a5a

TrpAutoRepair GenProg

104 354

94 238

99.444 131.874

100% 100%

0.926650

0.000000

gmp-bug-14166-14167

TrpAutoRepair GenProg

312 663

301 530

606.511 472.828

58% 12%

0.817529

0.000590

python-bug-69783-69784

TrpAutoRepair GenProg

434 2572

408 2318

452.185 1409.825

37% 21%

0.990991

0.000000

php-bug-309892-309910

TrpAutoRepair GenProg

4997 20837

4994 16986

457.486 1440.835

100% 97%

1.000000

0.000000

wireshark-bug-37112-37111

TrpAutoRepair GenProg

167 630

179 536

2159.907 1845.472

8% 20%

0.040625

0.000204

Program

*

We separately ran TrpAutoRepair and Genprog 100 times on each of the 16 subject programs and only recorded the trials leading to a successful repair. Hence, the success rate is n% if there are n successful trials finding successfully a valid patch; the statistics in this table are also computed according to the n successful trials.

those settings in [3]: we limited the size of the population for each generation to 40, and a maximum of 10 generations for each repair process; the global mute rate mute is set to 0.01. In fact, for all generations except the first generation, there are another 40 candidate patches generated due to crossover. That is, for one concrete repair process, GenProg can iteratively produce no more than 40+80*10=840 candidate patches. For TrpAutoRepair, we also limited the size of the population for each generation to 40, and a maximum of 10 generations for each repair process; for each generation, using random search (without crossover) total 40 candidate patches are produced in the same way of the first generation. Hence, for fair comparison, we considered that TrpAutoRepair (GenProg) failed to repair one subject

program for one repair process if the valid patch was not found within 400 candidate patches. All the experiments ran on an Ubuntu 10.04 machine with 2.33 GHz Intel quad-core CPU and 4 GB of memory. Since randomized algorithm is applied in both TrpAutoRepair and GenProg, we statistically analyze the experimental results. Specifically, for TrpAutoRepair and GenProg we separately performed 100 trials for each program. D. Experimental Results 1) RQ1: How does TrpAutoRepair compare with GenProg in term of repair effectiveness? In this paper, we evaluate the repair effectiveness according to success rate measuring the difficulty of finding a valid

186

wireshark−bug−37112−37111

php−bug−309892−309910

python−bug−69783−69784

gmp−bug−14166−14167

libtiff−bug−ee2ce5b7−b5691a5a

libtiff−bug−d39db2b−4cd598c

libtiff−bug−90d136e4−4c66680f

libtiff−bug−8f6338a−4c5a9ec

libtiff−bug−829d8c4−036d7bb

libtiff−bug−6f9f4d7−73757f3

libtiff−bug−5b02179−3dfb33b

libtiff−bug−4a24508−cc79c2b

1000

libtiff−bug−10a4985−5362170

Test case executions

1200

TrpAutoRepair GenProg libtiff−bug−0fb6cf7−b4158fa

1400

libtiff−bug−0860361d−1ba75257

1600

libtiff−bug−01209c9−aaf9eb3

1800

800

600

400

200

0

Figure 2. NTCE boxplots for experiments. Note that for the python and php, which give too large NTCE over other programs, for the ease of presentation we narrowed down the NTCE values with linear scale for the two programs.

compared to TrpAutoRepair for which random search is used. The only exception is on wireshark, for which GenProg outperforms TrpAutoRepair. Having analyzed the repair process, we find that the patched programs are more likely to fail to be compiled in the initial phase of repair process, compared to other programs. Thus, we suspect the reason for the exception to be that fitness function is good at eliminating the bad patches (which fails to be compiled) and can produce more compilation-able candidate patches in the sequent repair process. 2) RQ2: Can TrpAutoRepair improve the rate of invalidpatch detection over GenProg? If so, how much better can TrpAutoRepair perform than GenProg on the improvement? To answer this question, we require a measure with which to assess and compare the rate of invalid-patch detection between TrpAutoRepair and GenProg. To quantify the goal of increasing test cases’ rate of fault detection, Rothermel et al. present APFD, a metric measuring the weighted average of the percentage of faults during the test case executions [7]. Although the APFD metric is generally used for evaluating the effectiveness of various test case prioritization techniques [9], APFD is not suitable for evaluating the rate of invalidpatch detection on the area of automated program repair for the reason that the order of test cases in the repair process varies all the time (recall the line 14 in Algorithm 1), while APFD requires a fixed order before the evaluation. In our experiment, we assess the rate of invalid-patch detection by computing the Number of Test Case Executions (NTCE) within one successful repair process; the smaller the NTCE value is, the higher the rate of invalid-patch detection is. An overview of how the two tools affected the test cases’ rate of invalid-patch detection can be determined from Figure 2, which provides boxplots of the NTCE values of each successful trial for running separately TrpAutoRepair and GenProg to repair each subject program. Figure 2 shows that, for all the 16 subject programs, TrpAutoRepair killed the invalid patches more rapidly (i.e.,

patch. Intuitively, the higher repair effectiveness of used repair tool, the stronger ability of the tool finding a valid patch; the stronger ability indicates higher success rate of repairing successfully a faulty program. Note that we separately ran TrpAutoRepair and Genprog 100 times on each of the 16 faulty programs. Hence, the success rate is n% if there are n successful trials. For most programs (15 of 16 in Table II), TrpAutoRepair has the higher or equal success rate over GenProg. The higher success rate means that TrpAutoRepair offers the greater ability of finding a valid patch under the same condition. The relatively worse success rate indicates that fitness function evaluation used by GenProg does not always work well for guiding the search for optimal or near optimal patches in the patch generation step and even, sometimes (9/16), misleads the search process. We are not surprised for this experimental results. In fact, Andrea Arcuri and Lionel Briand (in their ICSE 2011 paper [21, Page 3]) in particular has presented their concern to the effectiveness of genetic programming used by GenProg, and pointed out that genetic programming should be compared against random search to check the effectiveness. However, in most recent work on GenProg [3] published in ICSE 2012, they still did not make the compare. In this sense, our experiment is the supplement to [3], and confirms the concern that genetic programming does not always work better than random search, and even worsens the patch search process in our experiment. The possible reason for the worse repair effectiveness by GenProg may be tracked down from the paper [22] written by the same authors of GenProg: current fitness functions including that used by GenProg are either overly simplistic or likely to exhibit ”all-or-nothing” behavior, and thus are not well correlated with true distance between an individual and the global optimum. Since chances are that imprecise fitness functions can mislead the search process, it should not be surprising that in our experimental evaluation GenProg has the worse repair effectiveness on many programs when

187

the lower NTCE values of both “Mean” and “Median” in Table II) than GenProg, which justifies that our FRTP technique can improve the rate of invalid-patch detection. As described above, TrpAutoRepair improves the rate of invalid-patch detection over GenProg in Figure 2, we then assess the magnitude of the improvement by measuring effect size (scientific significance)—in this case, a difference in the ability to detect invalid patches. The nonparametric Vargha-Delaney A-test [23], which is recommended in [21] and was also used in [24], is used to evaluate the effect size in this experiment. In [23] Vargha and Delaney suggest that A-statistic of greater than 0.64 (or less than 0.36) is indicative of “medium” effect size, and of greater than 0.71 (or less than 0.29) can be indicative of a promising “large” effect size. Table II shows the effect size A-statistic, which represents the difference significance between TrpAutoRepair and GenProg on NTCE, and the p-value (for the rank-sum test used for computing the A-statistic) for each subject program. Because A-statistic of all the 16 programs are either greater than 0.71 or less than 0.29 with low p-value (p < 0.05), it is reasonable to consider that TrpAutoRepair significantly improves the rate of invalid-patch detection over GenProg in our experiment, which also justifies the advantage of our FRTP technique.

of automatically fixing bugs existing in the java programs using an co-evolutionary approach; the repair effectiveness of JAFF is still unknown on real-world faulty softwares. In addition, there is some work [5] on fixing automatically php HTML generation Errors via string constraint solving. Although all the above tools generate the candidate patches according to different rules, there is at least one common point: to validate these patches, they have to test the patched programs through regression testing. And none of these tools scales well to complex or critical programs shipping with either large number of test cases or longrunning test cases. Test Case Prioritization Techniques. Techniques for test case prioritization try to improve the rate of fault detection by scheduling test cases for testing, and are intensively studied in regression testing [9]. Based on several different coverage criteria a family of test case prioritization techniques were presented in [7]. There are also lots of work focusing on different granularity levels such as the function level [16], system model [26], the block level and method level [27]. In addition, some research work evaluated traditional techniques on test case prioritization in context of time-aware test case prioritization [28]. More details and work on Test Case Prioritization can be found in the survey [9].

E. Threats to Validity

VII. C ONCLUSION AND F UTURE W ORK

Like many studies, the main threat is related to the possible poor generalization of the experimental results. Since we have selected 16 subject programs, chances are that the results tend to support the conclusions drawn from our experiment with some bias. Although one effective solution to minimize the experimental bias is to increase the number of subject programs, the experiment on more programs means that more expensive computation resource is necessary. To reduce the expensive computation, we plan to further optimize the repair process as we have done for earlier work [25], which will allow us to investigate an large number of subject programs in the future. Another issue is related to the appropriateness of the evaluation criteria in our experiment. We use the nonparametric A-test to evaluate the effect size according to the NTCE values. The reasonableness of the adoption remains to be confirmed in the future work, though NTCE can intuitively approximate the effectiveness of the rate of invalid-patch detection.

We present the test case prioritization technique called FRTP that can assist the patch validation process by improving the rate of invalid-patch detection in the context of automated program repair. FRTP can effectively reduce the size of test case executions for exposing the invalid patches and thus improves the whole repair efficiency, especially when the object program ships with either large number of test cases or long-running test cases. Different from most existing test case prioritization techniques requiring additional cost for gathering the prioritization information before the prioritization, FRTP iteratively extracts that information by recording the faults information in the repair process, which suffers from trivial cost. We have built the tool TrpAutoRepair, which implements our FRTP technique and can repair programs of full automation. We evaluated TrpAutoRepair against GenProg, a stateof-the-art tool for automated C program repair. For repair effectiveness, TrpAutoRepair, in most cases (15/16), outperformed GenProg; for repair efficiency, TrpAutoRepair can improve the rate of invalid-patch detection (in term of requiring much fewer test case executions to determine invalid patches) for all the subject programs in our experiment, and this improvement is statistically significant in term of “large” effect size for A-statistic. Complete experimental results in this paper are available at:

VI. R ELATED W ORK Automated Program Repair. Many studies try to repair defective programs at the source code level in different ways. Guided by evolutionary computation, GenProg seeks to repair programs without any specifications. AutoFix-E [4] can generate patches for faulty programs automatically but requires for the contracts. JAFF [6] has the ability

http://sourceforge.net/projects/trpautorepair/files/

188

In the future we plan to further speed up the repair process by applying weak recompilation and precise fault localization techniques [25], [29] to TrpAutoRepair. We believe that this application will make TrpAutoRepair more powerful in the sense that less time is taken to repair defective programs.

[13] A. Arcuri, “On the automation of fixing software bugs,” in International Conference on Software Engineering (ICSE), 2008, pp. 1003–1006. [14] Y. Pei, Y. Wei, C. Furia, M. Nordio, and B. Meyer, “Codebased automated program fixing,” in International Conference on Automated Software Engineering (ASE), 2011, pp. 392 – 395. [15] Y. Wei, C. A. Furia, N. Kazmin, and B. Meyer, “Inferring better contracts,” in International Conference on Software Engineering (ICSE), 2011, pp. 191–200. [16] S. Elbaum, A. Malishevsky, and G. Rothermel, “Test case prioritization: a family of empirical studies,” IEEE Transactions on Software Engineering (TSE), vol. 28, no. 2, pp. 159 –182, feb 2002. [17] J. Voas, “PIE: a dynamic failure-based technique,” IEEE Transactions on Software Engineering (TSE), vol. 18, no. 8, pp. 717 –727, aug 1992. [18] Y. Jia and M. Harman, “An analysis and survey of the development of mutation testing,” IEEE Transactions on Software Engineering (TSE), vol. 37, no. 5, pp. 649 –678, 2011. [19] W. Weimer, S. Forrest, C. Le Goues, and T. Nguyen, “Automatic program repair with evolutionary computation,” Communications of the ACM, vol. 53, no. 5, pp. 109–116, 2010. [20] C. Le Goues, W. Weimer, and S. Forrest, “Representations and operators for improving evolutionary software repair,” in Genetic and Evolutionary Computation (GECCO), 2012. [21] A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” in International Conference on Software Engineering (ICSE), 2011, pp. 1–10. [22] E. Fast, C. Le Goues, S. Forrest, and W. Weimer, “Designing better fitness functions for automated program repair,” in Genetic and Evolutionary Computation (GECCO), 2010, pp. 965–972. [23] A. Vargha and H. D. Delaney, “A critique and improvement of the CL common language effect size statistics of mcgraw and wong,” Journal of Educational and Behavioral Statistics, vol. 25, no. 2, pp. 101–132, 2000. [24] S. Poulding and J. A. Clark, “Efficient software verification: Statistical testing using automated search,” IEEE Transactions on Software Engineering (TSE), vol. 36, no. 6, pp. 763–777, Nov. 2010. [25] Y. Qi, X. Mao, Y. Wen, Z. Dai, and B. Gu, “More efficient automatic repair of large-scale programs using weak recompilation,” Science China Information Sciences, vol. 55, no. 12, pp. 2785–2799, 2012. [26] B. Korel, G. Koutsogiannakis, and L. Tahat, “Application of system models in regression test suite prioritization,” in International Conference on Software Maintenance (ICSM), 2008, pp. 247 –256. [27] H. Do, G. Rothermel, and A. Kinneer, “Empirical studies of test case prioritization in a junit testing environment,” in International Symposium on Software Reliability Engineering (ISSRE), 2004, pp. 113 – 124. [28] L. Zhang, S.-S. Hou, C. Guo, T. Xie, and H. Mei, “Timeaware test-case prioritization using integer linear programming,” in International Symposium on Software Testing and Analysis (ISSTA), 2009, pp. 213–224. [29] Y. Qi, X. Mao, Y. Lei, and C. Wang, “Using automated program repair for evaluating the effectiveness of fault localization techniques,” in International Symposium on Software Testing and Analysis (ISSTA), 2013, pp. 191–201.

ACKNOWLEDGMENT The authors thank W.Weimer et al. for their noteworthy work on GenProg, based on which we built the TrpAutoRepair system. This work was supported by the National Natural Science Foundation of China (Grant Nos.61120106006, 91118007 and 91318301), National High Technology Research and Development Program of China (Grant Nos.2011AA010106 and 2012AA011201). R EFERENCES [1] M. Harman, “Automated patching techniques: the fix is in: technical perspective,” Communications of the ACM, vol. 53, no. 5, pp. 108–108, 2010. [2] A. Zeller, “Automated debugging: Are we close,” Computer, vol. 34, no. 11, pp. 26–31, 2001. [3] C. Le Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer, “A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each,” in International Conference on Software Engineering (ICSE), 2012, pp. 3–13. [4] Y. Wei, Y. Pei, C. A. Furia, L. S. Silva, S. Buchholz, B. Meyer, and A. Zeller, “Automated fixing of programs with contracts,” in International Symposium on Software Testing and Analysis (ISSTA), 2010, pp. 61–72. [5] H. Samimi, M. Sch¨afer, S. Artzi, T. Millstein, F. Tip, and L. Hendren, “Automated repair of HTML generation errors in php applications using string constraint solving,” in International Conference on Software Engineering (ICSE), 2012, pp. 277–287. [6] A. Arcuri, “Evolutionary repair of faulty software,” Applied Soft Computing, vol. 11, no. 4, pp. 3494 – 3514, 2011. [7] G. Rothermel, R. Untch, C. Chu, and M. Harrold, “Prioritizing test cases for regression testing,” IEEE Transactions on Software Engineering (TSE), vol. 27, no. 10, pp. 929 –948, oct 2001. [8] C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A generic method for automatic software repair,” IEEE Transactions on Software Engineering (TSE), vol. 38, no. 1, pp. 54 –72, 2012. [9] S. Yoo and M. Harman, “Regression testing minimization, selection and prioritization: a survey,” Softw. Test. Verif. Reliab, vol. 22, no. 4, pp. 67–120, 2012. [10] J. Andrews, L. Briand, Y. Labiche, and A. Namin, “Using mutation analysis for assessing and comparing testing coverage criteria,” IEEE Transactions on Software Engineering (TSE), vol. 32, no. 8, pp. 608 –624, 2006. [11] L. Zhang, J. Zhou, D. Hao, L. Zhang, and H. Mei, “Prioritizing junit test cases in absence of coverage information,” in International Conference on Software Maintenance (ICSM), 2009, pp. 19–28. [12] J. A. Jones and M. J. Harrold, “Empirical evaluation of the tarantula automatic fault-localization technique,” in International Conference on Automated Software Engineering (ASE), 2005, pp. 273–282.

189