Does Genetic Programming Work Well on Automated Program Repair?

12 downloads 4677 Views 477KB Size Report
School of Computer. National ... Abstract—Automated program repair has made some impor- ... GenProg, a genetic method for automated software repair,.
2013 International Conference on Computational and Information Sciences

Does Genetic Programming Work Well on Automated Program Repair? Yuhua Qi, Xiaoguang Mao, Yan Lei, Ziying Dai and Chengsong Wang School of Computer National University of Defense Technology Changsha, China {yuhua.qi, xgmao, yanlei}@nudt.edu.cn {nudtdzy, jameschen186}@gmail.com Abstract—Automated program repair has made some important progress in the recent decade. One well-known repair tool is GneProg, which automates the patch generation process according to the guidance of genetic programming. Although GenProg has successfully fixed many legacy faulty programs, the guidance effectiveness of genetic programming used by GenProg has not been even justified through the comparison with random search algorithm. In this paper we try to make the guidance effectiveness comparison between genetic programming and random search algorithm on program repair. The experimental results show that genetic programming does not perform better than random search algorithm on guiding the patch generation process. Index Terms—Genetic programming; random search; automated program repair; automated debugging

I. I NTRODUCTION In the past decade, much attention has been paid to automated program repair techniques [1][2][3][4][5][6], which facilitates the development of automated debugging [7]. Most publications in the field focus on the problem of how to generate candidate patches for validation and deployment [8], [9], [10], [11], [2]. One well known tool called GenProg [8], [1] uses the algorithm of genetic programming to guide the patch generation process. Although experimental results show that GenProg has the ability of automatically repairing many off-the-shelf legacy programs, there is no comparison evidence to justify the advantage of genetic programming over random search. In this paper we try to check whether the genetic programming works better than random search on automated program repair through empirical experiment. Genetic programming is an extension of the traditional genetic algorithm in which each individual in the population is represented as a program variant (patched program). The program variant is generated using one of genetic algorithm operations: mutation and crossover. The acceptability of each variant is computed through a user-defined fitness function. These program variants obtaining high fitness scores are selected for next evolution. The evolution process will continue over and over until a valid patch is found. GenProg, a genetic method for automated software repair, can repair defects in deployed, legacy C programs without formal specifications [1][8]. With the hypothesis that some important functionalities missed by a program will be able to occur in another position in the program, GenProg can repair program by copying a few lines of code from other 978-0-7695-5004-6/13 $26.00 © 2013 IEEE DOI 10.1109/ICCIS.2013.490

parts of the program or modifying existing code, resulting in many candidate patches. GenProg takes advantage of genetic programming to maintain the population of candidate patches, and score each patch according to fitness evaluation. The final score (also called fitness) of each patch is a weighted sum of the positive and negative test cases the patched program passes. The higher score indicates the better acceptability of the corresponding patch. In fact, the scores are used by the genetic programming algorithm to determine which patches should be retained to the next generation. Obviously, the fitness evaluation, which is used to evaluate the score of each patch, plays an important role in the repair effectiveness of genetic programming. If the scores have the poor ability of representing the true acceptability of produced patches, genetic programming will not effectively guide the search process, and even exacerbate the patch generation process. GenProg’s fitness function evaluates a weighted sum of all test cases passed by a patched program. That is, the more test cases passed by a patched program, the larger score obtained by this patch. This simplistic fitness evaluation, however, is likely to exhibit “all-or-nothing” behavior, and thus are not well correlated with true distance between an individual and the global optimum. Since chances are that imprecise fitness functions can mislead the mislead the search process, we should make the experimental comparison between genetic programming and ransom search to check that whether genetic programming performs well on automated program repair. In summary, this paper makes the following contributions: • We are the first to doubt the effectiveness of genetic programming on automated program repair. Although GenProg, a genetic method for automated software repair, has justified itself the ability of repairing some faulty C programs, the advantage of genetic programming used by GenProg over simple random search is still not validated. • We make the experimental comparison between genetic programming and random search on repairing five faulty programs shipping with real-world field failures. Result data confirm our concern that genetic programming most often performs worse than random search. II. G ENETIC P ROGRAMMING As described earlier, genetic programming, the modified version of genetic algorithm, which has been commonly 1876 1875

applied to address an impressive range of problems, has been used to solve the problem of automated program repair. Through the application of genetic algorithm operations, such as mutation and crossover, genetic programming creates and maintains the population of candidate patches. for the fitness evaluation, there are two main approaches: formal specifications and test suits. For specifications, although precise and complete formal specifications have the ability of guaranteeing the optimal acceptability computation of each patch, formal specifications are rarely available despite recent advances in specification mining [12]. Hence, for test suits, running the patched program against a set of test cases, which consists of negative test cases (revealing the presence of the fault) and positive test cases (characterizing the normal behaviors of target program), is alternative for formal specifications; in fact, most existing techniques on automated program repair use test suits to check the effectiveness of candidate patches. Generally, for genetic programming, the weighted sum of test cases passed by patched program is used to represent the acceptability of the patch; genetic programming refers to the fitness of each patch as the weighted sum. That is, the acceptability of one patch is the dominant factor on the selection for next evolution. Whether the fitness evaluation mechanism can precisely reflect the acceptability of each patch will drastically influence the effectiveness of genetic programming. GenProg, one stateof-the-art technique on automated program repair, makes use of genetic programming to guide the repair process. GenProg evaluates the fitness of each patch through testing. Although GenProg has successfully fixed many legacy programs, the effectiveness of genetic programming has still not been assured through the comparison to other approaches even such as random search. In the following sections, we try to build one repair system in which random search algorithm is used to guide the patch generation process, and then make the comparison on repair effectiveness between this system and GenProg.

TABLE I S UBJECT P ROGRAMS Program libtiff gmp python php wireshark

Test Cases

77,000 145,000 407,000 1,046,000 2,814,000

78 146 303 4,986 63

Version bug-0860361d-1ba75257t bug-14166-14167 bug-69783-69784 php-5.3.6 bug-37112-37111

knowledge, no existing fitness function which has been proved well correlated with true distance between an candidate patch and the really valid patch, RSRepair produces the candidate patch by modifying the defective code with random search algorithm. The rules for code modification used by RSRepair is also the same as that used by GenProg: copying a few lines of code from other parts of the program or modifying existing code (see [1]). For the process of patch validation, we validate a candidate patch in the way of regression testing. If the patched program fails to pass some test case, then the patch is considered invalid and the validation process will be terminated at once. The patch is considered valid iff all the test cases have been passed. IV. E XPERIMENTAL E VALUATION In this section, we try to compare the repair effectiveness between RSRepair and GenProg, and check whether genetic programming used by GenProg has the better guidance effectiveness over random search algorithm. A. Subject Programs We selected the programs used in the most recent work [8] on Genprog as the experimental benchmarks2 , all of which are written in C language and are different real-world systems with real-life bugs from different domains. The repair process, however, is most frequent time-consuming; complete experimental evaluation on all the programs can take too much time (see [8, Table II]). In our experiment, we randomly took five programs from total eight ones without any bias; for each selected program, we considered one bug version as the experimental subject. TABLE I describes our subject programs in detail. The LOC column lists the scale of each subject program in term of lines of code, and the last two columns give the version information and the size of positive test cases. Note that we got these test cases by running the original test cases (provided by [8]) against the corresponding programs and eliminating bad ones for which some failure occurs. We assigned one negative test case for each subject program to reproduce the bug.

III. R ANDOM S EARCH Compared to genetic programming, we try to use random search algorithm to guide the patch generation process with the operators of mutation and crossover but eliminate the fitness evaluation mechanism. That is, in the patch generation process the candidate patches are produced according to both the repair rules and fault information provided by the fault localization process. We also implement the tool RSRepair, a repair system using random search algorithm (instead of genetic programming) to guide the patch generation process. We implemented RSRepair in OCaml language based on GenProg1 . Specifically, for fault localization, our implementation uses the same approach as GenProg: the components (in terms of statements) visited only by the negative test cases are given more repair efforts. For the process of patch generation, since there is, to our 1 Available:

LOC

B. Experimental Setup For the purposes of comparison, we separately ran RSRepair and GenProg to repair the five subject programs described in TABLE I. All the experimental parameters for GenProg 2 Available: bugs-tarballs/

http://epr.adaptive.cs.unm.edu/software/setup.html

1876 1877

https://church.cs.virginia.edu/genprog/archive/genprog-105-

TABLE II R EPAIR E FFECTIVENESS BY RSR EPAIR AND G EN P ROG

Approach

Avg. Test Cases Per Repair

Avg. Time(s) Per Repair

Success Rate

libtiff

RSRepair GenProg

60 157

269.149 499.395

100% 100%

gmp

RSRepair GenProg

311 917

606.511 714.092

58% 13%

python

RSRepair GenProg

434 4,546

452.185 2,560.048

37% 30%

php

RSRepair GenProg

5,014 12,975

607.018 867.541

100% 91%

wireshark

RSRepair GenProg

167 1,033

2,159.907 3,086.134

8% 35%

Program

*

We separately ran RSRepair and GenProg 100 times on each of the five subject programs and only recorded the trials leading to a successful repair. The ”Avg. Test Cases Per Repair” and ”Avg. Time(s) Per Repair” columns gives the average size of test cases executed and wall-clock time (seconds) for one repair process, respectively. The ”Success Rate” column reports the fraction of trials that were successful.

in our experiment are the same as those settings in [8]: we limited the size of the population for each generation to 40, and a maximum of 10 generations for each repair process. That is, for one concrete repair process, GenProg can iteratively produce no more than 40*10=400 candidate patches. For fair comparison, we also gave the limit for GenProg of a maximum of 400 candidate patches for one repair process. We considered that RSRepair (GenProg) failed to repair one subject program for one repair process if the valid patch was not found within 400 candidate patches. All the experiments ran on an Ubuntu 10.04 machine with 2.33 GHz Intel quad-core CPU and 4 GB of memory. For RSRepair and GenProg, we separately performed 100 trials for each program.

concern that genetic programming does not always work better than random search, and even worsens the patch search process in most cases in our experiment. The possible reason for the worse repair effectiveness by GenProg may be tracked down from the paper [14] written by the same authors of GenProg: current fitness functions including that used by GenProg are either overly simplistic or likely to exhibit “all-or-nothing” behavior, and thus are not well correlated with true distance between an individual and the global optimum. Since chances are that imprecise fitness functions can mislead the search process, it should not be surprising that in our experimental evaluation GenProg has the worse repair effectiveness on most programs when compared to RSRepair for which ransom search is used. The only exception is on wireshark, for which GenProg outperforms RSRepair. Having analyzed the repair process, we find that the patched programs are more likely to fail to be compiled in the initial phase of repair process, compared to other programs. Thus, we suspect the reason for the exception to be that fitness function is good at eliminating the bad patches (which fails to be compiled) and can produce more compilation-able candidate patches in the sequent repair process. V. C ONCLUSIONS In this paper we try to check whether genetic programming performs better on automated program repair over random search algorithm, because there exists no perfect fitness evaluation mechanism to precisely compute the true distance between an individual patch and the global optimum patch, to our knowledge. The experiment between RSRepair and GenProg, for which random search and genetic programming are used, respectively, on five legacy programs shipping with real-life field failures present the evidence that genetic programming performs even worse than random search algorithm. R EFERENCES

C. Experimental Results

[1] C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A generic method for automatic software repair,” IEEE Transactions on Software Engineering (TSE), vol. 38, no. 1, pp. 54 –72, 2012. [2] A. Arcuri, “Evolutionary repair of faulty software,” Applied Soft Computing, vol. 11, no. 4, pp. 3494 – 3514, 2011. [3] Y. Pei, Y. Wei, C. Furia, M. Nordio, and B. Meyer, “Code-based automated program fixing,” in International Conference on Automated Software Engineering (ASE), 2011, pp. 392 –395. [4] T. Ackling, B. Alexander, and I. Grunert, “Evolving patches for software repair,” in Genetic and Evolutionary Computation (GECCO), 2011, pp. 1427–1434. [5] V. Debroy and W. Wong, “Using mutation to automatically suggest fixes for faulty programs,” in International Conference on Software Testing, Verification and Validation (ICST), 2010, pp. 65 –74. [6] G. Jin, L. Song, W. Zhang, S. Lu, and B. Liblit, “Automated atomicityviolation fixing,” in Programming Language Design and Implementation (PLDI), 2011, pp. 389–400. [7] A. Zeller, “Automated debugging: Are we close,” Computer, vol. 34, no. 11, pp. 26–31, 2001. [8] C. Le Goues, M. Dewey-Vogt, S. Forrest, and W. Weimer, “A systematic study of automated program repair: fixing 55 out of 105 bugs for $8 each,” in International Conference on Software Engineering (ICSE), 2012, pp. 3–13. [9] Y. Wei, Y. Pei, C. A. Furia, L. S. Silva, S. Buchholz, B. Meyer, and A. Zeller, “Automated fixing of programs with contracts,” in International Symposium on Software Testing and Analysis (ISSTA), 2010, pp. 61–72.

TABLE II summaries the experimental results on the effectiveness comparisons between RSRepair and GenProg. For all the five subject programs, RSRepair takes much less time to repair the target program than GenProg. For the success rate within maximum size of candidate patches (i.e., 400 in the experiment) generated per repair process, RSRepair performs much better over GenProg, in most cases. The worse success rate indicates that fitness function evaluation used by GenProg does not always work well for guiding the search for optimal or near optimal patches in the patch generation step and even, in most cases, misleads the search process. We are not surprised for this experimental results. In fact, Andrea Arcuri and Lionel Briand (in their ICSE 2011 paper [13, Page 3]) in particular has presented their concern to the effectiveness of genetic programming used by GenProg, and pointed that genetic programming should be compared against random search to check the effectiveness. However, in most recent work on GenProg [8] published in ICSE 2012, they still did not make the compare. In this sense, our experiment is the supplement to [8], and confirms the

1877 1878

ification mining,” IEEE Transactions on Software Engineering (TSE), vol. 38, no. 1, pp. 175 –190, 2012. [13] A. Arcuri and L. Briand, “A practical guide for using statistical tests to assess randomized algorithms in software engineering,” in International Conference on Software Engineering (ICSE), 2011, pp. 1–10. [14] E. Fast, C. Le Goues, S. Forrest, and W. Weimer, “Designing better fitness functions for automated program repair,” in Genetic and Evolutionary Computation (GECCO), 2010, pp. 965–972.

[10] H. Samimi, M. Sch¨afer, S. Artzi, T. Millstein, F. Tip, and L. Hendren, “Automated repair of HTML generation errors in php applications using string constraint solving,” in International Conference on Software Engineering (ICSE), 2012, pp. 277–287. [11] P. Liu and C. Zhang, “Axis: automatically fixing atomicity violations through solving control constraints,” in International Conference on Software Engineering (ICSE), 2012, pp. 299–309. [12] C. Le Goues and W. Weimer, “Measuring code quality to improve spec-

1878 1879