its own tests and the best person to design those tests is ... Related work is presented in Section 5, and Sec- ... of a Stack from the Eclat tool's website [14].
Comparison of Unit-Level Automated Test Generation Tools Shuang Wang and Jeff Offutt Software Engineering George Mason University Fairfax, VA 22030, USA {swangb,offutt}@gmu.edu Abstract Data from projects worldwide show that many software projects fail and most are completed late or over budget. Unit testing is a simple but effective technique to improve software in terms of quality, flexibility, and time-to-market. A key idea of unit testing is that each piece of code needs its own tests and the best person to design those tests is the developer who wrote the software. However, generating tests for each unit by hand is very expensive, possibly prohibitively so. Automatic test data generation is essential to support unit testing and as unit testing is achieving more attention, developers are using automated unit test data generation tools more often. However, developers have very little information about which tools are effective. This experiment compared three well-known public-accessible unit test data generation tools, JCrasher, TestGen4j, and JUB. We applied them to Java classes and evaluated them based on their mutation scores. As a comparison, we created two additional sets of tests for each class. One test set contained random values and the other contained values to satisfy edge coverage. Results showed that the automatic test data generation tools generated tests with almost the same mutation scores as the random tests.
1 Introduction An important goal of unit testing is to verify that each unit of software works properly. Unit testing allows many problems to be found early in software development. A comprehensive unit test suite that runs together with daily builds is essential to a successful software project [20]. As the computing field uses more agile processes, relies more on test driven development, and has higher reliability requirements for software, unit testing will continue to increase in importance. However, some developers still do not do much unit testing. One possible reason is they do not think the time and
effort will be worthwhile [2]. Another possible reason is that unit tests must be maintained, and maintenance for unit tests is often not budgeted. A third possibility is that developers may not know how to design and implement high quality unit tests; they are certainly not taught this crucial knowledge in most undergraduate computer science programs. Using automated unit test tools instead of manual testing can help with all three problems. Automated unit test tools can reduce the time and effort needed to design and implement unit tests, they can make it easier to maintain tests as the program changes, and they can encapsulate knowledge of how to design and implement high quality tests so that developers do not need to know as much. But an important question developers must answer is “which tool should I use?” This empirical study looks at the most technical challenging part of unit testing, test data generation. We selected tools to empirically evaluate based on the following three factors: (1) the tool must automatically generate test values with little or no input from the tester, (2) the tool must test Java classes, and (3) the tool must be free and readily available (for example, through the web). We selected three well-known, public-accessible automated tools (Section 2.1). The first is JCrasher [3], a random testing tool that causes the class under test to “crash.” The second is TestGen4J [11], whose primary focus is to exercise boundary value testing of arguments passed to methods. The third is JUB (JUnit test case Builder) [19], which is a framework based on the Builder pattern [7]. We use these tools to automatically generate tests for a collection of Java classes (Section 2.3). As a control, our second step was to manually generate two additional sets of tests. A set of purely random tests was generated for each class as a “minimal effort” comparison, and tests to satisfy edge coverage on the control flow graph were generated. Third, we seeded faults using the mutation analysis tool muJava [9, 10] (Section 2.4). MuJava is an automated class
mutation system that automatically generates mutants for Java classes, and evaluates test sets by calculating the number of mutants killed. Finally, we applied the tests to muJava (Section 2.4), and compared their mutation scores (the percentage of killed mutants). Results are given in Section 3 and discussed in Section 4. Related work is presented in Section 5, and Section 6 concludes the paper.
the “weakest effort” testing strategy, and it seems natural to expect a unit test data generator tool to at least do better than random value generation. We wrote a special-purpose tool that generated random tests in two steps. For each test, the tool arbitrarily selected a method from the class to test. (The number of methods in each class is given in Table 2.) Then the tool randomly generated values for each parameter for that method. The tool did not parse the classes–the methods and parameters were hard-coded into tables in the tool. We decided to create the same random number of tests for each subject class as the tool from Section 2.1 that created the most tests for that class. For all the subject classes, JCrasher generated the most tests, so the study had the same number of random tests as JCrasher had. We elected to go with a test criterion as a second control. Formal test criteria are widely promoted by researchers and educators, but are only spottily used in industry [8]. We chose one of the weakest and most basic test criterion: edge coverage on the control flow graphs. We created control flow graphs by hand for each method in each class, then designed inputs to cover each edge in the graphs.
2 Experimental Design This section describes the design of the experiment. First, each unit test data generation tool used is described, then the process used to manually generate additional tests is presented. Next, the Java classes used in the study are discussed and the muJava mutation testing tool is presented. The process used in conducting the experiment is then presented, followed by possible threats to validity.
2.1
Subjects–Unit Testing Tools
Table 1 summarizes the three tools this study examined in this paper. Each tool is described in detail below. JCrasher [3] is an automatic robustness testing tool for Java classes. JCrasher examines the type information of methods in Java classes and constructs code fragments that will create instances of different types to test the behavior of the public methods with random data. JCrasher explicitly attempts to detect bugs by causing the class under test to crash, that is, to throw an undeclared runtime exception. Although limited by the randomness of the input values, this approach has the advantage of being completely automatic. No inputs are required from the developer. TestGen4J [11] automatically generates JUnit test cases from Java class files. Its primary focus is to perform boundary value testing of the arguments passed to methods. It uses rules, written in a user-configurable XML file, that defines boundary conditions for the data types. The test code is separated from test data with the help of JTestCase1 . JUB (JUnit test case Builder) [19] is a JUnit test case generator framework accompanied by a number of IDE specific extensions. These extensions (tools, plug-ins, etc.) are invoked from within the IDE and must store generated test case code inside the source code repository administered by the IDE.
2.2
2.3
Java Classes Tested
Table 2 lists the Java classes used in this experiment. BoundedStack is a small, fixed sized, implementation of a Stack from the Eclat tool’s website [14]. Inventory is taken from the MuClipse [18] project, the eclipse plug-in version of muJava. Node is a mutable set of Strings that is a small part of a publish/subscribe system. It was a sample solution from a graduate class at George Mason University. Queue is a mutable, bounded FIFO data structure of fixed size. Recipe is also taken from the MuClipse project; it is a javabean class that represents a real-world Recipe object. Twelve is another sample solution. It tries to combine three given integers with arithmetic operators to compute exactly twelve. VendingMachine is from Ammann and Offutt’s book [1]. It models a simple vending machine for chocolate candy. Table 2. Subject Classes Used Name LOC Methods BoundedStack 85 11 Inventory 67 11 Node 77 9 Queue 59 6 Recipe 74 15 TrashAndTakeOut 26 2 Twelve 94 1 VendingMachine 52 6 Total 534 61
Additional Test Sets
As a control comparison, we generated two additional sets of tests for each class by hand with some limited tool support. Testing with random values is widely considered 1 http://jtestcase.sourceforge.net/
2
Name JCrasher TestGen4J JUB
2.4
Table 1. Automated Unit Testing Tools Version Inputs Interface 0.1.9 (2004) Source File Eclipse Plug-in 0.1.4-alpha (2005) Jar File Command Line (Linux) 0.1.2 (2002) Source File Eclipse Plug-in
MuJava
Table 3. Classes and Mutants Classes Mutants
Our primary measurement of the test sets in this experiment is their ability to find faults. MuJava is used to seed faults (mutants) into the classes and evaluate how many mutants each test set kills. MuJava [9, 10] is a mutation system for Java classes. It automatically generates mutants for both traditional mutation testing and class-level mutation testing. MuJava can test individual classes and packages of multiple classes. Tests are supplied by the users as sequences of method calls to the classes under test encapsulated in methods in separate classes. MuJava creates object-oriented mutants for Java classes according to 24 operators that include object-oriented operators. Method level (traditional) mutants are based on the selective operator set by Offutt et al. [13]. After creating mutants, muJava allows the tester to enter and run tests, and evaluates the mutation coverage of the tests. In muJava, tests for the class under test are encoded in separate classes that make calls to methods in the class under test. Mutants are created and executed automatically.
2.5
BoundedStack Inventory Node Queue Recipe TrashAndTakeOut Twelve VendingMachine
Total
2.6
Traditional
Class
Total
224 101 18 117 101 104 234 77 976
4 50 4 6 26 0 0 7 97
228 151 22 120 127 104 234 84 1073
Threats to Validity
As with any study that involves specific programs or classes, there is no guarantee that the classes used are “representative” of the general population. As yet, nobody has developed a general theory for how to choose representative classes for empirical studies, or how many classes we may need. This general issue has a negative impact on our ability to apply statistical analysis tools and reduces the external validity of many software engineering empirical studies, including this one.
Experimental Conduct
Figure 1 illustrates the experimental process. The Java classes are represented by the leftmost box, P. Each of the three automated tools (JCrasher, TestGen4J, JUB) were used to create sets of tests (Test Set JC, ...). Then we manually created the random tests, then the edge coverage tests. MuJava was designed and developed before JUnit and other widely used test harness tools, so it has its own syntactic requirements. Each muJava test must be in a public method that returns a String that encapsulates the result of the test. Thus many of the tests from the three tools had to be modified to run with muJava. This modification is illustrated in Figure 2. Then muJava was used to generate mutants for each subject class, and run all five test sets against the mutants. This resulted in five mutation scores for each subject class (40 mutation scores in all). Because muJava separates scores for the traditional mutation operators from the class mutation operators, we also kept these scores separate.
Another question is whether the three tools used are representative. We searched for free, publicly available, tools that generated tests and were somewhat surprised at how few we found. We initially hoped to use an advanced tool called Agitar Test Runner, however it is no longer available to academic users. Another possible problem with internal validity may be in the manual steps that had to be applied. Tests from the three tools had to be translated to muJava tests. These changes were only to the structure of the test methods, and did not affect the values, so it is unlikely that these translations affected the results. The random and edge coverage tests were generated by hand with some tool support. These tests were generated without knowledge of the mutants, so as to avoid any bias. 3
muJava
P
JCrasher
Test Set JC
JCrasher Mutation Score
TestGen4J
Test Set TG
TestGen4J Mutation Score Mutants
JUB Mutation Score
JUB
Test Set JUB
Manual Random
Test Set Ran
Random Mutation Score
Manual Edge Cover
Test Set EC
Edge Cover Mutation Score
Figure 1. Experiment Implementation public void test242() throws Throwable { try { int i1 = 1; int i2 = 1; int i3 = -1; Twelve.twelve (i1, i2, i3); } catch (Exception e) } { dispatchException (e); } }
public String test242() throws Throwable { try { int i1 = 1; int i2 = 1; int i3 = -1; return Twelve.twelve (i1, i2, i3); } catch (Exception e) } { return e.toString(); } }
=⇒
Figure 2. Conversion to MuJava Tests
3 Results
and JUB tests, but not as well as the JCrasher tests. The edge coverage tests, however, kill 24% more mutants than the strongest tool, with fewer tests. Table 5 shows the same data for the class level mutants. There is more diversity in the scores from the three tools, and JCrasher killed 12% more class level mutants than the random tests did. However, the edge coverage tests are still far stronger, killing 21% more mutants than the strongest performing tool (JCrasher). Table 6 combines the scores from Tables 5 and 4 for both traditional and class level mutants. Again, the JUB tests are the weakest, JCrasher, TestGen and the random tests are fairly close, and the edge coverage tests kill far more mutants; 24% more than the JCrasher tests. Table 7 summarizes the data for all subject classes for each set of tests. Because the number of tests diverged widely, we asked the question “how efficient is each test generation method?” To approximate efficiency, we computed the number of mutants killed per test. Not surprisingly, the edge coverage tests were at the high end with a score of 6.5. TestGen generated the least number of tests, and came out as being the second most efficient, 5.2.
The number of mutants for each class are shown in Table 3. MuJava generates traditional (method level) and class mutants separately. The traditional operators focus on individual statements, and the class operators focus on connections between classes, especially in inheritance hierarchies. Node looks anomalous because of its small number of mutants. Node has only 18 traditional mutants over 77 lines of code. However, the statements are distributed over nine, mostly very small, methods. Most methods are very short (four have only one statement), they use no arithmetic, shift, or logical operators, they have no assignments, and only a few decision statements. Thus, Node has few locations where muJava’s mutation operators could be applied. Results from running the five sets of tests under muJava are shown in Tables 4 through 7. Table 4 shows the number of tests in each test set and the mutation scores on the traditional mutants, in terms of the percentage of mutants killed. The “Total” row gives the sum of the tests for the subject classes and the mutation score across all subject classes. As can be seen, the random tests did better than the TestGen 4
Classes BoundedStack Inventory Node Queue Recipe TrashAndTakeOut Twelve VendingMachine
Total
Classes BoundedStack Inventory Node Queue Recipe TrashAndTakeOut Twelve VendingMachine
Total
Classes BoundedStack Inventory Node Queue Recipe TrashAndTakeOut Twelve VendingMachine
Total
JCrasher Tests
% Killed
21 20 16 9 29 13 27 14 149
22.8% 68.3% 50.0% 39.3% 78.2% 64.4% 35.0% 10.4% 42.1%
JCrasher Tests
% Killed
21 20 16 9 29 13 27 14 149
25.0% 44.0% 25.0% 33.3% 73.1% N/A N/A 0.0% 46.4%
JCrasher Tests
% Killed
21 20 16 9 29 13 27 14 149
22.8% 60.3% 45.5% 37.4% 77.2% 64.4% 35.0% 9.5% 42.5%
Table 4. Traditional Mutants Killed TestGen JUB Edge Coverage Tests
10 10 16 5 10 2 1 5 59
% Killed
Tests
23.7% 60.4% 50.0% 34.2% 70.3% 30.8% 0.0% 9.1% 28.0%
11 11 9 9 15 2 1 6 64
Random
% Killed
Tests
% Killed
Tests
% Killed
17.4% 68.3% 33.3% 41.0% 35.6% 30.8% 0.0% 9.1% 24.3%
17 19 14 10 22 5 12 10 109
46.9% 60.4% 61.1% 51.3% 75.2% 70.2% 86.3% 75.3% 66.2%
21 20 16 9 29 13 27 14 149
22.3% 65.3% 50.0% 15.4% 57.4% 59.6% 35.0% 10.4% 36.2%
Table 5. Class Mutants Killed TestGen JUB
Edge Coverage
Tests
% Killed
Tests
% Killed
Tests
% Killed
25.0% 36.0% 25.0% 33.3% 11.0% N/A N/A 0.0% 25.6%
17 19 14 10 22 5 12 10 109
75.0% 76.0% 25.0% 33.3% 80.8% N/A N/A 0.0% 67.0%
21 20 16 9 29 5 27 14 149
25.0% 40.0% 25.0% 16.7% 38.5% N/A N/A 0.0% 34.0%
10 10 16 5 10 2 1 5 59
% Killed
Tests
25.0% 36.0% 25.0% 33.3% 53.8% N/A N/A 0.0% 37.1%
11 11 9 9 15 2 1 6 64
Random
Table 6. Total Mutants Killed TestGen JUB
Edge Coverage
Tests
% Killed
Tests
% Killed
Tests
% Killed
17.5% 57.6% 31.8% 39.0% 30.6% 30.8% 0.0% 8.3% 24.4%
17 19 14 10 22 5 12 10 109
47.4% 65.6% 54.5% 48.8% 76.4% 70.2% 86.3% 69.0% 66.3%
21 20 16 9 29 13 27 14 149
22.4% 57.0% 45.5% 14.6% 53.5% 59.6% 35.0% 9.5% 36.0%
10 10 16 5 10 2 1 5 59
% Killed
Tests
23.2% 52.3% 45.5% 32.5% 66.9% 30.8% 0.0% 8.3% 28.8%
11 11 9 9 15 2 1 6 64
5
Random
JCrasher and the random tests were the least efficient; they generated a lot of tests without much obvious benefit, which adds a burden on the developers who must evaluate the results of each test. Figure 3 illustrates the total percent of mutants killed by each test set in a bar chart. The difference between the edge coverage tests and the others is remarkable. Tables 8 and 9 give a more detailed look at the scores for each mutation operator. The mutation operators are described on the muJava website [10]. No mutants were generated for five traditional operators (AODS, SOR, LOR, LOD or ASRS) or for most of the class operators (IHI, IHD, IOP, IOR, ISI, ISD, IPC, PNC, PMD, PPD, PCI, PCC, PCD, PRV, OMR, OMD, OAN, EOA, or EOC), so they are not shown. The edge coverage tests had the highest scores for all mutation operators. None of the test sets did particularly well on the arithmetic operator mutants (operators whose names start with the letter ’A’).
generated by JCrasher, and concluded out that this is because JCrasher uses invalid values to attempt to “crash” the class, as shown in Figure 4. JUB only generates tests that uses 0 for integers and null for general subjects, and TestGen4J generates “normal” inputs, such as blanks and empty strings. JCrasher, of course, also created many more tests than the other two tools. public void test18() throws Throwable { try { String s1 = "˜!@$$%ˆ&*()_+{}|[]’;:/.,?‘-="; Node n2 = new Node(); n2.disallow (s1); } catch (Exception e) {dispatchException (e);} }
4 Analysis and Discussion
Figure 4. JCrasher Test
We have anecdotal evidence from in-class exercises that when hand crafting tests to kill mutants, it is trivially easy to kill between 40% to 50% of the mutants. This anecdotal note is confirmed by these data, where random values achieved an average mutation score of 36%. It was quite distressing to find that the three tools did little better than random testing. JCrasher was slightly better (6.5% overall), TestGen was worse (a 7.2% lower mutation score, and JUB was even worse (a 11.6% lower mutation score). An interesting observation from Table 6 is that the scores for VendingMachine are much lower for all sets of tests except for edge coverage. The other four mutation scores are below 10%. The reason is probably because of the relative complexity of VendingMachine. It has several multi-clause predicates that determine most of its behavior:
The measure of efficiency in Table 7 is a bit biased with mutation, as many mutants are very easy to kill. It is quite common for the first few tests to kill a lot of mutants and subsequent tests to kill fewer mutants, leading to a sort of diminishing returns. We separated the data for traditional and class mutants in Tables 4 and 5. JCrasher’s tests had a slightly higher mutation score on the class mutants, and TestGen’s, JUB’s and the random tests were slightly lower. There was little difference in the mutation scores for the edge coverage tests. However, we are not able to draw any general conclusions from those data. We also looked for a correlation between the number of tests for each class and the mutation score. With the JCrasher tests and Random tests, the largest number of tests (for Recipe in both cases) led to the highest mutation scores. However, the other three test sets showed no such correlation and we see no correlation with the least numbers of tests. In fact, edge coverage produced the least number of tests for class Twelve (12) but had the highest mutation score (83.6%). Again, we can draw no general conclusions from these data.
(coin!=10 && coin!=25 && coin!=100) (credit >= 90) (credit < 90 || stock.size() = MAX) MuJava creates dozens of mutants on these predicates, and the mostly random values created by the three tools in this study have a small chance of killing those mutants. Another interesting observation from Table 6 is that the scores for BoundedStack were the second lowest for all the test sets except edge coverage (in which it was the lowest). A difference in that class is that only two of the eleven methods have parameters. The three testing tools depend largely on the method signature, so fewer parameters may mean weaker tests. Another finding is that JCrasher got the highest mutation score among the three tools. We examined the tests
5 Related Work This research project was partially inspired by a paper by d’Amorim et al., which presented an empirical comparison of automated generation and classification techniques for object-oriented unit testing [4]. Their study compared pairs of test-generation techniques based on random generation or symbolic execution and test-classification tech6
Tool JCrasher TestGen JUB Edge Coverage Random 100%
Table 7. Summary Data % Killed Tests Traditional Class Total 149 42.1% 46.4% 42.5% 59 28.0% 37.1% 28.8% 64 24.3% 25.6% 24.4% 109 66.2% 67.0% 66.3% 149 36.2% 34.0% 36.0%
Efficiency Killed / Tests 3.1 5.2 4.1 6.5 2.6
Percent mutants killed
80% 66% 60% 43% 36%
40% 28%
24%
20% 0 JCrasher TestGen
JUB
Edge Random Coverage
Test Sets Figure 3. Total Percent Mutants Killed by Each Test Set niques based on uncaught exceptions or operational models. Specifically, they compared two tools that implement automated test generation; Eclat [15], which uses random generation and their own tool, Symclat, which uses symbolic generation. The tools also provide test classification based on an operational model and an uncaught exception model. The results showed that the two tools are complementary in revealing faults.
sion incorporated dynamic symbolic evaluation and a dynamic domain reduction procedure to generate tests [12]. A more recent tool that uses very similar techniques is the Daikon invariant detector tool [6]. It augments the kind of symbolic evaluation that Godzilla used with program invariants, an innovation that makes the test generation process more efficient and scalable. Daikon analyzes values that a program computes when running and reports properties that were true over the observed executions. Eclat, a model-driven random testing tool, uses Daikon to dynamically infer an operational model consisting of a set of likely program invariants [15]. Eclat requires classes to test plus an example of their use, such as a program that uses the classes or a small initial test suite. As Eclat’s result may depend on the initial seeds, it was not directly comparable with the other tools in this study.
In a similar study on static analysis test tools, Rutar et al. [17] compared five static analysis test tools against five open source projects. Their results showed that none of the five tools strictly subsumes any of the others in terms of the fault-finding capability. They proposed a meta-tool to combine and correlate the abilities of these five tools. Wagner et al. [21] presented a case study that applied three static fault-finding tools as well as code review and manual testing to several industrial projects. Their study showed that the static tools predominantly found different faults than manual testing but a subset of faults found by reviews. They proposed a combination of these three types of techniques.
Another tool that is based on Daikon is Jov [22], which presents an operational violation approach for unit test generation and selection, a black-box approach that does not require specifications. The approach dynamically generates operational abstractions from executions of the existing unit test suite. These operational abstractions guide test generation tools to generate tests to violate them. The approach selects tests that violate operational abstractions for inspec-
An early research tool that implemented automated unit testing was Godzilla, which was part of the Mothra mutation suite of tools [5]. Godzilla used symbolic evaluation to automatically generate tests to kill mutants, and a later ver7
Table 8. Mutation Scores per Traditional Mutation Operator Traditional Mutants JCrasher TestGen JUB Edge Coverage Random AORB 56 32% 21% 30% 66% 29% AORS 11 46% 27% 36% 55% 27% AOIU 66 46% 32% 17% 79% 36% AOIS 438 28% 24% 22% 53% 22% AODU 1 100% 100% 100% 100% 100% ROR 256 61% 25% 17% 79% 57% COR 12 33% 25% 25% 58% 33% COD 6 33% 33% 17% 50% 33% COI 4 75% 75% 50% 75% 75% LOI 126 53% 48% 44% 80% 48% Total 976 42% 28% 24% 66% 36%
Class IOD JTI JTD JSI JSD JID JDC EAM EMM Total
Table 9. Mutation Scores per Class Mutation Operator Mutants JCrasher TestGen JUB Edge Coverage Random 6 50% 50% 50% 50% 50% 20 95% 50% 25% 100% 55% 6 100% 100% 0% 100% 50% 13 0% 0% 0% 23% 0% 4 0% 0% 0% 50% 0% 2 0% 0% 0% 0% 0% 6 83% 66% 83% 83% 67% 28 0% 0% 0% 57% 0% 12 100% 100% 100% 100% 100% 97 46% 36% 26% 69% 34%
tion. These tests exercise new behavior that had not yet been exercised by the existing tests. Jov integrates the use of Daikon and Parasoft Jtest [16] (a commercial Java testing tool). Agitar Test Runner is a commercial test tool that was partially based on Daikon and Godzilla, but was unfortunately not available for this study.
search community for more than two decades, industrial test data generators seldom, if ever, try to generate tests that satisfy test criteria. Tools that evaluate coverage are available, but they do not solve the hardest problem of generating test values. This study has led us to conclude that it is past time for criteria-based test data generation to migrate into tools that developers can use with minimal knowledge of software testing theory. These tools were compared with only one test criterion, edge coverage on control flow graphs. This is widely known in the research community to be one of the simplest, cheapest, and least effective test criterion. Our anecdotal experience with manually killing mutants indicates that scores of around 40% are trivial to achieve and 70% is fairly easy to achieve with a small amount of hand analsyis of the class’s structure. This observation is supported by this study, in which random values reached the 40% level and edge coverage tests reached the 70% level. However, mutation scores of 80% to 90% are often quite hard to reach with handgenerated tests. This should be possible with more stringent criteria such as prime paths, all-uses, or logic-based coverage. We have also observed that software developers have few
6 Conclusions This paper compared three free, publicly accessible, unit test tools on the basis of their fault finding abilities. Faults were seeded into Java classes with an automated mutation tool and the tools’ tests were compared with hand generated random tests and edge coverage tests. Our findings are that these tools generate tests that are very poor at detecting faults. This can be viewed as a depressing comment on the state of practice. As users expectations for reliable software continue to grow, and as agile processes and test driven development continue to gain acceptance throughout the industry, unit testing is becoming increasingly important. Unfortunately, software developers have few choices in high quality test data generation tools. Whereas criteria-based testing has dominated the re8
educational opportunities to learn unit testing skills. Despite the fact that testing consumes more than half of the software industry’s resources, we are not aware of any universities in the USA that require undergraduate computer science students to take a software testing course. Very few universities do more than teach a lecture or two on testing in a general software engineering survey; the material that is presented is typically 20 years old. This study has led us to conclude that it is past time for universities to teach more knowledge of software testing to undergraduate computer science and software engineering students.
Windsor, UK, August 2006. IEEE Computer Society Press. [9] Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. MuJava : An automated class mutation system. Software Testing, Verification, and Reliability, 15(2):97– 133, June 2005. [10] Yu-Seung Ma, Jeff Offutt, and Yong-Rae Kwon. muJava home page. Online, 2005. http://cs.gmu.edu/∼offutt/mujava/, http://salmosa.kaist.ac.kr/LAB/MuJava/, last access December 2008.
References
[11] Manish Maratmu. Testgen4j. Online, 2005. http://developer.spikesource.com/wiki/index.php/Projects:testgen4j, last access December 2008.
[1] Paul Ammann and Jeff Offutt. Introduction to Software Testing. Cambridge University Press, Cambridge, UK, 2008. ISBN 0-52188-038-1. [2] AutomatedQA. Testcomplete. Online, 2008. http://www.automatedqa.com/products/testcomplete/, last access December 2008.
[12] Jeff Offutt, Zhenyi Jin, and Jie Pan. The dynamic domain reduction approach to test data generation. Software–Practice and Experience, 29(2):167–193, January 1999.
[3] Christoph Csallner and Yannis Smaragdakis. JCrasher: An automatic robustness tester for Java. Software: Practice and Experience, 34:1025–1050, 2004.
[13] Jeff Offutt, Ammei Lee, Gregg Rothermel, Roland Untch, and Christian Zapf. An experimental determination of sufficient mutation operators. ACM Transactions on Software Engineering Methodology, 5(2):99– 118, April 1996.
[4] Marcelo d’Amorim, Carlos Pacheco, Tao Xie, Darko Marinov, and Michael D. Ernst. An empirical comparison of automated generation and classification techniques for object-oriented unit testing. In Proceedings of the 21st International Conference on Automated Software Engineering (ASE 2006), pages 59– 68, Tokyo, Japan, September 2006. ACM / IEEE Computer Society Press.
[14] Carlos Pacheco and Michael D. Ernst. Eclat tutorial. Online. http://groups.csail.mit.edu/pag/eclat/manual/tutorial.php, last access December 2008. [15] Carlos Pacheco and Michael D. Ernst. Eclat: Automatic generation and classification of test inputs. In 19th European Conference on Object-Oriented Programming (ECOOP 2005), Glasgow, Scotland, July 2005.
[5] Richard A. DeMillo and Jeff Offutt. Constraint-based automatic test data generation. IEEE Transactions on Software Engineering, 17(9):900–910, September 1991.
[16] Parasoft. Jtest. Online, 2008. http://www.parasoft.com/jsp/products/home.jsp?product=Jtest, last access December 2008.
[6] Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering, 27(2):99–123, February 2001. [7] Erich Gamma, Richard Helm, Ralph Johnson, and John M. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995. ISBN 0-201-63361-2.
[17] Nick Rutar, Christian B. Almazan, and Jeffrey S. Foster. A comparison of bug finding tools for Java. In Proceedings of the 15th International Symposium on Software Reliability Engineering, pages 245–256, SaintMalo, Bretagne, France, November 2004. IEEE Computer Society Press.
[8] Mats Grindal, Jeff Offutt, and Jonas Mellin. On the testing maturity of software producing organizations. In Testing: Academia & Industry Conference - Practice And Research Techniques (TAIC / PART 2006),
[18] Ben Smith and Laurie Williams. Killing mutants with MuClipse. Online, 2008. http://agile.csc.ncsu.edu/SEMaterials/tutorials/muclipse/, last access December 2008. 9
[19] Mark Tyborowski. JUB (JUnit test case Builder). Online, 2002. http://jub.sourceforge.net/, last access December 2008.
[21] S. Wagner, J. Jurjens, C. Koller, and P. Trischberger. Comparing bug finding tools with reviews and tests. In 17th IFIP TC6/WG 6.1 International Conference on Testing of Communicating Systems, pages 40–55, May 2005.
[20] Sami Vaaraniemi. The benefits of automated unit testing. Online, 2003. http://www.codeproject.com/KB/architecture/onunittesting.aspx, last access December 2008.
[22] Tao Xie and David Notkin. Tool-assisted unit-test generation and selection based on operational abstractions. Automated Software Engineering Journal, 13(3):345–371, July 2006.
10