SOFTWARE TESTING, VERIFICATION AND RELIABILITY Softw. Test. Verif. Reliab. 2013; 23:375–403 Published online 27 March 2012 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/stvr.1469
Efficient mutation testing of multithreaded code Milos Gligoric, Vilas Jagannath, Qingzhou Luo and Darko Marinov* ,† Siebel Center for Computer, Science University of Illinois, Urbana, IL, USA
SUMMARY Mutation testing is a well-established method for measuring and improving the quality of test suites. A major cost of mutation testing is the time required to execute the test suite on all the mutants. This cost is even greater when the system under test is multithreaded: not only are test cases from the test suite executed on many mutants but also each test case is executed—or more precisely, explored—for multiple possible thread schedules. This paper introduces a general framework for efficient exploration that can reduce the time for mutation testing of multithreaded code. The paper presents five techniques (four optimizations and one heuristic) that are implemented in a tool called MuTMuT within the general framework. Evaluation of MuTMuT on mutation testing of 12 multithreaded programs shows that it can substantially reduce the time required for mutation testing of multithreaded code. Copyright © 2012 John Wiley & Sons, Ltd. Received 27 September 2010; Revised 2 January 2012; Accepted 23 February 2012 KEY WORDS:
mutation testing; multithreaded code
1. INTRODUCTION Computing is undergoing a historic change as processor clock speeds stagnate, whereas the number of computing cores increase. To extract better performance from the multiple cores, software developers need to write parallel code either from scratch or by parallelizing existing sequential code. The currently dominant parallel programming model is that of shared memory, where multiple threads of computation read and write shared data objects. For example, Java provides support for threads in the language and libraries, with shared data residing on the heap. Multithreaded code is notoriously hard to obtain right, with common concurrency bugs including atomicity violations, dataraces, and deadlocks. Although new parallel programming models are emerging, developers need help right now to develop and test multithreaded code. Mutation testing [1–5] is a method for evaluating and improving the quality of a test suite. A mutation testing tool proceeds in two steps. First, during mutant generation, the tool generates a set of mutants by modifying the original code under test. The tool applies mutation operators that perform small syntactic modifications on the code under test. For example, a modification can replace a variable with a constant (of a compatible type), say amount with 0. Second, during mutant execution, the tool measures how many mutants a given test suite detects. The tool executes each mutant and the original code on the test cases from the test suite and compares the corresponding results. If a mutant generates a result different from the original code, the test input kills the mutant. Note that some of the mutants, although syntactically different from the original code, may be semantically equivalent to the original code, so there is no input that can kill them [4, 6, 7].
*Correspondence to: Darko Marinov, Siebel Center for Computer Science, University of Illinois, 201 North Goodwin Avenue, Urbana, Illinois - 61801, USA. † E-mail:
[email protected] Copyright © 2012 John Wiley & Sons, Ltd.
376
M. GLIGORIC ET AL.
There is a large body of work on mutant generation and mutation operators, some even for multithreaded code. The literature contains mutation operators for several programming languages, including Java [6, 8–12]. The operators for Java include the traditional operators that modify statements and expressions, as well as object-oriented operators that modify classes, for example, field or method declarations. Additional operators for multithreaded code [13–15] can also modify thread or synchronization operations, for example, removing locks or barriers. However, there had been no work on efficient mutant execution for multithreaded code except for the prior ICST 2010 conference paper [16] by the authors. To understand the cost of mutant execution for multithreaded code, consider first test execution for multithreaded code. A multithreaded test is a piece of code that creates and executes two or more threads (and/or invokes code under test that itself creates and executes two or more threads) [17–22]. Because multithreaded code can have different behaviour for different thread schedules/ interleavings, a multithreaded test can produce different results for the same input. Thus, to check whether a multithreaded test can fail for some schedule, the test is typically run several times by using stress or random testing [23, 24] that repeatedly runs the test without much control over the scheduler or by using systematic exploration of possible schedules [20, 21, 25–35]. The most thorough exploration tries all possible schedules, but their number can be very high. Musuvathi and Qadeer proposed CHESS [20], a highly effective exploration heuristic that limits the number of schedules by bounding the number of preemptive context switches in thread schedules and yet finds many bugs. Regardless of what exploration is used, executing one multithreaded test on even one version of code can be expensive. It is even more expensive for mutant execution because the test is run on several (mutated) versions of the code. Automated test generation for multithreaded code is also very hard. For sequential code, there has been work on mutation-based test case generation [36–41], and there are several approaches to automating test generation, for example, random generation [42], dynamic symbolic execution [43], or search-based techniques [44]. The key problem for sequential code is selecting appropriate test input data. For multithreaded code, it is also necessary to generate what methods execute in multiple threads. Whereas there has been some work on generating test inputs values for multithreaded tests with fixed method sequences [45], there is currently only one published approach that attempts to automate generation of method sequences for multithreaded code [46] but requires that users manually provide method invocations to be combined in multiple threads. For these reasons, all the experiments in this paper use only manually written tests for multithreaded code, and automated test generation is not discussed further. This paper makes several contributions: Problem: It identifies the problem of mutation testing for multithreaded code, more specifically the need for efficient mutant execution for multithreaded code. Because multithreaded programming is becoming more widespread (as necessitated by performance requirements), testing multithreaded code has become more important. Framework: It proposes a framework for efficient mutant execution for multithreaded code. The framework and the tool that implements it are called MuTMuT (from ‘MUtation Testing of MUltiThreaded code’). The key idea of MuTMuT is to reuse some information collected during the exploration/execution of a test on the original code under test to speed up the exploration/execution of the same test on the mutant code. The simplest reuse is based on code coverage and is already implemented in some tools for mutation testing of sequential code [6, 12]. The idea is to execute a test on the original code under test and to determine which mutants the test reaches (i.e., what mutated code the test executes); then only the mutants that were reached need to be executed because the other mutants will be equivalent to the original code for this test. The same idea can be translated to multithreaded tests: if the original code does not reach a mutant during the entire exploration, then the mutant need not be explored. However, knowing only that a mutant was reached would require repeating the entire exploration for the mutant. MuTMuT can record more information—conceptually, which thread schedules reached a mutant—to prune away many more thread schedules from the exploration of a mutant. Techniques: It discusses five specific techniques (four optimization and one heuristic) that instantiate the general MuTMuT framework by using information that can be recorded and reused from Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
377
one exploration to another. The main trade-off is that collecting more information is costlier but can result in pruning away more schedules during exploration. In other words, reusing more information requires more time to collect that information but can prune away more schedules during exploration. Four of the techniques are optimizations in the sense that they provide the same result as a naïve mutant execution (namely, they find the same set of killed mutants) but provide the result faster. One of the techniques is a heuristic in the sense that it can provide a different result (namely, it can find that a mutant is equivalent although it is not) but provides the result even faster. Implementation: It presents an implementation of the MuTMuT framework and the five techniques in Java PathFinder (JPF), an open-source model-checking tool developed at NASA for verifying Java programs [26, 47]. JPF implements a backtrackable Java Virtual Machine (JVM) that provides control over non-deterministic choices including thread schedules. The backtrackable JVM allows JPF to explore multithreaded tests for a wide range of Java programs. By default, JPF provides guarantees about all schedules, although it does not explicitly explore them all but instead uses sophisticated partial-order and symmetry reduction techniques [26,48] to prune schedules/states that have equivalent behaviour. In addition, the MuTMuT experiments use a JPF extension that provides exploration up to a bounded number of context switches, as in CHESS [20, 29]. The implementation work also includes extending the Javalanche [6] mutation tool used to mutate the programs in the MuTMuT evaluation. Javalanche did not support multithreaded mutation operators [13–15], and there is no other publicly available mutation tool that supports such operators. Therefore, a part of the implementation was to extend Javalanche to support some multithreaded mutation operators. The Javalanche authors accepted this extension, and the new multithreaded mutation operators are included in the publicly available Javalanche distribution. With the use of this extension, it was possible to perform the evaluation by using both traditional sequential mutation operators and multithreaded mutation operators. Evaluation: It describes an evaluation of MuTMuT on 12 small multithreaded programs and their associated tests. The evaluation uses both exhaustive exploration of all thread schedules and CHESS-like bounding of context switches. The results show that the MuTMuT’s advanced optimization technique, StateSet, can speed up mutant execution up to 48.91% over the MuTMuT’s basic optimization technique, Basic. The results also show that the MuTMuT’s heuristic, SomeState, can speed up mutant execution up to 77.32% over Basic, and for the selected set of programs, SomeState killed all the mutants that Basic and StateSet killed. The rest of this paper is organized as follows. Section 2 introduces a running example that is used to illustrate MuTMuT. Section 3 describes the general MuTMuT framework and five specific instantiations of that framework. Section 4 presents some details of the implementation of MuTMuT in JPF and also describes Javalanche extension to support the application of multithreaded mutation operators. Section 5 presents an evaluation of MuTMuT. Section 6 reviews related work, and Section 7 concludes the paper. 2. EXAMPLE This section illustrates how MuTMuT speeds up mutant execution on a simple example on the basis of a bank simulation program from the benchmark for testing multithreaded programs [49]. Figure 1(a) and (c) shows the code under test and example tests, respectively. The Account class stores the balance for an account. It has operations to create an account with some initial amount, withdraw money from the account (as long as the resulting balance is not negative), deposit more money into the account, lookup the current balance, and transfer money from one account to another. Note that the operations use the Java keyword synchronized to indicate that the methods should be executed atomically. If they were not, using Account objects in multithreaded code could have dataraces and lead to incorrect balances. 2.1. Multithreaded tests Figure 1(c) shows three example multithreaded tests for the Account class written in JUnit [50], a popular testing framework for Java programs (similar multithreaded tests are written Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
378
M. GLIGORIC ET AL.
(b)
(a)
(c) Figure 1. (a) Example code under test, (b) mutated version of the code under test, and (c) example multithreaded tests.
for several Java codebases, for example, hundreds of tests are publicly available [51] for the java.util.concurrent package, which is included in the Java standard library [52]). The test1 method validates whether concurrently executing one withdraw operation and two deposit operations results in a correct balance amount. The test has three threads that perform the three operations and the main thread that makes the assertion on the account balance. Note that, because withdraw cannot result in a negative balance, the final amount in the test can differ on the basis of the schedule of the threads involved. This test uses the standard Java operations to initiate thread execution (start) and wait for thread completion (join). Similarly, test2 validates deposit and withdraw operations, and test3 validates the constructor and balance Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
379
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
(a careful reader may notice that the example tests are not very good as they do not even cover the transfer method). 2.2. Schedules and search space Figure 2(a) shows the thread schedules and the state space for complete exploration of test1. Each node in the graph represents a state in this space; states are labelled S1 –S10 . The value inside the node shows the current balance in the account a from test1 code. The edges represent transitions that the code makes as it progresses from one state to another. The edge labels show the method causing the transition, for example, w80 denotes withdraw(80.00). Note that the methods are atomic, so there are no thread interleavings within the methods. Note additionally the two highlighted points where the execution of w80 on the original Account code reaches Mutant1 as it will be discussed later for the exploration of that mutant. The search space shown in Figure 2(a) is for a stateful, depth-first search [48]; each new state encountered in the search is compared with the previously encountered states, and if the new state matches a previously encountered state, the exploration stops. For example, executing withdraw(80.00) in state S9 leads to the previously encountered state S5 . Note that the state comparison matches entire program states, including program counters for the threads and not only the account balance. For instance, states S1 and S2 differ: although they have the same account balance, they have different program counters for the threads (specifically, thread t1 is at the beginning of the execution in S1 , and it is at the end of the execution in S2 ). Further, note that this state comparison is commonly performed in model checking [48] and is not specific to application in mutation testing, namely MuTMuT does not use weak but strong mutation [4]. Although the example search that is shown is stateful and depth-first, it is important to point out that MuTMuT, which is built on top of JPF, also supports stateless search and non-depth-first search strategies available in JPF (breadth-first, random, CHESS, etc.). Various search approaches can take different time and can explore different parts of the search space. The number of states being explored and the number of transitions being executed [53] can be used as the first approximation to comparing two exploration times. For example, the state space in Figure 2(a) has 10 states and 13 transitions. 2.3. Mutants Figure 1(b) shows three example mutants for the Account class. The first mutant replaces the operator += with -=, the second mutant replaces the variable amount with 0, and the third mutant modifies the transfer method by removing the synchronized keyword around the block other.balance += amount. The first two mutants are created by applying the traditional sequential operators ‘change arithmetic operation’ and ‘replace variable with constant’ [4, 5].
S1
S1
50
w80
d20
d90 S2 50 d20
d90
S6 140 w80
d90 S2
S9 d20
w80
S5
S7 60
70
w80
d20
d90
d20
S10 160
S4
S8 160
S6 140
50
70 d90
S3 140
50
w80
d20
d20
d90
w’80
S3 140
w80
d20
70
w80
d90
S5
S10 160
70 d90
d20
S9
w80 w’80
S4 80
(a)
160
(b)
Figure 2. State space for exploring test1. (a) original exploration and (b) Mutant1 exploration. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
380
M. GLIGORIC ET AL.
The third mutant is created by applying the ‘remove synchronized block’ multithreaded mutation operator [14]. These mutants are shown as a mutant schemata [54], that is, one program that encodes all the mutants. By setting the appropriate value for the mutant id, the same program can execute either as the original code or as one of the mutants. This type of mutant generation was introduced to speed up compilation of mutants [54]: rather than generating (and compiling) one new program for each mutant, it generates (and compiles) only one meta-mutant that encodes all mutants. MuTMuT leverages these type of mutants to speed up exploration for multithreaded tests. Consider the exploration of the tests from Figure 1(c) on the mutant schemata from Figure 1(b). The goal of mutant execution is to establish which of the mutants get killed by the tests provided. A very naïve mutant execution would explore all the tests on all the mutants to find out which mutants get killed. An obvious improvement, used in practical mutation testing, is to stop executing a mutant as soon one test kills it. An even better solution is to consider which mutants were reached during the exploration of the original code [6, 12] (and as it is discussed later, where in the exploration the mutants were reached). Figure 2(b) shows the (entire) exploration of test1 on Mutant1. Same labels are used for states as in the original execution to make the comparison easier; if states are labelled consecutively only the states in this state space, S9 would be S7 and S10 would be S8 . The difference from the original exploration is in the two new edges that start from the highlighted points. These edges correspond to the transitions that reach the mutant in Mutant1 during the original exploration, namely where the withdraw method is called with balance greater than amount. Note that in this example, the two new edges take the execution to a state that is encountered in the original exploration, but, in general, exploration of mutants could lead to new states that were not encountered in the original exploration. Also note that in this example, test1 does not kill Mutant1: although the exploration of test1 on Mutant1 differs from the original exploration, the exploration on Mutant1 produces the same results as the original exploration and does not fail the assertion (MuTMuT speeds up exploration for both killed and not killed mutants as the results in Section 5 show). For the three example tests and mutants, explorations of both test1 and test2 reach both Mutant1 and Mutant2, exploration of test3 does not reach any mutant, and no test reaches Mutant3. Figure 3 shows a matrix indicating which tests reach which of the mutants. Because test3 reaches no mutant, it cannot kill any mutant, so it need not be explored for any mutant. Similarly, Mutant3 cannot be killed by any of the three tests, so it need not be explored for any test. The MuTMuT Basic technique uses such properties to speed up mutant execution. To infer these properties, it suffices to record whether a test reaches a mutant, effectively keeping a set of mutants reached during the original exploration for each test. 2.4. Efficient exploration The key insight behind MuTMuT is that it is possible to record even more information from the original exploration to speed up the test exploration for the mutants. More specifically, a tool can record not only whether but also where a test reaches a mutant. Consider Figure 2(b) and exploration of test1 for Mutant1. Rather than starting this exploration from the very initial state .S1 /, it suffices to re-execute the code from S1 to S6 and start the exploration from that state because that is the first state in the entire exploration that reaches this mutant. In other words, it is not necessary to explore whether test1 can kill Mutant1 if the first executed operation is withdraw from S1 to S2 because no such execution reaches the mutant. In this particular example, starting the exploration from S6 avoids exploring 1 state .S2 / and the 3 transitions (that lead to or start from S2 ).
Figure 3. Mutants reached during test explorations. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
381
Moreover, as discussed in detail in the next section, one could record even more information from the exploration of the original code—not only the first state in the entire exploration that reaches a mutant but also (i) the first state on each execution path/schedule that reaches a mutant or (ii) what mutants can be reached for explorations starting from each state—to prune away even more exploration. In the particular example of test1 and Mutant1, these two additional prunings would avoid exploring 5 states and 8 transitions, and 5 states and 9 transitions, respectively. Furthermore, even more exploration can be pruned if one allows some tests to not kill mutants. New heuristic, called SomeState, avoids exploring 6 states and 11 transitions for the running example. Section 3.3 shows a visualization of the state spaces for these cases and discusses this further. It is not always beneficial to record more information. The additional time required to record more information may not be paid off by the savings in the exploration. For this reason, MuTMuT, in its StateSet technique, records only the first state on each execution path that reaches a mutant. As the experimental results in Section 5 show, this information helps MuTMuT to substantially reduce the exploration time with the StateSet technique, up to 48.91% over the Basic technique (which is already a significant improvement [6, 12] over a naïve exploration of all tests on all mutants). 2.5. CHESS MuTMuT need not be used only for the exhaustive exploration of all schedules (which does not scale well with the number of the threads and the length of execution for each thread). For example, CHESS [20, 29] is an effective heuristic that explores only schedules up to a bounded number of preemptive context switches, that is, context switches from a thread that is enabled (rather than being blocked or terminated, in which case the execution must switch to another thread). For instance, consider the exploration for test2 that has two threads with two operations each, and let us label these operations as dt1 (deposit from thread 1), wt1 , dt 2 , and wt 2 . These operations can have six
Figure 4. State spaces for exploring test1 on Mutant1 using various MuTMuT techniques: (a) FirstState, (b) StateSet, (c) AllStates, and (d) SomeState. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
382
M. GLIGORIC ET AL.
possible schedules: hdt1 , wt1 , dt 2 , wt 2 i, hdt1 , dt 2 , wt1 , wt 2 i, hdt1 , dt 2 , wt 2 , wt1 i, hdt 2 , wt 2 , dt1 , wt1 i, hdt 2 , dt1 , wt 2 , wt1 i, and hdt 2 , dt1 , wt1 , wt 2 i. However, two of the schedules—hdt1 , dt 2 , wt1 , wt 2 i and hdt 2 , dt1 , wt 2 , wt1 i—require two preemptive context switches (in both cases, the last context switch is not preemptive because the current thread finishes execution). Running CHESS with bound 1 would prune these two schedules from the exploration. Whereas bound 1 is used for illustrative purposes, results from Microsoft show that CHESS with bound 2 can actually find many real bugs [20]. For the Account class from Section 5 and for the original program, CHESS with bound 2 explores tests 30 faster than the exhaustive exploration. Moreover, MuTMuT can speed up mutant execution even when CHESS is used; for Account and CHESS with bound 2, MuTMuT StateSet runs 13.37% faster than MuTMuT Basic, and MuTMuT SomeState runs 14.40% faster than MuTMuT Basic.
3. FRAMEWORK This section describes the general framework for mutant execution of multithreaded code. The general approach of the framework is to (i) first record various pieces of information during the exploration of a test on the original code under test and (ii) then use that information to speed up the exploration of the test on the mutants. The framework is instantiated into five specific techniques on the basis of various pieces of information that is recorded and reused. The key trade-off is that collecting more information is costlier but can result in pruning away more of the exploration. The section presents four optimization techniques that can prune exploration and yet guarantee to kill all mutants that should be killed. Also, the section presents one heuristic technique that prunes away even more of the exploration and can improve the time significantly, but which does not come for free: it can miss killing some mutants.
3.1. Traditional exploration First, it is described how a traditional exploration searches through a state space [48], and then it is discussed how MuTMuT techniques modify the traditional exploration. Figure 5 shows pseudocode for the traditional exploration. It takes as input a test and reports whether it fails for some state/path or passes for all (explored) states. It maintains two sets: ToExplore is a set of states and transitions that still need to be explored, and Encountered is a set of states (typically their hash values [48]) that were already visited. It also maintains for each state a path (i.e., a sequence of transitions) that leads to that state. The initialization part populates these sets on the basis of the initial state for the test. The exploration part executes a loop as follows. Although there is more to explore, it takes a state s and a transition t , executes t on s, checks if the resulting state s 0 was visited, and if not, adds s 0 and its transitions to the ToExplore set (in general, the transitions from s 0 can be viewed as an unordered set, but precise description of the MuTMuT’s FirstState technique requires viewing these transitions as a ordered list). If no violation is found, the test is reported as passing.
3.2. Common MuTMuT infrastructure Figure 6 shows the pseudo-code for the mutant execution performed by MuTMuT. Given a test suite and a set of (live) mutants, MuTMuT computes the set of mutants killed by the test suite. For each test, MuTMuT first performs a slightly modified (as explained in the next paragraph) traditional exploration of the original code. Note that the ‘original code’ is actually mutated code (as shown in Figure 1(b)) with currentlyExecutingMutantId set to 0 to execute the original, unmutated statements. During this exploration, MuTMuT collects a set of reached (live) mutants and the additional information (depending on the specific MuTMuT technique being used, as described in Section 3.3). For each of those mutants, it then performs a special exploration (again depending on the specific technique) of the test on each of those mutants. If the test fails, the mutant is killed. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
383
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE S1
S1
50
50
d20 d90
d90
S6
S6
S9 140 d20
w’80 S3
d20 w’80
d90
S5 140
140
70 w80
S3
S10 160
70 d90
d20
S10 160
140 d20
w’80 S4
w’80 S4
160
160
(a)
(b) S1
S1
50
50
d90
d90
S6
S6 140
140 d20 w’80
w’80 S3
S3
S10 160
140
w’80
140 d20 S4
S4 160
160
(c)
(d)
Figure 5. Pseudo-code for traditional state-space exploration.
Figure 6. High-level overview of MuTMuT’s mutant execution.
The change that MuTMuT makes to the traditional exploration is during the execution of transitions. MuTMuT first starts the traditional exploration for the original program and during this exploration, some of the executed transitions will reach the mutation points in the meta-mutant. Recall the example in Figure 2(a) and the highlighted points that denote where the original exploration reaches the mutation points for Mutant1 (from S6 to S7 and from S10 to S8 ). Figure 7 shows the method that MuTMuT executes for the mutation points to determine whether to execute the original code or the mutant. Additionally, this method notifies an execution observer if the mutant has been reached while executing the current transition on the current state. Various MuTMuT techniques will then record various information at these points. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
384
M. GLIGORIC ET AL.
Figure 7. Pseudo-code that MuTMuT executes for mutation points.
3.3. Specific MuTMuT techniques Five MuTMuT techniques include: four optimizations called Basic, FirstState, StateSet, and AllStates, and one heuristic, called SomeState. First, Section 3.3.1 describes the additional information that the techniques collect (through notifyMutantReached calls), and then Section 3.3.2 describes how the techniques use this information during their special exploration (in specialExplore). 3.3.1. Additional information. Figure 8 summarizes the additional information collected during exploration of the original code. Figure 9 shows the pseudo-code for this collection. Basic does not require any additional information; all the information is already in ReachedSet. FirstState records the first state (more precisely, the path needed to restore that state) that reaches the mutant during the entire exploration. For test1 and Mutant1, this happens during the execution of w80 from the state S6 . StateSet records a set of states (more precisely, the paths to those states) that reach the mutant. It records only the first state that reaches the mutant on a path. It does not record multiple states that reach the same mutant on one path, which can happen even on one execution path (rather than in the entire exploration) when the mutant is in a loop or a recursive method. The rationale for recording only the first state on a path is as follows. When MuTMuT executes the mutant (after the original code) and it reaches the mutant for the first time, it will execute the mutant rather than executing the original code. The mutant can potentially change the state such that the second mutant on the same path may not be reached. For test1 and Mutant1, StateSet records S6 and S10 . AllStates records the largest amount of information. It collects for each mutant a set of states that reach the mutant directly in one transition leading from that state. Moreover, it collects the entire state-space graph, that is, what states are successors of what other states. AllStates then postprocesses these two pieces of information to compute for each state a set of mutants that can be reached transitively in one or more transitions leading from that state. Effectively, it computes for each state what mutants can be reached in an exploration starting from that state.
technique Basic FirstState StateSet AllStates SomeState
additional information collected boolean: did the test exploration reach the mutant? at most one state: the first state in the entire exploration where the mutant was reached, if it was reached a set of states: for each path that reached the mutant, the first state that reached the mutant a set of states: all states that reached the mutant; also collects the entire state-space graph (shared for all mutants) at most one state: some state where the mutant was reached, if it was reached
Figure 8. Information recorded for each (live) mutant and test exploration for the five techniques. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
385
Figure 9. Pseudo-code for collecting additional information.
SomeState can record any state that reaches the mutant during the entire exploration. The experiments, performed for this paper, record the same information as FirstState, so for test1 and Mutant1, SomeState records the state S6 , but in theory, one could (randomly) choose to record any other state. 3.3.2. Special exploration. Various MuTMuT techniques perform different specialExplore. Basic performs exactly the same exploration as the traditional exploration shown in Figure 5 with the only difference being that, during specialExplore, Basic does not explore the original code but a mutant, specifically one of the mutants from the ReachedSet as shown in the loop within lines 8–13 in Figure 6. Figure 10 shows the pseudo-code for the other four techniques. FirstState has additional information about the first state that reached a mutant. Hence, for each mutant, FirstState’s exploration first restores that state and its associated search environment and then performs a search from that point. Thus, FirstState explores the states reachable from the restored state and then continues the search to complete the rest of the exploration. Effectively, FirstState replaces the initialization part from Figure 5 as shown in Figure 10: rather than starting the search from the initial state with no path, FirstState starts the search from a state with a path previously recorded from the original exploration. While restoring that path, for each state s reached from a transition t on that path, FirstState adds to the search the transition t and all transitions from s that come after t . Figure 4(a) shows the states and transitions that FirstState explores for test1 and Mutant1. Comparing with Figure 2(a), we see that FirstState does not explore state S2 and the three transitions that lead to and from it. Note that because the pruned part of the state space does not reach the mutant, FirstState is still guaranteed to correctly find whether the mutant is killed. However, FirstState requires the original exploration and the mutant exploration to use the same search strategy. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
386
M. GLIGORIC ET AL.
Figure 10. Pseudo-code for state-space exploration with MuTMuT’s techniques.
Namely, the call pickOnePair in line 19 in the traditional exploration in Figure 5 allows for any search strategy, for example, picking the pair added last to ToExplore results in depth-first search, and picking the pair first added results in breadth-first search. FirstState also allows any strategy to be used, as long as the same strategy is used for both explorations; if different strategies are used, Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
387
FirstState could miss exploring a part of the state space and could thus incorrectly miss killing a mutant. FirstState also requires that the original exploration and the mutant exploration return the list of transitions from a given state s in the same order. StateSet has additional information about the first state that reached a mutant for all paths that reached a mutant. Hence, for each mutant, StateSet’s exploration restores these states one by one and explores the state space reachable from the restored state, much like FirstState with one key difference: StateSet does not continue the search to complete the rest of the exploration. Figure 10 shows how StateSet initializes the search for each of the states. Figure 4(b) shows the states and transitions that StateSet explores for test1 and Mutant1. Comparing with Figure 4(a), we see that StateSet does not continue the exploration from S1 after finishing the exploration from S6 , so StateSet avoids exploring the state S9 and the transitions that lead to and from that state. Like FirstState, StateSet allows using various search strategies, but the same strategy has to be used for both the original and mutant explorations. Also like FirstState, StateSet is guaranteed to correctly find whether the mutant is killed because the pruned part of the state space could not kill the mutant. AllStates has the additional information about all the states that could reach the mutant not only directly in one transition but also transitively through a number of transitions. Like StateSet, AllStates finds all first states that reached a mutant on paths that reached a mutant, restores these states one by one, and then explores the state space reachable from the restored state, without continuing the search to complete the rest of the exploration. Moreover, AllStates modifies the condition in line 25 in Figure 5: the traditional condition checks only if the next state s 0 was visited, but AllStates also checks if it has information about s 0 (it could happen that s 0 is a new state encountered for the mutant such that AllStates could not have collected information about that state during the original exploration) and if it does have the information, checks whether the currently executed mutant can be reached transitively from s 0 . Specifically, line 25 becomes the following, where ‘ppai’ is the post-processed ‘ai’ (as described in Section 3.3.1): if .s 0 … Encountered ^ .Šppai.containsKey.s 0 / _ currentlyExecutingMutantId … ppai.s 0 /// Figure 4(c) shows the exploration for AllStates. Comparing with Figure 4(b), we see that AllStates prunes the transition d20 from S3 because AllStates ‘knows’ that exploration from S3 cannot reach Mutant1. Like FirstState and StateSet, AllStates requires the same search strategy for original and mutant explorations to correctly detect the mutants that are killed. SomeState performs the same search as StateSet, although it records the same additional information as FirstState, that is, the path to the first state for each mutant (although as it was mentioned earlier, SomeState could use any other state that reaches the mutant). Namely, SomeState does not continue the search to complete the rest of the exploration. Figure 4(d) shows the states and transitions that SomeState explores for test1 and Mutant1. Comparing with Figure 4(b), we see that SomeState prunes the state S10 because it remembers only the first state where the mutant is encountered and explores only from that state. 3.4. Properties of MuTMuT techniques All four MuTMuT optimizations satisfy three desirable properties for (correct and) efficient mutant execution. Additionally, the SomeState technique satisfies two of these properties. To state the properties, recall that MuTMuT’s inputs are a test suite and a set of mutants, MuTMuT’s output is a set of mutants killed by the test suite, and MuTMuT explores a state space, visiting a number of states and executing a number of transitions. Soundness: MuTMuT does not kill any mutant that should not be killed (i.e., which passes for all execution paths of all tests in the given test suite). Each mutant that MuTMuT puts in the set of killed mutants indeed fails for some execution path for some test from the test suite. Completeness: The MuTMuT optimizations kill all mutants that should be killed for the given exploration strategy and for the given test suite. Each mutant that the MuTMuT optimizations do not put in the set of killed mutants indeed cannot fail for the execution paths that would be explored with a naïve mutant execution that tries all tests on all mutants. Note that we have to qualify this Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
388
M. GLIGORIC ET AL.
statement on the basis of the exploration strategy because the strategy itself can prune some exploration paths, for example, CHESS prunes the paths where the number of context switches is over a given bound. It may be the case that the exhaustive exploration of all possible paths would kill the mutant for some test. The key is that the MuTMuT optimizations give the same result as a naïve mutant execution but much faster. In contrast to the MuTMuT optimizations, the SomeState heuristic can miss to kill some mutant that would be killed by an entire exploration. Because SomeState does not continue the search (as FirstState does) and because it does not store all paths that lead to the mutant (as StateSet does), it can miss to kill the mutant that cannot be killed from any path starting from the first state in the entire exploration but can be killed starting from some other state. As an example, consider test1 and Mutant1. The mutant is not killed starting from the state S6 , but SomeState would not explore other paths for that mutant. If the mutant had been killed from the state S10 , SomeState would have missed to kill it. This heuristic is based on the observation that many mutants are killed by exploring any state that reaches the mutant. SomeState can significantly improve the exploration time, and the results in Section 5 show that SomeState managed to kill all mutants that should be killed. Improvements: MuTMuT techniques are definite improvements in terms of the number of states and transitions explored over the naïve exploration, but the fact that one technique T explores fewer states or transitions than another technique T 0 does not necessarily mean that T has lower execution time. Specifically, Basic never explores more states or transitions than a naïve mutant execution, FirstState never explores more states or transitions than Basic, StateSet never explores more states or transitions than FirstState, and AllStates never explores more states or transitions than StateSet. Additionally, SomeState never explores more than StateSet but could explore more than AllStates (e.g., if there is only one state where the mutant is reached and AllStates can prune some exploration from that state on the basis of ‘knowing’ the graph). However, the user of MuTMuT cares about the real time the tool takes to report killed mutants and not about the internals of the exploration. It can happen that a technique that explores less actually runs slower because it performs a much costlier collection of additional information in notifyMutantReached or lookup of this information in specialExplore. The experiments actually showed this for AllStates and StateSet: although AllStates is better in theory, the experiments showed that AllStates easily runs out of memory while collecting the additional information and had longer mutant execution times than StateSet. Also, experiments showed that FirstState and StateSet can be slower than Basic in some cases but just a few percent, whereas providing a much bigger speedup in other cases. For this reason, StateSet can be proposed as the best optimization technique for efficient exploration of multithreaded tests for mutation testing. When one can tolerate potentially not killing some mutants, then SomeState provides an even bigger speedup. 4. IMPLEMENTATION This section describes modifications made to the JPF verification tool to support all five MuTMuT techniques. It also describes extensions made to the Javalanche mutation tool to support multithreaded mutation operators. 4.1. MuTMuT techniques MuTMuT is implemented in JPF, an open-source verification tool for Java programs [26, 47]. JPF provides an explicit-state model checker that can control thread scheduler, which is necessary for exploration of multithreaded tests. By default, JPF explores all possible schedules of such tests; this exploration mode is often called exhaustive exploration, although JPF does not explicitly enumerate all schedules but prunes equivalent schedules/states by using advanced partial-order reduction and state comparison techniques [26, 48]. The current MuTMuT implementation also supports other exploration modes in JPF, such as bounding the number of thread context switches as in CHESS [20, 29]. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
389
All five MuTMuT techniques are implemented in JPF. The implementation closely follows the pseudo-code algorithms presented in figures 6, 7, 9, and 10. The key differences are related to (i) the way JPF restores the states for exploration and (ii) performance improvements that are in the implementation but not shown in the pseudo-code algorithm in Figure 9. The pseudo-code algorithm in Figure 5 (in particular, line 23) and the description in Section 3.3 assume that the exploration restores a state by re-executing a path (where a path is a sequence of transitions) on the code being explored. Although re-execution makes it easier to present the algorithms, JPF actually does not re-execute the code but stores/checkpoints a copy of the state when it is encountered and then restores the state by re-establishing it from the copy. MuTMuT piggybacks on the storing and restoring that JPF already performs such that the implementations of FirstState, StateSet, and SomeState do not have to explicitly collect the execution path as shown in notifyMutantReached in Figure 9. However, AllStates still has to collect the entire state-space graph, which makes it frequently run out of memory. Moreover, MuTMuT’s implementation leverages JPF’s storing and restoring of states to remove the most expensive operation shown in notifyMutantReached in Figure 9, namely checking whether the current path is a suffix of some previous path. Performing such a check on a large set of paths would be costly but is not necessary. Instead, the first time that a mutant is reached on some path, the implementation remembers the state/path and also adds the mutantId to the set of mutants reached on that path. This set of mutants is in the state manipulated by JPF, so it gets properly stored, restored, and backtracked. It was also necessary to extend JPF to support ‘misbehaving’ mutants. These extensions are not particular to multithreaded tests and MuTMuT but are required for any mutation tool [6, 11, 12]. The problem is that, even if the tests terminate normally on the original code, it can happen that some mutants do not terminate (e.g., replace a condition in a while loop with true), run out of the heap memory (because of loops or recursive calls that allocate objects), overflow the stack (because of recursive calls), or throw other exceptions. Possible solutions to this problem involve running each mutant in a separate JVM [12] or in separate threads [55], with an appropriate timeout. However, these solutions cannot be directly applied to JPF while allowing that parts of one exploration be reused for another exploration because these explorations need to be in the same JVM. Fortunately, JPF already provides support to check for cycles in state spaces and thus detects some infinite loops in executions. A challenge is that a ‘misbehaving’ mutant could have a very long execution of one transition because it could be infinite and not produce a new state so that a cycle check can be performed. JPF was extended to also check for extremely long transitions by counting all executed bytecode instructions. The limit is set such that the number of bytecode instructions executed by a mutant (on any one path) cannot be larger than twice the number of bytecode instructions executed for the original program (on the longest path). It is common to limit the length of mutant execution on the basis of the original execution, typically measuring real time [12, 55] or counting loop iterations [56]; while real time would not be appropriate for JPF because of its performing exploration and not execution, counting all bytecode instructions rather than only back edges for loops does not increase overhead much because JPF is an interpreter and already has relatively high overhead for each instruction. Second, a similar check as for bytecode instructions is added for the number of allocated objects to ensure that JPF does not run out of memory because of a ‘misbehaving’ mutant. Finally, MuTMuT adds code to catch all the exceptions that the mutants could throw and to gracefully recover from them. Note that JPF can easily detect deadlocks (no transition enabled) and recover from them (in its JVM). 4.2. Multithreaded mutation operators The programs to be explored with MuTMuT are mutated with the Javalanche [6] mutation tool. Javalanche only supported traditional sequential mutation operators, although the goal was to apply both sequential and multithreaded mutation operators [13–15]. Therefore, Javalanche was extended so it can automatically apply the following multithreaded mutation operators: ‘Remove synchronized (from a) block’ ‘Replace notify with notifyAll and vice versa’ Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
390
M. GLIGORIC ET AL.
‘Replace join with sleep’ ‘Replace lock with unlock and vice versa’
The first three operators focus on the Java features that existed since the first version of Java, whereas the last operator focuses on the new features added in the java.util.concurrent package since Java 5 [51, 52]. These multithreaded mutation operators are chosen among those proposed for Java 5 [14] because these operators were most amenable to being applied as mutant schemata at the bytecode level and because they were most applicable to the programs on which the evaluation was performed. However, because Javalanche performs mutations at the level of Java bytecode rather than Java source code, the implementation of these operators (especially ‘remove synchronized block’) was non-trivial. Consider for example the source code for the transfer method shown in Figure 11(a). This code is similar as the code in Figure 1(a) from Section 2, but instead of the transfer method being declared synchronized, an explicit synchronized (this) block is added. The ‘remove synchronized block’ operator can be applied twice in this method, once on the outer synchronized (this) block and once on the inner synchronized (other) block.
(a)
(b) Figure 11. A transfer method with nested synchronized blocks: (a) source (original and mutated) and (b) bytecode (original). Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
391
Figure 11(a) shows these two mutants. Applying this mutation operator on the Java source code is relatively straightforward: it would require construction of the corresponding AST, identification of the synchronized blocks within the AST, and finally modification of the AST to construct the mutant schemata as shown in Figure 11(a). However, applying this mutation operator on the Java bytecode is more complicated because of the lack of syntactic block structures. Figure 11(b) shows the bytecode corresponding to the transfer method from Figure 11(a). The synchronized blocks are compiled into the monitorenter and monitorexit bytecode instructions: the outer synchronized (this) is compiled to the bytecode instructions at the offsets 4, 46, and 54, whereas the inner synchronized (other) is compiled to the bytecodes at the offsets 19, 32, and 40. Note that there are multiple monitorexit instructions corresponding to each monitorenter instruction because there can be multiple exit points (some regular and some exceptional) from a synchronized block and a monitorexit instruction has to be executed before any exit from the block to ensure correct execution. To apply the ‘remove synchronized block’ operator at the bytecode level, we have to map all the monitorexit instructions to their corresponding monitorenter instruction so that they can be enabled/disabled together. Mapping the monitorexit instructions to their corresponding monitorenter instruction is non-trivial and, in general, requires the construction of a control-flow graph. The implementation, presented here, performs this mapping in a more light-weight manner by using a heuristic to detect the last monitorexit for a corresponding monitorenter. The heuristic is based on the observation that the last monitorexit is always followed by a pattern that includes an athrow instruction as shown in lines 40–43 and 54–57 in Figure 11(b). This is because the Java compiler always includes a finally block while compiling synchronized blocks to ensure that the monitor is exited in the event of an exceptional exit from the synchronized block. Because of a fault in the version of JPF used by MuTMuT, one multithreaded mutation operator had to be manually applied (or rather ‘unapplied’). More specifically, some mutants created by the ‘remove synchronized block’ operator caused JPF to crash when wait or notify was executed outside a synchronized block. As the result, three of the subject programs that are used in the evaluation were manually mutated to filter out mutants that caused JPF to crash. All other programs were fully automatically mutated using the extended Javalanche. The authors of the Javalanche mutation tool have accepted the implementation of all mutation operators that are listed earlier and included the implementation in the publicly available repository. 5. EVALUATION The key research question for MuTMuT is whether it can reduce the running time for mutant execution. The evaluation considers this question for both exhaustive exploration and CHESS exploration, as illustrated in Section 2. Additionally, this section presents exploration statistics such as the number of states, transitions, and the memory required for mutant execution. All experiments use JPF version 4.1, with the default configuration setting (for search strategy, state comparison, etc.) except that the heuristic vm.por.sync_detection is set to false because it prevented execution of some of the tests by CHESS exploration. All experiments that measure real time were performed on an Intel Pentium 4 (with hyper-threading) 3.4GHz desktop, running Linux 2.6.28-15 and Sun’s JVM 1.6.0_10. 5.1. Experimental setup Figures 12 and 13 present some statistics about the programs used in the evaluation for exhaustive and CHESS exploration, respectively. Most of these programs were used in previous studies on testing multithreaded code. Account, Airline, Allocation, BubbleSort, LinkedList, NoRollback, and Shop are taken from an existing benchmark for testing multithreaded programs [49]. ArrayList and HashSet are taken from another existing benchmark for testing Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
392
M. GLIGORIC ET AL. program Account Airline Allocation ArrayList BubbleSort Buffer HashSet LinkedList NoRollback Shop SleepingBarber TreeBag
LOC 143 67 195 217 40 85 437 225 50 113 223 152
# tests 13 9 9 5 6 4 4 16 4 3 6 9
# mutants 4 31 34 8 32 35 18 23 66 58 14 38
# killed 4 18 8 5 28 19 7 14 51 35 12 29
# reached 4 31 27 8 32 35 18 18 66 41 14 33
mutation score (%) 100.00 58.06 23.52 62.50 87.50 88.57 38.88 60.86 77.27 60.34 85.71 76.31
Figure 12. Basic statistics about the programs used in the evaluation for exhaustive exploration. program Account Airline Allocation ArrayList BubbleSort Buffer HashSet LinkedList NoRollback Shop SleepingBarber TreeBag
LOC 143 67 195 217 40 85 437 225 50 113 223 152
# tests 26 18 18 10 12 8 8 32 8 8 12 18
# mutants 5 36 37 14 33 40 16 25 68 63 16 39
# killed 5 19 9 5 29 33 10 20 57 40 12 34
# reached 5 36 30 11 33 40 13 23 68 48 16 39
mutation score (%) 100.00 52.77 24.32 35.71 87.87 82.50 62.50 80.00 83.82 63.49 75.00 87.17
Figure 13. Basic statistics about the programs used in the evaluation for CHESS exploration.
multithreaded programs [45]. SleepingBarber is taken from the Software-artifact Infrastructure Repository [57, 58]. Buffer and TreeBag are written by the authors while getting more familiar with multithreaded code and testing, but these programs were not specifically tailored for evaluation of MuTMuT. These are small programs with the number of non-blank, non-comment lines of code ranging from 40 to 437. These programs do the following: Account is similar to the example in Section 2. Airline is a simplified system for selling flight tickets, where each flight has some number
of seats, say N , and if more customers try to buy tickets at once, only the first N should be able to do that. Allocation is a simple memory manager that can allocate and free blocks in a multithreaded environment. ArrayList is the implementation of an array list provided in an old version of the Java SDK. BubbleSort implements the bubble sort algorithm [59] by using multiple threads. Buffer implements a bounded buffer with standard putItem and getItem operations called by some number of producers and consumers in parallel. HashSet is the implementation of a hash set provided in an old version of the Java SDK. LinkedList is a linked list data structure that can be used in multithreaded code. NoRollback simulates transactions where checking if the transaction is possible and booking an item is performed in one method, whereas purchasing an item is performed from another method. Shop is a simplified shop inventory management program where suppliers add items to the shop’s inventory and customers buy items from that inventory. SleepingBarber is a simulation of a barber shop that has a number of barbers that provide haircuts to a number of customers. TreeBag implements the unsorted binary tree data structure that can have nodes with repeated values; the method find(int value, int numOfThreads) uses multiple threads to count how many nodes have a given value.
Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
393
Each program has a number of multithreaded tests as shown in the column # of tests. Several of these tests came with the programs, some directly written in JUnit as illustrated in Section 2 and some written in the main methods that are converted into JUnit. Also, a number of these tests were written by the authors of this paper and colleagues trying to exercise interesting parts of the code and not necessarily to kill all the mutants (because many tests were written even before the mutants were generated). Note that the number of tests in Figure 13 is twice as high as the number of tests in Figure 12. The initial set of tests [16] was extended while performing the additional experiments described in Section 5.5. Whereas the extended set of tests was used in the explorations with CHESS, only the initial set of tests was used in exhaustive exploration because of its high exploration cost. The column ‘# mutants’ shows the total number of mutants. Because sequential mutation operators apply in more places in code than multithreaded mutation operators, more mutants are sequential (i.e., created by a sequential mutation operator, although the code remains multithreaded) than multithreaded (i.e., created by a multithreaded mutation operator). Each program has at least one multithreaded mutant. Because exhaustive exploration takes much more time than CHESS, exhaustive exploration is run only with the sequential mutants, and CHESS is run with all the mutants. The columns ‘# killed’ and ‘# reached’ show the number of mutants that the tests kill and reach, respectively. In the experiments, CHESS killed all the (sequential) mutants that the exhaustive exploration killed (Figure 12), although it could have happened that CHESS does not kill all the mutants because of the pruning, as discussed in Section 3.4. The mutation score for these tests is shown in the last column, ‘mutation score (%)’. Recall that the tests were not chosen or created with the goal to kill all the mutants. It is important to mention that Javalanche [6] uses selective mutation [60] to optimize mutation testing. Hence, the number of mutants for each subject program may look smaller than expected. The basic idea behind selective mutation is to select a small set of mutation operators that gives a good approximation of the results that would be obtained by using all mutation operators. Specifically, Javalanche implements five mutation operators for sequential code: replace numerical constants, negate jump condition, replace arithmetic operator, replace method calls, and remove method calls.
5.2. Running time Figures 14 and 15 show the running time of MuTMuT with Basic, FirstState, StateSet, and SomeState techniques for exhaustive exploration and CHESS with bound 2 (this bound was shown to find many real bugs [20]). The time is presented in the h:mm:ss format; ‘ ’ indicates that the experiment did not finish within the time limit (2 h per test), and ‘ ’ indicates that the experiment ran out of memory (1 GB per test). The average speedup is computed as the geometric mean for all examples that did not run out of time or memory. The average speedup that StateSet achieves over Basic is 34.79% for exhaustive exploration and 29.09% for CHESS, and the speedup goes up to 48.91% (for BubbleSort) for exhaustive exploration and up to 39.14% (for LinkedList) for CHESS. In some cases for Allocation, there is a small slowdown compared with Basic because StateSet is recording more information during the exploration of the original code, and this information does not pay off during the exploration of mutants (Section 5.3 discusses the other exploration statistics that show why these results happen for Allocation). For comparison purposes, the time for FirstState is shown, but note that this technique has a higher time than StateSet. Moreover, FirstState is relatively harder to implement than StateSet, so the use of FirstState is not recommended for mutation testing of multithreaded code. The SomeState heuristic provides even more speedup than StateSet, but recall that SomeState could miss to kill some mutants. The average speedup that SomeState achieves over Basic is 42.76% for exhaustive exploration and 37.50% for CHESS, and the speedup goes up to 77.32% (for NoRollback) for exhaustive exploration and up to 75.36% (for SleepingBarber) for CHESS. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
394
M. GLIGORIC ET AL.
program Account Airline Allocation ArrayList BubbleSort Buffer HashSet LinkedList NoRollback Shop SleepingBarber TreeBag Averages
Basic
FirstState
†2:02:03 1:25:58 0:54:41 †2:09:31 1:51:03 †2:10:57 0:20:42 †6:57:33 0:18:31 0:51:56 †2:26:30 1:42:50
1:26:50 1:18:57 0:59:27 1:04:47 1:05:26 †2:08:49 0:15:59 5:27:01 0:15:48 0:39:38 †2:22:17 1:36:27
speedup (%) n/a 8.16 -8.72 n/a 41.08 n/a 22.79 n/a 14.67 23.68 n/a 6.21 21.00
Exhaustive StateSet speedup (%) 1:28:36 1:01:04 0:56:23 0:58:28 0:56:44 †2:06:18 0:14:57 †4:46:45 0:11:18 0:37:34 †2:20:23 1:28:30
n/a 28.96 -3.11 n/a 48.91 n/a 27.78 n/a 38.97 27.66 n/a 13.94 34.79
SomeState
speedup (%)
1:22:54 0:56:08 0:38:01 0:22:17 0:37:38 †2:03:27 0:09:05 4:30:20 0:04:12 0:33:18 †2:10:22 1:19:39
n/a 34.70 30.48 n/a 66.11 n/a 56.12 n/a 77.32 35.88 n/a 22.54 42.76
Figure 14. Overall mutant execution time for exhaustive exploration ( , time limit; , memory limit).
program Account Airline Allocation ArrayList BubbleSort Buffer HashSet LinkedList NoRollback Shop SleepingBarber TreeBag Averages
Basic 0:06:29 * 5:08:26 0:38:35 0:09:10 1:12:15 0:30:38 0:08:44 3:17:08 0:07:37 0:31:27 1:26:30 0:11:33
FirstState 0:05:30 * 0:15:24 0:43:24 0:07:13 0:55:35 0:31:35 0:07:30 2:10:08 0:05:53 0:28:06 1:15:49 0:11:46
speedup (%) 15.17 n/a -12.48 21.27 23.07 -3.10 14.12 33.99 22.76 10.65 12.35 -1.88 23.90
CHESS(2) StateSet speedup (%) 0:05:37 13.37 * 0:18:05 n/a 0:40:18 -4.45 0:06:09 32.91 0:54:44 24.24 0:24:12 21.00 0:06:07 29.96 1:59:58 39.14 0:05:05 33.26 0:25:11 19.93 1:00:58 29.52 0:10:35 8.37 29.09
SomeState 0:05:33 * 0:20:06 0:30:44 0:04:36 0:46:53 0:15:21 0:04:58 1:56:32 0:03:28 0:21:55 0:21:19 0:09:16
speedup (%) 14.40 n/a 20.35 49.82 35.11 49.89 43.13 40.89 54.49 30.31 75.36 19.77 37.50
Figure 15. Overall mutant execution time for CHESS exploration ( , time limit; , memory limit).
Note that CHESS takes much less time than exhaustive exploration (although CHESS is run for all the mutants and twice as many tests) because CHESS prunes away many thread schedules. However, for both exhaustive exploration and CHESS, StateSet provides a speedup over Basic on average, and SomeState additionally provides even more speedup. This shows that the advanced techniques in MuTMuT can indeed reduce the running time for mutant execution. 5.3. Other exploration statistics Figures 16–18 show additional exploration statistics for the set of experiments. For each program, technique, and exploration type, Figures 16–18 show the memory that JPF/MuTMuT takes for mutant execution, the number of states explored (nodes in the state-space graph, as discussed in Section 2), and the number of transitions executed (edges in the state-space graph), respectively. Because of the way that JPF allocates memory, memory requirements that JPF reports do not always correlate with the size of exploration, namely JPF sometimes ends up using more memory for an exploration that has fewer states and transitions than for an exploration that has more states and transitions. We can see that FirstState, StateSet, and SomeState sometimes take seemingly more memory than Basic, although they have smaller explorations than Basic. For the AllStates technique, JPF runs out of memory in almost all the cases because it attempts to collect the entire state-space graph in addition to a set of states for each mutant (Figure 8). For this reason, the detailed results are not even shown for AllStates. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
395
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
program Account Airline Allocation ArrayList BubbleSort Buffer HashSet LinkedList NoRollback Shop SleepingBarber TreeBag
Basic † 72 90 146 † 129 104 † 127 103 † 109 88 161 † 126 74
Exhaustive FirstState StateSet 66 107 96 180 139 203 98 218 84 252 † 132 † 244 112 148 † 223 98 70 166 142 291 † 97 † 241 81 158
SomeState 66 79 148 97 70 † 131 104 103 66 145 † 143 71
Basic 53 * 496 129 68 93 74 79 77 54 100 97 54
CHESS(2) FirstState StateSet 51 56 * 572 * 578 122 131 57 68 87 138 79 75 75 73 64 88 58 54 80 113 100 126 52 55
SomeState 55 * 574 123 68 87 89 75 63 50 92 78 57
Figure 16. Memory requirements per test, in MB ( , time limit; , memory limit).
program Account Airline Allocation ArrayList BubbleSort Buffer HashSet LinkedList NoRollback Shop SleepingBarber TreeBag
Basic † 513503 895089 458963 † 2070281 1507344 † 2135631 449168 † 1753552 370310 1669267 † 1662245 723352
Exhaustive FirstState StateSet 367125 367125 789406 616112 453872 446132 970126 906605 863894 759978 † 2260608 † 2056650 322647 305444 1318045 † 1068602 302359 218306 1208766 1179756 † 1330396 † 1272306 643825 593691
SomeState 367125 571288 293977 420487 499976 † 2290549 180173 1135189 82283 1078422 † 1491645 556060
Basic 31004 * 2734556 244074 120405 1231238 639721 139783 1113581 144943 677408 1466788 89035
CHESS(2) FirstState StateSet 25219 25219 * 227932 * 187887 242869 233050 98190 79583 874567 801741 635570 483291 112016 91858 678146 645499 126543 98753 517611 504425 1464167 1200069 78137 76034
SomeState 25219 * 172739 165741 52766 680225 306793 72531 626091 66075 450866 342537 68189
Figure 17. States statistics ( , time limit; , memory limit).
program Account Airline Allocation ArrayList BubbleSort Buffer HashSet LinkedList NoRollback Shop SleepingBarber TreeBag
Basic † 2617972 2796829 1042249 † 4918402 5043994 † 6879738 1078141 † 7320214 1048496 3803971 † 6265140 1989112
Exhaustive FirstState StateSet 1819939 1819939 2463716 1917392 1030680 1011860 2333911 2178137 2843137 2441532 † 6418802 † 5799120 769309 727360 5585016 † 4444439 856935 604675 2618922 2552821 † 4950257 † 4720424 1770186 1610919
SomeState 1819939 1784449 647249 934390 1580341 † 6517062 391046 4752832 211475 2283805 † 5585714 1520529
Basic 43600 * 4144431 351012 142286 1586764 719087 175731 1448208 192944 855232 1721192 107867
CHESS(2) FirstState StateSet 35864 35864 * 302349 * 233598 349474 334044 117189 93602 1142174 1037553 707516 539742 141168 114911 899280 851251 167794 128392 667098 649960 1717912 1386039 94623 91909
SomeState 35864 * 209344 247075 63334 870930 348074 92095 824478 86442 587705 398238 82510
Figure 18. Transition statistics ( , time limit; , memory limit).
In terms of the number of states and transitions, FirstState achieves smaller explorations than Basic, StateSet achieves smaller explorations than FirstState, and SomeState achieves smaller explorations than StateSet, as expected. The reduction in these numbers strongly correlates with the reduction in the overall mutant exploration time, so estimating the exploration time on the basis of the size of the exploration graph provides a good approximation [53]. There are several interesting observations about the exploration numbers for Account. Note that all techniques but Basic have the same numbers of states and transitions, both for exhaustive exploration and CHESS. The reason is that in these cases, all the mutants that get reached are killed on the very first path on which they are reached. It means that various pruning techniques do not have anything to prune in the exploration. Because the mutants are killed quickly, one may wonder why the explorations still take relatively high time. The reason is that the entire original exploration Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
396
M. GLIGORIC ET AL.
is performed for all the tests because MuTMuT dynamically discovers, during the original exploration, what mutants can be reached. Moreover, note that for Account, the overall exploration time for StateSet is slightly larger than the time for FirstState for exhaustive exploration (Figure 14), although they explore exactly the same states and transitions. This time difference is partly due to the somewhat higher overhead of StateSet compared with FirstState and partly due to the noise in the experiments. There are also several interesting observations about the exploration numbers for Allocation. Note that the numbers of states and transitions for FirstState and StateSet are only slightly smaller than the numbers of states and transitions for Basic. The reason is that mutants are reached very early in the exploration on almost all paths. Recall the example in Figure 2(a), and suppose that some mutant could be reached when executing all the transitions from S1 to S2 , from S1 to S6 , and from S1 to S9 . Then, both FirstState and StateSet would explore exactly the same state space, and only SomeState would explore less as it would prune the part of the state space after the first point where the mutant is reached. For Allocation, the state spaces for FirstState and StateSet are very similar to the state space for Basic, so these advanced techniques actually have some slowdown in terms of the overall exploration time (Figures 14 and 15 show negative speedup) because these techniques have a higher overhead than Basic. 5.4. Additional tests Recall that the tests that are used for experiments so far were not chosen or created with the goal to kill all the mutants. This section describes a small additional experiment that is performed to kill more mutants. The experiment is focused on the ArrayList and HashSet subject programs because their explorations were among the fastest (for both exhaustive exploration and CHESS) and because the sources of these subjects are manually mutated because of a JPF fault (Section 4.2), so it was easier to inspect the results and devise new tests to kill more mutants. ArrayList had five reached but not killed mutants, and HashSet had 6 reached but not killed mutants (Figure 13; note that this section discusses the experiments that are performed before doubling the set of tests for CHESS; therefore, the number of killed mutants is shown in Figure 12). For each of these programs, additional tests are added to kill five more mutants. Specifically, five tests are added for ArrayList and three tests for HashSet (as some of the latter tests could kill more than one mutant). Figures 19–23 show the results of the exploration for these new tests only, that is, not for the existing tests for these two subjects. We can see yet again that StateSet provides speedup over Basic and FirstState, although the absolute numbers are not high. We can also see that SomeState provides even more speedup. However, for both ArrayList and HashSet, there was one mutant (out of the five) that was killed by StateSet (and thus also FirstState and Basic) but was not killed by SomeState. Note that these new tests were not specifically designed to illustrate such cases, but they do show how SomeState can miss to kill a mutant. It remains as the future work to study when and how to use such a heuristic that can occasionally miss to kill a mutant. 5.5. Varying test suite size and orderings of tests The last set of experiments targets analysis of different sizes of tests suites and orderings of tests. The goal is to compare advanced MuTMuT techniques with Basic in a wider range of scenarios. Effectively, these scenarios simulate how tests could be incrementally added to a test suite by using
program ArrayList HashSet Averages
Basic 0:00:59 0:01:24
FirstState 0:00:52 0:01:04
speedup (%) 11.86 23.81 16.80
Exhaustive StateSet speedup (%) 0:00:47 20.34 0:00:54 35.71 26.95
SomeState 0:00:43 0:00:56
speedup (%) 27.12 33.33 30.07
Figure 19. Overall mutant execution time for exhaustive exploration (for additional tests). Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
397
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
program ArrayList HashSet Averages
Basic 0:00:43 0:00:59
FirstState 0:00:40 0:00:45
speedup (%) 6.98 23.73 12.87
CHESS(2) StateSet speedup (%) 0:00:40 6.98 0:00:41 30.51 14.59
SomeState 0:00:32 0:00:40
speedup (%) 25.58 32.20 28.70
Figure 20. Overall mutant execution time for CHESS exploration (for additional tests).
program ArrayList HashSet
Basic 86 99
Exhaustive FirstState StateSet 73 73 76 88
SomeState 74 88
Basic 73 76
CHESS(2) FirstState StateSet 67 66 87 76
SomeState 67 88
Figure 21. Memory requirements per test, in MB (for additional tests).
program ArrayList HashSet
Basic 5186 25045
Exhaustive FirstState StateSet 5430 3965 15273 15266
SomeState 2683 15171
Basic 4176 20334
CHESS(2) FirstState StateSet 3940 3233 12882 12875
SomeState 2336 12797
Figure 22. States statistics (for additional tests).
program ArrayList HashSet
Basic 11963 60604
Exhaustive FirstState StateSet 12533 8964 35172 35159
SomeState 5587 34990
Basic 6206 30348
CHESS(2) FirstState StateSet 5937 4675 19663 19650
SomeState 3530 19532
Figure 23. Transition statistics (for additional tests).
manual or automated test generation (recall that the entire MuTMuT evaluation uses only manually written tests because of the lack of tools/techniques for automated generation of multithreaded tests as mentioned in Section 1.) Each subject program (but Airline and ArrayList where Basic often runs out of memory) is explored with all MuTMuT techniques (but AllStates) using three test suite sizes: 50%, 75%, or 100% of the tests in Figure 13. In addition, for each test suite size, there are 10 runs that use a random ordering of tests in the test suite. Therefore, there is a total of 120 runs (4 techniques, 3 test suite sizes, 10 orderings) for each of the subject programs evaluated. All these runs take a lot of real time to perform on JPF, so they were performed on a departmental computational cluster rather than on one dedicated computer. Because the cluster has machines with various configurations that can also have different workloads, real time measurements would be unreliable. Hence, these experiments do not compare real time but do compare the numbers of both states and transitions for different test suite sizes and test orderings. Figure 24 shows the boxplots for states/transitions. Each boxplot summarizes a distribution of 30 values (for 3 test suite sizes and 10 orderings); the boxplot shows the median, upper, and lower quartile values, min—lower quartile minus 1.5 times the interquartile distance, max—upper quartile plus 1.5 times the interquartile distance, and the outliers. Each value in the boxplot is the ratio of the number of states/transitions for Basic over the number of states/transitions for one of the advanced MuTMuT techniques. Hence, the value of 1.0 would indicate no improvement, and higher values indicate more improvement. The median for FirstState ranges from 1.0 to below 1.5, the median for StateSet ranges from 1.0 to above 1.5, and the median for SomeState ranges from 1.0 to over 4.0. In other words, the more advanced techniques provide more improvement, as expected. Interestingly, the boxplots for states and transitions are quite similar, indicating that the relative improvements provided by MuTMuT over Basic are the same for both metrics, although the absolute values differ (see for example Figures 17 and 18). Whereas some subject programs have a relatively large Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
398
SleepingBarber
UnsortedTree
SleepingBarber
UnsortedTree UnsortedTree
Shop
NoRollback
LinkedList
HashSet
Buffer
SleepingBarber
(a) FirstState - states
BubbleSort
Allocation
Account
UnsortedTree
SleepingBarber
Shop
NoRollback
LinkedList
HashSet
Buffer
BubbleSort
Allocation
Account
1.0
1.0
1.5
1.5
2.0
2.0
2.5
2.5
M. GLIGORIC ET AL.
(c) StateSet - states
Shop
NoRollback
LinkedList
HashSet
Buffer
BubbleSort
Allocation
Account
UnsortedTree
SleepingBarber
Shop
NoRollback
LinkedList
HashSet
Buffer
BubbleSort
Allocation
Account
1.0
1.0
1.5
1.5
2.0
2.0
2.5
2.5
3.0
3.0
(b) FirstState - transitions
(e) SomeState - states
Shop
NoRollback
LinkedList
HashSet
Buffer
BubbleSort
Allocation
Account
UnsortedTree
SleepingBarber
Shop
NoRollback
LinkedList
HashSet
Buffer
BubbleSort
Allocation
Account
1
1
2
2
3
3
4
4
5
5
6
6
7
7
(d) StateSet - transitions
(f) SomeState - transitions
Figure 24. Ratio of #states and #transitions for FirstState, StateSet, and SomeState over #states and #transitions for Basic. Plots include values for three different test suite sizes and 10 orderings of tests per size.
variation (e.g., Buffer), most others have low variation, showing that MuTMuT techniques provide relatively consistent improvement over Basic, independent of the test suite size and orderings of tests. Moreover, the reader may wonder if MuTMuT techniques provide improvement only in the cases where a test kills some mutants or also in the cases of redundant tests where a test being currently Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
399
executed does not kill any alive mutant (because the tests executed previously already killed the mutants that the current test can kill, i.e., under a different ordering, the current test could have killed some mutants). It is common for large test suites to have such redundant tests. MuTMuT was extended to count the percentage of redundant tests. It shows that, across all the runs whose results are plotted in Figure 24, 61% of the tests are redundant (and for various programs, the number ranges from 35% to 83%). Therefore, MuTMuT techniques provide improvement even for redundant tests. Indeed, recall test1 and Mutant1 in Section 2; although test1 does not kill Mutant1, MuTMuT still reduces the state space to be explored. The improvement that MuTMuT provides depends on where in the state space the mutant is reached and not on whether it is killed or not; as a consequence, MuTMuT provides improvement even for equivalent mutants that cannot be killed by any test. 5.6. Threats to validity This section gives a brief overview of the main threats to validity. Internal threats: The experiments were performed on JPF version 4.1 by using the default settings (except for disabling vm.por.sync_detection), with the memory and time bounds described earlier. Changing the settings or bounds may lead to different results and conclusions. Moreover, the authors are not aware of any bugs in JPF or the implementation of the MuTMuT techniques that may have affected the results of the experiments. External threats: As it holds for any experimental study, the selected set of programs may not be fully representative. The overhead of the MuTMuT techniques may differ for programs that have different execution length, different number of mutants, different types of mutants, and similar. However, all the programs that are used in the study are created independently of MuTMuT, many have been used in some previous studies, and all but two were not created by the authors of this paper. The test cases used for the programs also constitute a threat to generalizing the results. The exploration time and overhead of the MuTMuT techniques could depend on various characteristics of the test cases, including the number of threads created in the tests and the parts of the code that are exercised by the tests. Also, the order in which tests are executed could affect the results because some test cases may never be executed if they are executed after a test that kills a mutant. To address this threat, a number of test cases with diverse characteristics were used for each subject program, these tests were mostly created by colleagues not familiar with the details of our work, and additional experiments were performed to evaluate the effect of different test suites and test orderings of test execution. Another external threat is the set of mutation operators used. In the experiments, all programs were mutated with the extended Javalanche mutation tool. Javalanche uses selective mutation and therefore includes only a small set of mutation operators for sequential code. However, the results obtained with these mutation operators have been previously shown to be representative of results that would be obtained with all mutation operators in a number of cases [60]. Construct threats: Different metrics can be used to report the results of an exploration, and the interpretation of the results may differ on the basis of the metrics that are used. Also, the platform that is used to execute the experiments can impact the real time results because cache size, available memory, number of CPU cores, and similar variables can affect the results of an exploration. For the reported results, a standard set of metrics is used, which is also used in numerous previous studies on exploration, for example, [28, 53, 61, 62]. All the experiments for which real time is reported are executed on the same machine, under the same workload. Because JPF is used to perform the exploration and it is a single-threaded application itself, the number of CPU cores should not have a high impact on the exploration. Conclusion Threats: The set of experiments that considers different orderings of test cases uses random seeds to define the orderings. There is a threat that the selected random orderings are not representative of all possible orderings. Another threat is the size of a test suite because different sizes can lead to different numbers of redundant test cases. To alleviate this threat, the experiments include results for three test suite sizes for each subject program. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
400
M. GLIGORIC ET AL.
6. RELATED WORK Mutation testing was introduced over three decades ago [1, 2], and since then, much work has been carried out on mutation testing for various languages including Ada, C, Cobol, C#, Fortran, Java, or SQL. An overview of this work can be found in the survey report by Jia and Harman [5]. The main focus of previous work has been on mutation operators, and effective and efficient mutant generation. In contrast, MuTMuT focuses on mutant execution, in particular for multithreaded tests. Note that mutant generation and execution are orthogonal; MuTMuT can use any mutant generation tool, but rather than implementing a new tool, Javalanche [6, 63] was used. Because MuTMuT is implemented for Java (although the ideas translate into any other language with multithreading), this paragraph describes two more mutation tools for Java. MuJava [11] focuses on operators specific to Java and was among the first mutation tools for Java [55]. MuJava supports two types of mutation operators: class-level operators and method-level operators. Jumble [12] was developed with the primary goal of efficient mutant execution in a real industrial environment. Jumble implements several heuristics that try to order tests to kill mutants faster, and it avoids running tests that would kill the same mutants. MuTMuT also monitors the exploration and executes the tests just for the mutants that are live and reached with the current test. MuTMuT currently does not implement heuristics for ordering tests, but those heuristics could further improve the performance of MuTMuT. The Jumble paper [12] briefly mentions that Jumble can work with multithreaded code, but the paper provides no details about the thread schedules explored. Mutation has been considered for concurrent code [13], including for Java [14]. Carver [13] describes mutation in a multithreaded environment. Instead of executing all possible thread schedules, the approach is to execute multithreaded code deterministically by specifying a set of thread schedules and requiring that for each schedule, the original code and mutants give equivalent results. This is a weaker notion of mutant killing than requiring that the set of all possible results produced by a mutant is the same as the set of all possible results produced by the original code (even if the results for some schedule differ). Another issue with this approach is that the tester has to specify the set of thread schedules. Bradbury et al. [14] define concurrency mutation operators for Java, specifically for J2SE 5.0. These operators can modify thread or synchronization operations, for example, removing locks. To evaluate MuTMuT, Javalanche was modified to automatically apply some such operators. Bradbury et al. [15] also use mutation testing to measure the effectiveness and efficiency of various testing and formal analysis tools, including JPF. In contrast, the goal of MuTMuT is to speed up mutant execution by using JPF. While the currently dominant programming model for concurrent code is the shared-memory multithreaded model, the message-passing model has been gaining popularity in the recent past as it avoids some bugs such as low-level dataraces. Mutation operators have been recently proposed for message-passing programs written using agents or actors [64, 65]. Because execution of a test for a message-passing program also requires exploration [66], but over many possible message delivery schedules rather than thread schedules, mutation testing for message-passing programs can also be very expensive as for multithreaded programs. The techniques that are proposed in MuTMuT could be also applied to make mutant execution for message-passing programs more efficient. The authors recently developed a state-space exploration framework for message-passing programs [66] written using actors [67] and plan to extend it for mutation. Another issue in mutation testing, besides mutant generation and mutant execution, is finding equivalent mutants. Although determining program equivalence is undecidable in general, recent work [6, 7] shows how to rank the mutants by the likelihood that they are non-equivalent to the original code. This ranking helps in prioritizing inspection of live mutants to generate new test cases to kill those mutants. The work is orthogonal to MuTMuT, which focuses on efficient mutant execution. Also, the projects differ in focusing on sequential versus multithreaded code. Fleyshgakker and Weiss present efficient mutation analysis [68] that shares the goal of MuTMuT to speed up mutant execution. There is further similarity in matching states, as in the AllStates technique. However, the key difference is that their work is for execution of sequential tests and so does not consider the issues that arise for exploration of multithreaded tests.
Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
401
The techniques for regression model checking (RMC) [69] and incremental state-space exploration (ISSE) [70] are conceptually similar to MuTMuT because RMC and ISSE speed up the exploration by reusing some results. However, RMC and ISSE reuse exploration during code evolution not during mutation testing. Additionally, ISSE works only for sequential code and not for multithreaded code. There has been a recent resurgence of interest in higher-order mutation testing [4, 71], which uses the so-called higher-order mutation operators that perform multiple small syntactic changes as opposed to the traditional first-order mutation operators that perform just one small syntactic change to create a mutant. Higher-order mutants can also be applied in the form of mutant schemata where every small change that is part of a single higher-order mutant is guarded by the same check (i.e., the same mutant id). Therefore, MuTMuT would also support higher-order mutants because it would just consider the first state reaching any of the changes belonging to a higher-order mutant as the first state reaching that mutant. 7. CONCLUSIONS Testing multithreaded code is becoming more important as such code is developed and used more frequently. Mutation testing is a well-established testing approach, but one of its major cost is the time to execute a test suite on multiple mutants. This cost is exacerbated for multithreaded code where each test needs to be executed for many thread schedules. This paper presented an efficient exploration framework, MuTMuT, that can significantly reduce the time for mutant execution of multithreaded code. The experiments show that the MuTMuT StateSet technique speeds up MuTMuT Basic up to 48.91%, whereas the MuTMuT SomeState techniques speeds up MuTMuT Basic up to 77.32%. ACKNOWLEDGEMENTS
The authors would like to thank Yaniv Eytani for providing many of the multithreaded programs used in the evaluation, Sandro Badame and Yu Lin for writing several tests for these programs, and David Schuler and Andreas Zeller for helping with Javalanche. This material is based upon work partially supported by the National Science Foundation under Grant Nos. CCF-0916893 and CCF-0746856, and by Intel and Microsoft under the Universal Parallel Computing Research Center. Milos Gligoric was supported by the Saburo Muroga fellowship. REFERENCES 1. Hamlet RG. Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering Jul 1977; 3(4):279–290. 2. DeMillo RA, Lipton RJ, Sayward FG. Hints on test data selection: help for the practicing programmer. Computer 1978; 4(11):34–41. 3. Offutt J, Untch R. Mutation 2000: uniting the orthogonal. Proceedings of the Mutation 2000: Mutation Testing in the Twentieth and the Twenty First Centuries, San Jose, CA, USA, 2000; 34–44. 4. Ammann P, Offutt J. Introduction to Software Testing. Cambridge University Press: New York, NY, USA, 2008. 5. Jia Y, Harman M. An analysis and survey of the development of mutation testing. Technical Report TR-09-06, King’s College London, 2009. (To appear in IEEE Transactions on Software Engineering). 6. Schuler D, Dallmeier V, Zeller A. Efficient mutation testing by checking invariant violations. Proceedings of the 18th International Symposium on Software Testing and Analysis (ISSTA 2009), 2009; 69–80. 7. Schuler D, Zeller A. (Un-)covering equivalent mutants. Proceedings of the 3rd International Conference on Software Testing, Verification, and Validation (ICST 2010), Paris, France, 2010; 45–54. 8. Kim S, Clark J, McDermid J. The rigorous generation of Java mutation operators using HAZOP. 12th International Conference on Software & Systems Engineering and their Applications (ICSSEA 99), 1999; 9–10. 9. Kim S, Clark J, McDermid J. Class mutation: mutation testing for object oriented programs. Proceedings of the Net.ObjectDays Conference on Object-Oriented Software Systems (OOSS 2000), 2000. 10. Ma YS, Kwon YR, Offutt J. Inter-class mutation operators for Java. Proceedings of the 13th International Symposium on Software Reliability Engineering (ISSRE 2002) 2002:352–363. 11. Ma YS, Offutt J, Kwon YR. MuJava: A mutation system for Java. Proceedings of the 28th International Conference on Software Engineering (ICSE 2006), 2006; 827–830. 12. Irvine SA, Pavlinic T, Trigg L, Cleary JG, Inglis S, Utting M. Jumble Java byte code to measure the effectiveness of unit tests. Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques MUTATION (TAICPART-MUTATION 07), 2007; 169–175. Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
402
M. GLIGORIC ET AL.
13. Carver RH. Mutation-based testing of concurrent programs. Proceedings of the IEEE International Test Conference on Designing, Testing, and Diagnostics (ITC 1993), 1993; 845–853. 14. Bradbury JS, Cordy JR, Dingel J. Mutation operators for concurrent Java (J2SE 5.0). Proceedings of the Second Workshop on Mutation Analysis (MUTATION’06), 2006; 83–92. 15. Bradbury JS, Cordy JR, Dingel J. Comparative assessment of testing and model checking using program mutation. Proceedings of the Testing: Academic and Industrial Conference Practice and Research Techniques - MUTATION (TAICPART-MUTATION 2007), 2007; 210–222. 16. Gligoric M, Jagannath V, Marinov D. MuTMuT: Efficient exploration for mutation testing of multithreaded code. Proc. 3rd International Conference on Software Testing, Verification, and Validation (ICST 2010), Paris, France, 2010; 55–64. 17. Long B, Hoffman D, Strooper PA. Tool support for testing concurrent Java components. IEEE Transactions on Software Engineering 2003; 29(6):555–566. 18. Pugh W, Ayewah N. Unit testing concurrent software. Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2007), 2007; 513–516. 19. Dantas A, Brasileiro FV, Cirne W. Improving automated testing of multi-threaded software. Proceedings of the 1st International Conference on Software Testing, Verification, and Validation (ICST 2008), 2008; 521–524. 20. Musuvathi M, Qadeer S. Iterative context bounding for systematic testing of multithreaded programs. Proceedings of the 2007 ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI 2007), 2007; 446–455. 21. Burckhardt S, Kothari P, Musuvathi M, Nagarakatte S. A randomized scheduler with probabilistic guarantees of finding bugs. 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2010), 2010; 167–178. 22. Jagannath V, Gligoric M, Jin D, Luo Q, Rosu G, Marinov D. Improved multithreaded unit testing. Proc. 8th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2011), Szeged, Hungary, 2011; 223–233. 23. Bron A, Farchi E, Magid Y, Nir Y, Ur S. Applications of synchronization coverage. Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2005), 2005; 206–212. 24. Park S, Lu S, Zhou Y. CTrigger: exposing atomicity violation bugs from their hiding places. Proceeding of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2009), 2009; 25–36. 25. Godefroid P. Model checking for programming languages using VeriSoft. Proceedings of the 24th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages (POPL 1997), Paris, France, 1997; 174–186. 26. Visser W, Havelund K, Brat G, Park S. Model checking programs. Automated Software Engineering Journal 2003; 10:203–232. 27. Yang Y, Chen X, Gopalakrishnan G, Kirby RM. Efficient stateful dynamic partial order reduction. 15th International SPIN Workshop on Model Checking Software (SPIN) 2008:288–305. 28. Holzmann GJ, Joshi R, Groce A. Tackling large verification problems with the swarm tool. Proceedings of the 15th international workshop on Model Checking Software (SPIN 2008), 2008; 134–143. 29. Musuvathi M, Qadeer S, Ball T, Basler G, Nainar PA, Neamtiu I. Finding and reproducing heisenbugs in concurrent programs. Proceedings of the 8th USENIX conference on Operating Systems Design and Implementation (OSDI 2008), 2008; 267–280. 30. Rabinovitz I, Grumberg O. Bounded model checking of concurrent programs. Proceedings of the 17th International Conference of Computer Aided Verification (CAV 2005), 2005; 82–97. 31. Stoller SD. Model-Checking Multi-threaded distributed Java programs. Proceedings of the 7th International SPIN Workshop on SPIN Model Checking and Software Verification (SPIN 2000), 2000; 224–244. 32. Bruening D, Chapin J. Systematic testing of multithreaded programs. Technical Report, MIT/LCS Technical Memo, LCSTM 607, 2000. 33. Sen K. Race directed random testing of concurrent programs. Proceedings of the 2008 ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), 2008; 11–21. 34. Musuvathi M, Park DYW, Chou A, Engler DR, Dill DL. CMC: a pragmatic approach to model checking real code. Proceedings of the 8th USENIX conference on Operating Systems Design and Implementation (OSDI 2002), 2002; 75–88. 35. Vjukov DS. Relacy Race Detector, 2009. http://groups.google.com/group/relacy. 36. DeMillo RA, Offutt AJ. Constraint-based automatic test data generation. IEEE Transactions on Software Engineering 1991; 17(9). 37. Ayari K, Bouktif S, Antoniol G. Automatic mutation test input data generation via ant colony. Proceedings of the 9th annual conference on Genetic and evolutionary computation (GECCO), 2007; 1074–1081. 38. Fraser G, Zeller A. Mutation-driven generation of unit tests and oracles. Proceedings of the 19th International Symposium on Software Testing and Analysis (ISSTA), 2010; 147–158. 39. Papadakis M, Malevris N, Kallia M. Towards automating the generation of mutation tests. Proceedings of the 5th Workshop on Automation of Software Test (AST), 2010; 111–118. 40. Papadakis M, Malevris N. Automatic mutation test case generation via dynamic symbolic execution. Proceedings of the 21st International Symposium on Software Reliability Engineering (ISSRE), 2010; 121–130.
Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr
EFFICIENT MUTATION TESTING OF MULTITHREADED CODE
403
41. Zhang L, Xie T, Zhang L, Tillmann N, de Halleux J, Mei H. Test generation via dynamic symbolic execution for mutation testing. Proceedings of International Conference on Software Maintenance (ICSM), 2010; 1–10. 42. Pacheco C, Lahiri SK, Ernst MD, Ball T. Feedback-directed random test generation. Proceedings of the 29th International Conference on Software Engineering (ICSE), 2007. 43. Godefroid P, Klarlund N, Sen K. Dart: Directed automated random testing. Proc. 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2005; 213–223. 44. Gligoric M, Gvero T, Jagannath V, Khurshid S, Kuncak V, Marinov D. Test generation through programming in UDITA. Proceedings of the 32nd International Conference on Software Engineering (ICSE), 2010; 225–234. 45. Sen K, Agha G. CUTE and jCUTE: Concolic unit testing and explicit path model-checking tools. Proceedings of the 18th International Conference on Computer Aided Verification (CAV 2006), 2006; 419–423. 46. Burckhardt S, Dern C, Musuvathi M, Tan R. Line-up: A complete and automatic linearizability checker. Proceedings of the 2010 ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI 2010), 2010; 330–340. 47. JPF home page. http://babelfish.arc.nasa.gov/trac/jpf/. 48. Clarke EM, Grumberg O, Peled DA. Model Checking. MIT Press: Cambridge, MA, USA, 1999. 49. Eytani Y, Havelund K, Stoller SD, Ur S. Towards a framework and a benchmark for testing tools for multi-threaded programs. Concurrency and Computation: Practice and Experience 2007; 19(3):267–279. 50. Gamma E, Beck K. JUnit, 2003. http://www.junit.org. 51. Java Specification Requests (JSR) 166: Concurrency Utilities, Test Code. http://gee.cs.oswego.edu/cgi-bin/viewcvs. cgi/jsr166/src/test/. 52. Sun Microsystems. Java 2 Platform Standard Edition 5.0 API Specification. http://java.sun.com/j2se/1.5.0/docs/api/. 53. Dwyer MB, Elbaum S, Person S, Purandare R. Parallel randomized state-space search. Proceedings of the 29th international conference on Software Engineering (ICSE 2007), 2007; 3–12. 54. Untch RH, Offutt AJ, Harrold MJ. Mutation analysis using mutant schemata. Proceedings of the 1993 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 1993; 139–148. 55. Marinov D. Automatic testing of software with structurally complex inputs. PhD Thesis, MIT, 2004. 56. Langdon WB, Harman M, Jia Y. Efficient multi objective higher order mutation testing with genetic programming. Journal of Systems and Software Dec 2010; 83:2416–2430. 57. Do H, Elbaum SG, Rothermel G. Supporting controlled experimentation with testing techniques: an infrastructure and its potential impact. Empirical Software Engineering 2005; 10(4):405–435. 58. Software-artifact Infrastructure Repository. SIR home page. http://sir.unl.edu/portal/index.html. 59. Cormen TH, Leiserson CE, Rivest RL. Introduction to Algorithms. MIT Press: Cambridge, MA, USA, 1990. 60. Offutt AJ, Lee A, Rothermel G, Untch RH, Zapf C. An experimental determination of sufficient mutant operators. Transactions on Software Engineering and Methodology Apr 1996; 5:99–118. 61. Yang J, Chen T, Wu M, Xu Z, Xuezheng Liu HL, Yang M, Long F, Zhang L, Zhou L. MODIST: transparent model checking of unmodified distributed systems. NSDI, 2009. 62. Yabandeh M, Knezevic N, Kostic D, Kuncak V. CrystalBall: predicting and preventing inconsistencies in deployed distributed systems. NSDI, 2009. 63. Schuler D, Zeller A. Javalanche: Efficient mutation testing for Java. Proceedings of the 7th joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2009), 2009; 297–298. 64. Adra S, McMinn P. Mutation operators for agent-based models. Proceedings of the 3rd International Conference on Software Testing, Verification, and Validation Workshops (ICSTW 2010), 2010; 151–156. 65. Jagannath V, Gligoric M, Lauterburg S, Marinov D, Agha G. Mutation operators for actor systems. Proceedings of the 3rd International Conference on Software Testing, Verification, and Validation Workshops (ICSTW 2010), 2010; 157–162. 66. Lauterburg S, Dotta M, Marinov D, Aghan G. A framework for state-space exploration of Java-based actor programs. Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering (ASE 2009), 2009; 468–479. 67. Agha G. Actors: A Model of Concurrent Computation in Distributed Systems, 1986. 68. Fleyshgakker VN, Weiss SN. Efficient mutation analysis: a new approach. Proceedings of the 1994 ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), 1994; 185–195. 69. Yang G, Dwyer MB, Rothermel G. Regression model checking. Proceedings of the 25th IEEE International Conference on Software Maintenance (ICSM 2009), Edmonton, Alberta, Canada, 2009; 115–124. 70. Lauterburg S, Sobeih A, Viswanathan M, Marinov D. Incremental state-space exploration for programs with dynamically allocated data. Proceedings of the 30th International Conference on Software Engineering (ICSE 2008), 2008; 291–300. 71. Jia Y, Harman M. Higher order mutation testing. Information and Software Technology 2009; 51(10):1379–1393.
Copyright © 2012 John Wiley & Sons, Ltd.
Softw. Test. Verif. Reliab. 2013; 23:375–403 DOI: 10.1002/stvr