Software testing with code-based test generators: data ... - Springer Link

4 downloads 25318 Views 584KB Size Report
Aug 2, 2013 - Code-based test generators rely on the source code of the software under ..... of the test suites, while delivering test cases with good chances of.
Software Qual J (2014) 22:311–333 DOI 10.1007/s11219-013-9207-1

Software testing with code-based test generators: data and lessons learned from a case study with an industrial software component Pietro Braione • Giovanni Denaro • Andrea Mattavelli Mattia Vivanti • Ali Muhammad



Published online: 2 August 2013  Springer Science+Business Media New York 2013

Abstract Automatically generating effective test suites promises a significant impact on testing practice by promoting extensively tested software within reasonable effort and cost bounds. Code-based test generators rely on the source code of the software under test to identify test objectives and to steer the test case generation process accordingly. Currently, the most mature proposals on this topic come from the research on random testing, dynamic symbolic execution, and search-based testing. This paper studies the effectiveness of a set of state-of-the-research test generators on a family of industrial programs with nontrivial domain-specific peculiarities. These programs are part of a software component of a real-time and safety-critical control system and integrate in a control task specified in LabVIEW, a graphical language for designing embedded systems. The result of this study enhances the available body of knowledge on the strengths and weaknesses of test generators. The empirical data indicate that the test generators can truly expose subtle (previously unknown) bugs in the subject software and that there can be merit in using different types of test generation approaches in a complementary, even synergic fashion. Furthermore, our experiment pinpoints the support for floating point arithmetics and nonlinear

P. Braione (&)  G. Denaro Universita` degli Studi di Milano-Bicocca, viale Sarca 336, 20126 Milan, Italy e-mail: [email protected] G. Denaro e-mail: [email protected] A. Mattavelli  M. Vivanti Universita` della Svizzera Italiana, via Buffi 13, 6900 Lugano, Switzerland e-mail: [email protected] M. Vivanti e-mail: [email protected] A. Muhammad VTT Technical Research Centre of Finland, Tampere, Finland e-mail: [email protected]

123

312

Software Qual J (2014) 22:311–333

computations as a major milestone in the path to exploiting the full potential of the prototypes based on symbolic execution in industry. Keywords

Structural testing  Automatic test generation  Experimental study

1 Introduction Testing is an essential verification and validation activity, and the most popular way of assessing the quality of software in industry (Myers et al. 2004). As such, testing keeps on attracting increasing attention in the research and academic community. Testing challenges researchers and practitioners in many ways. A critical challenge is automating the generation of test suites to assist developers in achieving adequate testing and fostering deep explorations of the program state spaces beyond the limited scope of manual testing. Automatic test generation techniques and tools (also called test generators throughout this paper) promise significant impact on the testing practice by promoting extensively tested software within reasonable effort and cost bounds. In general, an automatic test generation technique works by pursuing a set of test objectives identified after some (machine-readable) description of the software under test. Model-based test generators refer to models formalized out of the software specification, and devise test cases to exercise the behaviors represented in the models. For example, for a software system specified as a state machine, a test generator may attempt to generate test cases that execute all the state transitions of the state-machine model. Code-based test generators identify test objectives on the source code of the software under test, and steer the test case generation process accordingly. For example, some test generators may attempt to execute all program statements or sample the domain of an input variable declared in the code. It is well known that model- and code-based test generators have complementary strengths and weaknesses that depend on the amount of system knowledge and implementation details represented in the reference software artifacts, respectively (Pezze` and Young 2007). Furthermore, the viability of the model-based approaches is subject to the availability of suitable models, which may not always exist or be designed for all systems. In the remainder of this paper, we limit the scope of our discussion to codebased test generators. Research on code-based test generators dates back to the seventies (King 1976; Miller and Spooner 1976) and over the years has produced a wide panorama of approaches. We can classify these approaches based on the high-level mechanisms that underly the test generation algorithms. Random test generators model the input space of the program by a set of random variables and generate test cases by extracting random samples accordingly. Systematic test generators derive the executability conditions of the program paths through some type of program analysis and solve these conditions to obtain test cases that follow such paths. Search-based test generators rely on fitness functions to model the ‘‘distance’’ of populations of test inputs from given testing objectives, for example covering all branches of the software under test, and search algorithms to find test suites that minimize such distance. This classification is not meant to be exhaustive; hybrid techniques have been proposed to overcome the shortcomings of each. In recent years, the steady research improvements on fundamental techniques and analyses for test generation has led to a proliferation of approaches in the software

123

Software Qual J (2014) 22:311–333

313

engineering literature, for example Csallner and Smaragdakis (2004), Tonella (2004), Godefroid et al. (2005), Sen et al. (2005), Cadar et al. (2006, 2008), Pacheco et al. (2007), Anand et al. (2007), Burnim and Sen (2008), Ciupa et al. (2008), Inkumsah and Xie (2008), Tillmann and de Halleux (2008), Xie et al. (2009), Lakhotia et al. (2010), Baluda et al. (2010), McMillan (2010). Many of these approaches are implemented as freely available research prototypes, and even integrated and distributed as parts of commercial environments for assisting software development. As automated test generators are slowly making their way from research to industrial practice, the question arises as to whether they are actually able to deliver the promised advantages. This stance is motivated by an analysis of the empirical studies available in the literature that have previously tried to assess the effectiveness of automated test generators, for example the empirical studies presented in Csallner and Smaragdakis (2004), Godefroid et al. (2005), Cadar et al. (2006, 2008), Pacheco et al. (2007), Majumdar and Sen (2007), Ciupa et al. (2008), Godefroid et al. (2008a), Burnim and Sen (2008), Tillmann and de Halleux (2008), Xie et al. (2009), Lakhotia et al. (2009), Xie et al. (2009), McMillan (2010). Some of these studies only consider student programs or general-purpose programming libraries, which are hardly representative of industrial software. Most studies compel test generators to expose only specific types of faults, for example program crashes or violations of assertions embedded in the code. In general, test oracles are an important issue when using code-based test generators, since the oracles must be provided independently of the test generator, and may affect fault detection, but there is very little evidence of the effectiveness of code-based test generators in exposing faults that do not manifest as crashes, or cannot be captured adequately (or economically) by assertions. Some studies assess the relative effectiveness of test generators based on code coverage indicators. While coverage is an interesting proxy measure of the adequacy of testing, so far it is unclear to what extent it correlates with the ability to elicit failures and reveal faults. In general, experimental data on the level of coverage accomplished through a testing approach provide weak feedback on the real effectiveness of the approach. Floating point computations are another source of frequently overlooked challenges. As an example, most systematic test generation approaches rely on constraint solvers that at the state of the art provide very limited (if any) reasoning on floating point variables and arithmetics. As another example, floating point divisions by zero do not by themselves determine the programs to crash, but rather produce special values (according the IEEE 754 floating point standard), which may propagate as silent failures and require a manual inspection of the program outputs to be detected. To the best of our knowledge, we are aware of very limited data on the extent to which the approaches to automatic test generation proposed in the literature can cope with programs that involve nontrivial portions of floating point inputs and computations. This paper contributes an empirical study that confronts a set of state-of-the-research test generators on a family of programs with nontrivial domain-specific peculiarities. The subject of study is a software component of a real-time and safety-critical control system that remotely operates maintenance robots within a test facility to validate the maintenance operations of the ITER nuclear fusion plant. The component is programmed in C, implements several nonlinear computations, and integrates within a time-dependent control task specified in LabVIEW, which is a graphical language used to design embedded systems. We consider four programs that are incremental versions from the version history of this component. The study pursues experimental data to answer research questions on (1) the effectiveness of the test generators in exposing failures of the subject programs, and (2) the

123

314

Software Qual J (2014) 22:311–333

relative strength and weakness of the different test generation approaches and research prototypes. We also gain experience with an ensemble of test generators designed on purpose for this study. Our study challenges the considered test generators in many ways. First, the subject programs accept as input floating point variables and involve floating point arithmetics to a large extent. Systematic test generators offer limited support in coping with floating point arithmetics. Rather than sidestepping the issue, we analyze the suitability of workarounds based on modeling floating point computations over the domain of integers. Second, we target domain-specific failures that are not specified as code assertions, since writing assertions is not a common practice in the referenced industrial domain (and probably in many other industrial domains), and in fact, our subject programs contain no assertions. During the study, we also discovered that, while failures where indeed happening throughout the execution of the generated test cases against the subject programs, no failure manifested as a crash of the program under test, and we could only rely on manual oracles in order to pinpoint any failure. In this respect, our study sheds some light on whether and how the test generators can effectively be used in this type of nonfacilitated context. As a whole, the results reported in this paper contribute novel empirical evidence that test generators can expose both unknown (and subtle) bugs in the industrial programs considered, confirming the potential of test generators, but also highlighting their current limitations and obstacles to applicability. We also show that there can be merit in combining different types of test generation approaches in complementary, even synergic fashion. The paper is organized as follows. Section 2 provides an overview of the literature and the test generation approaches related to our study. Section 3 outlines the research questions specifically addressed in our study. Section 4 presents the subject programs and the experimental setup. Section 5 discusses the results of the experiment in answering the research questions. Section 6 describes the threats to the validity of our results, and our strategies for mitigating the impact of these threats. Section 7 concludes the paper with final remarks.

2 Related work Current approaches to the automation of test generation stem from the research results in the fields of random testing, systematic test generation, and search-based software engineering. Random testing consists of randomly sampling the input space of the program, and running the program against the obtained random input. Random testing is possibly the least expensive technique, both to implement and to put into operation. Most random test generators do not require any detailed analysis of the program structure, quickly finding large numbers of test cases almost out of the box. Moreover, random testing can be used both for searching for faults and for measuring software quality (Pezze` and Young 2007). On the negative side, random testing is largely unable to elicit program behaviors associated with small or singular subsets of the inputs. This yields in many practical cases early saturation of random test case generators, low code coverage, and big, redundant, hard-tomanage test suites. The techniques proposed in the literature, for example by Csallner and Smaragdakis (2004), Pacheco et al. (2007), and Godefroid et al. (2008a), essentially differ in the sampling strategies adopted for excluding meaningless or redundant input

123

Software Qual J (2014) 22:311–333

315

combinations, and for increasing the likelihood of triggering failures, or covering yetuncovered code elements. Recent research considers dynamic sampling of input space sampling and feedback from test case execution, to control the distribution of tests over the input domain (Pacheco et al. 2007; Ciupa et al. 2008). Systematic test generation approaches (Ferguson and Korel 1996; Ball 2003; Cadar et al. 2008) consider a (not necessarily finite) abstract representation of the program state space, usually produced by analyzing the program under test statically, and build test cases to progressively cover every element in the structure. These approaches usually embody some variant of symbolic execution (King 1976), a program analysis technique that simulates the execution of programs over symbolic data. Symbolic execution computes path conditions, that is, the executability conditions of a given static control flow paths as a function of the program inputs. Test generators based on symbolic execution usually enumerate the static control flow paths in the program and solve the associated path conditions by invoking an external procedure, yielding test cases eliciting a specific static path. If a path condition has no solution, the associated static path is discarded as infeasible. The potential advantage of a systematic approach with respect to random testing is the ability of the former in eliciting behaviors that are not easily discovered by the latter. By directly analyzing the state space, rather than the input space, structural testing can elicit behaviors that are associated with very small input sets. On the downside, structural approaches are computationally more expensive than random ones, and their actual precision strongly depends on how easy it is to reason on the state space. This, in turn, essentially depends on the speed and the class of formulas that the external solver is able to solve (its theory). Another issue with systematic approaches is that they can get stuck in the exhaustive exploration of state space regions that do not contribute to increasing the coverage. As an example, the approaches based on symbolic execution easily diverge on programs with infinite static paths, for example, in the presence of loops. An important category of systematic approaches is that based on dynamic symbolic execution (Godefroid et al. 2005, 2008b; Sen et al. 2005; Majumdar and Sen 2007; Artzi et al. 2008; Burnim and Sen 2008; Tillmann and de Halleux 2008; Paˇsaˇreanu et al. 2011). Also known as concolic execution and directed random testing, dynamic symbolic execution is a hybrid approach integrating (static) symbolic execution and information from (dynamic) test execution. Dynamic symbolic execution assumes an initial test suite, possibly obtained by random testing or designed by test analysts. It executes symbolically the program along the feasible control flow paths exercised by the tests, trims the resulting path conditions at selected branches, negates the clause at the trim point, solves the resulting predicates to generate new test inputs along not-yet-covered paths, and feeds them back to the next iteration of the procedure. Dynamic symbolic execution combines the ability of symbolic execution to elicit behaviors associated with small sets of inputs and at the same time reduces by half the number of invocations to the external solver (Braione et al. 2012). Additionally, the test generators based on this approach exploit the generated test inputs to approximate path predicates not in the solver’s theory with solvable ones. As a shortcoming, dynamic symbolic execution is prone to diverging when applied to a program with infinite control flow paths, as it happens in the case of all the approaches based on the exhaustive exploration of an infinite abstraction of the state space. Introduced more than 30 years ago by Miller and Spooner (1976), the idea of searchbased testing has catalyzed the attention of researchers only in the last decade (Korel 1990; Sthamer 1996; Michael and McGraw 1998; Pargas et al. 1999; Michael et al. 2001; Tonella 2004; Lakhotia et al. 2010; Fraser and Arcuri 2011). Search-based approaches

123

316

Software Qual J (2014) 22:311–333

consider test case generation as an optimization problem, aimed at maximizing a suitable fitness measure, which reflects ‘‘how close’’ a test input is to a given coverage objective. For example, the technique originally proposed by Miller and Spooner (1976) aims at covering a given program path in the context of structural testing. The authors experimented with different fitness function definitions, all obtained by normalizing path conditions to the form c0  0 ^ c1  0 ^    ^ cn  0; and defined as continuous functions that assume positive values on the test inputs that make all the ci positive, for example the function minðc0 ; c1 ; . . .; cn Þ. Once defined a fitness measure, search-based approaches rely on some metaheuristic technique to search the input that maximizes fitness. Hill climbing, simulated annealing, and evolutionary (genetic) algorithms are the metaheuristics customarily used by search-based test generators. Search-based approaches are very flexible with respect to the adequacy criteria of choice. Structural coverage, execution time, and some classes of functional requirements are three examples of criteria to which search-based approaches have been applied. The only requirement is that a ‘‘distance’’ of test inputs from the adequate one can be quantified by some metric. On the other hand, the quality of the search phase strongly depends on the shape of the hypersurface determined by the fitness function, the so-called ‘‘fitness landscape.’’ Local minima and singularities in the fitness landscape are customary with current fitness functions, hindering convergence to the global optimum. While literature abounds in definitions of new approaches, the studies that compare them one against the other are mostly similar in their rationale, especially in how they measure the effectiveness of tools. Testing aims at maximizing the number of discovered faults for a given testing effort budget. As the total number of faults in a software system is not quantifiable a priori, managers resort to assessing the thoroughness of testing based on measurable but approximate adequacy criteria. The mainstream criteria are structural, and they prescribe counting the degree of coverage of given kinds of code elements, such as statements, branches, and modified condition/decisions. Consistently, most studies emphasize coverage as the measure of effectiveness of a test generator, rather than trying to measure its ability to disclose software faults (Burnim and Sen 2008; Lakhotia et al. 2010; Fraser and Arcuri 2012). Many techniques measure coverage of the generated tests while they operate, and exploit this information to steer further generation toward yet-uncovered targets. By their very nature, systematic and search-based approaches are easily extended with path selection heuristics and fitness functions aimed at maximizing the chance of increasing coverage and speeding up convergence to saturation (Korel 1990; Michael et al. 2001; Tonella 2004; Majumdar and Sen 2007; Burnim and Sen 2008; Godefroid et al. 2008b; Inkumsah and Xie 2008; Xie et al. 2009; Paˇsaˇreanu et al. 2011). Other approaches propose to complement test case generation with a formal analysis of the feasibility of the coverage targets based on abstraction refinement (Baluda et al. 2010; McMillan 2010). These approaches progressively exclude the unreachable code elements from the coverage targets by refining the model along the infeasible control flow paths discovered during test case generation. The distinctive feature of these techniques is their ability, in some cases, to prove in finite time that an infinite number of static paths is infeasible. This potentially yields better coverage of ‘‘deep’’ targets and more precise estimates of the feasible coverage attained by the generated suite.

3 Research questions The overall goal of our experimental activity is to evaluate the state of the research in the area of automated test generation, trying to answer whether and to what extent this research

123

Software Qual J (2014) 22:311–333

317

area can be regarded as mature for technology transfer to industry, or whether it is still far from this objective. To this end, we set up an experiment that confronts a representative set of research prototypes of automatic test generators, based on different principles and mechanisms, against a sample of industrial software with characteristics that are notoriously challenging for the technology that underlies that or other prototypes. The challenging characteristics of the subject software include inter-procedural structures, floating point arithmetics, and the unavailability of code assertions to be used as testing oracles. When experimenting with any given test generation tool, the least intrusive approach is to run the tool out of the box against the software in the shape delivered by the developers. Unfortunately, we had to face practical problems that prevented us from setting up an experiment according to such a direct approach. A major obstacle was the need for test oracles sufficient to evaluate test effectiveness. In fact, a code-based test generator can directly (read with no human intervention) unveil only program failures that manifest in the form of program crashes or violations of specification contracts implemented as assertions in the code. Unfortunately, our subject programs contained no assertions, and interviews with the developers confirmed that writing assertions is uncommon in their software process. No test case resulted in runtime exceptions or program crashes either. Even though deeper analysis of the test results (refer to Sect. 5) revealed that runtime problems were actually happening, such as floating point underflows and divisions by zero, the standard semantics of floating point operations handles these exceptional cases by returning special values, such as NaN (not a number) or Inf (infinity), that were being silently propagated by the subject programs. In light of this consideration, we had to embrace additional assumptions in order to make the experiment produce interpretable data. The least assumption for applying any of the selected test generators was relying on manual oracles; that is, we engaged domain experts in evaluating the outputs of the subject programs when executed against the generated test suites, aiming to identify the occurrence of failures different from program crashes. Having to rely on manual oracles introduced the need to generate test suites of manageable size. We addressed this need by instructing the test generators to retain only the test cases that increased branch coverage: We regarded this as an inexpensive method of controlling the size of the test suites, while delivering test cases with good chances of capturing behaviors not yet seen. It is, however, easy to think of other, possibly more effective, methods of achieving a similar goal, and thus, our choice introduces a threat to the internal validity of the experiment: We might experience ineffective test suites because of the test selection strategy based on branch coverage, rather than because of deficiencies of the test generators. In general, the worse the effectiveness of the generated test suites, the higher the potential impact of this threat on the validity of the conclusions that can be taken out of our results. Our experiment specifically addresses the following research questions: Q1. Are test generators augmented with branch coverage-based test selection strategies effective in exposing relevant bugs of our sample industrial software? This research question is a refined version of the one stated at the beginning of this section, under the assumption and the threats to validity of using the test selection strategy. Q2. What is the relative effectiveness of different test generators and test generation approaches?

123

318

Software Qual J (2014) 22:311–333

This question aims to evaluate the relative strengths of the available prototypes and their underlying methods. Our study also evaluates an ensemble of test generators that we designed on purpose for this experiment.

4 Experiment setup This section presents the design of our experiment. We describe the subject programs and provide the core domain knowledge needed to understand the results of our testing activity, introduce the test generators selected for the experiment, and illustrate the experimental procedure undertaken with each test generator. 4.1 Subject programs ITER is part of a series of experimental fusion reactors which are meant to investigate the feasibility of using nuclear fusion as a practical source of energy and demonstrate the maintainability of such plants (Keilhacker 1997; Shimomura 2004). Due to very specialized requirements, the maintenance operations of the ITER reactor demand the development and testing of several new technologies related to software, mechanics, electric, and control engineering. Many of these technologies are under investigation at Divertor Test Platform (DTP2) at VTT Technical Research Centre of Finland (Muhammad et al. 2007). DTP2 embeds a real-time and safety critical control system for remotely operated tools and manipulation devices to handle the reactor components for maintenance (Honda et al. 2002). The control system is implemented using C, LabVIEW, and IEC 61131 programming languages. The software component chosen for this study is part of the motion trajectory control system of the manipulation devices. The software is implemented in C. It provides an interface between the operator and the manipulator. The operator inputs the target position of the manipulator, along with the maximum velocity, initial velocity, maximum acceleration, and maximum deceleration, as physical constraints on the generated trajectory. As a result, the software plans the movement of the manipulator, interpolating a trajectory between two given points in n-dimensional space, where n is the number of physical joints in manipulator. It returns outputs in the form of smooth motions, so that the manipulator’s joints accelerate, move, and decelerate within the physical bounds until the target position is reached. This avoids the mechanical stress on the structure of the manipulator, ensuring its integrity and safety. It also keeps the desired output forces of the joints’ actuators in check. The correctness of such software plays a key role in the reliability of the control system of the ITER maintenance equipment. The software aims to produce the trajectories in such a way that all the joints start and finish their motion at the same time. This constraint is fulfilled by slowing down the motion of certain joints, and it is ensured that the acceleration and velocity constraints are not violated for any of the joints. The software ensures that all joints finish their motion at the same time by slowing down acceleration and velocities for certain joints. The component is designed to be compiled as a dynamic link library (DLL) to work with Matlab or LabVIEW. This experiment considers four incremental versions of the subject software. Code size ranges between 250 and 1,000 lines of code. The number of branches ranges between 36 and 74. All versions include 6 functions with maximum cyclomatic complexity equal to 11.

123

Software Qual J (2014) 22:311–333

319

Baseline version The baseline version is the main working implementation of the software, which can be compiled to run in LabVIEW real-time environment. This version was used to test the motion characteristics of a water hydraulic manipulator. Platform change (buggy) version The second version considered in the study is fundamentally a platform change of the baseline version. This version provides the same functionality, but is designed to compile as a DLL to work in the Matlab Simulink environment. It was implemented to simulate and plan the motions in the virtual environment before executing them on real manipulator, aiming to enhance the safety of operations. Platform change (fixed) version The third version considered in the study is a bug fix of the second one. In fact, the above Matlab version contains a particular bug causing the manipulator to violate the maximum velocity and acceleration limits. This bug remained in the software for several years before it was detected and fixed in this version. New implementation The fourth version considered in the study is a new, recently proposed implementation to obtain the same functionality, but to rectify unwanted behaviors in the previous implementations. The component has not been tested in a real environment yet, and thus, it is not yet known whether this new implementation provides the proper functionality. 4.2 Test generators Our experiment covers code-based test generators based on random testing, dynamic symbolic execution, and search-based testing. We relied on personal knowledge and experience to select a set of publicly available research prototypes, trying to have at least one representative for each of the above test generation approaches. A requirement for a test generator to be selected was the handling of programs written in the C programming language, because our subjects were C programs. We made an exception for the test generator Pex (Tillmann and de Halleux 2008) that handles programs in C# for .NET, in which case we managed to produce a C# version of the subject programs. Specifically, our experiment considers the following set of research prototypes: • CREST is an open-source tool that implements test case generations according to different possible strategies, including in particular random testing and dynamic symbolic execution (Burnim and Sen 2008); • Pex implements test case generations according to the dynamic symbolic execution approach, using a fair-choice path exploration strategy (Tillmann and de Halleux 2008); • AUSTIN implements test case generation based on a search-based algorithm targeted toward maximizing branch coverage (Lakhotia et al. 2010). The test generators that embrace dynamic symbolic execution progressively explore the program paths and rely on constraint solvers to find assignments of input variables that make those paths execute. Constraint solver technologies at the state of the art generally experience limitations with formulas that involve reasoning on floating point variables and arithmetics. Some constraint solvers, for example Yices (Dutertre and de Moura 2006), which is the one integrated in the implementation of dynamic symbolic execution provided by CREST, do not allow floating point variables as input on this basis. Some others, for example Z3 (De Moura and Bjørner 2008), which is the one integrated in the implementation of dynamic symbolic execution provided by Pex, provide heuristic, incomplete handling of floating point formulas.

123

320

Software Qual J (2014) 22:311–333

The industrial software considered in our study takes as input only floating point variables, and we could not derive any relevant information from directly experimenting CREST against this software. Thus, we searched for workarounds to empower test generators that do not natively support floating point inputs. We experimented with simulating the floating point arithmetics over suitably interpreted integer values, after integrating the subject programs with programming libraries that provide simulations of this type. This amounts to performing a code transformation that reshapes the subject programs to work with integer inputs, and applying the dynamic symbolic execution-based test generators on the programs transformed in this way. We did so by integrating the subject programs with either of two publicly available libraries that implement a fixed-point approximation of the floating point computations,1 and a simulation of the floating point IEEE 754 semantics over integer-typed inputs interpreted at the bit level,2 respectively. Using the fixed-point approximation did not work: The resulting loss of precision affected the correctness of the analysis to large extent, yielding several spurious executions and crashes of the programs that blocked the test generators from proceeding. Our experiment discards this approach as not viable on this basis. Reshaping the subject programs on top of the integer-based floating point simulation library did indeed yield analyzable programs, at the cost of a factor-10 increase in the counts of branches. (Note that, despite the code transformations, we will evaluate the generated test suites for effectiveness against the original subject programs, in order to achieve results comparable with the ones of the test generators that do not have to rely on any code transformation.) In the experiment, we apply the code transformation approach in combination with the test generators based on dynamic symbolic execution, as implemented by either CREST equipped with the Yices constraint solver or Pex equipped with the Z3 constraint solver. Since Z3 has support for floating point formulas, we instantiate two test generation experiments based on Pex, corresponding to applying Pex either after the code transformation or against the original code, aiming to compare between the two. Furthermore, the experiment investigates the behavior of an ensemble of test generators designed on purpose for this experiment. The ensemble (we give full details below) is a staged test generator that applies random testing and dynamic symbolic execution strategies (both as implemented by CREST) in subsequent stages, combining the results through all stages. In summary, our experiment considers the following test generators: • RND_CREST: a test generator based on random testing as implemented by CREST; • DSE_I_CREST: a test generator based on dynamic symbolic execution as implemented by CREST applied after code transformation; • DSE_I_PEX: a test generator based on dynamic symbolic execution as implemented by Pex applied after code transformation; • DSE_F_PEX: a test generator based on dynamic symbolic execution as implemented by Pex applied to the original code with floating point inputs; • SBT_AUSTIN: a test generator based on search-based testing as implemented by AUSTIN; • ENS_CREST: a test generator based on an ensemble of random testing and dynamic symbolic execution implemented on top of CREST applied after code transformation. 1

http://sourceforge.net/projects/fixedptc.

2

http://www.jhauser.us/arithmetic/SoftFloat.html.

123

Software Qual J (2014) 22:311–333

321

4.3 The ensemble test generator Here, we describe the ensemble of test generators built on top of CREST, which we refer to as ENS_CREST. The ensemble consists of a four-stage test generator that uses CREST either as a random test generator or as a dynamic symbolic executor configured according to multiple search strategies, and combines the results achieved throughout the test generation stages. ENS_CREST works as follows. In the first stage, the test generator runs CREST in dynamic symbolic execution mode so as to perform a depth-first traversal of the program paths. The test generator monitors the generated test cases against the program under test and retains only the test cases that result in increasing branch coverage over the previous ones. The process is continued up to saturation, defined as experiencing no coverage increase for a configurable (set to 10,000) budget of iterations. This definition of saturation also applies to the next stages of the test generator. In the second stage, the test generator runs CREST in random testing mode until saturation, using the test cases generated in the previous stage as seeds of the random search. Again, it retains only the test cases that result in an increase in coverage. In the third stage, the test generator pursues additional coverage with dynamic symbolic execution according to a particular search heuristics (available in CREST) that weights the selectable paths according to the distance from the not-yet-covered branches and takes into account the number of unsuccessful attempts to follow specific subpaths to not-yet-covered branches in previous iterations (Burnim and Sen 2008). In the fourth stage, that test generator runs CREST with dynamic symbolic execution augmented with a systematic coverage-target-driven search strategy that we implemented in previous work, referred to as ARC-B (Baluda et al. 2011). ARC-B incrementally computes the executability conditions of the not-yet-covered branches and integrates these conditions into the constraint solver queries issued by dynamic symbolic execution, to favor the identification of solutions (test cases) that can increase branch coverage. Before either the third or fourth stage, the test generator sets the coverage targets according to the branches that were not covered in the preceding stages and then runs until saturation, retaining the test cases that increase coverage. Offline, we have also verified that, for the programs considered in our study, the search strategies applied in the third and fourth stages of the ensemble test generator do not perform better than depth-first traversal when used alone out of the ensemble. This is why, experimenting with DSE_I_CREST, we only use the depth-first search strategy. However, the former strategies produced some new test cases when seeded with the test cases randomly generated by the random stage, which is why we included these strategies in the ensemble. We have also tried to anticipate the random stage before dynamic symbolic execution and different orderings of the dynamic symbolic execution stages, but achieved poorer results (not reported in the paper) than when using the above sequence of stages. 4.4 Experimental procedure We ran all the selected test generators against the subject programs. We ran all the test generators up to saturation, defined as experiencing no coverage increase for an arbitrary budget of 10,000 test generation attempts, or autonomous termination before this budget. Throughout the test generation process of each test generator, we only retained test cases that increment the branch coverage of the program under test over the already generated test cases. Generating test suites across the subject programs, we worked by the

123

322

Software Qual J (2014) 22:311–333

rule of test suite augmentation (Santelices et al. 2008); that is, we started the selection of new test cases after executing the test suite generated (by the given test generator) for the previous version, and retained only the test cases that increase coverage further. We compute the coverage indicators with Gcov.3 We executed all the generated test suites so as to collect failure data. We manually inspected the test outcomes by looking into the trajectories of the manipulator’s joints generated by the subject programs, with support by VTT experts for the analysis of the plots. The subject programs yield the trajectory data of the joints as sextuples of floating point values. Each sextuple represents the trajectory of a joint by the times (three values) up to which the joint has to accelerate, cruise at peak velocity, and decelerate, respectively, and the corresponding (other three values) acceleration, peak velocity, and deceleration in each phase. For each test case in the test suites, we collected and analyzed the trajectory data in two forms: the values of the sextuples yielded and the plots of the resulting movements of the joints and their velocities over time. We searched the sextuples for (unexpected) 0, N or Inf values, and the plots for unexpected or inconsistent shapes across the subject programs. All test suites and problem reports from our testing activity have been submitted to developers at VTT to collect the feedback of domain experts on the relevance of the test cases generated and the correctness of our observations. Offline, we tabulated the failure data of each test suite for comparison, by recognizing distinct failures that can be exposed by multiple test suite. At the end of this process, we had collected 7 distinct failures that we describe in detail in the next section and tracked each test suite to the exposed (distinct) failures. The data collected in our experiment relate to the stated research questions as follows: From the analysis of the failure data and feedback of the domain experts on these data, we conclude the effectiveness of the test generators (research question Q1). From the comparative evaluation of the failures detected by each test generator, we deduce the relative effectiveness of test generators and their underlying test generation approaches (research question Q2).

5 Results This section discusses the results of our experiment according to the research questions. We describe the failure data in detail and analyze the relative strength of the experienced test generators. 5.1 Effectiveness of automatically generated test suites The research question Q1 (Sect. 3) asks whether the existing test generators are able to produce test suites that expose relevant bugs in industrial software. Our experiment addresses this question by using a set of state-of-the-art test generators so as to derive test suites for a set of industrial subject programs. The selected test generators cover the major lines of research in the area, that is, random testing, search-based testing, and directed testing based on dynamic symbolic execution, and include an ensemble of these test generators that we have designed on purpose for this study. Executing the test suites generated by the test generators in our experiment revealed the failures summarized in Table 1. Other than revealing the known bug (failure F4) of in the 3

Gcov is part of the Gnu compiler collection (GCC).

123

Software Qual J (2014) 22:311–333

323

subject program platform change (buggy) version, these failures expose relevant and previously unknown problems. Below, we describe all the problems in detail. Running the test suites generated for the baseline version, that is, the reference LabVIEW version of the component under test, we observed failures F1, F2, and F3. Table 2 reports trajectory data (columns Output) yielded by the baseline program for the inputs (columns Input) of some test cases that unveiled a given failure (column Failure). The inputs include maximum and initial velocity, maximum acceleration, and maximum deceleration of the joints. Origin and destination positions are omitted for space reasons. The outputs are the sextuples of trajectory data. These data show that the program fails to handle very small input values (failure F1), and combinations of the input parameters that include all zero (failure F2) or some negative (failure F3) values of the maximum acceleration/deceleration of the joints. The failures display as unexpected 0 and Inf values in the outputs. Debugging revealed that failure F1 is due to floating point underflows in a multiplication that involves the small values, while failures F2 and F3 derive from divisions by zero, in turn caused by a program’s function that returns 0 for unexpected inputs. From VTT experts we learned that, although these inputs are hardly showing up (e.g., the program is currently never used with negative inputs), such (unknown) problems call for strengthening the robustness checks in the program so as to avoid future issues. Running the augmented test suites generated for the platform change versions, that is, the versions adapted from the baseline version to migrate from the LabVIEW to the Matlab platform, we observed failures F4 and F5, both exposed by augmentation test cases derived on version platform change (buggy). Figure 1 plots data from a test case in which the input Table 1 Failures detected by automatically generated test cases Failure ID

Description

Identified in

F1

Floating point imprecision with small input values: In the presence of very small input values the program computes bad output values, for example unexpected 0.0 or Inf values

Baseline

F2

No robustness with all zero accelerations: If both the values of the maximum acceleration and maximum deceleration are set to zero, the program computes bad output values, for example unexpected 0.0 or Inf values

Baseline

F3

No robustness with negative accelerations: If either the value of the maximum acceleration or maximum deceleration is a negative number, the program computes bad output values, for example unexpected 0.0 or Inf values

Baseline

F4

Wrong peak velocity in presence of quiet joints: If there are quiet joints (same origin and destination positions), the program will issue movements at up to double or triple the maximum velocity

Platform change (buggy)

F5

Quiet joints that move: If there are quiet joints other then the first one, the program will cause them to move

Platform change (buggy)

F6

Slowness due to single instant peak velocity: The program issues a smooth progressive increase in acceleration up to peak velocity and a smooth progressive deceleration from then on; this results in (unwanted) slower movements than when applying the maximum acceleration and deceleration at once

New implementation

F7

Unaccounted maximum deceleration: The program refers to the value of maximum acceleration to compute both acceleration and deceleration movements, possibly exceeding the physical limits of the device when the maximum deceleration is lower than the maximum acceleration

New implementation

123

324

Software Qual J (2014) 22:311–333

values for the origin and destination positions of joint 1 are exactly equal; that is, joint 1 is expected to be a quiet joint, a joint that does not move. The plot illustrates the movement of joint 2 in this case, as issued by buggy platform change version and the fixed platform change version, respectively. Due to the (recently fixed) bug, the former version clearly issues higher velocity than the latter one (failure F4). At the code level, the fault consists of a sequence of assignments that may double or triple the value of maximum velocity in the presence of quiet joints. The equality constraints to execute these assignments are the typical case in which directed testing based on dynamic symbolic execution overcomes random testing; the equality constraints are easy to solve from the symbolic path conditions, while the probability of randomly generating equal values is infinitesimal. In facts, the augmentation test cases that pinpoint this bug were identified by test generators based on dynamic symbolic execution to cover the equality constraints mistakenly inserted in the platform change (buggy) version. The augmented test suite uncovered another unknown failure (failure F5) in the programs, due to a division by zero that produces NaN in the trajectory data of quiet joints. The NaN value interferes with the conditional control structures, such that the program fails to update the position of the joint according to the trajectory. The observed outcome is that, if the quiet joint is not first in the list, its movement is tracked exactly equal to the joint that precedes it; that is, the program actually causes the quiet joint to move. Figure 2 illustrates this behavior with reference to a test case in which joint 3 is specified as a quiet joint, but its actual trajectory is different than expected. This bug has been confirmed and indicated as very important by VTT experts. Running the augmented test suites generated for the new implementation version, that is, the recently proposed re-implementation of the functionality of the baseline version, we observed failures F6 and F7. While overall the test cases highlighted the expected change of behavior of the new implementation with respect to the baseline program, that is, that the new implementation approximates to the gradual (rather than immediate) accelerations of the physical movements, the test also revealed problems with the new implementation. First, the new implementation computes incremental accelerations that always produce single instant peak velocity, and then slower movements than are physically possible (failure F6); Second, it does not account for the maximum deceleration if different from the maximum acceleration (failure F7), which may entail important practical issues with the physical limitations of the manipulator. These problems can easily be spotted in Fig. 3, which plots the velocity of a joint in a test case. Replicating the available test cases against a new version is typical regression testing, while we did not observe any notable behavior related to the test cases specifically computed with test suite augmentation for this version.

Table 2 Trajectory data computed by the baseline program for some generated test cases Input Vmax

Output Vini

amax

dmax

Accelerate for

Keep peak vel.

Decelerate

2

for

for

m/s

m/s 1.4e-45

m/s2

F1

1.4e-45

0.0

5.0

1.4e-45

0.0

5.0

Inf

0.0

1.4e-45

F2

5.0

0.0

0.0

0.0

0.0

-1e-4

Inf

-0.0

0.0

-1e-4

F3

5.0

0.0

2.0

-1.0

0.0

-2.0

Inf

-0.0

-0.0

1.0

Vmax = maximum velocity, Vini = initial velocity, amax = maximum acceleration, dmax = maximum deceleration

123

Software Qual J (2014) 22:311–333

325

Fig. 1 Movement of joint 2 when executing a test case

Fig. 2 Movement of joint 3 when executing a test case

We regard, however, as a very positive outcome the fact that the automatically generated test cases can produce informative (and readily available) data for a new version of the software that has not yet been tested in the field. In summary, the main finding of this experiment is that test generators can be effective in the considered industrial setting, where they have been able to unveil important bugs of the software under test. The generated test suites cumulatively exposed the 7 failures described above. The subject programs considered are part of a safety critical system and have been extensively used within a prototype deployment of that system. Nonetheless, only one of the failures was due to a bug known from the version history of the subject programs, while 6 out of the 7 failures exposed by our test suites were previously unknown. These failures pinpoint relevant robustness issues with unchecked implicit preconditions and possible floating point underflows, a subtle corner case fault in presence of quiet joints

123

326

Software Qual J (2014) 22:311–333

Fig. 3 Velocity of joint 2 when executing a test case

in the manipulator trajectory, and regression problems in a recently developed new version of the subject software. All test outcomes have been reviewed, discussed, and confirmed with domain experts. Overall, the generated test suites provide valuable feedback to developers and domain experts. 5.2 Relative strength of test generators Research question Q2 (Sect. 3) relates the effectiveness of the test generation methods considered in our experiment, under the hypothesis that some approach can contribute a bigger share of effective test cases. Table 3 shows which test suite reveals which failure, as well as reporting the size and the branch coverage of the test suites. Random testing (applied with RND_CREST) does not perform well in the context of this experiment. While missing failures F2, F4, and F5 indicates the known shortage of random testing with singular behaviors, such as behaviors that depend on two inputs to be equal, the missed detection of F6 and F7 is unexpected. Deeper data analysis revealed that the test suites generated by RND_CREST determine either infinitesimal movements of the joints or very long movements at infinitesimal velocity. Neither type of input suffices to generate plots (as for example the plot in Fig. 3) that expose the failures F6 and F7. The test generators based on dynamic symbolic execution confirm the suitability of this technology to reason about inputs that take specific paths in a program, as, for example, needed to reveal failures F4 and F5. However, the test suites generated with DSE_I_PEX warn that simulating floating point computations (as opposed to coping with those directly) may inhibit this ability. The datum related to failure F1 might also suggest that the test generators that simulate the floating point computations discover additional failures, but deeper analysis pinpointed this effect as fortuitous. When querying a constraint solver for nonzero values, a frequently returned solution is number 1 in the domain of integers, whose bit representation corresponds by chance to a very small floating point value that challenges the precision of the computation. Search-based testing (applied with SBT_AUSTIN) does not show clear advantages in the context of this experiment, incurring similar difficulties as random testing to elicit singular behaviors of the subject programs. However, while carrying out the experiment,

123

Software Qual J (2014) 22:311–333

327

Table 3 Failure data and statistics of the generated test suites

RND_CREST DSE_I_CREST DSE_I_PEX DSE_F_PEX SBT_AUSTIN ENS_CREST

Baseline version

Platform change (buggy)

Platform change (fixed)

New impl.

#BR

74

126

92

36

#TC

38

41

43

61

COV

80 %

64 %

78 %

78 %

#TC

9

21

21

11

COV

65 %

80 %

72 %

81 %

#TC

9

21

23

22

COV

72 %

72 %

78 %

83 %

#TC

8

9

9

8

COV

71 %

52 %

71 %

80 %

#TC

6

8

8

6

COV

81 %

65 %

76 %

83 %

#TC

20

32

32

23

COV

86 %

88 %

83 %

86 %

Failures identified

F1, F3 F1, F3–F7 F1, F3 F3–F7 F1, F3, F6, F7 F1–F7

#BR = number of branches, #TC = number of test cases, COV = branch coverage

we faced several limitations due to the selected prototype. For example, AUSTIN is not able to reason on inter-procedural control flows, and thus, we generated the test cases against inlined versions of the subject programs, possibly biasing the results. We believe that the result of this experiment is scarcely representative of the effectiveness of searchbased test generators, and we aim to refine our results by conducting further experiments with other search-based test generators in the future. Observing the complementary strengths and weaknesses of random testing and dynamic symbolic execution led us to design the ensemble test generator that we have described in Sect. 4.3 of this paper. In the context of this experiment, this test generator is the only one that reveals the complete set of failures; that is, it does not miss any failure reported by any other test generator. Interestingly, the test suites from the ensemble test generator provide consistently higher branch coverage than the ones from the other prototypes. Table 4 gives further detail on the test cases and branch coverage contributed by each stage of the ensemble test generator against the subject programs. For the baseline version, the ensemble test generator produced 20 test cases that cover 86 % of the branches. The concolic and random stages generated most test cases, 9 and 8, respectively, while the subsequent coverage-driven stages contributed only 3 test cases. For the version platform change (buggy), the test generator augmented the test suite of the baseline version with 12 additional test cases, resulting to coverage of 88 % of the branches. The concolic and random stage produced 9 and 3 additional test cases, respectively, while the coverage-driven stages were not able to improve the coverage any further. The 3 test cases from the random stage covered 3 additional branches of the floating point simulation library used to facilitate the test generator, but did not result in additional coverage of the original code. For the version platform change (fixed), the test generator did not produce any additional test case over the test suite of the previous (buggy) version. The final coverage is 83 % of the branches of this version. For the version new implementation, the test generator produced 3 additional test cases over the test suite of the baseline version, with a total coverage of 86 % of the branches. Again, the test case from the random stage did not result in additional coverage of the original code.

123

328

Software Qual J (2014) 22:311–333

Overall, from the data of this experiment, we are not able to identify a clear winner out of the approaches and prototypes that we experienced. The data from the ensemble test generator indicate that there can be real merit in exploiting the synergy of different approaches to overcome each other’s deficiencies. 6 Threats to validity Here, we discuss the most important factors that may affect the internal and external validity of our study, and outline the strategies we used to reduce their impact. The specific set of test generators covered in the experiment may threaten the internal validity of our results. The selected test generators can fail to generate some (possibly relevant) test cases because of bugs in their implementation or may not be sufficiently representative of a class of test generation approaches. The impact of this threat is partially mitigated in the case of test generators based on dynamic symbolic execution, since we compared the results of two different test generators of this type, while we had only one representative of the random and search-based approaches. We selected research prototypes that were publicly available and can to the best of our knowledge handle C programs, and we admittedly had low control on this threat. The selected test generators have been used in other studies reported in the literature and are actively maintained software projects, which we took as indicators of robustness. Furthermore, since our data testify to the ability of the test generators to produce effective results, a bias toward having produced test suites weaker than possible does not critically affect this conclusion, while we have already commented on the inconclusiveness of our data with respect to the relative strength of different test generators. In the experiments, we selected test cases that produce increments in branch coverage up to saturation. Different selection or halting criteria might have induced different test cases and then different results. As above, we remark that the introduced bias is pessimistic, either because we may have halted the generation process too soon, or because we may have dropped test cases that elicit some failure, though not increasing branch coverage. We can, therefore, assume that this threat has low impact on the results related to Table 4 Test cases and coverage per stage of the ensemble test generator Baseline version

Previous version test suite Concolic testing (stage 1) Random testing (stage 2) Coverage-driven CREST (stage 3) Coverage-driven ARC-B (stage 4) Total

Platform change (fixed)

New impl.

#TC

20

32

20

COV

72 %

83 %

84 %

#TC

9

9



2

COV

70 %

88 %



86 %

#TC

8

3



1

COV

79 %

88 %



86 %

#TC

1







COV

82 %







#TC

2







COV

86 %







#TC

20

32

32

23

COV

86 %

88 %

83 %

86 %

#TC = number of test cases, COV = branch coverage

123

Platform change (buggy)

Software Qual J (2014) 22:311–333

329

research question Q1, and might have an impact, that we are not currently able to quantify, on our inability to give a conclusive answer to research question Q2. Handling floating point calculations with the simulation libraries led, when applied, to analyzing a code transformation of the original subject programs, with an increase of the total number of branches up to a factor-10. This may threaten the comparability of the results between the test generators for which the code transformation is and is not applied, respectively. We addressed this threat by using the transformed code only to generate the test suites, while we collected the failure and coverage data on the original programs in all cases, thus fostering comparable results. Our experiment analyzes a restricted number of versions of a software system. The features of the selected experimental subjects are representative of the several other realtime control systems that are being developed at the VTT research center, but we are aware that the results of a single experiment cannot be directly generalized. More specifically, it is unclear whether the results obtained generalize across software of different size, written in different programming languages and for different application domains. We are currently planning to repeat our experiment with other subject programs developed at VTT, and we are contacting other industrial partners to collect further data.

7 Conclusions We have presented a study of applying a set of test generators based on random testing, dynamic symbolic execution, search-based testing, and an ensemble of these techniques, respectively, against a family of subject programs developed at VTT. As a main result, this study augments the body of knowledge in the field by contributing empirical evidence that the test generators can be effective on industrial software, up to exposing bugs that had escaped detection during the testing of a prototype deployment of the safety critical system of which the subject programs of the study form part. The bugs found in this study relate to unknown robustness issues with unchecked implicit preconditions and possible floating point underflows, corner case behaviors on singular inputs, and unwanted inconsistencies between a re-implementation of a core algorithm and its baseline version. The study also pursues the comparison between the relative strength of the test generators, but produces inconclusive results in this respect. Nonetheless, we have found that an ensemble test generator built on top of a multi-strategy test generator used in the study (namely, CREST) performed consistently better than all other test generators, exposing a larger set of failures and achieving higher branch coverage. Our experiment confirms the difficulty of test generators based on dynamic symbolic execution to handle programs that are prevalently involved with floating point computations. Being able to effectively analyze programs that exploit nonlinear and floating point arithmetics was a strong requirement in our study. This probably generalizes to many other relevant industrial domains. We experimented with a possible workaround based on reshaping the subject programs by means of a programming library that simulates the floating point computations (according to the standard IEEE 756) over the domain of integers, and then executing the test generators on the resulting programs, but we did not observe evidence that this approach can guarantee good results in general. Our experience indicates the support for floating point arithmetics as an important milestone in the path to exploiting the full potential of the test generators based on dynamic symbolic execution in industry. Another conclusion that can be drawn from our study is that test generators must be able to integrate with manual oracles, since addressing program crashes or uncaught runtime

123

330

Software Qual J (2014) 22:311–333

exceptions can be insufficient. Other than testifying the scarce use of code assertions in industrial software, our study provides evidence that even low level violations, such as floating point divisions by zero, can result in silent failures. We are now working with VTT and other industrial partners to replicate this study, aiming to generalize the current results. Acknowledgments This work is partially supported by the European Community under the call FP7-ICT2009-5—project PINCETTE 257647.

References Anand, S., Paˇsaˇreanu, C. S., & Visser, W. (2007). JPF-SE: A symbolic execution extension to Java Pathfinder. In: International conference on tools and algorithms for construction and analysis of systems (TACAS 2007), Springer, pp. 134–138. Artzi, S., Quinonez, J., Kiezun, A., & Ernst, M. D. (2008). Finding bugs in dynamic web applications: Proceedings of the 2008 international symposium on software testing and analysis (ISSTA 2008). ACM, pp. 261–272. Ball, T. (2003). Abstraction-guided test generation: A case study. Tech. Rep. MSR-TR-2003-86, Microsoft Research, Microsoft Corporation. Baluda, M., Braione, P., Denaro, G., & Pezze`, M. (2010). Structural coverage of feasible code: Proceedings of the fifth international workshop on automation of software test (AST 2010). ACM. Baluda, M., Braione, P., Denaro, G., & Pezze`, M. (2011). Enhancing structural software coverage by incrementally computing branch executability. Software Quality JournaL, l 19(4), 725–751. Braione, P., Denaro, G., & Pezze`, M. (2012). On the integration of software testing and formal analysis. In B. Meyer & M. Nordio (Eds.), Empirical software engineering and verification, lecture notes in computer science (Vol. 7007, pp. 158–193). Berlin: Springer. Burnim, J., & Sen, K. (2008). Heuristics for scalable dynamic test generation: Proceedings of the 23rd IEEE/ ACM international conference on automated software engineering (ASE 2008). IEEE Computer Society, pp. 443–446. Cadar, C., Ganesh, V., Pawlowski, P. M., Dill, D. L., & Engler, D. R. (2006). EXE: Automatically generating inputs of death: Proceedings of the 13th ACM conference on computer and communications security (CCS ’06). ACM, pp. 322–335. Cadar, C., Dunbar, D., & Engler, D. (2008). KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs: Proceedings of the 8th USENIX symposium on operating systems design and implementation (OSDI 2008). USENIX Association. Ciupa, I., Leitner, A., Oriol, M., & Meyer, B. (2008). ARTOO: Adaptive random testing for object-oriented software: Proceedings of the 30th international conference on software engineering (ICSE’08). ACM, pp. 71–80. Csallner, C., & Smaragdakis, Y. (2004). JCrasher: An automatic robustness tester for Java. SoftwarePractice and Experience, 34(11),1025–1050. De Moura, L., & Bjørner, N. (2008). Z3: An efficient smt solver: Proceedings of the 14th international conference on tools and algorithms for the construction and analysis of systems (TACAS’08). Springer, pp. 337–340. Dutertre, B., & de Moura, L. (2006). The Yices SMT solver. SRI International. Ferguson, R., & Korel, B. (1996). The chaining approach for software test data generation. ACM Transactions on Software Engineering and Methodology, 5(1), 63–86. Fraser, G., & Arcuri, A. (2011). EvoSuite: Automatic test suite generation for object-oriented software: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on foundations of software engineering. ACM, ESEC/FSE ’11, pp. 416–419. Fraser, G., & Arcuri, A. (2012). Sound empirical evidence in software testing. In: Society IC (Ed.) Proceedings of the 34th international conference on software engineering (ICSE 2012), pp. 178–188. Godefroid, P., Klarlund, N., & Sen, K. (2005). DART: Directed automated random testing: Proceedings of the ACM SIGPLAN 2005 conference on programming language design and implementation (PLDI 2005). ACM, pp. 213–223. Godefroid, P., Kiezun, A., &Levin, M. Y. (2008a). Grammar-based whitebox fuzzing: Proceedings of the 2008 ACM SIGPLAN conference on programming language design and implementation (PLDI’08). ACM, pp. 206–215.

123

Software Qual J (2014) 22:311–333

331

Godefroid, P., Levin, M. Y., & Molnar, D. (2008b). Automated whitebox fuzz testing: Proceedings of the 16th annual network and distributed system security symposium (NDSS 2008). The Internet Society, pp. 151–166. Honda, T., Hattori, Y., Holloway, C., Martin, E., Matsumoto, Y., Matsunobu, T., Suzuki, T., Tesini, A., Baulo, V., Haange, R., Palmer, J., & Shibanuma, K. (2002). Remote handling systems for ITER. Fusion Engineering and Design, 63-64, 507–518. Inkumsah, K., & Xie, T. (2008). Improving structural testing of object-oriented programs via integrating evolutionary testing and symbolic execution: Proceedings of the 23rd IEEE/ACM international conference on automated software engineering (ASE 2008). IEEE Computer Society, pp. 297–306. Keilhacker, M. (1997). JET deuterium-tritium results and their implications: 17th IEEE/NPSS symposium on fusion engineering, Vol. 2, pp. 3–9. King, J. C. (1976). Symbolic execution and program testing. Communications of the ACM,19(7), 385–394. Korel, B. (1990). Automated software test data generation. IEEE Transactions on Software Engineering, 16(8), 870 –879. Lakhotia, K., McMinn, P., & Harman, M. (2009). Automated test data generation for coverage: Haven’t we solved this problem yet? Proceedings of the 2009 testing: Academic and industrial conference— Practice and research techniques. IEEE Computer Society, pp. 95–104. Lakhotia, K., Harman, M., & Gross, H. (2010). AUSTIN: A tool for search based software testing for the C language and its evaluation on deployed automotive systems: Proceedings of the 2nd international symposium on search based software engineering, pp. 101–110. Majumdar, R., & Sen, K. (2007). Hybrid concolic testing: Proceedings of the 29th international conference on software engineering (ICSE 2007). IEEE Computer Society, pp. 416–426. McMillan, K. L. (2010). Lazy annotation for program testing and verification: Computer aided verification, 22nd international conference, CAV 2010, Edinburgh, UK, July 15–19, 2010. Proceedings, Springer, Berlin, pp. 104–118. Michael, C., & McGraw, G. (1998). Automated software test data generation for complex programs. In: Society IC (Ed.) Proceedings of IEEE international conference on automated software engineering (ASE’98), pp. 136–146. Michael, C. C., McGraw, G., & Schatz, M. A. (2001). Generating software test data by evolution. IEEE Transactions on Software Engineering, 27, 1085–1110. Miller, W., & Spooner, D. L. (1976). Automatic generation of floating-point test data. IEEE Transactions on Software Engineering, 2, 223–226. Muhammad, A., Esque, S., Tolonen, M., Mattila, J., Nieminen, P., Linna, O., & Vilenius, M. (2007). Water hydraulic teleoperation system for ITER: Proceedings of the 10th Scandinavian international conference on fluid power, Vol. 3, pp. 263–276. Myers, G., Badgett, T., Thomas, T., & Sandler, C. (2004). The art of software testing. New York: Wiley. Pacheco, C., Lahiri, S. K., Ernst, M. D., & Ball, T. (2007). Feedback-directed random test generation: Proceedings of the 29th international conference on software engineering (ICSE 2007). IEEE Computer Society, pp. 75–84. Pargas, R. P., Harrold, M. J., & Peck, R. R. (1999). Test-data generation using genetic algorithms. Software Testing, Verification and Reliability, 9(4), 263–282. Paˇsaˇreanu, C. S., Rungta, N., & Visser, W. (2011). Symbolic execution with mixed concrete-symbolic solving: Proceedings of the 2011 international symposium on software testing and analysis (ISSTA 2011). ACM, pp. 35–44. Pezze`, M., & Young, M. (2007). Software testing and analysis: Process, principles and techniques. New York: Wiley. Santelices, R., Chittimalli, P. K., Apiwattanapong, T., Orso, A., & Harrold, M. J. (2008). Test-suite augmentation for evolving software: Proceedings of the 23rd IEEE/ACM international conference on automated software engineering (ASE’08). IEEE Computer Society, pp. 218–227. Sen, K., Marinov, D., & Agha, G. (2005). CUTE: A concolic unit testing engine for C: Proceedings of the European software engineering conference joint with 13th ACM SIGSOFT international symposium on foundations of software engineering (ESEC/FSE-13). ACM, pp. 263–272. Shimomura, Y. (2004). The present status and future prospects of the ITER project. Journal of Nuclear Materials, 329-333(1), 5–11. Sthamer, H. H. (1996). The automatic generation of software test data using genetic algorithms. PhD thesis, University of Glamorgan, Pontyprid, Wales, Great Britain. Tillmann, N., & de Halleux, J. (2008). Pex—white box test generation for .NET: Proceedings of the 2nd international conference on tests and proofs (TAP 2008). Springer, pp. 134–153. Tonella, P. (2004). Evolutionary testing of classes: Proceedings of the 2004 ACM SIGSOFT international symposium on software testing and analysis (ISSTA’04). ACM, pp. 119–128.

123

332

Software Qual J (2014) 22:311–333

Xie, T., Tillmann, N., de Halleux, P., & Schulte, W. (2009). Fitness-guided path exploration in dynamic symbolic execution: Proceedings of the 39th annual IEEE/IFIP international conference on dependable systems and networks (DSN 2009). IEEE Computer Society, pp. 359–368.

Author Biographies Pietro Braione received the Dr. Eng. degree in computer science engineering in 2000 and the PhD degree in information technology in 2004, both from Politecnico di Milano (Italy). Since 2007, he is a fulltime researcher at Universita` degli Studi di Milano-Bicocca. His research interests include formal software verification and analysis, and software testing.

Giovanni Denaro received his PhD degree in computer science from Politecnico di Milano (Italy). He is researcher in Software Engineering at the University of Milano-Bicocca. His research interests include formal methods for software verification, software testing and analysis, development of distributed and service-oriented systems, and software metrics for process optimization.

Andrea Mattavelli is a PhD student in informatics at University of Lugano. He got both his bachelor’s and his master’s degrees at University of Milano-Bicocca, Italy. His research interests include autonomic systems, self-healing systems, and software testing and analysis.

123

Software Qual J (2014) 22:311–333

333 Mattia Vivanti is a PhD student at the Software Testing and Analysis Research group of the Faculty of Informatics of the University of Lugano, since 2011. He has a bachelor’s and a master’s degree in informatics, both from Universita` degli Studi di Milano-Bicocca. His research concerns structural testing, software analysis, and automatic test case generation.

Ali Muhammad is a senior scientist at VTT Technical Research Centre of Finland. He has over 13 years of experience of implementing and coordinating several national and international projects in the field of hydraulics, automation, control, and robotics. He obtained his PhD in the field of control and robotics (2011) and MSc in automation and control (2004) from Tampere University of Technology (TUT). He also holds BSc in industrial electronics engineering (1998) from NED University of Engineering and Technology, Karachi. Since 2004, he has been doing the research at DTP2 to study the remote handling and maintenance concepts of fusion power plants such as ITER. Dr. Muhammad has authored and co-authored over 25 publications. His scientific interests include integration of dynamic simulation models with virtual reality, application of these digital mock-ups during the entire plant life-cycle, and self-evolving and adapting algorithms in the field of control and automation.

123

Suggest Documents