Subsequently, the ETF and the EvoTest Structural Testing tool built on-top of it will ...... 1467â1474. [20] âETF user manual & cookbook,â downloadable from http:.
2010 Third International Conference on Software Testing, Verification and Validation
Industrial Scaled Automated Structural Testing with the Evolutionary Testing Tool Tanja E.J. Vos‡ , Arthur I. Baars‡ , Felix F. Lindlar† , Peter M. Kruse∗ , Andreas Windisch† and Joachim Wegener∗ ∗ Berner & Mattner Systemtechnik GmbH, Berlin, Germany † Daimler Center for Automotive IT Innovations, Technische Universit¨ at Berlin, Germany ‡ Department of Information Systems and Computation, Technical University of Valencia, Spain
in testing and the quality of the resulting tests is sometimes low since they fail to find important errors in the system. In this work we will concentrate on Structural, or whitebox, Testing. Structural test case generation techniques analyse the control or data flow structures from the source code and use these to extract information needed for test case generation. Given a desired testing criterion (e.g. path coverage, statement coverage, branch coverage, etc.), sensitization techniques are used to automatically generate test cases that meet the selected criterion. Sensitization (i.e. the problem of finding test cases for a program that makes it execute a particular path, statement or branch), however, in general is undecidable [1]. Existing work on Evolutionary Testing has shown that stochastic optimization and intelligent search techniques (like evolutionary algorithms) constitute a successful way to automatically generate effective test cases and solve the underlying problem of sensitisation. Various types of structural testing have deserved a lot of attention [2], [3], [4], [5] applying Evolutionary Testing to generate a set of test data that achieves a high coverage of the program structures. However, one shortcoming is that evolutionary testing hardly has been applied to real-world complex systems and as such, little is known about the scalability, applicability and usability of these techniques in an industrial setting. One of the objectives of the European project EvoTest (2006-2009) (IST-33472) [6] is to improve this situation by developing an industrial scaled Evolutionary Testing Framework (ETF) that provides general components and interfaces to facilitate the automatic generation, execution, monitoring and evaluation of effective test scenarios. In [7] we have described a progress report on the ETF and listed a set of improvements that still needed to be implemented in the tool. This paper describes the final version of the Evolutionary Testing Framework (ETF) resulting from the EvoTest project. The paper concentrates on how to use the ETF and interpret the results. Outline: The structure of this paper is as follows. Section II introduces the concept of Evolutionary Testing. This Section starts with a brief introduction on evolutionary algorithms, and explains the application of such algorithms for testing in general and for structural testing specifically. Section III introduces the Evolutionary Testing Framework (ETF), a general framework to perform evolutionary testing
Abstract—Evolutionary testing has been researched and promising results have been presented. However, evolutionary testing has remained predominately a research-based activity not practiced within industry. Although attempts have been made, such as Daimler’s Evolutionary Structural Test (EST) prototype, until now, no such tool has been suitable for industrial adoption. The European project EvoTest (IST-33472) team has been working from 2006 till 2009 to improve this situation. This paper describes the final version of the Evolutionary Testing Framework (ETF) resulting from the EvoTest project. In specific we will present the EvoTest Structural Testing tool for fully automatic structural testing that has been demonstrated to be suitable within an industrial setting. The paper concentrates on how to use it and interpret the results. The paper starts with introducing the concepts of Evolutionary Testing in general and Structural Testing in specific. Subsequently, the ETF and the EvoTest Structural Testing tool built on-top of it will be described. We will concentrate on the usage, the architecture, and remaining limitations of the tool. The paper concludes describing the results of using the EvoTest Structural Testing tool in practice on real-world systems in an industrial setting. Keywords-evolutionary computation, test automation, structural testing, industrial practice, test-data generation
I. I NTRODUCTION Software and systems testing is at the moment the most important and mostly used quality assurance technique applied in industry and can already take up to more than 50% of development cost and time. Even though many test automation tools are currently available to aid test planning and control as well as test case execution and monitoring, all these tools share a similar passive philosophy towards test case design, selection of test data and test evaluation. They leave these crucial, time-consuming and demanding activities to the human tester. This is not without reason; test case design and test evaluation are difficult to automate with the techniques available in current industrial practice, since the domain of possible inputs (potential test cases), even for a trivial program, is typically too large to be exhaustively explored. However, test case design is the activity that determines the quality and effectiveness of the whole testing process; the test cases determine the kind and scope of the test. The lack of automation of this important testing activity makes that industry still spends a lot of effort and money 978-0-7695-3990-4/10 $26.00 © 2010 IEEE DOI 10.1109/ICST.2010.24
175
Initial Population
Reinsertion
the solution of a problem is initialized, usually at random. Each individual within the population is evaluated by calculating its fitness. This usually results in a spectrum of solutions ranging in fitness from very poor to good. Pairs of individuals are selected from the population according to the pre-defined selection strategy, and recombined in such a way as to produce new individuals analogous to biological reproduction. Subsequently mutation may be applied. The new individuals are evaluated and based on the reinsertionstrategy it is decided which individuals are fit enough to make it to the next iteration. The algorithm is iterated until the optimum is achieved, or another stopping condition is fulfilled. Evolutionary algorithms are generic and can be applied to a wide variety of optimization problems. In order to specialize an evolutionary algorithm for a specific problem, one needs to define a problem-specific objective (fitness) function. The objective function compares and contrasts solutions of the search with respect to the search goal. Using this information, the search is directed into potentially promising areas of the search space.
Mutation
Evaluation
Recombination Stop ?
Selection
Figure 1.
Test Results
Workflow of a general Evolutionary Algorithm
developed by the EvoTest project. Section IV introduces the EvoTest Structural Testing Tool, which also developed by the EvoTest project and is built on-top of the ETF. This Section explains the usage, the architecture, and limitation of the tool. Section V presents the case studies that was performed to evaluate the scalability of the EvoTest Structural Testing Tool in an industrial setting. Finally Section VI concludes.
B. Evolutionary Testing In order to automate software tests using evolutionary algorithms, the test aim itself must be transformed into an optimization task. Depending on which test aim is pursued, different objective (fitness) functions emerge for test data evaluation. If an appropriate fitness function can be defined for the test aim, and evolutionary computation is applied as the search technique, then the Evolutionary Test proceeds as follows: The initial set of test data is generated, usually at random. Instead of random initialization, one might also use test data that has been obtained by a previous systematic test [10]. This way the Evolutionary Test benefits from the tester’s knowledge of the SUT. After initialization, each individual within the population represents a test datum with which the SUT is executed. For each test datum the execution is monitored and the fitness value determined for the corresponding individual. Next, test data with high fitness values are selected with a higher probability than those with a lower value and are subjected to combination and mutation processes to generate new offspring test data. A new population of test data is formed by merging offspring and parent individuals according to the laid down survival procedures. From here on, the process repeats itself, starting with selection until the test objective is fulfilled or another given stopping condition is reached.
II. E VOLUTIONARY T ESTING The application of Evolutionary Algorithms to test data generation is often referred to as Evolutionary Testing (for example References [4], [8], [9]). The suitability of evolutionary algorithms for testing is based on their ability to produce effective solutions for complex and poorly understood search spaces with many dimensions. The dimensions of the search spaces are directly related to the number of input parameters of the System Under Test (SUT). A huge advantage of evolutionary algorithms (or any other meta-heuristic search technique) is that they can be readily applied for various testing objectives. Evolutionary Testing has been successfully applied for structural and functional testing objectives. In the remainder of this section we first present the basics of evolutionary algorithms, followed by an explanation on how such algorithms can be applied for testing in general and for structural testing specifically. A. Evolutionary Algorithms Evolutionary algorithms represent a class of adaptive search techniques and procedures based on the processes of natural genetics and Darwin’s theory of biological evolution. They are characterized by an iterative procedure and work in parallel on a number of potential solutions for a population of individuals. Permissible solution values for the variables of the optimization problem are encoded in each individual. Figure 1 provides an overview of a typical evolutionary algorithm procedure. First, a population of guesses as to
C. Evolutionary Structural Testing The goal of Structural Testing is to obtain a set of test inputs that cover the source code of the subject under test. There exist various coverage criteria, such as line coverage, statement coverage, modified condition/decision coverage, and path coverage. For safety-critical systems a certain
176
Initial Population
Reinsertion
Input: test data Level 4
Individuals
Test Data
Mutation
Level 3 Branch Distance
Test Execution
Evaluation
Condition fulfilled
Level 2
Level 1
Recombination Stop ?
Selection
Fitness Values
Monitoring
Output: Objective Value
Test Results Figure 2.
Workflow of evolutionary structural testing
level of coverage may be obliged by safety standards and regulations like ISO26262, IEC 61508 and DO178B [11], [12], [13]. The first work applying Evolutionary Algorithms to generate structural test data is that of Xanthakis et al. [14]. The design an appropriate objective function for various coverage criteria has been a field of active research. For an extensive overview we refer to the survey paper by McMinn [15]. Early approaches by Roper [16] used the amount of program structures that are covered by a test input as the objective value. Under this scheme the search tends to reward individuals that execute long paths through the test object. Guidance is not given for structures that are unlikely to be covered by chance, such as deeply nested structures or branch predicates that are only true when an input variable has a specific value from a large domain. The approach by Wegener et al. [17] avoids this problem. Their work takes the control-flow graph of the test object into account. Their (minimizing) objective function has the following basic form: obj = approach level + branch dist
the execution-path diverges away from the target branch. The approach level is defined as the distance between the critical predicate and the target branch. Subsequently the branch dist is determined by computing the distance between the actual values of the variables at the critical predicate and the values needed to make the critical predicate flip in the direction of the target branch. III. E VOLUTIONARY T ESTING F RAMEWORK To facilitate the development of evolutionary tests the EvoTest project has implemented the Evolutionary Testing Framework (ETF). The framework is an extension of the Eclipse IDE [18]. The framework includes the evolutionary engine generator GUIDE [19] and a GUI component for tuning the parameters of the algorithm. An extension point for the search engine is provided, so engines of other metaheuristic search techniques such as hill-climbing, simulated annealing and particle swarm optimization can be pluggedin easily. Furthermore the framework provides various ways of visualizing the search progress. To customize the framework for a particular test aim the following domain specific components need to be supplied (for more details see the ETF user manual [20]) :
(1)
The approach level is a natural number and the branch dist a real value scaled to the interval [0 : 1]. The strategy in which the approach level and branch dist are computed varies according to the coverage criterion in question. Fig. 2 illustrates the process of evolutionary structural testing for achieving branch coverage. Within the Evaluation step (evaluates the quality of each generation) the test object is executed using the test data induced by the individuals. Each execution results in an execution-path. In case the targeted branch is part of the execution-path, the branch is covered and the fitness value is simply 0. However, when the target branch is missed, the critical predicate is determined, i.e. the point in the control-flow graph at which
1) Individual specification 2) Test driver 3) Objective function The individual specification describes the structure of the individuals. The test driver provides the connection between the framework and the SUT. It converts the individuals from the search process into test data. Subsequently the test driver executes the SUT using the test data and monitors the output of the SUT. The monitoring results are passed back to the framework and are used by the objective function to calculate the adequacy of the test data. 177
Test Cases + (8) Coverage Info
CDT Eclipse Workspace R
Test Goal Manager
R
(1)
Current (6) Test Goal
Selected (2) Function
R
R
R
(4)
R
Bounds Dialog
SUT Preparator
Interface (3)
CFG (3)
Results for (7) Test Goal
Fitness Function Test Driver
Bounded (5) Interface
Figure 3.
R
R
Instrumented (3) Code
R
R
ETF
Individual Specification
Architecture of the EvoTest Structural Testing tool
IV. E VOT EST S TRUCTURAL T ESTING
order to allow a tester to repeat a test reusing the previously saved bounds. Additionally variable dependency analysis can be performed, which is a static analysis that detects which variables influence the coverage of each branch. This information is used to deactivate variables that are not important for the search goal, thus reducing the search space even further and increasing the performance of the search process. Finally EvoTest Structural Testing can automatically tune the parameters of the evolutionary engine. The automated tuning gives good results and requires no experience in evolutionary computation. Alternatively the tool also allows a tester to manually specify those parameters. When the tester is experienced in evolutionary computation then a manually tuned engine is likely to perform even better than an automatically tuned one.
For evolutionary structural testing the EvoTest team has developed a tool1 on top of the Evolutionary Testing Framework that enables the fully automatic generation of test inputs that cover all branches of an ISO C99-function [21]. At the moment only Condition and Decision coverage is implemented as test objective. This EvoTest Structural Testing tool is implemented as an extension to Eclipse’ C/C++ Development environment. A. Usage To use the tool one first needs an Eclipse C-project containing the source files of the SUT, and set up the compiler and linker preferences, such as the include and library directories. Subsequently a tester can simply select the function to test in the projects’ out-line view, to start a search that in a fully automatic way generates test inputs for a chosen coverage criterion. At the end of the search the set of test inputs is written to a file and furthermore coloured annotations are added to the source code of the function to display which branches have been successfully covered and which not. Optionally a tester may specify bounds for the input variables of the function under test before starting the search. Both upper and lower bounds can be specified for each variable and record field. For example variables that are used as boolean values can be assigned the range [0, 1]. Variables can even be switched off altogether, giving them a fixed value during the search. This way the search benefits from the tester’s knowledge, and, due to the reduction of the search space higher quality results can be expected in less time. The settings of the bounds can be saved and restored in
B. Architecture Figure 3 provides an overview of the main components and workflow of the EvoTest Structural Testing tool. EvoTest Structural Testing extends the graphical user interface of Eclipse’ CDT platform, and makes use of the Evolutionary Testing Framework to perform the evolutionary searches. The EvoTest Structural Testing tool is built on top of these two components, which are coloured grey in order to distinguish them from the tool’s internal components. Each arrow in the figure is annotated with a number between parentheses indicating the flow of time. Initially the tester selects (1) the function for which he or she wants to generate a set of covering test inputs. The selected function is passed (2) on to the SUT Preparator component, which analyses the source code of the function that is to be tested. The SUT Preparator component yields three results (as can be seen from flow (3) in Figure 3): • the function’s interface • instrumented source code of the function
1 If you are interested in trying out this one-click automated testing tool on your code, please contact one of the authors of this paper.
178
•
the function’s control flow graph
V. T HE ETF IN PRACTICE A. Case study design
The function’s interface contains a list of the parameters and global variables that are used by the function. This list is displayed to the user in the Bounds Dialog, allowing the tester to augment (4) the interface with bounds for each variable, either by manually specifying the bounds or by loading previous settings. The bounded interface is used to derive (5) the individual specification, which determines the shape of the individuals for the search process. The second result of the SUT Preparator is the instrumented code, which is used to instantiate the test driver component. The role of the test driver is to convert an individual into test input data, call the instrumented function and record the execution path. The third result of the SUT Preparator is the control flow graph of the function under test. The control flow graph is used to instantiate the fitness function as explained in Section II-C. Furthermore the control flow graph is used by the Test Goal Manager to extract a list of the goals (branches) that are to be covered. For each goal in turn the Test Goal Manager starts (6) an evolutionary search. During a search many test data is generated and evaluated until a test input is found that covers the current goal or until a budget of fitness evaluations (10000 by default) is exceeded. The evaluation of an individual test datum proceeds as follows. First, the instrumented function to test is executed by the test driver using the test datum as inputs. Then, the test driver records the execution path which is subsequently used by the fitness together with the control flow graph and the current goal to compute the objective value, i.e. the approximation level and branching distance [4]. As said before the Test Goal Manager performs a search for every goal (branch). The test inputs found by the searches are aggregated (7). The aggregated test inputs are written (8) to a file in the Eclipse workspace, and coloured annotations are displayed (8) in the graphical user interface to inform the tester about which branches were covered and which not.
Down to the present day, none of the approaches for automated test design have reached the level of applicability in an industrial setting. Consequently, there is considerable need and interest for empirical studies of evolutionary testing techniques and their scalability on complex realistic industrial systems. Hence, the research question to be addressed is: “Is evolutionary structural testing scalable to realworld complex industrial systems?” In order to obtain an answer to this question a refinement into propositions needs to be carried out, which can be evaluated through variables measured during the case studies. P1 In comparison with random testing the ETF is more effective and more efficient in finding test cases for real-world systems. P2 Automated parameter tuning improves the efficiency, effectiveness and usability of the search. P3 The amount of time, effort and knowledge necessary to configure and use the ETF make it worthwhile to use it within an industrial setting. 1) Case Study Design: The following steps for carrying out our case studies have been defined: (1) Installation and configuration according to the ETF user manual [20]. During these activities, work-diaries should be maintained. These diaries should just contain a lists of the tasks (including their date, time and description) that are performed to set up the ETF according to the user manual (e.g. installation, configuration, find an appropriate set of parameters for the evolutionary engine, etc.); (2) Running each search 30 times for ensuring statistical meaning and collecting the data listed below; (3) Interviews about the general suitability and acceptability in the specific industrial setting. These interviews are informal and not intended to perform statistical tests on the data since, evidently, the sample existing of Daimler and BMS engineers are not representative for the population of interest. However, these interviews are still interesting to include [?] since their objective is to get insight into the experiences that the practitioners had when applying the evolutionary testing techniques and tools. These experiences are used to find bugs in the tool, areas which need improvement, or extensions necessary for the particular industrial setting. 2) Data Collection: The case studies are run by controlling the independent variables and measuring the effect on the dependent variables. The first independent variable comprises the C-Code of industrial systems and the test goals depending on the selected coverage criterion to be fulfilled. Another independent variable to be controlled is the set of evolutionary parameters used for setting up the evolutionary engine of the ETF. Therefore, we distinguish
C. Limitations The tool still has some limitations that need to be solved in future work. Most notable is the lack of support for dynamic data structures. This means that the function under test cannot have inputs that are variable size arrays or recursive data structures, such as lists, trees, graphs, etc. On the other hand, fixed size arrays and pointers to single values and (non-recursive) structs, are supported. At the moment the tool only implements Decision/Condition coverage. Support for other coverage types, such as statement coverage, path coverage and MCDC could easily be added.
179
Table I S ELECTED F UNCTIONS FOR S TRUCTURAL T ESTING
Function
Case Study
LOC
Branches
Nesting level
Number of input variables
1
CS1
406
148
13
25
2
CS1
864
505
12
66
3
CS4
453
156
10
43
4
CS3
235
48
5
27
5
CS2
175
72
9
20
6
CS1
583
194
8
79
7
CS1
2896
964
6
139
8
CS1
917
146
3
51
9
CS5
919
420
14
80
10
CS5
259
142
12
38
11
CS5
58
36
6
14
12
CS6
85
110
11
27
13
CS6
99
76
7
29
14
CS6
199
129
4
15
15
CS7
67
32
9
3
16
CS7
272
216
4
28
between three different parameter settings, defining different versions of the ETF: ETF Random: Usage of random search instead of evolutionary search to be used as a baseline for comparison. ETF Manual: The evolutionary parameters are chosen based on the expertise of the tester. ETF Automated: Application of automated parameter tuning techniques in order to choose the “best” set of parameters automatically. The monitored dependent variables are: the number of test cases evaluated, the degree of structural code coverage reached, the fitness values’ progress, the time and effort needed to set up ETF and to find an appropriate set of parameters for the evolutionary engine (in the case of ETF Manual) and the general qualitative usability and acceptability opinions within the industrial setting. 3) Plan for manipulating independent variables: For each test object, we will run the experiments for the three different versions of the ETF. The ETF performs searches on a per branch basis. For all versions of ETF, the limit per branch is set to a maximum of 10.000 SUT-executions. 4) Criteria for interpreting the findings: In order to assess P1 we will evaluate the search results in comparison with random testing. The number of fitness evaluations will be used as a measure of efficiency, with an improved search
being considered more efficient. The reached coverage will be used as a surrogate for the effectiveness of the search. For P2 the fitness value progress and thus the effectiveness and efficiency of the search will be used to interpret its validity. For P3 we will use questionnaires, work-diaries and the general results of P1 and P2 related to effectiveness and efficiency. 5) Threats to Validity: Internal validity is concerned with the interpretability of the findings and may be threatened by the experience when using the ETF with evolutionary computation or evolutionary testing respectively (P3). Threats to external validity may influence the extent to which conclusions can be generalized, such as the representativeness of the selected case study systems. The amount of repetitions of the experiments necessary to gain statistical meaning may also take too long in an industrial setting. B. Test objects The test objects are selected C-functions from seven realworld automotive systems developed by industrial partners Daimler and Berner & Mattner: The Active Brake Assistant (CS1) is a system that reduces the risk of dangerous situations and collisions by influencing the momentum of the brake, if the braking action effected by the driver is not sufficient or the distance to a preceding vehicle decreases. The task of the Rear Window Defroster (CS2) is enabling a clear view through the rear window by automatically removing fog and ice. To do so, a thermal application on 180
ETF_Random
ETF_Manual *
*
100%
*
ETF_Automated
* statistically significant (p ≤ 0.05) *
*
*
*
* *
95% *
Coverage
90%
*
85%
*
80% 75% 70% 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Function
Figure 4. Average coverage out of 30 case studies for every configuration and each of the 16 test objects. At the top of each bar, the maximum and minimum coverage is indicated.
the window increases the temperature, dependent on the overall state of the vehicle (e.g., ignition active) and the driver’s requests. The purpose of the Global Powertrain Engine Controller (CS3) is to control a particular class of vehicle engines. StartStop (CS4) aims at improving the efficiency of a vehicle’s resource consumption by switching off the engine if it is not needed and restarting it as soon as it is requested. A Vehicle Access System (CS5) manages the handling of all doors installed in a vehicle include the handling of the remote control access system and all available sensor signals provided by the single doors. The purpose of the Lightning System (CS6) is to control a comfort light system by handling the blinker lights and also the control of all advanced light functions. The selected Break Assistant System (CS7) contains a component that deals with the assessment of the current situation of the vehicle.
C. Results This section serves to provide an answer to the underlying research question about how successful is the application of EvoTest technologies to real systems in industrial practice. Therefore a set of propositions has been refined that can be evaluated. 1) Proposition 1: Figure 4 shows the achieved average code coverage achieved during all case studies, whereas Figure 5 illustrates the search efficiency, i.e. the averaged number of fitness function evaluations – both segmented by test object and by search configuration. ETF Random achieved good coverage results for most functions: for 12 of the 16 functions, a coverage greater than 90% has been accomplished. One explanation is the high effort: A much higher number of test executions was carried out reaching or attempting to reach the test targets (coverage of each branch in the functions control flow graph). For each individual test goal a maximum number of 100 generations was allowed. Thus, the number of fitness function evaluations for ETF Random is higher compared to the other search configurations. By using the ETF Manual configuration, the results were quite good - regarding both search effectiveness and efficiency. Compared to ETF Random, the achieved coverage is higher for 14 functions. The number of fitness function evaluations is markedly below the ETF Random values. The control flow graph of two functions has been covered completely for each of the thirty related ETF Manual case studies. Summarized, the ETF using evolutionary computation is more effective in generating test cases when applied to real-world systems for structural testing, i.e. yielding higher structural code coverage, when compared with random testing. The evolutionary techniques showed a solution-
All these systems are composed of one or more C-code modules, which contain several C-functions. With respect to the control flow graph of these functions, only functions containing branches are relevant to the structural test case studies. Several preprocessing steps were performed in order to make the EvoTest framework applicable to the test objects. A total of 16 functions was selected from the remaining Cfunctions to restrict the over-all duration of the case studies. The final set of functions used for the studies – including several code measures like the number of branches and the number of input parameters – is shown in Table I. In order to enhance the optimization process, the co-domain of the input parameters has been specified manually in some cases. For example, the range of variables, which are known to be of the Boolean type, was set to [0,1].
181
approaching behaviour. After reaching the desired areas in the search space the evolutionary engine tried to approach the optimum by generating slightly different solutions. This process led to the finding of optimal solutions in most cases, resulting in a higher degree of covered branches compared to random testing. In addition, the ETF is more efficient in generating test cases, i.e. yielding a given degree of structural code coverage using a smaller number of fitness function evaluations when compared to random testing. It is obvious that the evolutionary optimization techniques needed significantly less effort, quantified by the number of fitness function evaluations, in order to achieve even higher code coverage. In order to gain measures for statistical significance, t-tests have also been performed for the achieved code coverage. Figure 4 shows statistical significance of the improvements on the coverage achieved by ETF Manual. In contrast, ETF Automated only results in an increase of coverage in one case. As seen in Figure 4, only for four functions a full average code coverage (100%) has been accomplished. For the other functions, the cause of the lower coverage is not necessarily a weakness of the ETF. In fact, for some C-functions higher coverage is impossible. 2) Proposition 2: Automated parameter tuning worked well in that it almost gives the same results as manual tuning, but relieves the tester from the difficult and tedious task of manually specifying the parameters of the evolutionary engine, hence contributing in usability and more general acceptability in industry. In terms of effectiveness, automated parameter tuning delivered better performance for six out of 16 functions, however this improvement is only statistically significant for two functions. For one function the effectiveness was worse, however not statistically significantly. Therefore in terms of the proposition, automated parameter tuning improves the effectiveness of the search (statistically significantly) for 13% of the sampled functions and the effectiveness is at least as good as for manual tuning for all functions. In terms of efficiency, automated parameter tuning improved on the efficiency of manual tuning for 10 out of 16 functions, however only five of the improvements were statistically significant. For five functions the efficiency reduced (statistically significantly) with automated parameter tuning compared with manual parameter tuning. Therefore in terms of the proposition, automated parameter tuning improves the efficiency of the search (statistically significantly). The results are expected to be better when more parameter settings would be involved in the automated tuning – but that would cause a much higher execution time. The execution time of ETF Automated was already much higher than the execution time of ETF Manual. 3) Proposition 3: From studying the maintained workdiaries, it becomes clear that after installation of the ETF, the amount of time and effort needed to configure the ETF in
order to apply it to real-world systems for structural testing is profitable within our industrial settings. For structural testing, there is basically no manual effort required concerning the preparation of the system under test (C-functions), since the ETF offers many features that automate the preparation of the SUT. If the user has a detailed knowledge about the SUT, he might specify a good set of bounds for the function to improve the search. We feel that the time necessary to set up the bounds for each function is rather long for industrial usage. Although not mandatory, there seem to be opportunities here for the ETF to provide the user with support for setting appropriate bounds. For example the ETF could detect when an integer variable is only tested for truth within a function and could set the bounds automatically to [0 → 1]. Another example would be that the ETF automatically disables member variables of pointers which are not referenced within the function under test. With a few enhancements in this manner the time necessary to set up bounds could be reduced. It was possible to use the ETF in order to automatically search for interesting test data. Hardly any detailed knowledge of evolutionary computation is required to search for interesting test data with the ETF. However, the parameters of the evolutionary engine can be specified manually in order to improve the evolutionary optimization and, as a result, retrieve test data that achieves higher code coverage. For this task, detailed knowledge of evolutionary computation is required. In terms of the ETF Manual phase of the evaluation where the engine parameters had to be manually tuned in order to improve the performance of the search, it was possible to evaluate the performance of each configuration using the average coverage achieved and the average number of evaluations without knowing much about the specific search configuration used. Simple heuristics such as varying the size of the population or the probability of mutation and crossover were sufficient to allow the tester to generate and evaluate new evolutionary engine parameter sets. Since time is always a crucial factor in industrial practice, the time necessary for the complete test data generation process to complete for a certain test object is of major concern. For evolutionary structural testing, being a dynamic data generation approach, the execution time of an optimization run varies, depending on the size and complexity of the C–code under test and the thus resulting execution time of its compilation, and on the other hand on the selected optimization parameters, such as the stopping criterion. According to our experience, execution times varied from a few minutes for our small test objects to a few hours for the complex ones (cf. Table I). Those long execution times are however compensated by the fact, that – in contrast to many standard testing techniques – no manual interaction is required during the test data generation process.
182
ETF_Random
ETF_Manual
ETF_Automated
* statistically significant (p ≤ 0.05)
Number of fitness function evaluations
1,000,000 *
800,000 *
600,000 *
400,000
* *
200,000 * *
*
* *
* *
4
5
*
*
*
0 1
2
3
6
7
8
9
10
11
12
13
14
15
16
Function
Figure 5. Average number of fitness function evaluations averaged over 30 case studies for every configuration and each of the 16 test objects. At the top of each bar, the maximum and minimum coverage is indicated.
VI. C ONCLUSION In this paper, an introduction on evolutionary testing was given together with an outline on evolutionary algorithms. In a slightly more detailed description, structural testing was explained. In a quick view the EvoTest Framework (ETF) was introduced. For the structural testing case studies a total of 16 functions from seven case studies have been selected. All case studies were introduced and carried out. Their results have been presented in detail and an overall assessment has been done. This paper now closes with a conclusion on our experience with evolutionary structural testing in our industrial setting. We have successfully shown, that evolutionary structural testing in an industrial setting is worthwhile and profitable. For structural testing, hardly any detailed knowledge of evolutionary computation is required to search for interesting test data. In all cases, evolutionary testing is more effective and efficient than random testing. Automated parameter tuning did not improve the results of the search as much as expected. There are still several open issues. Pointers to dynamic data structures (variable size arrays, recursive types such as lists, trees, and graphs) are not supported yet. There is ongoing work trying to solve this problem (e.g. work by Lakhotia [22] and Prutkina/Windisch [23]). Furthermore, functions as arguments to functions (i.e. function pointers) still do not work. A way and motivation for supporting volatile variables and multi-function instrumentation has been given in [7]. Search space reduction and hybrids had originally been on the agenda for evaluation with the EvoTest framework, but they are not yet included. Future work on automated bounds detection seems valuable, too. In addition, minimizing generated individuals is highly desirable. This might be done using an assumption about the
potential of individuals in order to predict which individuals are likely to cause or not cause any improvement. This prediction could be achieved by using information about similar individuals that have been executed in earlier generations. Another field of improvements might be reliable stopping conditions for tests. There is no confidence yet on when to terminate a calculation. The current options only include stopping after a certain number of generations or after reaching a specific fitness value. For a broader acceptance in an industrial setting, it is important to improve automated parameter tuning as it offers a chance to enable inexperienced testers to use evolutionary testing and gain valuable results. Different strategies for seeding of test data [24] and parallelization of tests (e.g. using multiple or distributed test targets) are requested. In addition, a set of common development guidelines on when to use which optimization technique might further increase industrial acceptance of evolutionary testing in general for both, average and advanced users. There exists a real need in embedded systems industry to have guidelines on what test techniques to use for different testing objectives, different levels of testing or different phases of system development; how these techniques contribute to the overall reliability and dependability of the embedded system; and how efficient and usable their application is. These guidelines, up to today do not exist for traditional testing techniques, and even less is known about evolutionary testing. Because of the diversity of the field, empirical studies are essential to lay the foundations for these guidelines and hence to integrate them in a general Test & Quality Assurance strategy for embedded systems. Having a central repository of example embedded systems that can act as a benchmark for evaluation of evolutionary testing techniques will be helpful as well.
183
R EFERENCES [1] B. Beizer, Software Testing Techniques. Reinhold, 1990.
[13] I. RTCA, “Do-178b, software considerations in airborne systems and equipment certification,” Jan 1992.
Van Nostrand
[14] S. E. Xanthakis, C. C. Skourlas, and A. LeGall, “Application of genetic algorithms to software testing,” in Proceedings of the 5th International Conference on Software Engineering and its Applications, 1992, pp. 625–636.
[2] B. Jones, H. Sthamer, and D. Eyres, “Automatic structural testing using genetic algorithms.” The Software Engineering Journal, vol. 11, no. 3, pp. 299 – 306, 1996.
[15] P. McMinn, “Search-based software test data generation: A survey,” Software Testing, Verification and Reliability, vol. 14, no. 2, pp. 105–156, 2004.
[3] N. Tracey, J. Clark, Mander, K., and J. McDermid, “An automated framework for structural test-data generation.” in Proceedings of the 13th IEEE Conference on Automated Software Engineering, Hawaii, USA, 1998.
[16] M. Roper, “Computer aided software testing using genetic algorithms,” in 10th International Software Quality Week, San Fransisco, USA, 1997. [Online]. Available: http: //eprints.cdlr.strath.ac.uk/2668/
[4] J. Wegener, K. Buhr, and H. Pohlheim, “Automatic test data generation for structural testing of embedded software systems by evolutionary testing,” in GECCO ’02: Proceedings of the Genetic and Evolutionary Computation Conference. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2002, pp. 1233–1240.
[17] J. Wegener, A. Baresel, and H. Sthamer, “Evolutionary test environment for automatic structural testing,” Information and Software Technology, vol. 43, no. 1, pp. 841–854, 2001.
[5] A. Baresel, H. Pohlheim, and S. Sadeghipour, “Structural and functional sequence test of dynamic and state-based software with evolutionary algorithms,” in Proceedings of Genetic and Evolutionary Computation Conference, 2003, pp. 2428–2441.
[18] “Eclipse integrated devopment Available: http://www.eclipse.org/
enviroment.”
[Online].
[19] L. Da Costa and M. Schoenauer, “Bringing evolutionary computation to industrial applications with guide,” in GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation. New York, NY, USA: ACM, 2009, pp. 1467–1474.
[6] T. Vos, “Evotest,” http://www.evotest.eu, 09 2006, last accessed on 2009-06-08. [7] H. Gross, P. M. Kruse, J. Wegener, and T. Vos, “Evolutionary white-box software test with the evotest framework: A progress report,” in ICSTW ’09: Proceedings of the IEEE International Conference on Software Testing, Verification, and Validation Workshops. Washington, DC, USA: IEEE Computer Society, 2009, pp. 111–120.
[20] “ETF user manual & cookbook,” downloadable from http: //www.evotest.eu, 2009. [21] ISO, “The ansi c standard (c99),” ISO/IEC, Tech. Rep. WG14 N1124, 1999. [Online]. Available: http://www.open-std.org/ JTC1/SC22/WG14/www/docs/n1124.pdf
[8] M. Harman, L. Hu, R. Hierons, A. Baresel, and H. Sthamer, “Improving evolutionary testing by flag removal,” in In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2002). New York, USA: Morgan Kaufmann, 2002, pp. 1233 – 1240.
[22] K. Lakhotia, M. Harman, and P. McMinn, “Handling dynamic data structures in search based testing,” in GECCO ’08: Proceedings of the 10th annual conference on Genetic and evolutionary computation. New York, NY, USA: ACM, 2008, pp. 1759–1766.
[9] O. B¨uhler and J. Wegener, “Automatic testing of an autonomous parking system using evolutionary computation,” in Proceedings of SAE 2004 World Congress, March 2004.
[23] M. Prutkina and A. Windisch, “Evolutionary structural testing of software with pointers,” in ICSTW ’08: Proceedings of the 2008 IEEE International Conference on Software Testing Verification and Validation Workshop. Washington, DC, USA: IEEE Computer Society, 2008, p. 231.
[10] J. Wegener, K. Grimm, M. Grochtmann, H. Sthamer, and B. Jones, “Systematic testing of real-time systems,” in Proceedings of the Fourth European International Conference on Software Testing, Analysis & Review, Amsterdam, The Netherlands, 1996.
[24] A. Arcuri, D. R. White, J. Clark, and X. Yao, “Multi-objective improvement of software using co-evolution and smart seeding,” in Proceedings of the 7th International Conference on Simulated Evolution And Learning (SEAL ’08), ser. Lecture Notes in Computer Science, X. Li, M. Kirley, M. Zhang, D. G. Green, V. Ciesielski, H. A. Abbass, Z. Michalewicz, T. Hendtlass, K. Deb, K. C. Tan, J. Branke, and Y. Shi, Eds., vol. 5361. Melbourne, Australia: Springer, Dec. 7-10 2008, pp. 61–70.
[11] “Iso/cd 26262: Road vehicles-functional safety,” committee draft, work in progress, Sep 2008. [12] “Iec 61508-3:1998, functional safety of electrical / electronic / programmable electronic safety-related systems, part 3: Software requirements,” 1998.
184