Harmony search-based test data generation for branch ... - Springer Link

3 downloads 5909 Views 842KB Size Report
Sep 11, 2013 - harmony search algorithm to generate test data satisfying ... test data for branch coverage in software structural testing. Keywords Test data ...
Neural Comput & Applic (2014) 25:199–216 DOI 10.1007/s00521-013-1474-z

ORIGINAL ARTICLE

Harmony search-based test data generation for branch coverage in software structural testing Chengying Mao

Received: 12 July 2012 / Accepted: 22 August 2013 / Published online: 11 September 2013  Springer-Verlag London 2013

Abstract Test data generation is always a key task in the field of software testing. In recent years, meta-heuristic search techniques have been considered as an effective way to assist test data generation in software structural testing. In this way, some representative test cases with high-coverage capability can be picked out from program input space. Harmony search (HS) is a recently developed algorithm and has been vigorously applied to various optimization problems. In the paper, we attempt to apply harmony search algorithm to generate test data satisfying branch coverage. At the preprocessing stage, the probes used for gathering coverage information are inserted into all branches via program static analysis. At the same time, the encoding and decoding styles between a test case and a harmony are also determined in advance. At the stage of test data searching, the subset of test data that has much stronger covering ability is stored in harmony memory. During the evolution process, one part of test suite is selected and adjusted from the harmony memory, and the other part is randomly generated from input space. Once a test suite is yielded after one-round search, its coverage can be measured by fitness function in our search algorithm. In our work, a new fitness function for branch coverage is constructed by comprehensively considering branch distance and branch weight. Here, the branch weight is determined by branch information in program, that is, the nesting level of a specific branch and the predicate types in it. Subsequently, the computed coverage metric is used for updating the test suite in the next round of searching. In order to validate the effectiveness of our proposed method, C. Mao (&) School of Software and Communication Engineering, Jiangxi University of Finance and Economics, Nanchang 330013, China e-mail: [email protected]

eight well-known programs are used for experimental evaluation. Experimental results show that the coverage of HS-based method is usually higher than those of other search algorithms, such as simulated annealing (SA) and genetic algorithm (GA). Meanwhile, HS demonstrates greater stability than SA and GA when varying the population size or performing repeated trials. That is to say, music-inspired HS algorithm is more suitable to generate test data for branch coverage in software structural testing. Keywords Test data generation  Harmony search  Fitness function  Branch coverage  Branch weight

1 Introduction With the wide application of information technology, software plays a crucial role in human society. However, due to the frequent occurrences of software vulnerabilities and failures, software dependability has caused much attention both from software users and developers [1]. Although some new methods on requirement analysis and software design have been proposed to improve software quality, software errors are still inevitable. Therefore, the testing and maintenance are very necessary after software release. In addition, it has been proved that the testing activity in the latter stage of whole software life cycle is still the most effective method in software quality assurance (SQA) [2]. During software testing process, test data directly influence the probability of finding the potential faults in software code. How to generate a test data set with high coverage and strong fault-revealing ability is a difficult problem, especially for structural testing. In software structural testing, a specific construct element, such as

123

200

Neural Comput & Applic (2014) 25:199–216

statement, branch or path, is usually selected as the coverage criterion. Then, test data are produced to cover the set of such elements [3]. In general, structural testing methods can be classified into two types: deterministic method and stochastic method. The former includes iterative relaxation [4], interval computation [5], symbol execution [6, 7] and so on. Its main idea is to convert the coverage of construct elements to a constraint solving problem so as to generate test inputs satisfying a specific coverage criterion [8–10]. The difficulty of the traditional deterministic method lies in the obstacle for tackling the special structures like array and pointer. Recently, an improved technique called dynamic symbol execution has been proposed in [11]. It realized the treatment for above complex structures for variables, but the constraint solving in the latter stage usually causes the problem of search space explosion. Although some wellknown constraint solvers like Z31 or Minion2 have been widely utilized, constraint satisfaction problem is essentially a NP-hard problem in computer science. The other kind of method is to sample test cases from software input domain with the help of some stochastic searching algorithms. Then, the test data set is optimized according to the feedback coverage information, that is socalled search-based software testing (SBST) [12, 13]. At present, some meta-heuristic search (MHS) techniques, such as simulated annealing (SA) [14], ant colony optimization (ACO) [15] and genetic algorithm (GA) [16–18], have been adopted for generating test cases. According to the experimental analysis, the above MHS algorithms can produce test data with good fault-prone ability [19]. Harmony search (HS) [20, 21] is a new-emerging search algorithm in the field of evolution computing. In the paper, we attempt to use this algorithm to generate test data for software structural testing. To the best of our knowledge, this is the first time to use HS algorithm to solve structural test data generation problem. More specifically, the main contributions of the paper are as follows. •





1 2

A framework of applying HS to generate test data has been presented. In the framework, we mainly discuss the details of how to combine HS algorithm with the testing process together. A new branch-based fitness function is constructed in the paper. More importantly, a method for measuring branch weight is also proposed. Some well-known programs are utilized for empirical study, the effectiveness of HS-based method has been validated.

http://z3.codeplex.com/. http://minion.sourceforge.net/.

123



The comparative analysis between HS and other search algorithms is also performed. Experimental results show that HS outperforms others both in terms of coverage effect and searching performance.

The paper is structured as follows. In the next section, we will briefly introduce the search-based software testing problem, and related work is also addressed. Then, the basic harmony search algorithm is introduced in Sect. 3. In Sect. 4, the overall framework of HS-based test data generation and the implementation details are discussed. Then, the experimental evaluation on three search algorithms is performed in Sect. 5. In this section, four research questions are investigated. Finally, the concluding remarks are given.

2 Background 2.1 Search-based software testing During the whole software testing process, test data generation is a very important task [22]. In essence, it is an activity of sampling test cases from input domain step by step so as to reveal the potential faults in program. However, the position of fault in program is difficult to know in advance, so the appearance of fault usually cannot be treated as a criterion for stopping testing process. In practice, the following assumption has been widely accepted: more program constructs are covered, the higher probability to find the hidden faults. Accordingly, the coverage of program constructs is usually chosen as an index to determine whether the testing activity is adequate or not [23]. As a consequence, test data generation problem can be converted to a covering problem about program constructs. In order to realize full coverage of target elements with a test data set as smaller size as possible, it is necessary to introduce some effective search algorithms to solve this difficulty. For a given program under test (PUT) P, suppose it contains n input variables x ¼ ðx1 ; x2 ; . . .; xn Þ. For each input variable xi, its input interval can be denoted as Ri (1 B i B n), then the input domain of whole program can be expressed as R ¼ R1  R2      Rn . Regarding the testing problem, in fact, it is to select a subset from input domain R according to a specified coverage criterion C, so as to realize full coverage of the corresponding construct elements. In general, the execution state of a program element (e.g., statement, branch or path) is determined by program’s input data. Hence, the satisfaction degree of criterion C can be modeled as a function f ðXÞ of input variables, where X is the set of input vectors, i.e., X ¼ fxg. Consequently, test data generation can ultimately be transformed as an optimization problem.

Neural Comput & Applic (2014) 25:199–216

201 Program Under Test

Search Algorithm

generates

executes

Test Suite uses

produces Coverage Info.

Fitness Function

updates

refers

statistical analysis

refers

Results & Traces

Coverage Criterion

Fig. 1 The basic framework of search-based test data generation

The basic procedure of search-based test data generation can be illustrated via Fig. 1 [24]. At the initial stage, lexical and syntax analyses are performed on program P to build control flow graph (CFG) and gather branch predicates. Then, probes are seeded into the program under the guidance of CFG. Meanwhile, fitness function f ðXÞ can be constructed according to the branch predicates. Subsequently, an initial test data set is randomly generated by search algorithm. Test driver individually picks out inputs from this set to run program P. During the testing process, execution traces can be gathered by the probes instrumented in program. Based on this information, the coverage w.r.t. criterion C can be automatically calculated. In next step, the search algorithm can adjust its searching direction according to the above feedback coverage information. The searching process will be stopped when test data set realizes full coverage w.r.t. criterion C or the maximum iteration generation maxGen is reached. It is not hard to find that, apart from static analysis for program P, fitness function construction, search direction adjustment and the interaction between test execution and searching activity are also the key points of search-based test data generation problem. We will address these issues in the following sections. 2.2 Related work Test data generation is the key problem of automated software testing [22], which has attracted extensive attention of researchers in the past decades. In general, test inputs for functional testing such as random testing and boundary value analysis are easily produced, but their fault exposure capability is not so high yet. On the contrary, structural testing method can effectively reveal the potential defects in program, and meta-heuristic search algorithms are usually utilized to generate the test data satisfying some given coverage criteria. At present, most

popular meta-heuristic search algorithms have been adopted to derive test input data. Here, we mainly consider the existing researches closed to our work in the paper. Genetic algorithm was firstly proposed by Holland [25], and it showed strong ability to get an approximate solution for complex problems such as NP-hard. In the 1990s, GA was adapted to generate test data. Sthamer [16], Jones et al. [17] and Pargas et al. [18] investigated the usage of GA for automated generation of test data. All of them adopted control flow graph (CFG) as the formal model for program under test and used GA to generate test inputs for branch coverage. Experiments on some small programs showed that GA could be up to several times better than random algorithm. Michael et al. [26] presented an approach that was an implementation of Korel’s function minimization approach [27] to generate test data using GA. In addition, testability transformation [28] has been viewed as an effective way to reduce the difficulty and cost of test data generation. GA was also used to tackle the flag problem in testability transformation [29]. Another well-known search algorithm is simulated annealing [30], which can solve complex optimization problems based on the idea of neighborhood search. Tracey et al. [14] proposed a framework to generate test data based on simulated annealing algorithm. Their method can incorporate a number of testing criteria, for both functional and non-functional properties, but their experimental analysis is not sufficient. On the other hand, Cohen et al. [31] adopted simulated annealing technique to generate test data for combinatorial testing. Obviously, their method is mainly intended for functional testing. In recent years, some other meta-heuristic search algorithms have also been adopted for solving test data generation problem. Particle swarm optimization (PSO) [32], as a swarm intelligence technique, has been applied to generate structural test data by Windisch et al. [33]. However, their experiments mainly focused on the artificial test objects. Li et al. [34] adopted ant colony optimization (ACO) algorithm to produce test sequences for state-based software testing, but their method belongs to functionoriented software testing, instead of structure-oriented testing. Here, we mainly concern on the test data generation for structural testing. In addition, Sagarna et al. [35] have presented an approach to generate test data for objectoriented software by means of estimation of distribution algorithms (EDAs). They also validated the effectiveness of their approach on three container classes, but our testing objects are mainly C programs at the current stage. The reasons of selecting SA and GA as comparative algorithms can be stated as follows: (1) SA and GA are two most classic algorithms in the field of evolution computing; (2) GA is the most popular algorithm for settling test data generation problem [16–18, 36] in the existing researches.

123

202

Neural Comput & Applic (2014) 25:199–216

Thus, we mainly compare our HS-based method with them in the subsequent experimental evaluation section. Since HS exhibits a better effect on solving optimization problems than some other meta-heuristic search algorithms, it has been introduced to tackle the problems from various application fields [21]. For example, HS has used to settle some well-known engineering problems in the field of computer science, such as task assignment [37], data mining [38–40] and service computing [41]. However, the applications in software testing are very rare. Up to now, only the work of Rahman et al. [42] is worthy of paying attention to. In their work, HS algorithm was adopted for constructing variable-strength t-way strategy, and experimental results demonstrated that HS-based method gave competitive results against most existing AIbased and pure computational counterparts. However, their work belongs to combinatorial testing, but not the programoriented testing (a.k.a. structural testing). In our previous work [24], the basic idea of applying HS to generate test data for branch coverage has been proposed, but only the preliminary results are reported. In this paper, some in-depth researches on the key techniques like algorithm process, fitness function construction and branch weight estimation are performed. On the other hand, more detailed experiments, such as impact analysis on population size and statistical analysis on repeated trials, are also addressed. As far as we know, our work is the first attempt to utilize HS algorithm to generate test data for software structural testing.

1.

3 Harmony search algorithm

Diversification and convergence are two important factors for the effect of meta-heuristic search algorithm, but they are contradictory after all. In fact, HS is a swarmbased heuristic algorithm, so it can easily realize the balance between the above two factors through the strategies of swarm intelligence, such as parallel evolution and elitism preserving policy. Moreover, due to its simplicity and easy implementation, HS has been applied to various optimization problems.

Harmony search algorithm was firstly developed by Geem et al. [20] in 2001. As a new meta-heuristic optimization algorithm, it originally came from the analogy between music improvisation and optimization process. It is inspired by the observation that the aim of music is to search for a perfect state of harmony. The effort to find the harmony in music is analogous to find the optimality in an optimization process [43]. In general, a musical performance process is to seek a best state (fantastic harmony) determined by esthetic estimation. Similarly, the optimization process seeks a best solution determined by objective function evaluation [20]. Esthetic estimation is done by the set of the pitches sounded by joined instruments, as objective function evaluation is done by the set of the values produced by composed variables. Another important common issue is both of them find the best results through multiple iterations. Harmony search algorithm mainly consists of the following four steps:

123

2.

3.

4.

Initialize the harmony memory (HM). HM is filled with as many randomly generated solution vectors as the size of shm, and each harmony vector X i ¼ ðx1 ; x2 ; . . .; xn Þ contains n pitches, here 1 B i B shm. In other words, the total number of design variables (a.k.a. design pool size) is n. Improvise new harmony. A new harmony vector is generated by the following three rules: HM consideration, pitch adjustment and totally random generation. In this step, two parameters are introduced to determine the operation types. rhmc is the rate of choosing one value from historical values stored in HM, and rpa is a probability known as pitch adjustment rate. At first, a random number rndi is generated between 0 and 1 for each vector Xi . If rndi is smaller than or equal to rhmc, the new vector X0i can be chosen from values stored in HM. Otherwise, discrete random values are assigned to the variables in the vector. Subsequently, if a variable attains its value from harmony memory, it should be further checked whether this value needs the additional action of pitch adjusted or not in line with parameter rpa. Update of harmony memory. Once a new harmony vector is generated, its objective function value can be calculated. If its value is better than the worst harmony in the HM, the new harmony will be included in the HM, and the current worst harmony should be excluded from HM simultaneously. Check stopping criterion. If the stopping criterion is satisfied, computation is terminated. Otherwise, Steps (2) and (3) are repeated.

4 HS-based test data generation 4.1 Overall framework As mentioned above, the difficulty of search-based test data generation lies in the cooperation between testing process and search algorithm. Here, the interaction between HS and test execution is shown in Fig. 2. In each iteration, the solutions are decoded into test inputs firstly. Then, they are seeded into program under

Neural Comput & Applic (2014) 25:199–216

HM

203

selects with

Random Generator

adjusts with

Adjuster for Pitch

updates

generates with

New Solution Set

HS Algorithm static analysis Source Code instruments

Selected HM

encodes

Evaluation

decodes

uses

dynamic execution PUT

refers

Coverage Criterion

Fitness Function

Test Suite refers produces counts Coverage Info.

Fig. 2 The illustration for the process of HS-based test data generation

test for collecting trace information via test execution. Based on the execution traces, the fitness of each test input and overall coverage can be naturally calculated. Subsequently, HS algorithm utilizes this information to update the population. The population in a new generation can be generated through the following three kinds of operations. 1.

2.

3.

Selection operation. The algorithm will choose partial test cases from harmony memory with probability rhmc. This step can ensure population forward to the overall optimal direction. Mutation operation. For the test cases selected from the preceding generation, the algorithm will perform pitch adjustment on them with the probability rpa. This action can enhance the local search ability of algorithm. Compensation operation. In addition, some test cases with the proportion 1 - rhmc will be randomly generated in order to maintain the diversity of whole population.

Once a new generation is yielded, the values in HM will also be decoded into the inputs for testing. When the full coverage is realized or the max evolution generation maxGen is reached, the searching process will be stopped. Furthermore, the encoding of input data, algorithm implementation and fitness function construction are also the key points and will be addressed in the following subsections.

4.2 Algorithm details For the testing problem mentioned in Sect. 2.1, suppose the test data set satisfying coverage criterion C is Stc ¼ ftc1 ; tc2 ; . . .; tcm g, here m is the number of test cases. For each test case, tci ¼ fx1 ; x2 ; . . .; xn g; f ðtci Þ is the fitness of the test case. Meanwhile, the fitness of whole test data set is denoted as f(Stc), whose concrete expression is shown in the next subsection. Accordingly, the above test data generation problem can be converted to an optimization problem as below. Simply stated, the search process is to produce a test suite Stc satisfying the coverage criterion C as high as possible. Once Stc satisfies the criterion C like branch coverage, the fitness of Stc [i.e., f(Stc)] will reach to the optimal value. max f ðStc Þ;

where Stc is the test suite with a given size: ð1Þ

After the problem is formally modeled, HS algorithm can be introduced to solve it. The details can be described as follows. 1.

Initialize parameters and HM. Different from the traditional GA, HS algorithm directly encodes the numeric values of input variables into pitches in harmony. A harmony represents a test case, and harmony memory stores the subset of test cases which have much stronger covering ability than the rest in population. For a program with n input variables, shm

123

204

Neural Comput & Applic (2014) 25:199–216

test cases with the stronger covering ability can be organized as a harmony memory (HM) as below. 2 6 6 HM ¼ 6 4

x11 ; x21 ; .. .

x12 ; x22 ; .. .

; ; .. .

x1n x2n .. .

xs1hm ; xs2hm ;    ;

xsnhm

3 7 7 7 5

ð2Þ

where ðxk1 ; xk2 ; . . .; xkn Þð1  k  shm Þ represents an elite test case in whole population. In general, shm varies in the range from 1 to 40. On the other hand, some other parameters, such as rhmc, rpa, maxGen and population size m, are also set in this step. 2. Generate new solutions. In order to generate a test suite Stc w.r.t. coverage criterion C, we adopt the parallel style to realize population update. A new solution (i.e., test case) tcj(t ? 1) (here j 2 ½1; m) in generation t ? 1 can be produced in the way shown in formula (3).  tcj ðt þ 1Þ ¼

rndj  rhmc tcl ðtÞ ¼ ðxl1 ; xl2 ; . . .; xln Þ; randomly select data from input domain; otherwise

ð3Þ where l is a random integer within [1, shm], rndj is a random number from 0 to 1. If a solution is selected from HM, the local adjustment action with probability rpa will further be conducted on it. Here, we adopt the linear adjustment shown in formula (4). xnew ¼ xold þ brange  e

ð4Þ

where xold is the existing pitch stored in HM, and xnew is the new pitch after the pitch adjusting action. brange is the pitch bandwidth, and e is a random number from uniform distribution with the range of [-1, 1]. In fact, this step is to produce a new pitch by adding small random amount to the existing pitch. 3. Evaluate solutions. For a new generated test case tci (1 B i B m), we can evaluate it according to the fitness function. If its fitness is better than the worst harmony (hworst) in HM, i.e., f(tci) [ f(hworst), the worst harmony will be replaced with tci. When all solutions in population are updated, algorithm will check the termination condition. For test data generation problem, the stopping rules include: (1) test data set achieves full coverage w.r.t. criterion C, or (2) the iteration times reach to maxGen. 4.3 Fitness function In order to ensure search algorithm to find the test suite satisfying a specific coverage criterion, coverage

123

information plays an important role in adjusting search direction. In software structural testing, the construct elements, such as statement, branch, path and definition-use pair, can be treated as covered objects. In our study, we mainly consider the widely used branch coverage [44] as searching objective. That is to say, the approving test suite is the set that can traverse all branches in program. Without loss of generality, the branch coverage information is collected via the probes seeded in PUT. In search-based test data generation, the search algorithm knows nothing about the problem except fitness information during search process. Naturally, only fitness can reflect how good a solution of harmony is in relation to the global optimum solution. In general, the optimal solution can be attained by comparing the fitness value of each harmony, which is measured by decoding the harmony vector to an objective function. According to the existing testing experiences, branch coverage is the most cost-effective approach in structural testing. In our work, we adopt the concept of branch distance to construct the fitness function by considering Korel’s and Tracey’s work [14, 27]. In fact, the so-called branch distance is the deviation for a given branch predicate when program variables are assigned with input values. In Table 1, the branch functions for several kinds of predicates are listed in the third column. Here, the value k (k [ 0), refers to a constant which is always added if the term is not true. A branch distance function can be constructed for each branch predicate in PUT according to Table 1. Then, the fitness function of whole program can be defined by comprehensively considering fitness of each branch. Suppose a program has s branches, the fitness function of whole program can be calculated via formula (5). ,  2 s Fitness ¼ 1 h þ P wi  f ðbchi Þ ð5Þ i¼1

where f(bchi) is the branch distance function for the ith branch in program, h is a constant with little value and is set to 0.01 in our experiments. wi is the corresponding P weight of the ith branch. Obviously, si¼1 wi ¼ 1. Generally speaking, each branch is assigned with different weight according to its reachable difficulty. The calculation method of branch weight will be addressed in next subsection. Since the covered state of a branch is usually determined by the test data set, fitness in formula (5) depends on the test suite Stc indeed. As a consequence, the optimization objective f(Stc) in formula (1) is represented via fitness in our algorithm implementation. As mentioned above, our fitness function is designed according to the concept of branch distance. Although

Neural Comput & Applic (2014) 25:199–216 Table 1 The branch functions for several kinds of branch predicates No.

Predicate

Branch distance function f(bchi)

1

boolean

if true then 0 else k

2

:a

Negation is propagated over a

3

a=b

If abs(a - b) = 0 then 0 else abs(a - b) ? k

4

a=b

If abs(a - b) = 0 then 0 else k

5

a\b

If a - b \ 0 then 0 else abs(a - b) ? k

6

aBb

If a - b B 0 then 0 else abs(a - b) ? k

7

a[b

If b - a \ 0 then 0 else abs(b - a) ? k

8

aCb

If b - a C 0 then 0 else abs(b - a) ? k

9

a and b

f(a) ? f(b)

10

a or b

min(f(a), f(b))

other researchers’ methods also follow this technical routine, the difference from them can be summarized as below. The reachable difficulties of different branches are obviously not identical, so each branch has to be assigned with different weight in fitness function. However, some existing methods need manual help to determine the weight of branch. In our proposed method, the branch weight is automatically calculated through analyzing branch statements in program. On the other hand, the existing methods in [12, 19, 45, 46] have also considered the reachability of branch, but the dependence relations between branches should be analyzed firstly. Generally speaking, the dependence analysis will cause considerable cost in testing activity. By comparison, our approach only needs such information as predicate components and nesting level of branch, which can be more easily obtained from programs. In a word, the fitness form presented by us is an effective and useful complement to the existing functions.

205

wnðbchi Þ ¼

ð6Þ

Furthermore, this weight can be normalized via formula (7). wnðbchi Þ wn0 ðbchi Þ ¼ Ps i¼1 wnðbchi Þ

ð7Þ

In general, the nesting level of a branch can be automatically analyzed by program static analysis. When nli of the ith branch is yielded, its nesting weight can be calculated according to formula (6) and (7). It is not hard to find that, the branch in much deeper nesting level will have the greater weight. Except for the nesting depth of branch, the difficulty degree of satisfying a branch also depends on its predicate condition. According to the semantic of predicate statement, we can roughly classify predicate conditions into the following four groups: equation, boolean expression, inequality and non-equation. In Table 2, we provide a reference weight model for all kinds of predicate conditions from the perspective of cognitive information [47]. The above reference model mainly aims at the case of branch predicate with only one condition. However, a branch predicate probably contains several conditions in practice. This structural information can also be automatically achieved via static analysis for program. Here, we assume branch predicate bchi (1 B i B s) contains u conditions. For each condition cj (1 B j B u), its reference weight wr(cj) can be determined according to Table 2. Thus, the weight [i.e., wp(bchi)] of branch predicate bchi can be assigned with value according to the rules shown in formula (8). 1.

4.4 Branch weight According to the formula (5), we can find that the fitness will compel search algorithm to spend more effort to generate test cases which can cover the branches with the higher weight. Therefore, it is very important to assign precise weight to each branch in line with its reachable difficulty. Without loss of generality, the reachable difficulty of a branch is usually determined by the following two factors: nesting weight and predicate weight. Generally speaking, the branch in deeper nesting level is harder to reach. For branch bchi (1 B i B s), suppose its nesting level is nli. Then, the maximum and minimum nesting levels of all branches can be represented as nlmax and nlmin, respectively. Thus, the nesting weight for branch bchi can be computed as follows.

nli  nlmin þ 1 nlmax  nlmin þ 1

2.

If predicate bchi is formed by combining u conditions with and operation, its predicate weight is square root of the sum of w2r (cj). In general, the number of conjunctions in a predicate is not more than four, so wp(bchi) will be \2.0 in most situations. If the condition conjunction is or, the predicate weight is the minimum element in the set of wr(cj) (1 B j B u).

Table 2 The reference weights of predicate conditions No.

Condition type

Reference weight

1

=

0.9

2

\, B , [ , C

0.6

3

Boolean

0.5

4

=

0.2

123

206

Neural Comput & Applic (2014) 25:199–216

( qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pu 2 j¼1 wr ðcj Þ ; wpðbchi Þ ¼ minfwr ðcj Þg;

if conjunction is and if conjunction is or ð8Þ

Similarly, the predicate weight of each branch can also be normalized as below. wpðbchi Þ wp0 ðbchi Þ ¼ Ps i¼1 wpðbchi Þ

ð9Þ

For each branch (a.k.a. predicate) bchi (1 B i B s), its corresponding weight wi in formula (5) can be viewed as the combination of wn0 ðbchi Þ and wp0 ðbchi Þ. wi ¼ a  wn0 ðbchi Þ þ ð1  aÞ  wp0 ðbchi Þ

ð10Þ

where a is an adjustment coefficient, and is set to 0.5 in our experiments.

5 Experimental evaluation 5.1 Experimental setup In order to validate the effectiveness of our proposed method for generating test data, we use eight programs to perform the comparative analysis, which are the wellknown benchmark programs and have been widely adopted in other researchers’ work. Some of them are extracted from the book C Numerical Recipes and available online at http://www.nr.com/. The details of these programs are listed in Table 3. The first program, triangleType, receives three integer numbers as inputs and decides what kind of triangle they represent: equilateral, isosceles, scalene or no triangle [36, 48, 49]. Here, we rewrite it according to this specification. Program calDay computes the day of the week, given a date as three integer arguments [36, 48, 49].

Program isValidDate is also implemented by us, which checks a date formatted in (year, month, day) to be a valid one or not. Program remainder calculates the remainder of two integers [49]. Program computeTax calculates the federal personal income tax in USA, here we convert the Java program in [50] into C?? language version. The sixth program, bessj, computes Bessel function of general integer order [49]. The program printCalendar is used to print the standard calendar of a month in some specific year [50]. The last program, line, whose function is to judge whether two rectangles are overlap or not [49]. Considering the diversity of program structure, we try to select benchmark programs to meet the requirements on diverse program structures, i.e., sequence, branch and loop. That is to say, all above benchmark programs cover sequence and branch structures, and most of them contain loop structure in our experiments. Since SA and GA are two popular algorithms for settling test data generation problem, we mainly compare our HSbased method with them in this section. Here, we performed deep contrast analysis on them from several aspects to validate the effectiveness of HS-based method. The parameter settings of these three algorithms are shown in Table 4. It should be noted that, the parameters of each algorithm are the most appropriate settings for most programs under test in the experiments. Besides the special parameters of each search algorithm, the common parameters of all algorithms are set as follows: The max evolution generation (maxGen) is 100, and the repeat runs (g) of experiment are 1,000. In addition, the population size (m) is set to 30 for the first six programs and m = 50 for the last two programs. The experiment is employed in the environment of MS Windows 7 with 32-bits and runs on Pentium 4 with 2.4 GHz and 2 GB memory. The algorithms are implemented in C?? language and run on the platform of MS Visual Studio .Net 2010.

Table 3 The benchmark programs used for experimental analysis Program

No. of arg

No. of branch

LOC

Description

Source

triangleType

3

5

31

Type classification for a triangle

Ref. [36, 48, 49]

calDay

3

11

72

Calculate the day of the week

Ref. [36, 48, 49]

isValidDate

3

16

59

Check whether a date is valid or not

By authors

remainder

2

18

49

Calculate the remainder of an integer division

Ref. [49]

computeTax

2

11

61

Compute the federal personal income tax

Ref. [50]

bessj

2

21

245

Bessel Jn function

Ref. [49]

printCalendar

2

33

187

Print calendar according to the input of year and month

Ref. [50]

line

8

36

92

Check if two rectangles overlap

Ref. [49]

123

Neural Comput & Applic (2014) 25:199–216

207

to maxGen. Thus, the metric AG can be expressed as blow.

Table 4 The parameter settings of three algorithms Algorithm

Parameter

Value

SA

Initial temperature T0

1.00

Cooling coefficient ca

0.95

Selection strategy

Gambling roulette

GA

HS

Crossover probability pc

0.90

Mutation probability pm

0.05

Size of harmony memory Shm

2–20

Harmony choosing rate rhmc

0.75

Pitch adjustment rate rpa

0.5

Pg AG ¼ 4.

1.

Pg

Average coverage (AC), i.e., the average of achieved branch coverage of all test inputs in g = 1,000 runs. Suppose the final branch coverage rate of the kth (1 B k B g) experiment run is rbck, obviously, 0 B rbck B 1. Then, the metric AC can be calculated as below. Pg AC ¼

2.

g

 100 %

ð11Þ

Success rate (SR), i.e., the probability of all branches can be covered by the generated test data. Here, suppose the boolean flag of the successful all-branch coverage in the kth run is bbck. Obviously, bbck 2 f0; 1g; bbck ¼ 1 if the kth experiment realizes allbranch coverage. Otherwise, bbck = 0. Then, the SR can be formally expressed as below. Pg SR ¼

3.

rbck

k¼1

k¼1

g

bbck

 100 %

g

ð12Þ

Average (convergence) generation (AG), i.e., the average evolutionary generation for realizing the stable branch coverage. For an algorithm, if its branch coverage is changeless after a specific generation in the kth run, such generation is called as convergence generation (i.e., gbck). It should be noted that, if a test execution does not reach stable branch coverage until the maxGen generation, we set its gbck

ð13Þ

Average time (AT), i.e., the average execution time (millisecond) for realizing the stable branch coverage. According to the definition of convergence generation, suppose the corresponding execution time of gbck (1 B k B g) is tbck, then AT can be calculated in the similar way.

5.2 Evaluation metrics In this section, we mainly analyze the overall effect of HSbased test data generation. In general, the effect is usually reflected both by the quality of test data set and the speed of data generation. Here, we define the following four issues as evaluation metrics for experimental analysis. As a consequence, the effectiveness and efficiency of three search algorithms can be compared on them.

gbck

k¼1

AT ¼

k¼1

tbck

g

ð14Þ

According to the above definitions, it is easy to find that the higher values of AC (average coverage) and SR (success rate) mean the corresponding algorithm is better. For two other metrics (i.e., AG and AT), the situation is just the opposite. In the following contrast analysis, we firstly perform the comprehensive comparison on SA, GA and HS according to the above four metrics. Then, the trend analysis on metrics versus generation or population size is discussed in details. In addition, statistical analysis on the coverage rates and convergence generations of g = 1,000 runs is also addressed. 5.3 Overview on experimental results In the above sections, we mainly discuss how to introduce harmony search algorithm to settle test data generation problem. However, the most important issue is the effectiveness of our HS-based method. Therefore, we should firstly answer the following research question from the perspective of comprehensive evaluation. RQ1: Is HS-based test data generation method more effective than other two classic search algorithms? The experimental results of three algorithms are shown in Tables 5 and 6. For metric AC, the results of HS are always better than those of SA and GA for all eight programs. Apart from the last program line, HS can attain 100 % branch coverage. Even for program line, the AC value of HS algorithm is very close to 100 %. On average, HS outperforms SA about 3.85 % points w.r.t. metric AC and is higher than GA about 4.00 %. On the other hand, we find that SA and GA have no significant difference for the metric AC. GA’s results are higher than those of SA for program isValidDate, printCalendar and line.

123

208

Neural Comput & Applic (2014) 25:199–216

Table 5 Comparative analysis on metric AC and SR Program

Average coverage (%)

Success rate (%)

SA

GA

HS

SA

GA

HS

triangleType

99.88

95.00

100

99.40

76.40

100

calDay

99.97

96.31

100

99.60

65.00

100

isValidDate

98.21

99.95

100

96.30

99.40

100

remainder

99.85

94.07

100

98.60

82.50

100

computeTax

94.44

91.51

100

88.50

62.80

100

bessj

99.45

98.61

100

97.40

94.90

100

printCalendar line

94.31 82.86

95.06 97.43

100 99.76

20.10 62.30

61.60 86.90

100 98.70

Table 6 Comparative analysis on metric AG and AT Program

Average generation

Average time (ms)

SA

GA

SA

triangleType

42.17

13.79

6.79

3.77

10.83

0.34

calDay

28.29

35.80

10.41

1.79

35.73

0.44 1.04

HS

GA

HS

isValidDate

15.17

21.69

11.35

2.43

11.68

remainder

13.66

16.31

9.46

1.01

6.09

0.52

computeTax

19.44

43.00

17.54

1.14

18.28

0.53

bessj

21.13

24.85

13.60

6.10

8.89

2.65

printCalendar

53.60

42.03

15.57

35.38

35.48

5.11

line

21.95

29.62

27.05

11.00

47.65

5.99

For the remaining five programs, SA’s effect is better than GA w.r.t. metric AC. Regarding the metric of success rate (SR), we find that its values are very consistent to those of metric AC. That is, HS can achieve full branch coverage with nearly 100 % probability, but other two algorithms are hard to realize full branch coverage. Meanwhile, SA’s results are similar to GA’s in most cases. In Table 5, an important case should be caused attention: Although the AC of SA algorithm can reach 94.31 % for program printCalendar, SR value of the algorithm is very low, only 20.10 %. This phenomenon shows that SA algorithm is not very stable for realizing full branch coverage in some situations. For the metric of average generation (AG) in Table 6, HS algorithm outperforms other two algorithms for most programs. Only for program line HS’s AG is higher than that of SA, but is lower than GA’s metric. The average of HS’s AG metrics for eight programs is 14. However, the corresponding values of SA and GA are 26.9 and 28.4, respectively. That is to say, the AG of HS algorithm is about a half of that of SA or GA. Based on the above analysis, we can argue that HS algorithm can achieve full

123

branch coverage with the most rapid speed. Meanwhile, the convergence speeds of SA and GA are comparable from the statistical perspective. Here, we further consider the execution time of each algorithm to generate test data. As shown in Table 6, the average execution time of HS-based method is 2.08 millisecond, and AT metrics of SA and GA are 7.83 ms and 21.83 ms, respectively. Thus, we can deduce the order of these three algorithms as follows: HS [ SA [ GA, where symbol ‘[’ means the ‘‘faster’’ relation. On the whole, AT metrics of HS and SA are in an identical magnitude scale, but GA’s AT is higher than them about one level of magnitude. This phenomenon means that the time of each iteration in HS or SA is much shorter than that of GA during the evolution process. According to the above analysis on four evaluation metrics, we can answer the first question that HS is more effective than SA and GA. Based on the above experimental results, we can also find a law that SR’s change trend is consistent with that of AC, and the similar rule is between AT and AG. Therefore, for the sake of simplicity, we mainly address the details of metric AC and AG in the rest of paper and omit the discussions about SR and AT. Table 6 only provides an overall comparison on average convergence generation, but the following research question from the viewpoint of evolution process should be confirmed. RQ2: Is HS’s growth rate of branch coverage much higher and steadier than those of other two algorithms? In order to investigate this question, we should analyze the growth trend of branch coverage with respect to evolution generation for each search algorithm. Here, the average branch coverage at generation t (0 B t B maxGen) can be defined as below. Pg rbck ðtÞ ACðtÞ ¼ k¼1  100 % ð15Þ g where rbck(t) is the branch coverage rate at generation t in the kth experiment. According to the definition of AC(t), we can count out the growth trend of average coverage. The results of three algorithms for eight subject programs are exhibitedin Fig. 3. Practically speaking, the trend can be analyzed in the following three stages: (1) Within the first five generations, the AC growth trend of HS is not as good as SA’s, but is better than GA’s for most programs. But for very few programs (e.g., line), GA is the most prominent one for enlarging coverage in the early stage of evolution. (2) From generation 5–40, HS’s growth speed is the fastest one of three algorithms. At the same time, HS can reach 100 % branch coverage for most programs. In most cases, SA’s growth speed is faster than GA, and the exception case is for program triangleType. For more complex

Neural Comput & Applic (2014) 25:199–216 1.01 1

0.95 0.9 0.85 0.8 HS GA SA

0.75 0.7 0.65

0

20

40

60

80

average coverage

1.025 1

average coverage

Fig. 3 Average coverage (AC) versus evolution generation

209

0.98 0.96 0.94 0.92

0.86

100

HS GA SA

0.9 0.88 0

20

generation

average coverage

average coverage

HS GA SA 0

20

40

60

80

0.96 0.94 0.92 HS GA SA

0.9 0.88 0.86

100

0

20

40

60

80

100

generation

(d)

1.05 1

1.025 1

0.9 0.8 0.7 0.6

HS GA SA

0.5 0

20

40

60

80

average coverage

average coverage

100

0.98

(c) 0.95 0.9 0.85 0.8

0.7

100

HS GA SA

0.75 0

20

generation

average coverage

0.95 0.9 0.85 0.8

HS GA SA

0.75 0

20

60

80

100

(f)

(e)

0.7

40

generation

1.025 1

average coverage

80

1.01 1

generation

0.4

60

(b)

(a) 1.025 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55

40

generation

40

60

generation

(g)

programs printCalendar and line, the performances of SA and GA are nearly identical. (3) After generation 40, HS keeps a steady coverage and its rate is very close to 100 %. However, other two algorithms (i.e., SA and GA) can only reach a steady state until to generation 60 or later. Another two important phenomena should be noted that, (1) with the increase in generation t, HS’s AC(t) is very stable. However, the stability of GA’s coverage is not approving for some cases. Refer to Fig. 3b, d, the

80

100

1.025 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5

HS GA SA 0

20

40

60

80

100

generation

(h)

fluctuation of AC(t) value of GA is very frequent, but SA is better than GA in this aspect. (2) When evolution generation outstrips 60, degradation of average coverage will appear in SA for some programs such as isValidDate, computeTax and line. On the contrary, GA and HS have not this problem in our experiments. According to the above experimental analysis, we can argue that the coverage growth trend of HS algorithm is sharper than other two algorithms, and its growth process is also much stabler.

123

210

Neural Comput & Applic (2014) 25:199–216

5.4 The impact of population size

1.01 1

1.025 1

0.98

0.95

avg. coverage

avg. coverage

In general, the population size has important influence on the effect of search algorithm. So we have to investigate the impact of population size for generating test data set and answer the research question as follows. RQ3: Is HS algorithm stabler than other two algorithms for the change of population size?

In order to verify the above question, we repeat the experiments for all subject programs with different population sizes. For the former six programs, algorithm’s population size varies from 10 to 60. For program printCalendar and line, the population size varies from 30 to 80. It should be noted that, other parameters of three algorithms remain unchanged in the current analysis. As mentioned above, success rates (SRs) of eight programs

0.96 0.94 0.92

HS GA SA

0.9 0.88 10

20

30

40

50

0.9 0.85 0.8

0.7 10

60

HS GA SA

0.75 20

1.05 1

0.9

0.9

avg. coverage

avg. coverage

1.05 1

0.8 0.7 0.6

HS GA SA

0.5 20

30

40

50

0.6

HS GA SA

0.5 0.4 10

60

20

0.95

avg. coverage

avg. coverage

0.9 0.8 0.7 0.6 HS GA SA

0.5 0.4

60

30

40

50

0.9 0.8 0.75 HS GA SA

0.7 0.65

60

10

20

30

40

50

60

population size

(f)

avg. coverage

(e)

avg. coverage

50

0.85

population size

HS GA SA 50

60

population size

(g) Fig. 4 Average coverage (AC) versus population size

123

40

(d) 1.025 1

40

30

population size

1.05 1

1.01 1 0.98 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 30

60

0.7

(c)

20

50

0.8

population size

10

40

(b)

(a)

0.4 10

30

population size

population size

70

80

1.025 1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 30

HS GA SA 40

50

60

population size

(h)

70

80

Neural Comput & Applic (2014) 25:199–216

211

value can keep on about 1.0 regardless of the change of population size. AC growth speed of SA versus population size is relatively slow, it can achieve 100 % coverage until the population size reaches to 30. Take GA into account, it can reach full branch coverage only when the size is quite large. For program isValidDate, computeTax and bessj (see Fig. 4c, e, f), the AC growth trends of three algorithms versus population size are basically the same.

are very consistent with average coverage (AC) metrics, and the similar situation is between AT and AG. Therefore, we mainly consider the trends of metrics AC and AG versus the change of population size in the following analysis. The experimental results about these two metrics are shown in Figs. 4 and 5, respectively. For program triangleType and calDay (see Fig. 4a, b), the AC of HS algorithm is very stable, and its

60 HS GA SA

50 40

avg. generation

avg. generation

60

30 20 10 0 10

20

30

40

50

40 30 20 10 0 10

60

HS GA SA

50

20

population size

avg. generation

avg. generation

HS GA SA

60 40 20

20

30

40

50

60 40 20 0 10

60

20

30

40

50

60

population size

(c)

(d)

100

60 HS GA SA

80

avg. generation

avg. generation

60

HS GA SA

80

population size

60 40 20

20

30

40

50

40 30 20 10 0 10

60

HS GA SA

50

20

population size

30

40

50

60

population size

(e)

(f)

80

60

avg. generation

HS GA SA

60

40

20

0 30

50

100

80

0 10

40

(b)

100

0 10

30

population size

(a)

avg. generation

Fig. 5 Average (convergence) generation (AG) versus population size

40

50

60

population size

(g)

70

80

HS GA SA

50 40 30 20 10 0 30

40

50

60

70

80

population size

(h)

123

212

Neural Comput & Applic (2014) 25:199–216

But HS algorithm can keep steady with the smaller size (i.e., 25), other two algorithms attain the full coverage until population size is [40. For the relatively complex program printCalendar and line (see Fig. 4g, h), HS algorithm can keep steady when the size reaches to about 45. Meanwhile, the effect of GA is better than that of SA, but both of them reach the full coverage until the size is [80. There is an exception for program remainder in Fig. 4d, HS’s AC is lower than SA’s when the size is \20. However, once the size exceeds this critical point, HS will outperform other two algorithms. Based on the above analysis, we can say that AC value of HS algorithm is nearly always higher than those of other two algorithms, and HS is stabler than SA and GA w.r.t. the change of population size. On the other hand, we investigate the change trend of average generation versus population size (see Fig. 5). (1) For program triangleType and printCalendar, HS’s AG is always lower than those of other two algorithms, and SA’s value is usually higher than that of GA. (2) For program calDay, HS’s AG is still the lowest one during the whole increasing process of population size. The difference lies in that GA’s AG is slightly higher than SA’s here. (3) For program line, HS’s AG is very similar to that of GA. When population size is \55, their values are higher than SA’s. Otherwise, the situation is opposite. (4) For the rest four programs, HS’s AG value is in the middle of SA’s metric and GA’s metric when population size is \25. When the size exceeds this boundary value, HS’s AG will become the lowest one of all three algorithms’ metrics. In summary, HS can keep the lowest AG value regardless of the change of population size in most situations. While considering the stability of HS’s AG value, we can find the laws as below: (1) For the relatively simple programs, its AG can keep a very stable state when the population size surpasses 30. (2) For the more complex programs, the critical point for reaching the stable AG is about 50. It is not difficult to see that, HS algorithm can ensure a relatively fixed convergence generation if it

Table 7 Steady analysis on average coverage (trials g = 1,000)

Program

5.5 Algorithm stability analysis on trials In general, a good algorithm can produce stable results for different trials. In order to validate HS’s feature in this aspect, we collected the experimental results of 1,000 trials for comparative analysis. During the analysis, the following research question should be verified. RQ4: Are experimental results of HS algorithm in different trials much stabler than those of other two algorithms? From the aspect of AC metric, the comparative results are listed in Table 7. It is obvious that, HS can ensure realize 100 % branch coverage in each trial except for program line. Even for this counter example, the failed times are only about 10 % and far less than those of other two algorithms. For the other two algorithms, the failed times of SA are generally lower than that of GA. Of course, there exists some anomalous cases, such as for program isValidDate, printCalendar and line. When considering the worst AC record in all 1,000 trials, we find that HS’s worst AC is always higher than other two algorithms. But SA and GA have not obvious difference in this aspect, SA is better than GA for program calDay, remainder, bessj and printCalendar. Otherwise, GA is equal to or better than SA. According to the results in Table 7, we can argue that HS shows obvious stability to realize full branch coverage in different trials. SA and GA are both worse than HS and have no significant difference in the worst AC records of them for each program, but SA has slighter improvement on the metric of failed times.

Failed times of full branch covering

The worst coverage (%)

SA

SA

GA

HS

GA

HS

triangleType

6

236

0

60.00

60.00

100

calDay

4

350

0

92.31

76.92

100

isValidDate

37

6

0

50.00

93.75

100

remainder

14

175

0

90.91

36.36

100

115

372

0

36.36

63.64

100

26

51

0

78.95

42.11

100

printCalendar

799

384

0

84.85

69.70

100

line

377

131

13

43.75

68.75

68.75

computeTax bessj

123

chooses an appropriate population size according to program’s complexity. Based on the above analysis, we are well aware that HS algorithm can achieve the highest AC and the lowest AG in most cases. Meanwhile, the corresponding values can be very stable once the population size reaches to a specific critical point.

213

100

100

80

80

generation

generation

Neural Comput & Applic (2014) 25:199–216

60 40

60 40 20

20

0

0 SA

GA

SA

HS

GA

algorithm

(b)

100

100

80

80

generation

generation

(a)

60 40 20

60 40 20 0

0 SA

GA

HS

SA

algorithm

HS

(d)

100

100

80

80

generation

generation

GA

algorithm

(c)

60 40

60 40 20

20

0

0 SA

GA

HS

SA

algorithm

GA

HS

algorithm

(f)

(e) 100

100

80

80

generation

generation

HS

algorithm

60 40

60 40 20

20

0

0 SA

GA

SA

HS

GA

HS

algorithm

algorithm

(g)

(h)

Fig. 6 Convergence generation distribution in 1,000 trials of each algorithm

On the other hand, the AG distributions of three algorithms for each subject program are illustrated in Fig. 6. According to the statistical results, the AG distribution range of HS algorithm is the shortest in all three

algorithms. In general, the AG distribution range of GA algorithm is larger than that of SA. The median of HS’s AGs in 1,000 trials can always keep the lowest record for all programs. Accordingly, the median of SA is higher than

123

214

GA’s for program triangleType and printCalendar, and the opposite situation for program calDay and computeTax. Moreover, the medians of algorithm SA and GA are roughly identical for the rest four programs. The first quartile of HS’s AG is very steady and is basically within 5 generations. Meanwhile, the difference of the first quartile records of three algorithms is relatively small, just has obvious disparity for program triangleType, computeTax and printCalendar. On the contrary, the third quartile values of three algorithms have obvious difference. For nearly all subject programs, the third quartile of HS’s AG is always \20. For two relatively complex programs printCalendar and line, the third quartile of GA’s AG is lower than that of SA. Otherwise, GA’s value is higher than that of SA for the rest six programs. Thus, we can give response to RQ4 via the following observed laws: HS demonstrates the strongest stability for different trials in all three algorithms. Meanwhile, SA’s stability is better than GA’s for the relatively simple programs, and GA is better than SA for the complex programs. 5.6 Discussions At present, although we have performed comparative analysis on three search algorithms for eight subject programs, there are still some issues that need to be explained here and explored in ongoing work. Apart from the comprehensive evaluation of three search algorithms for test data generation problem, the impact analysis of algorithm parameters is also performed in our work. However, we mainly conducted the study on the impact from different values of common parameters (e.g., population size and stability) w.r.t. three algorithms. For the special parameter settings of a given algorithm, we choose the parameter values according to the following two facts: (a) We set the parameter around the value that is recommended in the public literatures about the given algorithm, such as pc = 0.9 and pm = 0.05 for GA algorithm. (b) More importantly, the relatively reasonable value is determined through considering the comprehensive effect on all eight programs under test (PUTs). From the perspective of overall contrast, the above style of selecting parameter values is feasible and rational when we study test data generation algorithm on a certain number of subject programs (i.e., PUTs). On the other hand, more strict experimental analysis should be conducted to observe the impact of different parameter values of each search algorithm for each PUT. As mentioned above, the current parameter settings is determined by roughly changing the parameter values of each search algorithm for all subject programs, so even the more detailed experimental analysis could draw the similar

123

Neural Comput & Applic (2014) 25:199–216

conclusion in principle. Of course, very few counterexamples may appear. That is an important research issue in ongoing research work. Here, let us rethink the problem resulted from two special structures, i.e., array and pointer. For the array in program, our method can handle it whatever it appears in program body or program interface. During the process of test data generation, only the traversing information of branches should be collected for further optimization. As a result, we mainly concern on the branch predicates but not data storage structures (e.g., array and pointer) in the phase of code instrumentation. That is to say, the internal code constructs in program have little to do with search algorithm. Our proposed method can successfully treat the following case: special structures located in program body. Unfortunately, it is hard to handle the case of pointer variables in program interface. That is the important direction in our next-step research work. It should be pointed out that, when array variable appears in program interface, we can automatically treat it as several corresponding independent variables in basic type. Though the benchmark programs in our experiments are widely used in the studies of test data generation problem [36, 45, 48, 49], I have to admit that the programs from academe are not as large as the industrial programs in size. Due to the difficulty of automated static analysis on some special constructs of large-scale programs in our current research stage, we will overcome this technical limitation and validate our approach on some industrial programs in ongoing research. At present, we mainly concern with the comparative analysis on basic coverage criteria. Further, other metrics like fault-revealed number (i.e., mutant analysis) could also be utilized to evaluate test data generation methods.

6 Concluding remarks Software quality problem has caused much attention both in academe and industry. In the past, software testing has been validated as an effective way to improve software quality. Among all research topics about software testing, automated test data generation has been viewed as the most challengeable problem. In recent years, quite a few researches have confirmed that search-based test data generation is a rational way to settle this difficulty. The socalled search-based software testing is the use of a metaheuristic optimizing search technique to automate or partially automate test data generation task [13]. In the paper, our main work is to adapt a new-emerging meta-heuristic search algorithm, i.e., harmony search, to generate test data satisfying branch coverage. At first, the overall framework of applying HS to test data generation problem is proposed.

Neural Comput & Applic (2014) 25:199–216

Then, the algorithm details are supplemented. In the search-based test data generation, fitness function plays a critical role. Here, we adopt branch distance to construct fitness function. Meanwhile, branch weight can be automatically determined by static analysis for program. In order to verify the effectiveness of our method, eight wellknown programs are utilized for experimental evaluation. According to the experimental results, we can find that test data produced by HS can achieve much higher coverage and shorter convergence generation than other two classic search algorithms (i.e., SA and GA). In addition, HS shows the stronger stability than SA and GA when varying the population size or carrying out repeated experiments. Research on automated test data generation is always an important topic in the field of software testing. Although we provide a new approach to solve this problem, there are still many open issues that remain in this field. In the ongoing research, we attempt to construct more effective fitness function for other kinds of coverage such as path coverage. Moreover, applying HS algorithm to generate test inputs for revealing security problems, such as memory leak and exception handling, is also the interesting tasks. Acknowledgments The author is grateful to the anonymous reviewers for providing detailed, thoughtful and constructive advice for improving the quality of paper and thanks Xinxin Yu for the help in experimental analysis. The author would also like to thank Prof. Francisco Chicano and Dr. Javier Ferrer at University of Ma´laga in Spain for providing some benchmark programs under test for experiments. This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant No. 60803046 and 61063013, the Natural Science Foundation of Jiangxi Province under Grant No. 2010GZS0044, the Science Foundation of Jiangxi Educational Committee under Grant No. GJJ10433, the Open Foundation of State Key Laboratory of Software Engineering under Grant No. SKLSE2010-08-23, and the Program for Outstanding Young Academic Talent in Jiangxi University of Finance and Economics.

References

215

8.

9.

10.

11.

12. 13.

14.

15.

16.

17. 18.

19.

20. 21. 22. 23.

1. Littlewood B, Strigini L (2000) Software reliability and dependability: a roadmap. In: Proceedings of the 22nd international conference on software engineering, future of software engineering track, pp 175–188, June 2. Bertolino A (2007) Software testing research: achievements, challenges, dreams. In: Proceedings of the ICSE workshop on the future of software engineering (FOSE’07), pp 85–103, May 3. Korel B (1992) Dynamic method for software test data generation. Softw Test Verif Reliab 2(4):203-213 4. Gupta N, Mathur AP, Soffa ML (1998) Automated test data generation using an iterative relaxation method. In: Proceedings of the SIGSOFT FSE’98, pp 231–244 5. Offutt A, Jin Z, Pan J (1999) The dynamic domain reduction procedure for test data generation. Softw Pract Exp 29(2): 167–193 6. King JC (1976) Symbolic execution and program testing. Commun ACM 19(7):385–394 7. Cadar C, Godefroid P, Khurshid S, Paˇsaˇreanu CS, Sen K, Tillmann N, Visser W (2011) Symbolic execution for software

24.

25. 26.

27. 28.

29.

30.

testing in practice—preliminary assessment. In: Proceedings of the 33rd international conference on software engineering (ICSE’11), pp 1066–1071, May Gotlieb A, Botella B, Rueher M (1998) Automatic test data generation using constraint solving techniques. In: Proceedings of the ISSTA’98, pp 53–62 Edvardsson J, Kamkar M (2001) Analysis of the constraint solver in una based test data generation. In: Proceedings of the ESEC/ FSE’01, pp 237–245 Zhang J, Xu C, Wang X (2004) Path-oriented test data generation using symbolic execution and constraint solving techniques. In: Proceedings of the international conference on software engineering and formal methods, pp 242–250 Ge X, Taneja K, Xie T, Tillmann N (2011) Dyta: dynamic symbolic execution guided with static verification results. In: Proceedings of the ICSE’11, pp 992–994 McMinn P (2004) Search-based software test data generation: a survey. Soft Test Verif Reliab 14:105–156 McMinn P (2011) Search-based software testing: past, present and future. In: Proceedings of ICSE workshop on the searchbased software testing (SBST’11), pp 153–163 Tracey N, Clark J, Mander K, McDermid J (1998) An automated framework for structural test-data generation. In: Proceedings of the 13th international conference on automated software engineering (ASE’98), pp 285–288 Ayari K, Bouktif S, Antoniol G (2007) Automatic mutation test input data generation via ant colony. In: Proceedings of the 9th annual conference on genetic and evolutionary computation (GECCO’07), pp 1074–1081 Sthamer HH (1995) The automatic generation of software test data using genetic algorithms, PhD thesis. University of Glamorgan, November Jones BF, Sthamer HH, Eyres DE (1996) Automated structural testing using genetic algorithms. Softw Eng J 11(5):299–306 Pargas RP, Harrold MJ, Peck R (1999) Automated structural testing using genetic algorithms. Softw Test Verif Reliab 9(4):263–282 Harman M, McMinn P (2010) A theoretical and empirical study of search-based testing: local, global, and hybrid search. IEEE Trans Softw Eng 36(2):226–247 Geem ZM, Kim J, Loganathan G (2001) A new heuristic optimization algorithm: harmony search. Simulation 76(2):60–68 Geem ZM (2009) Music-inspired harmony search algorithm: theory and applications. Springer, Berlin Ammann P, Offutt J (2008) Introduction to software testing. Cambridge University Press, London Zhu H, Hall PAV, May JHR (1997) Software unit test coverage and adequacy. ACM Comput Surv 29(4):366-427 Mao C (2013) Structural test data generation based on harmony search. In: Proceedings of the 4th international conference on international conference on swarm intelligence (ICSI’13), pp 353–360, June Holland JH (1992) Genetic algorithms. Sci Am 267(1):66–72 Michael CC, McGraw GE, Schatz MA, Walton CC (1997) Genetic algorithms for dynamic test data generations. Technical Report RSTR-003-97-11, May Korel B (1990) Automated software test data generation. IEEE Trans Softw Eng 16(8):870–879 Harman M, Hu L, Hierons R, Wegener J, Sthamer H, Baresel A, Roper M (2004) Testability transformation. IEEE Trans Softw Eng 30(1):3–16 Gong D, Yao X (2012) Testability transformation based on equivalence of target statements. Neural Comput Appl 21(8): 1871–1882 Kirkpatrick S, Gelatt JCD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680

123

216 31. Cohen MB, Colbourn CJ, Ling ACH (2003) Augmenting simulated annealing to build interaction test suites. In: Proceedings of the 14th international symposium on software reliability engineering (ISSRE’03), pp 394–405 32. Shi Y, Eberhart RC (1998) A modified particle swarm optimizer. In: Proceedings of the 1998 IEEE international conference on evolutionary computation (ICEC’98), pp 69–73 33. Windisch A, Wappler S, Wegener J (2007) Applying particle swarm optimization to software testing. In: Proceedings of the 9th annual conference on genetic and evolutionary computation (GECCO’07), pp 1121–1128 34. Li H, Lam CP (2005) An ant colony optimization approach to test sequence generation for state-based software testing. In: Proceedings of the 5th international conference on quality software (QSIC’05), pp 255–264 35. Sagarna R, Arcuri A, Yao X (2007) Estimation of distribution algorithms for testing object oriented software. In: Proceedings of the IEEE congress on evolutionary computation (CEC’07), pp 438–444, September 36. Bouchachia A (2007) An immune genetic algorithm for software test data generation. In: Proceedings of the 7th international conference on hybrid intelligent systems (HIS’07), pp 84–89 37. Zou D, Gao L, Li S, Wu J, Wang X (2010) A novel global harmony search algorithm for task assignment problem. J Syst Softw 83:1678–1688 38. Mahdavi M, Abolhassani H (2009) Harmony k-means algorithm for document clustering. Data Min Knowl Discov 18:370–391 39. Forsati R, Mahdavi M (2010) Web text mining using harmony search. In: Geem ZW (ed) Recent advances in harmony search algorithm. Springer, Berlin 40. Karimi Z, Abolhassani H, Beigy H (2012) A new method of mining data streams using harmony search. J Intell Inf Syst 39(2):491–511

123

Neural Comput & Applic (2014) 25:199–216 41. Jafarpour N, Khayyambashi M-R (2012) Qos-aware selection of web service compositions using harmony search algorithm. J Digit Inf Manage 8(3):160–166 42. Rahman A, Alsewari A, Zamli KZ (2012) Design and implementation of a harmony-search-based variable-strength t-way testing strategy with constraints support. Inf Softw Technol 54(6):553–568 43. Yang X-S (2009) Harmony search as a metaheuristic algorithm. In: Geem ZW (eds) Music-inspired harmony search algorithm: theory and applications. Springer, Berlin 44. Bertolino A, Mirandola R, Peciola E (1997) A case study in branch testing automation. J Syst Softw 38(1):47–59 45. Wegener J, Baresel A, Sthamer H (2001) Evolutionary test environment for automatic structural testing. Inf Softw Technol 43(14):841–854 46. Liu X, Liu H, Wang B, Chen P, Cai X (2005) A unified fitness function calculation rule for flag conditions to improve evolutionary testing. In: Proceedings of the 20th IEEE/ACM international conference on automated software engineering (ASE’05), pp 337–341, November 47. Wang Y (2002) On cognitive informatics. In: Proceedings of the 1st IEEE international conference on cognitive informatics, pp 34–42, August 48. Alba E, Chicano F (2008) Observations in using parallel and sequential evolutionary algorithms for automatic software testing. Comput Oper Res 35:3161–3183 49. Ferrer J, Chicano F, Alba E (2012) Evolutionary algorithms for the multi-objective test data generation problem. Softw Pract Exp 42(11):1331–1362 50. Liang YD (2011) Introduction to Java Programming eighth edition. Pearson Education Inc., Upper Saddle River, NJ

Suggest Documents