On Test Case Distributions of Adaptive Random ... - Semantic Scholar

On Test Case Distributions of Adaptive Random Testing∗ Tsong Yueh Chen Fei-Ching Kuo† Huai Liu Faculty of ICT, Swinburne University of Technology, Australia E-mail: {tchen, dkuo, hliu}@ict.swin.edu.au Abstract

work has been conducted to compare ART methods with respect to their test case distributions, not to say the study on the relationship between the test case distributions and the performance of these methods. In this paper, we study four ART methods, FSCSART [8], RRT [4], and two versions of ART through dynamic partitioning, namely, “by bisection” (BPRT) and “by random partitioning” (RPRT) [5]. We measure their faultdetection effectiveness, and examine their test case distributions using various metrics.

Adaptive Random Resting (ART) has recently been proposed as an approach to enhancing the fault-detection effectiveness of Random Testing (RT). The basic principle of ART is to enforce randomly selected test cases as evenly spread over the input domain as possible. Many ART methods have been proposed to evenly spread test cases in different ways, but no comparison has been made among these methods in terms of their test case distributions. In this paper, we conduct a comprehensive investigation on test case distributions of various ART methods. Our work shows many interesting aspects related to ART’s performance and its test case distribution. Furthermore, it points out a new research direction on enhancing ART.

2. The effectiveness of various ART methods For ease of discussion, we introduce notations and concepts commonly used in this paper as follows. • D denotes the input domain of N dimension. • dD denotes d-dimension, where d = 1, 2, · · · , N . • E denotes the set of already executed test cases. • |D| and |E| denote the size of D and E, respectively. • θ denotes failure rate, ratio of the number of failurecausing inputs to the number of all possible inputs. There are different notions of implementing ART, and different notions give rise to various ART methods. For the detailed algorithms of the ART methods, refer to [11]. The performance of ART methods is usually evaluated by F-measure, the expected number of test cases to detect the first failure. A testing method is considered more effective if it has a smaller F-measure. As shown in [7], ART performs best when failurecausing inputs are well clustered into one single compact block (resuls are given in Experiment 1 of [7]). We followed the same experimental setting to study how various ART methods perform. It was expected that the data collected in this section could help us better understand the relationship between ART performance and even spreading of test cases. The experimental results are summarized in Figure 1, where x-axis denotes θ, and y-axis denotes ART F-ratio, which is defined as the ratio of F-measure of ART (denoted by FART ) to that of RT (denoted by FRT ). F-ratio measures the improvement of ART over RT. From these results, the following observations can be made. • Almost all the experimental data show that these ART methods have larger F-measures in higher dimensional spaces for the same θ.

1. Introduction Random Testing (RT), a fundamental software testing technique, simply generates test cases in a random manner from the set of all possible inputs, namely the input domain [10, 14]. RT has been successfully applied in industry to detect software failures [15, 16, 17]. It has been observed that for most programs, the failurecausing inputs (program inputs that can reveal failures) are clustered together [1, 2, 9]. Chen et al. [8] studied how to improve the fault-detection effectiveness of RT under such a situation. They proposed a novel approach, namely Adaptive Random Testing (ART), where test cases are not only randomly selected from the input domain, but also enforced as evenly spread over the input domain as possible. Since then, many ART methods have been proposed, such as Fixed-Sized-Candidate-Set ART (FSCS-ART) [8], Ristricted RT (RRT) [4], ART through dynamic partitioning [5] and Lattice-based ART [12]. Different ART methods distribute their test cases in different ways. All previous studies on ART methods [4, 5, 8, 12] were focused on the performance improvement that ART has over pure RT. No ∗ This research project is supported by an Australian Research Council Discovery Grant (DP0557246). † Corresponding author

1

1.10

1.20

1.00

1.10

being denoted by E1 , E2 , ..., Em , which are subsets of E. Note that m is set to 1000 in this paper.

1.00

FART/FRT

FART/FRT

0.90 0.80

0.90 0.80

0.70 0.70 0.60

0.60

0.50 1.00E+00 1.00E-01

1.00E-02

FSCS-ART

1.00E-03

T RRT

BPRT

1.00E-04

0.50 1.00E+00 1.00E-01

1.00E-05

RPRT

1.00E-02

FSCS-ART

(a) 1D

1.00E-03

T RRT

1.00E-04

BPRT

1.00E-05

RPRT

(b) 2D

1.50

Intuitively, MDiscrepancy indicates whether regions have an equal density of points. E is considered reasonably equidistributed if MDiscrepancy is close to 0. • Dispersion MDispersion = max dist(ei , ñ(ei , E\{ei })) (2) i=1...|E|

2.00

1.40

where ei ∈ E.

1.75

1.30 1.20

FART/FRT

FART/FRT

1.50 1.10 1.00 0.90

1.25 1.00

0.80 0.70

0.75

0.60 0.50 1.00E+00 1.00E-01

1.00E-02

FSCS-ART

1.00E-03

T RRT

(c) 3D

BPRT

1.00E-04

RPRT

1.00E-05

0.50 1.00E+00 1.00E-01

1.00E-02

FSCS-ART

1.00E-03

T RRT

1.00E-04

BPRT

1.00E-05

RPRT

(d) 4D

Figure 1. The effectiveness of various ART methods in different dimensions • FSCS-ART and RRT have very similar performance trends. When θ is large, both methods perform worse than RT and their F-ratios increase as θ decreases. When θ drops to a certain value v, their F-ratios start to decrease as θ decreases. • BPRT and RPRT have opposite performance trends compared with FSCS-ART and RRT. For large θ, Fratios of BPRT and RPRT are smaller than 1, and decrease as θ decreases. After θ drops to a specific value v, F-ratios of BPRT and RPRT stop decreasing. • There is a specific failure rate v such that RRT has larger FART than FSCS-ART when θ > v; and RRT has smaller FART than FSCS-ART when θ ≤ v. • There is a specific failure rate v such that BPRT has larger FART than RPRT when θ > v; and BPRT has smaller FART than RPRT when θ ≤ v. Although all above methods aim at evenly spreading test cases, this study shows that their fault-detection effectiveness vary. The differences are mainly due to their ways of distributing test cases. In this paper, the distribution of test cases generated by these ART methods will be measured.

3. Measurement of test case distribution For ease of discussion, the following notations are introduced. Suppose p′ and p′′ are two elements of E. dist(p′ , p′′ ) denotes the distance between p′ and p′′ ; and ñ(p, E\{p}) denotes the nearest neighbour of p in E. Without loss of generality, the range of values for each dimension of D is set to [0, 1), or simply D = [0, 1)N . In this study, three metrics were used to measure the test case distributions (the distribution of E in D). The following outlines the definitions of these three metrics. • Discrepancy |Ei | |Di | (1) − MDiscrepancy = max i=1...m |E| |D| where D1 , D2 , ..., Dm denote m randomly defined subsets of D, with their corresponding sets of test cases

Intuitively, MDispersion indicates whether any point in E is surrounded by a very large empty spherical region. A small MDispersion indicates that E is reasonably equidistributed. • The ratio of the number of test cases in the edge of the input domain (Eedge ) to the number of test cases in the central region of the input domain (Ecentre ) |Eedge | MEdge:Centre = (3) |Ecentre | where Eedge and Ecentre denote two disjoint subsets of E locating in Dedge and Dcentre , respectively; E = Eedge ∪ Ecentre . Note that |Dcentre | = |D| 2 and Dedge = D\Dcentre . Therefore, Dcentre = N q q |D| |D| N N 1 1 , + . 2 − 2N +1 2 2N +1 Clearly, in order for Mdiscrepancy to be small, the MEdge:Centre should be close to 1; otherwise, different parts of D have different densities of points. Discrepancy and dispersion are two commonly used metrics for measuring sample point equidistribution. More details of these two metrics can be found in [3]. The above three metrics were used to measure the test case distribution of a testing method from various perspectives. The space where a method generated points (test cases) was set to either 1D, 2D, 3D, or 4D. |E| was set as from 100 to 10000. A sufficient amount of data were collected in order to get a reliable mean value of a metric within 95% confidence level and ±5% accuracy range. It is interesting to find out how test cases of pure RT are distributed with respect to these metrics. Like all previous studies of ART, it was assumed that RT has a uniform distribution of test cases, which means that all test cases have an equal chance of being selected. Note that “uniform distribution” does not imply even spreading of test cases.

4. Analysis of test case distributions The ranges of MEdge:Centre for all testing methods are summarized in Figure 2, with the following observations. • When N = 1, MEdge:Centre for all ART methods under study is close to 1. • When N > 1, FSCS-ART and RRT tend to generate more test cases in Dedge than in Dcentre (or simply, FSCS-ART and RRT have edge bias). Moreover, the edge bias is stronger with RRT than with FSCS-ART.

RT

1D

2D

3D

4D

Max

1.0315

1.0341

1.0208

1.0202

Min

1.0000

0.9998

1.0005

0.9998

Max-Min

0.0315

0.0343

0.0203

0.0204

1D

2D

3D

4D

1.0103

1.1442

1.6105

2.0733

FSCS-ART Max Min

0.9997

1.0132

1.0863

1.2409

Max-Min

0.0106

0.1310

0.5241

0.8323

RRT

1D

2D

3D

4D

Max

1.0114

1.2103

2.1943

4.7346

Min

0.9995

1.0201

1.1358

1.3444

Max-Min

0.0119

0.1901

1.0585

3.3902

1D

2D

3D

4D

Max

1.0085

1.0456

1.0482

1.0250

Min

0.9991

0.9975

0.9963

0.9979

Max-Min

0.0094

0.0480

0.0519

0.0271

1D

2D

3D

4D

Max

1.0001

0.9995

0.9923

0.9780

Min

0.9812

0.9522

0.9236

0.9012

Max-Min

0.0190

0.0473

0.0688

0.0768

BPRT

RPRT

Figure 2. Range of MEdge:Centre for each testing method and dimension where |E| ≤ 10000 • When N > 1, RPRT allocates more test cases in Dcentre than in Dedge (or simply, RPRT has centre bias). But the centre bias of RPRT is much less significant than the edge bias of FSCS-ART or RRT. • The edge bias (for FSCS-ART and RRT) and centre bias (for RPRT) increase as N increases. • RT and BPRT have neither edge bias nor centre bias. The ranges of MDiscrepancy for all testing methods are summarized in Figure 3, with the following observations. RT

1D

2D

3D

4D

Max

0.1056

0.1093

0.0925

0.0785

Min

0.0105

0.0106

0.0092

0.0077

Max-Min

0.0951

0.0987

0.0833

0.0709

FSCS-ART Max

1D

2D

3D

4D

0.0437

0.0719

0.0876

0.0793

Min

0.0040

0.0062

0.0145

0.0188

Max-Min

0.0396

0.0657

0.0731

0.0605

RRT

1D

2D

3D

4D

Max

0.0401

0.0810

0.1279

0.1172

Min

0.0035

0.0074

0.0227

0.0302

Max-Min

0.0365

0.0736

0.1052

0.0871

BPRT Max

1D

2D

3D

4D

0.0470

0.0674

0.0670

0.0609

Min

0.0018

0.0019

0.0016

0.0013

Max-Min

0.0452

0.0655

0.0654

0.0595

RPRT Max

1D

2D

3D

4D

0.0610

0.0701

0.0696

0.0651

Min

0.0056

0.0052

0.0054

0.0054

Max-Min

0.0554

0.0649

0.0642

0.0597

1000 randomly defined partitions of D). Hence, smaller |MEdge:Centre − 1| does not necessarily imply a smaller MDiscrepancy , but a smaller MDiscrepancy does imply a smaller |MEdge:Centre − 1|. This explains why the value of MEdge:Centre for RT is close to 1 for all N , but its MDiscrepancy is not the smallest. The ranges of MDispersion for all testing methods are summarized in Figure 4, with the following observations. 1D

2D

3D

4D

0.0272

0.1488

0.2851

0.4138

Min

0.0005

0.0189

0.0714

0.1461

Max-Min

0.0267

0.1298

0.2137

0.2676

1D

2D

3D

4D

Max

0.0160

0.1281

0.2731

0.4139

Min

0.0002

0.0141

0.0601

0.1299

Max-Min

0.0158

0.1141

0.2129

0.2839

RRT

1D

2D

3D

4D

Max

0.0156

0.1253

0.2780

0.4350

Min

0.0002

0.0132

0.0585

0.1274

Max-Min

0.0154

0.1121

0.2195

0.3077

1D

2D

3D

4D

Max

0.0182

0.1365

0.2797

0.4121

Min

0.0002

0.0150

0.0649

0.1388

Max-Min

0.0180

0.1215

0.2148

0.2733

1D

2D

3D

4D

Max

0.0179

0.1335

0.2743

0.4020

Min

0.0002

0.0161

0.0671

0.1414

Max-Min

0.0177

0.1174

0.2071

0.2606

FSCS-ART

BPRT

RPRT

Figure 4. Range of MDispersion for each testing method and dimension where |E| ≤ 10000 • For N = 1, all ART methods have smaller MDispersion values than RT. • For N > 1, RRT normally has the smallest MDispersion values, followed in ascending order by FSCS-ART, BPRT, RPRT and RT. In Table 1, the testing methods are ranked according to their test case distribution metrics. For the same metric, the method which most satisfies the definition is ranked 1, and the one which least satisfies the definition is ranked 5. When two methods satisfy the definition to more or less the same degree, they are given the same ranking. Definition Dimension RRT FSCS-ART BPRT RPRT RT

Figure 3. Range of MDiscrepancy for each testing method and dimension where |E| ≤ 10000 • The impact of N on MDiscrepancy of RT, FSCSART and RRT is different. As N increases, the MDiscrepancy for RT decreases, but for FSCS-ART and RRT, it increases. • MDiscrepancy for RPRT and BPRT are independent of the dimensions under study. • In general, BPRT has the smallest MDiscrepancy for all N. • When N = 1, FSCS-ART, RRT and BPRT have almost identical values for MDiscrepancy . Clearly, measuring the density of points in two partitions (DEdge and DCentre ) is only part of the measuring by MDiscrepancy (which measures the density of points in

RT Max

† §

MEdge:Centre

MDiscrepancy

MDispersion

The closer to 1 is better 1D 2D 3D 4D 1 5§ 5§ 5§ 1 4§ 4§ 4§ 1 1 1 1 1 3† 3† 3† 1 1 1 1

The closer to 0 is better 1D 2D 3D 4D 2 4 5 5 2 3 4 4 1 1 1 1 4 2 2 2 5 5 3 3

The smaller is better 1D 2D 3D 1 1 1 1 2 2 1 3 3 1 4 4 5 5 5

4D 1 2 3 4 5

The MEdge:Centre is smaller than 1, so the testing method has a centre bias. The MEdge:Centre is larger than 1, so the testing method has an edge bias.

Table 1. Testing methods ranked according to test case distribution metrics For those testing methods studied, in the 1D case, it has been observed that RRT has the best performance, followed by FSCS-ART, BPRT, RPRT and then RT. However, when looking at the 2D, 3D and 4D cases, the same performance ordering is observed for small θ, but almost the reverse ordering for large θ. It has been shown in [11] that there exist some hidden factors (unrelated to how evenly a method

spreads test cases) which have an strong impact on the performance of ART. Only when θ is small enough, will the performance of ART strongly depend on how evenly spread its test cases are. In order to fairly analyze the relationship between the test case distribution and the performance of ART methods, without being influenced by external factors, the rest of the discussion will be carried out on small failure rates. Table 1 shows that in terms of MDispersion metric, RRT has the most even spreading of test cases, followed by FSCS-ART, BPRT and RPRT. In other words, the ranking according to the MDispersion metric is consistent with the ranking according to F-measures (data shown in Figure 1). It should be pointed out that even though MDiscrepancy and MDispersion are two commonly used metrics for measuring sample point equidistribution, in this study, MDispersion appears to be more appropriate than MDiscrepancy . Interestingly, among all ART methods under study, the one with the largest MEdge:Centre (that is, RRT) has the smallest MDispersion , while the one with the smallest MEdge:Centre (that is, RPRT) has the largest MDispersion . As discussed before, MDispersion could best reflect the ordering of testing methods with respect to their performance. We notice that pushing test cases away (so that MEdge:Centre > 1) is not a bad approach to evenly spreading test cases, even though it may not be the best approach to achieving a real even spreading of test cases.

5. Conclusion Previous studies [4, 5, 8] showed that even spreading of test cases makes ART outperform RT. The concept of even spreading of test cases is simple but vague. In this paper, several metrics were used to measure the test case distribution of ART as well as RT. The relevance and appropriateness of these metrics were also investigated. To our best knowledge, this is the first work on analyzing the relationship between test case distributions and performance of an ART method. Recently, there were some works on alleviating the edge bias of FSCS-ART [6, 13]. We shall continue the line of this research with additional knowledge gained from this study to enhance the exsiting ART methods.

References [1] P. E. Ammann and J. C. Knight. Data diversity: an approach to software fault tolerance. IEEE Transactions on Computers, 37(4):418–425, 1988. [2] P. G. Bishop. The variation of software survival times for different operational input profiles. In Proceedings of the 23rd International Symposium on FaultTolerant Computing, pages 98–107, 1993. [3] M. S. Branicky, S. M. LaValle, K. Olson, and L. Yang. Quasi-randomized path planning. In Proceedings of the 2001 IEEE International Conference on Robotics and Automation, pages 1481–1487, 2001.

[4] K. P. Chan, T. Y. Chen, and D. Towey. Restricted random testing: Adaptive random testing by exclusion. Accepted to appear in International Journal of Software Engineering and Knowledge Engineering, 2006. [5] T. Y. Chen, G. Eddy, R. G. Merkel, and P. K. Wong. Adaptive random testing through dynamic partitioning. In Proceedings of the 4th International Conference on Quality Software, pages 79–86, 2004. [6] T. Y. Chen, F.-C. Kuo, and H. Liu. Enhancing adaptive random testing through partitioning by edge and centre. In Proceedings of the 18th Australian Software Engineering Conference, pages 265–273, 2007. [7] T. Y. Chen, F.-C. Kuo, and Z. Q. Zhou. On favourable conditions for adaptive random testing. Accepted to appear in International Journal of Software Engineering and Knowledge Engineering. [8] T. Y. Chen, H. Leung, and I. K. Mak. Adaptive random testing. In Proceedings of the 9th Asian Computing Science Conference, pages 320–329, 2004. [9] G. B. Finelli. NASA software failure characterization experiments. Reliability Engineering and System Safety, 32(1–2):155–169, 1991. [10] R. Hamlet. Random testing. In J. Marciniak, editor, Encyclopedia of Software Engineering. John Wiley & Sons, second edition, 2002. [11] F.-C. Kuo. On adaptive random testing. PhD thesis, Faculty of Information and Communications Technologies, Swinburne University of Technology, 2006. [12] J. Mayer. Lattice-based adaptive random testing. In Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, pages 333–336, 2005. [13] J. Mayer and C. Schneckenburger. Adaptive random testing with enlarged input domain. In Proceedings of the 6th International Conference on Quality Software, pages 251–258, 2006. [14] G. J. Myers. The Art of Software Testing. Wiley, New York, second edition, 1979. [15] K. Sen, D. Marinov, and G. Agha. CUTE: a concolic unit testing engine for C. In Proceedings of 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 263–272, 2005. [16] D. Slutz. Massive stochastic testing of SQL. In Proceedings of the 24th International Conference on Very Large Databases, pages 618–622, 1998. [17] T. Yoshikawa, K. Shimura, and T. Ozawa. Random program generator for Java JIT compiler test system. In Proceedings of the 3rd International Conference on Quality Software, pages 20–24, 2003.

On Test Case Distributions of Adaptive Random ... - Semantic Scholar

On Test Case Distributions of Adaptive Random ... - Semantic Scholar

Suggest Documents

Adaptive Random Test Case Prioritization - CiteSeerX

Adaptive Random Fuzzy Cognitive Maps - Semantic Scholar

Enhancing Adaptive Random Testing through ... - Semantic Scholar

test case prioritization - Semantic Scholar

Test Case Generation Based on Use case and ... - Semantic Scholar

A Random Dot Stereoacuity Test based on 3D ... - Semantic Scholar

Simulation of random distributions on surfaces

An Adaptively Optimized Approach on Random Test Case Generation ...

Scaling of iDDT Test Methods for Random Logic ... - Semantic Scholar

Pattern-Constrained Test Case Generation - Semantic Scholar

Factors Oriented Test Case Prioritization ... - Semantic Scholar

Tools for Test Case Generation - Semantic Scholar

Property Oriented Test Case Generation - Semantic Scholar

Critical Review on Test Case Generation Systems ... - Semantic Scholar

Random Variables and Probability Distributions Random Variables

On score distributions and relevance - Semantic Scholar

Discrete Random Variables and Probability Distributions Random ...

Stable Distributions - Semantic Scholar

Decomposing probability distributions on ... - Semantic Scholar

Adaptive Random Sampling for Traffic Volume ... - Semantic Scholar

Adaptive linear prediction filtering for random ... - Semantic Scholar

RASH: A Self-adaptive Random Search Method - Semantic Scholar

Adaptive REM: random exponential marking with ... - Semantic Scholar

Coverage-adaptive random sensor scheduling for ... - Semantic Scholar