On Test Case Distributions of Adaptive Random Testing. â. Tsong Yueh Chen Fei-Ching Kuoâ Huai Liu. Faculty of ICT, Swinburne University of Technology, ...
On Test Case Distributions of Adaptive Random Testing∗ Tsong Yueh Chen Fei-Ching Kuo† Huai Liu Faculty of ICT, Swinburne University of Technology, Australia E-mail: {tchen, dkuo, hliu}@ict.swin.edu.au Abstract
work has been conducted to compare ART methods with respect to their test case distributions, not to say the study on the relationship between the test case distributions and the performance of these methods. In this paper, we study four ART methods, FSCSART [8], RRT [4], and two versions of ART through dynamic partitioning, namely, “by bisection” (BPRT) and “by random partitioning” (RPRT) [5]. We measure their faultdetection effectiveness, and examine their test case distributions using various metrics.
Adaptive Random Resting (ART) has recently been proposed as an approach to enhancing the fault-detection effectiveness of Random Testing (RT). The basic principle of ART is to enforce randomly selected test cases as evenly spread over the input domain as possible. Many ART methods have been proposed to evenly spread test cases in different ways, but no comparison has been made among these methods in terms of their test case distributions. In this paper, we conduct a comprehensive investigation on test case distributions of various ART methods. Our work shows many interesting aspects related to ART’s performance and its test case distribution. Furthermore, it points out a new research direction on enhancing ART.
2. The effectiveness of various ART methods For ease of discussion, we introduce notations and concepts commonly used in this paper as follows. • D denotes the input domain of N dimension. • dD denotes d-dimension, where d = 1, 2, · · · , N . • E denotes the set of already executed test cases. • |D| and |E| denote the size of D and E, respectively. • θ denotes failure rate, ratio of the number of failurecausing inputs to the number of all possible inputs. There are different notions of implementing ART, and different notions give rise to various ART methods. For the detailed algorithms of the ART methods, refer to [11]. The performance of ART methods is usually evaluated by F-measure, the expected number of test cases to detect the first failure. A testing method is considered more effective if it has a smaller F-measure. As shown in [7], ART performs best when failurecausing inputs are well clustered into one single compact block (resuls are given in Experiment 1 of [7]). We followed the same experimental setting to study how various ART methods perform. It was expected that the data collected in this section could help us better understand the relationship between ART performance and even spreading of test cases. The experimental results are summarized in Figure 1, where x-axis denotes θ, and y-axis denotes ART F-ratio, which is defined as the ratio of F-measure of ART (denoted by FART ) to that of RT (denoted by FRT ). F-ratio measures the improvement of ART over RT. From these results, the following observations can be made. • Almost all the experimental data show that these ART methods have larger F-measures in higher dimensional spaces for the same θ.
1. Introduction Random Testing (RT), a fundamental software testing technique, simply generates test cases in a random manner from the set of all possible inputs, namely the input domain [10, 14]. RT has been successfully applied in industry to detect software failures [15, 16, 17]. It has been observed that for most programs, the failurecausing inputs (program inputs that can reveal failures) are clustered together [1, 2, 9]. Chen et al. [8] studied how to improve the fault-detection effectiveness of RT under such a situation. They proposed a novel approach, namely Adaptive Random Testing (ART), where test cases are not only randomly selected from the input domain, but also enforced as evenly spread over the input domain as possible. Since then, many ART methods have been proposed, such as Fixed-Sized-Candidate-Set ART (FSCS-ART) [8], Ristricted RT (RRT) [4], ART through dynamic partitioning [5] and Lattice-based ART [12]. Different ART methods distribute their test cases in different ways. All previous studies on ART methods [4, 5, 8, 12] were focused on the performance improvement that ART has over pure RT. No ∗ This research project is supported by an Australian Research Council Discovery Grant (DP0557246). † Corresponding author
1
1.10
1.20
1.00
1.10
being denoted by E1 , E2 , ..., Em , which are subsets of E. Note that m is set to 1000 in this paper.
1.00
FART/FRT
FART/FRT
0.90 0.80
0.90 0.80
0.70 0.70 0.60
0.60
0.50 1.00E+00 1.00E-01
1.00E-02
FSCS-ART
1.00E-03
T RRT
BPRT
1.00E-04
0.50 1.00E+00 1.00E-01
1.00E-05
RPRT
1.00E-02
FSCS-ART
(a) 1D
1.00E-03
T RRT
1.00E-04
BPRT
1.00E-05
RPRT
(b) 2D
1.50
Intuitively, MDiscrepancy indicates whether regions have an equal density of points. E is considered reasonably equidistributed if MDiscrepancy is close to 0. • Dispersion MDispersion = max dist(ei , ñ(ei , E\{ei })) (2) i=1...|E|
2.00
1.40
where ei ∈ E.
1.75
1.30 1.20
FART/FRT
FART/FRT
1.50 1.10 1.00 0.90
1.25 1.00
0.80 0.70
0.75
0.60 0.50 1.00E+00 1.00E-01
1.00E-02
FSCS-ART
1.00E-03
T RRT
(c) 3D
BPRT
1.00E-04
RPRT
1.00E-05
0.50 1.00E+00 1.00E-01
1.00E-02
FSCS-ART
1.00E-03
T RRT
1.00E-04
BPRT
1.00E-05
RPRT
(d) 4D
Figure 1. The effectiveness of various ART methods in different dimensions • FSCS-ART and RRT have very similar performance trends. When θ is large, both methods perform worse than RT and their F-ratios increase as θ decreases. When θ drops to a certain value v, their F-ratios start to decrease as θ decreases. • BPRT and RPRT have opposite performance trends compared with FSCS-ART and RRT. For large θ, Fratios of BPRT and RPRT are smaller than 1, and decrease as θ decreases. After θ drops to a specific value v, F-ratios of BPRT and RPRT stop decreasing. • There is a specific failure rate v such that RRT has larger FART than FSCS-ART when θ > v; and RRT has smaller FART than FSCS-ART when θ ≤ v. • There is a specific failure rate v such that BPRT has larger FART than RPRT when θ > v; and BPRT has smaller FART than RPRT when θ ≤ v. Although all above methods aim at evenly spreading test cases, this study shows that their fault-detection effectiveness vary. The differences are mainly due to their ways of distributing test cases. In this paper, the distribution of test cases generated by these ART methods will be measured.
3. Measurement of test case distribution For ease of discussion, the following notations are introduced. Suppose p′ and p′′ are two elements of E. dist(p′ , p′′ ) denotes the distance between p′ and p′′ ; and ñ(p, E\{p}) denotes the nearest neighbour of p in E. Without loss of generality, the range of values for each dimension of D is set to [0, 1), or simply D = [0, 1)N . In this study, three metrics were used to measure the test case distributions (the distribution of E in D). The following outlines the definitions of these three metrics. • Discrepancy |Ei | |Di | (1) − MDiscrepancy = max i=1...m |E| |D| where D1 , D2 , ..., Dm denote m randomly defined subsets of D, with their corresponding sets of test cases
Intuitively, MDispersion indicates whether any point in E is surrounded by a very large empty spherical region. A small MDispersion indicates that E is reasonably equidistributed. • The ratio of the number of test cases in the edge of the input domain (Eedge ) to the number of test cases in the central region of the input domain (Ecentre ) |Eedge | MEdge:Centre = (3) |Ecentre | where Eedge and Ecentre denote two disjoint subsets of E locating in Dedge and Dcentre , respectively; E = Eedge ∪ Ecentre . Note that |Dcentre | = |D| 2 and Dedge = D\Dcentre . Therefore, Dcentre = N q q |D| |D| N N 1 1 , + . 2 − 2N +1 2 2N +1 Clearly, in order for Mdiscrepancy to be small, the MEdge:Centre should be close to 1; otherwise, different parts of D have different densities of points. Discrepancy and dispersion are two commonly used metrics for measuring sample point equidistribution. More details of these two metrics can be found in [3]. The above three metrics were used to measure the test case distribution of a testing method from various perspectives. The space where a method generated points (test cases) was set to either 1D, 2D, 3D, or 4D. |E| was set as from 100 to 10000. A sufficient amount of data were collected in order to get a reliable mean value of a metric within 95% confidence level and ±5% accuracy range. It is interesting to find out how test cases of pure RT are distributed with respect to these metrics. Like all previous studies of ART, it was assumed that RT has a uniform distribution of test cases, which means that all test cases have an equal chance of being selected. Note that “uniform distribution” does not imply even spreading of test cases.
4. Analysis of test case distributions The ranges of MEdge:Centre for all testing methods are summarized in Figure 2, with the following observations. • When N = 1, MEdge:Centre for all ART methods under study is close to 1. • When N > 1, FSCS-ART and RRT tend to generate more test cases in Dedge than in Dcentre (or simply, FSCS-ART and RRT have edge bias). Moreover, the edge bias is stronger with RRT than with FSCS-ART.
RT
1D
2D
3D
4D
Max
1.0315
1.0341
1.0208
1.0202
Min
1.0000
0.9998
1.0005
0.9998
Max-Min
0.0315
0.0343
0.0203
0.0204
1D
2D
3D
4D
1.0103
1.1442
1.6105
2.0733
FSCS-ART Max Min
0.9997
1.0132
1.0863
1.2409
Max-Min
0.0106
0.1310
0.5241
0.8323
RRT
1D
2D
3D
4D
Max
1.0114
1.2103
2.1943
4.7346
Min
0.9995
1.0201
1.1358
1.3444
Max-Min
0.0119
0.1901
1.0585
3.3902
1D
2D
3D
4D
Max
1.0085
1.0456
1.0482
1.0250
Min
0.9991
0.9975
0.9963
0.9979
Max-Min
0.0094
0.0480
0.0519
0.0271
1D
2D
3D
4D
Max
1.0001
0.9995
0.9923
0.9780
Min
0.9812
0.9522
0.9236
0.9012
Max-Min
0.0190
0.0473
0.0688
0.0768
BPRT
RPRT
Figure 2. Range of MEdge:Centre for each testing method and dimension where |E| ≤ 10000 • When N > 1, RPRT allocates more test cases in Dcentre than in Dedge (or simply, RPRT has centre bias). But the centre bias of RPRT is much less significant than the edge bias of FSCS-ART or RRT. • The edge bias (for FSCS-ART and RRT) and centre bias (for RPRT) increase as N increases. • RT and BPRT have neither edge bias nor centre bias. The ranges of MDiscrepancy for all testing methods are summarized in Figure 3, with the following observations. RT
1D
2D
3D
4D
Max
0.1056
0.1093
0.0925
0.0785
Min
0.0105
0.0106
0.0092
0.0077
Max-Min
0.0951
0.0987
0.0833
0.0709
FSCS-ART Max
1D
2D
3D
4D
0.0437
0.0719
0.0876
0.0793
Min
0.0040
0.0062
0.0145
0.0188
Max-Min
0.0396
0.0657
0.0731
0.0605
RRT
1D
2D
3D
4D
Max
0.0401
0.0810
0.1279
0.1172
Min
0.0035
0.0074
0.0227
0.0302
Max-Min
0.0365
0.0736
0.1052
0.0871
BPRT Max
1D
2D
3D
4D
0.0470
0.0674
0.0670
0.0609
Min
0.0018
0.0019
0.0016
0.0013
Max-Min
0.0452
0.0655
0.0654
0.0595
RPRT Max
1D
2D
3D
4D
0.0610
0.0701
0.0696
0.0651
Min
0.0056
0.0052
0.0054
0.0054
Max-Min
0.0554
0.0649
0.0642
0.0597
1000 randomly defined partitions of D). Hence, smaller |MEdge:Centre − 1| does not necessarily imply a smaller MDiscrepancy , but a smaller MDiscrepancy does imply a smaller |MEdge:Centre − 1|. This explains why the value of MEdge:Centre for RT is close to 1 for all N , but its MDiscrepancy is not the smallest. The ranges of MDispersion for all testing methods are summarized in Figure 4, with the following observations. 1D
2D
3D
4D
0.0272
0.1488
0.2851
0.4138
Min
0.0005
0.0189
0.0714
0.1461
Max-Min
0.0267
0.1298
0.2137
0.2676
1D
2D
3D
4D
Max
0.0160
0.1281
0.2731
0.4139
Min
0.0002
0.0141
0.0601
0.1299
Max-Min
0.0158
0.1141
0.2129
0.2839
RRT
1D
2D
3D
4D
Max
0.0156
0.1253
0.2780
0.4350
Min
0.0002
0.0132
0.0585
0.1274
Max-Min
0.0154
0.1121
0.2195
0.3077
1D
2D
3D
4D
Max
0.0182
0.1365
0.2797
0.4121
Min
0.0002
0.0150
0.0649
0.1388
Max-Min
0.0180
0.1215
0.2148
0.2733
1D
2D
3D
4D
Max
0.0179
0.1335
0.2743
0.4020
Min
0.0002
0.0161
0.0671
0.1414
Max-Min
0.0177
0.1174
0.2071
0.2606
FSCS-ART
BPRT
RPRT
Figure 4. Range of MDispersion for each testing method and dimension where |E| ≤ 10000 • For N = 1, all ART methods have smaller MDispersion values than RT. • For N > 1, RRT normally has the smallest MDispersion values, followed in ascending order by FSCS-ART, BPRT, RPRT and RT. In Table 1, the testing methods are ranked according to their test case distribution metrics. For the same metric, the method which most satisfies the definition is ranked 1, and the one which least satisfies the definition is ranked 5. When two methods satisfy the definition to more or less the same degree, they are given the same ranking. Definition Dimension RRT FSCS-ART BPRT RPRT RT
Figure 3. Range of MDiscrepancy for each testing method and dimension where |E| ≤ 10000 • The impact of N on MDiscrepancy of RT, FSCSART and RRT is different. As N increases, the MDiscrepancy for RT decreases, but for FSCS-ART and RRT, it increases. • MDiscrepancy for RPRT and BPRT are independent of the dimensions under study. • In general, BPRT has the smallest MDiscrepancy for all N. • When N = 1, FSCS-ART, RRT and BPRT have almost identical values for MDiscrepancy . Clearly, measuring the density of points in two partitions (DEdge and DCentre ) is only part of the measuring by MDiscrepancy (which measures the density of points in
RT Max
† §
MEdge:Centre
MDiscrepancy
MDispersion
The closer to 1 is better 1D 2D 3D 4D 1 5§ 5§ 5§ 1 4§ 4§ 4§ 1 1 1 1 1 3† 3† 3† 1 1 1 1
The closer to 0 is better 1D 2D 3D 4D 2 4 5 5 2 3 4 4 1 1 1 1 4 2 2 2 5 5 3 3
The smaller is better 1D 2D 3D 1 1 1 1 2 2 1 3 3 1 4 4 5 5 5
4D 1 2 3 4 5
The MEdge:Centre is smaller than 1, so the testing method has a centre bias. The MEdge:Centre is larger than 1, so the testing method has an edge bias.
Table 1. Testing methods ranked according to test case distribution metrics For those testing methods studied, in the 1D case, it has been observed that RRT has the best performance, followed by FSCS-ART, BPRT, RPRT and then RT. However, when looking at the 2D, 3D and 4D cases, the same performance ordering is observed for small θ, but almost the reverse ordering for large θ. It has been shown in [11] that there exist some hidden factors (unrelated to how evenly a method
spreads test cases) which have an strong impact on the performance of ART. Only when θ is small enough, will the performance of ART strongly depend on how evenly spread its test cases are. In order to fairly analyze the relationship between the test case distribution and the performance of ART methods, without being influenced by external factors, the rest of the discussion will be carried out on small failure rates. Table 1 shows that in terms of MDispersion metric, RRT has the most even spreading of test cases, followed by FSCS-ART, BPRT and RPRT. In other words, the ranking according to the MDispersion metric is consistent with the ranking according to F-measures (data shown in Figure 1). It should be pointed out that even though MDiscrepancy and MDispersion are two commonly used metrics for measuring sample point equidistribution, in this study, MDispersion appears to be more appropriate than MDiscrepancy . Interestingly, among all ART methods under study, the one with the largest MEdge:Centre (that is, RRT) has the smallest MDispersion , while the one with the smallest MEdge:Centre (that is, RPRT) has the largest MDispersion . As discussed before, MDispersion could best reflect the ordering of testing methods with respect to their performance. We notice that pushing test cases away (so that MEdge:Centre > 1) is not a bad approach to evenly spreading test cases, even though it may not be the best approach to achieving a real even spreading of test cases.
5. Conclusion Previous studies [4, 5, 8] showed that even spreading of test cases makes ART outperform RT. The concept of even spreading of test cases is simple but vague. In this paper, several metrics were used to measure the test case distribution of ART as well as RT. The relevance and appropriateness of these metrics were also investigated. To our best knowledge, this is the first work on analyzing the relationship between test case distributions and performance of an ART method. Recently, there were some works on alleviating the edge bias of FSCS-ART [6, 13]. We shall continue the line of this research with additional knowledge gained from this study to enhance the exsiting ART methods.
References [1] P. E. Ammann and J. C. Knight. Data diversity: an approach to software fault tolerance. IEEE Transactions on Computers, 37(4):418–425, 1988. [2] P. G. Bishop. The variation of software survival times for different operational input profiles. In Proceedings of the 23rd International Symposium on FaultTolerant Computing, pages 98–107, 1993. [3] M. S. Branicky, S. M. LaValle, K. Olson, and L. Yang. Quasi-randomized path planning. In Proceedings of the 2001 IEEE International Conference on Robotics and Automation, pages 1481–1487, 2001.
[4] K. P. Chan, T. Y. Chen, and D. Towey. Restricted random testing: Adaptive random testing by exclusion. Accepted to appear in International Journal of Software Engineering and Knowledge Engineering, 2006. [5] T. Y. Chen, G. Eddy, R. G. Merkel, and P. K. Wong. Adaptive random testing through dynamic partitioning. In Proceedings of the 4th International Conference on Quality Software, pages 79–86, 2004. [6] T. Y. Chen, F.-C. Kuo, and H. Liu. Enhancing adaptive random testing through partitioning by edge and centre. In Proceedings of the 18th Australian Software Engineering Conference, pages 265–273, 2007. [7] T. Y. Chen, F.-C. Kuo, and Z. Q. Zhou. On favourable conditions for adaptive random testing. Accepted to appear in International Journal of Software Engineering and Knowledge Engineering. [8] T. Y. Chen, H. Leung, and I. K. Mak. Adaptive random testing. In Proceedings of the 9th Asian Computing Science Conference, pages 320–329, 2004. [9] G. B. Finelli. NASA software failure characterization experiments. Reliability Engineering and System Safety, 32(1–2):155–169, 1991. [10] R. Hamlet. Random testing. In J. Marciniak, editor, Encyclopedia of Software Engineering. John Wiley & Sons, second edition, 2002. [11] F.-C. Kuo. On adaptive random testing. PhD thesis, Faculty of Information and Communications Technologies, Swinburne University of Technology, 2006. [12] J. Mayer. Lattice-based adaptive random testing. In Proceedings of the 20th IEEE/ACM International Conference on Automated Software Engineering, pages 333–336, 2005. [13] J. Mayer and C. Schneckenburger. Adaptive random testing with enlarged input domain. In Proceedings of the 6th International Conference on Quality Software, pages 251–258, 2006. [14] G. J. Myers. The Art of Software Testing. Wiley, New York, second edition, 1979. [15] K. Sen, D. Marinov, and G. Agha. CUTE: a concolic unit testing engine for C. In Proceedings of 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 263–272, 2005. [16] D. Slutz. Massive stochastic testing of SQL. In Proceedings of the 24th International Conference on Very Large Databases, pages 618–622, 1998. [17] T. Yoshikawa, K. Shimura, and T. Ozawa. Random program generator for Java JIT compiler test system. In Proceedings of the 3rd International Conference on Quality Software, pages 20–24, 2003.