Testing software to detect and reduce risk - CiteSeerX

14 downloads 2644 Views 168KB Size Report
The risk of a piece of software is defined to be the expected cost due to ... www.elsevier.com/locate/jss ..... likely than average to occur in the field, testing with a.
The Journal of Systems and Software 53 (2000) 275±286

www.elsevier.com/locate/jss

Testing software to detect and reduce risk Phyllis G. Frankl a,1, Elaine J. Weyuker b,* a

Computer Science Department, Polytechnic University, 6 Metrotech Center, Brooklyn, NY 11201, USA b AT&T Labs ± Research, Room E237, 180 Park Ave., Florham Park, NJ 07932, USA Received 1 December 1999; accepted 1 December 1999

Abstract The risk of a piece of software is de®ned to be the expected cost due to operational failures of the software. Notions of the risk detected by a testing technique and the risk reduction due to a technique are introduced and are used to analytically compare the e€ectiveness of testing techniques. It is proved that if a certain relation holds between testing techniques A and B, then A is guaranteed to be at least as good as B at detecting and reducing risk, regardless of the particular faults in the program under test or their costs. These results can help practitioners choose an appropriate technique for testing software when risk reduction is the goal. Ó 2000 Elsevier Science Inc. All rights reserved. Keywords: Fault detection; Program testing; Software testing; Software risk

1. Introduction Software risk is usually de®ned to be the expected loss attributable to failures in a given piece of software (Boehm, 1989; Gutjahr, 1995; Hall, 1998; Leveson, 1995; Sherer, 1992). It is typically de®ned to be the product of the probability of failures occurring and the expected loss attributable to such failures. An interesting question to consider is: ``What is the role of a test data selection or adequacy criterion in the assessment of risk?'' Should we be able to predict the risk associated with a software system based on having information about how the system was tested? Intuitively speaking, if a system has been comprehensively tested, there should be less risk associated with its use than if it has been only lightly tested. Thus, we would like to be able to compare testing strategies in a way that allows us to say that if a system has been tested using criterion C1 , it is likely to have less risk associated with its use than if it has been tested using criterion C2 . Most previous evaluations of the e€ectiveness of software testing techniques have considered all failures to be equivalent to one another. They have employed such measures of test e€ectiveness as the likelihood of discovering at least one fault (i.e., the likelihood of at *

Corresponding author. E-mail addresses: [email protected] (P.G. [email protected] (E.J. Weyuker). 1 Supported in part by NSF Grant CCR-9870270.

Frankl),

least one failure occurring), the expected number of failures that occur during test, the number of seeded faults discovered during test, and the mean time until the ®rst failure, or between failures. In this context, a failure is de®ned to be any deviation between the actual output and the speci®ed output for a given input. In practice, some failures may represent inconsequential deviations from the speci®cation, while others are more severe, and some may even be catastrophic. Therefore, in evaluating the risk associated with a program, one must distinguish between di€erent failures in accordance with their importance. To do so, we associate a cost with each failure. Previous work by Weyuker (1996), Tsoukalas et al. (1993), Ntafos (1997) and Gutjahr (1995) have incorporated cost into the evaluation of testing techniques. Weyuker used cost or consequence of failure as the basis for an automatic test case generation algorithm, and to assess the reliability of the software that had been tested using this algorithm, Tsoukalas et al. and Ntafos analytically compared random testing and partition testing strategies when cost was taken into account, while Gutjahr derived a test distribution that would result in minimum variance for an unbiased estimator of risk. 2 2 Gutjahr allowed the cost associated with a subdomain to be a random variable whose distribution was determined a priori; he also considered the special case for which each subdomain consists of a single element, as well as the more general partition testing situation.

0164-1212/00/$ - see front matter Ó 2000 Elsevier Science Inc. All rights reserved. PII: S 0 1 6 4 - 1 2 1 2 ( 0 0 ) 0 0 0 1 8 - 2

276

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

In the work of Gutjahr, Tsoukalas et al. and Ntafos, the input domain is divided, using some partition testing strategy, and a cost ci is associated a priori with each subdomain. A failure of any element of the ith subdomain is assumed to cost ci . It may be reasonable under some circumstances to associate costs of failures with subdomains, for example, if subdomains correspond to distinct functions that the software can perform and costs can be ascribed to the failure of each of these functions, and each input causes the execution of exactly one of the functions. However, there are many other situations in which this is not a realistic approximation of reality. In general, for the subdomains induced by the testing strategies commonly studied and in use, the consequences of failure for two di€erent elements of the same subdomain may be very di€erent. Furthermore, most subdomain testing strategies involve subdomains that intersect. For example, each input typically exercises many of the functions that have been identi®ed in functional testing. In the above scheme, any subdomains that overlapped would have to have the same cost or the input space would have to be further subdivided to insure that this is the case. In this paper we explore testing to evaluate software risk in a broader setting. In general, a piece of software will have some (possibly large) ®nite number of ``failure modes'' which may or may not be known in advance. These failure modes represent deviations from the speci®cation that the system analysts or users view as being in some sense ``equivalent'' to one another. For example, any failure that results from misspelling a word in the output might be considered equivalent, while a failure that results in the outputting of the wrong numerical value could be considered to have considerably more severe consequences. Even here, di€erences in the magnitude of the numerical discrepancy might lead to di€erent failure modes. It may or may not be possible to associate, a priori, a ®xed cost or a cost as a function of input with each failure mode. For example, the cost of outputting the wrong numerical value might be ®xed or might depend on how far the incorrect value is from the correct one. Weiss and Weyuker (1998) introduced a domain-based de®nition of software reliability that incorporated the discrepancy between speci®ed and computed values of outputs. This work provided the motivation for the later work incorporating cost described by Gutjahr (1995) and Weyuker (1996). Even if one could identify all of the failure modes of interest a priori and assign costs to them, this information could not be easily used to assign costs to input elements. To do so, it would be necessary to derive the set of inputs that result in a given failure mode. Since this is equivalent to being able to determine the sets of inputs that compute a given value, and therefore, can be reduced from the question of whether or not a given

input ever causes the program to halt, this question is undecidable in the sense that there can be no algorithm to make this determination (Davis et al., 1994). In spite of the lack of prior knowledge of failure modes, their costs, or which inputs correspond to which failure modes, it is nevertheless possible to learn something about risk from testing. In this paper, we will assume only that when a failure is observed during testing, the cost of that failure can be determined. This is a generalization of the well-known oracle assumption, which assumes there is a way to determine whether or not a test output agrees with the speci®ed value. Here, we assume further, that the oracle is able to determine the cost of the particular kind of failure. This might be a realistic assumption if, for example, a human was conducting the testing process and could provide this information. Note that we are assuming not only that the program is deterministic, so that a given input always succeeds or always fails, but also that whenever a given input causes a failure, it does so in the same way, with the same cost. 3 Di€erent goals of testing to assess risk can be distinguished: · Testing to detect risk: In addition to counting the number of failures that occur during testing, one keeps track of the cost of those failures. Testing technique A will be considered more e€ective than testing technique B if the (expected) total cost of failures detected during test is higher for A than for B. · Testing and debugging to reduce risk: It is further assumed that each failure that occurs leads to the correction of the fault that caused that failure, thereby reducing the risk associated with the corrected software. Testing technique A will be considered more effective than testing technique B if A reduces the risk more than B does, thus resulting in less risky software. The distinction between this goal and the above goal is discussed in more detail in Section 4. · Testing to estimate risk: In estimating software reliability, it is assumed that some faults will remain in the software. The goal is to estimate the probability that the software will fail after deployment (during some speci®ed time). Previous work by Gutjahr and Tsoukalas et al. generalized this to the estimation of the expected cost due to failures after deployment, i.e. estimating risk. Here, we will say that testing technique A is better than testing technique B (for a given technique of estimating risk) if A provides more accurate estimates of risk than B. 3 This is not always a realistic assumption, as the cost of a particular failure mode may depend on the circumstances under which it occurs; for example, system down-time might be more acceptable at 2 a.m. on Sunday than during peak business hours since the failure cost might involve the number of users impacted. In addition, it may depend on the types and frequencies of failures that precede it.

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

Risk is computed relative to a particular piece of software, its speci®cation, its usage pro®le (which helps determine the probability of failure), and its operational environment (which a€ects the consequence of a particular failure). We are interested in comparing testing criteria according to their ability to detect and/or reduce risk and would like to draw conclusions about the relative merits of the criteria that hold for all software, speci®cations, environments and usage pro®les. Consequently, we should not expect to be able to say things of a numerical nature, such as, ``Testing with criterion C1 will result in 50% more risk reduction than testing with criterion C2 ''. Instead, we must be content with conclusions that allow us to make statements of a relative nature, such as ``Testing with C1 will reduce risk at least as much as testing with C2 ''. We focus on this type of comparison of testing criteria in this paper. In particular, this paper compares the ability of various subdomain testing techniques to one another according to how e€ective they are at risk detection and risk reduction. We show that if a certain relation holds between criterion A and criterion B, then criterion A is guaranteed to be at least as good as criterion B at both detecting and reducing risk. 2. Background In this section, we will provide needed concepts and terminology. Many of the de®nitions were introduced in Frankl and Weyuker (1993a) or Frankl and Weyuker (1993b), in which we investigated ways of comparing the fault-detecting ability of software test data adequacy criteria so that one can say in a concrete way that one testing method is better than another. We now continue this study by investigating how software risk can be incorporated into this comparison. An important motivation of our earlier work was to improve upon the small amount of analytical work that had been done to assess the relative ecacy of di€erent proposed testing methods. Prior to these papers, most comparisons were based on subsumption, where criterion C1 is said to subsume criterion C2 if for every program P, every test suite satisfying C1 also satis®es C2 . As pointed out by several research groups including Frankl and Weyuker (1993a), Frankl and Weyuker (1993b), Hamlet (1989), Weiss (1989), and Weyuker et al. (1991), subsumption is not necessarily the ideal way to compare testing strategies. One problem with subsumption is that it is sometimes possible to construct a test suite that satis®es C2 and detects the presence of a fault while another test suite that satis®es C1 does not detect it, even when criterion C1 subsumes criterion C2 . The problem is that there are usually many di€erent test suites satisfying a given test data adequacy criterion, and

277

generally no real guidance on how to select a ``good'' one. Therefore, rather than considering whether it is possible for an adequate test suite to select a test case that fails, in Frankl and Weyuker (1993a), Frankl and Weyuker (1993b) we explored whether test suites generated to satisfy a given criterion are likely to include test cases that fail. For this reason, we used probabilistic ways of comparing test data adequacy criteria that assessed criteria based on the likelihood of selecting at least one test case that fails or the expected number of failures that will occur during testing, and in that way considered one criterion to be at least as good as another. That analysis was based on investigating how software testing criteria divide a program's input domain into subsets, or subdomains. More precisely, in Frankl and Weyuker (1993a) we introduced the properly covers relation between criteria and showed it to be ``stronger than'' subsumption. We proved that if C1 properly covers C2 , then when one test case is independently randomly selected from each subdomain using a uniform distribution, the probability that C1 contains at least one test case that fails is guaranteed to be greater than or equal to the probability that C2 contains at least one test case that fails. We then proved in Frankl and Weyuker (1993b) that if C1 properly covers C2 , then when one test case is independently randomly selected from each subdomain using a uniform distribution, the expected number of failures detected by C1 is guaranteed to be greater than or equal to the expected number of failures detected by C2 . Given that these results were proved using a model of testing that is a reasonable approximation of reality, these are powerful results allowing us to make concrete what we mean when we say that criterion C1 is at least as good as criterion C2 . We will provide formal de®nitions of relevant relations in Section 2.3, as well as a formal statement of our primary earlier results. This will allow us to build upon these results in our study of relationships between testing strategies and software risk assessment. We used these results as a way of guaranteeing that a criterion C1 was at least as good as criterion C2 for testing any program in a large class of programs. This was done independent of the particular faults occurring in the program. In addition, we showed that if C1 did not properly cover C2 for some program, then even if C2 was subsumed by C1 , it was possible for C2 to be more likely than C1 to expose a fault in the program. In Frankl and Weyuker (1993b), we used the above results to investigate the relative failure-exposing ability of several well-known testing techniques. In this paper we will extend this investigation by considering what we can say precisely about the risk detected when a program is tested using criterion C1 compared to the risk if it is tested using criterion C2 , given that C1 properly covers C2 .

278

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

2.1. Terminology A multi-set is a collection of objects. In contrast to a set, a multi-set may contain duplicates. Multi-sets will be delimited by curly braces and set-theoretic operator symbols will be used to denote the corresponding multiset operators. A multi-set S1 is a sub-multi-set of multiset S2 provided there are at least as many copies of each element of S1 in S2 as there are in S1 . The set of possible inputs to a program is known as its input domain. We assume that the program computes a partial function on its domain, i.e. that on a given input, the program will always produce the same output or will always fail to terminate. This assumption does not hold for programs whose results depend on the state of their environment, but it can be made to hold by considering all relevant aspects of the environment to be components of the input domain. Although we place no bound on the input domain size, we do restrict attention to programs with ®nite input domains. We do not consider this to be an unrealistic restriction since all programs run on machines with ®nite word sizes and with ®nite amounts of memory. A test suite is a multi-set of test cases, each of which is an element of the input domain. We often speak of test suites rather than test sets because it is sometimes pragmatically useful to permit some duplication of test cases. A test data adequacy criterion is a relation C  Programs  Specifications  Test Suites used for determining whether program P has been ``thoroughly'' tested by test suite T relative to speci®cation S. If C…P ; S; T † holds, we say that T is adequate for testing P with respect to S according to C, or that T is C-adequate for P and S. There are two primary uses for adequacy criteria: as a test suite evaluation method and as the basis for test case selection strategies. In the ®rst case, the adequacy criterion is used to determine whether or not a test suite is suciently robust to consider the testing phase to be complete once all of the test cases in the suite have been run. In this case, the way the test suite is constructed is irrelevant and may be independent of the adequacy criterion, and all test case selection is completed before the adequacy criterion is applied. If the test suite does not satisfy the adequacy criterion, test case selection resumes and after some additional test cases have been added, the adequacy criterion is again applied. In the second case, ni test cases are selected to satisfy the ith requirement determined by the adequacy criterion. (Usually ni ˆ 1.) In this case the adequacy criterion is explicitly being used to help construct the test suite. Consider, for example, the statement coverage adequacy criterion, which requires that each statement in the program be executed by some test case. In the ®rst case, one would select test cases by some independent means, then check that every statement has been exe-

cuted, adding more test cases, if necessary. In the second approach, the inputs that exercise each statement would be determined and one (or several) test case would be selected for each statement. The results in this paper are based on the second approach to using an adequacy criterion. In practice, some hybrid of the two approaches is often used. We will focus on subdomain-based testing approaches, namely those that divide the input domain into subsets called subdomains, and then require the selection of one or more elements from each subdomain. The multi-set of subdomains for criterion C, program P and speci®cations S will be denoted by SDC …P ; S†. The input domain may be subdivided based on the program structure of the software under test (program-based testing), the structure or semantics of the speci®cation (speci®cationbased testing), or some combination of the two. Such strategies have sometimes been referred to as partition testing strategies, but since in practice such strategies often divide the input domain into overlapping subdomains, they do not form true partitions of the input domain, in the mathematical sense. The operational pro®le or operational distribution is a probability distribution that associates with each element of the input domain, the probability of occurrence when the software is operational in the ®eld. Thus, if Q is the operational distribution of a program, Q…t† is the probability of input t occurring in a single execution of the program in its operational environment. An input t is said to be failure-causing for a given program P and speci®cation S, if the output produced by P on input t does not agree with the output speci®ed by S. We will sometimes speak of a test suite T detecting a fault in a program. We will mean by this that there is at least one failure-causing input in T. 2.2. The model In Frankl and Weyuker (1993a), our goal was to compare the fault-detecting ability of criteria, and hence we needed a well-de®ned and realistic model that was not biased in any one criterion's favor. We assumed that test suites were selected to satisfy a subdomain-based criterion C by ®rst dividing the domain based on SDC …P ; S†, and then for each subdomain Di 2 SDC …P ; S† randomly selecting an element of Di . We let di ˆ jDi j denote the size of subdomain Di , mi be the number of failure-causing inputs in Di , and let  n  Y mi M…C; P ; S† ˆ 1 ÿ 1ÿ : di iˆ1 If one test case is independently randomly selected from each subdomain according to a uniform distribution, M is the probability that a test suite chosen using this test selection strategy will cause at least one failure to occur.

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

M has been widely used by a variety of researchers as the basis for the comparison of testing strategies (Duran and Ntafos, 1984; Frankl and Weyuker, 1993a; Hamlet and Taylor, 1990; Weyuker and Jeng, 1991). We also de®ned a di€erent measure of a criterion's fault-detecting ability in Frankl and Weyuker (1993b). We again let SDC …P ; S† ˆ fD1 ; . . . ; Dn g, and assumed that one test case was independently randomly selected from each subdomain, based on a uniform distribution, and de®ned the expected number of failures detected to be: E…C; P ; S† ˆ

n X mi : di iˆ1

2.3. Testing criteria relations In our earlier work, we introduced several relations R that could hold among subdomain-based testing criteria, and asked whether R…C1 ; C2 † necessarily implies that M…C1 ; P ; S† P M…C2 ; P ; S† or that E…C1 ; P ; S† P E…C2 ; P ; S†. We ®rst showed that it was possible for M…C1 ; P ; S† to be less than M…C2 ; P ; S†, even though C1 subsumes C2 . This result led to the introduction of a stronger comparison relation that we showed had the desired properties relative to both M…C; P ; S† and E…C; P ; S†. The following de®nition appeared in Frankl and Weyuker (1993a): De®nition 1. Let C1 and C2 be criteria. C1 covers C2 for (P,S) if for every subdomain D 2 SDC2 …P ; S† there is a collection fD1 ; . . . ; Dn g of subdomains belonging to SDC1 …P ; S† such that D1 [    [ Dn ˆ D. It was shown in Frankl and Weyuker (1993a) that although several well-known testing criteria are related by the covers relation, it was possible for M…C1 ; P ; S† < M…C2 ; P ; S† even though C1 covers C2 for …P ; S†. The reason that this could happen was that one subdomain of C1 could be used to cover two or more subdomains of C2 . This led to the introduction of the properly covers relation. De®nition 2. Let SDC1 …P ; S† ˆ fD11 ; . . . ; D1m g, and let SDC2 …P ; S† ˆ fD21 ; . . . ; D2n g. C1 properly covers C2 for …P ; S† if there is a multi-set M ˆ fD11;1 ; . . . ; D11;k1 ; . . . ; D1n;1 ; . . . ; D1n;kn g such that M is a sub-multi-set of SDC1 …P ; S† and D21 ˆ D11;1 [    [ D11;k1 ; .. .

D2n ˆ D1n;1 [    [ D1n;kn :

279

Informally, this says that each of C2 's subdomains is ``covered'' by the union of some of the subdomains of C1 and that no C2 subdomain is used more often in covering the C2 subdomains than its number of occurrences in SDC1 …P ; S†. Note that it is not the number of subdomains, alone, that determines whether or not one criterion properly covers another.In Frankl and Weyuker (1993a) we proved that: Theorem 1. If C1 properly covers C2 for program P and specification S, then M…C1 ; P ; S† P M…C2 ; P ; S†. and in Frankl and Weyuker (1993b) we proved an analogous result: Theorem 2. If C1 properly covers C2 for program P and specification S, then E…C1 ; P ; S† P E…C2 ; P ; S†. The thrust of these two theorems was that one could guarantee that a given criterion was better at uncovering failures than another by showing that they were related by the properly covers relation. 2.4. Measuring detected risk The risk of program P is the expected cost of failure of P in the ®eld on a single input. Let c…t† denote the cost due to deviation (if any) of the output produced by program P on input t, P …t†, and the speci®ed output S…t†. Cost is actually a function of the program, its speci®cation, and numerous environmental and social factors, as well as the input, but for brevity, we denote it by c…t†. Although, as argued above, it may be unrealistic to determine c…t† for all inputs t a priori, it is much more reasonable to assume that these costs can be determined for each test case after executing the test case and observing the failure mode (if any) that results. Our results do not depend on prior knowledge of the values of c…t†. Let Q…t† denote the operational distribution, i.e., the probability that input t is selected when P is operational in the ®eld. Then X Q…t†c…t† R…P ; S† ˆ t2D

is the risk of program P. Equivalent de®nitions of risk have been used by other authors including Gutjahr (1995), Tsoukalas et al. (1993), and Sherer (1992), and by Weyuker (1996) to weight the operational distribution when selecting test cases depending on both the frequency of occurrence and consequence of failure. Note that this notion of risk is de®ned relative to a program and how it is to be used in the ®eld and is independent of the test selection method used. If P is equivalent to S, i.e. if P is correct, then c…t† is 0 for all elements of the domain, and hence the risk R…P ; S† ˆ 0. This is consistent with one's intuition.

280

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

We next de®ne the risk detected by a test suite T for program P, relative to speci®cation S to be: X c…t†: DR…P ; S; T † ˆ t2T

Here we have de®ned a notion that is independent of how the program will be actually used, but dependent on how it was tested. Of course, if test cases are selected based on the operational distribution, then the risk detected will give some sort of picture of the risk associated with the program. The testing techniques we consider are probabilistic in nature. Consequently, we will compare techniques by comparing the expected value of DR…P ; S; T †, where T ranges over the test suites that could be selected by the given technique: ! X c…t† EDR…C; P ; S; T † ˆ E…DR…P ; S; T †† ˆ E t2T

ˆ

X T

Prob…T †

X

!

c…t† ;

t2T

where Prob…T † is the probability that test suite T will be selected by test selection technique C, relative to program P and speci®cation S. 3. Comparing the risk detection ability of testing criteria Theorems 1 and 2 give conditions under which one testing criterion is guaranteed to be more e€ective than another according to certain measures of e€ectiveness. In the remainder of the paper, analogous results for measures of e€ectiveness that are related to risk are proved. This is done in a somewhat more general setting, loosening the restrictions on how the subdomains are used to guide test data selection. Theorems 1 and 2 assume that test cases are drawn from each subdomain according to a uniform distribution on that subdomain. Since risk is de®ned in terms of the operational distribution, the use of a uniform distribution when testing either to detect or reduce risk may be misleading. Surely, if operational distribution data is available, one would expect the computed risk to be less accurate than possible if a uniform distribution is used in lieu of the historical usage data. If there are inputs that lead to high cost failures and that are more likely than average to occur in the ®eld, testing with a uniform distribution may lull the tester into a false sense of security. Conversely, if there are inputs causing costly failures that are very unlikely to occur in the ®eld, testing with a uniform distribution may lead the tester to think the software is more risky than it actually is. Just as the traditional goal of subdomain testing is to uncover failures and eliminate the underlying faults that caused them, rather than to estimate reliability, the goal

of subdomain testing in this context is to detect (and then reduce) risk, rather than to estimate risk. Nevertheless, it may be more informative to perform subdomain testing with a non-uniform distribution on each subdomain, in order to more closely approximate the operational distribution, and thereby detect and remove an amount of risk that is more closely related to the actual risk of the software. Similarly, if prior information is available about which inputs are likely to cause high-consequence failures, one might choose a distribution that gives higher weight to those inputs. Consequently, we consider subdomain testing with distributions on subdomains that are not necessarily uniform. Let P be a program, S be a speci®cation, D denote the input domain of P relative to S, and F denote the set of failure-causing inputs. Let C be a subdomain-based criterion, and let SDC …P ; S† ˆ fD1 ; . . . ; Dn g be the corresponding multi-set of subdomains. Let Pr1 ; . . . ; Prn be probability distributions on D1 ; . . . ; Dn , respectively. One can select a C-adequate test suite by independently randomly selecting one test case from each Di according to Pri . It then follows that X Pri …t† hi ˆ t2Di \F

is the probability that a test case selected from subdomain Di will be a failure-causing input, and hence n Y M…C; P ; S; Pr1 ; . . . ; Prn † ˆ 1 ÿ …1 ÿ hi † iˆ1

is the probability that a test suite selected according to this strategy will detect a fault by causing P to fail. We begin by noting that Theorem 1 does not generalize to arbitrary test selection strategies of this nature. Observation 1. Let P be an incorrect program for specification S and let C1 and C2 be subdomain-based criteria. Assume that 0 < M…C2 ; P ; S† and that M…C1 ; P ; S† < 1, (i.e., assume that at least one subdomain induced by C2 includes a failure-causing input and that every subdomain induced by C1 includes at least one non-failure-causing input.) Then there exist probability distributions Pr11 ; . . . ; Pr1n , and Pr21 ; . . . ; Pr2m such that M…C1 ; P ; S; Pr11 ; . . . ; Pr1n † ˆ 0 and M…C2 ; P ; S; Pr21 ; . . . ; Pr2m † ˆ 1. To see this, simply let Di be a C2 subdomain that contains a failure-causing input t, and let Pr2i select t with probability 1; any test suite selected according to this strategy is guaranteed to detect a fault. For each i, let Pr1i select a non-failure-causing input with probability 1; no test suite selected according to this strategy will detect a fault. Thus, if neither C nor C 0 is guaranteed to detect a fault and neither is guaranteed to never detect a fault, there are distributions for which C performs better than C 0 and distributions for which C 0 performs better

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

than C. Analogous problems arise for the expected number of failures, expected risk detection, and expected risk reduction. The problem occurs because when t belongs to subdomain D1i of C1 and to subdomain D2j of C2 , there is not necessarily any relation between the probability that t is selected to represent D1i and the probability that t is selected to represent D2j . However, as we shall see below, in certain situations that arise naturally in practice, such a relation does exist. De®nition 3. Let Pr be a probability distribution on the input domain of program P, with speci®cation S, let SDC …P ; S† ˆ fD1 ; . . . ; Dn g, and let ai ˆ

X

Pr…t†:

t2Di

Let PrDi …t† ˆ

1 Pr…t†: ai

It is easy to verify that PrDi is a probability distribution on Di . We will call this distribution the inherited distribution of Pr on Di . Note that PrDi …t† is the conditional probability that test case t is selected, given that some test case in Di is selected. There are several approaches to selecting test cases according to an inherited distribution. If the structure of the subdomains is not too complicated, it may be possible to explicitly compute the probability densities of the subdomains (the ai 's) and use that to explicitly compute the distributions on each subdomain. Alternatively, if it is dicult to determine the ai 's a priori, one can select test cases according to the distribution on the entire domain, determine which subdomain(s) each test case lies in (by executing the program or by other means), associate each test case with one subdomain that contains it and that has not already been ``killed'', and discard remaining test cases. correctly. In order to assure that part of a subdomain that intersects some other subdomain is not underrepresented, subdomains can be grouped into batches of disjoint subdomains, and each batch can be treated separately. This approach results in the generation of extra test cases, but may be cost-effective in situations in which checking test results is more expensive than generating and executing tests. Since these and similar approaches are reasonably close to the way testing criteria are used in practice, we believe inherited distributions o€er a suitable generalization of previous studies limited to the uniform distribution. We shall prove a theorem analogous to Theorem 1 for expected detected risk when tests are selected using an inherited distribution.

281

The following is proved in Appendix A: Lemma 1. If test suites are selected by independently randomly selecting one test case from each subdomain, using distribution Pri to select from subdomain Di , then the expected risk detected during testing is n X X Pri …t†c…t†: EDR…C; P ; S; Pr1 ; . . . ; Prn † ˆ iˆ1 t2Di

Corollary 1. Let the expected number of failures be denoted by E…C; P ; S; Pr1 ; . . . ; Prn †. Then: n X hi : E…C; P ; S; Pr1 ; . . . ; Prn † ˆ iˆ1

Proof. Treat all failure-causing inputs as having a cost of 1 and all non-failure-causing inputs as having a cost of 0. Then the expected number of failures is the expected cost, n X n n X X X X Pri …t†c…t† ˆ Pri …t† ˆ hi :  iˆ1 t2Di

iˆ1 t2Di \FCI

iˆ1

We now prove a result analogous to Theorem 1 for expected risk detection under inherited distributions. For testing with distributions inherited from Pr, denote EDR…C; P ; S; PrD1 ; . . . ; PrDn † by EDR…C; P ; S; Pr†. Note that n X X Pr…t† c…t†: EDR…C; P ; S; Pr† ˆ ai iˆ1 t2Di In order to assess risk using subdomain testing strategies, we will be particularly interested in distributions inherited from the operational distribution. Theorem 3. Let C1 properly cover C2 for program P, with specification S and let Pr be any probability distribution on the input domain D. Then EDR…C1 ; P ; S; Pr† P EDR…C2 ; P ; S; Pr†: Proof. Let SDC1 …P ; S† ˆ fD11 ; . . . ; D1m g, 2 SDC2 …P ; S† ˆ fD1 ; . . . ; D2n g, and let

let

M ˆ fD11;1 ; . . . ; D11;k1 ; . . . ; D1n;1 ; . . . ; D1n;kn g be a multi-set such that M  SDC1 …P ; S† and D21 ˆ D11;1 [    [ D11;k1 ; .. .

D2n ˆ D1n;1 [    [ D1n;kn : This is possible since C1 is assumed to properly cover C2 for program P and Pspeci®cation S. P Let a1i denote Pt2D1 Pr…t†, let ai denote t2D2 Pr…t†, i i and let ai;j denote t2D1 Pr…t†. Then i;j

282

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

EDR…C1 ; P ; S; Pr† ˆ

m X X Pr…t† c…t† a1i 1 iˆ1

…1†

t2Di

ki X n X X Pr…t† c…t† ai;j 1 iˆ1 jˆ1

…2†

ki X n X X Pr…t† c…t† P ai 1 iˆ1 jˆ1

…3†

P

t2Di;j

n X X Pr…t† c…t† ai 2 iˆ1

…4†

t2Di

ˆ EDR…C2 ; P ; S; Pr†:

…5†

The inequality in line (2) follows from the facts that the sum in line (1) is over all the subdomains in SDC1 …P ; S†, the sum in line (2) is over all the subdomains in M, and M is a sub-multi-set of SDC1 …P ; S†. The inequality in line (3) holds because for each i; j, D1i;j  D2i and therefore ai P ai;j . The inequality in line (4) holds because the sum in line (3) represents ki selections from each subdomain D2i , while the sum in line (4) represents one selection from each D2i . The equalities in lines (1) and (5) follow from the lemma.  Thus if C1 properly covers C2 , it follows that testing with C1 (using the test selection strategy described above) is guaranteed to detect at least as much risk as testing with C2 . This is important since it shows that the properly covers relation can be used in a natural way to compare testing strategies with respect to risk reduction, especially since it does not assume that tests cases are selected from within subdomains using a uniform distribution. In Frankl and Weyuker (1993b), we showed that the properly covers relation holds between various wellknown testing criteria for a large class of programs. Each of these criteria was introduced and investigated in a number of earlier papers, and formal de®nitions can be found in Frankl and Weyuker (1993b). Applying the result of Theorem 3 to these pairs of criteria yields. Corollary 2. For any program (in a particular large class of programs) any specification, and for any probability distribution Pr on D EDR…required-k-tuples‡†; P ; S; Pr† P EDR…all-uses; P ; S; Pr† P EDR…all-p-uses; P ; S; Pr† P EDR…decision-coverage; P ; S; Pr†; EDR…ordered-context-coverage; P ; S; Pr† P EDR…context-coverage; P ; S; Pr† P EDR…decision-coverage; P ; S; Pr†;

P EDR…decision-coverage; P ; S; Pr†; and EDR…decision-condition-coverage; P ; S; Pr† P EDR…decision-coverage; P ; S; Pr†: 4. Comparing the risk reduction ability of testing criteria

t2Di;j

P

EDR…multiple-condition-coverage; P ; S; Pr†

As mentioned above, when testing to uncover the presence of faults, the ultimate goal is not only to detect the faults, but to remove them, so as to reduce the likelihood that the software will fail in the ®eld. Similarly, when testing to detect risk, the ultimate goal is to remove high consequence faults that are likely to occur, so as to reduce the risk of the software. The measure investigated above, risk detected by a test suite, is related, but not identical, to the amount of risk reduction that will result when all the faults detected are ®xed. There are several reasons why the detected risk is not necessarily equal to the risk reduction: · The probability of detecting a given fault (by selecting a test case that will cause a failure during testing) may di€er from the probability that that fault will result in failure in the ®eld. This might happen due to di€erences between the probability distribution used in testing and the operational distribution. · Several test cases may detect the same fault. This means that the cost of that fault will be counted several times in calculating detected risk, but its removal will only contribute once to reducing risk. · The detected risk is calculated using n test cases, where n is the test suite size, whereas the risk reduction is the reduction in the expected cost due to failure on a single run of the program. · In attempting to remove a fault, the programmer may introduce other faults, in which case the risk will not decrease as much as expected and may actually increase. Although the risk reduction resulting from a test set is not directly measurable, under certain assumptions, we can still say something about which testing criteria are better at reducing risk. Frankl et al. (1998) introduced the notion of a ``failure region'', a subset of the set of failure-causing inputs consisting of inputs that are related in the sense that the change made in response to detecting a failure of one of the test cases in the region will ®x all of the test cases in the region. This intuition is predicated on the fact that a software change generally does not cause only the speci®c test case that caused the change to be made to behave di€erently. Typically many other elements of the domain will also be a€ected, hopefully by causing them to produce the correct output rather than an incorrect one. All of the inputs whose behavior are corrected by the code change to correct the ``fault'' together will be thought of as a failure region. It is assumed that if

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

any element in the failure region had been selected as a test case, it would have caused the person to make the same changes to the software. As in Frankl et al. (1998), we assume that the set of failure-causing inputs can be divided into disjoint failure regions, F1 ; . . . ; Fp , having the property that whenever a test case from Fi is executed, the person debugging the program makes a change that exactly removes Fi . This implies that this change causes all of the inputs in Fi to now execute correctly. This is a strong assumption, as the change made upon observing a failure may depend on many factors, including exactly which test case failed, how the tester selected that test case, and the prior experience or whims of the programmer debugging the program, at that moment. We recognize that this is not necessarily true in reality, and that it is even possible that at di€erent times the same person might make different changes to the software in response to the same test case failing. Nevertheless, we believe it is a useful assumption, approximating reality closely enough to give useful insights into the relative strengths and weaknesses of various testing techniques. In particular, since the assumption is applied uniformly in our analyses of various testing criteria, it does not bias the results in favor of or against a particular criterion. Thus it seems reasonable for our present purposes (and those in Frankl et al. (1998)) of comparing the e€ectiveness of various testing techniques. We will further assume that each failure region has a ®xed cost. This will allow us to model the risk reduction obtained through testing. As above, we assume we can associate a cost of failure c…t† with each input t, but that these costs are not necessarily known a priori. The cost of a failure region F` is X c…t†: c…F` † ˆ t2F`

Let Q represent the operational distribution of the program. The probability that a failure region F` will be encountered on a single execution of the software in the ®eld is X Q…t†: Q…F` † ˆ t2F`

Thus the risk attributed to failure region F` is X Q…t†c…t†: R…F` † ˆ t2F`

The risk reduction due to a test suite T is the di€erence between the risk of the original program and the program obtained by removing all of the failure regions discovered by T. This leads us to another useful measure of the e€ectiveness of a testing technique: the expected risk reduction when a test case is selected using that technique. As above, since the testing techniques we are investigating are probabilistic, we consider the expected

283

value over possible test suites selected using the technique. The expected risk reduction can be obtained by summing the risks due to the failure regions, weighted by the probability that the failure regions are detected (and hence removed) during the testing/debugging process. Consider a subdomain testing technique with subdomains fD1 ; . . . ; Dn g, where one test case is selected independently from each using probability distribution Pri on Di . Consider subdomain F` . The probability that F` is detected is equal to the probability that it is detected by a test case from at least one subdomain, which is 1 minus the probability that it is not detected by the tests selected from any of the subdomains: !! n X Y 1ÿ Pri …t† : 1ÿ iˆ1

t2Di \F`

Hence, the expected risk reduction for criterion C (with the above testing technique) for program P, speci®cation S is RR…C; P ; S; Pr1 ; . . . ; Prn † " !# p n Y X X 1ÿ 1ÿ Pri …t† c…F` †Q…F` †: ˆ `ˆ1

iˆ1

t2Di \F`

As above, we will restrict attention to probability distributions on the subdomains that are inherited from a probability distribution on the entire domain. We'll denote the expected risk reduction in such cases by RR…C; P ; S; Pr†. In the remainder of this section, we show that a result analogous to Theorem 1 holds for the expected risk reduction and present a corollary analogous to Corollary 2. Theorem 4. Let C1 properly cover C2 for program P, with specification S and let Pr be any probability distribution on the input domain D. Then RR…C1 ; P ; S; Pr† P RR…C2 ; P ; S; Pr†: Proof. Assume C1 properly covers C2 for program P and speci®cation S. Let SDC2 …P ; S† ˆ fD21 ; . . . ; D2n g, let SDC1 …P ; S† ˆ fD11 ; . . . ; D1m g, and let M ˆ fD11;1 ; . . . ; D11;k1 ; . . . ; D1n;1 ; . . . ; D1n;kn g be a multi-set such that M  SDC1 …P ; S† and D21 ˆ D11;1 [    [ D11;k1 ; .. .

D2n ˆ D1n;1 [    [ D1n;kn : Let fi` ˆ

X t2D2i \F`

Pr…t†=ai

284

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

and let fi;j` ˆ

X

t2D1i;j \F`

Pr…t†=a1i;j :

That is, fi` denotes the probability that a test case selected from D2i detects failure region F` and fi;j` denotes the analogous probability for D1i;j . We want to show that ! p ki n Y Y X ` 1ÿ …1 ÿ fi;j † Q…F` †c…F` † iˆ1 jˆ1

`ˆ1

P

p X `ˆ1

! n Y ` 1 ÿ …1 ÿ fi † Q…F` †c…F` †; iˆ1

i.e., that the risk reduction due to those C1 subdomains involved in the covering exceeds the risk reduction due to all of the C2 subdomains. The remaining C1 subdomains, if any, will only further increase the risk reduction ability of C1 . It suces to show that for all i; `, ki Y …1 ÿ fi;j` † 6 …1 ÿ fi` †: jˆ1

In other words, we need to show that the probability of not detecting a given failure region with one test case drawn from D2i is greater than or equal to the probability of not detecting that failure region with one test case from each of the C1 subdomains used to cover D2i . That is, we can consider the terms from each failure region separately, then within each such term, we can consider separately the relative contributions to the product on the left and right hand sides from each D2i and its corresponding covering C1 subdomains. Note that the expressions in this inequality are similar in form to those occurring in the M measure. However, there are some important di€erences to be noted. Whereas M considers the probability that at least one test case will detect any failure region, these expressions consider the probability that at least one test case will detect a particular failure region. Furthermore, whereas M assumed test cases were chosen according to the uniform distribution, here we assume they are chosen according to any inherited distribution.We can massage these expressions into exactly the form of the analogous expressions arising in the proof of Theorem 1 in Frankl and Weyuker (1993b). To simplify the notation, let D denote D2i and let Dj denote D1i;j . Let Pr…ti † ˆ xi =yi , where xi ; yi are integers. 4 Let d be the least common multiple of the yi 's. We can transform the input domain into a space with d points

4 If any of the probabilities are irrational, they can be approximated closely enough by rationals so as to make the inequality hold.

and emulate Pr with a uniform distribution on this space as follows. Set X xi dj ˆ d ; y ti 2Dj i X xi ; mj ˆ d y ti 2Dj \F` i X xi : mˆd y ti 2D i Then m=d is the probability that a test case selected from D will detect failure region F` and mj =dj is the proability that a test case selected from Dj will detect it. It suces to show that  Y m mi 1ÿ 1ÿ P di d j which follows from the proof of Theorem 1 in Frankl and Weyuker (1993b).  We can again particularize this result to several wellknown testing criteria. Corollary 3. For any program (in a particular large class of programs) any specification, and for any probability distribution Pr on D RR…required-k-tuples‡†; P ; S; Pr† P RR…all-uses; P ; S; Pr† P RR…all-p-uses; P ; S; Pr† P RR…decision-coverage; P ; S; Pr†; RR…ordered-context-coverage; P ; S; Pr† P RR…context-coverage; P ; S; Pr† P RR…decision-coverage; P ; S; Pr†; RR…multiple-condition-coverage; P ; S; Pr† P RR…decision-coverage; P ; S; Pr†; and RR…decision-condition-coverage; P ; S; Pr† P RR…decision-coverage; P ; S; Pr†:

5. Conclusion We have extended the results of Frankl and Weyuker (1993a) and Frankl and Weyuker (1993b) which investigated concrete ways of comparing testing methods based on their fault-detecting ability. We have extended these results in two directions: considering more general test selection strategies and considering measures of test e€ectiveness that are more directly related to the risk of the software under test than those used in the earlier work.

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

We began by generalizing the test selection strategies considered. In Frankl and Weyuker (1993a) and Frankl and Weyuker (1993b), all strategies were assumed to be subdomain-based, with test cases independently, randomly selected from each subdomain using a uniform distribution on the subdomain. In this paper, we relaxed the requirement that all test case selection be based on a uniform distribution, and allowed the selection of test suites by independently randomly selecting one test case from each subdomain using a di€erent probability distribution for each subdomain, but required that these distributions be related in the sense that they were all ``inherited'' from a common distribution on the whole domain. We introduced two measures of software testing effectiveness related to software risk, expected detected risk and expected risk reduction, and investigated whether one could guarantee that one testing technique is better than another according to these measures. We showed that if C1 properly covers C2 for program P and speci®cation S and if test suites are selected by independently selecting one test case from each subdomain according to distributions inherited from a common distribution, then C1 is guaranteed to perform at least as well as C2 according to these risk-related measures. Note that the fact that C1 selects larger test suites than C2 is not enough to guarantee this, nor is the fact that C1 subsumes C2 . We expect this to be a very useful result for determining which subdomain-based testing strategy to use when minimization of risk is a primary consideration for a project. Acknowledgements We are grateful to Sandro Morasca for making several interesting suggestions.

285

test suites are of the form ft1;j1 ; . . . ; tn;jn g where ti;ji is drawn from subdomain Di with probability Pri …ti;ji †. Since the selections from the subdomains are independent, the probability of selecting a given test suite is Prob…ft1;j1 ; . . . ; tn;jn g† ˆ

n Y

Pri …ti;ji †:

iˆ1

The collection of all test suites is obtained by considering all possible combinations of one test case from each subdomain, so, ! ! s1 sn n n X X Y X ... Pri …ti;ji † c…tk;jk † E…DR…P ; S; T †† ˆ ˆ

j1 ˆ1

jn ˆ1

iˆ1

s1 X

sn X

n X

jn ˆ1

kˆ1

...

j1 ˆ1

kˆ1

c…tk;jk †

n Y

!

Pri …ti;ji † :

iˆ1

We will now consider each particular test case separately, and its ``contribution'' to the expected detected risk, beginning with t1;1 , selected from subdomain D1 . This involves setting j1 ˆ 1 and k ˆ 1. Collecting all the terms involving t1;1 yields X1;1 ˆ

s2 X j2 ˆ1

...

sn X

c…t1;1 †Pr1 …t1;1 †

n Y

Pri …ti;ji †

iˆ2

jn ˆ1

ˆ Pr1 …t1;1 †c…t1;1 †

s2 X

...

j2 ˆ1

sn Y n X

Pri …ti;ji †

jn ˆ1 iˆ2

ˆ Pr1 …t1;1 †c…t1;1 †; since the last sum of products represents the probability of selecting one test case from each of the remaining subdomains, which is equal to one. Each test case from each subdomain contributes a term analogous to X1;1 , so E…DR…P ; S; T †† ˆ

si n X X

Xi;ji

iˆ1 ji ˆ1

Appendix A Lemma A.1. If test suites are selected by independently randomly selecting one test case from each subdomain, using distribution Pri to select from subdomain Di , then the expected risk detected during testing is n X X Pri …t†c…t†: EDR…C; P ; S; Pr1 ; . . . ; Prn † ˆ iˆ1 t2Di

Proof. Recall that the expected detected risk is X X Prob…T † c…t†:  E…DR…P ; S; T †† ˆ T

t2T

Let si denote the size of Di , and let fti;1 ; . . . ; ti;si g denote the elements of Di . With this test selection strategy, the

ˆ

si n X X

Pri …ti;ji †c…ti;ji †

iˆ1 ji ˆ1

ˆ

n X X

Pri …t†c…t†:

iˆ1 t2Di

References Boehm, B., 1989. Software risk management. In: Proceedings ESEC, Warwick, UK, September 1989, pp. 1±19. Davis, M.D., Sigal, R., Weyuker, E.J., 1994. Computability, Complexity and Languages, second ed.. Academic Press, New York. Duran, J.W., Ntafos, S.C., 1984. An evaluation of random testing. IEEE Trans. Software Eng. SE-10 (7), 438±444. Frankl, P.G., Hamlet, D., Littlewood, B., Strigini, L., 1998. Evaluating testing methods by delivered reliability. IEEE Trans. Software Eng. 24 (10) 586±601.

286

P.G. Frankl, E.J. Weyuker / The Journal of Systems and Software 53 (2000) 275±286

Frankl, P.G., Weyuker, E.J., 1993a. A formal analysis of the fault detecting ability of testing methfods. IEEE Trans. Software Eng., 202±213. Frankl, P.G., Weyuker, E.J., 1993b. Provable improvements on branch testing. IEEE Trans. Software Eng. 19 (10), 962±975. Gutjahr, W.J., 1995. Optimal test distributions for software failure cost estimation. IEEE Trans. Software Eng. 19 (10), 962±975. Hall, E.M., 1998. Managing Software Systems Risk. Addison-Wesley, New York. Hamlet, D., 1989. Theoretical comparison of testing methods. In: Proceedings ACM SIG SOFT Third Symposium on Software Testing, Analysis, and Veri®cation. ACM Press, pp. 28±37. Hamlet, D., Taylor, R., 1990. Partition testing does not inspire con®dence. IEEE Trans. Software Eng. 16 (12), 1402±1411. Leveson, N.G., 1995. Safeware System Safety and Computers. Addison-Wesley, New York. Ntafos, S.C., 1997. The cost of software failures. In: Proceedings IASTED Software Engineering Conference, pp. 53±57.

Sherer, S.A., 1992. Software Failure Risk. Plenum Press, New York. Tsoukalas, M.Z., Duran, J.W., Ntafos, S.C., 1993. On some reliability estimation problems in random and partition testing. IEEE Trans. Software Eng. 19 (7), 687±697. Weiss, S.N., 1989. Comparing test data adequacy criteria. Software Eng. Notes 14 (6), 42±49. Weiss, S.N., Weyuker, E., 1998. An extended domain-based model of software reliability. IEEE Trans. Software Eng. SE-14 (10), 1512± 1524. Weyuker, E.J., 1996. Using failure cost information for testing and reliability assessment. ACM Trans. Software Eng. Meth. 5 (2), 87±98. Weyuker, E.J., Jeng, B., 1991. Analyzing partition testing strategies. IEEE Trans. Software Eng. 17 (7), 703±711. Weyuker, E.J., Weiss, S.N., Hamlet, D., 1991. Comparison of program testing strategies. In: Proceedings Fourth Symposium on Software Testing, Analysis, and Veri®cation. ACM Press, pp. 1±10.

Suggest Documents