Fault Localization using Execution Slices and Dataflow Tests ...

11 downloads 3337 Views 759KB Size Report
supports execution slicing and dicing based on test ... design tests which are effective in eliciting program ... ATAC is a tool designed to evaluate test set ade-.
Fault J,ocalizatioxi using Execution Slices and Dataflow Tests * Hiralal Agrawal, Joseph R. Horgan, Saul London, and W. Eric Wong { {hira, jrh, Saul, ewong)@bellcore.com}

-

Abstract

intersection of slices A and B

Finding a fault in a program is a complex process which involves understanding the program’s purpose, structure, semantics, and the relevant characteristics of failure producing tests. We describe a tool which supports execution slicing and dicing based on test cases. We report the results of an experiment that uses heuristic techniques in fault localization. Keywords: program slicing, fault detection., testing, debugging, block coverage, decision coverage, dataflow coverage

1

Introduction

The relationship between testing and debugging is an intimate one. Thorough testing requires an understanding not only of program requirements but also of the program implementation. ‘I.0 understand a program’s implementation the program’s semantics and syntax must be understood. The tester, often the author of the program, exploits this understanding to design tests which are effective in eliciting program failures. Once the program fails, various debugging techniques and tools are employed to locate the bug. In this paper we describe a practical slicing tool for C language programs. We also describe an experiment that shows the usefulness of slicing in locating faults on a single complex C program. Although the experiment is limited, we believe that the tools and methods of debugging we use are widely applicable. Weiser 191 originally conceived of the static program slice as an abstraction used by ‘programmers in locating bugs. The literature of slicing is extensive and well surveyed in Tip [lo]. Most simply, a static slice is the set of statements of a program which might affect the value of a particular output (or the value of a variable instance). A dynamic slice is the set of statements

Figure 1: Bugs, Slices, and Dices

of a program which do affect the value of the output on the execution of a particular input. Dynamic slicing, first proposed by Korel and Laski [4], was further explored by Agrawal and Horgan [2]. The use of dynamic slices in debugging was extensively investigated by Agrawal [l] where the notion of execution slices was defined. In the present paper an execution slice is the set of a program’s basic blocks or a program’s decisions executed by a test input. The set difference of two slices is known as a dice, a concept which first appears in Lyle and Weiser [SI. In this paper an execution dice is the set of basic blocks or decisions in one execution slice which do not appear in the other execution slice. A concept very close to our use of execution dicing for fault localization is suggested in Colofello and Cousins [3]. A strategy for fault localization using execution dices proceeds as follows. The fault resides in the slice

*Hiralal Agrawal, Joseph R. Horgan, Saul London and W. Eric Wong are with Bell Commmications Research, Morristown, N J 07960.

1071-9458/95$4.00 01995 IEEE

143

block is a sequence of C expressions in a program contain no branches and are always executed together. A decision exists for each possible value of a branch predthe prediicate. For instance, in i f ( * e p == cate *ep == j\O’ may evaluate “true” (which means non-zero) or “false” (which means zero). A c-use is the relationship between a “definition” of a variable (e.g.x = 5) and the variable subsequent use in a computational expression (e.g. x + y ). A p - u s e is the relationship between a definition of a variable, the use of the variable in a subsequent predicate (e.g. i f ( x > 10 ) ) and a decision on that predicate (e.g. “false”). ATAC measures how well a set of tests covers a program by counting the number of basic blocks, decisions, c-uses, and p-uses executed while running the tests on the program. ATAC also helps the programmer construct the tests by displaying the uncovered elements of the program. Coverage analysis may be performed for each test case, source file, C function, or various combinations of these. Multiple source files may be tested together or one at a time. ATAC and its uses are fully discussed in Horgan and London [5, 61 and Horgan et. al. [7]. To use ATAC on sort&08.c, the faulty program constructed from the sort program by inserting the fault designated 408 in Table 3, one compiles and links the sort-408.c source into the sort408 executable using ATAC in place of the standard compiler and linker. Tests are then constructed to attain good code coverage. The 56 tests of our study were constructed in 1993 to demonstrate the degree of coverage possible using ATAC on a program in a single day’s effort. The following coverage is achieved by those 56 tests.

of a test case which fails on execution. Attention is focused on that slice and the rest of the program is ignored in searching for the fault. To narrow the search further, we assume that the fault does not lie in the slice of a test which succeeds. We then restrict our attention to the statements in the failed slice which do not appear in the successful slice, i.e. the dice. We postulate that the fault lies in this dice in most cases. By way of illustration, consider Figure 1. Here the left most ovad represents slice B, the successful slice, while the right most oval represents slice A, the failure slice. As bug a is in the shaded intersection of the two slices, it is not in the dice of A and B, the right most crescent. However, bug b is in slice A and not in slice B arid so will be in the dice of A minus B. The area of the dice of A minus B is smaller than the area of the slice A and the programmer is saved the effort of searching the shaded area for bug b. The experiments we report in the remainder of the paper indicate that debugging based on this idea is both accurate and effective. In the sections that follow we describe our experiment and the use of the tool chislice. We then discuss the effectiveness of a simple dicing heuristic and how the precision of the heuristic can be improved. Finally, we conclude with some observations on the strengths, weaknesses, and applicability of our methods in debugging.

2

j\O’>

The method of our experiment

We conducted our experiment on a single program, the standard Unix sort program, a complex program of 914 lines. Using a dataflow testing tool, ATAC, we constructed a set oftests, 56 in number, which covered the sort program well. The sort program was seeded one at a time with 25 bugs (see Table 1), resulting in 25 erroneous variants of the sort program each with a single bug. The 56 tests were executed on each of these 25 programs. Those bugs which were detectable by the tests allowed us to form dices of failing slices minus non-failing slices (see Table 2).

2.1

> atac

-s

% blocks

sort-408.trace sort-408.atac % decisions % C-Uses % P-Uses

------------ _-_-________ -----_-_____ -----_----96(487/508)

89(351/394)

74(653/878)

74(566/760)

This means that 487 of the 508 basic blocks (96 percent) were covered in the program sort-408 during execution of the 56 tests. The coverages for decisons, c-uses, and p-used were 89, 74, and 74 percent, respectively. We expect such a test set to be reasonably good in revealing faults in the program.

Generating tests with ATAC

2.2

ATAC is a tool designed to evaluate test set adequacy using data flow coverage measures. ATAC computes data flow coverage from static analysis of source code (represented in a “.atac” file) and dynamic analysis of execution paths (represented in a “.trace” file). The program constructs measured by ATAC include basic blocks, decisions, c-uses, and p-uses. .A basic

Constructing faulty sort programs

Table 1 contains the differences, for each of 25 bugs, between the original sort program on the left and the faulty program on the right indexed by the line number where the change occurs. Our first experimentation in the effectiveness of slicing in debugging concerns these 25 bugs. Table 3 shows an additional 7 bugs we studied in a less systematic way. Consider 144

the fault designated by line 408 in Table 3 which differs from sort by changing a singlle line. The original source file for sort, sort.c, is modified to replace line 408 in the original source file with the faulty line indicated in Table 3. We designate this faulty program sortJO8. The Unix diff command below displays the differences in the original and modified source files.

diff s0rt.c sort-408.c 408~408




while(--i)

The use ofxslice

Xslice is a slicing and dicing tool for the standard or ANSI C language programs. The tool has a graphical interface which permits the user to compose and browse slices and dices. It allows arbitrary boolean expressions over execution slices associated with tests executed under ATAC tlo designate complex “dices.” For example, one might specify the dice relpresenting the intersection of all failing test cases minus the union of all sucessful test cases. In this paper we consider only simple dices representing the set difference of two slices. For instance, Figure 2 is a Xslice monochrome display showing the individual basic block coverages of some of the 56 tests on sort. I[t also singles out test D09.1, on which the faulty program s o d 4 0 8 succeeds, and D05.1, on which s0rtJO8 fails to execute as does the original sort program. Figure 2 also shows that D05.1 is x-marked and that D09.1 is d-marked. The highlighted code in Figure 3 is part of the execution slice of test D05.1 in sori408. The fault which caused the failure of sort408 is highlighted in this slice. The bitmaped scroll bar at the left of the display of Figure 3 shows a thumbnail sketch of the slice as it lays in the entire sort-408.c file. The ’highlighted slice is about 15% of the entire program. The dice of D09.1 in D05.1 (the set compliment of slice of D09.1 in the slice of D05.1) is shown in Figure 4. Since the test D09.1 failed, the slice ofD09.1, in all likelihood, will “contain” the fault which induced the failure, Moreover, because D05.1 did not fail, the fault is unlikely to be in the slice of D05.1. Therefore, the fault is likely to be in the dice of D09.1 in D05.1. This heuristic, as we shall see, seems to be a good one. In fact, in this case, the fault is in the clause 145

occurred in both the slice of D05.1 and the slice of D09.1 and is, therefore, subtracted from the dice of D05.1 with respect to D09.1. The slices and dices displayed to this point are basic block slices and dices. That is, only the basic block structure of the program is considered in the construction of the slices and dices. There are more revealing views of the dice of D09.1 in D05.1. One is in Figure 5. In this view the programmer has chosen to look at the dice of D09.1 in D05.1 with respect to decision coverage rather than block coverage. The expression

is highlighted. The programmer has also clicked on this expression and a small window is displayed which shows that the “true” branch of the decision is in the dice while the “false” branch is not. Evidently, test D05.1 evaluated --i to true but D09.1 evaluated --i to false and never true. The basic block

as the target of the true branch of whib(--i), occurs in the block dice of D09.1 in D05.1 and the true decision of --i occurs in the decision dice of D09.1 in D05.1. The result is that while(--i) occurs in the slice of both D05.1 and D09.1 and so is absent from the basic block dice of D09.1 in D05.1, the true decision on --i is in the decision dice of D09.1 in D05.1. The scenario just traced might well represent the reasoning of a programmer in using Xslice in attempting to find a bug in a program. The main question of this paper is whether automatic heuristic aid based on dicing is likely to be helpful in fault localization. It is to that topic that we now turn. 2.4

The effectiveness of the t e s t set

We now ask how effective this test set of 56 is in detecting faults in the sort program.

Line # I1 0 riginal code

U

11 Modifled code

delete this case

Table 1: Twenty-five faults in Sort

them, Stronger testing than that achieved by the test set of 56 tests is necessary to assure high quality code in programs such as the sort. Stronger tests will detect some of the faults undetected by this set. The new version of ATAC is much more effective in guiding creation of strong test than was the version used to select the tests we study here.

Table 1 represents 25 faults seeded into the Unix sort utility. These faults were seeded for the purpose of assessing the fault detecting capacity of dataflow tests in Wong, et. al. [Ill. In that study, some of the faults were found to be difficult to detect by even rather complete test sets. The faults we are concerned with affect a single or a few well localized lines. Although these faults are not too complex, we believe that they represent bugs that programmers often encounter. Recall that the test set of 56 tests we employ to detect these faults is reasonably strong as measured below by ATAC. % blocks

________--_-

% decisions % C-Uses

------------ -----------

96(487/508)

89(351/394)

74(653/878) 74(566/760)

Fault

Line #

N umber

of Dices

Average Size

tiood

Dices

Percent Good

% P-Uses

The 56 tests leave only 21 of the 508 basic blocks of the sort program uncovered. The dataflow def/use coverages of 74% C-uses and 74% P-Uses are also reasonably strong. For this test set the faults at lines 211, 235, 450, 511, 689, and 896 as indicated in Table 2 were not detectable. That is, distinct programs seeded with the single faults designated 211, 235, 450, 511, 689, and 896 each executed exactly as did the unaltered sort program on the 56 tests. The first lesson of testing is that some bugs escape even good tests. The obvious corollary for debugging is that debugging does not begin until a test results in a failure. Thus, our slicing method will be of no avail in locating these 6 faults and we shall devote no further attention to

Table 2: Pairwise slice differences on 25 faults Table 2 represents the principal data collected in our experiment. The 56 tests were run on each of the faulty programs representing faults in Table 1. The outputs of these tests were compared to the outputs of the unmodified sort program. Those tests which dif146

Q by-type

0 hy-file

+ by-testcase

0 by-function

.................................................................................

.........

I:::::.129 : of .......

508

....

...

Figure 2: Selecting the dice of D05.1 minus D09.1 “Average Size” indicates that these 300 basic block dices for line 634 average approximately 116 lines in the 914 line faulty version of sort which we designate as sort-634. This seems to indicate that the “signal to noise” ratio in the case of line 634 is high for our heuristic. That is, if the programmer were presented with a randomly selected dice, an average of 160 lines of code would have to be studied to find the fault. That is better than 914 lines but still not as precise a start in debugging as one would like. Once the programmer has run the 56 tests and determined that six fail and the rest succeed, our tool can present any one or several of the dices as a starting point to debugging. On average we are presenting only about 13% of the program for the programmer to study. What is more, in this case, every dice contains the bug. The reader can examine Table 2 and notice that for the faults other than 634 the average dice is smaller than that of 634 but the likelihood of retrieving one of these other faults is often less than 100 percent. The notion of a good slice is not always so simple as it is for fault 634. Fault 782 represents the deletion of an entire basic block. There seems to be no natural location for this fault until one notices that control will always pass incorrectly to the “default” case fOS the switch statement in which fault 782 occurs. Thus, we assign the location of that default case as the fault

fered in output between sort and the faulty programs were designated “failures;” those which were identical in output were designated ‘~successes.7~ So the 56 tests define two sets of slices for each faulty program, the slices of test failures and the slices of test successes. We form all possible dices in which a successful slice is subtracted from a failed sliae. If there are k failure tests for a given faulty program, then there are 56k - k2 such pairwise dices. Suppose that our heuristic for searching for a ’bug is to present at randomly chosen dice to the programmer as an initial guess at the bug’s location. For instance, consider the faulty program associated with Fault line 634 in Table 1 if(---*ipa != ’0’)

which substitutes for the correct code if(*---ipa

= ’0’ )

to create a pointer error. There were six failing tests and 50 successful tests, therefore there aye 300 dices for line 634 as indicated in Table 2 that we must consider. All of the 300 dices are in the column of “Good Dices.” This means that each of the 300 dices contained the line 634, and thus expressed the fault. In the parlance of information retrieval, we would say that the “precision” of our retrieval, the percent of the dices which contain the bug at line 634, is perfect. 147

TOO^

Coverage

File

___

do

[email protected]@t

slvpaary

Trace

Kestot:e

Update

Quit

0

P..

v-.

1 = 1; i f ( r l ine (ibuf [ i l ) ) { k = i; while(++k c j ) ibuf [k-11 = ibuf [k];

_.

,-.

j --;

'P

1

v

l

I

) while(1);

Tool: xSlice

File: bug2.c

Line: 390 of 914

Coverage: block

Highlight: zero

Figure 3: The basic block slice of s0rt.c with respect to D05.1

0

Line # 11 0 riginal code

1) Modified code

U

Table 3: Seven additional faults in Sort

2.5

location in our slices. In general, the location of the fault is the basic block in which a fault is introduced or, as in the case of fault 634, the basic block to which control flow may incorrectly transfer with the omission of the code associated with the bug. In using slicing and dicing the programmer must understand that the appearance of an unexpected basic block in a slice or dice might mean the omission of some important code which would affect control flow and, thus, the slice. Where faults are alterations to existing code, the fault is apparent and understood or not according to the cleverness of the debugger. Our heuristic seems to work well for most of the faults of Tablel.

Faults inserted during demonstrations

The seven faults of Table 3 were all collected during demonstrations of the Xslice tool. Each of these faults was inserted into the otherwise unchanged sort code by an observer of the demonstration. Then the 56 tests were run (55 for the fault designated by line 214 in Table 3) and the results were automatically checked against those of the faultless sort program to identify passing and failing tests. Typically, the hidden fault could be found by guessing at various dices. We did not employ the heuristic of forming all dices and presenting some for study. Instead, we guessed at dices and more often than not found the fault. Had we used the heuristic, the results in terms of the expected size of the dice and precision of the dice are indicated in Table 4. These results seem rather similar to those 148

Tool

Coverage

Pile

Bfcjhl icjht:

Sumnary

Trace

Update

RonstoLc?

Quit

p = (iatruct merg "1)lspace; j = 0,: forti-; i < b; i++){ f = setfill(i);

ifff == 01

else

ibuffj] = p; if(! r l ine(p))

j++;

i = j; qsort((char **)ibuf, (char **)(ibuf+i)); 1 = 0: while(-(

1

i f (riim( ibuf I il ) ) { k = i; while(++k < j ) ibuf[k-11 = ibuf[kl; j--; 1

I

f while(1);

w l : --31ice

Fille: bug2.c

Line: 390 of 914

Coverage: block

Highlight: zero

Figure 4: 'The block dice D05.1 minus D09.1 EbUIt Line #

I

Number of Dices

I

Average Size

1-

mer. Figure 4 is such a block dice and presents fewer lines for the programmer to consider. That dice failed to retrieve the fault, although the fault is retrieved in the decision dice of Figure 5.

Table 4: Additional dices oln faults of sort.

Fault Line #

Number of Dices

Average Size

211

Fault undetected by tests 1 Fault undetected by tests 1 L I 55.00 1 L I

Good Dices

Yercent Good b l

235 336

of Table 2 and indicate that th.ere is nothing about the original faults in Table 1 as compared to those in Table 3 that is favorable to our study. It is our impression that the kinds of faults suggested as "tricky" (e.g. 408 of Table 3) by programmers are tractable by our methods. 2.6

100

Attempting to improve the precision

The method of slicing is clear, If a test fails, its slice either contains incorrect code, the fault, or is missing code it should have. Thus, using slices seems a good approach to aid in debugging. However, we have observed that slices are often larger than is desirable as a starting point for debugging. Notice that the slice of Figure 3 presents many lines of code for the programmer to consider. The heuristic method we propose to this point is straightforward. A dice constructed by subtracting the slice! of a successful test case from that of a failed test case is presented to the program-

Table 5: Small dices on 25 faults These heuristics may be seen as a trade-off between the expected size and precision of the dice. To be certain of capturing the fault, the entire program must be presented. Nearly as good certainty of bug retrieval is achieved by execution slices with the advantage that the slice size is substantially smaller than the whole 149

File

Tool

Coveraqe

ffit&lit&t

1 = 0; while(

Trace

Sumdry

Rcbstore

Update

Quit

ib~fCi]->l;

p == ’ \ O ’ )

{

1 = 1; if(rline(ibuf[i])) { k = i; while(++k c j ) ibuf [k-11 = ibuf [k];

1

) while(1);

F i l e : bug2.c

Line: 390 of 914

Coverage: decision

H i g h l i g h t : zero

Figure 5: The decision dice of D05.1 minus D09.1

3

program. The dicing heuristic improves the size of the code presented at the expense of the likelihood of the fault being present. In Table 5 we present data where only dices of twenty-five lines or fewer are considered. A comparison of Table 5 and Table 2 shows a substantial improvement in average size and a significant decrease in the percent of slices which contain the fault. Indeed, the faults designated by lines 414, 608, and 806 have no dices this small. The faults 177 and $11 have no dices of this size which contain the fault. The dices of 206 and 439 show a marked decrease in finding the fault, but the remainder of the faults give results not entirely unlike those of Table 2.

452 8’(8

I

I

.3

I

’!2 I

1b.bl

20.59

1

I

.3

1.l

I

I

Observations and Conclusions

Our experimentation presented several surprises. The first of these came when we found that we were able to locate faults inserted by independent, sometime malicious, observers of our demonstrations. To be fair on this count, when guessing and not using our heuristic, we struck out as often as we succeeded in attempting to find inserted bugs under these circumstances. Some bugs inserted by observers were undetectable by our tests and some were always absent in dices even when present in slices. Observe that faults 507, 782, 851, 882, and 910 of Table 1 are all faults due to missing code. It is a sometimes said in testing circles that structural or coverage testing, such as that used in this study, is unlikely to detect faults of “omission.” It is true that coverage testing does not detect unimplemented requirements. However, our coverage tests detect specific faults of code omission and our structure-based debugging methods do well a t locating them. This localization is possible when omissions are from a basic block which is in an execution slice. In the case of 882, the removal of an entire basic block associated with the case statement causes the “default” case to appear inappropriately in a slice which is a telltale sign of an error to the programmer. We have discussed the chislice tool, one of a set

LUU (,I

Table 6: Small dices on 7 faults

Similar data are presented for small dices for the faults of Table 3 in Table 6. 150

[8] J . R. Lyle and M. Weiser, “Automatic bug location by program slicing,” In Proceedings of the Second International Conference on Computers and Applications, Beijing, China, June 1987,

of tools which exploit static dataflow analysis of programs and traces of tests run on those programs to better understand programs. The slicing tool, in its present form, seems effective in locating faults. The dicing data presented here indicate that simple dicing heuristics are effective in localizing faults. We believe more sophisticated heurist,ics may be even more effective. Our experiments which show that, for the lJnix sort program, our methods are effective in locating realistic faults. The tool, chislice, works efliciently on (arbitrary large and small C programs. Execution slicing is a practical debugging tool for the working programmer.

pp 877-883.

[9] M. Weiser “Programmers use slices when debugging,” Communications of the ACM, 25, 7, July 1982, pp 446-452. [lo] F. Tip, Generation of Program Analysis Tools, Institute for Logic, Language and Computation, Amsterdam, The Netherlands, 1995. [ll] W. Wong, J . Horgan, S. London, and A. Mathur, “Effect of Test Set Size and Block Coverage on the Fault Detection Effectiveness,” In Proceedings of the Fifth International Symposium on Software Reliability Engineering, Monterey, CA, 1994, pp 230-238.

References H. Agrawal, Toward automatic debugging of Computer Programs, PhD thesis, Purdue University, 1991.

H. Agrawal and J. Horgan, “Dynamic program slicing,” In Proceedings of the ACM SIGPLAN‘SO Conference on Programming language Design and Implementation, SIGPLAN Notices 25(6), 1990, pp 246-256. J . S. Colofello and L. Cousins, “Towards automatic software fault location through decision to decision path analysis,” In Proceedings of the 1987 National Computer Conference, 1987, pp 539-544.

B. Korel and J. La,ski, “Dynamic program slicing,” Information Processiing Newsletters. 29, 3. 1988, pp 155-163. J. R. Horgan and S. A. London, “Data Flow Coverage and the C Language,” Proceedings of the Fourth Symposium on Software Testing, Analysis, and Verification, Victoria, British Columbia, Canada, October 1991, pp 87-97. J . R. Horgan and S. A. Lo:ndon. “ATAC: A data flow coverage testing tool ffor C”. In Proceedings of Symposium on Aissessment of Quality Software Development Tools, New Orleans, LA, May 1992, pp 2-10.

J. R. Horgan, S. London, and MI. R. Lyu, “Achieving Software Qual.ity with Testing Coverage Measures,” IEEE Computer, Vol. 27, No. 9, 1994, pp 60-70. 151

Suggest Documents