Locating Program Features by using Execution Slices W. Eric Wong Bellcore Morristown, NJ
[email protected]
Swapna S. Gokhale University of California Riverside, CA
[email protected]
Joseph R. Horgan Bellcore Morristown, NJ
[email protected]
Kishor S. Trivedi Duke University Durham, NC
[email protected]
Abstract
An important step toward ecient software maintenance is to locate the code relevant to a particular feature. In this paper we report a study applying an execution slice-based technique to a reliability and performance evaluator to identify the code which is unique to a feature, or is common to a group of features. Supported by tools called ATAC and Vue, the program features in the source code can be tracked down to les, functions, lines of code, decisions, and then c- or p-uses. Our study suggests that the technique can provide software programmers and maintainers with a good starting point for quick program understanding.
Keywords: program comprehension, program feature, execution slice, invoking test, excluding test, unique code, common code
1 Introduction The modules in a well-designed software system should exhibit a high degree of cohesion and a low degree of coupling, such that each module addresses a speci c subfunction of the requirements and has a simple interface when viewed from other parts of the program structure [15, 16]. Cohesion is a measure of the relative functional strength of a module and is a natural extension of the concept of information hiding. A cohesive module should ideally do just one thing. Coupling is a measure of interconnection among the modules in a program structure that depends on the interface complexity between the modules. Although high cohesion and low coupling are very desirable characteristics, achieving them in practice is extremely dicult. Low cohesion and high coupling are inevitable because most language parameters are ubiquitous, which results in program features being mixed together in the code. Programmers, in the early development stage of the system, may try to follow certain standards [8, 14] to ensure a clear mapping between each feature and its corresponding code segments. However, as development continues, the pressure to keep a system operational will probably lead to the exclusion of such 1
traceability. Thus, as the software ages it is more likely that the program features are implemented over modules which are seemingly unrelated. A programmer's/maintainer's understanding can deteriorate based on such delocalized structures, and this can often lead to serious maintenance errors [18, 21]. An important step toward ecient software maintenance is to locate the code relevant to a particular feature. There are two methods for achieving this [12]. First, a systematic approach can be followed that requires a complete understanding of the program behavior before any code modi cation. Second, an as-needed approach can be adopted that requires only a partial understanding of the program so as to locate, as quickly as possible, certain code segments that need to be changed for the desired enhancement and bug- xing. The systematic approach provides a good understanding of the existing interactions among program features, but is often impractical for large and complex systems which can contain millions of lines of code. On the other hand, the as-needed approach, although less expensive and less time-consuming, tends to miss some of the non-local interactions among the features. These interactions can be very critical to avoiding unexpected side-eects during code modi cation. Thus, the need arises to identify those parts of the system that are crucial for the programmer and maintainer to understand. A possible solution is to read the documentation, and studies have been conducted on the eective design of documentation to compensate for delocalized plans [18]. However, it is not uncommon to nd inadequate and incomplete documentation of a system. Even when a document is available, programmers and maintainers may be reluctant to read it. Perhaps a faster and more ecient way of identifying the important segments of the code is to let the system speak for itself. This paper reports our study of this issue. Both static and dynamic slices can be used as an abstraction to help programmers and maintainers locate the implementation of dierent features in a software system [3, 9, 10, 13, 19]. However, a static slice is less eective in identifying code that is uniquely related to a given feature because it, in general, includes a larger portion of program code with a great deal of common utility code. On the other hand, collecting dynamic slices may consume excessive time and le space. In this paper we use an execution slice-based technique, where an execution slice is the set of program components (either basic blocks, decisions, c-uses, or p-uses)1 executed by a test input. Compared with static- or dynamic-based techniques, a clear advantage of applying our approach is that if the complete traceability of each test has been collected properly during the testing (see Section 3 for details), such information can be reused directly without any additional sophisticated slicing analysis. As a result, not only can testers use code coverage to improve their con dence in quality, but programmers and maintainers can also bene t from the same information to eciently locate code that implements features. 1
An explanation of basic block, decision, c-use and p-use can be found in Section 2.4.
2
The eectiveness of using the execution slice-based technique depends on several important factors, as shown in Figure 1.
Heuristics: Several studies have been reported [20, 21, 22] using some heuristics to identify
code from the feature point of view. However, these studies touch only a very speci c part of the problem. Since, in general, the notion of an execution slice-based technique is very abstract, it is incorrect to assume that a single heuristic can be applicable to all scenarios. In fact, depending on the need, dierent heuristics have to be applied in mapping program features to code. As a result, developing novel heuristics and experimenting with them to explore their potential will provide maximum bene ts. In this paper we present several heuristics to identify code that is either unique to a given feature (Section 2.1) or common to a group of features (Section 2.2). We also discuss, in Section 5.4, the impact of the feature implementation and the invoking and excluding tests on selecting the heuristic.
Test cases: As indicated above, code identi ed by the execution slice-based technique depends
not only on which heuristic is applied and how the features are implemented, but also on how tests are selected. In Section 2.3, we discuss the relation between the invoking and excluding tests in terms of their execution slices. Such a relationship depends heavily on the goal, that is, whether we are looking for the code unique to a given feature or common to a group of features. In Section 5.4, we explore a word-around solution for when no focused invoking tests exist with respect to the feature being located, i.e., no test which will exhibit only this feature and no others. These discussions and explorations, either completely ignored or only slightly acknowledged in the previous works, provide important guidance in helping us select appropriate tests.
Granularity: Program features in our study are located by using execution slices with respect
to both control ow and data ow analyses to provide dierent details, whereas in others only control ow is used. This implies we can present code that is unique to a feature or common to a group of features at dierent levels of granularity (basic blocks, decisions, c-uses, and p-uses) rather than only with respect to a branch or a line of code. The importance of locating featurerelated code at ner granularity levels such as c-use and p-use is explained in Section 5.1.
Tool support: A very important factor for a technique to be applicable in real-life contexts
is that it be supported by some automated tools. The tools used in our study can generate the execution slice with respect to any given test. They also provide a graphical interface for 3
11 00 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 11 0000000000 1111111111 00 11 0000000000 1111111111
Execution slice-based technique
1 0 0 1 0 1 0 1 0 1 0 1 00000000 11111111 0 1 00000000 11111111 0 1
Heuristics
1 0 0 1 0 1 0 1 0 1 0 1 00000000 11111111 0 1 00000000 11111111 0 1
0 1
0 1 1 0 1 Granularity 0 0 1
Test cases
0 1 11111111 00000000 0 1 00000000 11111111 0 1
0 1
0 1 1 0 1 Tool support 0 0 1
0 1 11111111 00000000 0 1 00000000 11111111 0 1
1 0 0 1 0 1 0 1 0 1 How features are implemented 0 1 0 1 in the program 0 1 0 1 0 1 000000000000000000 111111111111111111 0 1 000000000000000000 111111111111111111
Figure 1: Important factors of the execution slice-based technique visualizing code related to a feature (or features) (see Figures 4 and 5 in Section 3). Compared with the tool in reference [20, 21, 22] which only shows such code in an annotated ascii interface, we believe these tools will make our technique more attractive to practitioners. It is clear that there is a close relationship between the heuristic applied, the invoking and excluding tests selected, and the way in which a feature is implemented. Each of these three has a profound impact on the code identi ed by an execution slice-based technique. In addition to the heuristics, we provide a thorough discussion of the impact of the last two factors on the quality of the code identi ed, whereas most prior eorts have only focused on the de nition, demonstration and selection of heuristics. The remainder of this paper is organized as follows. Section 2 explains the general concepts including heuristics for nding code, selection of invoking and excluding tests, and division of programs into components. Section 3 describes tool support for the execution slice-based technique. Section 4 presents a case study to show the merits of our technique. Lessons learned from our study are discussed in Section 5. Our conclusions appear in Section 6.
2 General Concepts A program feature can be viewed as an abstract description of a functionality given in the speci cation. One good way to describe the features of a given program is through its speci cation. For example, the speci cation of the UNIX wordcount program (wc) is to count the number of lines, words, and/or characters given to it as input. Based on this, we can specify three features (with respect to three functionalities): one which returns the number of lines, another which returns the number of words, and one which returns the number of characters. 4
Suppose programmers and maintainers understand the features of the program being considered and can determine whether a feature is exercised by a test. For a given program P and a feature F , an invoking test is a test which when executed on P shows the functionality of F ; an excluding test is one that does not. For example, let us use the UNIX wordcount program again. Suppose F is the functionality to count the number of lines. A test (say t1 ) such as \wc -l data" that returns the number of lines in the le data is an invoking test; whereas another test (say t2 ) \wc -w data"that gives the number of words (instead of the number of lines) in the le data is an excluding test. An invoking test is said to be focused on a given feature if it exhibits only this feature and no other features. Following the same example as above, t1 is also a focused test with respect to F which counts the number of lines. But this is not true for a test \wc data" (referred to as t3 ) even though it also returns the number of lines in the le data. This is because in addition to the number of lines, t3 also returns the number of words and characters. That is, t3 also exhibits features which count the number of words and characters, respectively. There are many ways in which execution slices of invoking and excluding tests may be compared to identify pieces of code that are related to F . For example, we can form a union of the execution slices of all the invoking tests to nd a set of code that is used to implement F . One clear problem of this approach is that some code which has nothing do with F may also be included unless all the invoking tests exhibit only F and no other features (i.e., all the invoking tests are focused on F ). We can also create an intersection of these execution slices to identify code that is executed by every test which exhibits F . Since it is impossible, in general, to identify all the invoking and/or excluding tests for a given P and F , a practical alternative is to run P using a small, carefully selected set of tests, say T , with some exhibiting F and others not.2 Hereafter, we consider tests in T instead of all possible tests for the program. Let Sinvoking and Sexcluding represent program components that are executed by at least one test in T that exhibits F and those in T that do not exhibit F , respectively. Similarly, Tinvoking is the set of components that are executed by every invoking test in T .
2.1 Heuristics for nding code unique to a feature One simple approach is to compare the execution slice of just one invoking test with that of one excluding test. To minimize the amount of relevant code identi ed, the invoking test selected may be the one with the smallest execution slice (in terms of number of blocks, decisions, c-uses, or p-uses in the slice) and the excluding test selected may be the one with the largest execution slice. Another 2
Ideally, we would like to have all such tests to be the focused invoking tests on F .
5
approach is to identify code that is executed by any invoking test but not by any excluding test (i.e., S invoking - excluding ). In other words, code that is in the union of invoking tests, but not in the union of excluding tests, is identi ed. In a third approach, similar to the second, we can identify program components that are commonly executed by all invoking tests but not any excluding test (i.e., Tinvoking - Sexcluding ). This implies that the identi ed program components are in the intersection of all invoking tests but not in the union of excluding tests. As a result, we only select those program components that are always executed when the feature is exhibited, but not otherwise.
S
Depending on how features are implemented, programmers and maintainers may have to try all these approaches (see Section 5.4) or construct their own approaches in order to nd the best set of code related to a given feature. In the study reported in Section 4, we found the second approach works best for identifying code uniquely related to each of the ve features in Figure 3. We summarize this approach as follows: 1. Find Sinvoking : This includes components that implement F ; some of them may also be used to implement other features. 2. Find Sexcluding : This includes components that implement features exhibited by the set of excluding tests. It may also include code segments that are common to F and other features.
3. Subtract Sexcluding from Sinvoking : Components that are executed by tests that exhibit F but not those that do not exhibit F are uniquely related to F .
2.2 Heuristics for nding code common to more than one feature Code common to a pair of features, say F1 and F2 , is that which is executed by at least a test that exhibits only F1 and not F2 , and at least a test that exhibits only F2 and not F1 . One way to nd such code is to run the program using a few carefully selected tests which exhibit only F1 and no other features (or at least not F2 ); and a few other tests which exhibit only F2 and no other features (or at least not F1 ). We then take the intersection of the set of code related to feature F1 with the set of code related to feature F2 (i.e., by the intersection of Sinvoking for F1 and Sinvoking for F2 ). Similarly, code common to all features can be identi ed by taking the intersection of the sets of code executed by each individual feature, that is, intersection of Sinvoking for Fi , i = 1; : : :; n, where n is the number of features in the application, and Sinvoking for Fi is the union of the code executed by tests which exhibits only Fi .
2.3 Selection of invoking and excluding tests
In theory, to nd Sinvoking or Sexcluding , one may have to run all possible tests for the program. In practice, only a few carefully selected tests are needed. Dierent sets of code may be identi ed by dierent 6
sets of invoking and excluding tests. Poorly selected tests will lead to inaccurate identi cation by either including code that is not unique to a given feature, or excluding code that should not be excluded. When the invoking tests for a given feature are selected, they should be focused, as much as possible, with respect to this feature. To nd code unique to a feature, the excluding tests should be as similar as possible, in terms of the execution slice, to the invoking tests so that as much common code can be ltered out as possible. Similarly, while identifying code common to a group of features, one would like to have the invoking tests, in terms of the execution slice, for a feature be as dissimilar as possible to the invoking tests for other features in the group. This enables the exclusion of maximum uncommon code. To illustrate this concept, we use the sample code in Figure 2. Note that the code is written in a free format to explain its functionality. It is not intended to follow the syntax of any computer programming language. Suppose we want to nd the code that is uniquely used to compute the area of an equilateral triangle. We rst construct an invoking test t1 that exhibits this feature and two excluding tests t2 and t3 that compute the area of of an isosceles triangle and a rectangle, respectively. Clearly, t1 is closer to t2 than t3 . The dierence between the execution slices of t1 and t2 shows that only statements s10 and s13 are unique to this feature, whereas additional code, such as the statements s3 to s7, would also be identi ed (but should not be) if t3 is used in place of t2 as the excluding test. Furthermore, this example also indicates we do not even need to use the feature that computes the area of a rectangle to nd code that is unique to computing the area of an equilateral triangle. The ability to identify program components unique to a feature without the necessity of knowing all the program's features greatly enhances the feasibility of using the execution slice-based technique to quickly highlight a small number of program components that are important to programmers and maintainers following the as-needed program understanding strategy.
2.4 Division of programs into components Natural components of a C program might be les or functions3 within a le, but in many cases such a classi cation may not be satisfactory. For example, code in a function may be used to implement more than one feature which precludes this function from being unique to any of these features. To solve this problem, we need to further decompose the program. Four additional ner categories are used in this paper: blocks, decisions, c-uses, and p-uses. If necessary, programs can also be decomposed into a broader category such as a subsystem, which may contain many les and functions. Since the program used in our study is written in C (see Section 4.1), we use the term function instead of procedure or subroutine. 3
7
Figure 2: Sample code written in a free format. A basic block, or a block, is a sequence of consecutive statements or expressions, containing no branches except at the end, such that if one element of the sequence is executed, all are. This de nition assumes that the underlying hardware does not fail during the execution of a block. A decision is a conditional branch from one block to another. A de nition of a variable is a statement (or an expression) that assigns a value to it, and a use of that variable is an occurrence of it in another statement (or expression). Uses are classi ed as c-uses if the variable appears in a computational statement (or expression) and p-uses if the variable appears inside a predicate.4 Note that while a c-use is a pair of a de nition and its corresponding use with respect to a variable, a p-use includes not only the de nition and the use of a variable but also a decision on the predicate in which the variable is used. For example, the de nition and the use of w at s18 and s19, respectively, in Figure 2 and the true branch (s20) constitute a p-use. Similarly, the same de nition-use pair and the false branch (s22) make another p-use. We now use the same code in Figure 2 to illustrate components in c-uses that are unique to equilateral triangle. We compute the dierence between the set of c-uses executed by a test for equilateral triangle (t1 in Section 2.3) and that by a test for isosceles triangle (t2 in Section 2.3). We nd that the de nition of the variable a via the read statement at s3 is uniquely used at s13 for this feature (i.e., the one to compute the area of an equilateral triangle). However, the de nition of the variable s at s14 and its use at s15 have nothing to do with the feature. This type of information provides a ner view of whether a def-use pair of a variable has anything to do with a given feature. It is not our attempt to enumerate all possible ways to de ne or use a variable. Readers who are interested in the details of how blocks, decisions, c-uses, and p-uses are de ned in our study should refer to references [6, 7]. 4
8
3 Tool Support for Execution Slice-Based Technique Collecting the execution slice of each test input by hand can be very time consuming and prone to errors. Therefore, a tool which automatically analyzes and collects such information is necessary before any studies can be conducted on large complicated systems. In our study, we used Vue [1, 4] (a tool built on top of ATAC [6, 7]) for visualizing features in code. Given a program, ATAC computes its set of testable attributes (blocks, decisions, c-uses, and p-uses) and instruments it at compile time. Once the instrumentation is complete, an executable is built on the instrumented code. Each time the program is executed on a test, new trace information with respect to that test in terms of how many times each block, decision, c-use, and p-use is executed by that test (i.e., the complete traceability matrix of that test) is appended to a corresponding trace le. With this information, the execution slice of a test can be represented in terms of the blocks (decisions, c-uses, or p-uses) executed by that test. Through the Vue interface, tests that exhibit the feature being examined are moved into the invoking category and those that do not, into the excluding category. However, not every test has to be categorized in this way. Some tests can stay in the default dont know category, which means we either do not care about these tests or we simply do not know whether these tests exhibit the feature or not. Code identi ed using heuristics discussed earlier is displayed in a graphical interface to allow programmers and maintainers to visualize where a feature is implemented in the program. Two examples of this appear in Figures 4 and 5. Both ATAC and Vue are part of SudsT M ?a Software Understanding System developed at Bellcore. More information about Suds can be found at http://xsuds.bellcore.com.
4 A Case Study We now present a case study to show our experience of using the execution slice-based technique to identify code that is uniquely related to a feature, or is common to a pair or a group of features.
4.1 The target system SHARPE, a Symbolic Hierarchical Automated Reliability and Performance Evaluator [17] which analyzes stochastic models, was used in this study. SHARPE contains 35,412 lines of C code in 30 les, and has a total of 373 functions. It was rst developed in 1986 for three groups of users: practicing engineers, researchers in performance and reliability modeling, and students in engineering and science courses. Since then, several major revisions have been made to x bugs and adopt new requirements. SHARPE provides a speci cation language and analysis algorithms for nine dierent model types such 9
FEATURES SUB-FEATURES cdf, prob, tvalue
MC
SHARPE
MRM
sreward, exrss, exrt, cexrt, rvalue prob, cdf
GSPN
etok, prempty, util, etokt, premptyt tputt, utilt, tavetokt, tavtputt, cdf
PFQN
tput, rtime, qlength, util, mtput mrtime, mqlength, mutil
cdf, pqcdf
FT
Figure 3: Features of SHARPE as Markov Chains (MC), Markov Reward Models (MRM), Generalized Stochastic Petri-Nets (GSPN), Product-Form Queuing Networks (PFQN), and Fault Trees (FT). Various transient and steady-state measures of interest are possible using the built-in functions in SHARPE. In this study, we view the speci cation of a model and the built-in functions that can be used for computing the measures of interest pertaining to that model as a feature. For example, the speci cation of Markov chains and the grouping of the built-in functions cdf , prob and tvalue, which facilitate the analysis process, constitute a single feature, while the speci c functions (cdf , prob, tvalue) form the subfeatures. The ve features used in this study are shown in Figure 3. We will refer to MC , MRM , GSPN , PFQN , and FT along with their respective built-in functions as features F1, F2, F3 , F4 , and F5 , respectively. A subfeature j of a feature i is denoted as Fi;j . For example, etok is denoted as F3;1 and prempty as F3;2. The subfeatures need not be unique to a particular feature and can be shared among features. An example of this is the subfeature cdf which is shared by features F1 , F2 , F3 , and F5 as F1;1, F2;7 , F3;10, and F5;1.
4.2 Data collection For each feature, a small set of invoking and excluding tests was carefully selected. Heuristics, as discussed in Sections 2.1 and 2.2, were used to nd blocks, decisions, c-uses, and p-uses that are unique to each feature, or are common to a pair or all ve features.
10
4.3 Observations Data collected in our experiments, such as the identi ed les, functions, blocks, and decisions, were given to experts who were familiar with SHARPE for veri cation. The results are very promising as these identi ed program components are unique to a feature as they should be, shared by a pair of features, or common to all ve features. No complete veri cation was done with respect to the identi ed c-uses or p-uses because it is very dicult for humans to have a complete understanding of a complicated system like SHARPE at such ne granularity. Nevertheless, our experts did examine some of the identi ed c-uses and p-uses and agreed to the descriptions assigned to each of them. Table 1: Number of blocks unique to F3;j ; 1 j 10 analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
F3 1
F3 2
F3 3
F3 4
33 11
63 32
33 11
63 32
;
9
3 3 21
F3 8
F3 9
F3 10
33 11
63 32
63 32
3
3
136 7 31 35
136 7 31 27
307
307
37 25 15 20
37 25 15 20
728 224 798 52 214 18 219 737 6.19 1.91 6.79 0.44 1.82 0.15 1.86 6.27
729 6.20
;
26
226 1.92
;
26
3
3
1 136 7 31 26
136 7 31 39
137
9
;
9
3 3 19
F3 5 ;
;
9
F3 7 ;
33 11
19
3 3 14
12
137
307
307
37 25 14 20
72 29 33 20
;
9
1
137
9
F3 6
25 4 4
4
3 3 14
;
26
;
26
137
5
9
Note: A blank entry means no block in the corresponding le is unique to F3 . ;j
Based on these data, we analyzed code unique to every subfeature for all ve features. Due to space constraints, we report the results only for the subfeatures of F3 . The choice of F3 was com11
Blocks highlighted are unique to feature F3,1
Figure 4: Display of blocks unique to feature F3;1 pletely arbitrary. Table 1 shows the number of blocks identi ed on a per le basis for the subfeatures F3;j ; 1 j 10. As can be seen, a total of 226 blocks were identi ed as unique to feature F3;1 from 11; 752 blocks that constitute the entire application. This implies only about 1:9% of the total blocks are identi ed as unique to F3;1 . Blocks so identi ed are important to programmers and maintainers because they provide a good starting point for understanding how F3;1 is implemented. Analysis at dierent granularity also nds only a small percentage, that is, 148 of the 7,093 decisions (2.09%), 418 of the 21,936 c-uses (1.91%), and 264 of the 15,122 p-uses (1.75%) are identi ed as unique to F3;1. A similar observation (i.e., only a small percentage is identi ed) applies to other subfeatures of F3 as well. A le-wise summary of decisions, c-uses, and p-uses unique to the subfeatures of F3 is shown in Appendix A. The code unique to each subfeature can be viewed via the graphical interface of Vue. For the purpose of illustration, unique blocks and decisions identi ed based on the three steps discussed in Section 2.1 for feature F3;1 are shown in Figures 4 and 5. As for the subfeatures of F1 , F2 , F4 , and F5 , although detailed data are not shown here, analyses of these data provide a similar observation as that for subfeatures of F3 . At no matter which granularity level, it is always only a small percentage of program components de ned at that level that are identi ed as unique to the corresponding subfeature. 12
The true branch, not the false branch, is unique to feature F3,1
The false branch, not the true branch, is unique to feature F3,1
Figure 5: Display of decisions unique to feature F3;1 Code common to pairs of features was also analyzed. All ve features were used for this purpose, giving us a total of ten pairs. Table 2 shows the number of blocks common to the various possible pairs of features for every le. Similar summaries of decisions, c-uses, and p-uses are presented in Appendix B. Such information can be very useful in preventing certain types of unexpected maintenance errors. For example, a programmer may make a change to a function with one feature in mind without realizing that such a change may also aect another feature. In our case, 65 blocks in le analyze.c are used to implement both features F1 and F2 . As a result, a change in these blocks may aect not only F1 but also F2 . After being shown these data, experts who know SHARPE well indicated that results collected using our technique could provide some surprising insights about the target program. This is especially true when we move the granularity to c-uses and p-uses because it is very dicult to have a complete understanding of a complicated system like SHARPE at such ne granularity. Next, we present another view by showing in Table 3 the number of blocks, decisions, c-uses, and puses common to all ve features. This information can help programmers and maintainers understand whether a change in certain code will have global impact on all the features. In addition, such code in some sense represents the \utility code" and may have a potential for reuse if the program were to be expanded at a later stage. 13
Table 2: Number of blocks common to each pair of features analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
F1 =F2
F1 =F3
F1 =F4
F1 =F5
F2 =F3
F2 =F4
F2 =F5
F3 =F4
F3 =F5
F4 =F5
363
308
200
124
299
201
123
200
124
143
63
63
71
51
66
66
54
66
54
55
137
138
130
68
136
43 135
49 73
134
73
43 75
211 83
181 83
95
183
186 147
95
180
95
180
94
191 237
181 196
164 178
145
181 208
171 243
217
170 176
136
219
144
154 75 336 399 260 237 30 237 155
144
144
144
145
25
343
25
25
65
64
335 246 25 238 172
64
64
64
64
16
181 25 229 173
142 25 169 151
246 25 242 248
181 25 234 261
161 24 170 236
181 25 233 243
139 27 169 221
103 24 162 233
1679 14.29
1058 9.00
2535 21.57
1888 16.07
1287 10.95
1757 14.95
1123 9.56
1167 9.93
140 2510 21.36
3237 27.54
Note: The notation F1 =F2 indicates code common to features F1 and F2. A blank entry means no common block in the corresponding le.
5 Lessons Learned In this section, based on our case study, we consider some pros and cons of using the execution slicebased technique to map program features to code.
5.1 Identi cation of starting points for program understanding The code identi ed using the execution slice-based technique (either unique to a feature or common to a group of features) can be used as a starting point for studying program features. This is especially useful for those features that are implemented in many dierent modules in a large, complicated but poorly documented system. However, it is also important to realize that our technique may not nd 14
Table 3: Code common to all ve features analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
Number of Number of Number of Number of blocks decisions c-uses p-uses 122
32
68
32
51
34
44
45
64
29
11
17
94
40
72
36
123
72
160
56
13 24 161 136
64 15 114 68
182 37 205 109
111 18 159 95
788 6.71
468 6.60
888 4.05
569 3.76
Note: A blank entry means no common code in the corresponding le.
all relevant code that makes up a feature because the tests used in computing the execution slices may be incomplete. In fact, to nd the complete set of code for a given feature, one may have to use many more invoking and excluding tests??in contrast to our technique which only requires a few carefully selected tests. Nevertheless, experts who know SHARPE well have indicated that the small number of program components identi ed at various granularity levels using our technique provided some surprising insights that could be very useful in preventing certain types of unexpected maintenance errors (see Section 4.3). A particularly interesting reaction of these experts was that they were amazed at the results at the c-use and p-use levels, as they never had such detailed information before. We have not collected any real data on how information on such ne granularities can help programmers and 15
maintainers avoid making unexpected errors. However, our intuition leads us to believe that knowing the possible impact on other features while changing the de nitions of certain program variables and how they are used with one feature in mind is always a plus and a good practice in software maintenance.
5.2 Ease of use One advantage of using the execution slice-based technique is that programmers and maintainers only have to compute the execution slices of a few carefully selected tests (some invoking and some excluding) and ignore the others. This signi cantly reduces the amount of eort required to use this technique. Moreover, this process can be automated by using ATAC and Vue (see Section 3). Code so identi ed can be displayed in a user-friendly graphical interface (see Figures 4 and 5). All these simplify the transfer of this technique into the real world. Furthermore, ATAC diers from many other test coverage measurement tools in that it collects the complete trace information (in terms of how many times each block, decision, c-use, and p-use is executed) with respect to each test, whereas others remember only whether a line of code (or a decision) has been executed or not. With such detailed traceability, not only can programmers and maintainers eectively locate code that implements features, but testers can use the same set of information to compute code coverage to improve their con dence in quality. As a result, we can combine program understanding and code coverage measurement into an integrated process.
5.3 Need for more objective veri cation An important part of our case study is to verify whether the identi ed program components have the descriptions we assigned. The ideal approach is to ask those who are familiar with the application to highlight code segments they think are important to each feature. Such information then serves as the basis for the veri cation. An obvious diculty of this approach is that dierent segments might be highlighted by dierent people, which raises another series of problems about how to summarize such divergent information. As explained earlier, our goal is not to nd the complete set of code used to implement a feature. Instead, the objective is to provide a small number of carefully selected program components, which can be at various levels of granularity, as the starting point for programers and maintainers to get a quick understanding of how a feature is implemented in a system. To accomplish this, we only have to ask experts who are well-versed with the implementation details of the application to con rm whether the identi ed code is either unique to a feature or common to a group of features, as predicted. We 16
realize that a more objective way to verify the data collected from our experiment is desirable and more eort should be devoted to this task.
5.4 Variations of code identi ed As with any dynamic technique, the code identi ed in our study depends on which heuristic is applied, which itself is aected not only by how the features are implemented in the program but also how invoking and excluding tests are selected. We rst discuss how the feature implementation can decide which heuristic should be used. Let us assume that we want to nd code that is unique to a given feature. Suppose a feature (say F ) can only be exhibited if either feature F or feature F is also exhibited. This implies that all the invoking tests for F must also exhibit at least F or F and perhaps many other features. Under this condition, the best way to nd the code that is uniquely related to F is to use Tinvoking - Sexcluding . On the other hand, if F is not bundled with F or F in the way we just described (i.e., F can be exhibited by itself without other features being exhibited simultaneously), we probably should use a dierent heuristic, such as Sinvoking - Sexcluding . Similar arguments apply to other cases such as nding code that is common to a group of features. Next, we explain how invoking and excluding tests can aect the decision of which heuristic should be used. Suppose we are looking for code unique to a given feature (say F! ). If we can easily identify invoking tests that exhibit F! only and no other features (i.e., we can easily identify some focused invoking tests with respect to F! ), then it is better to use Sinvoking - Sexcluding . On the other hand, if we have trouble identifying tests that only exhibit F! (i.e., dicult to obtain focused invoking tests with respect to F! ), we probably should use Tinvoking - Sexcluding . Once the invoking tests are selected, we can follow the guidelines in Section 2.3 to select excluding tests. That is, to nd the code unique to a feature, we would like to have the invoking and the excluding tests as similar, in terms of the execution slice, as possible in order to lter out as much common code as possible. And, to nd the code common to a group of features, we would like to have the invoking tests for a feature be as dissimilar as possible from the invoking tests for other features in the group so that the maximum uncommon code can be excluded. In short, there is no universal heuristic good for all cases. Programmers and maintainers have to use their own judgment to determine which approach best meets their needs. Although dierent heuristics, invoking and excluding tests give dierent mappings between features and code, our experience suggests that, in general, they all provide good starting points to help programmers and maintainers 17
understand the system being analyzed. The only dierence is that the program components identi ed give dierent (1) recall de ned as the percentage ratio of the number of identi ed program components to the total number of program components that should be identi ed, and (2) false positive rates determined as the percentage ratio of the number of identi ed program components which do not have the descriptions we assigned to the total number of identi ed program components.
5.5 Extension to program debugging Two very costly activities in software maintenance are enhancements and the removing of development bugs [11, 23]. Although the emphasis of our study is to address the concern of identifying code unique to a particular feature, and common to a pair or group of features, from the point of view of program understanding, a similar approach could be used for program debugging. Excluding tests would be those that result in successful execution of the program, while the invoking tests would be those that cause the program to fail. The excluding tests are thus noted as \successful tests" and the invoking tests are noted as \failing tests." The code components that are executed by the failing tests but not by the successful tests are the most likely places for the location of the fault [2].
5.6 Enhancing results by including information from static slices Execution slices extract the code contained within the body of functions and methods. Some features may introduce new structs and classes, or may add a few attributes to existing structs and classes. To include such information, it would be necessary to examine some static slices in addition to the results obtained using execution slices. The bottom line is that this is a trade-o between whether we should follow the as-needed program understanding strategy (as discussed in Section 1) to locate, as quickly and as easily as possible, certain code segments as the starting points, or we should look for all the code related to a given feature. A study is underway to develop a hybrid method which can eectively combine the above static information with code identi ed by the execution slice-based technique.
6 Conclusions In our study, the program components identi ed as unique to a feature were, in general, unique as veri ed by experts who understood the application thoroughly. Also, our technique can aid in revealing code common to a group of features, which can assist developers in recovering design description from an implementation description [5]. Although a more rigorous and objective veri cation process still needs to be developed, our experience indicates that the execution slice-based technique is simple, 18
yet very informative. It could be a useful part of a programmer's/maintainer's tool-kit to provide a starting point for understanding a large complicated software system. The technique is valuable for nding code components unique to a functionality; however, it may be less eective in identifying all the components that are relevant or important from the point of view of a functionality. In fact, it is dangerous to assume that all the code for a feature can be identi ed by only a few invoking and excluding tests. One of the greatest advantages of the our methodology is that it is supported by tools, ATAC and Vue, which makes it a viable option for immediate application to large scale systems. Also for large and complicated systems, mere identi cation of unique (or shared) les, functions, blocks, and decisions may not be sucient. Identi cation of unique (or shared) def-uses can provide an in-depth understanding of the application and help save maintenance errors that could occur due to subtle interactions among the various features that can be easily overlooked. Our tools support the identi cation of unique (or shared) def-uses in addition to les, functions, blocks, and decisions. The fact that maintenance requires understanding of a very small percentage of a very large system is exempli ed by the Year 2000 problem. The \date sensitive" code has to be identi ed, understood and modi ed to ensure smooth transition from the twentieth to the twenty- rst century. The code unique to this feature may be only a few lines in a system consisting of millions of lines of code. The Year 2000 problem drives home the signi cance of highlighting unique components of code. Once again, the success of the execution slice-based technique to identify the \date sensitive" code, depends on how eectively the invoking and the excluding tests can be designed. Invoking tests execute segments pertaining to the date portion of the code, while excluding test cases execute the segments pertaining to the non-date portion of the code. To conclude, our experience suggests the execution sliced-based technique can be immediately applicable in industry. It can answer, although not perfectly, a very important and dicult problem in large legacy systems, that is, provide software programmers and maintainers with a good starting point for quick program understanding.
Acknowledgements The authors are extremely grateful to Dr. Robin Sahner for all her help with SHARPE. The authors would also like to thank all the Research Scientists on the Suds team in the Software Environment Research group at Bellcore for their eort in making ATAC and Vue available for this study. 19
References [1] H. Agrawal, J. L. Alberi, J. R. Horgan, J. J. Li, S. London, W. E. Wong, S. Ghosh, and N. Wilde, \Mining system tests to aid software maintenance," IEEE Computer, pp 64-73, July 1998. [2] H. Agrawal, J. R. Horgan, S. London, and W. E. Wong, \Fault localization using execution slices and data ow tests," in Proceedings of the Sixth IEEE International Symposium on Software Reliability Engineering, pp 143-151, Toulouse, France, October 1995. [3] T. Ball, \Software visualization in the large," IEEE Computer, pp 33-43, April 1996. [4] \Suds User's Manual," Bellcore, 1998. [5] S. C. Choi and W. Scacchi, \Extracting and restructuring the design of large systems," IEEE Software, 7(1):66-71, January 1990. [6] J. R. Horgan and S. A. London, \Data ow coverage and the C language," in Proceedings of the Fourth Symposium on Software Testing, Analysis, and Veri cation, pp 87-97, Victoria, British Columbia, Canada, October 1991. [7] J. R. Horgan and S. A. London, \ATAC: A data ow coverage testing tool for C," in Proceedings of Symposium on Assessment of Quality Software Development Tools, pp 2-10, New Orleans, LA, May 1992. [8] \IEEE Guide to Software Requirements Speci cations," ANSI/IEEE Std 830-1984, 1984. [9] B. Korel and J. W. Laski, \Dynamic program slicing," Information Processing Letters, 29(3):155163, 1988. [10] B. Korel and J. Rilling, \Dynamic program slicing in understanding of program execution," in Proceedings of the Fifth International Workshop on Program Comprehension, pp 80-89, Dearborn, MI, May, 1997. [11] B. P. Lientz and E. B. Swanson, \Software Maintenance Management," Addison-Wesley, New York, 1980. [12] D. Littman, J. Pinto, S. Letovsky, and E. Soloway, \Mental Models and Software Maintenance," in Empirical Studies of Programmers (E. Soloway and S. Iyengar, Eds.). Ablex Publishing Corp., Norwood, NJ, 1986. [13] A. D. Malony, D. H. Hammerslag, and D. J. Jablonowski, \Traceview: A trace visualization tool," IEEE Software, pp 19-28, September 1991. 20
[14] \Military Standard: Defense System Software Development, (DOD-STD-2167A)" Department of Defense, February 1988. [15] Meilir Page-Jones, \The Practical Guide to Structured Systems Design (Yourdon Press Computing Series," Prentice Hall, Englewood Clis, New Jersey, 1988. [16] R. S. Pressman, \Software Engineering: A Practitioner's Approach," McGraw-Hill, New York, 1997. [17] R. A. Sahner, K. S. Trivedi, and A. Pulia to, \Performance and Reliability Analysis of Computer Systems: An Example-Based Approach Using the SHARPE Software Package," Kluwer Academic Publishers, Boston, 1996. [18] E. Soloway, J. Pinto, S. Letovsky, D. Littman, and R. Lampert, \Designing documentation to compensate for delocalized plans," Communications of the ACM, 31(11):1259-1267, November 1988. [19] M. Weiser, \Program slicing," IEEE Trans. on Software Engineering, SE-10(4):352-357, July 1984. [20] N. Wilde and C. Casey, \Early eld experience with the software reconnaissance technique for program comprehension," in Proceedings of the International Conference on Software Maintenance, pp 312-318, Monterey, CA, November 1996. [21] N. Wilde, J. A. Gomez, T. Gust, and D. Strasburg, \Locating user functionality in old code," in Proceedings of the International Conference on Software Maintenance, pp 200-205, Orlando, FL, November 1992. [22] N. Wilde and M. S. Scully, \Software reconnaissance: Mapping program features to code," Software Maintenance: Research and Practice, 7(1):49-62, 1995. [23] E. Yourdon, \Introduction to the March 1994 issue," American Programmer, 7(3), March 1994.
21
Appendix A: Number of decisions, c-uses, and p-uses unique to F3;j ; 1 j 10 Note: A blank entry means no unique decision, c-use, or p-use in the corresponding le. Part I: Number of unique decisions analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
F3 1
F3 2
F3 3
F3 4
16 12
32 35
16 12
32 35
;
9
3 4 18
;
19
;
9
2
2
1 80 5 22 26
80 5 22 32
75
3 4 16
F3 5
;
;
19
156
9
37 20 30 10
;
9
F3 7 ;
16 12
F3 8
F3 9
F3 10
16 12
32 35
32 35
2
2
80 5 22 35
80 5 22 28
;
9
1 9
75 1
F3 6
3 4 12
9
3 4 12
;
19
75
75
1
1
156
156
7
22 19 15 10
22 19 15 10
445 6.27
1
156
10
22 19 14 10
148 2.09
443 145 480 25 137 13 139 452 6.25 2.04 6.77 0.35 1.93 0.18 1.96 6.37
10 1 5
;
19
5
3
Part II: Number of unique c-uses analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c
F3 1 ;
F3 2
F3 3 ;
F3 4
55 45
108 81
55 45
108 81
8
;
39
8
F3 5
;
;
39
F3 6 ;
8
55 45
22
F3 7 ;
F3 8 ;
F3 9
F3 10
55 45
108 81
108 81
8
;
39
;
39
results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
F3 1 ;
14 10 54
F3 2 ;
1 277 19 59 58
217
F3 3 ;
F3 4
14 10 53
277 19 59 71
;
F3 5
F3 6
24
14 10 50
;
217
1
640
14
58 39 16 22
1
640
14
116 43 21 22
41 4
;
F3 7
F3 8 ;
F3 9
F3 10
19
14 10 50
277 19 59 68
277 19 59 60
;
1
;
;
217
217
1
1
640
640
13
58 39 16 22
58 39 16 22
8
6
418 1417 417 1496 69 408 26 413 1426 1.91 6.46 1.90 6.82 0.31 1.86 0.12 1.88 6.50
1418 6.46
Part III: Number of unique p-uses analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
F3 1
F3 2
F3 3
F3 4
35 21
63 60
35 21
63 60
;
10
1 14 19
;
30
2
2 148 3 37 22
148 3 37 22
1 14 19
F3 5
;
;
30
2
154
F3 6 ;
10
F3 7 ;
35 21
F3 8
F3 9
F3 10
35 21
63 60
63 60
2
2
148 3 37 28
148 3 37 24
;
10
2 9
154
1 14 19
9
1 14 19
;
30
;
30
154
154
1
1
412
412
6
28 29 19 17
28 29 19 17
871 263 910 31 258 15 261 876 5.76 1.74 6.02 0.20 1.71 0.10 1.73 5.79
872 5.77
1
412
9
28 29 18 17
264 1.75
;
10
1
412
8
51 30 35 17
16 1 5
23
3
4
Appendix B: Number of decisions, c-uses, and p-uses shared by each pair of features Note: The notation F =F indicates decisions, c-uses, or p-uses shared by features F and F . A blank entry means no shared decision, c-use, or p-use in the corresponding le. i
j
i
j
Part I: Number of shared decisions analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
F1 =F2 z
F1 =F3
F1 =F4
F1 =F5
F2 =F3
F2 =F4
F2 =F5
F3 /F4
F3 =F5
F4 =F5
190
146
81
34
136
82
33
80
34
46
40
41
43
34
43
43
37
44
37
38
69
68
61
31
66
18 65
21 34
63
34
18 35
94 46
85 46
41
86
85 92
41
82
41
83
40
111 141
103 118
89 99
86
103 118
95 134
127
93 102
82
124
81
93 31 170 175 115 144 22 155 80
81
81
81
82
12
181
12
12
29
171 149 17 156 94
27
26
27
26
26
7
104 16 150 95
93 17 118 80
149 17 158 124
104 16 155 134
104 15 120 120
104 16 153 119
91 20 118 108
64 15 116 117
898 12.66
579 8.16
1380 19.46
1006 14.18
693 9.77
935 13.18
607 8.56
620 8.74
48 1388 19.57
1667 23.50
Part II: Number of shared c-uses analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c
F1 =F2 z
F1 =F3
F1 =F4
F1 =F5
F2 =F3
F2 =F4
F2 =F5
F3 /F4
F3 =F5
F4 =F5
415
326
179
68
318
180
68
179
68
95
62
62
65
44
68
68
50
68
50
51
63
63
61
12
60
10 61
16 12
59
13
10 13
117
114
99
114
24
99
99
17
results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
F1 =F2 z
F1 =F3
F1 =F4
F1 =F5
F2 =F3
F2 =F4
F2 =F5
F3 /F4
F3 =F5
F4 =F5
231 324
212 256
189 235
187
211 268
199 313
274
185 237
178
275
244
258 87 630 656 595 538 42 299 137
241
244
239
248
46
744
46
46
153 143
637 525 38 295 156
130 143
74
133
135 354
74
129
74
128
72
297 37 282 168
341 39 213 130
535 38 300 217
297 37 288 234
344 37 216 188
297 37 288 219
328 42 217 178
182 37 209 195
1973 8.99
1167 5.32
3606 16.44
2145 9.78
1334 6.08
2036 9.28
1202 5.48
1156 5.27
191 3403 15.51
4739 21.60
Part III: Number of shared p-uses analyze.c multpath.c share.c reachgraph.c pfqn.c mpfqn.c util.c inspade.c indist.c inshare.c maketree.c results.c cg.c in qn pn.c inchain.c bind.c bitlib.c sor.c newcg.c phase.c newphase.c newlinear.c cexpo.c expo.c read1.c symbol.c ftree.c debug.c uniform.c mtta.c Total Percentage
F1 =F2 z
F1 =F3
F1 =F4
F1 =F5
F2 =F3
F2 =F4
F2 =F5
F3 /F4
F3 =F5
F4 =F5
272
206
105
34
196
106
33
104
34
47
54
55
58
45
60
60
51
61
51
52
53
54
48
19
48
3 51
6 21
47
21
3 22
78 79
71 79
37
84
83 202
37
74
37
72
36
137 117
122 94
108 92
64
121 96
114 120
97
109 89
61
106
172
199 59 401 390 282 261 28 202 108
174
172
171
179
23
481
23
23
59
406 262 19 202 129
57
46
57
46
46
8
160 19 196 133
192 22 163 112
261 19 205 162
160 19 203 178
199 18 166 156
160 19 200 156
185 28 163 142
111 18 162 154
1199 7.93
735 4.86
2163 14.30
1291 8.54
821 5.43
1230 8.13
757 5.01
719 4.75
112 2039 13.48
2780 18.38
25