Co-Slicing for Program Comprehension and Reuse Ran Ettinger IBM Haifa Research Lab
[email protected]
Abstract When trying to understand certain behaviors of a given program, a programmer’s job can be facilitated by the use of a slicing tool, which helps focusing attention on the relevant subprogram. After, or even instead of reading a selected program slice, a programmer might be interested in studying its complement, i.e. the rest of the program. But what is that complement? Simply taking all statements outside the slice will not do, as some of the slice’s statements might be relevant too. Instead, one might ask which statements of a given slice can be removed from the original program, assuming the sliced results are already available. The removal of such redundant statements would yield a slice’s complement. This paper introduces the concept of a complement-slice, or simply co-slice, and demonstrates its use in both program comprehension and reuse. Moreover, the paper presents a provably correct co-slicing algorithm for a simple programming language, and outlines a corresponding solution for the general case.
1. Introduction Program slicing [11], the study of meaningful subprograms, is a well established discipline with known applications for, among others, program comprehension and reuse. When trying to understand certain behaviors of a given program, a programmer’s job can be facilitated by the use of a slicing tool, which helps focusing attention on the relevant subprogram. However, after, or even instead of reading a selected program slice, a programmer might be interested in studying its complement, i.e. the rest of the program. This paper is interested in such complements. It introduces the concept of a complement-slice, or simply co-slice and demonstrates its use in both program comprehension, for understanding programs, and reuse, for coping with change (e.g. through refactoring [10, 3]). Moreover, the paper presents a provably correct co-slicing algorithm for a simple programming
language, and outlines a corresponding solution for the general case. The solution involves backward propagation of assertions, simple substitution of variables, and traditional slicing. The result of slicing for sets of variables, from the end of a given program, as we do in this paper, can be defined as follows: Definition 1 (A slice). Let S be a given statement and let V be a set of variables of interest. A statement S0 is a slice of S on V, if for any input on which S terminates, S0 will terminate too, and with the same result held in all program variables V. In a similar manner, the novel concept of a co-slice can be defined as follows: Definition 2 (A co-slice). Let S be a given statement and let V be a set of variables of no interest. (That is, the final value of variables V, and the code for achieving it, in S, can be avoided.) A statement S0 is a co-slice of S on V, if for any input on which S terminates, S0 will terminate too, and with the same result held in all program variables outside V. Moreover, suppose the result of V is available for reuse through the corresponding set of fresh variables fV. A coslice S0 on V with fV is free to use the final value of variables in V through the corresponding elements of fV. Note that a co-slice S0 is not expected to preserve behavior with regards to the co-sliced variables V themeselves. A co-slicing algorithm will be expected to find the smallest possible such S0 . This will be achieved by maximizing reuse of final values of V and, surely, by slicing for the complementary set of potentially modified variables. Such an algorithm will be presented later in the paper, in Section 3, and some applications will be discussed in Section 4. Section 5 will evaluate the approach and compare it with some related work, whereas Section 6 will conclude. But first, we turn to a motivating example.
2. Understanding tangled code: a motivating scenario In maintaining a system that includes the following program, borrowed with slight modifications from Lakhotia and Deprez [9] (where a transformation for splitting tangled computations, called tuck, is presented): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
i = 0; while (i> sale[i++]; if (shouldProcess) { i = 0; totalSale = 0; totalPay = 0; while (i < days) { totalSale = totalSale+sale[i]; totalPay = totalPay+0.1*sale[i]; if (sale[i]>1000) totalPay = totalPay+50; i = i+1; } pay = totalPay/days+100; profit = 0.9*totalSale-cost; }
suppose we wish to understand the computation of profit. A slicing tool may be used to focus our attention on its slice, from the end of scope, yielding 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
i = 0; while (i> sale[i++]; if (shouldProcess) { i = 0; totalSale = 0;
7 8 9 10 11 12 13 14 15
totalPay = 0;
totalPay = totalPay+0.1*sale[i]; if (sale[i]>1000) totalPay = totalPay+50;
pay = totalPay/days+100;
A more coherent and sensible computation could be isolated by slicing for all possibly-modified variables other than the previously investigated profit; i.e. slice for i,sale,cin,totalSale,totalPay and pay. However, this would yield the whole program but line 16. Instead, it could be useful to identify all variables whose full computation has already been studied along with that of profit. Those are the variables whose slice is included in the said slice. Following Gallagher [4], who investigates the dependences between various slices, we draw a graph depicting the slice-inclusion as follows: 1
pay O
profit O
totalPay totalSale O fMMM q8 O MMqMqqq qM qqq MMM q q q sale i O cin
while (i < days) { totalSale = totalSale+sale[i];
i = i+1; } profit = 0.9*totalSale-cost;
Following the graph, we observe that, in profit’s case, the slices we have already encountered are those of i,cin,sale and totalSale, along with that of profit itself. Now slicing for the remaining possiblymodified variables, totalPay and pay, would yield the smaller
}
Alternatively, or once we are done, we might be interested in all but that computation of profit. Simply deleting the above slice from the original would yield the following unintelligible code:
1 Gallagher’s
slices are slightly different. His ‘output-restricted decomposition slices’ slice not only from the end of program, for each variable, but also from all printing points. Instead, I prefer to consider the output stream (as well as the input) as any other program variable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
i = 0; while (i> sale[i++]; if (shouldProcess) { i = 0; totalPay = 0; while (i < days) { totalPay = totalPay+0.1*sale[i]; if (sale[i]>1000) totalPay = totalPay+50; i = i+1; } pay = totalPay/days+100;
4 15 16 17
if (shouldProcess) { pay = totalPay/days+100; profit = 0.9*totalSale-cost; }
How do we automate the identification of such co-slices? In what follows, a co-slicing algorithm is presented. This last co-slicing example will be recalled when illustrating the various steps of that algorithm.
}
where lines 6 and 9 are successfully eliminated, along with (the previously removed) line 16. Note, however, that the slices of i,sale and cin are still included in this complement. (This inclusion is also confirmed by observing the respectice paths on the slice-inclusion graph.) In response, our final improvement will be to assume the results of all co-sliced variables are readily available for use in the complement. In our case, this approach may yield the following co-slice of profit and the included i,sale,cin and totalPay: 4 5 6 7 8 9 10 11 12 13 14 15 16 17
sliced variables, e.g. after studying its own slice, would leave us with lines 4, 15 and 17 — thus completely eliminating the loop of lines 5-14. Similarly, the co-slice of all variables whose slice is included in some other slice (i.e. non-maximal in the slice-inclusion relation, again borrowing Gallagher’s terminology [4]), namely the co-slice of {totalPay,totalSale,i,sale,cin}, is simply
if (shouldProcess) { i = 0; totalPay = 0; while (i < days) {
3. A co-slicing algorithm 3.1. Rationale The goal of our co-slicing algorithm is to maximize reuse of the available final values, before slicing for the complementary set. However, a simplistic approach of substituting all uses of the selected variables V with their corresponding final values, fV, will not do. This is so since some of the uses make reference to intermediate, rather than final values. Accordingly, the final-value references must be identified and substituted, before slicing. Finally, after slicing is performed, for all non-selected variables, some of the earlier final-use substitutions may be undone.
3.2. The algorithm totalPay = totalPay+0.1*sale[i]; if (sale[i]>1000) totalPay = totalPay+50; i = i+1; } pay = totalPay/days+100; }
Note how this time lines 1-3, i.e. the slices of sale and cin, have been removed too, hence yielding a smaller subprogram than the corresponding slice (of the complementary set of possibly-modified variables, {totalPay,pay}). In general, a co-slice of some selected variables is potentially smaller (but never larger) than the slice of the complementary set. Further note how adding totalPay to the set of co-
1. Let S be a given (possibly compound) statement, acting as the program scope for co-slicing. 2. Let V be a given set of variables whose computation is of no interest, and on which S is to be co-sliced. 3. Let fV be a given set of fresh variables, not occurring in S, corresponding to the co-sliced variables V. Assume that for each initial state, elements of fV hold on entry to (and in fact throughout) S, the final value of the corresponding elements of V. That is, if started with that initial state, S is guaranteed to terminate, if at all, in a state satisfying V = fV. 4. Let S0 be the program statement S after final-use substitution (see below) of variables V with the corresponding fresh variables fV.
5. Let slice be a given slicer that when applied to any statement T and set of variables U, produces another statement T 0 which is a correct slice of T on U (from the end of T). 6. Let coV be the set of all program variables outside V, whose value may be modified by an execution of S (or, equivalently, of S0 ). 7. Let S00 be the result of applying slice to S0 and coV, i.e. a correct slice of the final-use substituted statement S0 on all non-selected, possibly modified variables coV. 8. Let V1 be the set of all variables in V that are not modified in S00 . 9. Let fV1 be the subset of fV corresponding to the subset V1 of V. 10. Let S000 be the statement S00 after normal substitution of all variables fV1 with the corresponding program variables V1. This S000 is the result of the algorithm, i.e. the co-slice of S on V with fV.
3.3. Final-use substitution The removal of redundant code when co-slicing is maximized through a so-called final-use substitution. A final use is a reference to a variable’s value (e.g. totalPay in the last example above), in a program point where it is guaranteed to hold its final value (e.g. where totalPay participates in the assignment to pay, on line 15). As the final value of a co-sliced variable, e.g. totalPay, is available for reuse, e.g. through the fresh variable fTotalPay), its computation is redundant and can be avoided. Final-use substitution can be formalized in the following way. Starting with “ S ; {V = fV} ” where fV is a fresh set of variables, we transform S into S0 := S[final-use V \ fV] demanding “ S ; {V = fV} ” = “ S0 ; {V = fV} ”. Statement S0 will be using variables in fV instead of V in points to which the corresponding assertion can be propagated. The full derivation of S[final-use V \fV], for a simple programming language, is given in [1, Appendix E]. The idea is to propagate the assertion backwards as far as possible, around and into all statements which do not alter the variables of the assertion. Since the language is simply structured, this propagation is performed in a syntax-directed manner by a single pass over the program tree representation, provided summary information (e.g. of possibly modified variables) which can be collected in a separate preliminary pass, is available for each node.
3.4. An illustration To demonstrate the workings of final-use substitutions, and of the co-slicing algorithm, we return to the earlier example: 1. Let S be the full code (lines 1-17) for reading and processing sales results, from the preceding section. 2. If we are only interested in the outline computation of the final results pay and profit (i.e. the maximal elements of the slice-inclusion relation), we can co-slice for all other possibly-modified variables, i.e. let V be the set {totalPay,totalSale,i,sale,cin}. 3. Let fV be the corresponding set of fresh variables {fTotalPay,fTotalSale,fI,fSale,fCin}. Assume the assertions totalPay=fTotalPay, totalSale=fTotalSale, i=fI, sale=fSale, and cin=fCin are all guaranteed to hold on exit from S (i.e. after line 17). 4. Performing the appropriate final-use substitution on S would yield the following S0 : 1 2 3 4 5 6 7 8 9’ 10’ 11’ 12 13 14 15’ 16’ 17
i = 0; while (i> sale[i++]; if (shouldProcess) { i = 0; totalSale = 0; totalPay = 0; while (i < days) { totalSale = totalSale+fSale[i]; totalPay = totalPay+0.1*fSale[i]; if (fSale[i]>1000) totalPay = totalPay+50; i = i+1; } pay = fTotalPay/days+100; profit = 0.9*fTotalSale-cost; }
Note that not all uses of the co-sliced variables were substituted; the use of i,sale and cin in the first loop, for instance, is not of a final value, and must not be replaced. Similarly, the use of i,totalSale and totalPay in the second loop is not of a final value, whereas that of sale, there, is indeed final; and so is the use of totalPay and totalSale on lines 15 and 16 respectively.
5. Let slice be the SSA-based slicer, as formally developed in [1, Chapter 9]. There, in fact, the supported programming language is a simple one, with a slightly different syntax. Nevertheless, that language includes assignments, sequential composition of statements, conditional statements and loops, as well as support for arrays and streams, such that slicing our S0 with that slicer is still possible. However, correctly supporting a more general language, with e.g. unstructured constructs, procedures, pointers or objects, would require a more general slicer, whose proof of correctness might have to be completely redone. 6. Let coV, the set of all program variables outside V, whose value may be modified by an execution of S0 , be {pay,profit}. 7. As intended, final-use substitution has the potential of introducing dead code, which can subsequently be removed. At this point, slicing for {pay,profit} would successfully remove the irrelevant computation of the co-sliced variables i,sale,cin,totalSale and totalPay; leading to 4 15’ 16’ 17
if (shouldProcess) { pay = fTotalPay/days+100; profit = 0.9*fTotalSale-cost; }
being our S00 . 8. No element of our V, i.e. {totalPay,totalSale,i,sale,cin}, is modified in S00 ; hence our V1 is the full set {totalPay,totalSale,i,sale,cin}. 9. Similarly, fV1 is the full fV, i.e. {fTotalPay,fTotalSale,fI,fSale,fCin}. 10. Finally, the substitution of all occurrances of {fTotalPay,fTotalSale,fI,fSale,fCin} by the original program variables {totalPay,totalSale,i,sale,cin} leads to 4 15 16 17
if (shouldProcess) { pay = totalPay/days+100; profit = 0.9*totalSale-cost; }
being our resulting S000 , i.e. the co-slice of the original S on {totalPay,totalSale,i,sale,cin}. In this case, none of the final-value backup variables was required. They merely contributed temporarily to finding a co-slice smaller than the corresponding slice.
4. Applications 4.1. Studying non-exceptional code As was demonstrated earlier, in Section 2, both slicing and co-slicing can be helpful for program comprehension. This application is further supported by the following scenario, in which co-slicing helps ignoring error-identification code by focusing attention on the main computation. Consider the following code for summarizing rainfall data, taken from Harman et al. [5] (where an amorphous approach for procedure and function extraction is proposed): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
anomalous = false; count = 0; total = 0; while (counthighest) { highest = average; } } if (!anomalous) Report(total,average); else ReportAnomalous();
Then, assuming the variable anomalous is not live on exit, the final step is to eliminate it altogether, by inlining its value at the only use, i.e. on line 22. However, such inlining must be done with care; since the parameter highest may be modified ahead of the new call, its initial value should be duplicated, as in the following: 1” 11 13 14 15 16 17 19 20 21 22 23 24 25
iHighest = highest; if (total==0 || count==0) { average = 0; } else { average = total/count; if (average>highest) { highest = average; } } if (!isAnom(total,count,iHighest)) Report(total,average); else ReportAnomalous();
the 12 lines computation was simplified to a single return statement (whose expression encapsulates the full computation, using conditional expressions). However, this only works with straight-line code. The importance of formal proof of correctness, as in the sliding approach, can be highlighted by their apparent mistake in the final step above. There, the need to make a backup of highest was missed, leading to an incorrect transformation. Breaking the Replace Temp with Query into three steps of sliding, function extraction and and inlining, as in this paper, makes that need for backup more explicit. Earlier method-extraction solutions, for non-contiguous tangled code, include the Tuck transformation of Lakhotia and Deprez [9] and several algorithms by Komondoor and Horwitz [7, 8, 6]. In the former, the complement is a full slice (from all non-extracted points); no data is allowed to flow from the slice to its complement, and the tranformation is rejected if a live-on-exit variable is modified in both the slice and its complement (which is not the case in sliding). Indeed, in our example above, tucking would have been rejected due to that reason. Sliding is in fact an improved version of tucking, where the complement is a co-slice that potentially reuses extracted results, and is hence potentially smaller (yielding further improved applicability). In contrast, in the latter, Komondoor and Horwitz do allow data to flow from the extracted code to the complement, and this indeed has inspired the invention of co-slices. However, they do not support duplication of assignments, which leads in turn to a need to extract larger portions of code through promotion, and hence (undesired) extraction of non-selected code. It also leads to an inability to support untangling of loops.
6. Conclusion
The complete refactoring scenario above is known as Replace Temp with Query [3], and the sliding of a selected slice away from its complement may act as a building block in its automation. (To date, none of the existing development environments supports this refactoring.)
This paper has introduced the concept of co-slicing, and suggested some applications, for program reuse and comprehension. A co-slicing algorithm has been presented and illustrated through examples from the literature.
5. Related work
The main contribution of co-slicing, beyond its interesting relevance for the understanding of programs, is for the automation of method-extraction refactoring transformations.
The scenario above appeared first in [5] where an amorphous function extraction was suggested. The amorphous approach combines more transformations with the extraction, thus not restricting the extraction to syntax preserving transformations. One advantage of their approach can be seen, for example, in the extracted function’s body, where
Ideas for future work include the development of a coslicing tool for a ‘real’ programming language, with the immediate goals to collect empirical data over the usefulness of the approach, and to automate the relevant refactoring transformations.
Acknowledgements I would like to thank Mark Harman and Jeremy Gibbons for their valuable comments and the interesting discussion during my viva. I must also thank past and present members of Oege de Moor’s Programming Tools Group at Oxford, in particular Ivan Sanabria, Yorck Hunke, Stephen Drape, Mathieu Verbaere and Damien Sereni, for helpful feedback during various stages of the research leading to this paper.
References [1] R. Ettinger. Refactoring via Program Slicing and Sliding. PhD thesis, University of Oxford, Oxford, United Kingdom, 2006. http://progtools.comlab.ox.ac.uk/ members/rani/sliding thesis esub101006.pdf, Submitted thesis. [2] R. Ettinger and M. Verbaere. Untangling: a slice extraction refactoring. In AOSD ’04: Proceedings of the 3rd international conference on Aspect-oriented software development, pages 93–101, New York, NY, USA, 2004. ACM Press. [3] M. Fowler. Refactoring: Improving the Design of Existing Code. Addison Wesley, 2000. [4] K. B. Gallagher and J. R. Lyle. Using program slicing in software maintenance. IEEE Trans. Softw. Eng., 17(8):751– 761, 1991. [5] M. Harman, D. Binkley, R. Singh, and R. M. Hierons. Amorphous procedure extraction. In SCAM, pages 85–94, 2004. [6] R. Komondoor. Automated duplicated-code detection and procedure extraction. PhD thesis, University of WisconsinMadison, WI, USA, 2003. [7] R. Komondoor and S. Horwitz. Semantics-preserving procedure extraction. In POPL ’00: Proceedings of the 27th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 155–169, New York, NY, USA, 2000. ACM Press. [8] R. Komondoor and S. Horwitz. Effective automatic procedure extraction. In Proceedings of the 11th IEEE International Workshop on Program Comprehension, 2003. [9] A. Lakhotia and J.-C. Deprez. Restructuring programs by tucking statements into functions. Information and Software Technology, 40(11-12):677–690, 1998. [10] W. F. Opdyke. Refactoring Object-Oriented Frameworks. PhD thesis, University of Illinois at Urbana-Champaign, IL, USA, 1992. [11] M. Weiser. Program slicing. IEEE Transactions on Software Engineering, 10(4):352–357, 1984.