Tracking Source Locations Steven P. Reiss Department of Computer Science Brown University Providence, RI. 02912
[email protected] ABSTRACT
location internally so that if you use the Eclipse editor, the breakpoint location will move around to match source changes. However, if you use emacs or some other editor outside of Eclipse, the breakpoint location falls back to using the original line number.
Many programming tools require information to be associated with source locations. Current tools do this in different ways with different degrees of effectiveness. This paper is an investigation into the various approaches to maintaining source locations. It is based on an experiment that attempts to track a variety of locations over the evolution of a source file. The results demonstrate that relatively simple techniques can be very effective.
Since breakpoints are quite ephemeral and the changes between runs are relatively simple, the fact that the reassigned breakpoint might be in the wrong location is an annoyance but not a real problem. However, there are tools that would benefit from or require a more permanent association of information with the source and where the association is more critical.
Categories and Subject Descriptors D2.6 [Software Engineering]:Programming Environments — interactive environments.
An example of such a tool is lint [22]. Lint is a tool that checks the program for potential bugs such as unused variables or uninitialized values. Its main drawback is that it generally produces a large number of false positives, warnings that the programmer knows are not problems. To accommodate this, lint provides a mechanism for letting the programmer indicate that it should not generate warnings for a potential error. This mechanism is based on stylized comments that are inserted into the code. Such comments are effective in that they change locations correctly as the source changes. However, they are somewhat annoying to the programmer and, where the comment really doesn’t help in understanding the code, they can inhibit code readability.
General Terms Experimentation, Algorithms.
Keywords Source lines, software evolution, tool support.
1. INTRODUCTION In a programming environment, various tools need to store and later retrieve information associated with the program source. Moreover, these tools often want to have the associations maintained as the programmer edits the source. Over the years a variety of different techniques have been used to accomplish this. These techniques vary considerably both in terms of effectiveness and in terms of their impact on the developer.
The need for adding comments for such tools is becoming more of a problem as more such tools are developed. Much of what was done by lint is now done within the compiler, and languages such as Java have added annotations to avoid compiler warnings where they are inappropriate. Other tools such as FindBugs [7], and MJ [2] are using a variety of techniques including flow analysis, model checking, and theorem proving to identify potential problems. Again, these tools are plagued by the problems of false positives. While comments could be used here to avoid seeing the errors the next time the analysis is run, doing this both clutters the code and becomes impractical when several of the tools are being used. A better technique would allow the tool to keep track of what the programmer indicated should be ignored and to remember this when a future analysis was done.
A simple example of such a tool is a debugger. The debugger wants to associate breakpoints with line numbers in the source. A debugging session typically involves the user finding a potential bug using breakpoints, fixing the code, and then rerunning the program, again checking values at the previously set breakpoints. Today’s debuggers take different approaches to relocating the breakpoints after the source has been changed. Some, like gdb simply use the original line number and set the new breakpoint at that line. The Eclipse environment is more sophisticated. It saves the breakpoint
These aren’t the only examples of tools that need to associate information with source locations. Software visualization, especially for algorithm animation, involves identifying significant events in the program and using these events to drive the animation [6,24]. While these could again be done with comments, the comments here are typically unrelated to the code itself and get in the way. Early work with the FIELD environment [16,17] kept a copy of the original program and the corresponding lines and then ran the Unix file difference program, diff, to track where those lines were in the modified source. Unix provides ctags functionality to let the
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICSE’08, May 10-18, 2008, Leipzig, Germany. Copyright 2008 ACM 978-1-60558-079-1/08/05...$5.00.
11
maintain the information (Eclipse), using comments (lint), using abstract syntax and semantic information, and even using Unix tools like diff (FIELD). Given that simple line numbers work poorly, that user’s are disinclined to use different editors, that comments are often inappropriate for the particular applications, and tools like diff are expensive (both in terms of having to run them and in terms of needing to maintain a complete copy of the original program), we wanted to know if there was a better way.
programmer quickly find definitions and declarations of program constructs [14]. This is done by saving and later matching source lines to achieve portability when the source changes. More recent work has resorted to comments, to using program features such as assignments or calls [21], and to using aspects with languages such as Aspect/J [10,11]. Today’s compilers are quite sophisticated and are capable of using performance information to improve optimization. However, this requires that the environment appropriately associate branch counts and other information with the source or that all experiments that yield that information be redone each time the source changes. Code analysis can often be better one if the programmer is allowed to give hints as to what is important or not [23], hints that might be better kept separate from the code. Similar situations arise when attempting to do program proofs or to simplify model checking.
Many ideas suggest themselves, for example considering the content of the line, the context of the line, abstract syntax trees, string differencing, and various combinations of these. However, it is unclear which of these will work, how well they might work, and what they cost. To answer these questions we developed a variety of different approaches and set up an experiment to test these approaches for different types of lines under real-world conditions. The experiment itself is described in Section 2. Section 3 then describes the various methods for tracking source locations that were used. Section 4 describes the results of the experiment, while Section 5 discusses these results.
There are a variety of other applications that could benefit from being able to associate information with the program source and track this information as the source evolves. These include new tools for identifying potential bugs, tools that use programmer hints to do code generation or verification, tools that identify and track design patterns [18], tools that track bug locations over time, tools that manage to-do lists, and tools that attempt to ensure that the different aspects of a software system remain synchronized as they evolve [20]. TagSEA is a framework where tags in the form of stylized source comments provide the basis for collaborative software development [25,26].
The bottom line, as we will demonstrate through the experiments, is that a relatively simple and inexpensive approach that combines naive context with string matching of the source line in question is quite effective and should be used.
2. THE EXPERIMENT
Our focus is on tracking source lines as the program changes. Most of the above applications require the programmer to associate information with source locations. Because today’s text and program editors and tool source displays such as those provided by debuggers are primarily line-oriented, this is generally done using line numbers. In gdb and other debuggers one provides the line number. Programming environments such as FIELD and Eclipse use annotations that are defined by pointing to or clicking on a line. Alternatives, such as storing information at a character position or associating it with the underlying abstract syntax tree are less common, the former because positions may be ambiguous and difficult to show to the user, and the latter because abstract syntax trees are ephemeral and only exist within certain tools, not in a general environment.
A technique for associating information with source should be able to track source locations across multiple versions of a file. In order to see how well different methods can accomplish this, we set up an experiment that attempts to look at what happens to source files in the real world. We started with twenty-five different versions of a source file from Clime [19] that were stored in CVS from September 2002 through June 2006. Over this period the file changed significantly, growing 50% in size, converting from Java 1.2 to Java 5, and evolving with the underlying system. We then identified fifty-three source lines to track in the original version of the file. This was done by looking at every tenth line, discarding those that didn’t make sense to track (e.g. blank lines), and finally augmenting the set with additional examples that we thought might be interesting or problematic (e.g. name and comment changes). The lines selected include comments (the copyright notice), the start of functions, the start and end of blocks, as well as simple declarations and statements.
But source lines are not the only thing that could be considered. Tracking software changes in general through multiple versions of a system has become a problem of interest [12]. This problem involves associating programming constructs in one version of a program with the same construct in a later version of the same program. Because this problem focuses on programming constructs, it generally been tackled with techniques that are either structural or semantic in nature, for example, by matching abstract syntax trees [27], by semantic differencing [8], or by looking at program dependence graphs [5] or similar structures [1,15]. Recent techniques that attempt to actual identify the changes have proven quite effective at doing the matching [13]. Similar technology has been used for the related problems of clone detection [3,4,9].
For each of these fifty-three source lines we manually went through each of the twenty-five different source files and identified where the source line occurred in the new file or if that particular line had been eliminated. We note that this assignment of “correct” answers is somewhat subjective. There were instances, for example, where a set of lines containing an initial source location was moved from one function to another and, knowing that the new function was just an encapsulation of the old one’s functionality, decided that the source actually moved. One could argue that it would be just as correct to say that the original source had disappeared in this case.
Given that associating information with source lines is a desirable goal, the question is how this best can be done. As noted, prior work has used simple line numbers (gdb), matching the original line (ctags), specialized editors that
Once we had defined initial locations and had the correct answers for each location and each version of the source file, we were ready to test alternative methods. To do this we
12
• Smart match (A_SMART). This is similar to best match except that it attempts to take into account semantic properties of the text. While best match assigns a cost of one to each insertion, deletion or modification, smart match assigns a cost that is dependent on the type and location of the difference. Changes inside comments have a low cost (0.1). Changes on the right of an assignment or in the arguments to a function call have a reduced cost (0.15) as do changes involving the insertion or deletion of white space (0.1). Other changes have an increased cost (1.5). Again a threshold for considering a match is used.
presented each method with the initial locations, let it store whatever information it wanted from the original file for each of these locations, and then presented it with each of the twenty five versions of the file in turn. For each version, we had the method determine where it believed each location had migrated to or if the location had been removed. Once it had determined the new location, we let the method update its information based for the modified file before going on to the next version of the file. From all this we got information about each location and each file version for each method. We use this information to score the method. We actually consider several different scores. The basic score is the number of locations that are correctly assigned (out of 25*53=1325 possibilities). This score could be broken down by looking at the types of errors because different types of errors might have different severities for different applications. Errors occur when the method assigns a location the wrong line, when the method says a location disappeared when it hadn’t, and when the method assigns a location a line when the actual location had disappeared.
The methods we considered that look at the local context surrounding the actual line included: • Matching lines (C_LINE). This method takes a window of k lines before and after the actual line and counts the number of matches between these lines and the lines of a potential target context. • Area context (C_AREA). This is similar to matching lines except that the lines are not considered in order. This means that line insertions or deletions have a much smaller effect.
Another scoring method we consider is the number of locations that are tracked correctly through all versions of the source. The raw score penalizes methods that get a location wrong in an early version since it will generally get it wrong in all later versions. Finally, we consider the run time and space requirements of each of the methods.
• Best match context (C_BEST). This method uses the string difference between the concatenation of the k lines before and after the actual line and the corresponding lines surrounding the target line. The value is computed as in the best match method. For each of these methods except A_NUMBER, we also considered a version of the method that ignored indentation. The corresponding methods are called A_EXACT_I, A_BEST_I, A_SMART_I, C_LINE_I, C_AREA_I, and C_BEST_I.
3. SOURCE TRACKING METHODS There are many ways that one can imagine tracking source locations as their underlying files evolve. Earlier tools looked at the line number, at the contents of the line, and at the context (using diff). It is also possible to consider the local context and even program structure, for example as exhibited by abstract syntax trees. To understand what works best we implemented all these methods and provided the means to combine them in arbitrary weighted combinations.
The methods we considered that look at the abstract syntax tree included: • AST exact match (T_EXACT). This method saves the abstract syntax tree corresponding to the original context and finds new contexts that match that tree. • AST parent match (T_PARENT). This method creates a context for the original match using the abstract syntax tree by considering the node types of each of the parents and of the k previous and subsequent siblings of the node corresponding to the original line. This context is stored as a string and is matched against corresponding contexts created for each candidate line.
The basic methods we considered can be divided into four categories, those that match the line in question, those that textually match the local context around the line (but not the line itself), those that match the abstract syntax tree for the line in question, and methods that use whole file information. The methods that match the actual line included:
• AST best match (T_BEST). This method compares the original abstract syntax tree with each candidate tree by looking at the differences between the two trees. This is done hierarchically by matching each child tree of with the corresponding child. An exact match is credited with a 1/n where n is the number of children. Subtrees are matched recursively and the resultant match weighted appropriately.
• Line number match (A_NUMBER). This method looks for the line with the same line number as the original. It actually assigns a weight to each line that is the inverse of its distance from the original line number plus one. This is done so that it can be effectively combined with other methods. • Exact match (A_EXACT).This method attempts to find an exact match to the original line. It is used by the Unix tags facility. Matching is actually done by looking at the hash code of the lines rather than by string comparison.
• AST method match (T_METHOD). This method is designed to be used in conjunction with other methods. It accepts all nodes that occur in the same method as the original node where methods are matched by fully qualified name.
• Best match (A_BEST). This method takes the normal (Levenshtein) string difference between the original line and candidate lines and finds those that are the best match. It includes a threshold to determine if the line is relevant or not. Given the two lines S and T and a threshold t, it computes ( S – Δ(S , T ) ) v = ---------------------------------- , ensures that v > t , and chooses the line S with the highest value.
Finally, the global method that we looked was: • Unix diff (U_DIFF). This method uses the Unix diff utility to determine where lines were inserted or deleted as the original file evolved to the current file. It then uses this information to determine where the original line number migrated to. This is essentially what was done in the FIELD environment.
13
TABLE 1. Results for each separate method
Method A_BEST A_BEST_I A_EXACT A_EXACT_I A_NUMBER A_SMART A_SMART_I C_AREA C_AREA_I C_BEST C_BEST_I C_LINE C_LINE_I T_BEST T_EXACT T_METHOD T_PARENT U_DIFF
% Correct 78.0% 74.7% 76.0% 72.9% 5.8% 76.9% 73.1% 84.6% 86.2% 85.2% 83.2% 85.2% 87.0% 71.3% 68.8% 17.9% 58.3% 91.8%
% Change % Eliminated % Spurrious % Ids Correct 77.7% 22.3% 0.0% 69.8% 78.8% 16.1% 5.1% 66.0% 60.1% 39.9% 0.0% 66.0% 67.4% 27.9% 4.7% 64.2% 90.8% 0.0% 9.2% 1.9% 73.9% 10.1% 16.0% 67.9% 77.6% 8.7% 13.7% 64.2% 43.6% 0.0% 56.4% 77.4% 37.2% 0.0% 62.8% 77.4% 61.7% 1.5% 36.7% 75.5% 59.6% 8.1% 32.3% 73.6% 41.3% 0.0% 58.7% 79.2% 33.1% 0.0% 66.9% 81.1% 69.7% 0.0% 30.3% 52.8% 44.0% 51.9% 4.1% 49.1% 97.1% 0.0% 2.9% 7.5% 77.0% 16.5% 6.5% 39.6% 47.2% 12.0% 40.7% 83.0%
Time (ms) 14.56 12.70 0.03 0.15 0.65 30.18 24.80 0.75 1.56 161.69 134.60 0.21 0.90 6.34 3.23 4.94 11.70 5.53
T_BEST with T_PARENT, matching the abstract syntax tree both locally and by considering the context in terms of its parents.
In addition to each of these methods, we were able to construct combined approaches that take a weighted average of two or more methods and threshold the result. This is most useful for combining a method that looks at the target line with one that looks at the surrounding context. Rather than considering all possible combinations, we looked at which methods worked best on their own and combined them. Moreover, for methods that worked reasonably well on their own, we attempted to find complementary methods that handled the cases where the method did not do as well. Finally, we experimented with different weightings and different thresholds for combinations that did well.
In addition, we used the method T_METHOD which restricts matches to within the same function in conjunction with various other combinations. These are noted by the addition of METHOD to the end of the name, for example with W_BESTI_LINEI_METHOD.
4. RESULTS We determined for each method and each version of the file which source locations were correctly identified. For each incorrectly identified location, we determined if the location was changed (a different line number was reported than the correct one), eliminated (the location was reported as missing while it should have been reported at a line), or spurious (the location was reported at a line but should have been reported as missing).
The combination we considered fell into four categories. The first were those that combined multiple methods of looking at a line. For example, we combined A_BEST with A_NUMBER to get A_BEST_NUMBER. The intuition here is that when there are multiple lines in the source that match the original line, we want to choose the one closest to the original line number.
The results for each of the methods separately are shown in Table 1. From this table it is clear that the simple approach of just maintaining line numbers is a relatively poor idea. On the other hand, using diff does reasonably well, getting over 90% of the locations correct and 83% of the ids. The methods that look only at the line tend to get about 75% of the locations and 65% of the ids correct. Those that look at context tend to get about 75%-85% of the locations and 70% of the ids correct. It is also noteworthy that ignoring indentation generally made matters worse.
The second category were combinations of method that looked at the specified line with a method that looked at the context around that line. These are designated with a W_ prefix. For example the method W_BESTI_LINE consists of combining A_BEST_I (best match ignoring indentation for the actual line) with C_LINE (matching context lines). The third category considered combining U_DIFF, which exhibited the best single method score, with a method that did a line match. The result is U_BEST_DIFF. We did not test this category extensively since diff already combines information about context and the actual line.
The last column of the table gives the average time in milliseconds for the method to compute the probabilities for all lines in a source file. This varies considerably from a fraction of a millisecond for some of the simple methods, to over one
The final category combined the different abstract syntax tree methods. For example, T_BEST_PARENT is a combination of
14
TABLE 2. The effect of different context sizes.
Method C_AREA
C_BEST
C_LINE
Size 1 2 3 4 5 10 20 1 2 3 4 5 10 20 1 2 3 4 5 10 20
% Correct 55.8% 73.7% 80.2% 84.6% 82.6% 81.0% 74.9% 55.5% 80.8% 85.2% 81.5% 83.4% 76.0% 77.7% 55.8% 74.5% 82.8% 85.2% 83.6% 80.0% 79.2%
% Change 81.2% 64.1% 56.1% 43.6% 50.2% 54.4% 65.4% 78.6% 66.5% 61.7% 62.9% 80.0% 79.9% 83.4% 81.2% 63.0% 51.8% 41.3% 47.0% 56.6% 64.4%
% Eliminated % Spurrious % Ids Correct 1.7% 17.1% 47.2% 2.9% 33.0% 62.3% 0.0% 43.9% 69.8% 0.0% 56.4% 77.4% 0.0% 49.8% 71.7% 0.0% 45.6% 69.8% 0.0% 34.6% 62.3% 3.4% 18.0% 45.3% 5.1% 28.3% 71.7% 1.5% 36.7% 75.5% 4.1% 33.1% 67.9% 4.5% 15.5% 73.6% 0.0% 20.1% 62.3% 0.0% 16.6% 60.4% 1.7% 17.1% 47.2% 3.0% 34.0% 64.2% 4.4% 43.9% 73.6% 0.0% 58.7% 79.2% 0.0% 53.0% 77.4% 0.0% 43.4% 69.8% 0.0% 35.6% 69.8%
are some locations (numbers 3, 6, 8, 29, 36 and 43) that most of these methods have problems getting correct. The corresponding original and final lines for these locations are shown in Figure 1. For the first and fifth location shown, the line was eliminated in some version of the program, but some methods failed to detect the elimination, instead choosing a line that was similar (e.g. another import or return) and that occurred in a similar local context. For the second, third, and fourth locations, the text changed in what might be considered significant ways. For the last location, at some point in the evolution, the line changed a bit, while the contents of the function that followed, e.g. the context of the line, changed significantly.
tenth of a second for methods that find the string editing distance of the context. The time does not include the time to build abstract syntax trees where needed. For the various context-related methods, one needs to determine the proper context size. We experimented with different context sizes for the different methods as shown in Table 2. These results show that the best context size is between 3 and 5 lines on either side of the actual line. Using larger contexts, which one would intuitively think would help, generally gets poorer results. Based on these results, we looked at combining the methods that look at context with those that look at the current line. The results for these combinations are shown in Table 3. In combining context and actual line methods, we needed to determine both how to weight the two approaches and what is the appropriate cutoff value for deciding a line should be ignored. Table 4 shows the results of some of the experiments with different weightings. The weight column here indicates the weight associated with the context method. As can be seen, most of the combinations do best when the context method has a weight of around 0.4 and the actual line method has a weight of 0.6.
These illustrate the trade off in the methods that consider the context and the original line. They need to be able to handle significant changes in the line when the context stays the same and at the same time handle significant changes in the context when the line stays the same. Moreover, if the line changes, even in a simple way, the context is likely to change as well. There are several interesting aspects of the experiments. First, while ignoring indentation never helped with the actual line and didn’t help with best match context, it does seem to help when the methods are combined, even with the best context. Our analysis of this is that the individual methods match too many lines when indentation is ignored and hence tend to give spurious results. However, when methods are combined this is less of a problem and allows the methods to find lines where the code was embedded in a block, for example moved inside a conditional.
5. DISCUSSION In order to appreciate which methods work best and why they work, one needs to look at where the methods fail. The best methods tend to fail in common ways. This is shown in Table 5 which shows which locations are not done correctly for some of the better methods. This table illustrates that there
15
TABLE 3. Combined methods
Method A_BEST_NUMBER A_NUMBER_EXACT T_BEST_PARENT T_BEST_PARENT_METHOD T_EXACT_PARENT U_BEST_DIFF W_BESTI_AREA W_BESTI_AREAI W_BESTI_BEST W_BESTI_BESTI W_BESTI_LINE W_BESTI_LINEI W_BESTI_LINEI_METHOD W_BEST_AREA W_BEST_AREAI W_BEST_BEST W_BEST_BESTI W_BEST_DIFF W_BEST_LINE W_BEST_LINEI W_EXACTI_AREA W_EXACTI_AREAI W_EXACTI_BEST W_EXACTI_BESTI W_EXACTI_LINE W_EXACTI_LINEI W_EXACT_AREA W_EXACT_AREAI W_EXACT_BEST W_EXACT_BESTI W_EXACT_LINE W_EXACT_LINEI W_SMARTI_AREA W_SMARTI_AREAI W_SMARTI_BEST W_SMARTI_BESTI W_SMARTI_LINE W_SMARTI_LINEI W_SMARTI_LINEI_METHOD W_SMART_AREA W_SMART_AREAI W_SMART_BEST W_SMART_BESTI W_SMART_LINE W_SMART_LINEI
% Correct 84.5% 80.8% 82.0% 88.7% 81.9% 67.0% 86.1% 86.1% 96.5% 97.1% 96.7% 97.0% 97.7% 87.2% 87.2% 96.5% 95.3% 92.8% 94.2% 97.0% 92.8% 92.8% 93.7% 95.3% 96.1% 96.1% 92.0% 93.8% 93.0% 94.6% 95.3% 95.3% 85.6% 85.6% 91.3% 92.9% 93.3% 95.3% 97.3% 85.6% 85.6% 91.3% 92.9% 93.3% 95.3%
% Change % Eliminated % Spurrious % Ids Correct 68.4% 31.6% 0.0% 73.6% 50.2% 49.8% 0.0% 67.9% 51.9% 0.0% 48.1% 67.9% 23.3% 0.0% 76.7% 79.2% 72.5% 0.0% 27.5% 67.9% 76.2% 3.0% 20.8% 52.8% 43.5% 7.1% 49.5% 75.5% 43.5% 7.1% 49.5% 75.5% 0.0% 48.9% 51.1% 90.6% 0.0% 60.5% 39.5% 92.5% 0.0% 29.5% 70.5% 90.6% 0.0% 57.5% 42.5% 92.5% 6.5% 93.5% 0.0% 86.8% 47.3% 7.7% 45.0% 77.4% 47.3% 7.7% 45.0% 77.4% 0.0% 48.9% 51.1% 90.6% 0.0% 21.0% 79.0% 90.6% 40.0% 13.7% 46.3% 84.9% 0.0% 16.9% 83.1% 88.7% 0.0% 57.5% 42.5% 92.5% 0.0% 50.0% 50.0% 84.9% 0.0% 50.0% 50.0% 84.9% 25.3% 54.2% 20.5% 88.7% 0.0% 72.6% 27.4% 90.6% 0.0% 67.3% 32.7% 92.5% 0.0% 67.3% 32.7% 92.5% 0.0% 70.8% 29.2% 83.0% 0.0% 62.2% 37.8% 84.9% 22.6% 77.4% 0.0% 86.8% 0.0% 100.0% 0.0% 88.7% 0.0% 100.0% 0.0% 90.6% 0.0% 100.0% 0.0% 90.6% 41.9% 6.8% 51.3% 75.5% 41.9% 6.8% 51.3% 75.5% 18.3% 11.3% 70.4% 84.9% 0.0% 13.8% 86.2% 86.8% 3.4% 41.6% 55.1% 88.7% 0.0% 21.0% 79.0% 90.6% 5.6% 52.8% 41.7% 86.8% 41.9% 6.8% 51.3% 75.5% 41.9% 6.8% 51.3% 75.5% 18.3% 11.3% 70.4% 84.9% 0.0% 13.8% 86.2% 86.8% 3.4% 41.6% 55.1% 88.7% 0.0% 21.0% 79.0% 90.6%
16
TABLE 4. The effect of different weights for the context
Method W_BEST_BEST
W_BEST_LINE
W_EXACT_LINE
W_SMART_LINE
Weight % Correct 0.3 93.7% 0.4 96.5% 0.5 93.7% 0.6 94.3% 0.2 93.4% 0.4 94.2% 0.6 91.2% 0.8 89.8% 0.3 90.4% 0.4 95.3% 0.5 94.2% 0.6 90.8% 0.2 91.4% 0.4 93.3% 0.6 91.2% 0.8 89.8%
% Change 25.3% 0.0% 25.3% 28.0% 3.4% 0.0% 23.1% 20.0% 0.0% 0.0% 0.0% 19.7% 5.3% 3.4% 23.1% 20.0%
% Eliminated % Spurrious % Ids Correct 15.7% 59.0% 88.7% 48.9% 51.1% 90.6% 15.7% 59.0% 88.7% 17.3% 54.7% 88.7% 11.5% 85.1% 86.8% 16.9% 83.1% 88.7% 8.5% 68.4% 84.9% 7.4% 72.6% 84.9% 100.0% 0.0% 81.1% 100.0% 0.0% 90.6% 48.1% 51.9% 88.7% 10.7% 69.7% 84.9% 8.8% 86.0% 84.9% 41.6% 55.1% 88.7% 8.5% 68.4% 84.9% 7.4% 72.6% 84.9%
TABLE 5. Location errors for some of the better methods.
Method W_BESTI_BEST W_BESTI_BESTI W_BESTI_LINE W_BESTI_LINEI W_BEST_BEST W_BEST_BESTI W_BEST_LINEI W_EXACTI_BESTI W_EXACTI_LINE W_EXACTI_LINEI W_EXACT_LINE W_EXACT_LINEI W_SMARTI_LINEI W_SMART_LINEI
00000000001111111111222222222233333333334444444444555 01234567890123456789012345678901234567890123456789012 ** * * * * * * * ** * * * * * * * ** * * * ** * * * * * * * * * * * * * * * * * * * * * ** * * * ** * * * * * ** * * * ** approaches. Considering the lines in arbitrary order was supposed to let new lines be inserted or deleted around the line in question without completely invalidating the match. It turned out, however, that doing this produced too many false matches and resulted in change errors.
Second, there is little difference in the overall results from exact matching, best matching, and smart matching. This is somewhat surprising in that minor changes to the line would result in minor differences for the latter two, but major differences for the former. It turns out that while using exact match does eliminate some lines that change, it is also more likely to correctly identify lines that are eliminated. Methods that do approximate matching can sometimes find a similar line that occurs in a similar context elsewhere in the file.
Fourth, there is little difference between doing string difference matching for the context and doing exact matching. The context, when it changes, either changes significantly and the line itself doesn’t, or changes very little and the line itself changes. This is significant to note because the C_LINE methods are very fast while the C_BEST methods are significantly slower.
Third, in terms of context, looking at the lines in arbitrary order (C_AREA), which did about as well by itself as doing exact matching (C_LINE) or approximate matching (C_BEST), did not do well when combined with other
17
import java.text.*; String key = proj + “@” + owner + “@” + group + “@” + user; String key = cp.getName() + “@” + cp.getOwner() + “@” + user; db_user = user; db_user = System.getProperty(“user.name”); Vector tbls = new Vector(); Vector tbls = new Vector(); return 0; public synchronized String getLocationForId(ClideDataModel.TableView tbl,String id) public synchronized Location getLocationForId(ClideDataModel.TableView tbl,String id)
Figure 1. Original and final versions of source lines that cause problems. require string matching require that the 6-10 lines of context that are matched are stored for each location. Similarly, methods that use string matching on the original line need to store that line. The methods with the least storage costs are those that either store just the line number or that do exact matching since exact matching can be accurately approximated using good hash functions. Thus, the C_LINE method requires storing 6-10 integers (32 bytes) per location, while the C_BEST method requires storing 6-10 lines (about 300 bytes) per location. Similarly, the A_LINE method requires storing one integer (4 bytes), while the A_BEST method requires storing the original line (about 40 bytes). Thus, the method W_BEST_LINE which did quite well, requires only 36 bytes of storage for each saved location.
Fifth, the various methods based on abstract syntax trees did not as good a job as those based on lines and would not be worth the extra overhead of computing and storing trees. Our analysis shows this is primarily due to the association between trees and lines. Our experiment is based on source lines, not programming constructs. This was done deliberately since lines are the way the programmer normally refers to the code and most of the targeted applications deal with interacting with and saving information for the programmer. However, abstract syntax trees are not line oriented. It is often difficult to determine which node in the abstract syntax tree corresponds to a given line (or which line a node in the tree refers to). Nodes can easily span multiple lines or just a fraction of a line. This is further complicated by the association of comments the abstract syntax tree. We did our best to track the mappings, keeping track of where the original line is with respect to the chosen tree node, and using this information to select the target line, but still had significant problems. This problem would remain if one were to consider more sophisticated approaches, for example comparing program dependence graphs.
While we attempted, though our choice of locations, to make this experiment as comprehensive as possible, we do note that there might be some problems. First, we can not claim to have covered all possible types of changes that one would want to track. We made sure we covered cases where the line changed, where the line and context changed, where the context changed, where comments changed, where there were insertions and deletions around the line, where the line was moved, where variables were renamed, where methods were refactored, etc., but still assume that there are other changes we did not consider. Second, because we only looked at one particular file, we restricted our experiment to single coding style which might affect the results. Third, we looked only at versions of the file that were check into CVS. For most of the applications of source tracking we are interested in, one would actually look at much finer grain versions, tracking changes after each change, after each save, or after each successful recompilation. Our sense is that this should only improve the performance of the methods, but didn’t attempt to verify it. Finally, while we took great efforts to ensure that the methods we experimented with worked correctly and that the approaches they took were optimized with respect to accuracy, we did not prove that the code was correct.
Sixth, although it is not shown directly in the tables presented here, most of the methods were sensitive to their thresholding value. When the threshold is set too high, the methods tend to eliminate lines when the line or their context have just changed slightly. When the threshold is too low, the methods tend to give spurious results, choosing a similar line and context when the actual line is removed. Additional (unreported) experiments were used to set the threshold appropriately for the better methods. In general, we found that a threshold of about 0.5 worked best but the exact value depended on the methods being considered and their weights. Finally, combinations that included U_DIFF were not effective. There were two reasons for this. The first is that diff computes its value based on both the context and the contents of the line. Attempting to add context or line information is essentially subverting its analysis. Secondly, U_DIFF yields a single result, not a probability distribution as the other methods do, and hence is difficult to combine.
To provide additional validation of the approach, we ran a simpler experiment tracking fourteen randomly chosen locations in twenty-seven versions of another Java program. Here W_BESTI_LINE tracked everything correctly and W_EXACTI_LINE indicated two of locations had disappeared when they hadn’t. For comparison, U_DIFF got 81% of the
In addition to the performance and the time cost of the various methods, one also should consider the space required for the methods. Some of the methods, for example those based on UNIX diff or on comparing abstract trees, effectively require that the original file be kept in its entirety. Context method that
18
locations correct but managed to get every location wrong at least once.
7. David Hovemeyer and William Pugh, “Finding bugs is easy,” OOPSLA 2004 Companion, pp. 132-136 (2004).
6. CONCLUSION
8. Daniel Jackson and David A. Ladd, “Semantic diff: a tool for summarizing the effects of modifications,” ICSM ’94, pp. 243-252 (1994).
Our experiment shows that tracking source lines through multiple versions of a file can be done quite effectively using relative simple techniques. In particular, we would recommend the method W_BESTI_LINE which is about as effective as any other method and which is fast and requires only a small amount of storage.
9. Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue, “CCFinder: a multilinguistic token-based code clone detection system for large scale source code,” IEEE Trans. on Software Engineering Vol. 28(7) pp. 654-670 (July 2002).
This method does a string comparison of the line in question to get a normalized match value between zero (no match) and one (exact match). It then looks at a context consisting of four lines before and after the line in question and counts the number of these lines that match the corresponding line in the new version. This is again normalized from zero to one. A combined score is computed by taking 0.6 times the line match and 0.4 times the context match. This is thresholded at a value of 0.5 so that smaller values are mapped to zero. Finally, for a given location, the location in the new file with the highest non-zero score is designated as the new match.
10. Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Videira Lopes, Jean-Marc Loingtier, and John Irwin, “Aspect-Oriented Programming,” in European Conference on Object-Oriented Programming, (jun 1997).
Not only does our experiment show that this method works very well, but it also shows that methods that are significantly more sophisticated, either in terms of the nature of their matching or in terms of what they look at, do not really improve the result. This is important in that it means that the costs associated with these more complex methods is generally not justified.
13. Miryung Kim, David Notkin, and Dan Grossman, “Automatic inference of structural changes for matching across program versions,” ICSE ’07, pp. 333-343 (May 2007).
11. Gregor Kiczales, Erik Hilsdale, Jim Hugunin, Mik Kersten, Jeffrey Palm, and William Griswold, “An Overview of AspectJ,” in European Conference on Object-Oriented Programming, (2001). 12. Miryung Kim and David Notkin, “Program element matching for multi-version program analyses,” MSR ’06, (May 2006).
14. Linda Lamb and Arnold Robbins, Learning the vi Editor, O’ Reilly (November 1998). 15. Janusz Laski and Wojciech Szermer, “Identification of program modifications and its applications in software maintenance,” ICSM ’92, pp. 282-290 (November 1992).
The code used in these experiments can be obtained on the Internet as package limbotest in project wadi at ftp://ftp.cs.brown.edu/u/spr/wadi.tar.gz. The data files are available upon request. A simple library that maintains source locations using the approach outlined above is part of the wadi project.
16. Steven P. Reiss, “On the use of annotations for integrating the source in a program development environment,” pp. 25-36 in Human Factors in Analysis and Design of Information Systems, ed. R. Traunmuller,North-Holland (1990).
7. ACKNOWLEDGEMENTS
17. Steven P. Reiss, FIELD: A Friendly Integrated Environment for Learning and Development, Kluwer (1994).
This work is supported by the National Science Foundation through grant CCR0613162.
18. Steven P. Reiss, “Working with patterns and code,” Proc. HICSS-33, (January 2000).
8. REFERENCES
19. Steven P. Reiss, Christina M. Kennedy, Tom Wooldridge, and Shriram Krishnamurthi, “CLIME: An environment for constrained evolution,” Proc 25th ICSE, pp. 818-819 (May 2003).
1. Taweesup Apiwattanapong, Alessandro Orso, and Mary Jean Harrold, “A differencing algorithm for object-oriented programs,” ASE 2004, pp. 2-13 (September 2004). 2. Godmar Back and Dawson Engler, “MJ - a system for constructing bug-finding analyses for Java,” Stanford University Computer Systems Laboratory, (2003). 3. Brenda S. Baker, “On finding duplication and near-duplication in large software systems,” Working Conference on Reverse Engineering, pp. 86-95 (1995). 4. I.D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, “Clone detection using abstract syntax trees,” ICSM ’98, pp. 368-377 (1998). 5. David Binkley, Susan Horwitz, and Thomas Reps, “Program integration for languages with procedure calls,” ACM Trans. on Software Engineering and Methodology Vol. 4(1) pp. 3-35 (1995). 6. Marc H. Brown and Robert Sedgewick, “A system for algorithm animation,” Computer Graphics Vol. 18(3) pp. 177-186 (July 1984).
20. Steven P. Reiss, “Incremental Maintenance of Software Artifacts,” Proc. ICSM 2005, pp. 113-122 (September 2005). 21. Steven P. Reiss, “Visualizing program execution using user abstractions,” SOFTVIS 06, pp. 125-134 (September 2006). 22. D. M. Ritchie, S. C. Johnson, M. E. Lesk, and B. W. Kernighan, “The C programming language,” Bell Systems Tech. J. Vol. 57(6) pp. 1991-2020 (1978). 23. Atanas Rountev, Ana Milanova, and Barbara G. Ryder, “Points-to analysis for Java using annotated constraints,” OOPSLA 2001, pp. 43-55 (2001). 24. John T. Stasko, “TANGO: a framework and system for algorithm animation,” IEEE Computer Vol. 23(9) pp. 27-39 (September 1990). 25. Margaret-Anne Storey, Li-Te Cheng, Ian Bull, and Peter Rigby, “Shared waypoints and social tagging to support
19
collaboration in software development,” CSCW ’06, pp. 195198 (November 2006). 26. Margaret-Anne Storey, Li-Te Cheng, Ian Bull, and Peter Rigby, “Waypointing and social tagging to support program navigation,” CHI 2006, pp. 1367-1372 (April 2006).
27. Wuu Yang, “Identifying syntactic difference between two programs,” Software Practice and Experience Vol. 21(7) pp. 739-755 (July 1991).
20