strate a heuristic to reconstruct evolution processes of existing software ... At the beginning of our research we were wondering how general refactor opera-.
Studying Software Evolution Using Clone Detection Filip Van Rysselberghe
Serge Demeyer
Abstract In modern software engineering, researchers regard a software system as an organic life form that must continue to evolve to remain successful. Unfortunately, little is known about how successful software systems have evolved, and consequently little has been learned from previous experience. In this paper, we demonstrate a heuristic to reconstruct evolution processes of existing software systems by exploiting techniques to detect duplication in large amounts of data. Based on two/three small experiments, we conclude that it helps to acquire a better understanding of the evolution processes of successful software, which may help to improve current software development methods.
1 Re-engineering Context According to several scientific studies concerning large scale software systems, more than 80% of the total budget of a software project is spent during system maintenance. This percentage is, how surprising it may seem at first, even increasing when modern development methods are used [6]. The explanation for this observation lies in the fact that modern software – more than their traditional counterparts– undergoes continual change. Software systems that don’t evolve become progressively less useful and perish, a phenomenon quite similar to Darwin’s survival of the fittest. Not surprisingly, the recent trend with agile software development processes is to recognise change as the only constant factor in software development [8, 3]. Today however, little is known about how changes affect software systems. Of course, there is the principle of software entropy, stating that an existing welldesigned program gradually looses its structure and eventually turns into chaos [11]. But on the other hand, experienced software engineers are well aware of this entropy phenomenon, and take the appropriate counter measures in the form of refactoring [5]. Other re–engineering techniques like code visualisation, also aid in reducing the systems complexity. Unfortunately, these counter measures as well as their use are seldom documented and we currently lack concrete information about how successful software systems avoid the software entropy chaos. A community of researchers tries to answer these why and what questions of software evolution[12]. Based on the idea that a better insight in the evolution phenomenon must lead to improved methods, they address fundamental evolution issues as for example the nature of software evolution and its impact. Our evolution research fits in this point of view on software evolution research.
1
The figure shows two matrices representing the comparison between two subsequent releases of the same program. In such matrix, each column and each line represents a line of code. A dot, on the other hand, implies that the corresponding lines are duplicates of each other. Thus, the perfect diagonal in the left hand side shows that the first release is an exact copy of itself. However, in the right hand side the diagonal is broken in several locations, revealing added or deleted lines. Figure 1: Dot plot showing the changes between two releases In this research we try to cope with the general lack of evolution information by applying a heuristic which we call a “software palaeontology” heuristic, its explanation is included in section 2. Section 3 demonstrates a preliminary result which supports our belief in the usability of our approach.
2
Software Palaeontology Heuristic
To cope with the lack of evolution information, we proposes a kind of software palaeontology approach. By comparing different releases of existing source code (= the fossil remainders of software systems) and analysing the differences, we will reconstruct past evolution processes. This way, we hope to learn how software systems gracefully adapt to changing requirements. For the comparison of different releases of a software system, existing techniques used to detect duplicated code fragments1 are used. Because a lot of research effort was spent in the last 20 years on this topic, these techniques can be regarded as scalable techniques for the analysis of software systems[1, 4, 2, 9, 10, 13]. However, these techniques are used in a completely different way: rather than looking for matches which represent duplicated code, we will be looking for mismatches which represent places where a program has changed. Searching for such mismatches in the large amount of generated data is facilitated by using dot plot visualisations. Dot plots are a visualisation technique originally developed for investigating similarities in DNA-sequences, but later adopted for analysing code duplication[4]. Figure 1 shows how such a visualisation helps identifying mismatches. 1 Clone
detection techniques is another name used to denote such techniques
2
Our technique is in some sense similar to the work of Pinzger[14] who uses predefined patterns to recover a systems architecture, where we search for all occurring evolution patterns. Godfrey[7] uses a technique they call “origin analysis” to study the evolution of a software system. Although this technique is actually built around a clone detection technique, it is used in a much more narrow context of finding the evolutionary predecessor of an entity.
3 Preliminary Results At the beginning of our research we were wondering how general refactor operations, which are said to be useful to keep control over the systems entropy, would be represented in a dot plot comparison. Therefore, we took one important refactoring namely “Pull-Up” method and tried to come up with a corresponding pattern that can easily be recognised in a dot plot visualisation.
3.1
Pull Up Method
In large software systems, duplicated code has often been applied because of various reasons. However such duplicated code leads to more complex code and thus stimulates software entropy. A typical refactoring that removes duplicated code is “Pull up method”, which moves duplicated methods higher up the class hierarchy, replacing them by a single method to be reused by all subclasses[5]. The pull up method refactoring thus plays an important role in reducing complexity. Because of this importance of the pull up method in the context of reducing program entropy, it would be interesting when we would be able to detect when it is used during evolution. Certainly because it helps evaluating the usability and impact of this refactoring on a programs evolution. By applying our clone detection techniques on a number of small cases in which we applied the pull up method refactoring, we established a dot plot pattern that indicates the presence of the refactoring between two versions. Before showing the actual pattern, we will go through the various mismatches that occur when a method, present in a certain class, is pulled up in the next version of the system. If the system remains intact between two versions, each line of the first version would match each line of the second version, resulting in a diagonal when visualised. Now however, a method is moved from one class to another. Since the method is moved from a class, the comparison of that class with itself, is no longer a one–on–one (a diagonal) match but shows a shift where the moved method used to be. To illustrate this with an example consider a class where the method used to be on line 40. When this method is moved, line 40 from the new version will no longer match the old versions 40th line but line 50 since the original method was 10 lines long. Visually this removal is represented by a small piece of diagonal, removed from the longer diagonal as denoted by arrow a in area A of figure 2. Since the method was moved and not only removed, an additional effect can be seen. Due to the move a match is introduced between the original class and the class which now contains the method. Visually this match is represented as a small diagonal between the classes of both versions (arrow b in area B of figure 2). When both patterns are combined, it visually looks like there was a small piece diagonal cut out of the large diagonal (effect A) and was moved to another class in
3
the same row (effect B). As is demonstrated by figure 2.
A b a
B
Figure 2: Dot plot visualisation containing a pull up refactoring. Area A corresponds to the removal, area B with the introduction in the new class.
3.2 Scalability A problem that might rise is the scalability of the visualisation, certainly for large systems. However based on our early experiments we are convinced that is possible to reduce that amount of by only visualising classes which contain at least one larger match. In one of our experiments for example this would reduce the visualisation of 9737 classes into one of only 140 classes. Another thing we could exploit is the fact that changes visually affect a vertical and horizontal region. Take for example the pull up method refactoring, the removal and insertion of the match is located in the same horizontal region. Therefore we could extend our visualisation framework with a summary of this horizontal and vertical zones which would aid the scalability. Eventually there is still the possibility of using automatic pattern recognition techniques.
4
References [1] Brenda Baker. On finding duplication and near-duplication in large software systems. In Working Conference on Reverse Engineering 1995, 1995. [2] I.D. Baxter, A. Yahin, L. Moura, and M. Sant’ Anna. Clone detection using abstract syntax trees. In International Conference on Software Maintenance, 1998. [3] K. Beck. Extreme Programming Explained. Addison-Wesley, 1999. [4] S. Ducasse, M. Rieger, and S. Demeyer. A language independent approach for detecting duplicated code. In International Conference on Software Maintenance, 1999. [5] Martin Fowler. Refactoring - Improving the Design of Existing Code. Addison Wesley, 07 1999. [6] R Glass. Maintenance: Less is not More. IEEE, July/August 1998. [7] Michael Godfrey and Qiang Tu. Tracking structural evolution using origin analysis. In Proceedings of the international workshop on Principles of software evolution, pages 117–119. ACM Press, 2002. [8] J. Highsmith. Adaptive Software Development. Dorset House Publishing, 1999. [9] J.H. Johnson. Identifying redundancy in source code using fingerprints. In Cascon, 1993. [10] K. Kontogiannis. Evaluation experiments on the detection of programming patterns using software metrics. In Working Conference On Reverse Engineering, 1997. [11] M. Lehman and L Belady. Program Evolution: Processes of Software Change. Academic Press, 1985. [12] M M Lehman and J F Ramil. Software evolution. 2001. [13] J. Mayrand, C. Leblanc, and E. Merlo. Experiment on the automatic detection of function clones in a software system using metrics. In International Conference on Software Maintenance, 1996. [14] Martin Pinzger and Harald Gall. Pattern-supported architecture recovery. In Proceedings of the International Workshop on Program Comprehension, pages 53–xx, 2002.
5