1 Introduction. The objective of software reverse engineering is most often to gain a sufficient design-level understanding in support of maintenance, adaptation ...
Data-mining in Support of Detecting Class Co-evolution Zhenchang Xing and Eleni Stroulia Computing Science Department University of Alberta Edmonton AB, T6G 2H1, Canada {xing, stroulia}@cs.ualberta.ca
Abstract In an evolving system maintained over a long time period, there exist many non-trivial relationships among system classes, such as class co-evolutions, which usually are not easily perceivable in the source code. However, unfortunately, the continuing evolution of large, long-lived systems leads to lost information about these hidden relationships. In this paper, we propose a method for recovering such lost knowledge by data mining method. This method relies on the UMLDiff algorithm that, given a sequence of UML class models of a system, surfaces the design-level changes over its life span, thus eliminating the need for high quality modification reports and nonintuitive software code-based metrics. We employ Apriori association rule mining algorithm to the transactional database of class modifications, which elicit previously unknown or undocumented co-evolving relations among two or more classes. The recovered knowledge facilitates the overall understanding of system evolution and the planning of future maintaining activities. We report on one real world case study evaluating our approach.
1
Introduction
The objective of software reverse engineering is most often to gain a sufficient design-level understanding in support of maintenance, adaptation and feature extension [6]. In object-oriented systems, classes model abstractions of real-world entities around which these systems are designed. In an evolving system maintained over a long period, there exist many non-trivial relationships among system classes, which may not be intentional and usually are not easily perceivable in the source code. A particularly interesting such relation is class coevolution. Frequently, classes not explicitly related in the system design exhibit parallel evolution history. This phenomenon may indicate an implicit inter-dependence which, when understood, can be valuable in guiding subsequent evolution of the system in question. First, the system maintainers may decide to restructure the system in order to eliminate this interdependence, thus evolving it into a more modular and less coupled design. Alternatively, they may document the inter-dependence as a predictor of maintenance activities, so that when some of the co-
evolving classes have to be modified the rest of the cluster is also examined and retested. Recovering and making explicit such “lost knowledge” to increase the overall comprehensibility of a given system is one of the major objectives of reverse-engineering research. And, as many have already recognized [3,13], this task can benefit from Artificial Intelligence (AI), and more specifically data-mining, techniques. More specifically, Shirabad [15] recently proposed a method based on inductive-learning algorithm to address the problem of detecting the co-evolution of two code modules. Based on past maintenance experience, recorded in the form of change requests and code-update records, he explored the supervised inductive-learning method for recognizing co-updated modules and using this relation to predict whether updating one source file may require a change in another file. An important shortcoming of this work is its knowledge requirements. It essentially assumes the existence of a fairly detailed change-tracking system in which all change requests are recorded. Then these requests are co-related with the code updates committed in response to “closing” the requests. These co-related requests and updates become the examples input to the learning algorithms. Unfortunately, however, such consistently kept change-tracking systems are not always available [6]. In our work on detecting class co-evolution, we have adopted class-design models of subsequent system snapshots (which may be released versions or simply snapshots checked-out in regular time intervals) as the primary input of our method. These class-design models are easily obtainable, given a version-management system and any of a variety of existing round-trip softwaredevelopment tools [23,24]. The fundamental intuition underlying our class co-evolution detection method is that by comparing a sequence of snapshots of system’s class models, one can extract a history of the evolution of each individual class in terms of the “additions”, “deletions”, “moves”, “renamings” and “signature changes” of its super- and sub- classes, interfaces, and their fields and methods. Then rule- or sequence- mining algorithms, such as Apriori for example, can be used to detect common change co-occurrences among these class histories, thus uncovering co-evolving classes.
In addition to its simpler knowledge assumptions, our approach exhibits two advantages over Shirabad’s method. First it is unsupervised: unlike Shirabad’s method that requires a set of co-evolution examples in the form of sets of modules that were updated for the same change request, our method does not require labeled training examples. Second, it is relatively more scalable because it focuses on the changing system classes instead of all its modules. The remainder of the paper is structured as follows. Section 2 relates this work to previous researches. Section 3 presents the overall methodology and rationale of our approach. One case study illustrating our approach is discussed in section 4. Finally, Section 5 concludes with a summary of the lessons we have learned to date and our plans for future work.
Furthermore, they employ deduction algorithms that are computationally demanding.
2.2
Semantic UML model manipulation
There has been some work at analyzing the changes to software at the design level. Egyed [8] has investigated a suite of rule-based, constraint-based and transformational comparative methods for checking the consistency of the evolving UML diagrams of a software system. Selonen et al. [14] have also developed a method for UML transformations, including differencing. However, these projects have not explored the product of their analyses in service of evolution understanding.
3
Methodology
Our class co-evolution detection work spans over two related-research themes: first, the general area of employing artificial-intelligence methods in support of software reverse engineering, and second, the semantic manipulation of UML design models.
In this section, we present the structural modification detection algorithm, UMLDiff. We discuss the transaction database of class evolution histories and the Apriori algorithm used for detecting co-evolving classes given such a database. Finally, we discuss potential applications of recovered class co-evolution knowledge in the context of software maintenance.
2.1
3.1
2
Related work
AI in support of reverse engineering
Artificial-intelligence methods can benefit several reverse-engineering processes, and design recovery in particular. We have already discussed the work most similar to ours in terms of objectives and types of AI algorithms employed. Shirabad et al. [15] applied inductive machine learning method to elicit co-update file pairs of a subject system. The inductive learning method they applied, requires predefined classes, and need a lot of effort to select and extract features and label training samples, which significantly affect the quality of learned concepts or models. Besides, their methods require high quality change reports that are not always readily available. Gall et al. [10] use information in the product release history of a system to uncover logical coupling among modules based on sequence matching. Zimmermann et al. [19] identify (heavily dependent on visualization of historical data stored in CVS archive) the fine-grained coupling between program entities like methods and fields. Devanbu et al. [7] employed expert system and knowledge base as their underlying technology to assist in representing and deducing the relationships among components of software system. They built a system called LaSSIE, which integrates architectural, conceptual, code views of a large software system into a knowledge base represented in formal knowledge representation language and provides a semantic reasoning mechanism based on formal inference for developers to discover the structure of software system. Such knowledge-based system generally requires trained knowledge engineers to interview experts and build knowledge base and need a great effort to maintain and evolve knowledge base as system evolves.
UMLDiff: Class-modification detection
The overall problem of detecting and representing changes to data is important for version and configuration management. It is an active research area on its own in the area of data management. Probably the most well known algorithm for textual comparisons, GNU diff, was discussed as the string-to-string correction problem using dynamic programming in [17]. Used in the context of code differencing, it reports changes at the code line level rather than at higher level of abstraction of system structural modifications. As more data and documents are stored in XML format, some sophisticated version control systems include XMLaware features to handle XML documents. The general tree-to-tree correction problem has been studied extensively [2], and has been applied to show differences between XML data [22]. However, such general treedifferencing algorithms report changes as “XML element modifications” ignoring the domain-specific semantics of the nodes. Let us consider XMI, the XML Metadata Interchange for UML models, as an example. When a class implements a new interface, a general XMLdifferencing tool would only report that a set of XML nodes were inserted but would not recognize the implementation of a new interface, since it does not understand the XMI semantics. For the same reason, if an attribute or a method were moved from one class to another, the change would most likely be reported as two separate activities of node addition and the deletion. Recognizing changes while taking into account the UML-specific semantics of XMI documents is exactly the purpose of UMLDiff. Relying on the semantics of the model data, it identifies “moves” by hypothesizing
correspondences between additions and deletions of similarly-named elements of the same type. In the context of software evolution, where local transformations, such as refactoring, frequently involve moving features from one class to another, recognizing such “moves” is essential. As one of the most elementary operation of refactoring, a “move” often represents the redistribution of information or the reorganization of the class hierarchy, frequent perfective-maintenance modifications, such as, for example, moving methods from classes suffering strong coupling. Figure 1, discussed below, shows an example of such “move” operations. In general, the problem of detecting the class-model changes between two snapshots of an object-oriented system can be viewed as a graph-difference problem, since class models can be viewed as specific types of directed graphs. This problem is NP-complete which makes an automatic approach impractical. Therefore, we have limited our initial exploration to considering only the inheritance hierarchies of the class model. UMLDiff is essentially a domain-specific tree-differencing algorithm, aware of the UML semantics captured by the XMI syntax. It takes as input two UML class models, corresponding to two snapshots of the system under analysis, represented in XMI. Such class models can be either produced in the software-design phase by the system developers, or they can be reverse engineered from the system code, using any of the currently available software engineering tools [24]. The first step of the algorithm is to parse the input forests of class models into two labelled tree structures, in which the tree nodes are labelled with the type of objects, such as class, method, etc., and their corresponding attributes, such as modifiers, data type, parameter list, etc. The target representation contains the application classes and interfaces, their fields, their methods and their inheritance, implementation, and nested class relations. Nested classes of a particular class are enclosed in a special element in the context of the containing class to distinguish them from its subclasses. Multiple-inheritance is handled by duplicating the class node (not including its children) under each of its super classes. The next step of the algorithm is to identify the afterbefore changes between the two tree structures, in terms of the “additions”, “removals”, “moves”, “renamings” and “signature changes” of super- and sub- classes, interfaces, and their fields and methods. Currently, the comparison is based on simple identifier matching of the signatures of the various object-oriented entities of the same type. This UML differencing process brings to the surface structural modifications to the software design from one snapshot to another. The results are represented as change trees, i.e., trees of structural modifications, which, if applied to the before version would result in the after version. Change trees are represented in an XML-based syntax and are visualized to the user as shown in Figure 1.
The different icons to the left of each node represent the different object-oriented entities: “class”, “interface”, “method”, and “field”. The top-right adornments show the modifiers of the object, for example, “abstract”, “static”, etc. The bottom-right adornments represent the status of a particular object. It can be plus sign for “add”, minus sign for “remove”, filled triangle for “rename”, empty triangle for “change signature”, arrow with minus sign for “move out from source”, arrow with plug sign for “move into target”. Figure 1 shows that a new abstract class, “Statement”, was created with three newly created abstract methods, “eachRentalString”, “footerString”, and “headerString”. The “value” methods of its two subclasses, “HTMLStatement” and “PlainStatement”, were pulled up into the new class “Statement”. This change tree corresponds to the differences between version 27 and 28 of the extended refactoring sample from Fowler’s book [9] as found in [20]. It represents the modifications to the class model after an “Extract Superclass” refactoring, which is described as follows: “if you have two classes with similar features, then create a superclass and move the common features to the superclass [9]”
Figure 1 An example of change tree
3.2
The detection of co-evolving classes
UMLDiff reports the structural changes between two snapshots of a system’s class models. There are N such models in an evolving software system with N successive snapshots, and consequently UMLDiff can be applied N-1 times to generate the differences between the (I+1)th and Ith versions, where 1≤ I