An Automatic Approach to identify Class Evolution ... - Semantic Scholar

24 downloads 4156 Views 194KB Size Report
main name server. Almost all the ..... an old class A with a new one B changing the class name ..... an open source Domain Name Server (DNS) written in Java.
An Automatic Approach to identify Class Evolution Discontinuities Giuliano Antoniol∗ Massimiliano Di Penta∗ and Ettore Merlo∗∗ [email protected], [email protected], [email protected] RCOST - Research Centre on Software Technology University of Sannio, Department of Engineering Palazzo ex Poste, Via Traiano 82100 Benevento, Italy ∗

Abstract When a software system evolves, features are added, removed and changed. Moreover, refactoring activities are periodically performed to improve the software internal structure. A class may be replaced by another, two classes can be merged, or a class may be split in two others. As a consequence, it may not be possible to trace software features between a release and another. When studying software evolution, we should be able to trace a class lifetime even when it disappears because it is replaced by a similar one, split or merged. Such a capability is also essential to perform impact analysis. This paper proposes an automatic approach, inspired on vector space information retrieval, to identify class evolution discontinuities and, therefore, cases of possible refactoring. The approach has been applied to identify refactorings performed over 40 releases of a Java open source domain name server. Almost all the refactorings found were actually performed in the analyzed system, thus indicating the helpfulness of the approach and of the developed tool. Keywords: Software Evolution, Releases, Refactoring, Traceability

1. Introduction Software systems continuously evolve to meet everchanging user needs. As a system evolves, new functionalities are added and existing ones are removed or modified. In particular, when we look at the evolution of an Object– Oriented (OO) software system, we see that the lifetime of a class is only a limited segment in the whole system evolution. When a class is not considered useful anymore, it can be removed. On the contrary, new features can imply the creation of new classes. The latter is, however, only part of the reality. To improve the software internal structure, maintainability and comprehensibility, refactoring [8, 11]

∗∗ ´

Ecole Polytechnique de Montr´eal Montr´eal, Canada

activities are periodically performed. At class level, such refactorings may imply that new classes can be obtained by splitting or merging old ones. Moreover, it may happen that a class can be obtained factoring out part of another class or, on the contrary, a class can be merged with another. Often, for different reasons, those refactorings are not documented. The lack of configuration management and, in general, of a well-defined software development process can cause the lost of traceability between related classes. As a consequence, the software system maintainability and, in general, its quality, tend to deteriorate. Software evolution activities rapidly become extremely difficult as any change may produce unpredictable side effects on other portions of the system. It would be greatly useful to connect the independent segments representing class evolution during system lifetime. If, at a given release, a class terminates its life and other two classes, obtained splitting the first one, appear, then the three segments representing such classes should be connected to indicate such a relationship. The first, intuitive consequence of this information is related to understand software evolution: the lifetime of a class should be studied also across events such as renaming, replacement, merge and split. Second, the detection of refactorings helps locating functionalities over classes, thus giving a relevant support to software maintenance and, in particular, impact analysis. Last but not least, the approach can be a support to facilitate the reuse of test cases developed for the old class(es). This paper proposes to adopt techniques inspired by Information Retrieval (IR) approaches to automatically identify and document evolution discontinuities when analyzing the evolution of OO source code at class level. The approach is inspired from a number of studies [3, 5, 16, 17] aimed at recovering any mapping between software artifacts (e.g., free text documentation and code), or between subsequent releases of a software system. Without loosing the generality of the proposed approach,

this paper will focus on a limited number of refactoring events, namely class renaming or replacement, class merge and split, and factoring in/out a class into/from another one. Nevertheless, the approach can be extended to other (even finer-grained, for example at method level) types of refactoring. The primary contributions of this paper are the following: • it proposes a novel, automatic, approach to locate software evolution discontinuities due to possible refactorings; • it explicitly defines, in terms of such an approach, a way to identify merge, split and rename; and • it presents results of a study aiming at identifying such discontinuities in the evolution of 40 releases of a Java system. The remainder of the paper is organized as follows. After Section 2 discusses the related works and their relationship with the present paper, Section 3 presents, for completeness’ sake, a brief overview of IR based traceability link recovery. Section 4 describes the proposed approach, while Section 5 reports and discusses the results from the case study. Section 6 concludes.

2. Related Work The problem of identifying refactorings in subsequent releases of a software system has been previously tackled by Demeyer et al. [7]. They proposed heuristics, based on class and method-level metrics, to identify several forms of refactoring. As stated by the authors, Demeyer et al. approach requires metric analysis being complemented by manual browsing. We agree with authors that, in general, each category of refactoring requires a particular heuristic to be identified and that, in general, manual browsing is necessary. However, our aim is to propose an approach that automatically, without any human intervention, identifies some potential refactorings. Second, we are, in this paper, mainly interested in those refactorings that create discontinuities in a class lifetime. Zimmerman et al. [23, 22] presented a data mining approach over CVS repositories to identify changes, to detect coupling between fine–grained entities and to predict future or missing changes. Xing and Stroulia [21] presented an approach to understand class evolution in OO software. More generally, the issue of recovering traceability links between portions of code, or between code and free text documentation, has gained interest in recent years. In particular, a number of papers have been published in the area of impact analysis. For example, Turver and Munro [20]

assumed the existence of some form of ripple propagation graph, describing relations between software artifacts. When available, Concurrent Versioning System (CVS) data is used to track features between releases. Fischer et al. [10] proposed to combine CVS revision data with bug reporting data and to add some missing information such as, for example, merge points. The same authors also performed, on the same data, an analysis devoted to track features [9]. Finally, Gall et al. [13] analyzed CVS release history data for detecting logical coupling. Ratiu et al. [18] use version histories to detect design flaws. IR approaches were for the first time adopted by Maarek et al. [15]. Maarek introduced an IR method for automatically assembling software libraries based on a free textindexing scheme. Maletic and Marcus [16] presented a system called PROCSSI that uses Latent Semantic Indexing (LSI) to identify semantic similarity between pieces of code. Authors also prove that this semantic similarity measure is helpful in the comprehension task. Antoniol et al. [6] presented a method to establish and maintain traceability links between code and free text documents. The method exploits probabilistic IR techniques to estimate a language model for each document or document section, and applies Bayesian classification to score the sequence of mnemonics extracted from a selected area of code against the language models. The same method was applied in [2], to recover traceability links between the functional requirements and the Java source code, extending and validating the previous results on a more complex and difficult case study. The investigation was then extended [4, 5] to vector space model, to compare different model families and assess the relative influence of affecting factors. Maletic and Marcus performed a comparison of the probabilistic and vector space approaches with the LSI [17]. Antoniol et al. [3] used string matching and graph algorithms were used to recover a traceability mapping between different releases of OO systems. In particular, thresholds were used to discard matching (graph edges) unlikely to represent class evolution. While that work [3] was focused on traceability among classes, the present paper focuses on situations that create discontinuities in a class lifetime, and, as it will be clearer in Section 4, we rely on sum of vectors to model and identify situations of class split and merge.

3. Background To identify links between classes obtained from refactoring, we applied techniques inspired by IR vector space approaches. The vector space model treats documents and queries as vectors [14]; documents are ranked against queries by computing some similarity functions between

the corresponding vectors. For our study’s purpose, documents and queries are just classes of different releases. Each element of the vector corresponds to an identifier, extracted from the class source code. If |V | is the size of the vocabulary, then the vector [di,1 , di,2 , . . . di,|V | ] represents the class Di . The j-th element di,j is a measure of the weight of the j-th term of the vocabulary in the class (i.e., document) Di . To quantify the similarity, a well known IR metric called tf -idf [19] was used. According to this metric, the j-th element di,j is derived from the term frequency tfi,j of the j-th term in the document Di and from the inverse document frequency idfj of the same term over the entire set of documents, i.e., classes in a given release. The term frequency tfi,j is the ratio between the number of occurrences of the j-th term, that is identifier, attribute or method name, over the total number of terms contained in the class (document), Di . The inverse document frequency idfj is defined as: idfj =

T otal N umber of Documents N umber of Documents containing the j th term

The vector element di,j is:

Mapping of classes into a vector space is performed via tf-idf representation, described in Section 3, which in turn requires extraction of identifiers from class source code. Once identifiers are available, a dictionary encompassing all identifiers encountered in the software evolution is constructed and each class encoded by means of a vector which components are the tf-idf of the identifiers extracted from the class. The closer to one the cosine of the angle between two vectors (encoding two classes) is, the more similar classes are. It should be noticed that in vector-space based traceability recovery approaches [5], vectors represent documents at different levels of abstraction e.g., high-level documents and source code, and thus, to improve accuracy stopping and stemming are needed. Studying class evolution to locate refactoring events does not require to link concepts expressed at different abstraction levels; classes of subsequent releases are likely to contain the same or very similar identifiers. Moreover, language predefined identifiers play a role in class syntax and semantic and, since we are not abstracting higher level concepts, they should not be removed. In other words, while for traceability recovery [5] identifiers underwent stemming and stopping, the present approach avoids any identifier processing.

di,j = tfi,j ∗ log(idfj )

4.1 Cases of Refactoring Considered The term log(idfj ) acts as a weight for the frequency of a word in a document: the more the word is specific to the document, the higher the weight will be. A query SQ is represented in the same way of a document by a vector [q1 , q2 , ...q|V | ]. The similarity between a document Di and a query SQ is computed as the cosine of the angle between the corresponding vectors according to the equation:

σ(Di , SQ ) = pP

P

Di,k Sk P 2 2 k (Sk ) k (Di,k ) k

The following subsections will explain in detail the different class refactoring we considered in this paper. Each case is presented as a recipe that summarizes possible motivations, conditions to be applied to identify candidate classes and heuristics useful to reduce the search space. Conditions will be stated in terms of cos(θ), i.e., the cosine of the angle between vectors representing classes and computed as indicated in Section 3. Heuristics state constraints under which the search is performed.

(1)

4. Approach Description Our approach to locate refactoring events stems from vector spaces, more precisely from the linear algebra and vector composition i.e., vector sum. Suppose that classes are mapped into elements of a vector space; if a class is split into two new classes, then vectors representing the new classes have to add up to the original one. However, when refactoring is performed, code may be added, deleted or modified thus, in general, composition of refactored classes will only approximate the original class (see Figure 1). Clearly, a massive renaming of the identifiers constitutes a limitation for our approach.

4.1.1 Class Replacement This happens (Figure 1-a) when developers decide replace an old class A with a new one B changing the class name and, possibly, partially modifying it to change its purpose. Motivation: replace a class with a new one by modifying class code and changing its name to better reflect the class new purpose; Condition: the class B must have a different class name, and is likely to have a high similarity with A such that cos(θ) is: 1. maximum (i.e., the class is the most similar to A); and

Figure 1. Possible cases of class refactoring

2. above a threshold, under which the two classes are to be considered as significantly different i.e., the difference is not simply due to the evolution of the class across two subsequent releases.

Heuristic: 1. classes A and B must belong to different and subsequent releases say release n and n + 1;

2. class B is not present in release n and class A is not present in release n + 1.

4.1.2 Class Extraction This happens (see Fowler book [11]) when a class A continues to exist between releases n and n + 1, while part of it is factored out in a class B. Motivation: a new class B is extracted from A when B, for example, contains a data structure or both methods and attributes that i) are not highly cohesive with the rest of class A and ii) are used by classes other than class A. Condition: the class B will be identified as a class extracted from A if, as shown in Figure 1-b: 1. the cosine of the angle between class An and the vector resulting from the sum of An+1 and B is above a

threshold (the same used in the previous case) under which the “merge” of An+1 and B is significantly different from An ; and 2. such a cosine is greater than those obtained for any other candidate extracted class.

4.1.4 Class Merge By reversing time axis, class merge is brought back to class extraction, thus very similar and dual considerations hold. For example, developers may add a feature, then they could realize that such a feature should be part of an already existing class and thus perform the merge.

It is important to note that the tf-idf for the sum vector are computed after identifiers of the two classes to sum have been merged in a unique list. This avoids giving a low weight to identifiers appearing in both classes.

Motivation: class B is actually implementing part of class A state or implementing a subset of A behavior or a mixed of both;

Heuristic:

Condition: see the condition of class extraction;

• class A must exist in releases n and n + 1;

Heuristic: see the heuristic for class extraction.

• class B does not exist in release n;

4.1.5 Class Merge into a new Class

• the cosine between An and An+1 is below the threshold of change that is considered to be due to class evolution.

This case is the dual of class split. In this case, two classes, say B and C, disappear after release n, while class A appears.

4.1.3 Class Split This case is similar to class extraction. It may happen that developers decide to split a class into two different ones, for example because the original class is not highly cohesive, and it models two different concepts or entities. On the contrary of the previous case, class A disappears in release n, while classes B and C appear in release n + 1. Motivation: new classes B and C are obtained by splitting class A, to achieve higher cohesion (i.e., B and C have different purposes) and low coupling.

Motivation: as in class merge, however the overall behavior of the entity modeled by the new class substantially differs from the merged classes, thus class replacement take place. Condition: see the condition of class split; Heuristic: see the heuristic of class split. 4.1.6 Recombining Classes

Condition: classes B and C will be identified as classes obtained by splitting A if, as shown in Figure 1-d:

This is the more general case. Developers may have assigned some responsibilities to the wrong classes, they decide to perform refactoring, and thus they move methods/attributes from a class to another. Two different situations may happen:

1. the cosine of the angle between class A and the vector resulting from the sum of B and C is above a threshold (the same used in the previous case); and

1. given two classes, A and B at release n, part of methods/attributes can migrate from A to B and vice-versa. while being substantially changed, both classes will still be present in release n + 1,

2. such a cosine is greater than those obtained for any other combination of classes.

Heuristic: • class A must exist in releases n and not in n + 1; • classes B and C do not exist in release n; and • the cosine between A and B+C is above the threshold.

2. the combination of A and B will produce two new classes, C and D, that will appear in release n + 1, where A and B will not be present anymore.

Motivation: better (re)assign responsibilities to classes, improve cohesion, reduce coupling. Condition: given the two vectors Sn and Sn+1 , obtained from the sum of the A and B at release n and n + 1 respectively,

Type of Refactoring Performed Replacement Extraction Split Merge Merge into new class Recombination Recombination into new classes

Classes involved in release n A A A A, B B, C A, B A, B

Classes in involved in release n + 1 B A, B B, C A A A, B C, D

Condition

cos(A 6 B) > threshold cos(An 6 (An+1 + B)) > threshold cos(A 6 (B + C)) > threshold cos((An + B) 6 An+1 ) > threshold cos((B + C) 6 A) > threshold cos((An + Bn ) 6 (An+1 + Bn+1 )) > threshold cos((A + B) 6 (C + D)) > threshold

Table 1. Conditions that need to be verified for classes involved in refactoring. • the cosine of the angle between Sn and Sn+1 is above a fixed threshold; and • (only for the second situation, i.e., new classes C and D have been created by recombining A and B) such a cosine must be greater than those obtained with any other combination.

Heuristic: • classes A and B belong to both releases n and n + 1; or • classes A and B disappear in release n + 1, while classes C and D were not present in release n; • the cosine of the angle between Sn (A plus B) and Sn+1 (A plus B or C plus D) must be above a fixed threshold.

• software evolution across releases measured in terms of size metrics, or the amount of changes does not indicate radical changes across releases. A high threshold decreases computation time, and also reduces the number of class clusters. The latter aspect clearly reduces the amount of manual work needed to discern good results from false positives. However, the cost to pay may be a number of false negatives. On the contrary, a lower threshold (0.6 - 0.7) is better suited if: • the system is small, thus the search space is not very large, and the number of produced results is low; or • a high recall is needed, even at the cost of higher computation times and effort required to check the results. Clearly, the threshold value depends from case to case and it should be calibrated analyzing software system evolution across releases.

4.2 Calibrating the Threshold 4.3 Approach Scalability Threshold values influence method accuracy, as well as the required human intervention to perform inspection and discard false positives from good results. A threshold too high, i.e., close to 1, could produce too many false negatives and, in general, while increasing precision, decreases the approach’s recall [12]. This is due to the role played by threshold that, as reported in Table 1, is compared with the cosine of the angle between vectors. On the contrary, a lower threshold could produce false positives, thus decreasing the precision, while increasing the recall. However, higher human intervention is required to discard false positives. In the authors’ experience, the threshold needs to be calibrated in the range 0.8 - 0.9 when: • a large software system is under analysis, thus class combination between releases may be high and consequently the search space is large;

Complexity is tied to search space size, i.e., to the number of candidate classes between releases n and n + 1, and to the refactoring event to be located. In general, the complexity of the approach is bounded by polynomial time. The refactoring cases presented in Section 4.1 are a simplification, a limited view of what may happen in real-world software systems. Merge, split, extract or, in general, class reorganization can involve more than two classes, in general k classes. This will significantly influence the search space. Let us consider, for example, the case in which a class A is split in k new classes. If m candidate classes are available in release n + 1, then the search needs to be done among m k combinations. If large software systems are analyzed this may produce unacceptable computation times and a high number of false positives. However, the problem can be at least partially tackled by:

20

90

18

80

16

70

14

60

12

New classes Removed after this release

Classes

Classes

100

50 40

10 8

30

6

20 4 10 2 5

10

15

20 Releases

25

30

35

40

5

10

15

20 Releases

25

30

35

40

Figure 2. a) Classes contained in each release - b) Classes added/removed

1. a careful choice of the threshold, as explained in Section 4.2, to achieve a compromise between accuracy and manual intervention required;

software release history, without any kind of human intervention.

2. optimizing the search for an approximate solution using approaches such as genetic algorithms, simulated annealing or hill climbing. The set of classes involved in the refactoring can be, in fact, encoded in a genome, and the cosine value computed can be used as a fitness function. Such optimization approaches can, at least, reduce the computation time (while leaving high the number of potential false positives that can be obtained).

5. Case Study

Fortunately, the number of classes changed by a split, merge or reorganization is upper bounded and most of the times no more than four classes are involved (class recombination). Such in a worst case, a complexity of O(n2 ), may be experienced. However, while theoretical complexity remains polynomial, in practical cases this does not constitute a problem: the number of classes removed and added between two subsequent releases most of the time is considerably lower than the system size.

4.4 Tool Support Different tools have been implemented and reused to support the proposed approach, in particular: • a tool for extracting the identifiers from Java source code, implemented using the freely available parser generator JavaCC [1] and its Java lexer; and • a set of Perl scripts to implement the search algorithm itself. Noticeably, the use of such a toolkit is completely automatic: it proposes a series of candidate refactoring over the

To obtain a preliminary evidence of the proposed approach, we applied it on 40 releases of dnsjava1 . dnsjava is an open source Domain Name Server (DNS) written in Java. The system comprises classes for handling DNS names, records, addresses, for caching name resolutions and many others. Analyzed releases range from 0.1 to 1.4.3. The system size varies from 4.3 KLOC to 15.4 KLOC, and the number of classes from 39 to 99. As shown in Figure 2-a, the number of classes is, except few case, always increasing between subsequent releases. Figure 2-b shows, for each release, the number of classes added and the number of classes that terminate their life in that release. In other words, the figure shows the candidate classes among which refactoring activities should be searched. Figure 3 shows the lifetime of dnsjava classes across releases. The plot clearly indicates situations in which some classes appear while others were removed. For sake of simplicity, class names were omitted (a total of 123 classes were involved in dnsjava evolution).

5.1 Case Study Results The application of the proposed approach produced a list of potential class refactoring operations listed in Table 2. The table highlights how, in the initial results, a class could be potentially involved in different types of refactoring in the same release. For example, it could not be clear if, between releases 12 and 13, class CacheElement has 1 http://www.dnsjava.org

Type of Refactoring Performed Replacement Replacement Replacement Replacement Replacement Replacement Merge Merge Merge Merge Split Split Factor out

From

To

Release 3 4 7 7 11 32 7 11 12 32 4 7 2

Release 4 5 8 8 12 33 8 12 13 33 5 8 3

Classes involved in release n dnsServer CountedDataInputStream Resolver FindResolver CacheElement AXFREnumeration FindResolver, Resolver CacheElement, IO CacheResponse, ZoneResponse AXFREnumeration, Enumerator CountedDataInputStream FindResolver dns

Classes in involved in release n + 1 jnamed DataByteInputStream SimpleResolver FindServer Element AXFRIterator SimpleResolver Element SetResponse AXFRIterator DataByteInputStream, DataByteOutputStream ExtendedResolver, FindServer dns, Type

Cosine

0.71 0.32 0.79 0.31 0.85 0.90 0.75 0.77 0.88 0.88 0.24 0.31 0.76

Table 2. Potential refactorings found in dnsjava

120

100

Classes

80

60

40

20

5

10

15

20 Releases

25

30

35

40

Figure 3. Class lifetime for 40 dnsjava releases

been replaced by class Element or, instead, merged with class IO again to create class Element. In some cases, the decision can be taken automatically by analyzing the cosine value. In the last example, the cosine obtained for the merge is 0.77, while it is 0.85 for the replacement. This clearly weights in favor of the replacement. There may be however cases for which the decision is not that easy to be taken, even because the difference between the cosines in not that evident, or because the actual decision made by the developers could not be in full agreement with the computed cosines. Again, it is worth highlighting the fact that these results should (as for traceability recovery) not be considered as an absolute truth, while as indications that should be supported by code inspection. Let us now focus on the results obtained. The results

for replacement indicated that class jnamed replaced class dnsServer in release 4. Code inspection (and even class names) confirmed this was a good result, since both class model DNS daemons. The classes were quite similar, except for the fact that jnamed was able to handle DNS record sets and zones. The second case of replacement is somewhat more complex to be interpreted. In fact, it does not appear clear if class Resolver has been simply replaced by SimpleResolver or if, instead, SimpleResolver was obtained from the merge of Resolver and FindResolver. The higher similarity agrees with the manual inspection in favors of the replacement. The class FindResolver does not seem to have been included in SimpleResolver. Instead, we found that it was replaced by class FindServer. However, the first run of the algorithm missed that replacement. This because the class FindServer contained additional features than SimpleResolver (it also analyzes properties of DNS configuration other than reading them from a file). A threshold of 0.3 for cosine allowed the automatic identification of this replacement. The lowered threshold also allowed us to discover that at release 5 CountedDataInputStream was replaced by DataByteInputStream. The similarity (0.32) was not so high since several new features were added in the new class. Actually, also the dual class CountedDataOutputStream was replaced by DataByteOutputStream. However, here the similarity was even smaller (0.10) and therefore this refactoring could have been discovered only using a very low value of the threshold (acceptable for this case study, while not for others). Then, we analyzed if, in release 12, class

CacheElement was replaced by class Element or if, instead, the latter was obtained from merging classes IO and CacheElement. In this case, the cosine difference is relevant (0.85 vs. 0.77) and manual inspection confirmed that class Element replaced CacheElement (both were nested classed of class Cache, aiming to represent a cache element). The last possible case of replacement is, in release 33, between AXFREnumeration and AXFRIterator. In this case, the narrow cosine difference (0.90 vs. 0.88) made not clear if, instead, AXFRIterator was obtained by merging AXFREnumeration with Enumerator. Anyway, manual inspection confirmed that the higher cosine indicates the actual refactoring, i.e., a replacement. In fact, both were nested classes of Zone, providing access to the zone data structure. The replacement was due to moving from a simple enumeration type to an iterator design pattern. Instead, the class Enumerator was just a class similar to the other two, thus the algorithms proposed a merge with it. The remained case of potential merge combined CacheResponse and ZoneResponse into SetResponse, with a similarity of 0.88. Manual inspection confirmed this merge: both ZoneResponse and CacheResponse modeled a DNS response, even in a Zone and in a Cache (the latter two are subclasses of the class NameSet, modeling a DNS set of names). Since the behavior of the two classes was not that different, developers decided to merge them into SetResponse. Split results were a bit puzzling. It seems that class FindResolver was split into ExtendedResolver and FindServer. However we already know it is not the case, in that FindResolver and FindServer were simply involved in a replacement. On the contrary, this association helped to close the circle about the evolution of class Resolver. From replacement results, it appears to have been replaced by SimpleResolver. Instead, a careful analysis indicated that it also evolved in ExtendedResolver. However, the latter contained a number of additional features, making automatic matching not possible. Finally, in release 3 part of class dns was factored out in class Type. The manual inspection confirmed that a long list of type definitions (constants) for a DNS, were actually factored out in a separate class.

5.2 Discussion Although the size of the presented case study is mediumsmall, it represents the evolution, across a reasonable number of releases (40) of an open source project. Results suggest that the approach produces useful indications of class refactoring across releases. Even when a

class could, at the same time, have been involved in multiple refactoring actions (e.g., replacing, merge, split), the highest cosine value always indicates the correct choice, i.e., what developers actually did. Overall, we experienced that in cases where the automatic approach did not provide a correct decision, gathered information and the cosine of the angle between classes were a valuable mean to identify action actually carried out by developers. In some cases, even the class name indicated the transition (e.g., from CacheElement to Element) or (CacheResponse, ZoneResponse and SetResponse). Using a string distance (e.g., Levenshtein) between class names could be alternative (or complementary) to the proposed approach. However, while the possibility of complementing the two approaches is under investigation, we still believe that vectors of class identifiers convey more information that simple class names. For example, when a class Type was extracted from dns class the class name did not provide any indication. Moreover, it happens that in case of replacement, split or merge a class names are often changed e.g., by adding a prefix (e.g., CacheElement and Element). A prefix as long as the original class name would produce a string distance of 0.5 and thus, although words are similar, this event will not be detected. Instead, the combination of vector space identifiers with clone detection to highlight refactorings is worth of investigation. It is reasonable to believe that a class replacing another one should also be a near-clone of the latter, and that classes obtained from a split should contain cloned code from the source class. Finally, some considerations about scalability. Despite the theoretical complexity of the approach, in the practice there are often a few candidate classes (see Figure 2-b and Figure 3) for refactoring, thus the resulting search space is greatly reduced. On a 3GHz Pentium IV with 512 Mb RAM the complete search took 105 sec. of CPU user time (computed with the Unix time utility). Clearly, for largest case studies, as well as for values of k (i.e., number of classes in which a class could be split, or number of classes than could be merged) greater than two, search heuristics should be adopted, as suggested in Section 4.3.

6. Conclusions and work-in-progress We presented an approach, based on Vector Space cosine similarity on class identifiers, to automatically identify class-level refactorings between two subsequent releases. In particular, the approach aimed to identify cases of class replacement, split, merge, as well as factoring in and out of feature from/to other classes. The approach was useful to identify some replacement, merge and factoring during the evolution of dnsjava.

In our opinion, this can provide a useful support when a project manager is interested to study the evolution of a feature (or of a classes) across all the releases of a software system. In fact, even if a class could disappear at a given release, it could have been replaced, or simply merged with another into a new class. This refactoring therefore causes broken links in class evolution traceability. Identifying those links is also relevant for other important purposes, such as program comprehension and impact analysis. Work-in-progress is devoted to improve the identification algorithm by introducing search heuristics to find more complex cases of refactoring. Moreover, we will also investigate on how complementing clone detection to improve the approach. Finally, we will also focus on the identification of finer-grained cases of refactoring (e.g. of methods), also relying on the analysis of data from CVS repositories.

References [1] Java Compiler Compiler (JavaCC) - The Java Parser Generator. https://javacc.dev.java.net. [2] G. Antoniol, G. Canfora, G. Casazza, and A. De Lucia. Information retrieval models for recovering traceability links between code and documentation. In Proceedings of IEEE International Conference on Software Maintenance, pages 40–49, San Jose CA USA, October 2000. IEEE Comp. Soc. Press. [3] G. Antoniol, G. Canfora, G. Casazza, and A. De Lucia. Maintaining traceability links during object-oriented software evolution. Software - Practice and Experience, 31:1– 25, 2001. [4] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Tracing object-oriented code into functional requirements. In Proceedings of the 8th International Workshop on Program Comprehension, pages 227–230. Limerick Ireland, June 2000. [5] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo. Recovering traceability links between code and documentation. IEEE Transactions on Software Engineering, 28(10):970–983, October 2002. [6] G. Antoniol, G. Canfora, A. De Lucia, and E. Merlo. Recovering code to documentation links in OO systems. Proc. of the Working Conference on Reverse Engineering, pages 136–144, Oct 1999. [7] S. Demeyer, S. Ducasse, and O. Nierstrasz. Finding refactorings via change metrics. In Proceedings of the International Conference on Object-Oriented Programming, Systems, Languages, and Application (OOPSLA), 2000. [8] S. Demeyer, S. Ducasse, and O. Nierstrasz. Object–Oriented Reengineering Patters. Morgan Kaufmann/Elsevier, 20022. [9] M. Fischer, M. Pinzger, and H. Gall. Analyzing and Relating Bug Report Data for Feature Tracking. In 10th Working Conference on Reverse Engineering (WCRE), Victoria, Canada, pages 90–99, November 2003. [10] M. Fischer, M. Pinzger, and H. Gall. Populating a release history database from version control and bug tracking systems. In Proceedings of IEEE International Conference

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

on Software Maintenance, pages 23–32, Amsterdam, The Netherlands, Sep 2003. M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts. Refactoring: Improving the Design of Existing Code. Addison-Wesley Publishing Company, 1999. W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1992. H. Gall, M. Jazayeri, and J. Krajewski. CVS release history data for detecting logical couplings. In Proceedings of the International Workshop on Principles of Software Evolution, pages 13–23, Helsinki, Finand, Sep 2003. D. Harman. Ranking algorithms. In Information Retrieval: Data Structures and Algorithms, pages 363–392. PrenticeHall, Englewood Cliffs, NJ, 1992. Y. Maarek, D. Berry, and G. Kaiser. An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering, 17:800–813, 8 1991. J. I. Maletic and A. Marcus. Supporting program comprehension using semantic and structural information. In Proc. of 23rd International Conference on Software Engineering, pages 103–112, Toronto, 2001. A. Marcus and J. I. Maletic. Recovering documentation-tosource-code traceability links using latent semantic indexing. In Proceedings of the International Conference on Software Engineering, pages 125–135, Portland Oregon USA, May 2003. D. Ratiu, S. Ducasse, T. Gˆırba, and R. Marinescu. Using history information to improve design flaws detection. In European Conference on Software Maintenance and Reengineering, pages 223–232, Tampere, Finland, Mar 2004. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988. R. J. Turver and M. Munro. An early impact analysis technique for software maintenance. Journal of Software Maintenance - Research and Practice, 6(1):35–52, 1994. Z. Xing and E. Stroulia. Understanding class evolution in object–oriented software. In Proceedings of the IEEE International Workshop on Program Comprehension, pages 34– 43, Bari, Italy, Jun 2004. T. Zimmermann, S. Diehl, and A. Zeller. How history justifies system architecture (or not). In Proceedings of the International Workshop on Principles of Software Evolution, pages 73–83, Helsinki, Finland, Sep 2003. T. Zimmermann, P. Weißgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. In Proceedings of the International Conference on Software Engineering, pages 563–572, Edinburgh, Scotland, UK, May 2004.