Received: 4 April 2016
Revised: 16 January 2017
Accepted: 3 February 2017
DOI: 10.1002/smr.1859
RESEARCH ARTICLE
Using discriminative feature in software entities for relevance identification of code changes Yuan Huang1,2
Xiangping Chen2,3
Zhiyong Liu1,2
Xiaonan Luo1,2
Zibin Zheng1
1 School of Data and Computer Science, Sun
Yat-sen University, Guangzhou, China 2 National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China 3 Institute of Advanced Technology, Sun Yat-sen University, Guangzhou, China
Abstract
Correspondence Xiangping Chen, National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China. Email:
[email protected]
can be guaranteed. Inspired by the effectiveness of machine learning techniques in classification
Funding information National Key Research and Development Program, Grant/Award Number: 2016YFB1000101; National Natural Science Foundation, Grant/Award Number: 61232011 and 61672545; Science and Technology Planning Project, Grant/Award Number: 2014B010118003
coupling rules and 4 cochanged type relationships are elaborately extracted from software enti-
Developers often bundle unrelated changes (eg, bug fix and feature addition) in a single commit and then submit a “poor cohesive” commit to version control system. Such a commit consists of multiple independent code changes and makes review of code changes harder. If the code changes before commit can be identified as related and unrelated ones, the “cohesiveness” of a commit field, we model the relevance identification of code changes as a binary classification problem (ie, related and unrelated changes) and propose discriminative feature in software entities to characterize the relevance of code changes. In particular, to quantify the discriminative feature, 21 ties to construct related changes vector (RCV). Twenty-one coupling rules at granularities of class, attribute, and method can capture the relevance of code changes from structural coupling dimension, and 4 cochanged type relationships are defined to capture the change type combinations of software entities that may cause related changes. Based on RCV, machine learning algorithms are applied to identify the relevance of code changes. The experiment results show that probabilistic neural network and general regression neural network provide statistically significant improvements in accuracy of relevance identification of code changes over the other 4 machine learning algorithms. Related changes vector with 72 dimensions (RCV72 ) outperforms other 2RCVs with less dimensions. KEYWORDS
coupling rules, cochanged types, discriminative feature, relevance of code changes
1
INTRODUCTION
improvement of a shortcut key). Such a commit is problematic, as it makes code review, reversion, and integration harder and historical
During software life cycle, developers make many code changes to meet ever-changing user requirements.1
analysis of the project less reliable.3,4
After working for some time,
The key point to guarantee the “cohesiveness” of a commit is to
developers commit their code changes to a version control system. Ide-
identify whether the code changes bundled in a commit are related.
ally, developers should bundle related code changes in a single commit.
Therefore, the problem predigests into the relevance identification of
By doing so, developers will be required to frequently identify the rele-
code changes, which is the aim of this paper. We hypothesize that the
vance of code changes before commit, which may interrupt their work
relevance of code changes can be characterized by the discriminative
flow.2 So in practice, some commits inevitably consist of multiple unre-
feature of software entities. The changes in the source code may occur
lated code changes. As Figure 1 shows, the commit ([00d497] from
at different levels: class, method, and attribute. Besides, after apply-
JabRef* ) in Subversion contains both a new feature addition (classes
ing changes on a class, the involved software entities have 2 change
JabRefFrame,
PluginInstallerAction,
and
PluginIn-
types: either changed or new. Figure 2 shows a scene of related code
staller are cochanged for the feature of plug-in installer) and a
changes. The method addContent() (calling method) in class Docu-
feature improvement (class JabRefPreferences is changed for the
ment invokes the method isRootElement() (method being called) in
* https://sourceforge.net/p/jabref/code/ci/00d497fa5a7eaeabf5b559742181a7ffdc464678/
being called is of new type, it can be concluded as the changed part in
class Element. If the calling method is of changed type and the method
J Softw Evol Proc. 2017;29:e1859. https://doi.org/10.1002/smr.1859
wileyonlinelibrary.com/journal/smr
Copyright © 2017 John Wiley & Sons, Ltd.
1 of 15
2 of 15
HUANG ET AL.
FIGURE 1
Example of commit consisting of unrelated code changes
FIGURE 2
Discriminative feature for related changes identification
the calling method is induced by the method being called. The reason
existing approach CLUSTERCHANGES,5 and the result indicates that
is that the method being called is a newly added one in current release,
our approach provides better success rate when the number of change
which does not exist before. To invoke it, the calling method inevitably
regions increases, although our approach spends a little more time than
overwrites its code for the method invocation. From the perspective of
CLUSTERCHANGES. Our contribution is threefold: Firstly, we propose
classes, the changes in these 2 classes are related. This example illus-
a discriminative feature-based approach to identify the relevance of
trates that software entities' coupling relations and change types can
code changes. Secondly, we conduct a comprehensive empirical eval-
be used as discriminative feature for change relevance identification.
uation of related changes detection. Thirdly, we design a user survey
Concentrating on identifying code changes relevance, we present a method detecting the relevance of code changes (DRCC) to use super-
to prove that DRCC can help developers significantly improve the efficiency of code change reviews.
vised machine learning (SML) (eg, probabilistic neural network [PNN]
We significantly extend our previous work.6 In particular, we extend
and general regression neural network [GRNN]) to classify the rele-
the instances of coupling rules from 16 to 21. We add and analyze
vance of code changes. Our method relies on SML, which has been used
data sets from another 2 software systems (JabRef and jEdit) for
extensively in engineering classification problems. We hypothesize that
the related code changes detection, these results are not available
changes relevance identification can be regarded as a binary classifica-
in Huang et al.6 Also, we extend statistical tests to all the results
tion problem. Thus, by establishing the most likely features for a given
achieved by different machine learning algorithms and RCVs with dif-
category (related or unrelated), an SML model to predict changes rel-
ferent dimensions. In addition, we investigate an additional research
evance can be learned. In particular, the coupling rules and cochanged
question (RQ3 ) in our empirical evaluation that compares our approach
types extracted from software entities are used as discriminative fea-
with CLUSTERCHANGES.5
ture to characterize the relevance of code changes. Coupling rules at
Paper organization. Section 2 presents related works, while
granularities of class, attribute, and method can capture the relevance
Section 3 describes the approach overview. The main process of con-
of code changes from structural coupling dimension, and cochanged
struction of RCV is presented in Section 4. The setups and results of
type relationships are defined to capture the change type combina-
case study are discussed in Sections 5 and 6, while Section 7 discusses
tions of software entities that may cause related changes. Coupling
threats to validity. Section 8 summarizes our approach and outlines
rules combining cochanged types construct the related changes vector
directions of future work.
(RCV), which is used as the input of machine learning algorithms. In this paper, we want to explore (1) which machine learning technique is the most effective for related code changes detection, (2)
2
RELATED WORK
whether the choice of feature granularities affects the accuracy of detection, and (3) how does DRCC perform compared with an exist-
Our approach is closely related to the research on coupling metrics. The
ing approach. With these 3 questions, we have evaluated our approach
core mechanism in our approach for detecting related code changes
on 2 data sets: a manually collected data set and a public data set.
is RCV, which is constructed by the combination of coupling rules and
The results show that using PNN with RCV72 as input obtains statisti-
cochanged types. We are discussing here the major approaches of cou-
cally significant improvements in accuracy over other 5 machine learn-
pling metric. Then, we will discuss the researches to changes relevance
ing techniques. Furthermore, our approach has been compared with
identification.
3 of 15
HUANG ET AL.
Coupling metrics. Coupling refers to the degree of interdependence among the components of a software
system,7
Dias et al2 proposed EpiceaUntangler to help developers to untan-
and developers always
gle code changes. They also use machine learning technology on code
pursue an optimal balance between coupling and cohesion8 when
change information. This approach is different from ours in feature
modularizing the components of their systems. Coupling metrics are
selection. Specifically, our approach fully uses static coupling and
widely used in software engineering tasks, such as change propagation
change types relationship in source code. No extra information, such as
prediction,9,10 assessing the fault-proneness of classes,11,12 software
information related to testing or version control system, are required.
dependency prediction,13 and software remodularization.14 Two types
To facilitate developers' understanding of changes in the context of
of coupling metrics have received significant attention in the literature,
code reviews, Barnett et al5 proposed to relate separate regions of
namely, dynamic15,16
and static.17–19
Dynamic metrics usually capture
change within a changeset by using static code analysis technique. The
the dynamic coupling by observing the object interactions during pro-
approach uses a single relationship that between the use of a type,
gram execution, while static metrics use different types of information
method, or field and its definition provides a useful decomposition.
such as structural, textual, and evolutionary to measure the degree of
Their method is very effective when the change regions are small scale.
interaction between software entities. The basic idea underlying tradi-
As the number of change regions increases, most of change regions
tional static coupling metrics is very simple: count how many interac-
tend to be related (called overrelated) under the definitions and uses
tions there are between software entities in the system.20–22 The RCV
mechanism. To avoid the overrelated, our approach to identify the rele-
proposed in our work uses structural coupling metrics to characterize
vance of code changes is based on machine learning models, which are
related code changes. To make RCV more discriminatory and provide
trained on an empirical data collected from real cases of related and
more information, we subdivided the structural coupling used in RCV
unrelated code changes in program. We find that some change regions
into several types at different levels (eg, class, method, and attribute).
cannot be considered related even they satisfy the definitions and uses
Therefore, we mainly introduce the types of structural coupling classi-
mechanism. Moreover, besides the traditional coupling relationships in
fied by Briand et al7 and Ying et al.23
Barnett et al,5 we propose an original cochanged type relationships to
Briand et
al7
defined structural couplings which captured 3 types
enhance the confidence of identifying related code changes.
of interactions between software entities, namely, Class-Attribute
Machine learning. There are 2 types of machine learning algorithms:
(CA), Class-Method, and Method-Method interactions. However,
supervised and unsupervised.24 In SML, a training set of precatego-
these couplings are categorized at a coarser-grained level. In con-
rized vectors is used to build a mapping between the features and
trast, Ying et al proposed 12 fine-grained “structural relationships”
the categories. Then this mapping is used to predict the categories to
coming from Java and C++ languages, and these structural relation-
which uncategorized vectors belong. On the other hand, unsupervised
ship referred to the structural couplings with 12 different types.23
machine learning generates categories (eg, clusters) based on the latent
By analyzing the real source code in object-oriented projects, we
structure (patterns, regularities, similarities, etc) of the features. In this
extend the couplings' type to 21 and classified them into several
paper, we use 6 supervised algorithms in experiment section, they are
granularities (class, method, and attribute level) in our metrics. Fur-
k-nearest neighbor (kNN),25 Naive Bayes (NB),26 decision tree (DT),27
thermore, we redefined the coarser-grained types of interactions
support vector machine (SVM),28 PNN,29 and GRNN.30
between software entities with considering the common character-
k-nearest neighbor algorithm is a nonparametric method used for
istics of coupling instances and finally classified them into 4 types:
classification, which is a type of instance-based learning algorithm.25 In
Class-to-Class (CR1 ), Method-to-Class (CR2 ), Method-to-Attribute
kNN, an object is classified by a majority vote of its neighbors, with the
(CR3 ), Method-to-Method (CR4 ) interactions, where CR2 and CR4 cor-
object being assigned to the category most common among its kNNs.
responded to Class-Method and Method-Method, respectively. Through
Jaafar et al31 use kNN to group code changes, and Okutan et al32 use
further analysis on CA, we found that CA can be transformed into 1 of
kNN to predict software defectiveness. Naive Bayes assumes that all
the 3 relationships, CR2 , CR3 , and CR4 .
the attributes are independent and that each contributes equally to
Changes relevance identification. To the best of our knowledge,
the categorization.26 A category is assigned to an object by combin-
researches involving the problem of identifying relevance of code
ing the contribution of each feature. This combination is achieved by
changes are first proposed3 in 2013. Herzig and Zeller presented
estimating the posterior probabilities of each category by using Bayes
the earliest results on an empirical experiment of tangle commits in
theorem. Naive Bayes models have been used for text retrieval33 and
software development. The experiment shows that detecting related
classification34 and software refactoring.35 Decision tree uses a “divide
changes in source code is important, and software changes are
and conquer” strategy to split the problem space into subsets.27 A DT
tangled even in a version control system. They also proposed a
is modeled like a tree in which the root and the nodes are questions,
heuristic-based algorithm to “untangle” changes based on code change
and the arcs between nodes are possible answers to the questions. The
information such as file distance and the call graph of a change.
leaves of the tree are the categories. Decision tree models have been
We noticed the importance of detecting code changes for program
used for software defect prediction in Moser et al36 and Knab et al.37
understanding and
proposed6
a machine learning–based approach
Support vector machine splits the problem space into two possible sets
in 2014. In this paper, our approach uses 21 coupling rules and 4
by finding a hyper-plane that maximizes the distance with the closest
cochanged type relationships to specify static coupling relationship
item of each subset.28 The function that splits the hyperplane is known
and change type combinations of 2 entities. Our result shows higher
as the kernel function. If the data are linearly separable, a linear kernel
precision compared to Herzig and Zeller with different data sets
function is used with the SVM, in other case, nonlinear functions such
in experiments.
as polynomials, radial basis, and sigmoid should be used. Support vector
4 of 15
HUANG ET AL.
machine models have been used for bug prediction in Shivaji et al38 and
the coupling rules and cochanged types collaboratively construct RCV.
Shivaji et al.39 Probabilistic neural network is a feedforward neural net-
Related changes vector indicates what discriminative feature the
work, which introduces a radial basis function to measure the weight
updated entities satisfy. Finally, with machine learning algorithm and
of distance for each neighbor.29
RCV, the relevance of code changes is identified as related and unre-
In a PNN, the operations are organized
into a multilayered feedforward network with 4 layers. Namely, input,
lated ones.
hidden, pattern and summation, and decision layer.s Ted et al40 use PNN to classify software modules. General regression neural network also has 4 layers structure, while it has a radial basis layer and a special linear
4
RELATED CHANGES VECTOR
layer.30 General regression neural network is a memory-based feedforward neural network based on the approximate estimation of the prob-
In this work, we propose an RCV mechanism, based on coupling rules
ability density function from observed samples using Parzen-window
and cochanged types, to measure the relevance of code changes. We
estimation. Kanmani et al41 use GRNN for software quality prediction.
first introduce the concept of coupling rules and cochanged types and then introduce the definition of RCV. The key idea behind our approach is to identify related changes via
3
APPROACH OVERVIEW
using the discriminative feature derived from program entities. In most case, the major factor behind the production of related changes is due
In this paper, software entities include classes, methods, and attributes.
to a change propagating from 1 entity to another entity. The change
The software entities involving code changes are called updated enti-
propagation mechanism9 shows that code change tends to propagate
ties. If the code lines of an entities are changed, the entity is of changed
from 1 entity to another entity when these 2 entities are structural
type. If an entity is newly added, it is of new type. Therefore, an updated
coupled. Therefore, structural coupling is a necessary condition for 2
entity has only 1 change type, either changed or new.
code changes be related. In this paper, we regard the structural coupling
Before developers commit their code changes to SVN, they need to identify the relevance between code changes. The code changes may
as one of discriminative features used to judge the relevance of code changes.
occur in same class or in different classes. Therefore, our approach need to identify the changes relevance in 2 cases: either in different classes (case 1 in left-hand side of Figure 3, where software entities with red background involve code changes) or in same class (case 2 in left-hand side of Figure 3). To facilitate extraction of discriminative feature of code changes, the code change is organized as entity-unit in this paper. To better understand the extraction of discriminative feature of code changes occurring in same class, we use 2 classes, A and A', to highlight the code changes occurring in different entities of a class (case 2 in left-hand side of Figure 3, where A and A' are the same class). It is worth noting that we consider the distinct code changes occurring in same method to be related. Figure 3 shows the overview of our proposed approach, which takes a pair of classes involving code changes as input, and reports the relevance of code changes. Firstly, 21 coupling features and 4 cochanged
4.1
Coupling rules
In real programming paradigm, structural coupling is presented as a variety of coupling rules.23 For us to summarize the general types of structural coupling, it is necessary to explore what coupling rules the entities satisfy in practice. Thus, we go deep into the source code of program entities to analyze and collect the coupling rules, and then classify them at different granularities. Coupling rule 1. S is the set of updated classes in program. S = {C1 , C2 , … , Cn }. The coupling rules CR are 2-tuple relation defined on
S. For any classes Ci and Cj , Ci ∈ S, Cj ∈ S. If Ci and Cj establish coupling relationship at class level, they satisfy the Class-to-Class coupling rule, denoted by CR1 = {Ci ⇒ Cj |
Ci ∈ S, Cj ∈ S}.
(1)
type features are combined to represent the discriminative feature.
“⇒” denotes the coupling relationship. In Java program syntax, IH and
The coupling features refer to the structural coupling in real pro-
II42 are the most common cases that satisfy CR1 . The first 2 rows in
gramming paradigm, such as Inheritance (IH) and Implementing Inter-
Table 1 describe these 2 instances.
face(II). Cochanged type features mean the change types of pair of
A class is a set of attributes and methods. Namely, Ci = {Ai , Mi }, and
coupling entities. Secondly, so as to quantify the discriminative feature,
Ai = {ai1 , ai2 , … , ait }, Mi = {mi1 , mi2 , … , mih }. Ai and Mi are the set of
FIGURE 3
Approach overview
5 of 15
HUANG ET AL.
TABLE 1
Common instances of structural coupling
Instances
Description
Abbrev
Coupling rules
Inheritance
Ci inherits from Cj
IH
Class-to-Class
Implementing Interface
Ci implements interface Cj
II
Class-to-Class
Type-Casting
Ci performs Type-Casting to Cj in mi*
TC
Method-to-Class
Instanceof
Ci performs Instanceof of Cj in mi*
IO
Method-to-Class
Return Type
Cj is return type of mj*
RT
Method-to-Class
Parameter Type
Cj is parameter type of mj*
PT
Method-to-Class
.class
Ci performs .class of Cj in mi*
DC
Method-to-Class
Exception Throws 1
Cj is exception handler class, mi* throws Cj when method definition
ET1
Method-to-Class
Exception Throws 2
Cj is exception handler class, mi* throws Cj in method body
ET2
Method-to-Class
Static Method Invoking
mi* invokes static method mj* of Cj
SMI
Method-to-Method
Static Attribute Invoking
mi* invokes static attribute aj* of Cj
SAI
Method-to-Attribute
Construction Method Invoking
mi* invokes construction method mj* of Cj
CMI
Method-to-Method
Method Member Attribute
mi* invokes attribute ai*
MMAUA
Method-to-Class
mi* invokes attribute aj* of Cj
MMAIA
Method-to-Attribute
mi* invokes method mj* of Cj
MMAIM
Method-to-Method
mi* invokes attribute ai* mi* invokes attribute aj* of Cj
CMAUA CMAIA
Method-to-Class Method-to-Attribute
mi* invokes method mj* of Cj
CMAIM
Method-to-Method
mi* invokes parameter ai* mi* invokes attribute aj* of Cj
FPUP FPIA
Method-to-Class Method-to-Attribute
mi* invokes method mj* of Cj
FPIM
Method-to-Method
Class Member Attribute
Function Parameter
attributes and methods of Ci , respectively. Similarly, Cj = {Aj , Mj }, and
The most representative instance for CR3 is Static Attribute Invoking
Aj = {aj1 , aj2 , … , aju }, Mj = {mj1 , mj2 , … , mjv }. Aj and Mj are the set of
(SAI).
attributes and methods of Cj , respectively. There is another coupling relation between classes and methods.
Coupling rule 4. This rule builds a coupling relation between mi* contained in Ci and mj* contained in Cj , namely, a Method-to-Method rule:
Coupling rule 2. When the class Cj emerges in a method mi* of Ci and does not be defined as an attribute or be used with its static attributes CR4 = {Ci · mi∗ ⇒ Cj · mj∗
and methods, the situation satisfies the Method-to-Class coupling rule: CR2 = {Ci · mi∗ ⇒ Cj
|
|
mi∗ ∈ Mi
∧
mj∗ ∈ Mj }
(5)
(2)
mi∗ ∈ Mi }
CR4 builds coupling relation at method level, eg, Static Method Invoking; The instances of CR2 are Type-Casting (TC), Instanceof (IO), Return Type
Construction Method Invoking. The more detailed description is shown in
(RT), Parameter Type (PT), Exception Throws (ET), and .class (DC), etc.
Table 1.
Where .class is a Java grammar that can get the instance of the class via
In addition to the above mentioned instances for each coupling
the form of “classname.class”. The more detailed description is shown in
rules, there is a most common instance in real encoding. This kind of
Table 1.
instance uses Cj to define an attribute ai* contained in Ci , where ai*
However, in most cases finer-grained coupling rules often occur at
may be a Method Member Attribute or Class Member Attribute or Function Parameter. Generally, the defined attribute ai* is usually used in
method and attribute level. The finer-grained CRfg is defined as
a method mi* of Ci , and there are 3 usages: (1) using ai* directly in CRfg = {Ci · ei ⇒ Cj · ej
|
ei ∈ (Ai ∪ Mi )
∧
ej ∈ (Aj ∪ Mj )},
(3)
where ei and ej are the attribute or method contained in Ci and Cj , respectively. ei ⇒ ej denotes that ei and ej couple together at attribute or method level. According to the definition, there are 4 possible instances of the finer granularity CRfg , namely, Ci · ai∗ ⇒ Cj · aj∗ , Ci · ai∗ ⇒ Cj · mj∗ , Ci · mi∗ ⇒ Cj · aj∗ , Ci · mi∗ ⇒ Cj · mj∗ .
In our investigation, the instances of Ci ·ai∗ ⇒ Cj ·aj∗ and Ci ·ai∗ ⇒ Cj ·mj∗ barely occur in real encoding, while Ci ·mi∗ ⇒ Cj ·aj∗ and Ci ·mi∗ ⇒ Cj ·mj∗ occur in most cases. Thus, we only consider the latter 2 in the next. Coupling rule 3. This rule builds a coupling relation between mi* contained in Ci and aj* contained in Cj , namely, a Method-to-Attribute rule:
a certain method mi* of Ci . Because ai* is an attribute declared by Cj and ai* is used in certain method mi* of Ci , thus we can regard the coupling relation between Ci and Cj as Method-to-Class. For example, the instances MMAUA, CMAUA, and FPUA satisfy this case, and detailed description is shown in Table 1; (2) invoking the attribute aj* of Cj via ai* (because ai* is declared by Cj ), and Ci and Cj builds a Method-to-Attribute coupling relation. As Table 1 shows, the instances MMAIA, CMAIA, and FPIA satisfy this case; (3) invoking the method mj* of Cj via ai* , and Ci and Cj builds a Method-to-Method coupling relation. As Table 1 shows, the instances MMAIM, CMAIM, and FPIM satisfy this case. We have classified the coupling rules into 4 types at class, attribute, and method levels. Table 2 shows the 4 coupling rules and their cor-
CR3 = {Ci · mi∗ ⇒ Cj · aj∗
|
mi∗ ∈ Mi
∧
aj∗ ∈ Aj }
(4)
responding formalization expressions and 21 instances. Note that,
6 of 15
HUANG ET AL.
TABLE 2
Four types of coupling rules
Name
Coupling rules
Formalization
Instances
CR1
Class-to-Class
Ci ⇒ Cj
IH, II
CR2
Method-to-Class
Ci · mi∗ ⇒ Cj
TC, IO, RT, PT, DC, ET1 ,
CR3
Method-to-Attribute
Ci · mi∗ ⇒ Cj · aj∗
ET2 , MMAUA, CMAUA, FPUP SAI, MMAIA, CMAIA, FPIA
CR4
Method-to-Method
Ci · mi∗ ⇒ Cj · mj∗
SMI, CMI, FPIM, MMAIM, CMAIM
although this does not ensure that these coupling rules and instances
then we combine these 2 instances into 1, which produces 4 cochanged
in Table 2 can represent all of the cases, it has already covered most
types. Adding 60 cochanged types produced by the rest of 15 instances,
of them.
there are 72 possible cochanged types in total, thus the dimension of RCV is 72. Namely, m=72, each 𝜂 k in RCV(Ci , Cj ) maps to the cochanged
4.2
Cochanged types
type of certain instance.
A new functionality in program is usually implemented by adding new entities or expanding the ability of old entities (changed) or combining both. If 2 entities with new change type are coupled, they are most likely to be involved in achieving the same functionality, and their changes are related. Another example of this is a changed type method coupling with a new type method shown in Figure 2. From a statistical perspective, the change type combinations of 2 coupling entities can be regarded as another discriminative feature used to judge the relevance of code changes. For a single updated entity ei , it has 2 change types, changed
5
CASE STUDY DESIGN
To have a meaningful evaluation, we outline 6 different machine learning algorithms (namely, PNN, GRNN, kNN, NB, DT, and SVM) for relevance identification of code changes and consider 3 types of RCV with different granularities (RCV72 , RCV14 , and RCV21 ). Besides, we empirically investigate how DRCC performs comparing it with existing approaches.
and new. Let CT(ei ) be the change types of ei . For 2 coupling entities ei and ej , there are at most 4 change type combinations (eg, 1 of entities is changed and the other is new). We call the change types of coupling entities cochanged types. Let Co-CT(ei ⇒ ej ) represent the cochanged types
5.1 Related changes vector with different dimensions
of entities ei and ej . The possible values of CT(ei ) and Co-CT(ei ⇒ ej ) are
RCV72 is based on 21 instances of coupling rules and 4 cochanged
⎧ ⎪ < changed, changed > ⎪ < new, changed > Co−CT(ei ⇒ ej ) = ⎨ ⎪ < changed, new > ⎪ < new, new > ⎩
types. To evaluate whether fine-grained dimensions are necessary for
{ CT(ei ) =
< changed > , < new >
related changes detection, we construct RCVs at coarse-grained dimensions and take these RCVs as input for machine learning algorithms. Specifically, RCVs at more coarse-grained dimensions are constructed by combining related finer-grained dimensions. We construct 2 RCVs by combining dimensions of different instances of coupling rules and dif-
4.3
ferent cochanged types of the same coupling rule, which are called
Construction of RCV
RCV14 and RCV21 , respectively. RCV14 is constructed at coupling rules
So as to quantify the discriminative feature of code changes in Ci
level instead of at instances level. Specifically, RCV14 is constructed
and Cj , coupling rules and cochanged types collaboratively construct
considering 4 coupling rules and their corresponding 4 cochanged
RCV. Related changes vector represents the relevance vector of code
types. Because the coupling rule CR3 belongs to Method-to-Attribute
changes. The definition is
relationship and the attribute only has new type, CR3 has 2 possible
RCV(Ci , Cj ) = | 𝜂1 ,
𝜂2 ,
𝜂3 ,
𝜂4
.
.
.
𝜂m |
(6)
cochanged types. The rest of coupling rules (CR1 , CR2 , and CR4 , each of them has 4 cochanged types) have 12 dimensions in total. There-
In formula 6, for any k, 1 ⩽ k ⩽ m, 𝜂 k = 0 or 1. ‘1’ represents that
fore, RCV14 has 14 dimensions in total. Row 2 of Table 3 shows RCV14 .
the software entities involving code changes satisfy the corresponding
RCV21 is constructed by 21 instances of coupling rules, regardless of the
coupling rule instance along with the cochanged type, and ‘0’ means
cochanged types. The row 3 of Table 3 shows RCV21 . If the accuracy of
no such relation. In real source code, updated entities may satisfy 1 or
the detection results is affected by the chosen of RCVs, it means that
several coupling rule instances listed in Table 1, and each instance has
the granularity of coupling rule affects the detection accuracy.
4 possible cochanged types except SAI, MMAIA, CMAIA, and FPIA. For these 4 instances, they build a Method-to-Attribute coupling relation. A method has 2 change types: changed and new. For attribute, we only consider the new change type. Because in our survey applying changes on an attribute rarely occur. As a result, each of the 4 instances has 2 possible cochanged types (< changed, new > and < new, new >), and they produce 8 cochanged types in total. Besides, there are a few mutually exclusive situations that cannot occur simultaneously. For example, 2 classes can only satisfy 1 of the instances of CR1 : either IH or II,
5.2
Research questions
This section is the empirical study for DRCC, with the purpose of analyzing its capability to identify the related code changes. The quality focus is the precision, recall, F-measure, and false positive rate of detection of related code changes with respect to original ones. • RQ1: Which machine learning technique is the most effective for related code changes detection?
7 of 15
HUANG ET AL.
TABLE 3
Related changes vector with different dimensions
RCV
Instances of coupling rules
RCV72
IH,TC,RT,PT,DC,SAI,SMI, ET1 , ET2 , MMAUA,CMAUA,
Coupling rules
Cochanged types , ,
--
FPUP,MMAIA,CMAIA,FPIA,
,
CMI,FPIM,MMAIM,CMAIM,IO.
. ,
RCV14
--
CR1 , CR2 ,
,
CR3 , CR4 .
, .
RCV21
IH,II,TC,RT,PT,DC,SAI,SMI, ET1 , ET2 , MMAUA,CMAUA, FPUP,MMAIA,CMAIA,FPIA,
--
--
CMI,FPIM,MMAIM,CMAIM,IO.
• RQ2: Does the choice of RCV72 , RCV14 , and RCV21 affect the accuracy of detection?
we analyze its associated source code files (eg, classes), and the classes in current commit form a gold set.
• RQ3: How does DRCC perform compared with other approaches in detecting related changes?
Table 4 summarizes the project names, RQ, releases, and the number of gold sets. The RQ column represents the research questions (RQ)
For RQ1, we compare different machine learning approaches on large project sets and obtain quantitative measurements of how well they detect related code changes. Our object is to choose a best classifier from numerous machine learning techniques, such as, PNN, GRNN,
that each data set helps to answer. To better validate our approach, we select gold set including at least 2 unique classes (for example, we select 13 from 39 gold sets for JabRef, and any one of 13 gold sets includes at least 2 unique classes).
kNN, NB, DT, and SVM. We also want to know how different granularities of features affect the accuracy of machine learning algorithms.
5.3.2
Evaluation methodology
Specifically, we extract three granularities of features used in the case
Our evaluation strategy is to use the gold set to determine whether
study, and in RQ2, we want to study which granularities of features
DRCC correctly identifies the related code changes. The changes in dif-
can obtain better accuracy. Similarly, the purpose of RQ3 is to compare
ferent classes can be related when they address the same issue (eg, bug
our approach with an existing approach, and we are interested in the
fixing), and a gold set exactly collects the classes addressing an issue.
comparison of DRCC and the other approaches.
However, when the number of classes contained in a gold set is greater than 2, it cannot be guaranteed that the changes of any pair of classes in
5.3
Data sets and evaluation methodology
5.3.1
this gold set are related. For example, class C1 , C2 , and C3 are in a same gold set. The change of class C1 is related with the change of class C2 ,
Data sets
and the change of class C3 is also related with the change of C2 . But the
When a development task (eg, bug fixing) involves multiple updated
changes of C1 and C3 are uncorrelated. Because these 3 classes fix a
classes, there theoretically exist related changes between these
bug together, they are in the same gold set. We expect that DRCC can
updated classes. Fortunately, Dit et al43 have proposed gold set mech-
detect the truly related changes. The evaluation methodology can be
anism, which is used to record the classes involved in a same develop-
summarized in the following steps:
ment task. In the case study, gold set is used to validate whether DRCC can detect the related code changes. Two public data sets (JabRef† and
1. For a project, if the number of classes contained in a gold set is
as well as another 3 manual collection
greater than 2, we manually identify the relevance of changes in
data sets (Jdom§ , gwt-log¶ , and XStream‖ ) are used in our case study.
each pair of classes to obtain the truly related changes, and the
The process of manual collection of gold set can be summarized as
truly related changes are used as an oracle in training phase and
following: firstly, we filter out the commits whose comments contain
validation phase.
‡
jEdit ) provided by Dit et
al43
IssueID. For example, the comment of commit #13321 is “fix for bug
2. For each pair of updated classes in same gold set, our algorithm gen-
#732,” where #732 is an IssueID. The issue #732 is mapped to commit
erates corresponding RCV. According to the oracle, if the changes
#123321. Secondly, we manually verify each issue-commit mapping to
of a pair of classes are truly related, we use ‘1’ to label this RCV,
ensure the correctness of the data and to discard commits that contain
otherwise it is labeled as ‘0’. All gold sets generate 1143 RCVs in
numbers that do not represent IssueIDs (eg, “Eliminated a small code duplication found in r10817”). At last, for each issue-commit mapping,
total. 3. In training process, DRCC uses a 5-fold cross validation, in which the RCVs of specific project are randomly broken into 5 sections.
† http://jabref.sourceforge.net/ ‡ http://www.jedit.org/ § http://www.jdom.org ¶ https://code.google.com/p/gwt-log/
‖ http://xstream.codehaus.org
One section (about 228 RCVs) is used to test the machine learning algorithms and train against the other four-fifths (about 915 RCVs). There are 5 iterations, and each section is used as the testing set once.
8 of 15
HUANG ET AL.
TABLE 4
Projects used in case study
Source Manual Collection
Public Data set
RQ RQ 1,2
RQ 1,2,3
Release
# of selected gold sets
Jdom Jdom
Project
b7 b8
5 4
gwt-log
2.5
4
gwt-log
3.1
6
XStream
1.1
7
XStream
1.2
9
XStream
1.3
7
XStream
1.4
8
JabRef
2.6
13
jEdit
4.3
52
4. Finally, DRCC takes the test set as input to predict the relevance
where correct represents the set of truly related changes and detected is
of changes. For the related changes judged by DRCC, we validate
the set of related changes judged by DRCC. Precision is the percentage
whether these changes are truly related according to the oracle.
of related changes identified by DRCC that are truly related changes
To establish a valid oracle, we manually identify the truly related changes in each gold set. If the number of classes contained in current gold set equals 2, the changes of these 2 classes are truly related changes; if the number of classes contained in a gold set is greater than 2, we manually investigate the source code of pair of classes to judge whether the change in one class is induced by the one in other classes. Because the judgement accuracy is related to the programming experiments of the participants, we tried to avoid bias during the data collection phase. For example, we first invite 3 graduate students
according to the oracle. Recall is the percentage of the truly related changes that are successfully retrieved by DRCC. The F-measure is a weighted harmonic mean of precision and recall and can be used as a comprehensive indicator of combined precision and recall values. FP is the number of false positives, namely, the related changes identified by DRCC that are not truly related. TN is the number of true negatives, namely, changes are correctly identified as not related. FPR measures the proportion of false positives over the total number of negative instances.
to manually identify truly related changes, and other 3 graduate students validate the produced results, to verify that all the truly related changes identified by the first 3 students are correct; all the partici-
5.4.2
Testing statistical significance
pants are major in computer science and not aware of the experimental
The goal of our RQs is to compare the PRE, REC, FM, and FPR achieved
goals and the way that DRCC identifies related code changes. The final results show that the ratio of truly correlated changes vs all the changes
by different machine learning algorithms. According to the work of ̌ 44 we use Friedevaluating machine learning algorithms in DemZar,
forming the gold sets is about 38.4%.
man test with Nemenyi's post hoc procedure to establish statistical significance. The Friedman test firstly compares the performance of k machine learning algorithms over m data sets; if the null hypoth-
5.4 5.4.1
esis is rejected using the Friedman test, then the Nemenyi test is
Evaluation metrics and statistical analyses
used as a post hoc procedure to compare pairs of machine learn-
Evaluation metrics
ing algorithms. Besides, to control the family-wise error, Bonferroni
To evaluate our measures used in the case study, the related changes judged by DRCC are compared against the truly related changes. Aver-
correction is introduced.44 The ith null hypothesis used for Nemenyi test is
age precision (PRE), recall (REC), F-measure (FM), and false positive rate (FPR) for each release are computed. |correct ∩ detected| PRE = % |detected|
Hi0 ∶ There is no statistically significant difference between PRE (or REC, or FM, or FPR) identified by n and n + 1 machine learning algrithms
(7) Moreover, a p̂ nonparametric effect size estimator45 is introduced to represent the probability that a value randomly drawn from 1 sample
REC =
|correct ∩ detected| % |correct|
will be greater than a value randomly drawn from other sample. The (8)
estimator p̂ is defined as ̂ pa,b =
|Precision ∗ Recall| FM = 2 ∗ % |Precision + Recall|
U , na nb
(11)
(9) where U is the Mann-Whitney Statistic and na nb is the product of the 2 sample sizes. Therefore, for RQ1 and RQ2, we use Friedman
FPR =
|FP| % |TN + FP|
test, for all the research questions, we use statistic p̂ as effect size (10)
estimator.
9 of 15
HUANG ET AL.
6
CASE STUDY RESULTS
than 0.0001. Therefore, we reject the null hypothesis that there is no significant difference between PRE, REC, FM, and FPR of PNN and the
6.1
rest of machine learning algorithms.
RQ1: machine learning algorithms
Figure 4 shows a statistical summary of the PRE, REC, FM, and FPR for the detection result using each algorithm on the data set of 5 projects.
Table 5 lists the null hypotheses (H10 to H20 ) used for the Nemenyi post 0 hoc test on the difference between specific pairs of machine learning algorithms. The P values and the effect sizes are also listed in Table 6.
Each boxplot represents the distribution of a metric for 1 algorithm,
For PRE, REC, and FM, we find that the effect sizes reported by PNN
on 3 RCVs with different dimension levels, with 5-fold cross validation.
and other algorithms (kNN, NB, DT, SVM) are higher than 0.84. For FPR,
We observe that the medians of PNN and GRNN outperform the other
the effect size is small, with probability that is lower than 0.44. It means
algorithms. We apply Friedman test to test the statistical significance of
that there are significant differences between the performance of PNN
the difference in the results of each machine learning algorithm. When
compared with kNN, NB, DT, and SVM. Therefore, we reject all the null
testing 4 metrics (PRE, REC, FM, and FPR) at a 5% confidence level, we
hypotheses (Except for H10 , H60 , H11 and H16 ), meaning that the mean 0 0
find that P values (2-tailed) for the comparison of four metrics of GRNN
PRE, REC, and FM given by PNN are statistically significantly higher
vs PNN are greater than the Bonferroni corrected significance level
than the results from kNN, NB, DT, and SVM, while the FPR given by
(0.0033). These values suggest us to accept the null hypothesis (H10 , H60 ,
PNN is statistically significantly lower than the results from kNN, NB,
H11 , H16 ) stating that there is no significant difference between PRE, 0 0
DT, and SVM.
REC, FM, and FPR of PNN and GRNN. Moreover, when testing the sta-
In principle, PNN and GRNN are both radial basis network suitable
tistical significance of the difference between the results of PNN and
for classification problems. They have an adjustable parameter, spread,
the rest of machine learning algorithms, we find that P values are less
which is the spread of radial basis functions (default = 1.0). If spread is
FIGURE 4 Performance metrics for each algorithm over a combination of RCV14 , RCV21 , and RCV72 . The red solid line is the median. The circle is the singular value and the box is the Interquartile Range (IQR). The thin line extends from Q1-1.5IQR to Q3+1.5IQR
TABLE 5
Null hypotheses and P values for RQ1 using the Nemenyi test and nonparametric effect sizes p̂
Comparison
PRE
REC
GRNN vs PNN
H10 , P
= .0644, p̂ = 0.5833
H60 , P
kNN vs PNN
H20 , P
NB vs PNN
H30 , P
FM
FPR
= .0250, p̂ = 0.5992
H11 ,P 0
= .0061, p̂ = 0.6243
H16 ,P 0
= .2713, p̂ = 0.5387
< .0001, p̂ = 0.9051
H70 , P
< 0.0001, p̂ = 0.8541
H12 ,P 0
< .0001, p̂ = 0.8501
H17 ,P 0
< .0001, p̂ = 0.3618
< .0001, p̂ = 0.9354
H80 , P
< 0.0001, p̂ = 0.8795
H13 ,P 0
< .0001, p̂ = 0.8731
H18 ,P 0
< .0001, p̂ = 0.4361
DT vs PNN
H40 , P
< .0001, p̂ = 0.9221
H90 , P
< 0.0001, p̂ = 0.8702
H14 ,P 0
< .0001, p̂ = 0.9191
H19 ,P 0
< .0001, p̂ = 0.4382
SVM vs PNN
H50 , P
< .0001, p̂ = 0.9259
H10 ,P 0
< 0.0001, p̂ = 0.8696
H15 ,P 0
< .0001, p̂ = 0.8465
H20 ,P 0
< .0001, p̂ = 0.4276
Abbreviation: FM, F-measure; FPR, false positive rate; FM, F-measure; PRE, average precision; GRNN, general regression neural network; PNN, probabilistic neural network. Each cell lists a null hypothesis that formulates that there is not significant difference between the values of a metric achieved with PNN and the values of the same metric achieved with other machine learning algorithms. For example H10 formulates that there is no significant difference between the precision of GRNN and PNN. The Bonferroni corrected significance level (for Nemenyi test) is 0.0033
10 of 15
TABLE 6
HUANG ET AL.
Null hypotheses and P values for RQ2 using the Nemenyi test and nonparametric effect sizes p̂
Comparison
PRE = .0032, p̂ = 0.6392
REC H24 ,P 0
FM
= .9319, p̂ = 0.5057
H27 ,P 0
= .0047, p̂ = 0.6295
FPR H30 ,P 0
= .3873, p̂ = 0.4678
RCV14 vs. RCV72
H21 ,P 0
RCV21 vs RCV72
H22 , P < .0001, p̂ = 0.8358 0
H25 , P < .0001, p̂ = 0.8244 0
H28 , P < .0001, p̂ = 0.8822 0
H31 , P = .0113, p̂ = 0.3977 0
RCV21 vs RCV14
H23 , P < .0001, p̂ = 0.7963 0
H26 , P < .0001, p̂ = 0.8064 0
H29 , P < .0001, p̂ = 0.8403 0
H32 , P = .0081, p̂ = 0.4396 0
Abbreviation: RCV, related changes vector. formulates that there is no significant difference between the precisions of RCV14 and RCV72 . The Bonferroni Each cell lists a null hypothesis. For example, H21 0 corrected significance level (for Nemenyitest) was 0.0167
FIGURE 5
Performance metrics for RCV72 , RCV14 , and RCV21 over 6 algorithms in all data sets
near zero, the network acts as a nearest neighbor classifier. As spread
RCV14 , and RCV21 are 47.82%, 46.64%, and 30.79%, respectively. RCV72
becomes larger, the network takes into account several nearby design
presents a roughly 1.18% improvement over RCV14 , and RCV72 presents
vectors. Coincidentally, PNN and GRNN both achieve the best accuracy
a roughly 17.03% improvement over RCV21 . The average values of FM
when the spread value is set to 0.4 in our project. With deeper analysis,
for RCV72 , RCV14 , and RCV21 are 46.57%, 41.89%, and 24.26%, respec-
we discover these 2 classifiers are powerful memory-based network
tively. RCV72 provides a 4.68% improvement over RCV14 and a 22.31%
and able to deal with sparse data effectively. The RCV obtained is rather
accuracy improvement on RCV21 . The average values of FPR for RCV72 ,
sparse. This is one of the reasons that PNN and GRNN can achieve
RCV14 , and RCV21 are 7.34%, 8.95%, and 9.06%, respectively. FPR of
higher accuracy compared with other machine learning algorithms.
RCV72 is the lowest and FPR of RCV21 is the highest.
Therefore, we answer RQ1 by concluding that PNN and GRNN
Therefore, there is a difference between the performance of RCV72 ,
are the most-effective machine learning algorithms for related code
RCV14 , and RCV21 . The Friedman test shows that the difference in aver-
changes detection in our evaluation.
age is statistically significant. When testing the 4 metrics at a 5% confidence level, we find that P values (2-tailed) are less than corrected
6.2
RQ2: performance of RCV72 , RCV14 , and RCV21
To evaluate whether RCV with fine-grained dimensions are necessary for the related code changes detection, we construct 3 RCVs (RCV72 ,
significance level (0.0167) except for H24 and H30 . These values suggest 0 0 us to reject the null hypothesis stating that there is no significant difference between PRE, REC, FM, and FPR for RCV72 , RCV14 , and RCV21 .
RCV14 , and RCV21 ) at different granularities and use these RCVs as input
Table 6 lists the null hypotheses; we use in the Nemenyi post hoc test.
for machine learning algorithms, and then compare the performance of
We find that P values for the comparison of PRE and FM of RCV21 vs
related code changes detection. Each RCV is evaluated using 6 machine
RCV72 , RCV14 vs RCV72 , and RCV21 vs. RCV14 are lower than the Bon-
learning algorithms (PNN, GRNN, kNN, DT, NB, and SVM) to minimize
ferroni corrected significance level (0.0167). The lowest value of effect
a threat to validity faced when using only 1 algorithm. At last, we group
sizes is in RCV21 vs RCV72 , and we find that the effect size reported
the results (eg, PRE, REC, FM, and FPR) for each RCV from 6 machine
by RCV72 and other 2 RCVs are higher than 0.88. Therefore, all the
learning algorithms.
hypotheses are rejected except for H24 and H30 . The P values for H24 0 0 0
A statistical summary of the results is presented in Figure 5. Each
and H30 are greater than the Bonferroni corrected significance level 0
boxplot is corresponding to a RCV. We observe that the average PRE for
(0.0167), which suggests us to accept the null hypothesis stating that
RCV72 , RCV14 , and RCV21 are 48.09%, 42.61%, and 25.73%, respectively.
there is no significant difference between REC and FPR of RCV14 vs.
RCV72 is the RCV with highest precision and RCV21 is the lowest one. The
RCV72
average value of PRE for RCV72 presents a roughly 5.48% improvement
Although there is no significant difference between REC and FPR
over RCV14 . The average values of PRE for RCV14 presents a roughly
of RCV14 vs RCV72 , RCV72 is more effective because its REC is slightly
16.88% improvement over RCV21 . The average values of REC for RCV72 ,
higher and its FPR is lower. Moreover, RCV14 is more effective than
11 of 15
HUANG ET AL.
and environment (eg, Microsoft). It is difficult to conduct the comparison experiment on the same dataset and experimental environment. We reimplement CLUS in java according to the mechanism in Barnett et al5 and conduct the comparison experiment on the commits of jEdit (https://sourceforge.net/p/jedit/svn/commit_browser)
and
JabRef
(https://sourceforge.net/p/jabref/code/commit_browser). We firstly select hundreds of continuous commits from jEdit and JabRef and then omit the commits involving no code changes (eg, a commit only involves configuration files changes). If a commit is tangled one,2 we manually split it into several self contained (or cohesive) comFIGURE 6
mits. After that, all the selected commits are atomic, and form a com-
Information gain
mits sequence according to release time. Then, the commits sequence
RCV21 for detection of related code changes. In general, the results show that RCV72 outperforms RCV14 and RCV14 outperforms RCV21 . It is not surprising because fine-grained coupling rules provide more concrete modeling of relationship between changed entities. Therefore, it is necessary to define fine-grained coupling rules for the detection of related code changes.
is used to evaluate DRCC and CLUS. We firstly combine any 2 adjacent commits in the sequence, and then apply DRCC and CLUS to decompose the combined commits, respectively. We want to explore whether DRCC and CLUS can decompose the combined commits into original ones (see left-hand side in Figure 7), and focus on the success rate of decomposition of DRCC and CLUS. As comparison, we also combine 4 or more adjacent commits each time, and then apply DRCC and CLUS
In addition, to judge whether every dimension is necessary in RCV72 , we conduct a quantitative analysis. The information gain is used to measure the importance of every dimension. Information gain is originally use to decide the ordering of attributes in the nodes of a DT.27 Information gain tells us how important a given attribute of the feature vectors is. If an information gain is positive, it indicates the corresponding dimension has a positive effect on distinguishing the data samples; if an information gain is negative, it indicates the corresponding dimension has a negative effect on distinguishing the data samples, and we need push it out of the vectors.
to decompose them again. In the whole process, DRCC uses PNN as the classifier and uses RCV72 as input vectors, and we extract 400 self contained commits from jEdit and JabRef, respectively. A commit in the sequence contains 1 or more involved classes. Therefore, when decomposing a combined commit, DRCC identifies the change relevance of pair of classes in the combined commit and regards the classes involving related code changes come from same commit. Specially, because all code changes in a class are organized as entity-unit (eg, methods or attributes), DRCC actually identifies the change relevance of any pair of changed methods (or attributes) in
As Figure 6 shows, the highest information gain value is coupling instance MMAUA with cochanged type < changed, new > (corresponding to feature 44 in Figure 6). Moreover, IH with < new, new > (feature 4), PT with < new, new > (feature 20), Construction Method Invoking with < new, new > (feature 42) have relatively high information gain values. In general, the most important dimensions are broadly in line with what we expected, especially the cochanged types mechanism. For these most important dimensions, their cochanged types are either < changed, new > or < new, new >. We have mentioned that changed software entities with these 2 cochanged types are likely with related changed code. There is no negative information gain value in the 72 features, while there are 9 features whose information gain values are 0. The features are IO with < new, changed > (feature 11), DC with < new, changed > (feature 23), Static Method Invoking with < changed, new > (feature 34), etc. The reason is that there is no case
two classes. If a pair of changed methods (or attributes) is identified as related, DRCC regards that these 2 classes' changes are related. Finally, the classes involving related code changes are clustered to form a decomposed commit. To measure the success rate of decomposition of DRCC and CLUS, we need to know the ratio between the number of successfully decomposed classes and the total number of classes. To know whether a class has been successfully decomposed, we must find which decomposed commit best matches which original commit. Our evaluation method is inspired by Dias et al.2 The matrix on right-hand side of Figure 7 shows a sample comparison between a decomposed commit and an original commit. The matrix represents the Jaccard indexes computed for each pair of decomposed and original commits. This index is defined by using the following formula:
satisfying these features in all RCVs generated from the 5 projects in
Ji,j =
the experiment. However, we cannot regard these 9 features are use-
|decompi ∩ commitj | |decompi ∪ commitj |
(12)
less since the RCVs generated from other projects may satisfy these
From the resulting matrix, we want to know the matching relations
features, so we remain them in this paper.
between decomposed commits and original commits. This can be obtained by maximizing the sum of the Jaccard indexes over all permutations. For the sample in Figure 7, the maximum sum over all the
6.3 RQ3: comparing DRCC with CLUSTERCHANGES In
this
section,
we
compare
permutations (1.0) is attained for this set of pairs: Matching = {(decomp1 , commit1 )
DRCC
with
the
(decomp2 , commit2 )}.
(13)
existing
approach—CLUSTERCHANGES (short for CLUS).5 The CLUS conducts
We compute the success rate of decomposition using the following
its experiment on nonpublic datasets with specific platform (eg, C#)
formula:
12 of 15
HUANG ET AL.
FIGURE 7
Comparison between original and decomposed commits. DRCC, detecting the relevance of code changes; CLUS, CLUSTERCHANGES
FIGURE 8
Success rate achieved by detecting the relevance of code changes (DRCC) and CLUSTERCHANGES (CLUS)
SuccessRate =
#SuccessfullyDecomposedClasses #Classes
(14)
also records the change region in method getPluginCacheEntry() of class PluginJAR. If we combine these 2 commits and apply CLUS
A class cla i is successfully decomposed if the original and decomposed
to decompose them, CLUS will consider the change region in setVal-
commits that contain cla i are in the same pair of the matching set.
ueAt() is related with the change region in getDeclaredJars()
For the decomposition of commit 1 and 2 in Figure 7, all classes are
due to the “uses” of PluginJAR in both change regions, namely, they
successfully decomposed except cla 2. This gives us a success rate of
satisfy the “useUsesInDiffs” relation proposed in Barnett et al.5
2/3=0.67.
In fact, these 2 change regions are unrelated. DRCC is different from
As Figure 8 shows, when only combining 2 adjacent commits each
CLUS, which considers the change regions in setValueAt() and
time, DRCC and CLUS can successfully decompose combined commits
getDeclaredJars() to not be related because they don't satisfy any
with high success rates, both in jEdit and JabRef. At this point, the suc-
coupling rules and cochanged types. On the other hand, the change
cess rate achieved by CLUS (about 92% in jEdit, 91% in JabRef) and
regions in getDeclaredJars() and getPluginCacheEntry()
DRCC (about 91% in jEdit, 91% in JabRef) have no significant differ-
satisfy the coupling rule of CR4 and < changed, changed > co-changed
ence. As the number of combined commits increases to 4, the success
type, and DRCC considers them to be related. As a result, DRCC can
rates achieved by CLUS (about 69% in jEdit, 66% in JabRef) decline
properly decompose the combination of [r24067] and [r24068]
quickly. When the number of combined commits increases to 7, the per-
into original ones.
formance of DRCC (about 65% in jEdit, 66% in JabRef) is significantly
On the basis of the observation of a number of commits, we find
better than CLUS(about 57% in jEdit, 60% in JabRef). As the number of
that a developer may address the same bugs/features over a period
combined commits increases to 10, the success rates of two approaches
in several commits, and these commits act on roughly the same code
are both below 51%, while the performance of DRCC is still better
segments. When we combine these commits into a big one in the exper-
than CLUS.
iment, both DRCC and CLUS regard these code segments are related
So we can come to this conclusion: with the number of combined
and can't decompose the combined commit into original ones. Another
adjacent commits increases, the performance of DRCC is superior to
situation is that a developer may address same bugs/features over a
CLUS. The basic idea behind CLUS is that it relates all change regions as
period of time in several commits, and these commits act on distinct
long as they involve definitions (eg, fields, methods, and local variables)
code segments of the same method. Because DRCC and CLUS consider
and corresponding uses (eg, references to a definition). As a result,
the distinct code changes occurring in same method to be related, both
when combining adjacent commits in the experiment, CLUS relates
DRCC and CLUS cannot decompose the commit combined by these
change regions in separate commits as the principle of definitions
atomic commits. For example, the commits [0b7de9] and [a0eb0d]
and uses is satisfied, this is called overrelated. For example, the com-
in JabRef record the distinct code changes of the method proper-
mits [r24067] and [r24068] in jEdit are both atomic. [r24067]
Brackets() (in class AuthorList), and DRCC and CLUS judge that
records the change region in method setValueAt() of class Man-
the code changes recorded in the two commits are related. Some sim-
agePanel. [r24068] records the change region in another method
ilar cases are in JabRef, such as commits [ac2c0f] and [8ce578],
getDeclaredJars() of class ManagePanel. Besides, [r24068]
commits [0c1109] and [8754e6], etc.
13 of 15
HUANG ET AL.
or similar meaning of the original comment, we regard the participant really understand the commit and the questionnaire is valid. It's worth noting that the handwritten comment of a nonatomic commit is valid when the comment describes corresponding 2 code changes. All of the 10 participants engage in computer programming work for different internet companies. The first 5 participants have on average 4.6 years Java programming experience, and the latter 5 participants have on average 4.4 years Java programming experience. Figure 10A shows that 88.33% of atomic commits are correctly understood by participants, and 76.67% of nonatomic commits are correctly understood by participants. This result shows that decomposing the nonatomic commits into atomic ones can improve the accuTime cost. DRCC, detecting the relevance of code changes; CLUS, CLUSTERCHANGES FIGURE 9
However, in some case, a developer addressed same bugs/features in several commits, while we took these commits in our data set. For example, developer daleanson addressed same feature #422 in 4 continuous commits (from commit #23971 to #23974) in project jEdit. Although the 4 commits are concerning the same feature, the involved code segments of each commit located in different classes, and these 4 commits are atomic. Therefore, we put these commits in our data set. Besides, we have count the total time spending on decomposing the
racy in understanding commits. Meanwhile, we calculate the time spent in understanding atomic and nonatomic commits, respectively. Figure 10B shows that the average time spent on correctly understanding an atomic commit and an nonatomic commit is 5.8 and 18.2 minutes. It means that for a nonatomic commit containing 2 atomic commits, the average time spent on understanding it is 11.6 minutes after the commit is decomposed into 2 atomic commits. Therefore, it can save about 6.6 minutes (namely, 18.2−11.6=6.6) in understanding a nonatomic commit using DRCC. DRCC can significantly improve the efficiency for programmers in code change reviews.
commits when the combined numbers are 2, 4, 7, and 10. In experiment, DRCC and CLUS are executed on Windows 7, with a quad-core 3.3 GHz Intel Core i5 processor and 8 GB memory. As Figure 9 shows, both DRCC and CLUS can accomplish the task of decomposing the commits within 8 to 14 minutes. In general, the total time increases along with the number of combined commits. It is worth noting that the total number of commits is not changed as the number of combined commits (2, 4, 7, and 10) changes.
6.4
Usefulness analysis
To evaluate how our approach can be used to help developers under-
7
THREATS TO VALIDITY
In this section, we focus on the threats that could affect the results of our case studies. Several issues are listed as follows. In this paper, although we consider most of fine-grained structural coupling rules (see Table 1) in object-oriented programming paradigm, we believe that there exist some coupling rules we have ignored, which may make our coupling rules incomplete. On the other hand, we have chosen to ignore some coupling rules (see Table 7). For example, the CRaa (where ai* is an attribute of class Ci , and aj* is an attribute of class Cj . ai*
stand source code changes, we design a questionnaire. Specifically, we
and aj* couple together via value assignment) coupling rule. From thou-
collect 12 nonatomic commits. A nonatomic commit contains 2 non-
sand of cases, we find that there is no related code changes induced
related code changes. We use DRCC to decompose each nonatomic
by CRaa coupling rule, so we do not take CRaa into consideration, and
commit into 2 atomic commits. In the first questionnaire, we ask the
similarly for the CRam .
participants to directly read the changed code segments of 12
Another threat to validity is the correctness of data sets. We tried to
nonatomic commits and write out commit comment for each nonatomic
avoid bias during the data collection phase. For example, we first invite
commit. The time they start and finish the questionnaire is recorded.
3 students to manually identify truly related changes, and other 3 stu-
For comparison, in the second questionnaire, we ask the participants
dents to validate the results, and they are not aware of the experimental
to read the changed code segments of 24 atomic commits and write
goals and the way that DRCC identifies related code changes. However,
out commit comment for each atomic commit decomposed from 12
it is still possible that there are bias and noises in the results. Because
nonatomic commits. The time they start and finish the questionnaire is
few of the graduate students contact with the real project development
recorded too.
like that in software house, they have no enough coding experience in
The questionnaires are published on the zhubajie website (http://
the real production environment, and their capacity of reading, ana-
www.zbj.com) as tasks, and each participant can get about 5 $ when
lyzing and comprehending code may fall short as comparing with the
he/she completes a valid questionnaire. There are 5 participants in the
professional software developers. Consequently, there may exist false
first questionnaire and 5 participants in the second questionnaire. To
positive or false negative results, which may affect the validity of the
determine whether the participants understand the commits (atomic
case study.
or non-atomic) or not, we compare the comments written by partic-
In the experiment, we only consider the bug fixing commits as our
ipants with the original comments of the commits. For a nonatomic
dataset. Bug fixing commit is a small fraction of all the commits in a
commit, its original comment describes 2 code changes correspond-
system. So it may be somehow a restricted training∕validation dataset.
ing to 2 atomic commits. If a participant's comment express the same
However, in our previous research,6 we find that the bug fixing commits
14 of 15
FIGURE 10
HUANG ET AL.
The usefulness analysis of detecting the relevance of code changes TABLE 7
No.
Ignored instances of coupling rules Instances
Description
Abbreviation
Coupling Rules
1
CRaa
Ci · ai∗ ⇒ Cj · aj∗
–
Attribute-to-Attribute
2
CRam
Ci · ai∗ ⇒ Cj · mj∗
–
Attribute-to-Method
and other types of commits have no essential difference. In paper,6 we
a comment generation algorithm is difficult to generate single com-
does not differentiate the god set coming from bug fixing commits and
ment to cover all information of these code changes. Therefore, DRCC
other types of commits, the final accuracy in paper6 and the one of this
decomposing nonatomic commit into atomic ones can help to improve
paper are almost in the same level.
the accuracy of comment generation algorithm. In future, DRCC in conjunction with comment generation algorithm will offer assistance to developers in their daily development.
8
CONCLUSIONS AND FUTURE WORK ACKNOWLEDGMENTS
This paper describes an approach to identify related code changes. To
This research is supported by the National Key Research and
obtain the discriminative feature derived from the interior of the enti-
Development Program (2016YFB1000101), National Natural Sci-
ties themselves, we collect the coupling rules and cochanged types
ence Foundation of China (61232011 and 61672545), and the
based on object-oriented programming paradigm by analyzing the real
Science and Technology Planning Project of Guangdong Province
source code. Related changes vector is defined to measure the discrim-
(2014B010118003).
inative feature, and machine learning algorithms are used to identify the relevance of code changes. The experiment results indicate that PNN is more excellent for related changes detection compared with other machine learning algorithms. Also, 72 dimensions RCV is more reasonable. In addition, we empirically investigate how DRCC performs as comparing it with CLUS. The results further validate the effectiveness of DRCC as the number of combined commits increases. At last, we prove that DRCC can help developers in practice. In general, DRCC can help developers in 2 scenarios: (1) In daily development process, developers may create a nonatomic commit, and then developers can use DRCC to decompose the nonatomic commit into atomic ones, which avoids developers submit nonatomic commits to version control system; (2) For the nonatomic commits already existing in version control system, DRCC can decompose them into atomic ones, which avoids reviewers have to understand several unrelated code changes at once. The future research agenda mainly focuses on fitting DRCC into the development process. When a developer unconsciously creates a nonatomic commit, DRCC automatically decomposes it into atomic ones. Then, we help the developer to generate readable comment for each commit. Step 1 was already addressed by this work; next work focuses on automatically generating readable commit comment. To reduce the complexity of comment generation algorithm, we should firstly decompose nonatomic commit into atomic ones. Because a nonatomic commit usually contains several independent code changes,
REFERENCES 1. Vaucher S, Sahraoui H, Vaucher J. Discovering new change patterns in object-oriented systems. 2008 15th Working Conference on Reverse Engineering. IEEE: Antwerp, 2008:37–41. 2. Dias M, Bacchelli A, Gousios G, Cassou D, Ducasse S. Untangling fine-grained code changes. Proceedings of the 22nd International Conference on Software Analysis, Evolution and Reengineering. IEEE: Montreal, 2015;341–350. 3. Herzig K, Zeller A. The impact of tangled code changes. Proceedings of 10th Conference on Mining Software Repositories, MSR '13. IEEE Press: San Francisco, 2013;121–130. 4. Gómez V U, Ducasse S, D'Hondt T. Visually characterizing source code changes. Sci Comput Program. 2015;98, Part 3:376–393. 5. Barnett M, Bird C, Brunet J, Lahiri SK. Helping developers help themselves: Automatic decomposition of code review changesets. Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE '15. Piscataway, NJ, USA: IEEE Press; 2015:134–144. 6. Huang Y, Chen X, Zou Q, Luo X. A probabilistic neural network-based approach for related software changes detection. 21st Asia-Pacific Software Engineering Conference, APSEC 2014, Jeju, South Korea, December 1-4, 2014. Volume 1: Research Papers; 2014:279–286. 7. Briand L, Devanbu P, Melo W. An investigation into coupling measures for c++. Proceedings of the 19th International Conference on Software Engineering, ICSE '97. New York, NY, USA: ACM; 1997:412–421. 8. Al Dallal J. Accounting for data encapsulation in the measurement of object-oriented class cohesion. J Software: Evol Process. 2015;27(5):373–400.
15 of 15
HUANG ET AL.
9. Hassan AE, Holt RC. Predicting change propagation in software systems. Proceedings of the 20th IEEE International Conference on Software Maintenance, ICSM '04. Washington, DC, USA: IEEE Computer Society; 2004:284–293. 10. Aryani A, Peake ID, Hamilton M. Domain-based change propagation analysis: An enterprise system case study. 2010 IEEE International Conference on Software Maintenance (ICSM), ICSM 2010; Timisoara 2010:1–9. 11. Gyimothy T, Ferenc R, Siket I. Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Software Eng. 2005;31(10):897–910. 12. Olague HM, Etzkorn LH, Gholston S, Quattlebaum S. Empirical validation of three software metrics suites to predict fault-proneness of object-oriented classes developed using highly iterative or agile software development processes. IEEE Trans Software Eng. 2007;33(6):402–419. 13. Aryani A, Perin F, Lungu M, Mahmood AN, Nierstrasz O. Predicting dependences using domain-based coupling. J Software: Evol Process. 2014;26(1):50–76. 14. Yang HY, Tempero E, Berrigan R. Detecting indirect coupling. Proceedings of the 2005 Australian Conference on Software Engineering, ASWEC '05. Washington, DC, USA: IEEE Computer Society; 2005:212–221. 15. Arisholm E, Briand LC, Foyen A. Dynamic coupling measurement for object-oriented software. IEEE Trans Software Eng. 2004;30(8):491–506. 16. Hassoun Y, Johnson R, Counsell S. A dynamic runtime coupling metric for meta-level architectures. 15th European Conference on Software Maintenance and Reengineering, CSMR 2011; Oldenburg, 2011:339–346. 17. Chidamber SR, Kemerer CF. A metrics suite for object oriented design. IEEE Trans Software Eng. 1994;20(6):476–493. 18. Churcher NI, Shepperd MJ. Towards a conceptual framework for object oriented software metrics. ACM SIGSOFT Software Eng Notes. 1995;20(2):69–75. 19. Briand LC, Morasca S, Basili VR. Property-based software engineering measurement. IEEE Trans Software Eng. 1996;22(1):68–86. 20. Li W, Henry S. Object-oriented metrics that predict maintainability. J Syst Software. 1993;23(2):111–122. 21. Martin R. Oo design quality metrics-an analysis of dependencies. RODA. 1995;2(3):151–170. 22. Lee YS, Liang BS, Wu SF, Wang FJ. Measuring the coupling and cohesion of an object-oriented program based on information flow. Proceedings of the International Conference on Software Quality, Maribor, Slovenia; 1995:81–90. 23. Ying ATT, Murphy GC, Ng R, Chu-Carroll MC. Predicting source code changes by mining change history. IEEE Trans Software Eng. 2004;30(9):574–586. 24. Mario L-V, Collin M, Denys P, Mark G. On using machine learning to automatically classify software applications into domain categories. Empirical Software Eng. 2014;19(3):582–618. 25. Coomans D, Massart DL. Alternative k-nearest neighbour rules in supervised pattern recognition: Part 1. k-nearest neighbour classification by using alternative voting rules. Anal Chim Acta. 1982;136:15–27. 26. Zolnierek A, Rubacha B. The Empirical Study of the Naive Bayes Classifier. Berlin, Heidelberg: Springer Berlin Heidelberg; 2005;329–336.
29. Specht DF. Probabilistic 1990;3(1):109–118.
neural
networks.
Neural
Networks.
30. Specht DF. A general regression neural network. IEEE Trans Neural Networks. 1991;2(6):568–576. 31. Jaafar F, Guéhéneuc Y-G, Hamel S, Antoniol G. Detecting asynchrony and dephase change patterns by mining software repositories. J Software: Evol Process. 2014;26(1):77–106. 32. Okutan A, Yildiz OT. A novel kernel to predict software defectiveness. J Syst Software. 2016;119:109–121 33. Lewis DD. Naive (bayes) at forty: The independence assumption in information retrieval. European Conference on Machine Learning. Springer: Chemnitz, 1998:4–15. 34. McCallum A, Nigam K, et al. A comparison of event models for naive bayes text classification. Aaai-98 Workshop on Learning for Text Categorization, vol. 752: Citeseer: Madison, 1998:41–48. 35. Kosker Y, Turhan B, Bener A. An expert system for determining candidate software classes for refactoring. Expert Syst Appl. 2009;36(6):10000–10003. 36. Moser R, Pedrycz W, Succi G. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. 2008 ACM/IEEE 30th International Conference on Software Engineering. IEEE: Leipzig, 2008;181–190. 37. Knab P, Pinzger M, Bernstein A. Predicting defect densities in source code files with decision tree learners. Proceedings of the 2006 International Workshop on Mining Software Repositories. ACM; 2006:119–125. 38. Shivaji S, Whitehead EJ, Akella R, Kim S. Reducing features to improve code change-based bug prediction. IEEE Trans Software Eng. 2013;39(4):552–569. 39. Shivaji S, Whitehead EJ Jr., Akella R, Kim S. Reducing features to improve bug prediction. Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society: Auckland, 2009;600–604. 40. Washburne T, Stachowitz R, Hawley J, Romsdahl H. Automatic classification of software modules with probabilistic neural networks. Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on, vol. 6. IEEE: Orlando, 1994:3894–3899. 41. Kanmani S, Uthariaraj VR, Sankaranarayanan V, Thambidurai P. Object oriented software quality prediction using general regression neural networks. ACM SIGSOFT Software Eng Notes. 2004;29(5):1–6. 42. Nasseri E, Counsell S, Shepperd M. An empirical study of evolution of inheritance in java oss. 19th Australian Conference on Software Engineering (ASWEC 2008). IEEE: Perth, 2008;269–278. 43. Dit B, Holtzhauer A, Poshyvanyk D, Kagdi HH. A dataset from change history to support evaluation of software maintenance tasks. In: Zimmermann T, Penta MD, Kim S, eds. MSR: IEEE Computer Society; 2013:131–134. 44. Demš ar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30. 45. Grissom RJ, Kim JJ. Effect Sizes for Research: Univariate and Multivariate Applications: Routledge; 2012.
How to cite this article: Huang Y, Chen X, Liu Z, Luo X,
27. Quinlan JR. Simplifying decision trees. Int J Man-Mach Stud. 1987;27(3):221–234.
Zheng Z. Using discriminative feature in software entities for
28. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–297.
2017;29:e1859. https://doi.org/10.1002/smr.1859
relevance identification of code changes. J Softw Evol Proc.