Using discriminative feature in software entities ... - Wiley Online Library

Received: 4 April 2016

Revised: 16 January 2017

Accepted: 3 February 2017

DOI: 10.1002/smr.1859

RESEARCH ARTICLE

Using discriminative feature in software entities for relevance identification of code changes Yuan Huang1,2

Xiangping Chen2,3

Zhiyong Liu1,2

Xiaonan Luo1,2

Zibin Zheng1

1 School of Data and Computer Science, Sun

Yat-sen University, Guangzhou, China 2 National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China 3 Institute of Advanced Technology, Sun Yat-sen University, Guangzhou, China

Abstract

Correspondence Xiangping Chen, National Engineering Research Center of Digital Life, Sun Yat-sen University, Guangzhou, China. Email: [email protected]

can be guaranteed. Inspired by the effectiveness of machine learning techniques in classification

Funding information National Key Research and Development Program, Grant/Award Number: 2016YFB1000101; National Natural Science Foundation, Grant/Award Number: 61232011 and 61672545; Science and Technology Planning Project, Grant/Award Number: 2014B010118003

coupling rules and 4 cochanged type relationships are elaborately extracted from software enti-

Developers often bundle unrelated changes (eg, bug fix and feature addition) in a single commit and then submit a “poor cohesive” commit to version control system. Such a commit consists of multiple independent code changes and makes review of code changes harder. If the code changes before commit can be identified as related and unrelated ones, the “cohesiveness” of a commit field, we model the relevance identification of code changes as a binary classification problem (ie, related and unrelated changes) and propose discriminative feature in software entities to characterize the relevance of code changes. In particular, to quantify the discriminative feature, 21 ties to construct related changes vector (RCV). Twenty-one coupling rules at granularities of class, attribute, and method can capture the relevance of code changes from structural coupling dimension, and 4 cochanged type relationships are defined to capture the change type combinations of software entities that may cause related changes. Based on RCV, machine learning algorithms are applied to identify the relevance of code changes. The experiment results show that probabilistic neural network and general regression neural network provide statistically significant improvements in accuracy of relevance identification of code changes over the other 4 machine learning algorithms. Related changes vector with 72 dimensions (RCV72 ) outperforms other 2RCVs with less dimensions. KEYWORDS

coupling rules, cochanged types, discriminative feature, relevance of code changes

1

INTRODUCTION

improvement of a shortcut key). Such a commit is problematic, as it makes code review, reversion, and integration harder and historical

During software life cycle, developers make many code changes to meet ever-changing user requirements.1

analysis of the project less reliable.3,4

After working for some time,

The key point to guarantee the “cohesiveness” of a commit is to

developers commit their code changes to a version control system. Ide-

identify whether the code changes bundled in a commit are related.

ally, developers should bundle related code changes in a single commit.

Therefore, the problem predigests into the relevance identification of

By doing so, developers will be required to frequently identify the rele-

code changes, which is the aim of this paper. We hypothesize that the

vance of code changes before commit, which may interrupt their work

relevance of code changes can be characterized by the discriminative

flow.2 So in practice, some commits inevitably consist of multiple unre-

feature of software entities. The changes in the source code may occur

lated code changes. As Figure 1 shows, the commit ([00d497] from

at different levels: class, method, and attribute. Besides, after apply-

JabRef* ) in Subversion contains both a new feature addition (classes

ing changes on a class, the involved software entities have 2 change

JabRefFrame,

PluginInstallerAction,

and

PluginIn-

types: either changed or new. Figure 2 shows a scene of related code

staller are cochanged for the feature of plug-in installer) and a

changes. The method addContent() (calling method) in class Docu-

feature improvement (class JabRefPreferences is changed for the

ment invokes the method isRootElement() (method being called) in

* https://sourceforge.net/p/jabref/code/ci/00d497fa5a7eaeabf5b559742181a7ffdc464678/

being called is of new type, it can be concluded as the changed part in

class Element. If the calling method is of changed type and the method

J Softw Evol Proc. 2017;29:e1859. https://doi.org/10.1002/smr.1859

wileyonlinelibrary.com/journal/smr

Copyright © 2017 John Wiley & Sons, Ltd.

1 of 15

2 of 15

HUANG ET AL.

FIGURE 1

Example of commit consisting of unrelated code changes

FIGURE 2

Discriminative feature for related changes identification

the calling method is induced by the method being called. The reason

existing approach CLUSTERCHANGES,5 and the result indicates that

is that the method being called is a newly added one in current release,

our approach provides better success rate when the number of change

which does not exist before. To invoke it, the calling method inevitably

regions increases, although our approach spends a little more time than

overwrites its code for the method invocation. From the perspective of

CLUSTERCHANGES. Our contribution is threefold: Firstly, we propose

classes, the changes in these 2 classes are related. This example illus-

a discriminative feature-based approach to identify the relevance of

trates that software entities' coupling relations and change types can

code changes. Secondly, we conduct a comprehensive empirical eval-

be used as discriminative feature for change relevance identification.

uation of related changes detection. Thirdly, we design a user survey

Concentrating on identifying code changes relevance, we present a method detecting the relevance of code changes (DRCC) to use super-

to prove that DRCC can help developers significantly improve the efficiency of code change reviews.

vised machine learning (SML) (eg, probabilistic neural network [PNN]

We significantly extend our previous work.6 In particular, we extend

and general regression neural network [GRNN]) to classify the rele-

the instances of coupling rules from 16 to 21. We add and analyze

vance of code changes. Our method relies on SML, which has been used

data sets from another 2 software systems (JabRef and jEdit) for

extensively in engineering classification problems. We hypothesize that

the related code changes detection, these results are not available

changes relevance identification can be regarded as a binary classifica-

in Huang et al.6 Also, we extend statistical tests to all the results

tion problem. Thus, by establishing the most likely features for a given

achieved by different machine learning algorithms and RCVs with dif-

category (related or unrelated), an SML model to predict changes rel-

ferent dimensions. In addition, we investigate an additional research

evance can be learned. In particular, the coupling rules and cochanged

question (RQ3 ) in our empirical evaluation that compares our approach

types extracted from software entities are used as discriminative fea-

with CLUSTERCHANGES.5

ture to characterize the relevance of code changes. Coupling rules at

Paper organization. Section 2 presents related works, while

granularities of class, attribute, and method can capture the relevance

Section 3 describes the approach overview. The main process of con-

of code changes from structural coupling dimension, and cochanged

struction of RCV is presented in Section 4. The setups and results of

type relationships are defined to capture the change type combina-

case study are discussed in Sections 5 and 6, while Section 7 discusses

tions of software entities that may cause related changes. Coupling

threats to validity. Section 8 summarizes our approach and outlines

rules combining cochanged types construct the related changes vector

directions of future work.

(RCV), which is used as the input of machine learning algorithms. In this paper, we want to explore (1) which machine learning technique is the most effective for related code changes detection, (2)

2

RELATED WORK

whether the choice of feature granularities affects the accuracy of detection, and (3) how does DRCC perform compared with an exist-

Our approach is closely related to the research on coupling metrics. The

ing approach. With these 3 questions, we have evaluated our approach

core mechanism in our approach for detecting related code changes

on 2 data sets: a manually collected data set and a public data set.

is RCV, which is constructed by the combination of coupling rules and

The results show that using PNN with RCV72 as input obtains statisti-

cochanged types. We are discussing here the major approaches of cou-

cally significant improvements in accuracy over other 5 machine learn-

pling metric. Then, we will discuss the researches to changes relevance

ing techniques. Furthermore, our approach has been compared with

identification.

3 of 15

HUANG ET AL.

Coupling metrics. Coupling refers to the degree of interdependence among the components of a software

system,7

Dias et al2 proposed EpiceaUntangler to help developers to untan-

and developers always

gle code changes. They also use machine learning technology on code

pursue an optimal balance between coupling and cohesion8 when

change information. This approach is different from ours in feature

modularizing the components of their systems. Coupling metrics are

selection. Specifically, our approach fully uses static coupling and

widely used in software engineering tasks, such as change propagation

change types relationship in source code. No extra information, such as

prediction,9,10 assessing the fault-proneness of classes,11,12 software

information related to testing or version control system, are required.

dependency prediction,13 and software remodularization.14 Two types

To facilitate developers' understanding of changes in the context of

of coupling metrics have received significant attention in the literature,

code reviews, Barnett et al5 proposed to relate separate regions of

namely, dynamic15,16

and static.17–19

Dynamic metrics usually capture

change within a changeset by using static code analysis technique. The

the dynamic coupling by observing the object interactions during pro-

approach uses a single relationship that between the use of a type,

gram execution, while static metrics use different types of information

method, or field and its definition provides a useful decomposition.

such as structural, textual, and evolutionary to measure the degree of

Their method is very effective when the change regions are small scale.

interaction between software entities. The basic idea underlying tradi-

As the number of change regions increases, most of change regions

tional static coupling metrics is very simple: count how many interac-

tend to be related (called overrelated) under the definitions and uses

tions there are between software entities in the system.20–22 The RCV

mechanism. To avoid the overrelated, our approach to identify the rele-

proposed in our work uses structural coupling metrics to characterize

vance of code changes is based on machine learning models, which are

related code changes. To make RCV more discriminatory and provide

trained on an empirical data collected from real cases of related and

more information, we subdivided the structural coupling used in RCV

unrelated code changes in program. We find that some change regions

into several types at different levels (eg, class, method, and attribute).

cannot be considered related even they satisfy the definitions and uses

Therefore, we mainly introduce the types of structural coupling classi-

mechanism. Moreover, besides the traditional coupling relationships in

fied by Briand et al7 and Ying et al.23

Barnett et al,5 we propose an original cochanged type relationships to

Briand et

al7

defined structural couplings which captured 3 types

enhance the confidence of identifying related code changes.

of interactions between software entities, namely, Class-Attribute

Machine learning. There are 2 types of machine learning algorithms:

(CA), Class-Method, and Method-Method interactions. However,

supervised and unsupervised.24 In SML, a training set of precatego-

these couplings are categorized at a coarser-grained level. In con-

rized vectors is used to build a mapping between the features and

trast, Ying et al proposed 12 fine-grained “structural relationships”

the categories. Then this mapping is used to predict the categories to

coming from Java and C++ languages, and these structural relation-

which uncategorized vectors belong. On the other hand, unsupervised

ship referred to the structural couplings with 12 different types.23

machine learning generates categories (eg, clusters) based on the latent

By analyzing the real source code in object-oriented projects, we

structure (patterns, regularities, similarities, etc) of the features. In this

extend the couplings' type to 21 and classified them into several

paper, we use 6 supervised algorithms in experiment section, they are

granularities (class, method, and attribute level) in our metrics. Fur-

k-nearest neighbor (kNN),25 Naive Bayes (NB),26 decision tree (DT),27

thermore, we redefined the coarser-grained types of interactions

support vector machine (SVM),28 PNN,29 and GRNN.30

between software entities with considering the common character-

k-nearest neighbor algorithm is a nonparametric method used for

istics of coupling instances and finally classified them into 4 types:

classification, which is a type of instance-based learning algorithm.25 In

Class-to-Class (CR1 ), Method-to-Class (CR2 ), Method-to-Attribute

kNN, an object is classified by a majority vote of its neighbors, with the

(CR3 ), Method-to-Method (CR4 ) interactions, where CR2 and CR4 cor-

object being assigned to the category most common among its kNNs.

responded to Class-Method and Method-Method, respectively. Through

Jaafar et al31 use kNN to group code changes, and Okutan et al32 use

further analysis on CA, we found that CA can be transformed into 1 of

kNN to predict software defectiveness. Naive Bayes assumes that all

the 3 relationships, CR2 , CR3 , and CR4 .

the attributes are independent and that each contributes equally to

Changes relevance identification. To the best of our knowledge,

the categorization.26 A category is assigned to an object by combin-

researches involving the problem of identifying relevance of code

ing the contribution of each feature. This combination is achieved by

changes are first proposed3 in 2013. Herzig and Zeller presented

estimating the posterior probabilities of each category by using Bayes

the earliest results on an empirical experiment of tangle commits in

theorem. Naive Bayes models have been used for text retrieval33 and

software development. The experiment shows that detecting related

classification34 and software refactoring.35 Decision tree uses a “divide

changes in source code is important, and software changes are

and conquer” strategy to split the problem space into subsets.27 A DT

tangled even in a version control system. They also proposed a

is modeled like a tree in which the root and the nodes are questions,

heuristic-based algorithm to “untangle” changes based on code change

and the arcs between nodes are possible answers to the questions. The

information such as file distance and the call graph of a change.

leaves of the tree are the categories. Decision tree models have been

We noticed the importance of detecting code changes for program

used for software defect prediction in Moser et al36 and Knab et al.37

understanding and

proposed6

a machine learning–based approach

Support vector machine splits the problem space into two possible sets

in 2014. In this paper, our approach uses 21 coupling rules and 4

by finding a hyper-plane that maximizes the distance with the closest

cochanged type relationships to specify static coupling relationship

item of each subset.28 The function that splits the hyperplane is known

and change type combinations of 2 entities. Our result shows higher

as the kernel function. If the data are linearly separable, a linear kernel

precision compared to Herzig and Zeller with different data sets

function is used with the SVM, in other case, nonlinear functions such

in experiments.

as polynomials, radial basis, and sigmoid should be used. Support vector

4 of 15

HUANG ET AL.

machine models have been used for bug prediction in Shivaji et al38 and

the coupling rules and cochanged types collaboratively construct RCV.

Shivaji et al.39 Probabilistic neural network is a feedforward neural net-

Related changes vector indicates what discriminative feature the

work, which introduces a radial basis function to measure the weight

updated entities satisfy. Finally, with machine learning algorithm and

of distance for each neighbor.29

RCV, the relevance of code changes is identified as related and unre-

In a PNN, the operations are organized

into a multilayered feedforward network with 4 layers. Namely, input,

lated ones.

hidden, pattern and summation, and decision layer.s Ted et al40 use PNN to classify software modules. General regression neural network also has 4 layers structure, while it has a radial basis layer and a special linear

4

RELATED CHANGES VECTOR

layer.30 General regression neural network is a memory-based feedforward neural network based on the approximate estimation of the prob-

In this work, we propose an RCV mechanism, based on coupling rules

ability density function from observed samples using Parzen-window

and cochanged types, to measure the relevance of code changes. We

estimation. Kanmani et al41 use GRNN for software quality prediction.

first introduce the concept of coupling rules and cochanged types and then introduce the definition of RCV. The key idea behind our approach is to identify related changes via

3

APPROACH OVERVIEW

using the discriminative feature derived from program entities. In most case, the major factor behind the production of related changes is due

In this paper, software entities include classes, methods, and attributes.

to a change propagating from 1 entity to another entity. The change

The software entities involving code changes are called updated enti-

propagation mechanism9 shows that code change tends to propagate

ties. If the code lines of an entities are changed, the entity is of changed

from 1 entity to another entity when these 2 entities are structural

type. If an entity is newly added, it is of new type. Therefore, an updated

coupled. Therefore, structural coupling is a necessary condition for 2

entity has only 1 change type, either changed or new.

code changes be related. In this paper, we regard the structural coupling

Before developers commit their code changes to SVN, they need to identify the relevance between code changes. The code changes may

as one of discriminative features used to judge the relevance of code changes.

occur in same class or in different classes. Therefore, our approach need to identify the changes relevance in 2 cases: either in different classes (case 1 in left-hand side of Figure 3, where software entities with red background involve code changes) or in same class (case 2 in left-hand side of Figure 3). To facilitate extraction of discriminative feature of code changes, the code change is organized as entity-unit in this paper. To better understand the extraction of discriminative feature of code changes occurring in same class, we use 2 classes, A and A', to highlight the code changes occurring in different entities of a class (case 2 in left-hand side of Figure 3, where A and A' are the same class). It is worth noting that we consider the distinct code changes occurring in same method to be related. Figure 3 shows the overview of our proposed approach, which takes a pair of classes involving code changes as input, and reports the relevance of code changes. Firstly, 21 coupling features and 4 cochanged

4.1

Coupling rules

In real programming paradigm, structural coupling is presented as a variety of coupling rules.23 For us to summarize the general types of structural coupling, it is necessary to explore what coupling rules the entities satisfy in practice. Thus, we go deep into the source code of program entities to analyze and collect the coupling rules, and then classify them at different granularities. Coupling rule 1. S is the set of updated classes in program. S = {C1 , C2 , … , Cn }. The coupling rules CR are 2-tuple relation defined on

S. For any classes Ci and Cj , Ci ∈ S, Cj ∈ S. If Ci and Cj establish coupling relationship at class level, they satisfy the Class-to-Class coupling rule, denoted by CR1 = {Ci ⇒ Cj |

Ci ∈ S, Cj ∈ S}.

(1)

type features are combined to represent the discriminative feature.

“⇒” denotes the coupling relationship. In Java program syntax, IH and

The coupling features refer to the structural coupling in real pro-

II42 are the most common cases that satisfy CR1 . The first 2 rows in

gramming paradigm, such as Inheritance (IH) and Implementing Inter-

Table 1 describe these 2 instances.

face(II). Cochanged type features mean the change types of pair of

A class is a set of attributes and methods. Namely, Ci = {Ai , Mi }, and

coupling entities. Secondly, so as to quantify the discriminative feature,

Ai = {ai1 , ai2 , … , ait }, Mi = {mi1 , mi2 , … , mih }. Ai and Mi are the set of

FIGURE 3

Approach overview

5 of 15

HUANG ET AL.

TABLE 1

Common instances of structural coupling

Instances

Description

Abbrev

Coupling rules

Inheritance

Ci inherits from Cj

IH

Class-to-Class

Implementing Interface

Ci implements interface Cj

II

Class-to-Class

Type-Casting

Ci performs Type-Casting to Cj in mi*

TC

Method-to-Class

Instanceof

Ci performs Instanceof of Cj in mi*

IO

Method-to-Class

Return Type

Cj is return type of mj*

RT

Method-to-Class

Parameter Type

Cj is parameter type of mj*

PT

Method-to-Class

.class

Ci performs .class of Cj in mi*

DC

Method-to-Class

Exception Throws 1

Cj is exception handler class, mi* throws Cj when method definition

ET1

Method-to-Class

Exception Throws 2

Cj is exception handler class, mi* throws Cj in method body

ET2

Method-to-Class

Static Method Invoking

mi* invokes static method mj* of Cj

SMI

Method-to-Method

Static Attribute Invoking

mi* invokes static attribute aj* of Cj

SAI

Method-to-Attribute

Construction Method Invoking

mi* invokes construction method mj* of Cj

CMI

Method-to-Method

Method Member Attribute

mi* invokes attribute ai*

MMAUA

Method-to-Class

mi* invokes attribute aj* of Cj

MMAIA

Method-to-Attribute

mi* invokes method mj* of Cj

MMAIM

Method-to-Method

mi* invokes attribute ai* mi* invokes attribute aj* of Cj

CMAUA CMAIA

Method-to-Class Method-to-Attribute


CMAIM

Method-to-Method

mi* invokes parameter ai* mi* invokes attribute aj* of Cj

FPUP FPIA

Method-to-Class Method-to-Attribute


FPIM

Method-to-Method

Class Member Attribute

Function Parameter

attributes and methods of Ci , respectively. Similarly, Cj = {Aj , Mj }, and

The most representative instance for CR3 is Static Attribute Invoking

Aj = {aj1 , aj2 , … , aju }, Mj = {mj1 , mj2 , … , mjv }. Aj and Mj are the set of

(SAI).

attributes and methods of Cj , respectively. There is another coupling relation between classes and methods.

Coupling rule 4. This rule builds a coupling relation between mi* contained in Ci and mj* contained in Cj , namely, a Method-to-Method rule:

Coupling rule 2. When the class Cj emerges in a method mi* of Ci and does not be defined as an attribute or be used with its static attributes CR4 = {Ci · mi∗ ⇒ Cj · mj∗

and methods, the situation satisfies the Method-to-Class coupling rule: CR2 = {Ci · mi∗ ⇒ Cj

|

|

mi∗ ∈ Mi

∧

mj∗ ∈ Mj }

(5)

(2)

mi∗ ∈ Mi }

CR4 builds coupling relation at method level, eg, Static Method Invoking; The instances of CR2 are Type-Casting (TC), Instanceof (IO), Return Type

Construction Method Invoking. The more detailed description is shown in

(RT), Parameter Type (PT), Exception Throws (ET), and .class (DC), etc.

Table 1.

Where .class is a Java grammar that can get the instance of the class via

In addition to the above mentioned instances for each coupling

the form of “classname.class”. The more detailed description is shown in

rules, there is a most common instance in real encoding. This kind of

Table 1.

instance uses Cj to define an attribute ai* contained in Ci , where ai*

However, in most cases finer-grained coupling rules often occur at

may be a Method Member Attribute or Class Member Attribute or Function Parameter. Generally, the defined attribute ai* is usually used in

method and attribute level. The finer-grained CRfg is defined as

a method mi* of Ci , and there are 3 usages: (1) using ai* directly in CRfg = {Ci · ei ⇒ Cj · ej

|

ei ∈ (Ai ∪ Mi )

∧

ej ∈ (Aj ∪ Mj )},

(3)

where ei and ej are the attribute or method contained in Ci and Cj , respectively. ei ⇒ ej denotes that ei and ej couple together at attribute or method level. According to the definition, there are 4 possible instances of the finer granularity CRfg , namely, Ci · ai∗ ⇒ Cj · aj∗ , Ci · ai∗ ⇒ Cj · mj∗ , Ci · mi∗ ⇒ Cj · aj∗ , Ci · mi∗ ⇒ Cj · mj∗ .

In our investigation, the instances of Ci ·ai∗ ⇒ Cj ·aj∗ and Ci ·ai∗ ⇒ Cj ·mj∗ barely occur in real encoding, while Ci ·mi∗ ⇒ Cj ·aj∗ and Ci ·mi∗ ⇒ Cj ·mj∗ occur in most cases. Thus, we only consider the latter 2 in the next. Coupling rule 3. This rule builds a coupling relation between mi* contained in Ci and aj* contained in Cj , namely, a Method-to-Attribute rule:

a certain method mi* of Ci . Because ai* is an attribute declared by Cj and ai* is used in certain method mi* of Ci , thus we can regard the coupling relation between Ci and Cj as Method-to-Class. For example, the instances MMAUA, CMAUA, and FPUA satisfy this case, and detailed description is shown in Table 1; (2) invoking the attribute aj* of Cj via ai* (because ai* is declared by Cj ), and Ci and Cj builds a Method-to-Attribute coupling relation. As Table 1 shows, the instances MMAIA, CMAIA, and FPIA satisfy this case; (3) invoking the method mj* of Cj via ai* , and Ci and Cj builds a Method-to-Method coupling relation. As Table 1 shows, the instances MMAIM, CMAIM, and FPIM satisfy this case. We have classified the coupling rules into 4 types at class, attribute, and method levels. Table 2 shows the 4 coupling rules and their cor-

CR3 = {Ci · mi∗ ⇒ Cj · aj∗

|

mi∗ ∈ Mi

∧

aj∗ ∈ Aj }

(4)

responding formalization expressions and 21 instances. Note that,

6 of 15

HUANG ET AL.

TABLE 2

Four types of coupling rules

Name

Coupling rules

Formalization

Instances

CR1

Class-to-Class

Ci ⇒ Cj

IH, II

CR2

Method-to-Class

Ci · mi∗ ⇒ Cj

TC, IO, RT, PT, DC, ET1 ,

CR3

Method-to-Attribute

Ci · mi∗ ⇒ Cj · aj∗

ET2 , MMAUA, CMAUA, FPUP SAI, MMAIA, CMAIA, FPIA

CR4

Method-to-Method

Ci · mi∗ ⇒ Cj · mj∗

SMI, CMI, FPIM, MMAIM, CMAIM

although this does not ensure that these coupling rules and instances

then we combine these 2 instances into 1, which produces 4 cochanged

in Table 2 can represent all of the cases, it has already covered most

types. Adding 60 cochanged types produced by the rest of 15 instances,

of them.

there are 72 possible cochanged types in total, thus the dimension of RCV is 72. Namely, m=72, each 𝜂 k in RCV(Ci , Cj ) maps to the cochanged

4.2

Cochanged types

type of certain instance.

A new functionality in program is usually implemented by adding new entities or expanding the ability of old entities (changed) or combining both. If 2 entities with new change type are coupled, they are most likely to be involved in achieving the same functionality, and their changes are related. Another example of this is a changed type method coupling with a new type method shown in Figure 2. From a statistical perspective, the change type combinations of 2 coupling entities can be regarded as another discriminative feature used to judge the relevance of code changes. For a single updated entity ei , it has 2 change types, changed

5

CASE STUDY DESIGN

To have a meaningful evaluation, we outline 6 different machine learning algorithms (namely, PNN, GRNN, kNN, NB, DT, and SVM) for relevance identification of code changes and consider 3 types of RCV with different granularities (RCV72 , RCV14 , and RCV21 ). Besides, we empirically investigate how DRCC performs comparing it with existing approaches.

and new. Let CT(ei ) be the change types of ei . For 2 coupling entities ei and ej , there are at most 4 change type combinations (eg, 1 of entities is changed and the other is new). We call the change types of coupling entities cochanged types. Let Co-CT(ei ⇒ ej ) represent the cochanged types

5.1 Related changes vector with different dimensions

of entities ei and ej . The possible values of CT(ei ) and Co-CT(ei ⇒ ej ) are

RCV72 is based on 21 instances of coupling rules and 4 cochanged

⎧ ⎪ < changed, changed > ⎪ < new, changed > Co−CT(ei ⇒ ej ) = ⎨ ⎪ < changed, new > ⎪ < new, new > ⎩

types. To evaluate whether fine-grained dimensions are necessary for

{ CT(ei ) =

< changed > , < new >

related changes detection, we construct RCVs at coarse-grained dimensions and take these RCVs as input for machine learning algorithms. Specifically, RCVs at more coarse-grained dimensions are constructed by combining related finer-grained dimensions. We construct 2 RCVs by combining dimensions of different instances of coupling rules and dif-

4.3

ferent cochanged types of the same coupling rule, which are called

Construction of RCV

RCV14 and RCV21 , respectively. RCV14 is constructed at coupling rules

So as to quantify the discriminative feature of code changes in Ci

level instead of at instances level. Specifically, RCV14 is constructed

and Cj , coupling rules and cochanged types collaboratively construct

considering 4 coupling rules and their corresponding 4 cochanged

RCV. Related changes vector represents the relevance vector of code

types. Because the coupling rule CR3 belongs to Method-to-Attribute

changes. The definition is

relationship and the attribute only has new type, CR3 has 2 possible

RCV(Ci , Cj ) = | 𝜂1 ,

𝜂2 ,

𝜂3 ,

𝜂4

.

.

.

𝜂m |

(6)

cochanged types. The rest of coupling rules (CR1 , CR2 , and CR4 , each of them has 4 cochanged types) have 12 dimensions in total. There-

In formula 6, for any k, 1 ⩽ k ⩽ m, 𝜂 k = 0 or 1. ‘1’ represents that

fore, RCV14 has 14 dimensions in total. Row 2 of Table 3 shows RCV14 .

the software entities involving code changes satisfy the corresponding

RCV21 is constructed by 21 instances of coupling rules, regardless of the

coupling rule instance along with the cochanged type, and ‘0’ means

cochanged types. The row 3 of Table 3 shows RCV21 . If the accuracy of

no such relation. In real source code, updated entities may satisfy 1 or

the detection results is affected by the chosen of RCVs, it means that

several coupling rule instances listed in Table 1, and each instance has

the granularity of coupling rule affects the detection accuracy.

4 possible cochanged types except SAI, MMAIA, CMAIA, and FPIA. For these 4 instances, they build a Method-to-Attribute coupling relation. A method has 2 change types: changed and new. For attribute, we only consider the new change type. Because in our survey applying changes on an attribute rarely occur. As a result, each of the 4 instances has 2 possible cochanged types (< changed, new > and < new, new >), and they produce 8 cochanged types in total. Besides, there are a few mutually exclusive situations that cannot occur simultaneously. For example, 2 classes can only satisfy 1 of the instances of CR1 : either IH or II,

5.2

Research questions

This section is the empirical study for DRCC, with the purpose of analyzing its capability to identify the related code changes. The quality focus is the precision, recall, F-measure, and false positive rate of detection of related code changes with respect to original ones. • RQ1: Which machine learning technique is the most effective for related code changes detection?

7 of 15

HUANG ET AL.

TABLE 3

Related changes vector with different dimensions

RCV

Instances of coupling rules

RCV72

IH,TC,RT,PT,DC,SAI,SMI, ET1 , ET2 , MMAUA,CMAUA,

Coupling rules

Cochanged types , ,

--

FPUP,MMAIA,CMAIA,FPIA,

,

CMI,FPIM,MMAIM,CMAIM,IO.

. ,

RCV14

--

CR1 , CR2 ,

,

CR3 , CR4 .

, .

RCV21

IH,II,TC,RT,PT,DC,SAI,SMI, ET1 , ET2 , MMAUA,CMAUA, FPUP,MMAIA,CMAIA,FPIA,

--

--

CMI,FPIM,MMAIM,CMAIM,IO.

• RQ2: Does the choice of RCV72 , RCV14 , and RCV21 affect the accuracy of detection?

we analyze its associated source code files (eg, classes), and the classes in current commit form a gold set.

• RQ3: How does DRCC perform compared with other approaches in detecting related changes?

Table 4 summarizes the project names, RQ, releases, and the number of gold sets. The RQ column represents the research questions (RQ)

For RQ1, we compare different machine learning approaches on large project sets and obtain quantitative measurements of how well they detect related code changes. Our object is to choose a best classifier from numerous machine learning techniques, such as, PNN, GRNN,

that each data set helps to answer. To better validate our approach, we select gold set including at least 2 unique classes (for example, we select 13 from 39 gold sets for JabRef, and any one of 13 gold sets includes at least 2 unique classes).

kNN, NB, DT, and SVM. We also want to know how different granularities of features affect the accuracy of machine learning algorithms.

5.3.2

Evaluation methodology

Specifically, we extract three granularities of features used in the case

Our evaluation strategy is to use the gold set to determine whether

study, and in RQ2, we want to study which granularities of features

DRCC correctly identifies the related code changes. The changes in dif-

can obtain better accuracy. Similarly, the purpose of RQ3 is to compare

ferent classes can be related when they address the same issue (eg, bug

our approach with an existing approach, and we are interested in the

fixing), and a gold set exactly collects the classes addressing an issue.

comparison of DRCC and the other approaches.

However, when the number of classes contained in a gold set is greater than 2, it cannot be guaranteed that the changes of any pair of classes in

5.3

Data sets and evaluation methodology

5.3.1

this gold set are related. For example, class C1 , C2 , and C3 are in a same gold set. The change of class C1 is related with the change of class C2 ,

Data sets

and the change of class C3 is also related with the change of C2 . But the

When a development task (eg, bug fixing) involves multiple updated

changes of C1 and C3 are uncorrelated. Because these 3 classes fix a

classes, there theoretically exist related changes between these

bug together, they are in the same gold set. We expect that DRCC can

updated classes. Fortunately, Dit et al43 have proposed gold set mech-

detect the truly related changes. The evaluation methodology can be

anism, which is used to record the classes involved in a same develop-

summarized in the following steps:

ment task. In the case study, gold set is used to validate whether DRCC can detect the related code changes. Two public data sets (JabRef† and

1. For a project, if the number of classes contained in a gold set is

as well as another 3 manual collection

greater than 2, we manually identify the relevance of changes in

data sets (Jdom§ , gwt-log¶ , and XStream‖ ) are used in our case study.

each pair of classes to obtain the truly related changes, and the

The process of manual collection of gold set can be summarized as

truly related changes are used as an oracle in training phase and

following: firstly, we filter out the commits whose comments contain

validation phase.

‡

jEdit ) provided by Dit et

al43

IssueID. For example, the comment of commit #13321 is “fix for bug

2. For each pair of updated classes in same gold set, our algorithm gen-

#732,” where #732 is an IssueID. The issue #732 is mapped to commit

erates corresponding RCV. According to the oracle, if the changes

#123321. Secondly, we manually verify each issue-commit mapping to

of a pair of classes are truly related, we use ‘1’ to label this RCV,

ensure the correctness of the data and to discard commits that contain

otherwise it is labeled as ‘0’. All gold sets generate 1143 RCVs in

numbers that do not represent IssueIDs (eg, “Eliminated a small code duplication found in r10817”). At last, for each issue-commit mapping,

total. 3. In training process, DRCC uses a 5-fold cross validation, in which the RCVs of specific project are randomly broken into 5 sections.

† http://jabref.sourceforge.net/ ‡ http://www.jedit.org/ § http://www.jdom.org ¶ https://code.google.com/p/gwt-log/

‖ http://xstream.codehaus.org

One section (about 228 RCVs) is used to test the machine learning algorithms and train against the other four-fifths (about 915 RCVs). There are 5 iterations, and each section is used as the testing set once.

8 of 15

HUANG ET AL.

TABLE 4

Projects used in case study

Source Manual Collection

Public Data set

RQ RQ 1,2

RQ 1,2,3

Release

# of selected gold sets

Jdom Jdom

Project

b7 b8

5 4

gwt-log

2.5

4

gwt-log

3.1

6

XStream

1.1

7

XStream

1.2

9

XStream

1.3

7

XStream

1.4

8

JabRef

2.6

13

jEdit

4.3

52

4. Finally, DRCC takes the test set as input to predict the relevance

where correct represents the set of truly related changes and detected is

of changes. For the related changes judged by DRCC, we validate

the set of related changes judged by DRCC. Precision is the percentage

whether these changes are truly related according to the oracle.

of related changes identified by DRCC that are truly related changes

To establish a valid oracle, we manually identify the truly related changes in each gold set. If the number of classes contained in current gold set equals 2, the changes of these 2 classes are truly related changes; if the number of classes contained in a gold set is greater than 2, we manually investigate the source code of pair of classes to judge whether the change in one class is induced by the one in other classes. Because the judgement accuracy is related to the programming experiments of the participants, we tried to avoid bias during the data collection phase. For example, we first invite 3 graduate students

according to the oracle. Recall is the percentage of the truly related changes that are successfully retrieved by DRCC. The F-measure is a weighted harmonic mean of precision and recall and can be used as a comprehensive indicator of combined precision and recall values. FP is the number of false positives, namely, the related changes identified by DRCC that are not truly related. TN is the number of true negatives, namely, changes are correctly identified as not related. FPR measures the proportion of false positives over the total number of negative instances.

to manually identify truly related changes, and other 3 graduate students validate the produced results, to verify that all the truly related changes identified by the first 3 students are correct; all the partici-

5.4.2

Testing statistical significance

pants are major in computer science and not aware of the experimental

The goal of our RQs is to compare the PRE, REC, FM, and FPR achieved

goals and the way that DRCC identifies related code changes. The final results show that the ratio of truly correlated changes vs all the changes

by different machine learning algorithms. According to the work of ̌ 44 we use Friedevaluating machine learning algorithms in DemZar,

forming the gold sets is about 38.4%.

man test with Nemenyi's post hoc procedure to establish statistical significance. The Friedman test firstly compares the performance of k machine learning algorithms over m data sets; if the null hypoth-

5.4 5.4.1

esis is rejected using the Friedman test, then the Nemenyi test is

Evaluation metrics and statistical analyses

used as a post hoc procedure to compare pairs of machine learn-

Evaluation metrics

ing algorithms. Besides, to control the family-wise error, Bonferroni

To evaluate our measures used in the case study, the related changes judged by DRCC are compared against the truly related changes. Aver-

correction is introduced.44 The ith null hypothesis used for Nemenyi test is

age precision (PRE), recall (REC), F-measure (FM), and false positive rate (FPR) for each release are computed. |correct ∩ detected| PRE = % |detected|

Hi0 ∶ There is no statistically significant difference between PRE (or REC, or FM, or FPR) identified by n and n + 1 machine learning algrithms

(7) Moreover, a p̂ nonparametric effect size estimator45 is introduced to represent the probability that a value randomly drawn from 1 sample

REC =

|correct ∩ detected| % |correct|

will be greater than a value randomly drawn from other sample. The (8)

estimator p̂ is defined as ̂ pa,b =

|Precision ∗ Recall| FM = 2 ∗ % |Precision + Recall|

U , na nb

(11)

(9) where U is the Mann-Whitney Statistic and na nb is the product of the 2 sample sizes. Therefore, for RQ1 and RQ2, we use Friedman

FPR =

|FP| % |TN + FP|

test, for all the research questions, we use statistic p̂ as effect size (10)

estimator.

9 of 15

HUANG ET AL.

6

CASE STUDY RESULTS

than 0.0001. Therefore, we reject the null hypothesis that there is no significant difference between PRE, REC, FM, and FPR of PNN and the

6.1

rest of machine learning algorithms.

RQ1: machine learning algorithms

Figure 4 shows a statistical summary of the PRE, REC, FM, and FPR for the detection result using each algorithm on the data set of 5 projects.

Table 5 lists the null hypotheses (H10 to H20 ) used for the Nemenyi post 0 hoc test on the difference between specific pairs of machine learning algorithms. The P values and the effect sizes are also listed in Table 6.

Each boxplot represents the distribution of a metric for 1 algorithm,

For PRE, REC, and FM, we find that the effect sizes reported by PNN

on 3 RCVs with different dimension levels, with 5-fold cross validation.

and other algorithms (kNN, NB, DT, SVM) are higher than 0.84. For FPR,

We observe that the medians of PNN and GRNN outperform the other

the effect size is small, with probability that is lower than 0.44. It means

algorithms. We apply Friedman test to test the statistical significance of

that there are significant differences between the performance of PNN

the difference in the results of each machine learning algorithm. When

compared with kNN, NB, DT, and SVM. Therefore, we reject all the null

testing 4 metrics (PRE, REC, FM, and FPR) at a 5% confidence level, we

hypotheses (Except for H10 , H60 , H11 and H16 ), meaning that the mean 0 0

find that P values (2-tailed) for the comparison of four metrics of GRNN

PRE, REC, and FM given by PNN are statistically significantly higher

vs PNN are greater than the Bonferroni corrected significance level

than the results from kNN, NB, DT, and SVM, while the FPR given by

(0.0033). These values suggest us to accept the null hypothesis (H10 , H60 ,

PNN is statistically significantly lower than the results from kNN, NB,

H11 , H16 ) stating that there is no significant difference between PRE, 0 0

DT, and SVM.

REC, FM, and FPR of PNN and GRNN. Moreover, when testing the sta-

In principle, PNN and GRNN are both radial basis network suitable

tistical significance of the difference between the results of PNN and

for classification problems. They have an adjustable parameter, spread,

the rest of machine learning algorithms, we find that P values are less

which is the spread of radial basis functions (default = 1.0). If spread is

FIGURE 4 Performance metrics for each algorithm over a combination of RCV14 , RCV21 , and RCV72 . The red solid line is the median. The circle is the singular value and the box is the Interquartile Range (IQR). The thin line extends from Q1-1.5IQR to Q3+1.5IQR

TABLE 5

Null hypotheses and P values for RQ1 using the Nemenyi test and nonparametric effect sizes p̂

Comparison

PRE

REC

GRNN vs PNN

H10 , P

= .0644, p̂ = 0.5833

H60 , P

kNN vs PNN

H20 , P

NB vs PNN

H30 , P

FM

FPR

= .0250, p̂ = 0.5992

H11 ,P 0

= .0061, p̂ = 0.6243

H16 ,P 0

= .2713, p̂ = 0.5387

< .0001, p̂ = 0.9051

H70 , P

< 0.0001, p̂ = 0.8541

H12 ,P 0

< .0001, p̂ = 0.8501

H17 ,P 0

< .0001, p̂ = 0.3618

< .0001, p̂ = 0.9354

H80 , P

< 0.0001, p̂ = 0.8795

H13 ,P 0

< .0001, p̂ = 0.8731

H18 ,P 0

< .0001, p̂ = 0.4361

DT vs PNN

H40 , P

< .0001, p̂ = 0.9221

H90 , P

< 0.0001, p̂ = 0.8702

H14 ,P 0

< .0001, p̂ = 0.9191

H19 ,P 0

< .0001, p̂ = 0.4382

SVM vs PNN

H50 , P

< .0001, p̂ = 0.9259

H10 ,P 0

< 0.0001, p̂ = 0.8696

H15 ,P 0

< .0001, p̂ = 0.8465

H20 ,P 0

< .0001, p̂ = 0.4276

Abbreviation: FM, F-measure; FPR, false positive rate; FM, F-measure; PRE, average precision; GRNN, general regression neural network; PNN, probabilistic neural network. Each cell lists a null hypothesis that formulates that there is not significant difference between the values of a metric achieved with PNN and the values of the same metric achieved with other machine learning algorithms. For example H10 formulates that there is no significant difference between the precision of GRNN and PNN. The Bonferroni corrected significance level (for Nemenyi test) is 0.0033

10 of 15

TABLE 6

HUANG ET AL.

Null hypotheses and P values for RQ2 using the Nemenyi test and nonparametric effect sizes p̂

Comparison

PRE = .0032, p̂ = 0.6392

REC H24 ,P 0

FM

= .9319, p̂ = 0.5057

H27 ,P 0

= .0047, p̂ = 0.6295

FPR H30 ,P 0

= .3873, p̂ = 0.4678

RCV14 vs. RCV72

H21 ,P 0

RCV21 vs RCV72

H22 , P < .0001, p̂ = 0.8358 0

H25 , P < .0001, p̂ = 0.8244 0

H28 , P < .0001, p̂ = 0.8822 0

H31 , P = .0113, p̂ = 0.3977 0

RCV21 vs RCV14

H23 , P < .0001, p̂ = 0.7963 0

H26 , P < .0001, p̂ = 0.8064 0

H29 , P < .0001, p̂ = 0.8403 0

H32 , P = .0081, p̂ = 0.4396 0

Abbreviation: RCV, related changes vector. formulates that there is no significant difference between the precisions of RCV14 and RCV72 . The Bonferroni Each cell lists a null hypothesis. For example, H21 0 corrected significance level (for Nemenyitest) was 0.0167

FIGURE 5

Performance metrics for RCV72 , RCV14 , and RCV21 over 6 algorithms in all data sets

near zero, the network acts as a nearest neighbor classifier. As spread

RCV14 , and RCV21 are 47.82%, 46.64%, and 30.79%, respectively. RCV72

becomes larger, the network takes into account several nearby design

presents a roughly 1.18% improvement over RCV14 , and RCV72 presents

vectors. Coincidentally, PNN and GRNN both achieve the best accuracy

a roughly 17.03% improvement over RCV21 . The average values of FM

when the spread value is set to 0.4 in our project. With deeper analysis,

for RCV72 , RCV14 , and RCV21 are 46.57%, 41.89%, and 24.26%, respec-

we discover these 2 classifiers are powerful memory-based network

tively. RCV72 provides a 4.68% improvement over RCV14 and a 22.31%

and able to deal with sparse data effectively. The RCV obtained is rather

accuracy improvement on RCV21 . The average values of FPR for RCV72 ,

sparse. This is one of the reasons that PNN and GRNN can achieve

RCV14 , and RCV21 are 7.34%, 8.95%, and 9.06%, respectively. FPR of

higher accuracy compared with other machine learning algorithms.

RCV72 is the lowest and FPR of RCV21 is the highest.

Therefore, we answer RQ1 by concluding that PNN and GRNN

Therefore, there is a difference between the performance of RCV72 ,

are the most-effective machine learning algorithms for related code

RCV14 , and RCV21 . The Friedman test shows that the difference in aver-

changes detection in our evaluation.

age is statistically significant. When testing the 4 metrics at a 5% confidence level, we find that P values (2-tailed) are less than corrected

6.2

RQ2: performance of RCV72 , RCV14 , and RCV21

To evaluate whether RCV with fine-grained dimensions are necessary for the related code changes detection, we construct 3 RCVs (RCV72 ,

significance level (0.0167) except for H24 and H30 . These values suggest 0 0 us to reject the null hypothesis stating that there is no significant difference between PRE, REC, FM, and FPR for RCV72 , RCV14 , and RCV21 .

RCV14 , and RCV21 ) at different granularities and use these RCVs as input

Table 6 lists the null hypotheses; we use in the Nemenyi post hoc test.

for machine learning algorithms, and then compare the performance of

We find that P values for the comparison of PRE and FM of RCV21 vs

related code changes detection. Each RCV is evaluated using 6 machine

RCV72 , RCV14 vs RCV72 , and RCV21 vs. RCV14 are lower than the Bon-

learning algorithms (PNN, GRNN, kNN, DT, NB, and SVM) to minimize

ferroni corrected significance level (0.0167). The lowest value of effect

a threat to validity faced when using only 1 algorithm. At last, we group

sizes is in RCV21 vs RCV72 , and we find that the effect size reported

the results (eg, PRE, REC, FM, and FPR) for each RCV from 6 machine

by RCV72 and other 2 RCVs are higher than 0.88. Therefore, all the

learning algorithms.

hypotheses are rejected except for H24 and H30 . The P values for H24 0 0 0

A statistical summary of the results is presented in Figure 5. Each

and H30 are greater than the Bonferroni corrected significance level 0

boxplot is corresponding to a RCV. We observe that the average PRE for

(0.0167), which suggests us to accept the null hypothesis stating that

RCV72 , RCV14 , and RCV21 are 48.09%, 42.61%, and 25.73%, respectively.

there is no significant difference between REC and FPR of RCV14 vs.

RCV72 is the RCV with highest precision and RCV21 is the lowest one. The

RCV72

average value of PRE for RCV72 presents a roughly 5.48% improvement

Although there is no significant difference between REC and FPR

over RCV14 . The average values of PRE for RCV14 presents a roughly

of RCV14 vs RCV72 , RCV72 is more effective because its REC is slightly

16.88% improvement over RCV21 . The average values of REC for RCV72 ,

higher and its FPR is lower. Moreover, RCV14 is more effective than

11 of 15

HUANG ET AL.

and environment (eg, Microsoft). It is difficult to conduct the comparison experiment on the same dataset and experimental environment. We reimplement CLUS in java according to the mechanism in Barnett et al5 and conduct the comparison experiment on the commits of jEdit (https://sourceforge.net/p/jedit/svn/commit_browser)

and

JabRef

(https://sourceforge.net/p/jabref/code/commit_browser). We firstly select hundreds of continuous commits from jEdit and JabRef and then omit the commits involving no code changes (eg, a commit only involves configuration files changes). If a commit is tangled one,2 we manually split it into several self contained (or cohesive) comFIGURE 6

mits. After that, all the selected commits are atomic, and form a com-

Information gain

mits sequence according to release time. Then, the commits sequence

RCV21 for detection of related code changes. In general, the results show that RCV72 outperforms RCV14 and RCV14 outperforms RCV21 . It is not surprising because fine-grained coupling rules provide more concrete modeling of relationship between changed entities. Therefore, it is necessary to define fine-grained coupling rules for the detection of related code changes.

is used to evaluate DRCC and CLUS. We firstly combine any 2 adjacent commits in the sequence, and then apply DRCC and CLUS to decompose the combined commits, respectively. We want to explore whether DRCC and CLUS can decompose the combined commits into original ones (see left-hand side in Figure 7), and focus on the success rate of decomposition of DRCC and CLUS. As comparison, we also combine 4 or more adjacent commits each time, and then apply DRCC and CLUS

In addition, to judge whether every dimension is necessary in RCV72 , we conduct a quantitative analysis. The information gain is used to measure the importance of every dimension. Information gain is originally use to decide the ordering of attributes in the nodes of a DT.27 Information gain tells us how important a given attribute of the feature vectors is. If an information gain is positive, it indicates the corresponding dimension has a positive effect on distinguishing the data samples; if an information gain is negative, it indicates the corresponding dimension has a negative effect on distinguishing the data samples, and we need push it out of the vectors.

to decompose them again. In the whole process, DRCC uses PNN as the classifier and uses RCV72 as input vectors, and we extract 400 self contained commits from jEdit and JabRef, respectively. A commit in the sequence contains 1 or more involved classes. Therefore, when decomposing a combined commit, DRCC identifies the change relevance of pair of classes in the combined commit and regards the classes involving related code changes come from same commit. Specially, because all code changes in a class are organized as entity-unit (eg, methods or attributes), DRCC actually identifies the change relevance of any pair of changed methods (or attributes) in

As Figure 6 shows, the highest information gain value is coupling instance MMAUA with cochanged type < changed, new > (corresponding to feature 44 in Figure 6). Moreover, IH with < new, new > (feature 4), PT with < new, new > (feature 20), Construction Method Invoking with < new, new > (feature 42) have relatively high information gain values. In general, the most important dimensions are broadly in line with what we expected, especially the cochanged types mechanism. For these most important dimensions, their cochanged types are either < changed, new > or < new, new >. We have mentioned that changed software entities with these 2 cochanged types are likely with related changed code. There is no negative information gain value in the 72 features, while there are 9 features whose information gain values are 0. The features are IO with < new, changed > (feature 11), DC with < new, changed > (feature 23), Static Method Invoking with < changed, new > (feature 34), etc. The reason is that there is no case

two classes. If a pair of changed methods (or attributes) is identified as related, DRCC regards that these 2 classes' changes are related. Finally, the classes involving related code changes are clustered to form a decomposed commit. To measure the success rate of decomposition of DRCC and CLUS, we need to know the ratio between the number of successfully decomposed classes and the total number of classes. To know whether a class has been successfully decomposed, we must find which decomposed commit best matches which original commit. Our evaluation method is inspired by Dias et al.2 The matrix on right-hand side of Figure 7 shows a sample comparison between a decomposed commit and an original commit. The matrix represents the Jaccard indexes computed for each pair of decomposed and original commits. This index is defined by using the following formula:

satisfying these features in all RCVs generated from the 5 projects in

Ji,j =

the experiment. However, we cannot regard these 9 features are use-

|decompi ∩ commitj | |decompi ∪ commitj |

(12)

less since the RCVs generated from other projects may satisfy these

From the resulting matrix, we want to know the matching relations

features, so we remain them in this paper.

between decomposed commits and original commits. This can be obtained by maximizing the sum of the Jaccard indexes over all permutations. For the sample in Figure 7, the maximum sum over all the

6.3 RQ3: comparing DRCC with CLUSTERCHANGES In

this

section,

we

compare

permutations (1.0) is attained for this set of pairs: Matching = {(decomp1 , commit1 )

DRCC

with

the

(decomp2 , commit2 )}.

(13)

existing

approach—CLUSTERCHANGES (short for CLUS).5 The CLUS conducts

We compute the success rate of decomposition using the following

its experiment on nonpublic datasets with specific platform (eg, C#)

formula:

12 of 15

HUANG ET AL.

FIGURE 7

Comparison between original and decomposed commits. DRCC, detecting the relevance of code changes; CLUS, CLUSTERCHANGES

FIGURE 8

Success rate achieved by detecting the relevance of code changes (DRCC) and CLUSTERCHANGES (CLUS)

SuccessRate =

#SuccessfullyDecomposedClasses #Classes

(14)

also records the change region in method getPluginCacheEntry() of class PluginJAR. If we combine these 2 commits and apply CLUS

A class cla i is successfully decomposed if the original and decomposed

to decompose them, CLUS will consider the change region in setVal-

commits that contain cla i are in the same pair of the matching set.

ueAt() is related with the change region in getDeclaredJars()

For the decomposition of commit 1 and 2 in Figure 7, all classes are

due to the “uses” of PluginJAR in both change regions, namely, they

successfully decomposed except cla 2. This gives us a success rate of

satisfy the “useUsesInDiffs” relation proposed in Barnett et al.5

2/3=0.67.

In fact, these 2 change regions are unrelated. DRCC is different from

As Figure 8 shows, when only combining 2 adjacent commits each

CLUS, which considers the change regions in setValueAt() and

time, DRCC and CLUS can successfully decompose combined commits

getDeclaredJars() to not be related because they don't satisfy any

with high success rates, both in jEdit and JabRef. At this point, the suc-

coupling rules and cochanged types. On the other hand, the change

cess rate achieved by CLUS (about 92% in jEdit, 91% in JabRef) and

regions in getDeclaredJars() and getPluginCacheEntry()

DRCC (about 91% in jEdit, 91% in JabRef) have no significant differ-

satisfy the coupling rule of CR4 and < changed, changed > co-changed

ence. As the number of combined commits increases to 4, the success

type, and DRCC considers them to be related. As a result, DRCC can

rates achieved by CLUS (about 69% in jEdit, 66% in JabRef) decline

properly decompose the combination of [r24067] and [r24068]

quickly. When the number of combined commits increases to 7, the per-

into original ones.

formance of DRCC (about 65% in jEdit, 66% in JabRef) is significantly

On the basis of the observation of a number of commits, we find

better than CLUS(about 57% in jEdit, 60% in JabRef). As the number of

that a developer may address the same bugs/features over a period

combined commits increases to 10, the success rates of two approaches

in several commits, and these commits act on roughly the same code

are both below 51%, while the performance of DRCC is still better

segments. When we combine these commits into a big one in the exper-

than CLUS.

iment, both DRCC and CLUS regard these code segments are related

So we can come to this conclusion: with the number of combined

and can't decompose the combined commit into original ones. Another

adjacent commits increases, the performance of DRCC is superior to

situation is that a developer may address same bugs/features over a

CLUS. The basic idea behind CLUS is that it relates all change regions as

period of time in several commits, and these commits act on distinct

long as they involve definitions (eg, fields, methods, and local variables)

code segments of the same method. Because DRCC and CLUS consider

and corresponding uses (eg, references to a definition). As a result,

the distinct code changes occurring in same method to be related, both

when combining adjacent commits in the experiment, CLUS relates

DRCC and CLUS cannot decompose the commit combined by these

change regions in separate commits as the principle of definitions

atomic commits. For example, the commits [0b7de9] and [a0eb0d]

and uses is satisfied, this is called overrelated. For example, the com-

in JabRef record the distinct code changes of the method proper-

mits [r24067] and [r24068] in jEdit are both atomic. [r24067]

Brackets() (in class AuthorList), and DRCC and CLUS judge that

records the change region in method setValueAt() of class Man-

the code changes recorded in the two commits are related. Some sim-

agePanel. [r24068] records the change region in another method

ilar cases are in JabRef, such as commits [ac2c0f] and [8ce578],

getDeclaredJars() of class ManagePanel. Besides, [r24068]

commits [0c1109] and [8754e6], etc.

13 of 15

HUANG ET AL.

or similar meaning of the original comment, we regard the participant really understand the commit and the questionnaire is valid. It's worth noting that the handwritten comment of a nonatomic commit is valid when the comment describes corresponding 2 code changes. All of the 10 participants engage in computer programming work for different internet companies. The first 5 participants have on average 4.6 years Java programming experience, and the latter 5 participants have on average 4.4 years Java programming experience. Figure 10A shows that 88.33% of atomic commits are correctly understood by participants, and 76.67% of nonatomic commits are correctly understood by participants. This result shows that decomposing the nonatomic commits into atomic ones can improve the accuTime cost. DRCC, detecting the relevance of code changes; CLUS, CLUSTERCHANGES FIGURE 9

However, in some case, a developer addressed same bugs/features in several commits, while we took these commits in our data set. For example, developer daleanson addressed same feature #422 in 4 continuous commits (from commit #23971 to #23974) in project jEdit. Although the 4 commits are concerning the same feature, the involved code segments of each commit located in different classes, and these 4 commits are atomic. Therefore, we put these commits in our data set. Besides, we have count the total time spending on decomposing the

racy in understanding commits. Meanwhile, we calculate the time spent in understanding atomic and nonatomic commits, respectively. Figure 10B shows that the average time spent on correctly understanding an atomic commit and an nonatomic commit is 5.8 and 18.2 minutes. It means that for a nonatomic commit containing 2 atomic commits, the average time spent on understanding it is 11.6 minutes after the commit is decomposed into 2 atomic commits. Therefore, it can save about 6.6 minutes (namely, 18.2−11.6=6.6) in understanding a nonatomic commit using DRCC. DRCC can significantly improve the efficiency for programmers in code change reviews.

commits when the combined numbers are 2, 4, 7, and 10. In experiment, DRCC and CLUS are executed on Windows 7, with a quad-core 3.3 GHz Intel Core i5 processor and 8 GB memory. As Figure 9 shows, both DRCC and CLUS can accomplish the task of decomposing the commits within 8 to 14 minutes. In general, the total time increases along with the number of combined commits. It is worth noting that the total number of commits is not changed as the number of combined commits (2, 4, 7, and 10) changes.

6.4

Usefulness analysis

To evaluate how our approach can be used to help developers under-

7

THREATS TO VALIDITY

In this section, we focus on the threats that could affect the results of our case studies. Several issues are listed as follows. In this paper, although we consider most of fine-grained structural coupling rules (see Table 1) in object-oriented programming paradigm, we believe that there exist some coupling rules we have ignored, which may make our coupling rules incomplete. On the other hand, we have chosen to ignore some coupling rules (see Table 7). For example, the CRaa (where ai* is an attribute of class Ci , and aj* is an attribute of class Cj . ai*

stand source code changes, we design a questionnaire. Specifically, we

and aj* couple together via value assignment) coupling rule. From thou-

collect 12 nonatomic commits. A nonatomic commit contains 2 non-

sand of cases, we find that there is no related code changes induced

related code changes. We use DRCC to decompose each nonatomic

by CRaa coupling rule, so we do not take CRaa into consideration, and

commit into 2 atomic commits. In the first questionnaire, we ask the

similarly for the CRam .

participants to directly read the changed code segments of 12

Another threat to validity is the correctness of data sets. We tried to

nonatomic commits and write out commit comment for each nonatomic

avoid bias during the data collection phase. For example, we first invite

commit. The time they start and finish the questionnaire is recorded.

3 students to manually identify truly related changes, and other 3 stu-

For comparison, in the second questionnaire, we ask the participants

dents to validate the results, and they are not aware of the experimental

to read the changed code segments of 24 atomic commits and write

goals and the way that DRCC identifies related code changes. However,

out commit comment for each atomic commit decomposed from 12

it is still possible that there are bias and noises in the results. Because

nonatomic commits. The time they start and finish the questionnaire is

few of the graduate students contact with the real project development

recorded too.

like that in software house, they have no enough coding experience in

The questionnaires are published on the zhubajie website (http://

the real production environment, and their capacity of reading, ana-

www.zbj.com) as tasks, and each participant can get about 5 $ when

lyzing and comprehending code may fall short as comparing with the

he/she completes a valid questionnaire. There are 5 participants in the

professional software developers. Consequently, there may exist false

first questionnaire and 5 participants in the second questionnaire. To

positive or false negative results, which may affect the validity of the

determine whether the participants understand the commits (atomic

case study.

or non-atomic) or not, we compare the comments written by partic-

In the experiment, we only consider the bug fixing commits as our

ipants with the original comments of the commits. For a nonatomic

dataset. Bug fixing commit is a small fraction of all the commits in a

commit, its original comment describes 2 code changes correspond-

system. So it may be somehow a restricted training∕validation dataset.

ing to 2 atomic commits. If a participant's comment express the same

However, in our previous research,6 we find that the bug fixing commits

14 of 15

FIGURE 10

HUANG ET AL.

The usefulness analysis of detecting the relevance of code changes TABLE 7

No.

Ignored instances of coupling rules Instances

Description

Abbreviation

Coupling Rules

1

CRaa

Ci · ai∗ ⇒ Cj · aj∗

–

Attribute-to-Attribute

2

CRam

Ci · ai∗ ⇒ Cj · mj∗

–

Attribute-to-Method

and other types of commits have no essential difference. In paper,6 we

a comment generation algorithm is difficult to generate single com-

does not differentiate the god set coming from bug fixing commits and

ment to cover all information of these code changes. Therefore, DRCC

other types of commits, the final accuracy in paper6 and the one of this

decomposing nonatomic commit into atomic ones can help to improve

paper are almost in the same level.

the accuracy of comment generation algorithm. In future, DRCC in conjunction with comment generation algorithm will offer assistance to developers in their daily development.

8

CONCLUSIONS AND FUTURE WORK ACKNOWLEDGMENTS

This paper describes an approach to identify related code changes. To

This research is supported by the National Key Research and

obtain the discriminative feature derived from the interior of the enti-

Development Program (2016YFB1000101), National Natural Sci-

ties themselves, we collect the coupling rules and cochanged types

ence Foundation of China (61232011 and 61672545), and the

based on object-oriented programming paradigm by analyzing the real

Science and Technology Planning Project of Guangdong Province

source code. Related changes vector is defined to measure the discrim-

(2014B010118003).

inative feature, and machine learning algorithms are used to identify the relevance of code changes. The experiment results indicate that PNN is more excellent for related changes detection compared with other machine learning algorithms. Also, 72 dimensions RCV is more reasonable. In addition, we empirically investigate how DRCC performs as comparing it with CLUS. The results further validate the effectiveness of DRCC as the number of combined commits increases. At last, we prove that DRCC can help developers in practice. In general, DRCC can help developers in 2 scenarios: (1) In daily development process, developers may create a nonatomic commit, and then developers can use DRCC to decompose the nonatomic commit into atomic ones, which avoids developers submit nonatomic commits to version control system; (2) For the nonatomic commits already existing in version control system, DRCC can decompose them into atomic ones, which avoids reviewers have to understand several unrelated code changes at once. The future research agenda mainly focuses on fitting DRCC into the development process. When a developer unconsciously creates a nonatomic commit, DRCC automatically decomposes it into atomic ones. Then, we help the developer to generate readable comment for each commit. Step 1 was already addressed by this work; next work focuses on automatically generating readable commit comment. To reduce the complexity of comment generation algorithm, we should firstly decompose nonatomic commit into atomic ones. Because a nonatomic commit usually contains several independent code changes,

REFERENCES 1. Vaucher S, Sahraoui H, Vaucher J. Discovering new change patterns in object-oriented systems. 2008 15th Working Conference on Reverse Engineering. IEEE: Antwerp, 2008:37–41. 2. Dias M, Bacchelli A, Gousios G, Cassou D, Ducasse S. Untangling fine-grained code changes. Proceedings of the 22nd International Conference on Software Analysis, Evolution and Reengineering. IEEE: Montreal, 2015;341–350. 3. Herzig K, Zeller A. The impact of tangled code changes. Proceedings of 10th Conference on Mining Software Repositories, MSR '13. IEEE Press: San Francisco, 2013;121–130. 4. Gómez V U, Ducasse S, D'Hondt T. Visually characterizing source code changes. Sci Comput Program. 2015;98, Part 3:376–393. 5. Barnett M, Bird C, Brunet J, Lahiri SK. Helping developers help themselves: Automatic decomposition of code review changesets. Proceedings of the 37th International Conference on Software Engineering - Volume 1, ICSE '15. Piscataway, NJ, USA: IEEE Press; 2015:134–144. 6. Huang Y, Chen X, Zou Q, Luo X. A probabilistic neural network-based approach for related software changes detection. 21st Asia-Pacific Software Engineering Conference, APSEC 2014, Jeju, South Korea, December 1-4, 2014. Volume 1: Research Papers; 2014:279–286. 7. Briand L, Devanbu P, Melo W. An investigation into coupling measures for c++. Proceedings of the 19th International Conference on Software Engineering, ICSE '97. New York, NY, USA: ACM; 1997:412–421. 8. Al Dallal J. Accounting for data encapsulation in the measurement of object-oriented class cohesion. J Software: Evol Process. 2015;27(5):373–400.

15 of 15

HUANG ET AL.

9. Hassan AE, Holt RC. Predicting change propagation in software systems. Proceedings of the 20th IEEE International Conference on Software Maintenance, ICSM '04. Washington, DC, USA: IEEE Computer Society; 2004:284–293. 10. Aryani A, Peake ID, Hamilton M. Domain-based change propagation analysis: An enterprise system case study. 2010 IEEE International Conference on Software Maintenance (ICSM), ICSM 2010; Timisoara 2010:1–9. 11. Gyimothy T, Ferenc R, Siket I. Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Trans Software Eng. 2005;31(10):897–910. 12. Olague HM, Etzkorn LH, Gholston S, Quattlebaum S. Empirical validation of three software metrics suites to predict fault-proneness of object-oriented classes developed using highly iterative or agile software development processes. IEEE Trans Software Eng. 2007;33(6):402–419. 13. Aryani A, Perin F, Lungu M, Mahmood AN, Nierstrasz O. Predicting dependences using domain-based coupling. J Software: Evol Process. 2014;26(1):50–76. 14. Yang HY, Tempero E, Berrigan R. Detecting indirect coupling. Proceedings of the 2005 Australian Conference on Software Engineering, ASWEC '05. Washington, DC, USA: IEEE Computer Society; 2005:212–221. 15. Arisholm E, Briand LC, Foyen A. Dynamic coupling measurement for object-oriented software. IEEE Trans Software Eng. 2004;30(8):491–506. 16. Hassoun Y, Johnson R, Counsell S. A dynamic runtime coupling metric for meta-level architectures. 15th European Conference on Software Maintenance and Reengineering, CSMR 2011; Oldenburg, 2011:339–346. 17. Chidamber SR, Kemerer CF. A metrics suite for object oriented design. IEEE Trans Software Eng. 1994;20(6):476–493. 18. Churcher NI, Shepperd MJ. Towards a conceptual framework for object oriented software metrics. ACM SIGSOFT Software Eng Notes. 1995;20(2):69–75. 19. Briand LC, Morasca S, Basili VR. Property-based software engineering measurement. IEEE Trans Software Eng. 1996;22(1):68–86. 20. Li W, Henry S. Object-oriented metrics that predict maintainability. J Syst Software. 1993;23(2):111–122. 21. Martin R. Oo design quality metrics-an analysis of dependencies. RODA. 1995;2(3):151–170. 22. Lee YS, Liang BS, Wu SF, Wang FJ. Measuring the coupling and cohesion of an object-oriented program based on information flow. Proceedings of the International Conference on Software Quality, Maribor, Slovenia; 1995:81–90. 23. Ying ATT, Murphy GC, Ng R, Chu-Carroll MC. Predicting source code changes by mining change history. IEEE Trans Software Eng. 2004;30(9):574–586. 24. Mario L-V, Collin M, Denys P, Mark G. On using machine learning to automatically classify software applications into domain categories. Empirical Software Eng. 2014;19(3):582–618. 25. Coomans D, Massart DL. Alternative k-nearest neighbour rules in supervised pattern recognition: Part 1. k-nearest neighbour classification by using alternative voting rules. Anal Chim Acta. 1982;136:15–27. 26. Zolnierek A, Rubacha B. The Empirical Study of the Naive Bayes Classifier. Berlin, Heidelberg: Springer Berlin Heidelberg; 2005;329–336.

29. Specht DF. Probabilistic 1990;3(1):109–118.

neural

networks.

Neural

Networks.

30. Specht DF. A general regression neural network. IEEE Trans Neural Networks. 1991;2(6):568–576. 31. Jaafar F, Guéhéneuc Y-G, Hamel S, Antoniol G. Detecting asynchrony and dephase change patterns by mining software repositories. J Software: Evol Process. 2014;26(1):77–106. 32. Okutan A, Yildiz OT. A novel kernel to predict software defectiveness. J Syst Software. 2016;119:109–121 33. Lewis DD. Naive (bayes) at forty: The independence assumption in information retrieval. European Conference on Machine Learning. Springer: Chemnitz, 1998:4–15. 34. McCallum A, Nigam K, et al. A comparison of event models for naive bayes text classification. Aaai-98 Workshop on Learning for Text Categorization, vol. 752: Citeseer: Madison, 1998:41–48. 35. Kosker Y, Turhan B, Bener A. An expert system for determining candidate software classes for refactoring. Expert Syst Appl. 2009;36(6):10000–10003. 36. Moser R, Pedrycz W, Succi G. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. 2008 ACM/IEEE 30th International Conference on Software Engineering. IEEE: Leipzig, 2008;181–190. 37. Knab P, Pinzger M, Bernstein A. Predicting defect densities in source code files with decision tree learners. Proceedings of the 2006 International Workshop on Mining Software Repositories. ACM; 2006:119–125. 38. Shivaji S, Whitehead EJ, Akella R, Kim S. Reducing features to improve code change-based bug prediction. IEEE Trans Software Eng. 2013;39(4):552–569. 39. Shivaji S, Whitehead EJ Jr., Akella R, Kim S. Reducing features to improve bug prediction. Proceedings of the 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE Computer Society: Auckland, 2009;600–604. 40. Washburne T, Stachowitz R, Hawley J, Romsdahl H. Automatic classification of software modules with probabilistic neural networks. Neural Networks, 1994. IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on, vol. 6. IEEE: Orlando, 1994:3894–3899. 41. Kanmani S, Uthariaraj VR, Sankaranarayanan V, Thambidurai P. Object oriented software quality prediction using general regression neural networks. ACM SIGSOFT Software Eng Notes. 2004;29(5):1–6. 42. Nasseri E, Counsell S, Shepperd M. An empirical study of evolution of inheritance in java oss. 19th Australian Conference on Software Engineering (ASWEC 2008). IEEE: Perth, 2008;269–278. 43. Dit B, Holtzhauer A, Poshyvanyk D, Kagdi HH. A dataset from change history to support evaluation of software maintenance tasks. In: Zimmermann T, Penta MD, Kim S, eds. MSR: IEEE Computer Society; 2013:131–134. 44. Demš ar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30. 45. Grissom RJ, Kim JJ. Effect Sizes for Research: Univariate and Multivariate Applications: Routledge; 2012.

How to cite this article: Huang Y, Chen X, Liu Z, Luo X,

27. Quinlan JR. Simplifying decision trees. Int J Man-Mach Stud. 1987;27(3):221–234.

Zheng Z. Using discriminative feature in software entities for

28. Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–297.

2017;29:e1859. https://doi.org/10.1002/smr.1859

relevance identification of code changes. J Softw Evol Proc.

Using discriminative feature in software entities ... - Wiley Online Library

Using discriminative feature in software entities ... - Wiley Online Library

Suggest Documents

Software Review - Wiley Online Library

Computational assessment of feature ... - Wiley Online Library

Wild Animal Detection using Discriminative Feature-oriented

DeepTrack: Learning Discriminative Feature Representations Online ...

Visualization of Entities Within Social Media - Wiley Online Library

Software visualization in software maintenance ... - Wiley Online Library

Using software to teach thesaurus development ... - Wiley Online Library

Childhood headaches: discrete entities or ... - Wiley Online Library

Using evidence in practice - Wiley Online Library

Using evidence in practice - Wiley Online Library

Using research in practice - Wiley Online Library

Using evidence in practice - Wiley Online Library

Special Feature: 5th Anniversary of Methods in ... - Wiley Online Library

Agile Software Development - Wiley Online Library

Available energy assessment software - Wiley Online Library

Optimal software maintenance policy ... - Wiley Online Library

Online Discriminative Feature Selection in a Bayesian Framework

COMBINING DISCRIMINATIVE FEATURE, TRANSFORM ... - CiteSeerX

Features in Concert: Discriminative Feature Selection meets ...

Discriminative binary feature learning and

The practice of secure software development in ... - Wiley Online Library

The R software environment in reproducible ... - Wiley Online Library

OneMap: software for genetic mapping in ... - Wiley Online Library

Managing software development information in ... - Wiley Online Library