212
Int. J. Computer Applications in Technology, Vol. 51, No. 3, 2015
Identifying related commits from software repositories Mustafa Hammad Department of Information Technology, Mutah University, Al-Karak, Mutah 61710, Jordan Email:
[email protected] Abstract: Source code modifications are saved in software repositories as individual and independent commits. A high-level programming task is usually applied by related or similar code changes activities. This paper presents an approach to automatically identify related and similar source code modifications from software repositories. Discovering related commits helps maintainers to understand and trace the implementation of a specific programming task. Furthermore, identifying commits of a programming task leads to simplify code fixing and debugging activities. The identification is based on discovering relations among commits from software repositories. A relation is exposed based on the textual similarity between commits. Therefore, commits relationships lead to categorise commits into disjoint groups. Each generated group would represent related or similar code modifications activities. A group can be a set of maintenance tasks related to a specific feature in the system. A case study on an open source project is presented to investigate the proposed approach. Keywords: software maintenance; software repositories; source code modifications. Reference to this paper should be made as follows: Hammad, M. (2015) ‘Identifying related commits from software repositories’, Int. J. Computer Applications in Technology, Vol. 51, No. 3, pp.212–218. Biographical notes: Mustafa Hammad is an Assistant Professor at Information Technology department in Mu’tah University, Al Karak, Jordan. He received his PhD in Computer Science from New Mexico State University, NM, USA in 2010. He received his Masters degree in Computer Science from Al-Balqa Applied University, Jordan in 2005 and his BSc in Computer Science from The Hashemite University, Jordan in 2002. His research interest is in Software Engineering with focus on source code analysis, visualisation and software evolution.
1
Introduction
Programmers are responsible for the evolution of software systems. Software is changed owing to many reasons; correct a bug, add a feature, refactoring, etc. Code changes are managed by special subversion tools such as CSV and SVN. These tools enable maintenance activities to be applied as a set of commits committed by different programmers. A bug could be fixed by one commit. This is because most bug fixes are small code changes (Raghavan et al., 2004; Hindle et al., 2008). On the other hand, some programming tasks need more than one commit to be fully implemented as adding features. The problem arises when we need to identify commits responsible for implementing or maintaining a specific programming task. This problem has two aspects. The first aspect is how to group the archive of code change (history) into related commits. The archive is a set of independent commits. So, we need to recover relationships between commits to group-related commits. Grouping related commits helps maintainers to understand and trace the implementation of a specific programming task. For example, to update the code responsible for user login operations, it is necessary
Copyright © 2015 Inderscience Enterprises Ltd.
to understand how the login process was implemented. This requires us to recover all commits that implemented the login operations. The second aspect is how to give a summary or topic to the grouped commits. The summary can help in identifying the feature that is maintained by the commits of the group. To maintain a feature, it is essential to understand all previous coding activities of this feature. Current widely subversion tools do not support locating related commits that are responsible for implementing a specific feature. These tools do not support querying the history to locate commits. So, it is hard for developers to locate a commit based on a request. Even if a commit is located, it is very difficult to locate other commits that are related to the located one. In this paper, an approach is presented to automatically partition and group the archive of commits into disjoint groups. The approach is based on recovering relationships between commits. Our premise is that, if two commits have similar text contents, then there is a relationship between them. A measuring technique is used to identify how close two commits are to each other.
Identifying related commits from software repositories The degree of relations between commits is used to group-related commits into disjoint groups. The agglomerative hierarchal clustering is used in the grouping process. Each group of commits represents a set of related coding activities that represents a specific programming task. For each group of commits, a textual summary is generated. The approach examines the textual contents of commits. Each commit is represented as a set of tokens. Then, agglomerative hierarchal clustering is applied on the sets. The approach has the advantage of recovering relations between commits just from their textual contents. There is no need to parse the syntax of the source code or code modifications. The research contributions of this paper are: •
an approach to recover relationships between commits based on textual contents
•
the utilising of clustering techniques to partition the history of code changes into disjoint-related groups
•
a method to generate a summary to mark the code changes activities of the cluster.
213 Figure 1
Snapshot from the content of commit 17008 from QuantLib with its extracted tokens (see online version for colours)
This paper is organised as follows. Section 2 details the set representation of commits. The process of identifying related commits is given in Section 3. Section 4 describes the process of applying hierarchal clustering on commits. Generating summaries for a group of related commits is shown in Section 5. A case study on an open source project is discussed in Section 6 followed by related work. Finally, our conclusions and future work are presented in Section 8.
2
Representation of commits
Code changes are saved in software repositories as commits. Each commit basically has a number, date, author, changes files, code changes and commit message. In this paper, each commit is represented as a set of tokens that are extracted from the contents of the commit. The tokens are extracted from: •
Names of changed, modified or deleted files by the commit. The full path is partitioned into a set of tokens.
•
Words in the commit message.
•
Identifiers in changed lines of code.
As an example of a commit and its extracted tokens, Figure 1 shows a snapshot of commit 17008 that was extracted from the repository of the QuantLib open source project. QuantLib is a C++ free open source library for quantitative finance (http://quantlib.org/index.shtml). The lines which begin with the keyword ‘Tokens’ are not part of the commit body. We added these lines to show the extracted tokens from textual contents of the commit body. The following subsections describe the process of extracting tokens from commits.
2.1 Tokens of file name File names are important in discovering related commits. There are relations between commits that are applied on the same files. So, full path of files are included as tokens in the representation of commits. The names of packages or folders that appear in the path of the changed, added or modified files are extracted. For example, as shown in Figure 1 the path of the file: “ql/experimental/ finitedifferences/fdmhestonvariancemesher.cpp” is extracted into these set of tokens {ql, experimental, finite differences, fdmhestonvariancemesher.cpp}.
2.2 Tokens from commit message Similar commits messages lead to identify relations between commits. So, all words, except stop words, of a commit message are extracted. Stop words, such as ‘then’ and ‘the’, do not help in the commits grouping.
214
M. Hammad
The commit message in Figure 1 “ensure monotony of pGrid” is extracted into tokens: {ensure, monotony, pGrid}. The stop word ‘of’ is ignored. Token from a code change is all identifiers that appear in any added or deleted line of code. Identifiers include user defined names of variables, classes and methods. Each added or deleted line of code by the programmer is considered for the extraction process. Unchanged lines are ignored. Each added/deleted line of code is processed as follows: •
The text line is tokenised into distinct tokens based on all programming language operators.
•
Tokens are filtered by removing all reserved words of the programming language.
•
All tokens with size one character are ignored. This will exclude counters and local variables, which is not useful in recovering relations between commits.
by number of different tokens, which is 21. As a result, the final closeness value will be 0.86. Figure 2
Snapshot from the content of commit 17009 from QuantLib with its extracted tokens (see online version for colours)
For example, the line of code: “const Size e = ((i + 1)*tp.size())/size;” is tokenised into: const, Size, e, i, tp, size, and size. Then, tokens are filtered to: Size, size and tp. Tokens of size one are mostly temporarily identifier as indexes. These tokens are ignored because they do not reflect the nature of the feature under consideration. Finally, all extracted tokens from a commit form a set for that commit. The tokens identified in Sections 2.1–2.3 for commit 17008 in Figure 1 forms the following final set of tokens {ql, experimental, finitedifferences, fdmheston variancemesher.cpp, ensure, monotony, pGrid, Real, vGrid, size, Size, grid, reserve, tAvgSteps, qMin, epsilon, vx, first, second, tp}. This set is considered as a base for identifying the relations between commits. To clarify the idea, another example is provided in Figure 2. It shows another commit from QuantLib (commit 17009). The final set representation of this commit is {ensure, monotony, pGrid, finitedifferences, fdmheston variancemesher.cpp, Real, grid, reserve, size, tAvgSteps, qMin, epsilon, vx, tp, vGrid, size, first, second}.
4 3
Identifying related commits
Identifying how close commits are to each other is the key to identify-related commits. Since each commit is represented as a set of tokens, the closeness is measured based on the ratio of the intersection of two sets over their union (Jaccard, 1912). So, this ratio is used to determine how two commits are closed and hence related. For example, the closeness between the set {ql, experimental, finitedifferences, fdmhestonvariancemesher.cpp, ensure, monotony, pGrid, Real, vGrid, size, Size, grid, reserve, tAvgSteps, qMin, epsilon, vx, first, second, tp} and the set {ensure, monotony, pGrid, finitedifferences, fdmheston variancemesher.cpp, Real, grid, reserve, size, tAvgSteps, qMin, epsilon, vx, tp, vGrid, size, first, second} is equal to number of tokens in common, which is 18, divided
Grouping related commits
We need to group related commits into disjoint groups. Each group can be a set of related maintenance activities that implement or modify a feature. The goal is to group commits into disjoint and related groups. Commits that have similar code changes are grouped together. So, commits in the same group could have relations among them. The method of identifying relations among commits is done by two phases. First, the textual contents of each commit is analysed to extract tokens from file names, commit messages and changed code as described in Section 2. As a result, each commit is represented as a set of tokens. In the next phase, the agglomerative hierarchal clustering is applied on all sets that represent commits. The final result is groups of commits. Each group contains only very related commits. A group of commits means related code changes
Identifying related commits from software repositories for a specific programming task. In this case, each group can be given a topic or representative summary that may reflect this programming task. Clustering is applied on the sets that represent commits. Agglomerative hierarchal clustering (Johnson, 1967) is a bottom up clustering technique. It starts by placing each set in its own group and then merges these singles groups into larger and larger groups until all of sets are gathered in a single big group that contains all sets. A stopping condition is needed to stop the grouping process at some level. The condition is threshold value for the closeness measures (Han, 2005). The grouping of elements is done in hierarchal way. One representation of such a structure is a tree called a dendrogram. It shows how elements are grouped together in each step (Han, 2005). For example, we extracted seven commits extracted from the open source project QuamtLib (http://quantlib. org). The extracted commits and their tokens are shown in Table 1. The tokens are separated by colon ‘:’. The hierarchal clustering was applied on these commits. In the first grouping level, commits 17508 and 17450 are grouped together in one group with closeness value ~30%. Also commits 17483 and 17528 are grouped in another cluster. The other three commits are not clustered to any group. In the next level commit 17448 is grouped to the cluster {17483, 17528} to form a new group of commits {17483, 17528, 17448}. The process continues until all commits are grouped together in one cluster at closeness value 15%. An appropriate closeness threshold value is needed to stop the process. For example, with threshold value 35%, the generated groups are: •
{17514}
•
{17508,17450}
•
{17483, 17528, 17448}
•
{17487}.
To check the quality of the generated cluster, a manual check has been applied on the commits of cluster {17483, 17528, 17448}. It is found that all the three commits modified the same header file “/trunk/QuantLib/ql/ pricingengines/basket/all.hpp” that means related coding activates. Table 1
An example of seven commits with their set representation of tokens
Commit No. Tokens 17528
trunk:QuantLib:ql:experimental:finitedifferences:all .hpp:fdbatesvanillaengine:hpp:fdmbatesop:fdmbates solver:fdmhestonlikesolverfactory:Updated:all:file:
17483
trunk:QuantLib:ql:pricingengines:basket:all.hpp:kir kengine:hpp:Updated:all:file
17448
trunk:QuantLib:ql:experimental:processes:all.hpp:e xtendedornsteinuhlenbeckprocess:hpp:Updated:all: with:files:
215 Table 1
An example of seven commits with their set representation of tokens (continued)
Commit No. Tokens 17514
trunk:QuantLib:ql:termstructures:volatility:optionlet :all.hpp:strippedoptionlet:hpp:strippedoptionletbase: Regenerated:all:file:with:ordered:headers:
17487
trunk:QuantLib:ql:experimental:finitedifferences:fd hestonvanillaengine.cpp:fdhestonvanillaengine.hpp: fdmhestonsolver:hpp:fdmbackwardsolver:removed: useless:
17450
trunk:QuantLib:ql:experimental:coupons:Makefile.a m:all.hpp:proxyibor:hpp:cpp:Added:untracked:files: autotools:build:
17508
trunk:QuantLib:ql:experimental:finitedifferences:M akefile.am:all.hpp:fdmamericanstepcondition:hpp:A dded:missing:file:Makefile:
5
Generating summaries for groups of commits
Each group of commits represents related programming activities. A summary is generated for each cluster to help in identifying the feature or the concept maintained by the group. The summary is generated as follows: •
for each cluster of commits, extract common (intersection) tokens from the set representation of each class
•
the extracted tokens form a summary for the cluster.
Shared tokens among all sets of a cluster help in providing useful information about the programming activities done collectively by the group. For example, the cluster {17483, 17528, 17448} in Section 4.1 is summarised by the following set of tokens: {trunk, QuantLib, ql, all.hpp, Updated, all}. These tokens result from the intersection of three sets that represent the three commits. Another example is the cluster {17508,17450}. The common tokens in these two sets are {trunk, QuantLib, ql, experimental, Makefile.am, all.hpp, hpp, Added}. These tokens are the summary for the two commits.
6
Case study
We investigated commits from the QuantLib open source project (http://quantlib.org). QuantLib is a comprehensive software framework for quantitative finance. A subset of commits has been extracted that cover one year, specifically the year 2010. The total number of extracted commits is 97. With closeness value 20%, the clustering produced 59 groups. The average size of groups is 1.6 commit per cluster. There are 36 clusters out of the 59 clusters have size of one commit. This results means that most of design changes on QuantLib during the year 2010 have different purposes and are not focused on specific part of the system.
216
M. Hammad
The generated dendogram resulting from applying the hierarchal clustering is shown in Figure 3. The X-access of the diagram shows the 97 commits’ numbers and the closeness values between commits are shown in the Y-access. Each identified group of commits would mean related or similar code changes activities. We did manual checking on random groups to verify the efficiency of the generated groups. We provided some of our findings in the following set of examples. One large group of commits is generated by the clustering is composed of commits: {17384, 17387, 17393, 17394, 17408, 17415, 17440, 17480}. All these commits were committed by the same developer (nando). Manual checking to code changes of the commits revealed that these commits have the same code changing activity. This activity is stated in the commit messages of all commits of the cluster. The commit message is “merged branches/R01000x-branch into trunk, respecting ancestry”. Figure 3
The dendogram of clustering 97 commits extracted from QuantLib open source project (see online version for colours)
Another generated cluster is commits {17567, 17568}. The first commit modified the header file math/ linearleastsquaresregression.hpp and the source testing file test-suite/linearleastsquaresregression.cpp. According to the commit message, the developer added template constructors and did some redesign. In the second commit, the same developer modified the same file from the first commit (math/linearleastsquaresregression.hpp) by removing a method named swap. The summary of the cluster is represented by tokens: {trunk, QuantLib, ql, math, linearleastsquaresregression.hpp, Copyright, Other LinearLeastSquaresRegression, swap, Size, xContainer, yContainer, LinearFcts, value, lfs, value_type, Argument Type, Super, Real, intercept, calculate, begin}.
7
So, all the commits of the cluster performed the merging activity. Even if the commits of the cluster are not adjacent (different dates), they are clustered together. Another example is the cluster of commits {17487, 17488, 17543}. Commit number 17487 modified files finitedifferences/fdhestonvanillaengine.cpp and finitedifferences/fdhestonvanillaengine.hpp by updating the include statements. Also, commit 17488 fixed the include statements for another two files in the finitedifferences directory. Both commits (17487 and 17488) have the same commit message which is “removed useless include”. The third commit in the cluster (17543) has also related code changes activity on another two files in finitedifferences directory. It updated one includes statement that uses a file from commit 17488 and redefined some variables. The commit message for this commit is “removed obsolete typedef”. The summary of the cluster, which is the union of tokens of the three commits, is: {trunk, QuantLib, ql, experimental, finitedifferences, removed}.
Related work
The related work can be grouped into identifying artefacts dependencies, code summarisations, and clustering.
In the field of identifying dependencies between artefacts; Aryani et al. (2011) proposed an approach to predict software dependencies based on domain-based coupling which is derived from the domain-level relationships between software components. They proposed a model to trace dependencies among source code, database and user interface components. Beyer and Fararooy (2010) proposed the CheckDep tool that manages abstract level of dependencies. A software developer can use it to inspect introduced and removed dependencies before committing new versions, and other developers receive summaries of the changed dependencies via e-mail. Dhaliwal et al. (2012) proposed two grouping approaches that identify dependencies among commits and create groups of dependent commits that need to be integrated as a whole into a code branch. In their approach identifying dependencies is based on four metrics they proposed; File Dependency Distance, File Association Distance, Developer Dissimilarity Distance and Change Request Dependency Distance. In our approach, we used simple text analysis
Identifying related commits from software repositories to recover dependencies between commits. Hassan and Holt (2004) addressed the issue of propagating a change from one source to other entities. They proposed several heuristics to predict change propagation. Antoniol et al. (2002) proposed a method based on information retrieval to recover traceability links between source code and free text documents. The method is based on using the names of program items, such as functions, variables, types, classes and methods. Abdelkader et al. (2013) proposed an approach that formulates the problem of identifying candidate web service in legacy software. Kagdi and Maletic (2007) and Kagdi et al. (2006, 2007) showed that open source projects are sources for traceability links between different types of software artefacts. They showed that artefacts that are committed together frequently have a high probability that they have a traceability link between them. Ali and Antoniol (2012) proposed Trustrace which is a trust-based traceability recovery approach. They showed that mining software repositories and combining mined results with IR techniques can improve the accuracy of IR techniques. Zimmermann et al. (2006) presented a study to identify line changes across several versions. They defined the annotation graph which captures how lines evolve over time. In the field of code summarisations, Thomas (2011) proposed the use of statistical topic models to automatically discover structure the textual representation of software repositories. Sridhara et al. (2010) presented a technique to automatically generate descriptive summary comments for Java methods. Haiduc et al. (2010) proposed an approach to automatically determine textual descriptions for source code. The approach is based on automated text summarisation technology which is the creation of a shortened version of a text by a computer program. In software clustering, Beck and Diehl (2010) evaluated the impact of software evolution on software clustering. They found a positive impact of evolutionary data on software clustering. Kuhn et al. (2007) retrieved the topics present in the source code vocabulary to support program comprehension. They introduced semantic clustering which a technique based on Latent Semantic Indexing and clustering to group source artefacts that use similar vocabulary. They used information retrieval techniques to derive topics from the vocabulary usage at the source code level. Beyer and Noack (2005) introduced a method for clustering software artefacts, based on historical co-changes and interpretable graph layout to identify clusters of artefacts that are frequently changed together. Ejnioui et al. (2013) presented an approach using grey relational analysis for prioritising software requirements. Vanya et al. (2008) described a history-based approach to assess the extent in which a certain partition allows its parts to evolve independently. They used the assumption that a set of software entities which co-evolved often in the past are likely to be modified together in the near future. The approach uses hierarchal clustering to construct evolutionary clusters from historical information.
217 Our approach is distinguished from the related work in the area by locating related commits based on their textual representation. So, it is a light weight approach without any semantic information about the code change.
8
Conclusions and future work
This paper proposed an approach to identify-related commits that have similar or related code changes activities. The textual contents of commits are used to recover the traceability links between them. Traceability is recovered based a set of tokens extracted from the body of a commit. The contents include changed identifier, files' names, and commit messages. Hierarchal clustering is used to grouprelated commits into well defined disjoint clusters. The goal is to locate-related commits that are responsible on maintaining specific coding activity or feature. The approach has been applied on an archive of open source project. Results showed that our approach could be a good method to locate-related commits and to portioning the archive of code changes into meaningful clusters or groups. Our future work focuses on applying more textual analysis on the extracted tokens from commits. One possible extension is using latent semantic indexing (LSI) and stemming to increase the matching between commits. We aim also to apply more and different text summarisation methods to generate summaries for commits and clusters. The set representation of commits is useful to build an information retrieval system for commits. In this case, users can query commits based on their set representation which helps in locating commits based on their textual content. We are currently working on building such system.
References Abdelkader, M., Malki, M. and Benslimane, S.M. (2013) ‘A heuristic approach to locate candidate web service in legacy software’, International Journal of Computer Applications in Technology, Vol. 47, Nos. 2–3, pp.152–161. Ali, N. and Antoniol, G. (2012) ‘Trustrace: mining software repositories to improve the accuracy of requirement traceability links’, IEEE Transactions on Software Engineering, Vol. 39, No. 5, pp.725–741. Antoniol, G., Canfora, G., Casazza, G., De Lucia, A. and Merlo, E. (2002) ‘Recovering traceability links between code and documentation’, IEEE Transactions on Software Engineering, Vol. 28, No. 10, pp.970–983. Aryani, A., Perin, F., Lungu, M., Mahmood, A.N. and Nierstrasz, O. (2011) ‘Can we predict dependencies using domain information?’, 18th IEEE Working Conference on Reverse Engineering (WCRE’11), Limerick, Ireland, pp.55–64. Beck, F. and Diehl, S. (2010) ‘Evaluating the impact of software evolution on software clustering’, 17th Working Conference on Reverse Engineering (WCRE’10), Beverly, MA, USA. pp.99–108.
218
M. Hammad
Beyer, D. and Fararooy, A. (2010) ‘CheckDep: a tool for tracking software dependencies’, 18th IEEE International Conference Program Comprehension (ICPC’10), Braga, Minho, Portugal, pp.42–43. Beyer, D. and Noack, A. (2005) ‘Clustering software artifacts based on frequent common changes’, 13th IEEE International Workshop on Program Comprehension (IWPC’05), Missouri, USA pp.259–268. Dhaliwal, T., Khomh, F., Zou, Y. and Hassan, A.E. (2012) ‘Recovering commit dependencies for selective code integration in software product lines’, 2012 IEEE International Conference on Software Maintenance (ICSM’12), Trento, Italy, pp.202–211. Ejnioui, A., Otero, C.E. and Otero, L.D. (2013) ‘Prioritisation of software requirements using grey relational analysis’, International Journal of Computer Applications in Technology, Vol. 47, Nos. 2–3, pp.100–109. Haiduc, S., Aponte, J. and Marcus, A. (2010) ‘Supporting program comprehension with source code summarization’, 32nd ACM/IEEE International Conference on Software Engineering (ICSE’10), Cape Town, pp.223–226. Han, J. (2005) Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ISBN:1558609016. Hassan, A.E. and Holt, R.C. (2004) ‘Predicting change propagation in software systems’, 20th IEEE International Conference on Software Maintenance (ICSM’04), Chicago Illinois, USA, pp.284–293. Hindle, A., German, D.M. and Holt, R. (2008) ‘What do large commits tell us? A taxonomical study of large commits’, International Working Conference on Mining Software Repositories (MSR’08), Leipzig, Germany, pp.99–108. Jaccard, P. (1912) ‘The distribution of the flora in the alpine zone’, New Phytologist, Vol. 11, pp.37–50. Johnson, S.C. (1967) ‘Hierarchical clustering schemes’, Psychometrika, Vol. 2, pp.241–254.
Kagdi, H. and Maletic, J.I. (2007) ‘Software repositories: a source for traceability links’, 4th ACM International Workshop on Traceability in Emerging Forms of Software Engineering (GCT/TEFSE’07), Lexington, KY, USA. Kagdi, H., Maletic, J.I. and Sharif, B. (2007) ‘Mining software repositories for traceability links’, 15th IEEE International Conference on Program Comprehension (ICPC’07), Banff, Alberta, BC, Canada, pp.145–154. Kagdi, H., Yusuf, S. and Maletic, J.I. (2006) ‘Mining sequences of changed-files from version histories’, International Workshop on Mining Software Repositories (MSR’06), Shanghai, China, pp.47–53. Kuhn, A., Ducasse, S. and Gírba, T. (2007) ‘Semantic clustering: identifying topics in source code’, Information and Software Technology, Vol. 49, No. 3, pp.230–243. Raghavan, S., Rohana, R., Leon, D., Podgurski, A. and Augustine, V. (2004) ‘Dex: a semantic-graph differencing tool for studying changes in large code bases’, 20th IEEE International Conference on Software Maintenance (ICSM’04), Chicago Illinois, USA, pp.188–197. Sridhara, G., Hill, E., Muppaneni, D., Pollock, L. and Vijay-Shanker, K. (2010) ‘Towards automatically generating summary comments for Java methods’, IEEE/ACM International Conference on Automated Software Engineering (ASE’10), Antwerp, Belgium, pp.43–52. Thomas, S.W. (2011) ‘Mining software repositories using topic models’, 33rd IEEE International Conference on Software Engineering (ICSE’11), Waikiki, Honolulu, Hawaii, pp.1138–1139. Vanya, A., Holland, L., Klusenser, S., Laar, P. and Vliet, H. (2008) ‘Assessing software archives with evolutionary clusters’, 16th IEEE International Conference on Program Comprehension (ICPC’08), Amsterdam, Netherlands, pp.192–201. Zimmermann, T., Kim, S., Zeller, A. and Whitehead Jr., E.J. (2006) ‘Mining version archives for co-changed lines’, International Workshop on Mining Software Repositories (MSR’06), Shanghai, China, pp.72–75.