Mining Software Revision History using Advanced Social Network Analysis Bharath Cheluvaraju
Kartikay Nagal
Infosys Labs, Infosys Limited, Bangalore, India
[email protected]
Infosys Labs, Infosys Limited, Bangalore, India
[email protected]
Abstract — In this paper, we propose a novel method to investigate relationship between the files that are committed together by applying advanced social network analysis to a “network” of source files that are committed together. The source files constitute the nodes of the network and an edge is created between files which are committed together in the same revision. We present our findings with recommendations on how mining revision histories from a social network analysis perspective can be used to build inferences on change propagation, evaluate impact analysis, and extract crossprogramming-language relationships. We performed empirical analysis on revision histories of a well-known open-source web application testing system, ‘Selenium’ and results are reported. Keywords-Mining software repository; social network analysis; data mining; version histories; software engineering
I.
INTRODUCTION
One of the best practices in performing a change-commit on the revision control system is to make sure that the change reflects a single purpose like fixing a bug, adding a new feature etc. A commit, therefore, is a wrapper for related changes. To supplement this, several researchers have demonstrated that when relevant parts of source code are changed together repeatedly, functional and logical relationships among them might become apparent from analysis of the source code development history [1] [2]. However, identifying functional and logical relationships alone is inadequate in Polyglot programming [3], where multiple languages and modularity are used to leverage the best practices of both the paradigms. In such scenarios, identifying cross-language dependencies becomes critical and challenging as they are imperceptible with no definitive patterns. We propose a novel method to extract cross-platform and cross-language dependencies by mining software revision histories. The co-occurring changes to files in a single commit are modeled as a “network” and elements of social network analysis are applied to derive interesting inferences on cross-platform and language dependencies. Social network analysis (SNA) is the method of mapping and measuring relationships and information flows between myriad information processing entities like people, organizations, etc. The visualization of the entities and the mapping between them is captured using a “sociogram” [4]. A sociogram is a network of nodes connected by edges. The nodes in the network represent the processing entities; an edge or link shows the relationship or information flow between two nodes. SNA methodology is gaining attention in scientific and technological disciplines. Although traditional social networks deal with people or organizations, they are now
Anjaneyulu Pasala Infosys Labs, Infosys Limited, Bangalore, India
[email protected]
being applied in various other fields. This widespread application of SNA is attributed to its innate ability to leverage linkages and build inferences among interacting entities. In this paper, we propose an exploratory approach to analyze software revision histories by modeling them as a sociogram of source files that are committed or edited together. Social network analysis metrics like betweenness [5], closeness [5], community detection [6], Erdös number [7] and link prediction are used to build inferences on software engineering parameters like estimating crosslanguage change propagation [8], evaluating impact analysis, predicting change impact [8] and extracting crossprogramming-language relationships [9]. We believe that this preliminary analysis would provide a new dimension to mining software repositories and characterizing imperceptible cross-language and crossplatform connections across software artifacts. II.
RELATED WORK
Several researchers have applied data mining techniques and SNA techniques on software repositories to study the dynamics of various artifacts related to software engineering. These techniques are applied independently to address the challenges of software engineering. Here, we discuss the work reported on mining data from version control repositories and application of social network analysis to software engineering. Ying et al. [1] present an approach that use association rule mining over version histories for identifying rules related to a set of files that have been changed together over a period of time. Zimmerman et al. [10] and Gall et al. [2] perform data mining on CVS history to detect more finegrain logical coupling between classes, files and functions. A tool called ROSE [11] has been build that mines associations between files and finer-grained entities. These works have two things in common: (1) they perform data mining on version histories, and (2) extract relationships between source files. All these approaches are necessary and provide useful information on how files in large software repositories are “associated” together. But, the disadvantage in applying association rule mining algorithms is that most of them need to be configured before being executed. Often, the user has to provide appropriate values for the parameters in advance and a non-optimal value leads to generation of either huge number of rules with redundancy or too few rules. Also, visualizing the relationship with association rules is nonintuitive and not easily comprehendible. Madey et al. [12] studied the joint-membership of developers in projects and derived inferences of teammember relationships and showed that large open-source
projects form self-organizing social networks. LopezFernandez et al. [13] investigated CVS histories as developer network and module network. Zimmerman et al. [14] used social network analysis on dependency information for binaries in Windows 2003 to build prediction models for post-release failures. Our work is motivated by the work of Zimmermann et al. [11] that concern mining software version histories to guiding software changes. However, our work is not only aimed at guiding changes, but, also at guiding changes in a polyglot programming environment and deriving inferences on non-obvious parameters like estimating change propagation, evaluating impact analysis, and crossprogramming-language relationships. To evaluate our hypothesis, we analyzed the version histories of Selenium software project (http://seleniumhq.org/) as a “social network” and apply advanced social network analysis techniques like community detection, information diffusion, centrality, closure, etc. and evaluate the applicability of these techniques in extracting software engineering parameters. III.
PROPOSED METHODOLOGY
A. Concepts of Social Network Analysis The concepts social network use specialized jargons and notations of borrowed from graph theory. A social network is formally represented as , , where is the vertex set and are the edge set. There are two types of relationships (often called edges), namely, undirected dyadic relations which are intrinsically symmetric and have edge sets which consist of unordered pairs of vertices and directed dyadic relations which are not inherently symmetric and have edge sets consisting of ordered pairs of vertices. Many of the basic questions in the study of social networks can provide illuminating inferences by mere visualization of the data, but, it is not sufficiently precise to serve as an adequate basis for scientific work. Therefore, we need a means of specifying particular structural properties to be examined and quantify them in a systematic way. The following are some of the structural measurements that are used in this paper. (1) Degree Centrality ( ): In the case of undirected dyadic relations it is defined as the size of the neighborhood | | . The of the focal vertex. Formally, , degree centrality measures the number of partners | | of the vertex in a graph . It tends to serve as proxies for activity and/or involvement in the relation. (2) Betweenness Centrality ( ): Betweenness quantifies to the number of shortest paths from all vertices to all others that pass through that node. High-betweenness individuals, thus, tend to act as ‘boundary spanners’, bridging groups which are otherwise distantly connected. Formally, betweenness is defined in the undirected case ∑ , , where is the total number of as is the shortest paths from node to node and number of those paths that pass through . (3) Eigenvector Centrality ( ): It is used to find the most central nodes in terms of the “global” structure of the network and to pay less attention to patterns that are more
“local”. This centrality measure is obtained by applying factor analysis to the adjacency matrix of the graph to identify “dimensions” of the distances among nodes. A detailed formal explanation of eigenvector centrality is explained [15]. (4) Erdös Number ( ): It is used to compute, how close other nodes to the specified node are? It is based on the idea of “The Small-World Phenomenon” [7]. Mathematically, the Erdös number is the average count of nodes that are between selected node and other nodes in the network. (5) Community Detection ( ): Detecting clusters or communities in real-world graphs such as large social networks has been a problem of considerable interest. A “community” is thought of as a group of nodes with better interactions amongst its members than the remainder of the network. Precise formulations of this optimization problem are known to be computationally intractable. But, several algorithms [16] [17] have been proposed to and reasonably good partitions in a reasonably fast way. For convenience and consistency, we uses the community detection algorithm proposed by Blondel et al. [17] which has been shown to outperform all other known community detection methods in terms of quality and computation time. (6) Link Prediction (LP): It is used to predict nonintuitive interactions and collaborations that might happen in a network or a graph in future. Given a network or a graph at a time , we predict the new edges that will be added to the till time is reached. network or a graph in an interval Various approaches have been proposed to claculate link prediction in both homogeneous and heterogeneous networks that make intuition more precise. B. Mining Revision Histories using SNA In order to appreciate the contribution of social network analysis to the network of source-files, we provide a mapping between the essential SNA metrics and the corresponding parameters that they could affect. Table 1 illustrates the mapping between the SNA parameters to its plausible applicability in mining revision history. We treat the above mapping as analogous to a null hypothesis, which will be tested on the Selenium dataset. We present an empirical study to prove our hypothesis and derive preliminary evidences to validate whether software engineering metrics like predicting risk of software changes could be estimated using several SNA metrics. IV.
EMPIRICAL STUDY
In this section, we present the details of the dataset, elaborate on the experiments that were conducted and discuss the admissible inferences that could be derived from the gathered observations. A. Dataset details For our experiments we obtained used the revision history of Selenium (http://seleniumhq.org/). Data cleaning was performed on these revision histories to retain only the source-code files. The details of the constructed sociogram are provided in Table 2.
Table 1.
Mapping between SNA and Mining Revisioon Histories parlance
In SNA
What it means in Mining Revvision Histories parlance A node (or source file) with higgh degree centrality maintains numerous co-changes witth other source files. It serves as a source or conduit foor larger volumes of information exchange and other reesource transactions with other nodes. Hence, they couuld represent critical nodes that might have large deppendency. Example: Configuration files, etc. A node with high betweenness coontrols the flow of information or acts as a service rennderer. They are the nodes that would act as bridgess to two or more modules. E.g. Drivers, Adapters, serrvlets, etc. Eigenvector centrality is a way by which we can capture indirect influence. Hence, tthe nodes with high eigenvector centrality measures reppresent critical nodes from a global perspective, takingg into account the cascaded influences on many oother modules. Ex. Security module It is used to compute how closee other nodes to a specified node are. In SNA, Erdos number is traditionally used to analyze and esstimate collaborators for a given node. In revision historyy mining parlance it could be used to provide infeerences on change propagation and impact analysis. In the context of network of revvision histories, the community structure refers to the ooccurrence of groups of source-files that are more densely connected internally than with the rest of the network. It is likely to provide inferences on strong deependency modules, and modules with cross-program mming and crossplatform dependencies. Link prediction will help to prredict non intuitive dependencies between the files in network of revision histories, i.e. it will help us to preddict which files may be needed to be modified using thhe information from previous revisions. Using theese predictions a recommender system can be designed to give suggestions for the files that need tto be checked when we make modifications in a particullar file.
Degree Centrality
Betweenness Centrality Eigenvector Centrality
Erdös number
Community Detection
Link prediction
Table 2. Details of the dataset Selenium Dataset Programming languages used: Java, JavaScript, C#, Ruby, Python, ObjectiveC and Java (Android ) Vertices 6,167 Unique Edges 81,765 Edges With Duplicates 63,266 Connected Components 147
B. Results and Discussions As a first step, we systematically perfoormed basic SNA to extract measures like degree, betweeeness centrality, eigenvector centrality, page rank, modullarity index and Erdos number for each node in the sociograam. The details of the frequencies of the centrality measures and page rank is shown in Figure 1 and Table 3 provides details on the top 5 nodes with each measure. A community detection algorithm [18] is applied to the network of revision history. A total of 1179 communities were detected. A visualization of the 10 larggest communities with color coding, their corresponding com mmunity number
and their mapping (which was done manually) with the actual modules (depicted as a blo ock diagram as obtained from the wiki documents of projectt selenium) are shown in Figure 2.
Figure 1: Details of frequencies of various centrality measures Table 3.
y measure in Selenium Top-5 files for centrality
Cd (Degree)
Cb(Betweeness)
Ce(Eigenvector)
/remote/server/DriverServl et.java /remote/HttpCommandExe cutor.java /htmlunit/HtmlUnitDriver.j ava /remote/RemoteWebDriver .java /htmlunit/HtmlUnitWebEl ement.java
/mappings/javascrript. rb /scripts/seleniumbrowserbot.js /IEDriver/Generatted/ atoms.h /SeleniumServerS Start er.java remote/server/Driiver Servlet.java
/htmlunit/HtmlUnitDrive r.java /browserlaunchers/MacP roxyManager.java /grid/internal/RemotePro xy.java /grid/internal/Registry.ja va /remote/server/DriverSer vlet.java
Figure 2: Visualization of the network k and its mapping with the architecture
In the visualization, node sizes were w the architecture of selenium made to be proportion nal to the betweenness centrality of the node (i.e. thee larger nodes in the visualization indicate that they hav ve a higher betweenness centrality measure). Since, mere ap pplication of SNA to the revision histories does not provide a validation criterion; we performed a manual analysis of thee dependencies using the project wikis and the design docu umentations (available at https://code.google.com/p/selenium/w/list). The observations and admissible inferences are tabulated in Table 4. V.
D FUTURE WORK CONCLUSIONS AND
In this paper, we performed a sttudy to investigate how mining software revision histories from a SNA perspective could be used to derive software engineering parameters like extracting cross-language chan nge dependencies, change propagation and impact analysis. We W performed prefatory empirical analysis on project Selen nium’s revision histories and mapped our observations to a set of inferences by crossvalidating the observations with the project wikis and design
Table 4.
A table of observations and plausible inferences
Observations
Inferences
Nodes with high degree centrality are the files that belong to WebDriver module. The selenium document states that WebDriver forms a significant part of Selenium and is used in all the 7 modules of selenium. The other nodes with high degree centrality are RegistryManagers and CommandLine modules.
(i) A possible inference from the observations is that configuration files or core modules tend to have a high degree centrality. (ii) The impact of making modification to a file with high Cd is high, but, potential problems due to the change would impact only those modules that are in the near vicinity (in same folder, same package) of the proposed changes (i) Connecting the dots, after reading the design document makes it seem very obvious that these files have highbetweenness. But, an interesting inference is that this metric might be useful in estimating impact analysis and riskiness of refactoring on a certain file (specifically with high betweenness).
Nodes with high betweenness centrality are file belonging to the mapping module. The selenium documentation refers to these files as the ones that perform buildTargets for Java, JavaScript, Rake/Ruby and Python modules. The other files that have high betweenness centrality are the Adapter modules which act as a bridge between multiple modules. ProxyManager, RequestHandler, TestSuite modules have high eigenvector centrality measures. The documentation refers to these modules as cross-sectional modules that handle multiple segments of Selenium.
The openqa module has a high page rank. The documentation revealed that these modules deal with testing crossbrowser compatibility and other browser issues. Of the 179 communities that were detected, 108 belong to a logical partition in the Selenium repository. (i.e. they contain only files which are a part of the same folder, package, etc.). 51 out of 179 communities contain source-files belonging to three or more programming languages. When we investigated the top 3 largest communities in detail, we were able to extract the following dependencies. (a)Android, SeleniumEmulation, RendererModule and linux-webdriverinteraction modules formed a community. (b) Firefox, Chrome, Safari modules formed their respective communities with Session, Remote, WebDriver and BrowserLauncher modules. (c) Interestingly, files with .rb (Ruby) extension formed more communities with .py (Python) files and files with .java (Java) extension have more communities with .cs (C#). We chose five random nodes as seeds to calculate Erdös number and observed that for any of the seed nodes, the corresponding nodes with lowest Erdös number (implies high collaboration) had equal contributions from all the modules in project Selenium except the grid module. A detailed study of the documentation showed that grid module has been deprecated!
(i) Eigenvector centrality is typically capable of identifying files that interact with large modules. We could infer that, if these files also have high betweenness, then, they possibly form a part of the configuration system and those with moderate to low betweenness are ideal candidates of Test suites. (i) A possible inference is that page rank might be a useful metric to extract details about aspects that are common to most of the other core-modules or cross-sectional modules in the software. (i) The communities detected are selforganized; they can be used to find logical partition in a software repository. However, this inference needs further investigation on a wide variety of projects. (i) Observations (a) and (b) provide positive results towards plausibility of community detection being a good tool to investigate cross-language change propagation. However, the further empirical research has to be performed to theorize this hypothesis. (ii) Observation (c) is interesting! Several programming language sources have pointed out similarities in RubyPython and Java-C# as programming languages. Our observations emphasizes this, with files coded in these language-pairs co-occurring in a large number of detected communities. But, further investigations have to be performed to analyze existence of such dependencies in software repositories of various projects. Erdös number has been extensively used in classical SNA to measure “collaborative distance”. Through our observations, we could infer that Erdös numbers could be a useful index, specifically, to identify dependencies on deprecated or disconnected modules.
documents of Selenium. Although, we presented a preliminary work, we believe that this finding may stimulate further research in this area of software engineering and hopefully motivate researchers to employ SNA based techniques in other areas of mining software repository. As future work, we aim to perform empirical analysis on diverse software repositories and study applicability of this technique at fine-grain levels of classes and methods as opposed to source-files. REFERENCES [1] A Ying, G Murphy, R Ng, and M Chu-Carroll, "Predicting Source Code Changes by Mining Change History," IEEE Transactions on Software Engineering, vol. 30, no. 9, pp. 574 - 586, 2004. [2] H. Gall, M. Jazayeri, and J. Krajewski, "CVS release history data for detecting logical couplings," in International Workshop on Principles of Software Evolution, Helsinki, 2003, pp. 13 - 23. [3] B Meyer, "Polyglot Programming," Sofware Development, vol. 10, no. 5, pp. 68-71, 2002. [4] Nancy Katz, David Lazer, Holly Arrow, and Noshir Contractor, "Network Theory and Small Groups," Small Group Research, vol. 35, no. 3, pp. 307-332, June 2004. [5] Linton C. Freeman, "Centrality in social networks conceptual clarification," Social Networks, pp. 215-239, 1978. [6] F Radicchi et al., "Defining and identifying communities in networks," National Academy of Sciences of the United States of America, 2004, pp. 2658-2663. [7] D. Watts and S. Strogatz, "Collective dynamics of small-world networks," Nature, vol. 393, 1998. [8] Keith H. Bennett and Vaclav T. Rajlich., "Software maintenance and evolution: a roadmap," in Conference on The Future of Software Engineering, USA, 2000, pp. 73-87. [9] D.L. Moise and K. Wong, "Extracting and representing crosslanguage dependencies in diverse software systems," in Working Conference on Reverse Engineering, 2005, pp. 7 - 11. [10] T. Zimmermann, S. Diehl, and A. Zeller, "How history justifies system architecture (or not)," in International Workshop on Principles of Software Evolution, Helsinki, 2003, pp. 73 - 83. [11] T. Zimmermann, A. Zeller, P. Weissgerber, and S Diehl, "Mining Version Histories to Guide Software Changes," IEEE Transactions on Software Engineering, vol. 31, pp. 429 - 445, 2005. [12] G. Madey, V. Freeh, and R. Tynan, "The open source software development phenomenon: An analysis based on social network theory," in Americas Conference on Information Systems, 2002, pp. 1806 - 1813. [13] L. Lopez-Fernandez, G. Robles, and J. M. Gonzalez-Barahona, "Applying Social Network Analysis to the Information in CVS Repositories," International Workshop on Mining Software Repositories, 2004, pp. 101 - 105. [14] T Zimmermann and N Nagappan, "Predicting defects using network analysis on dependency graphs," in International Conference on Software engineering , 2008, pp. 531 - 540. [15] W. D. Richards and A. J. Seary, "Eigen analysis of networks," Journal of Social Structure, vol. 1, no. 1, 2000. [16] M. Girvan and M. Newman, "Community Structure in Social and Biological Networks," in National Academy of Sciences of the United States of America, 2002, pp. 7821 - 7826. [17] Blondel V, Guillaume J, Lambiotte R, and Mech E, "Fast unfolding of communities in large networks," Journal of Statistical Mechanics: Theory and Experiment, vol. 10, pp. 1 - 12.