lated to the software development process stored inside version control systems. The ... analysis for data stored in code repositories; finally, section 5 draws the ...
243
Source Code Repositories and Agile Methods Alberto Sillitti and Giancarlo Succi Free University of Bozen {Alberto.Sillitti,Giancarlo.Succi}@unibz.it Abstract. Source repositories are a promising database of information about software projects. This paper proposes a tool to extract and summarize information from CVS logs in order to identify whether there are differences in the development approach of Agile and non-Agile teams. The tool aims to improve empirical investigation of the Agile Methods (AMs) without affecting the way developers write code. There are many claims about the benefits of AMs; however, these claims are seldom supported by empirical analysis. Configuration management systems contain a huge amount of quantitative data about a project. The retrieval and part of the analysis can be automated in order to get useful insights about the status and the evolution of the project. However, this task poses formidable challenges because the data source is not designed as a measurement tool. This paper proposes a tool for extracting and summarizing information from CVS (Concurrent Versions System) repositories and a set of analysis that can be useful to identify common or different behaviors.
1 Introduction The main purpose of version control systems is to collect and manage source code effectively when there are several people modifying the same set of files. However, such systems collect a huge amount of information that does not include only source code. Version control systems store precise information regarding the files created, keep track of the modifications introduced storing both timestamps and user names, etc. Such data are process data that can be useful to study the development process of an Agile team. Moreover, analyzing code repositories presents several advantages, such as: 1. Data collection is non-invasive: It does not require effort from the developers [2]. 2. Data are “for free”: All development teams are already using a version control system, therefore they already have a database that can be analyzed. 3. Plenty of data available: It is possible to analyze projects that are already finished or at any time of the development process. It is not required that the data collection starts with the project. However, there are some drawbacks as well: 1. It is not possible to collect data regarding the effort spent for developing a piece of code. 2. It is not possible to trace the code that is developed and not checked in the repository (spikes, code developed and than deleted, etc.) This paper presents CodeMart (CM), a tool for retrieving and analyzing data related to the software development process stored inside version control systems. The H. Baumeister et al. (Eds.): XP 2005, LNCS 3556, pp. 243–246, 2005. © Springer-Verlag Berlin Heidelberg 2005
244
Alberto Sillitti and Giancarlo Succi
paper is organized as follows: section 2 analyzes problems analyzing software repositories; section 3 presents the architecture of CodeMart; section 4 propose a set of analysis for data stored in code repositories; finally, section 5 draws the conclusions.
2 Analyzing Software Repositories Source code repositories store useful process information, but they do not collect any data regarding the modifications that are stored. Systems such as CVS ask the developer to insert a short comment before checking in the code, but this approach presents two problems: 1. Most of the times developers do not insert any data. 2. If data are inserted, they are free text. Therefore, it is hard to understand for an automated system. In order to overcome this limitation, we have added a classification mechanism to our data extraction tool [3]. A simple classification identifies three main types of modifications that developers can introduce in source code (Table 1). Table 1. Classification of code modifications Type Comment Structural modification Non-structural modification
Code Identifier Any changes in the code comments Any changes in the code effecting execution paths (statements such as: if-then-else, for, do-while, switch, etc.) Any changes in the code not effecting execution paths (any statements rather than statement included in the previous type)
Comments modifications include all changes that affect source code comments and do not modify any executable instructions. Non-structural modifications include modifications of the source code instructions that do not change any execution paths of the program (all the function calls and the instructions except the flow control ones: if-then-else, for, do-while, switch, etc.). Structural modifications include modifications of the source code that change execution paths of the program.
3 Architecture of CodeMart CodeMart is a tool for data collection and analysis of data stored in version control systems. It is able to connect to software repositories and analyze the structure of the code and the sequences of operations performed by all the developers [1, 4]. The system includes: • Data extractor: It accesses a version control system, extracts all available data, parses the source code, and finally stores both raw collected and processed data into the data warehouse system; • Data analyzer: It performs analysis and shows results to the user through dynamically generated web pages, queries the data warehouse, collects answers, and displays data in different ways according to user preferences.
Source Code Repositories and Agile Methods
245
4 Sequence Analysis In order to identify interesting development patterns, we propose to use a behavioral analysis of the developers in time. In particular, we have applied the concepts of the gamma analysis [5] used in the social sciences. In this kind of analysis, a set of phases of the model are identified, than the sequence of the phases is analyzed. Phases are associated to the classification of the modifications described in Table 1. The gamma analysis describes the order of the phases in the sequence and provides a measure of their overlapping. It is based on the gamma score defined as follows: P −Q γ ( A,B ) = P +Q where P is the number of A-phases preceding the B-phases and Q is the number of Aphases following the B-phases. The γ calculated in this way is symmetric and varies between -1 and +1. If γ(A,B) < 0, the A-phases follow the B-phases; if γ(A,B) > 0, the Aphases precede the B-phases; finally, if γ(A,B) = 0, the A-phases and the B-phases are independent. The gamma score is used to calculate the precedence score and the separation score. The precedence score id defined as follows: γA =
1 N
∑γ
( A,i )
i
where N is the number of phases and γ(A,i) is the gamma score calculated between the phases A and i. This score varies between -1 and +1. The separation score is defined as follows: sA =
1 N
∑γ
( A,i )
i
This score varies between 0 and +1. A separation score of 0 means that the phases are independent, while +1 means that there is a separation among the phases. The values of γ and s are calculated for each file in the projects considered. Then, their values for the whole project are calculated as a weighted average as follows: ∑ vi γ i ∑ v i si γ = i s= i
∑v
i
i
∑v
i
i
where γi,, si, vi are the precedence score, the separation score, and the number of versions of the file i.
5 Conclusions and Future Work The system presented in this paper is the first implementation of an automated system for the identification of relevant sequences in source code repositories. And its goal is to identify if there are differences in the development patterns between Agile and traditional developers. The work presented is only an exploratory phase on the analysis on sequence patterns in software development.
246
Alberto Sillitti and Giancarlo Succi
References 1. R. Cooley, B. Mobasher, J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web”, Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997. 2. P.M. Johnson, “You can't even ask them to push a button: Toward ubiquitous, developercentric, empirical software engineering”, The NSF Workshop for New Visions for Software Design and Productivity: Research and Applications, Nashville, TN, USA, December 2001. 3. S.J. Metsker, “Building Parsers with Java”, Adison-Wesley, 2001. 4. J. Myllymaki, J. Jackson, “Web-based data mining, Automatically extract information with HTML, XML, and Java”, IBM developerWorks, http://www-106.ibm.com/developerworks/web/library/wa-wbdm/?dwzone=web 5. D.C. Pelz, “Innovation Complexity and Sequence of Innovating Strategies”, Knowledge: Creation Diffusion, Utilization, Vol. 6, 1985, pp. 261-291.