Code Churn Estimation Using Organisational and ...

5 downloads 537 Views 2MB Size Report
Aug 2, 2011 - predictors of yearly cumulative code churn of software projects on the basis of metrics extracted ... Email addresses: [email protected] (Siim Karus), ..... In XSLT, each template must be applied to each source document nodes.
Code Churn Estimation Using Organisational and Code Metrics: An Exploratory Comparison Siim Karusa,1,, Marlon Dumasa,1, a

Institute of Computer Science University of Tartu Estonia

Abstract Context: Source code revision control systems contain vast amounts of data that can be exploited for various purposes. For example, the data can be used as a base for estimating future code maintenance effort in order to plan software maintenance activities. Previous work has extensively studied the use of metrics extracted from object-oriented source code to estimate future coding effort. In comparison, the use of other types of metrics for this purpose has received significantly less attention. Objective: This paper applies machine learning techniques to unveil predictors of yearly cumulative code churn of software projects on the basis of metrics extracted from revision control systems. Method: The study is based on a collection of object-oriented code metrics, XML code metrics, and organisational metrics. Several models are constructed with different subsets of these metrics. The predictive power of these models is analysed based on a dataset extracted from 8 open-source projects. Results: The study shows that a code churn estimation model built purely with organisational metrics is superior to one built purely with code metrics. However, a combined model provides the highest predictive power. Conclusion: The results suggest that code metrics in general, and XML metrics in particular, are complementary to organisational metrics for the purpose of estimating code churn. Keywords: Email addresses: [email protected] (Siim Karus), [email protected] (Marlon Dumas) Preprint submitted to Information and Software Technology

August 2, 2011

XML, XSLT, code metrics, organisational metrics, code churn estimation, software maintenance 1. Introduction Accurately estimating future code maintenance effort of a software system is one of the keystones of software project planning [1]. Over the past decades, significant attention has been paid to estimating code maintenance effort based on object-oriented code metrics [2, 3, 4]. In the meantime, the eXtensible Markup Language (XML) has grown into an ubiquitous language in contemporary software projects. In a separate study [5], we found that in the context of open-source software development, XML files frequently co-evolve with other types of files – about 20% of changes to non-XML files were accompanied with changes to XML files. Furthermore, the widespread adoption of revision control systems in software projects has made it possible to readily extract a wealth of metrics pertaining to organisational aspects of software projects. These trends have opened the possibility of building more accurate models for code maintenance prediction by making use of a wider set of metrics than traditional object-oriented code metrics. Accordingly, this paper undertakes to study the relative performance and potential complementarity of different families of metrics in the context of code maintenance effort estimation. A common indirect measure of code maintenance effort, which we adopt in this paper, is that of code churn: the sum of number of lines of code added, modified and deleted between two revisions of a software module [6]. In addition to providing an indicator of code maintenance effort, code churn has also been shown to be correlated with software defects ([7, 8]). `The present study focuses on estimating long-term code churn. Specifically, the dependent variable of our study is the cumulative yearly code churn of a project: the sum of the cumulative yearly code churn of all (non-binary) files in a project, where the cumulative code churn of a file is the sum of the code churn of the file across all its revisions in a given 12-months period. This paper specifically addresses the following research question: What is the relative performance of models built to estimate the yearly cumulative code churn of a project based on: 1. XML/XSLT code metrics 2. Imperative/object-oriented programming language metrics 2

3. Organisational metrics 4. Organisational metrics and code metrics combined Given these questions, we have a choice between hypothesising that certain relations exist between a selected set of input metrics and code churn, or to uncover such relations in an exploratory manner. In the first approach we would start with a set of hypotheses, and we would use statistical conformance testing to validate these hypotheses on the chosen dataset. However, as mentioned above, we are not aware of previous studies on possible relations between XML metrics and code churn. Hence, there is little basis for formulating a priori hypotheses about such relations. On the other hand, there is a wealth of data available from open source repositories that can be leveraged in order to uncover such relations. Accordingly, we adopt a bottom-up approach based on data mining and exploratory data analysis. The adopted data mining approach comprises the following steps: 1. Data pre-processing: choice of prediction targets and proposition of input features (attributes that might influence the value we need to predict), data gathering, normalisation, and cleansing. 2. Learning: choice of data mining algorithms and application of these algorithms. 3. Results validation: evaluation of models fit using standard statistical techniques. This data mining approach allows us to identify interference between input features and the prediction target and in doing so they uncover the existence of a predictive model. However, the data mining approach itself does not allow us to explain the cause of the interference. In order to compensate for this shortcoming, exploratory data analysis was used in this study in order to gain an understanding of the models created by the data mining algorithms. Compared to a conformance testing approach, data mining and exploratory data analysis offer the benefit of not requiring an a priori specific model to test. The aim of exploratory data analysis is to propose models that can then be conformance-tested. Moreover, data mining and exploratory data analysis can uncover non-intuitive relations. In fact, the 3

Project Commons Dia Docbook Esb eXist Groovy Valgrind Wsas

XML file ratio 22% 16% 88% 52% 12% 6% 3% 38%

XML LOC ratio 30% 2% 54% 31% 7% 4% 2% 34%

XML Files 695 124 2733 381 453 168 43 341

Avg. XML LOC 232 350 514 257 371 327 928 286

Years covered 3 11 8 3 7 7 2 3

Address http://www.wso2.org/ http://www.gnome.org/projects/dia/ http://docbook.sourceforge.net/ http://www.wso2.org/ http://exist.sourceforge.net/ http://groovy.codehaus.org/ http://valgrind.org/ http://www.wso2.org/

Table 1: Projects, whose source code repositories were studied.

results of this study show, among other things, that there are no generallyapplicable straightforward models to estimate coding effort – that is, linear models based on one or a very small number of interactions between input features. The models with better predictive power uncovered in the study involve a non-trivial number of interactions between input features. The paper is structured as follows. Sections 2-3 introduce the dataset used in the study and the features extracted for training. The training algorithms and process employed are discussed in Section 4. Next, evaluation results are presented in Section 5. In Section 6, related work is discussed. Finally, conclusions and directions for future work are discussed in Section 7. 2. Dataset Eight Open Source Software Project repositories were mined for data (see Table 1). These projects were chosen so that they would represent different types of software products: WSO2 is an enterprise application platform; docbook is a documentation formatting tool; eXist is a database management system; Dia is a drawing tool; Groovy and Valgrind are software development tools. The projects also differed in terms of age. WSO2 is a relatively young project with only three years of version data available; Valgrind is a short term project (commits from only two to three years); docbook, and Dia are long term projects (8 years and 11 years); the rest of the projects have commit data from five to six years. A total of 31636 project revisions were extracted from these projects. About one quarter of XML files in these projects were XSL files and one twelfth were IDE, maven or Ant files. The rest were project specific or otherwise minor languages. The distribution of XML-based files is shown on Figure 1. 4

Figure 1: Distribution of XML-based languages.

Code churn between two file revisions is defined as LOC modified + LOC added + LOC deleted between the newer and the older revision. During the data collection, code churn for each file revision was calculated based on GNU diff output. The purpose of the study is to predict code churn over long periods of time. A common reference period for long-term code churn prediction is a year ([9, 3]). Accordingly, we measured the (cumulative) yearly code churn, which is defined as the sum of code churn of all file revisions during a year of observation. In the dataset of this study, cumulative yearly code churn ranged from 600 to 2,068,100 LOC (average 1,318,900 LOC). At the end of the data collection phase, the resulting dataset consisted of project revisions, each one with a timestamp, the contents of the files in the revision and their associated metadata (e.g. committer). For each project revision, we calculated the project’s cumulative code churn during the 12 months-period from the revision’s timestamp. Of course, this means we could not use project revisions for which we did not have data about their 12-months churn (i.e. file revisions with timestamps later than the date of data collection minus 12 months). The dataset was collected during June 2009. 3. Features And Training Algorithms According to the research questions, we are interested in building models to predict code churn based on imperative/object-oriented code metrics, XML/XSL code metrics, and organisational metrics. Accordingly, we identified a set of metrics covering all these categories (see Table 2). The following sub-sections explain each category of metrics in turn.

5

3.1. Object-oriented metrics Among all possible imperative/object-oriented programming languages, this study considers C, C++ and Java since these were the most represented ones in the dataset. Given these languages, we selected fifteen metrics divided into three categories: • Language-independent metrics found in effort estimation models (metrics 1-5 in first column of Table 2). It should be noted that in the first column of Table 2, the number of files refers to the number of C, C++ and Java files. The number of “all files” is included among the organisational metrics. • Metrics defined in the context of object-oriented languages (e.g. [3, 4, 2]) (metrics 6-10 in first column of Table 2). Since we deal with C/C++ files, we also counted the total number of functions. Also, in addition to counting classes, we also counted C/C++ structure definitions. • Structure and complexity metrics for procedural programming languages (metrics 11-15 in first column of Table 2), which are typically associated with code maintenance effort. The values of these metrics were collected using SourceMonitor 1 . 3.2. XML Metrics We selected a set of XML metrics in such a way that it mirrors as close as possible the set of OO metrics discussed earlier. The concept of “statement” in OO programming languages maps to nodes in XML. XML is composed of five different types of nodes: elements, attributes, processing instructions, comments and text. The selected set of XML metrics includes total number of nodes (excluding text nodes) and total number of nodes (including text nodes). The number of text nodes is not included because it is linearly derivable from the total number of nodes and the counts of the other four types of nodes. The machine learning algorithms employed test linear relations between input features. Thus, the number of text nodes is implicitly used by the models. 1

http://www.campwoodsw.com/sourcemonitor.html

6

C/C++/Java Metrics

XML Metrics

XSL Metrics

1. 2. 3. 4. 5. 6. 7.

1. 2. 3. 4. 5. 6.

1. Avg. # of templates 2. Avg. # of elements and attributes per template 3. Avg. # of ‘call-template’ and ‘apply-templates’ per template 4. # of simple test expressions 5. # of simple select expressions 6. # of simple match expressions 7. # of complex test expressions 8. # of complex select expressions 9. # of complex match expressions 10. Ratio of complex expressions to all expressions 11. % of branching elements 12. Avg. complexity 13. Max. complexity

# of files Lines of code (LOC) # of statements # of lines with comments % of lines with comments # of functions # of types (classes, interfaces, structures) 8. Avg. # of methods per class 9. Avg. # of statements per method 10. Avg. # of calls per method 11. % of branching statements 12. Max. block depth 13. Avg. block depth 14. Max. complexity 15. Avg. complexity

# of files Lines of code (LOC) # of XML nodes # of XML elements # of XML attributes # of XML processing instructions 7. # of lines with comments 8. % of XML comment nodes 9. Avg. # of first-level elements 10. Maximum XML tree depth 11. Average XML tree depth

Table 2: Code metrics used in the study.

As XML comments are blocks (not line based), we counted comment nodes instead of counting lines with XML comments. Accordingly, the percentage of comments is calculated in respect to the number of nodes instead of the number of lines. In XML, first level elements (children of the root element) are commonly the main structural entities of the document. This corresponds to the number of types in object-oriented languages. XML tree depth is similar to the block depth in imperative languages. The remaining OO metrics did not have a natural equivalent in the context of XML, although some have an equivalent in the context of XSL as discussed below. 3.3. XSL Metrics In this study, we consider XSLT version 1.0. The latest version of XSLT is 2.0, but we found that that the features added by XSLT 2.0 on top of XSLT 1.0 are still not widely used in practice. As a result, we could not find enough data to evaluate XSLT 2.0 specific features. XSLT is an XML-based language, meaning that every XSLT file is an XML file. This property allows us to apply the same metrics defined for XML files, to XSLT files. Addi7

tionally, semantically defined XSL allows us to better align object-oriented metrics with XSLT features. For example, XSLT templates, which determine the execution flow, are similar to functions and methods in imperative languages. Thus we can construct metrics “Average number of elements and attribute per ” and “Average number of and per ” in response to ”‘Statements per method” and “Calls per method” object-oriented code metrics. XSLT uses XPath expressions for matching, selecting and testing input data. These expressions can only be present in XSLT attributes “match”, “select”, “test” and in attributes outside XSLT namespace enclosed between curly braces (these are called ”inline expressions”). Inline XPath expressions can only be used for selecting data. As these expressions note decision points in XSLT, counting them will give us information on the flow of the execution of transformation rules. It is important to differentiate between two types of XPath expressions that may appear in the “match”, “select”, “test” attributes of XSLT templates: (i) simple expressions identifying specific elements or attributes by their name and namespace; and (ii) complex expressions that can match several nodes in the input document. Complex expressions contain wildcards or function calls while simple expressions do not need them. The number of complex expressions is likely to give an indication of the complexity of the transformation. Accordingly, Table 2 includes 7 metrics based on the distinction between complex and simple expressions (metrics 4-10 in the XSL column). Function calls are detected by looking for brackets in the expressions. In XSLT, each template must be applied to each source document nodes that match the XPath expression specified in the template’s “select” attribute (assuming the presence of an instruction). This effectively means that every XPath expression is a decision point and calculating cyclomatic complexity comes down to counting the number of match and test expressions and elements in the transformation. This observation was used to calculate XSL metrics 11-13. 3.4. Organisational Metrics The organisational metrics included in the study (Table 3) were selected with the aim of covering four aspects: • The total size of the project team, measured in terms of number of active developers in the project (metrics 1-5) 8

Organisational metrics 1. Total number of developers (active with at least one commit) on project as of the date of data collection 2. Total number of XML developers (active with at least one commit to XML file) on project as of the date of data collection 3. Total number of XSL developers (active with at least one commit to XSL file) on project as of the date of data collection 4. Total number of imperative language (C/C++/C#/Java) developers (active with at least one commit to .h, .c, .cpp, .cxx, .hpp, or .java file) on project as of the date of data collection 5. Total number of object-oriented language (C++/C#/Java) developers (active with at least one commit to .cpp, .cxx, .hpp, or .java file) on project as of the date of data collection 6. Number of developers on project to revision date (developers who have made at least one commit prior to the date of the file revision) 7. Number of XML developers on project to revision date 8. Number of XSL developers on project to revision date 9. Number of imperative language developers on project to revision date 10. Number of object-oriented language developers on project to revision date 11. Number of previous commits by committing developer 12. Number of previous commits to XML files by committing developer 13. Number of previous commits to XSL files by committing developer 14. Number of previous commits to imperative language files by committing developer 15. Number of previous commits to object-oriented language files by committing developer 16. Avg. number of previous commits per developer 17. Avg. number of previous commits to XML files per developer 18. Avg. number of previous commits to XSL files per developer 19. Avg. number of previous commits to imperative language files per developer 20. Avg. number of previous commits to object-oriented language files per developer 21. Revision number 22. Total number of files in revision 23. Total number of XML files in revision 24. Total number of XSL files in revision 25. Total number of imperative language files in revision 26. Total number of object-oriented language files in revision

Table 3: Organisational metrics used in the study.

9

• The current size of the project team, measured in terms of the number of developers as of the date of the revision (metrics 6-10) • The previous activity of the committing developer, measured in terms of numbers of commits (metrics 11-15) • The previous activity of all developers, measured in terms of average commits per developer (metrics 16-20) • The project revision’s size, measured in terms of number of files (metrics 22-26) For each of these concerns, we included five metrics: one per file type (XML, XSL, OO, imperative) and one for the total. In addition, we included the revision metric (metric 21), which captures the previous number of commits to the project. 3.5. Training Algorithms The dataset was analyzed using Microsoft SQL Server 2008 Analysis Services. SQL Server automatically applied additional feature selection if it was considered necessary. The feature selection algorithm used by SQL Server is Bayesian Dirichlet with uniform prior [10]. Two learning algorithms were used for training the code churn prediction models: a neural networks algorithm (Back-Propagated Delta Rule network, composed of three layers of perceptrons), and a decision tree algorithm for continuous attributes (a.k.a regression trees). Other alternative learning algorithms could have been used. We chose these two in order to cover two ends of the spectrum from simple to complex relations: Decision Trees are better at identifying (multi-)linear relations while Neural Networks can identify more complex relations, while being less suitable for identifying linear relations. 4. Results We trained and tested several models based on different subsets of input features, specifically, using OO code metrics only, using XML/XSL metrics only, using organisational metrics only, and using combinations of these subsets of features. Each model was validated using cross-validation with 7:1

10

splits by projects. In other words, we selected one of the 8 projects for testing, and used the remaining 7 projects for training the model, then took another project for testing and used the other 7 for training and so on. The quality of the resulting models was evaluated using visual scatter plots of actual and predicted values as well as the following measures of fit between the predicted and actual values: • Pearson’s correlation coefficient – shows linear dependence between estimated and actual values. Higher is better. • Kendall tau rank correlation coefficient – shows (linear or non-linear) dependence between estimated and actual values. Higher is better. • Mean absolute error (MAE) – shows how much the estimated values differ from the actual values on average. Smaller is better. • Normalised mean absolute error (NMAE) – MAE divided by mean of actual values. Smaller is better. • Normalised relative error (MRE) – Mean of relative errors. Smaller is better. • Root Mean Squared Deviation (RMSD) – measures differences between predicted values and actual values. Compared to MAE, RMSD is more affected by high errors. Lower is better. • Normalised Root Mean Squared Deviation (NRMSD) – RMSD divided by the range of actual values. Lower is better. In addition, cumulative distribution of absolute error plots were consulted during the evaluation of the models. The cumulative distribution of absolute error at a given absolute error value is the (distribution-independent) confidence level that the absolute estimation error is less than or equal to the absolute error value in question. The aim of these plots is to show the accuracy of different models for different “acceptable probabilities of error”, where the “acceptable probability of error” is defined as one minus the confidence level. In this section, we review each set of models in turn. This is followed by a comparison of the models and an analysis of the most influencing features of the models. Finally, the section concludes with a discussion on limitations of the study. 11

Algorithm

Model Type

DT DT DT DT NN NN NN NN NN NN DT DT constant

org org xsl org oo all all org xsl org org oo xsl oo xsl oo mean

Pearson corr. 0.9998 0.8174 0.7767 0.6815 0.9149 0.9173 0.9072 0.8963 0.8823 0.8174 0.4929 0.5092 -0.5433

Kendall corr. 0.9943 0.9604 0.9242 0.9133 0.6552 0.6192 0.5951 0.6088 0.5620 0.4252 0.7462 0.4103 -0.4194

NMAE 0.0048 0.0860 0.1464 0.1966 0.3096 0.3172 0.3261 0.3397 0.3816 0.4938 0.5109 0.7387 0.7796

NRMSD 0.0029 0.0878 0.1123 0.1335 0.0566 0.0555 0.0584 0.0616 0.0696 0.0805 0.2194 0.2023 0.1322

MRE 0.0086 0.1415 0.2058 0.5145 2.5514 2.4413 1.8920 1.1172 1.9593 4.7868 1.5453 7.1812 17.1931

Table 4: Statistics of the models to predict total LOC churn of a project.

4.1. Models from Code Metrics The first set of models were trained using code metrics only. Within this set of models, we considered two subsets: a subset of models trained with object-oriented and imperative code metrics only, and a subset of models trained with XML/XSL metrics only. Within each subset, we trained separate models using Neural Networks (NN) and using Decision Trees (DT). The performance of the models is summarized in Table 4. In this table and in subsequent plots, the abbreviation “oo” refers to models trained using object-oriented and imperative code metrics while “xsl” refers to models training using both XML and XSL metrics. The results show that models trained with Neural Networks algorithm performed better in terms of NMAE and NRMSD. However, even these best-in-set models performed rather weakly. We also observe a surprisingly strong Kendall correlation coefficient for the model trained with XML/XSL metrics using Decision Trees algorithm. As Kendall correlation reflects the correlation of rankings, this observation suggests that decision trees algorithm was able to identify general estimation rules, but it was not able to find the correct scale to predict the yearly code churn with sufficient level of accuracy. Thus, it might be possible to increase the performance of these models significantly by applying some transformation on the estimations produced by the model. This would also partially explain why the scatter plot of “DT xsl” model (see Figure 2) looks straighter than the scatter plots of NN models (the metrics of “DT xsl” are weak because some estimations are far out of the expected range).

12

Figure 2: Predictions and actual values for source code or organisational metrics based models (diagonal line shows ideal).

13

4.2. Models from Organisational Metrics Next, models trained only on organisational metrics (“org” models) were considered. These models showed extremely good results. DT models were far superior to NN models with respect to all metrics. In fact, DT models achieved almost ideal correlation coefficients (Pearson 1.00, Kendall 0.99) and NMAE, MRE, and NRMSD below 0.01. In other words, the organisational metrics used in this study led to models that estimate yearly code churn with an average error of less than 1%. The NN models trained with organisational metrics also outperformed the corresponding models trained with code metrics. Thus, we conclude that organisational metrics are better at estimating future churn than code metrics. The scatter plots of NN models shown in Figure 2 look similar implying that there might be some sort of dependency between organisational metrics and code metrics. The verification and identification of such dependency requires an in-depth study (possibly with other datasets) and is thus beyond the scope of this study. 4.3. Combined Feature Set Models Finally, models using both types of metrics (cf. all metrics in Table 2 and Table 3) were trained. We trained three types of models with both Decision Trees and Neural Networks algorithm: (1) a model trained with organisational and object-oriented metrics (“org oo”); (2) a model trained using organisational and XML metrics (“org xsl”); and (3) a model trained on both types of code metrics plus organisational metrics. The performance metrics of the models are shown in Table 4 and the scatter plots of these models are given in Figure 3. As expected from the rather similar scatter plots of NN models trained on separate feature sets, the models trained on combined feature sets are also similar to one another in terms of performance. A closer look however shows that NN models built from combined feature sets clearly have better performance metrics than NN models with separate feature sets. A small caveat though is that “org oo” model performs worse than “org” model in regards to Pearson correlation, NMAE and NMRSD. This can be seen as a trade-off for achieving higher Kendall correlation and MRE. The differences between the NN models’ performance metrics are very small, apart from the clearly worst “oo” model, which could be caused by the different initialisations of NN models at the time of training. 14

Figure 3: Predictions and actual values for source code or organisational metrics based models (diagonal line shows ideal).

Figure 4: Empirical cumulative distribution of estimation absolute error for models trained with Decision Trees algorithm).

15

DT models trained on combined feature sets scored high across all performance metrics values. This performance is confirmed by the scatter plots. When comparing the performance scores of DT models trained on combined feature set with those of DT models trained only on organisational metrics, it seems at first glance that the latter are (slightly) superior. However, it is misleading to jump into such a conclusion. A limitation of the performance metrics is that they try to combine different performance aims into a single number. In doing so, they lose a significant amount of details. In order to obtain deeper insights into the relative performance of DT models trained with different feature sets, we examined in details the cumulative absolute error distribution graph (see Figure 4), specifically in the area corresponding to small errors (i.e. high confidence levels). In this graph, we observe that the “all” model has the lowest error (less than 300 LOC) at 70% confidence or less, while “org xsl” model has the lowest error in confidence level ranges 70%-95.6% (less than 4200 LOC). Interestingly, the “org” model outperforms others only in confidence level ranges over 95.6% and is worse than even “org oo” model in confidence level ranges below 98%. In practice this means following: 1. If one is willing to tolerate more than 5% probability of unplanned churn, then the “org xsl” model (model built with organisational and XML/XSL features) offers the lowest estimation error. 2. If one is not willing to tolerate more than 5% probability of unplanned churn, or if the expected yearly churn is high (in this case relative to the 4200 LOC churn/year threshold), then the “org” model is more suitable. 4.4. Comparison of Models In general, it is clear that organisational metrics are superior at predicting yearly code churn in a project. The comparison on Table 4 shows the fit statistics, which confirms this superiority. On the other hand, code metrics do have certain predictive power towards code churn and seem to be more relevant in smaller projects with low yearly churn. Figure 5 shows the empirical cumulative distribution of absolute estimation error. The figure shows that models trained by the Decision Trees algorithm on organisational metrics have excellent accuracy even in high confidence ranges. Errors on the more common confidence ranges are shown in Table 5. 16

Figure 5: Empirical cumulative distribution of absolute estimation error.

Confidence 0.70 0.80 0.90 0.95 0.99 0.999

Best model DT all DT org xsl DT org xsl DT org xsl DT org DT org

Error 300 LOC 600 LOC 1,400 LOC 3,400 LOC 17,200 LOC 83,000 LOC

Table 5: Error margin at different confidence levels for text LOC churn predictions.

17

4.5. Analysis of Influencers The most accurate and interesting models were trained using Decision Trees algorithm, which suggest a relatively linear nature of the relations of input features to code churn. Decision Trees are also much easier to interpret than Neural Networks. This makes it possible to infer these models in order to identify their main influencing features. Below we discuss the main influencers identified in different classes of models. 4.5.1. Organisational Decision Trees Model The decision tree built on organisational metrics only was 9 levels deep, which makes it rather complex for human understanding. Nevertheless, we could identify that the first split was based on the “Average Number of XSL Commits To Revision Date” metric. From the values distributions on the nodes, we observed that the highest churn was identified for revisions with more than 306 “XSL Commits To Revision Date On Average”. This could be explained by intensive ongoing development as 306 commits to XSL files per developer is very high and XSL is not a stand-alone nor highly popular language [5]. On lower values of “Average Number of XSL Commits To Revision Date”, the code churn ranges were similar (the first split had 10 nodes). For deeper levels the splits were very different and based on different features. SSAS did, however, identify the following top dependencies (from strongest to weakest dependency): 1. Number of Object-Oriented Files 2. Revision Number 3. Number of Developers on Project to Revision Date 4. Number of XSL Files 5. Average Number of XSL Commits to Revision Date 6. Number of XML Files 7. Total number of Files 8. Average Number of XML Commits to Revision Date 9. Number of imperative programming language Files 18

10. Number of XSL Developers till Data Collection Point This shows that the data on projects history is more relevant than the general statistics on the project (e.g. “Total number of Developers”). We can also see that XSL plays stronger part in churn determination than XML as XSL related features were ranked higher than XML related features. The strongest influencer “Number of Object-Oriented Files” is not surprising as we have previously determined that object-oriented languages tend to be more churn prone than imperative, functional or markup languages. The next two features are pure organisational metrics independent of languages, which suggests that the model could even be used on projects without XML or object-oriented files. 4.5.2. Organisational and XML Decision Trees Model The decision tree built on organisational and XML metrics (combined) was 11 levels deep, which is a minor increase in complexity compared to organisational model. This could mean that the model could be improved if more training data were available as it appears that the complexity and support limits were reached for this model. The first split was based on the “Average Calls Per Template” metric. From the values distributions on the nodes, we observed an even stronger distinction of churn levels. The highest churn was identified for revisions with 1.9-2.2 “Calls Per Template” (1 node) and the lowest churn levels were identified by 2.2-2.8 “Calls Per Template” (2 nodes). On other values of “Calls Per Template” the code churn ranges were similar to each other (the first split had 9 nodes). For deeper levels the splits were very different and based on different features. SSAS identified the following top dependencies (from strongest to weakest dependency): 1. Average Number of First Level Elements 2. Total Number of Files 3. Number of Developers on Project to Revision Date 4. Revision Number 5. Number of Simple Tests 6. Number of XML Files 19

7. Average Number of Templates 8. Percentage of Comments in XML Files 9. Number of XSL Files 10. Average Number of imperative programming language Commits To Revision Date The strongest influencer of the organisational model (“Number of ObjectOriented Files”) was among the top twenty dependencies. The fact that OO metrics were ranked lower, explains why the model performed worse than models trained only on organisational metrics – the model performed weaker when more OO files were present and stronger when XML/XSL was more dominant (3 features in the top ten were XSL related, 3 XML related, 3 language independent). 4.5.3. All-Metrics Decision Trees Model The decision tree built on all metrics was only 8 levels deep, which is a clear sign that the complexity limits were reached for this model and the model can be improved with more training data. The splits in this model were very similar to the ones in the model trained on organisational and XML metrics – most notable difference was that the deeper splits were pruned. SSAS identified the following top dependencies (from strongest to weakest dependency): 1. Number of XSL Files 2. Number of Object-Oriented Files 3. Number of Simple Tests 4. Number of Simple Selects 5. Lines of Imperative Code 6. Average Number of Statements in Method 7. Revision Number 8. Total Number of Files 20

9. Lines of Object-Oriented Code 10. Number of Functions The model improved on OO metrics, however, it also underranked organisational features. Thus the weak predictive power of the model. 4.6. Limitations of the Study It is important to keep in mind that the results achieved in this study might not be transferrable to projects not included in the study. There are many unknown factors that might affect the coding effort in various situations. This is not a limitation of this specific study; it is a limitation of the data mining approach used in this study. Nonetheless, the fact that we included projects with different characteristics in terms of size, age and type of product, and used independent project for testing the models, gives some confidence that the obtained models are likely to have a reasonable predictive power on other projects, or at least, that the proposed approach can be used to train models for other projects. Another limitation is that some of the predictive features might have not been considered or might have been excluded by the techniques used in this study. It might be possible to use or construct some other metrics or features that allow better predictions to be made. Also, the use of other feature selection algorithms might yield different results. It might be the case that some code metrics are influenced by project organisation and thus there might be some correlation between project and code structure metrics. These cases, which were not specifically looked for, can be useful for removing noise from the input data. 5. Related Work Nagappan et al. used metrics from organisational structure to identify failure-prone binaries with great success [11]. They used more organisational information than available from version control systems. For example, information about not committing organisation members and organisational hierarchy was used. Their results also showed that code churn is the second best predictor of failure-prone binaries. One of the earliest attempts to estimate code churn was made by Khoshgoftaar et al. [12]. In their study, they used multiple source code structure 21

and control-flow metrics to train models for churn estimation, which was used to determine whether a software component is fault-prone or not. Zhou, et al. used linear regression to investigate relations between objectoriented design metrics and maintainability in open source Java projects [4]. In their paper, feature selection was done using linear regression. A similar approach can be used to identify interactions between features for feature construction purposes. Alternatively, the approach outlined by Smith et al. [13], which uses genetic algorithms to construct and select features, could also be employed for this purpose. Our work is to some extent related to a large body of research in the area of software cost estimation [14]. The main differentiator is that software cost estimation aims at producing an estimate of a software project’s cost (e.g. expressed in person-months) based on characteristics of the project – usually known ex ante – whereas the aim of our work is to make predictions of future coding effort based on a snapshot of a project, and specifically based on code metrics associated to file revisions in the project snapshot. In separate work [15], we studied the problem of predicting code churn in the next revision of a given XML file. Interestingly, in this related work, we found that code metrics provide relatively accurate predictions of nextrevision code churn, whereas in the present study we have concluded that code metrics alone are not accurate predictors of yearly code churn, when compared to organisational metrics. We have studied the problem of predicting yearly project churn based on an alternative set of organisational and code metrics, albeit from a different angle in [16]. Specifically, the models studied in [16] attempt to predict the three components of code churn separately (i.e. added LOC, deleted LOC and modified LOC). 6. Conclusions and Future Work We have shown that, in the context of projects rich in XML and XSLT, organisational metrics provide a better basis for predicting long-term code churn for individual files than code metrics. In fact, models trained on organisational metrics were clearly superior to other models considered in the study. Organisational models give excellent predictions with low error (error below 3,400 LOC/year for 95% of the cases). These models can aid in planning the projects by providing insights into the evolution of project size and complexity and indicating the effort needed to maintain the code base. 22

We have also shown that even though organisational models are highly accurate, using XML metrics to complement organisational metrics yields better results in some scenarios. We do also see a potential to build improved models based on combined features sets with larger training datasets. Building models on larger datasets along with developing more sophisticated feature selection algorithms is an avenue for future work.2 Another perspective for future work is to study the factors that affect the predictive power of source code metrics and how these factors correlate with organisational metrics. Understanding this question could help us to build more sophisticated models, for example by employing more complex training algorithms with better feature selection methods. 7. Acknowledgements This research was started during a visit of the first author to the Software Engineering Group at University of Zurich (visit funded by the ESF DoRa 6 Program). We thank Harald Gall and the members of his group for their valuable advice. The work is also funded by ERDF via the Estonian Centre of Excellence in Computer Science. References [1] B. W. Boehm, C. Abts, S. Chulani, Software development cost estimation approaches - a survey, Ann. Software Eng. 10 (2000) 177–205. [2] W. Li, S. Henry, Object-oriented metrics which predict maintainability, The Journal of systems and software 23 (2) (1993) 111–122. [3] M. M. T. Thwin, T.-S. Quah, Application of neural networks for software quality prediction using object-oriented metrics, Journal of Systems and Software 76 (2) (2005) 147–156. [4] Y. Zhou, B. Xu, Predicting the maintainability of open source software using design metrics, Wuhan University Journal of Natural Sciences 13 (2008) 14–20, 10.1007/s11859-008-0104-6. 2

We tried AIC-based, principal component analysis and SQL Server Analysis Services’ (SSAS) feature selection algorithms, with no success.

23

[5] S. Karus, H. Gall, A study of language usage evolution in open source software, in: MSR, ACM, 2011, pp. –, (in press). [6] G. A. Hall, J. C. Munson, Software evolution: code delta and code churn, Journal of Systems and Software 54 (2) (2000) 111–118. [7] N. Nagappan, T. Ball, Use of relative code churn measures to predict system defect density, in: G.-C. Roman, W. G. Griswold, B. Nuseibeh (Eds.), ICSE, ACM, 2005, pp. 284–292. [8] J. C. Munson, S. G. Elbaum, Code churn: A measure for estimating the impact of code change, in: International Conference on Software Maintenance (ICSM), Bethesda, MD, 1998, pp. 24–31. [9] C. van Koten, A. R. Gray, An application of bayesian network for predicting object-oriented software maintainability, Information & Software Technology 48 (1) (2006) 59–67. [10] D. Heckerman, D. Geiger, D. M. Chickering, Learning bayesian networks: The combination of knowledge and statistical data, Machine Learning 20 (3) (1995) 197–243. [11] N. Nagappan, B. Murphy, V. R. Basili, The influence of organizational structure on software quality: an empirical case study, in: W. Sch¨afer, M. B. Dwyer, V. Gruhn (Eds.), ICSE, ACM, 2008, pp. 521–530. [12] T. M. Khoshgoftaar, E. B. Allen, N. Goel, A. Nandi, J. McMullan, Detection of software modules with high debug code churn in a very large legacy system, in: ISSRE ’96: Proceedings of the The Seventh International Symposium on Software Reliability Engineering, IEEE Computer Society, Washington, DC, USA, 1996, p. 364. [13] M. G. Smith, L. Bull, Feature construction and selection using genetic programming and a genetic algorithm, in: C. Ryan, T. Soule, M. Keijzer, E. P. K. Tsang, R. Poli, E. Costa (Eds.), EuroGP, Vol. 2610 of Lecture Notes in Computer Science, Springer, 2003, pp. 229–237. [14] M. Jørgensen, M. J. Shepperd, A systematic review of software development cost estimation studies, IEEE Trans. Software Eng. 33 (1) (2007) 33–53. 24

[15] S. Karus, M. Dumas, Predicting the maintainability of xsl transformations, Science of Computer Programming In Press, Corrected Proof (2010) –. doi:10.1016/j.scico.2010.12.006. [16] S. Karus, M. Dumas, Predicting coding effort in projects containing xml code, ArXiv e-printsarXiv:1010.2354.

25

Suggest Documents