Supporting Code Review by Automatic Detection of

2 downloads 0 Views 162KB Size Report
views. The usage of parser combinators [6] for repository querying was influenced by [7]. .... compass d3 django-cms finagle. Font-Awesome homebrew libgit2 netty ... mongo mono node phantomjs rails scala. ServiceStack. SignalR symfony.
Supporting Code Review by Automatic Detection of Potentially Buggy Changes Mikolaj Fejzer, Michal Wojtyna, Marta Burza´ nska, Piotr Wi´sniewski, Krzysztof Stencel Faculty of Mathematics and Computer Science Nicolaus Copernicus University Toru´ n, Poland {mfejzer,goobar,quintria,pikonrad,stencel}@mat.umk.pl

Abstract. Code reviews constitute an important activity in software quality assurance. Although they are essentially based on human expertise and scrupulosity, they can also be supported by automated tools. In this paper we present such a solution integrated with code review tools. It is based on a SVM classifier that indicates potentially buggy changes. We train such a classifier on the history of a project. In order to construct a training set, we assume that a change/commit is buggy if its modifications has been later altered by a bug-fix commit. We evaluated our approach on 77 selected projects taken from GitHub and achieved promising results. We also assessed the quality of the resulting classifier depending on the size of a project and the fraction of the history of a project that have been used to build the training set.

Keywords: Bug detection, Code review, Github, Gerrit, SVM, Weka

1

Introduction

Contemporary projects use numerous tools to improve the quality of resulting software systems. Among them there are code review tools (e.g. Gerrit), builders (e.g. Maven) and testing environments (e.g. JUnit). Before a change is merged into the code repository, it is thoroughly proof-read and automatically checked if it breaks the building process or units tests. Unfortunately, only a fraction of all defects can be detected by reviewers. Prevalent software bugs are usually missed. In our opinion a proper automated support of the code review process can prevent a significant number of bugs that are often left unnoticed. Inspiration and Related Work In the literature there is number of approaches to the automatic verification of changes. The article [1] presents a method based on the extraction of knowledge from relations of data items; it is so called multi-relational data mining. It consists in building new relations in the form of “hypotheses” describing the

given problem. They are implied by induction from known examples of existing relations. A training set is composed of positive and negative examples of the examined relation. They are described by existing relations and base knowledge. The article analyses C++ student projects and an automatically generated set. The goal of those analyses were to tell whether a given class contains errors or not. On the other hand Jeong [2] struggles not only to recognize buggy changes but also tries to indicate the best person to review the code. His method is based on Bayesian networks and it was tested on Firefox’s source code. Ostrand [3] asked a questions whether adding the programmer to predictors would aid error detection based on quality metrics. He examined a dozen of projects using the negative binomial regression. The correlation of the identity of a programmer and bugs was confirmed. The presented tool was able to indicate 20% of the whole system that contained from 75% up to 90% of bugs. We were especially interested in the findings of Sunghun Kim, who described his method in [4] based on SVM. Our approach to improve code review process was inspired by works of Stacy Nelson and Johann Schumann [5] who described how different approaches to code review might change the outcome of the reviews. The usage of parser combinators [6] for repository querying was influenced by [7]. The authors of this paper describe common pitfalls when mining Git repositories. The decision to use parser combinators allowed creating numerous parsers by combining reusable functions. However, it also might cause problems with memory management/garbage collection in case of large repositories. Automatic analyses of the quality of changes have been also presented in [8–10, 4, 11, 12]. Contribution Recent studies on bug detection by Sunghun Kim [4, 13] have inspired us to create a bug detection tool integrated with a code review system. This automated tool identifies potential bugs in changes about to be merged into the repository of a version control system. It is integrated with Gerrit, a popular code review system. It is based on a classifier that takes changes and answers whether they are potentially buggy. This classifier is trained using the past of the project. We build the training set from past commits. We assume that a commit is clean if none of its changed lines has been later bug-fixed. All other commits are attributed to be buggy. The trained classifier is integrated into Gerrit and it suggests buggy changes. A reviewer is thus warned. However, he/she can ignore such a signal. We prepared a proof-of-concept implementation of this classifier [14]. We tested it on 77 Github repositories and cross-validated it [4]. The results of these tests are promising. They also give insights how to create better tools in the future. The contributions of this paper are as follows: – A novel idea to build classifiers of changes submitted to version control repositories.

– A proof-of-concept implementation of the classifier, its thorough experimental evaluation on 77 Github projects and cross-validation. The implementation is avaliable on https://github.com/mfejzer/CommitClassification. – An integration of the classifier into the code review system Gerrit. The article is organized as follows. Section 2 presents the details of the creation of the training set and the build of the classifier. Section 3 shows the results of an experimental evaluation of the proposed method. Section 4 concludes and rolls out possible future directions of research.

2

The method and its implementation details

Our solution is based on an SVM (Support Vector Machine) classifier [15, 16]. Such classifiers are hyperplanes in multidimensional spaces. Training such a classifier consists in selecting the parameters of a hyperplane so that it effectively separates the sets of negative and positive points. In our method an SVM classifier tells clean and buggy commits apart. The training set is deduced from the development history of a project in terms of the prevalence of errors. In order to identify buggy commits we use the algorithm from [17] which discovers bugs based on later corresponding fixes. We assume that fixing commits are those that contain a term such as “fixes”, “bug” etc. For each such a commit, we query the project repository for commits that added or altered the lines removed or modified in the fixing commit. Those commits are considered buggy, while all other commits are assumed clean. This creates the training set. It contains the content of commits and decisions whether they are clean or buggy. We use a specified number of initial commits to build the training set. We call this number the history limit. Having the training set, we start building the classifier. Its multidimensional space is the set of vectors of all words that occur in commits. A single commit is interpreted as a bag of words in the space with discrete coordinates. A coordinate is the number of occurrences of the given word in a commit. This is the standard “bag of words” model. In order to train the classifier (i.e. to find an appropriate hyperplane) we use a dedicated library from the Weka toolkit [18]. The resulting classifier is saved to be used in the future when new changes (candidate commits) arrive. The experiments discussed in Section 3 show that the history limit of 100 commits is sufficient to build a good classifier. The classification of a new change comprises its conversion to the “bag of words” model and then application of the trained classifier to decide if the change is buggy or clean. The resulting diagnosis is then communicated to the Gerrit code review system. This way the author and the reviewers are notified if the change seems dangerous or not. The processing of commits begins by the execution of a series of git commands and parsing their results. We had chosen parser combinators [6] as a method to construct parsers. Our decision was motivated by the extensibility and easiness

Table 1. The statistics of preliminary tests Project History Training Testing Buggy Clean Correctly Incorrectly Revision limit set size set size classified classified Gedit 500 400 100 16 484 96.00% 4.00% 99f6154 Egit 500 400 100 124 376 75.00% 25.00% 704f311 Svn 2000 1500 500 792 1208 78.25% 21.75% 840072

of unit testing enabled by this method. New and more complex parsers can be constructed from existing ones by the usage of special combining functions, provided by Scala language. This means that additional tools such as parser generators are not needed.

3

Experimental evaluation

We divided our experimental evaluation into following three groups of tests: 1. preliminary tests, conducted on Egit, Gedit and Svn, 2. comparison tests, run on 77 Github repositories with the same history limit, 3. detailed tests, with different history limits on selected Github repositories. The goal of the first group of tests was to check if our software worked for different kinds of version control repositories. We developed it to be used primarily with Git repositories, but we have also verified that it works fine with SVN (namely Apache SVN). We also wanted to examine differences in training the bug detection classifier when applied to various sizes of the project. We used a small project (Gedit), a medium project (Egit) and a large project (Svn). Our goal was to validate that our solution is general enough to be used for any size of a software project. Tests were run on projects cloned directly from Github, without any preprocessing. The results of preliminary tests (the first group) are presented in Table 1. They attest the efficiency of the implemented method. Those tests were conducted in a similar way as described in [4]. The results of preliminary tests inspired us to try our method on much broader class of projects. As the second test group, we applied our solution to 77 project repositories. We partitioned them into four categories according to the project size as listed in Table 2. The results show that the total number of commits determines the number of fixes that can be identified (see Table 3). This has a significant impact on the process of classifier training. The third group of tests was meant to determine how the history limit influences the accuracy of the classification. Tables 4 and 5 show that the history limit hardly influences the quality of the classification. However, we noted that the classifiers trained for very young projects are significantly worse than the classifiers obtained for older projects.

Table 2. Groups of Github projects by repository size Group name

Small

0,5

Medium

Large

XL

Repository (MB) min max avg

6,30

4,01

6,80 19,00 11,79

20,00 54,00 36,63

57,00 757,00 178,00

beanstalkd devtools folly impress.js octopress Slim zipkin android CraftBukkit html5-boilerplate libuv plupload scalatra tornado ActionBarSherlock ccv d3 Font-Awesome netty RestSharp ThinkUp akka elasticsearch hiphop-php mono rails SignalR TrinityCore

Project names chosen django-debug-toolbar gizzard jekyll paperclip stat-cookbook

devise flockdb httpie mosh shiny vinc.cc

boto facebook-android-sdk jquery memcached redis Sick-Beard

clojure flask knitr phpunit requests storm

bitcoin CodeIgniter django-cms homebrew ProjectTemplate sbt

cakephp compass finagle libgit2 reddit SparkleShare

diaspora foundation kestrel node scala symfony zf2

django gitlabhq mongo phantomjs ServiceStack three.js

Table 3. The statistics of tests on Github projects by repository size Group Correctly % Incorrectly % name classified classified Small 72,95 82,28% 16,11 17,72% Medium 81,74 82,29% 17,53 17,71% Large 79,89 84,29% 15,26 15,71% XL 84,40 84,40% 15,60 15,60% On average 79.81 83,33% 16.12 16.67%

Table 4. Groups of Github projects by history length Group name

Small

Medium

Large

XL

History length min max avg

1

985

607

986 2693 1757

2694 5892 3859

5893 94432 18965

Project names beanstalkd ccv facebook-android-sdk flockdb Font-Awesome httpie memcached mosh plupload ProjectTemplate shiny stat-cookbook zipkin ActionBarSherlock android CraftBukkit devise django-debug-toolbar flask html5-boilerplate kestrel paperclip phantomjs Slim storm tornado bitcoin boto d3 finagle jekyll jquery netty phpunit redis requests ServiceStack Sick-Beard SparkleShare akka cakephp diaspora django elasticsearch gitlabhq homebrew libgit2 mono node scala symfony TrinityCore zf2

chosen folly impress.js octopress RestSharp vinc.cc clojure devtools gizzard libuv scalatra ThinkUp compass foundation knitr reddit sbt SignalR CodeIgniter django-cms hiphop-php mongo rails three.js

Table 5. The statistics of tests on Github projects by history length Group Correctly % Incorrectly % name classified classified Small 64.37 78.65% 19.11 21.35% Medium 82.84 82.84% 17.16 17.16% Large 84.32 84.32% 15.68 15.68% XL 87.30 87.30% 12.70 12.70% On average 79.81 83,33% 16.12 16.67%

Table 6. The relation between history limit and training/testing sets History Training Testing limit set size set size 100 80 20 200 160 40 500 400 100 2500 2000 500 5000 4000 1000 Table 7. The statistics of tests with history limit set to 100 Project Correctly % Incorrectly % Bugs False False Non bugs name classified classified detected negative positive detected akka 17 85.00% 3 15.00% 0 3 0 17 mongo 19 95.00% 1 5.00% 0 1 0 19 reddit 20 100.00% 0 0.00% 0 0 0 20 scala 19 95.00% 1 5.00% 0 1 0 19 Table 8. The statistics of tests with history limit set to 200 Project Correctly % Incorrectly % Bugs False False Non bugs name classified classified detected negative positive detected akka 35 87.50% 5 12.50% 0 5 0 35 mongo 37 92.50% 3 7.50% 0 3 0 37 reddit 38 95.00% 2 5.00% 0 2 0 38 scala 36 90.00% 4 10.00% 0 4 0 36 Table 9. The statistics of tests with history limit set to 500 Project Correctly % Incorrectly % Bugs False False Non bugs name classified classified detected negative positive detected akka 82 82.00% 18 18.00% 0 18 0 82 mongo 86 86.00% 14 14.00% 0 14 0 86 reddit 92 92.00% 8 8.00% 0 8 0 92 scala 82 82.00% 18 18.00% 0 17 1 82 Table 10. The statistics of tests with history limit set to 2500 Project Correctly % Incorrectly % Bugs False False Non bugs name classified classified detected negative positive detected akka 386 77.20% 114 22.80% 29 95 19 357 mongo 389 77.80% 111 22.20% 7 103 8 382 reddit 402 80.40% 98 19.60% 6 92 6 396 scala 375 75.00% 125 25.00% 30 105 20 345

Table 11. The statistics of tests with history limit set to 5000 Project Correctly % Incorrectly % Bugs False False Non bugs name classified classified detected negative positive detected akka 737 73.70% 263 26.30% 133 176 87 604 mongo 768 76.80% 232 23.20% 28 211 21 740 reddit 591 77.36% 173 22.64% 60 129 44 531 scala 732 73.20% 268 26.80% 158 185 83 574

Another important question is how to set the history limit, i.e. how many commits use to train the classifier. The next test is devoted to answer this question. The training set comprised 80% of last commits selected randomly for given history limit. The remaining commits were used for evaluation, as shown in Table 6. We obtained better results using only the last 500 commits (Table 9) than 2500 (Table 10) and 5000 (Table 11) respectively. This shows that overfitting could also be a concern in our solution. These results prove that the history limit of 100 commits is sufficient to train a satisfactory classifier. This observation allows using our approach with daily training of the classifier from scratch. Then, the classification of incoming commits is performed using hundreds of most recent commits. This seems most appropriate since the classifier is continuously synchronized with the current (up to yesterday) composition and maturity of the development team.

4

Conclusions and future work

In this paper we have shown a method to detect possibly buggy changes. The method is based on a classifier trained using a project’s history. The training set is built according to the assumption that clean changes are not later altered by bug fixing commits. We evaluated the method on a number of projects and achieved promising results. We have observed that small projects are harder to analyze due to the low number of commits and thus fixes. However, code review is less likely to be used in such projects. Thus the benefits of potential error detection are marginal. Results of our evaluation show that only large projects with rich commit histories can benefit from our method of commit classification. Further research on small projects might possibly reveal a divergent assessment of quality than just fixes. Then, a completely different commit classifier may be prepared. Its goal will be similar, i.e. to identify whether a change is going to be reverted in the future. However, its training set will be based on possibly another assumption than taken in this paper. Another possible research option is to check how the number of developers corresponds to the size of a project and the error rate. It might allow preparing predefined default classification parameters to help easier integration with our tool. We believe that the results and the algorithms from [19] can be used to enrich our solution and further increase the performance of the classifier. This effect

can be obtained by taking into account additional information about a project such as the kind of a programming language (object oriented, functional, logic etc) and the type system used. It was shown in [19] that those factors can have significant impact on the prevalence of bugs, and thus also on our detection system. We assume that better results can be obtained if developers tag bugfixes. Such tags [20] will eliminate the need to classify commits whether they are fixes or not. This can have a significant impact on the training quality. Furthermore, different bugfixing detection algorithms such as [21] can possibly also improve training quality.

References 1. Cohen, W.W., Devanbu, P.T.: Automatically exploring hypotheses about fault prediction: A comparative study of inductive logic programming methods. International Journal of Software Engineering and Knowledge Engineering 9 (1999) 519–546 2. Jeong, G., Kim, S., Zimmermann, T., Yi, K.: Improving code review by predicting reviewers and acceptance of patches. Research on Software Analysis for Error-free Computing Center Tech-Memo (ROSAEC MEMO 2009-006) (2009) 3. Ostrand, T.J., Weyuker, E.J., Bell, R.M.: Programmer-based fault prediction. In: Proceedings of the 6th International Conference on Predictive Models in Software Engineering, PROMISE 2010, Timisoara, Romania, September 12-13, 2010. (2010) 19 4. Kim, S., Jr., E.J.W., Zhang, Y.: Classifying software changes: Clean or buggy? IEEE Trans. Software Eng. 34 (2008) 181–196 5. Nelson, S.D., Schumann, J.: What makes a code review trustworthy? In: 37th Hawaii International Conference on System Sciences (HICSS-37 2004), CD-ROM / Abstracts Proceedings, 5-8 January 2004, Big Island, HI, USA, IEEE Computer Society (2004) 6. Moors, A., Piessens, F., Odersky, M.: Parser combinators in scala. CW Reports (2008) 7. Bird, C., Rigby, P.C., Barr, E.T., Hamilton, D.J., Germ´ an, D.M., Devanbu, P.T.: The promises and perils of mining git. In Godfrey, M.W., Whitehead, J., eds.: Proceedings of the 6th International Working Conference on Mining Software Repositories, MSR 2009 (Co-located with ICSE), Vancouver, BC, Canada, May 16-17, 2009, Proceedings, IEEE (2009) 1–10 8. D’Ambros, M., Lanza, M., Robbes, R.: An extensive comparison of bug prediction approaches. In: Proceedings of the 7th International Working Conference on Mining Software Repositories, MSR 2010 (Co-located with ICSE), Cape Town, South Africa, May 2-3, 2010, Proceedings. (2010) 31–41 9. Radjenovic, D., Hericko, M., Torkar, R., Zivkovic, A.: Software fault prediction metrics: A systematic literature review. Information & Software Technology 55 (2013) 1397–1418 10. Arisholm, E., Briand, L.C., Johannessen, E.B.: A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. Journal of Systems and Software 83 (2010) 2–17 11. Catal, C., Diri, B.: A systematic review of software fault prediction studies. Expert systems with applications 36 (2009) 7346–7354

12. Shivaji, S., Jr., E.J.W., Akella, R., Kim, S.: Reducing features to improve code change-based bug prediction. IEEE Trans. Software Eng. 39 (2013) 552–569 13. Kim, S.: Adaptive bug prediction by analyzing project history. University of California, Santa Cruz (2006) 14. Fejzer, M.: Commit classification application. Project on Github code repository https://github.com/mfejzer/CommitClassification (2014) 15. Vapnik, V.: Estimation of Dependences Based on Empirical Data: Springer Series in Statistics (Springer Series in Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA (1982) 16. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. Springer (1998) 17. Sliwerski, J., Zimmermann, T., Zeller, A.: When do changes induce fixes? In: Proceedings of the 2005 International Workshop on Mining Software Repositories, MSR 2005, Saint Louis, Missouri, USA, May 17, 2005, ACM (2005) 18. Hall, M.A., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explorations 11 (2009) 10–18 19. Ray, B., Posnett, D., Filkov, V., Devanbu, P.T.: A large scale study of programming languages and code quality in github. In Cheung, S., Orso, A., Storey, M.D., eds.: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014, ACM (2014) 155–165 20. Treude, C., Storey, M.D.: Work item tagging: Communicating concerns in collaborative software development. IEEE Trans. Software Eng. 38 (2012) 19–34 21. Tian, Y., Lawall, J.L., Lo, D.: Identifying linux bug fixing patches. In Glinz, M., Murphy, G.C., Pezz`e, M., eds.: 34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland, IEEE (2012) 386– 396