Finding the Core Developers Jaco Geldenhuys Computer Science Division, Department of Mathematical Sciences Stellenbosch University, Stellenbosch, South Africa Email:
[email protected]
Abstract—It has been suggested that 20% of the participants in a free/libre/open source software (FLOSS) project contribute 80% of the work. This paper attempts to verify this claim for nine projects and for various metrics of user activity such as the number of contributions, duration of involvement with the project, and the number of modifications to source code files. Keywords-open source software, mining software repositories, developer behaviour
I. I NTRODUCTION Participants in free/libre/open source software projects (FLOSS) are often classified as core developers, peripheral developers and users—the so-called “onion model” [10]. Sometimes a finer distinction is drawn between peripheral developers who report bugs, and those who fix them. For some projects, usually large ones, it is also possible to identify leaders who act as managers and guide the overall direction of the project. The 20/80 rule, first reported in the context of FLOSS projects by Mockus et al. [8], [9]. They observed that the top 15 developers contributed 83% of the commits and 88% of the lines of code in the Apache project. They then hypothesized that: Open source developments will have a core of developers who control the code base, and will create approximately 80% or more of the new functionality. If this core group uses only informal, ad hoc means of coordinating their work, it will be no larger than 10-15 people. They also suggested that for larger projects, more formal mechanisms (such as code ownership, structured software processes, or code inspections) are needed to coordinate development. Data from the Mozilla project (with a core team of 22 to 35) seem to support this. Dinh-Trong and Biemann tried to confirm this hypothesis for FreeBSD [3] and found that the number of participants who contributed 80% ranged from 28 to 42. Koch and Schneider report a similar trend for the GNOME project [7]. In a larger study, Singh et al. looked at 205 projects and found that, on average, 20% of the top developers contributed about 81% of the source code [11]. While they claim that this agrees with the studies mentioned above, this is not strictly true. Core developers comprised just 3.9% of
the overall number of developers in the Apache project, 4.5– 7.2% in the Mozilla project, 13.9–16.8% in the FreeBSD project, and 17.3% in the GNOME project. This does not mean that the 20/80 finding is inaccurate. Ghosh and Prakash report almost exactly the same figure: in their study of 3149 projects, 20% of the top contributors were responsible for 81.2% of contributions to the code base [4]. This raises interesting questions: why is it that the 80% cutoff values in the two large-scale studies are so different from those in the single-project studies? Why do the values differ so much from project to project? Is this a sensible definition of “core developer”? To address these questions, this paper explores a variety of metrics. Each metric measures some aspect of developer behaviour (such as the number of contributions), and the users are ranked according to the metrics. We then compute the correlation between these rankings to determine whether or not previous definitions of the term “core developer” is robust, in the sense that it is always clear who these core developers are. II. DATA AND M ETHODOLOGY Nine FLOSS projects are examined in this work: autoconf is an extensible package of macros that produce shell scripts to automatically configure software source code packages; coreutils is a collection of UNIX commandline utilities; linux is the Linux kernel release 2.6; mesa implementats the OpenGL graphics standard; pciutils is a library for portable access to PCI bus configuration registers and utilities based on the library; samba is a free SMB and CIFS client and server for sharing files and printers; vlc is cross-platform multimedia framework, player and server for a variety of formats; wine implements the Microsoft Windows API and runs Windows programs under UNIX; and xserver is the standard implementation of the traditional X11 server. Table I shows the number of authors, number of commits (i.e., updates in the version control software), and the date of the first and last commit for each project. The projects are all long-running, widely used, and could be classified as successful, although this was not a deciding factor in their selection, but is clearly related to their longevity. The projects represent a spectrum in terms of their size, number of developers, and application domain.
Table I T HE PROJECTS USED IN THIS STUDY Project Authors autoconf 33 coreutils 73 linux 3668 mesa 243 pciutils 15 samba 78 vlc 216 wine 554 xserver 252
Commits 950 2304 95949 23397 317 5950 15941 43152 6874
First Mar 92 Oct 92 Apr 02 Feb 98 Dec 97 May 96 Aug 99 Jun 93 Nov 99
git Sep 07 Oct 06 Apr 05 Dec 06 May 06 Aug 07 Mar 08 Nov 05 Jun 06
Table II U SER CONTRIBUTIONS AS PERCENTAGE OF COMMITS Last Mar 10 Mar 10 May 08 Mar 10 Jan 10 Sep 09 Mar 10 Mar 10 Mar 10
All the projects changed to the git [5]—a modern, distributed version control system—during the last five years. The exact date is shown in the “git” column. Unlike older systems, git distinguishes between the roles of author and committer, and, for every change (commit), includes information about each role in its version history. Bird et al. [2] discuss other advantages of git. Shell scripts, standard Unix tools (e.g., awk, sort), and gnuplot were used to generate the figures and tables in this paper. The scripts and their output are available at http://www.cs.sun.ac.za/˜jaco/CORE/. A. Identifying the Users A first step in classifying user roles is identifying the users. This can be complicated, because the same user may be identified by different names. For example, although “John Smith” may be a common name, it is reasonable to assume that the following names all refer to the same person: 1. 2. 3. 4.
[email protected] [email protected] [email protected] [email protected]
John John John John
Smith Smith Q. Smith Q. Smith
The term identifier is used to refer to the combination of email address and real name. The identifiers in the nine projects were manually compared and reconciled; 36 users use eight or more different identifiers, and one user uses 16 different identifiers. Overall, 5605 individuals use 7909 identifiers; 23.5% of them use more than one. B. Project composition One possible behaviour is that “core” developers focus on certain parts of a project. To investigate this possibility, files were classified into six categories: 1) MAK: controlling the build process (e.g., Makefiles); 2) CNF: for configuration of installed software; 3) SRC: program code, either compiled or interpreted once installed; 4) TST: source code and files associated with testing; 5) DOC: (user and developer) documentation; and 6) LOG: development history and notes. This classification can be automated, but only partially. For instance, a file called test_input.c might contain
Rank 1 2 3 4 5 80% Number
A C 55.4 72.1 21.7 6.7 9.2 5.8 3.7 3.6 2.0 1.6 7.1 3.1 3 3
L M P S V W X 4.1 17.1 87.7 25.4 26.2 11.4 21.8 1.7 13.6 5.0 20.3 12.3 6.0 8.9 1.7 8.3 2.2 16.1 11.9 5.8 8.0 1.7 6.5 1.6 13.3 8.3 4.7 7.5 1.3 4.4 0.3 7.4 8.1 4.6 7.0 8.9 7.7 6.1 6.0 4.3 6.6 7.4 327 19 1 5 10 37 19
source code, testing code executed during development or installation, or even code used for the build process. In the end, many of the files were inspected manually. There are, no doubt, some misclassifications, and some files defy classification, but such cases are rare enough not to influence the results. III. R ERUN OF THE 20/80 E XPERIMENT The first part of Table II shows the five participants who produced most of the commits (each project is identified by the first letter of its name). The values are percentages of the total number of commits. For instance, in the wine project, the top developer contributed 11.4% of the commits. The bottom two lines shows the percentage and actual number of participants who produced 80% of the commits. It is clear that the projects differ significantly. Smaller projects (autoconf, coreutils, pciutils) appear much less egalitarian than the larger ones (linux, wine). The top contributors fall far short of the predicted 20%, ranging instead from 3.1% to 8.9%. These values agree with the single-project studies, but the actual number of developers ranges from very small (1), to surprisingly large (327), and clearly contradicts the values reported in the literature, also those in the single-project studies. The values do not seem to be correlated with project size. Accurate identification of users does not explain these results; if the table is computed based purely on email addresses, the results are more or less the same. Bias in the choice of projects may also skew the cutoff values, but if, for example, the age or success of a project impacts the number of top developers, the expected bias is towards either a lower or higher number of core developers. Instead, some values are higher and others lower than the 10–15 bound proposed by Mockus et al. IV. U SER R ANKING Below, a set of metrics is defined and justified. The aim to is capture different aspects of user behaviour. For each metric, the users are ranked 1, 2, 3, . . . and the correlation between the metrics is computed. If a pair of metrics is consistently highly correlated across all of the projects, this indicates that they agree on the identity of the core developers. The choice of metrics is somewhat subjective, but if metrics µ1 and µ2 are believed to both describe
important behavioural qualities of core developers, and if their rankings are not correlated, this would mean that they do not identify the same set of core developers. A. Metrics Consider a sequence of commits c1 , c2 , . . . , cn made by user u. Let t(c) be the time of commit c. 1) Commit Count, Timespan, and Frequency: The most straightforward metrics are the number of commits α1 (u) = n, the commit timespan α2 (u) = t(cn ) − t(c1 ), and the commit frequency α3 (u) = n/α2 (u). The intuition is that core developers are likely to commit more times, more often, and over a longer period of time. 2) File Modifications: For a commit c, define m(c) as the number of files modified in c. Define as metrics the size of the largest commit β1 (u) = maxi m(ci ), the total number of files modified β2 (u) = Σi m(ci ), and the average number of files modified per commit β3 (u) = β2 (u)/n. Justification: core developers may be called on to make large commits that modify many files, they may modify more files during their involvement with a project, and may—on average—modify more files in each commit than other users. 3) Lines Added and Deleted: For a commit c, define `a (c) as the number of lines added in commit c, and `d (c) as the number of lines deleted. Define as metrics the total number of lines added γ1 (u) = Σi `a (ci ), and the total number of lines deleted γ2 (u) = Σi `d (ci ). The number of commits alone does not accurately reflect the effort of contribution; a participant could, in theory, make only a few commits, but add many new lines to the code base. 4) Files Added and Deleted: For a commit c, define fa (c) as the number of files added in commit c, and fd (c) as the number of files deleted. Define as metrics the total number of files added δ1 (u) = Σi fa (ci ), and the total number of files deleted δ2 (u) = Σi fd (ci ). It seems reasonable that newcomers may be reluctant to change a project’s structure in a major way, while more experienced (i.e., core) developers need to make structural changes to a project. 5) Class Modifications: For a commit c, define Cx (c) as the number of files modified of class x, where x ∈ {M, C, S, T, D, L} corresponds to MAK, CNF, SRC, TST, DOC, and LOG. Define as metrics the total number of class x files modified, x (u) = Σi Cx (ci ). B. Ranking and correlations The users were ranked according to each metric, with ties resolved consistently according to a unique user identifier. This introduces a linearity among equally-ranked users which boost the correlation, but which also makes it more robust against outliers. The pairwise (Pearson’s product-moment) correlation between rankings was computed for all of the projects. Note that the correlation coefficients are perfectly significant for each project; correlation was calculated over all the users
and no sampling was used. Sample scatterplots are shown in Figure 1. The dotted lines indicate the elusive 80% cutoff point for the metrics. In other words, in the α1 –α2 scatterplot for the xserver project, points to the right of the vertical dotted line represent the top-ranked users who contributed 80% of α2 , i.e., whose commits accounted for 80% of the file modifications. Table III shows the minimum (below the diagonal) and maximum (above the diagonal) correlations coefficient for each of the pair of metrics, taken over all of the projects. While some metrics agreed strongly for some projects, the surprising and important aspect of the table is the low minimum values. For at least one project, almost all the metrics disagreed on who the core developers are. The strongest pairings are α1 /β2 (0.920. . . 1.000), α1 /α2 (0.918. . . 1.000), and α2 /β2 (0.863. . . 1.000). In retrospect, this is hardly surprising: the longer a developer is involved in a project (α2 ), the more commits he/she is likely to make (α1 ) and the more files he/she is likely to modify (β2 ). V. D ISCUSSION The message of this paper is a negative one. None of the metrics proposed agreed strongly on who the core developers are, suggesting that the concept of “core developer” is not necessarily as well-defined as the literature would suggest. This casts doubt on the 20/80 rule and its variants. Of course, it is possible that more sophisticated metrics may cast a different light on the question. In general, more fine-grained analysis is required to determine the roles that developers undertake in FLOSS projects [1], [6]. An even more serious problem is that the nine projects investigated did not exhibit the same patterns of developer involvement. As many others have noted, FLOSS projects are not only significantly different from traditional, commercial projects, but they also differ among themselves. What is true for one project, may be false for another. And an even more subtle caveat is that care must be taken when making general statements, even about a single project. Since FLOSS projects do not, generally speaking, follow a formalized software process, it is difficult, if not impossible, to identify particular software lifecycle stages. As a project evolves, the behaviour of participants changes, and it is important to recognize that what is true for a particular project may change over time. This poses somewhat of a problem, since researchers would like to understand FLOSS software processes and identify trends across different projects. Without such patterns, predicting aspects such as project success becomes more difficult, if not impossible. R EFERENCES [1] J. J. Amor, G. Robles, and J. M. Gonz´alez-Barahona, “Effort estimation by characterizing developer activity,” in Proceedings of the 2006 International Workshop on Economics Driven Software Engineering Research, 2006, pp. 3–6.
bbbbb b bbb bbbbbbbbbb b b b bb b b bbb b b bbb bb bb b bbbbbbbb 50 bb bb b b b bb b bb b b bb bb b bbbb b b b b bbbb b b b bb b b b 100 bbbbbb b bbbbbbbb b b b b bbbb bbb bbb bb b α1 bb b b b bb b 150 bbb b b b bb bbb b b b b 200 bbbb bbbb b b bb bbbb b b b 250 ρ =.9333
xserver
250
200
150
Figure 1.
α2
100
50
ρ =.5196 bb b bb bbb bb bb bbbbb bb bb b b b bb b b bbb b b b b b bb bbbb b b b b b b b bb 50 bb bb b b b b b bbb b bb bbb b bb b b b b b b b b bb b b b bb bb b 100 b b b bb b b bb b b b b b β3 b bb b b bb b bbb 150 b b bb b bbbb b b b b b b bb b bbb bb bbb b b 200 b b b b b b b b b b b b bbb bb b bb bb bb b b b b 250 xserver
250
200
150
M
100
ρ =.1569 b b bb b b b bb b bbbbbbbbb b b b b b b b bb b bb b b bbb b b b bb b b 50 b bbb bb b b bbb bb b b bb b b bb bb b b bb b bb bbb bb bb b bbbb b b b bb b 100 b b b b b b b b b b b b b bb b γ2 bbb bb b b b b b b bb b bb bb b bb b bbb b bb b bbbbb b b b 150 b b bbb b b b b bbb b b bb bb bb b b b b b b b bbb b b b b bb 200 bb b b b b b bb b bb b b bb bbb b b b b b b b bb b b b b bb mesa
50
200
150
T
100
50
Scatterplots for high, mid, and low values of the correlation coefficient ρ for the mesa and xserver projects Table III M INIMUM AND MAXIMUM VALUES OF THE CORRELATIONS COEFFICIENTS
α1 α2 α1 − 1.00 α2 .918 − α3 .718 .752 β1 .250 .250 β2 .920 .863 β3 −.057 −.045 γ1 .307 .307 γ2 .511 .511 δ1 .490 .431 δ2 .210 .180 M .454 .454 C .207 .186 S .605 .532 T .218 .160 D .234 .203 L .062 .037
α3 1.00 1.00 − .250 .663 .003 .307 .454 .490 .395 .454 .312 .446 .322 .419 .261
β1 β2 β3 γ1 γ2 .798 1.00 .543 .847 .848 .769 1.00 .517 .783 .821 .751 1.00 .547 .732 .721 − .796 .950 .883 .818 .250 − .554 .847 .848 .665 −.057 − .705 .648 .461 .307 .250 − .925 .436 .511 .081 .711 − −.039 .439 −.106 −.004 .186 −.068 .210 −.040 −.050 .139 −.046 .454 −.180 −.139 .075 −.091 .203 −.007 .016 .049 .543 .605 .314 .487 .444 .235 .192 .057 .027 .059 .348 .234 .254 .107 .170 −.039 .059 −.021 −.004 −.002
[2] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. Germ´an, and P. T. Devanbu, “The promises and perils of mining git,” in Proceedings of the 6th International Working Conference on Mining Software Repositories (MSR’09), May 2009, pp. 1–10. [3] T. T. Dinh-Trong and J. M. Biemann, “The FreeBSD project: A replication case study of open source development,” IEEE Transitions on Software Engineering, vol. 31, no. 6, pp. 481– 494, Jun. 2005. [4] R. Ghosh and V. V. Prakash, “The Orbiten free software survey,” First Monday, vol. 5, no. 7, Jul. 2000. [5] git home page. [Online]. Available: http://git.or.cz/ [6] G. Gousios, E. Kalliamvakou, and D. Spinellis, “Measuring developer contribution from software repository data,” in Proceedings of the 5th International Working Conference on Mining Software Repositories (MSR’08), May 2008, pp. 129– 132. [7] S. Koch and G. Schneider, “Effort, co-operation and coordination in an open source software project: GNOME,” Information Systems Journal, vol. 12, no. 1, pp. 27–42, Jan. 2002.
δ1 .821 .821 .821 .624 .821 .500 .515 .521 − .568 .650 .550 .376 .491 .178 .456
δ2 M C .804 .747 .971 .804 .702 .971 .804 .798 .971 .585 .677 .579 .804 .746 .971 .501 .520 .418 .393 .605 .551 .409 .568 .548 .989 .839 .884 − .726 .921 .516 − .677 .434 .377 − .202 −.096 .083 .377 .354 .294 .034 .093 −.009 .384 .210 .329
S .895 .840 .786 .872 .897 .696 .833 .824 .560 .530 .605 .636 − .136 .220 .014
T .656 .619 .653 .655 .657 .489 .660 .612 .755 .862 .854 .953 .658 − .150 .267
D L .760 1.00 .744 .979 .704 .902 .778 .535 .686 .934 .560 −.264 .814 .763 .721 .832 .767 1.00 .822 .989 .661 .660 .816 .961 .818 .725 .789 .966 − .858 .314 −
[8] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “A case study of open source software development: the Apache server,” in Proceedings of the 22th International Conference on Software Engineering (ICSE’00), Jun. 2000, pp. 263–272. [9] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “Two case studies of open source software development: Apache and Mozilla,” ACM Transactions on Software Engineering Methodology, vol. 11, no. 3, pp. 309–346, Jul. 2002. [10] K. Nakakoji, Y. Yamamoto, Y. Nishinaka, K. Kishida, and Y. Ye. “Evolution patterns of open-source software systems and communities,” in Proceedings of the International Workshop on Principles of Software Evolution, May 2002, pp. 76– 85. [11] P. V. Singh, M. Fan, and Y. Tan. (2007) An empirical investigation of code contribution, communication participation, and release strategy in open source software: A conditional hazard model approach. [Online]. Available: http://opensource.mit.edu/papers/singh fan tan.pdf