Author and file entropy in the Perl 6 documentation

0 downloads 0 Views 328KB Size Report
Jun 1, 2018 - As the development of a software project proceeds, its complexity ... In this report we will analyze entropy at the file level, and try to find some ...
Author and file entropy in the Perl 6 documentation repository JJ Merelo June 1st, 2018 Abstract As the development of a software project proceeds, its complexity increases, and this is reflected at several levels, from the file level to the author level. This complexity can be represented via the entropy. In this report we will analyze entropy at the file level, and try to find some correlations, deriving possibly some advice on how to improve the process of fully volunteer software projects such as this one.

Introduction Entropy has been the subject of several papers (Yu, Cawley, and Ramaswamy 2012,Taylor et al. (2011),Casebolt et al. (2009)) with the main focus on how it reveals the increasing complexity of software projects. Entropy is related to the amount of information, and in this case it refers more to the information it reveals about the software development process that about the information at a different level. In a previous report (Merelo 2018b), we established how the contributions through issues evolved in the Perl 6 documentation repository, a purely volunteer-based repository devoted to the creation of the official documentation of the Perl 6 language. We will focus now on different entropy measurements. There is no single way of measuring entropy, however. The reports above talk about author entropy, which is measured by calculating the Shannon entropy on the number of lines every author has contributed to a single file. This is a way of focusing on the information content of the file such as it is today. Main conclusion reached in (Casebolt et al. 2009) is the inverse relationship between file size and entropy in the Gnome suite of applications, with a dominant author, implying low entropy, usually being the case in big files. That result need not be general, but in any case reveals the power of a particular measure of entropy for revealing software development patterns. We already used entropy in (Merelo 2018a), although in this case we measured the monthly commit authorship entropy, using Shannon entropy on the number of commits contributed by developers monthly. No clear pattern, emerged, only that entropy increased with the number of authors, which is only to be expected. However, the picture is different if we look at the normalized entropy, which we will do in the next Section.

Monthly normalized entropy Below, we use the data in (Merelo 2018a), updated to the date of this paper, but using Normalized Shannon entropy, which divides the regular entropy by the logarithm of the number of authors that have participated each month.

1

1.00

0.75

Entropy

Authors 10 20

0.50

30 40

0.25

0.00 0

200

400

600

Commits

1.00

0.75

Commits Entropy

100 200 0.50

300 400 500

0.25

0.00 0

10

20

30

40

Authors The inverse relationship found by (Casebolt et al. 2009) for the Gnome suite does not hold in this case, with entropy roughly increasing with the number of authors, and roughly increasing with the number of commits.

2

As indicated in (Casebolt et al. 2009), in this case the entropy is related to the evenness of the distribution of authors and commits. Low entropy would mean that the more authors (or commits), a dominant one emerges; high entropy means that few authors will distribute the load more or less evenly among them. In this case entropy does not crash when the number of authors increases, implying that the load is still distributed in a more or less even way. This is simply an emergent behavior that arises from the self-organization of the authors, and is not really imposed from above in any possible way. In fact, the month with the lowest entropy corresponds to a relatively low number of authors, seven. With a high number of authors, normalized entropy tends to stay high with values that are better than 0.5. Same goes for the number of commits: even with a high number of commits, entropy stays around 0.5, although it roughly decreases with the number of commits. But we can look at the entropy at the file level. This will be done in the next Section.

File-level entropy Instead of computing the entropy by looking at the lines written by each author, we will look at the number of commits that affect a file, independently of how many lines they have changed. With this we achieve two things: first, look at the whole process of creation and edition of a file, including all the authors that have contributed to it. Second, we use a measure, commits, that is indicative of a different measurement of efforts than lines of code (as done by the rest of the authors mentioned above). This might help us confirm their findings, but also have a more dynamic measure of entropy than simply lines changed per author, which after a certain amount of time is not going to change too much; author commit entropy will change with the number of authors and the number of commits, giving a more up-to-date and accurate view of the complexity we want to reveal using entropy. Let us look first at the entropy including all repository history. 1.0

0.8

Authors

Entropy

250 200 150 100

0.6

50

1e+05

1e+04

1e+03

0.4

Size This chart plots normalized entropy vs file size. In general, the smaller the file, the higher the entropy, but there are also many cases in which big files have either low or high entropy. We can conclude if we measure

3

entropy this way, there is no relationship between file size and entropy. 1.0

0.8

Size Entropy

80000 60000 40000

0.6

20000

100

10

0.4

Authors

1.0

Entropy

0.8

0.6

34

256

33

31

28

26

25

24

23

22

21

20

19

18

17

16

15

14

13

12

11

9

10

8

7

6

5

4

3

2

0.4

Authors The first of these charts uses a logarithmic x scale, shows a certain downwards tendency or inverse relationship 4

between the number of authors and entropy; the higher the number of authors, the lower the entropy, although the lowest values are reached with files with 1 or 2 authors. The one with the lowest entropy is a relatively recent one, doc/Language/structures.pod6, with just three authors and an entropy ~0.30. In general, this low entropy is due to one of the authors making the majority of the commits to the file; please note that most of the files have an entropy that is higher than 0.5, indicating a good amount of collaboration. This might be peculiar to the nature of the repository, which is based mainly on text. Let’s look at another, similar repository, the Rakudo compiler repo, which is also volunteer-based, deals with the Perl 6 interpreter, and shares a good amount of developers with the documentation repository. 1.00

0.75

Authors Entropy

250 200 0.50

150 100 50

0.25

1e+06

1e+04

1e+02

0.00

Size The situation is remarkably similar to the documentation repo, with the main difference being the file size, whose range is smaller, as can be observed in this chart. Files tend to be more uniform in size, and also bigger.

5

1.00

0.75

Size Entropy

1500000 0.50

1000000 500000

0.25

1.00

100

10

1

0.00

Authors

Entropy

0.75

0.50

0.25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 17 18 20 21 26 27 28 29 31 33 39 40 49 50 55 269

0.00

Authors The other two charts representing normalized entropy vs. number of authors reflect in general a downward trend, but also a lower range of variation of entropy. As above, when there are few authors, the whole range

6

of situations is possible: from close to equal contribution (which, in general, will correspond to just a few and the same commits by every author) to clear domination by one of the authors. The bigger the number of authors, there tends to be a smaller difference between the contributions made by every one of them. Entropy in this case can be much lower, with several files under 0.5, mainly with few authors. In general, however, pattern of self-organized distributions is remarkably similar.

Conclusions The main objective of this paper was to find out the behavior of entropy in the Perl 6 documentation repository and how it informs us on the self-organizing nature of the work in this kind of repositories. What we have found out is that, in general, author commits to files organizes itself in a more or less even way, with files being created by several authors, contributing a number of commits that is not too different. This probably means that, even if a file is originally created by a single author, further contributions to it make authorship, and responsibility, more spread among repository contributors. This result, which is achieved via self-organization and not a centralized allocation of work, makes the result more even in quality, and the output larger, since every developer deciding what to do by herself makes her more engaged to the work and the project than a central allocation of tasks, which would certainly result in a different, lower entropy, pattern. This entropy pattern is not peculiar to the fact that it is a documentation, and not only software, repository. Although an exhaustive investigation has not been made, the sister Rakudo repository shows a very similar pattern. If we can conclude from these metrics that work is employed in a relatively efficient way, is there a chance that these metrics could help improve output or quality of the work? In general, entropy tends to decrease and get to a certain value with the number of authors, which also increases with time. A low value of entropy with a high number of authors could reveal a file that has not received enough attention from the volunteers. For instance, doc/Language/typesystem.pod6 has many different authors, but a low ~ 0.58 entropy. Looking at git blame we can see that it’s written mostly by a single person in the year 2016. It would be interesting to highlight these documents for revision, just in case they hide some error or need to be updated to the latest versions of the Perl 6 interpreter. This, of course, in the case of a documentation repository; in the case of a code repository, it is probably OK if a file has been left alone for such a long time and it passes tests. Documentation repositories, however, are different in this sense, and finding new targets for improvement can help allocate work better, and obtain in the mid and long term a better result.

Acknowledgements This report has been supported by The Perl Foundation under grant Curating and improving Perl 6 documentation.

References Casebolt, Jason R, Jonathan L Krein, Alexander C MacLean, Charles D Knutson, and Daniel P Delorey. 2009. “Author Entropy Vs. File Size in the Gnome Suite of Applications.” In Mining Software Repositories, 2009. Msr’09. 6th Ieee International Working Conference on, 91–94. IEEE. Merelo, JJ. 2018a. “Perl 6 Documentation Repository Through Time: Contributions Through Commits.” GeNeura Team, CITIC, ETSIIT, University of Granada. https://www.researchgate.net/publication/ 325020270_Perl_6_documentation_repository_through_time_contributions_through_commits/ references. ———. 2018b. “Perl 6 Documentation Repository Through Time: An Initial Report.” GeNeura Team, CITIC,

7

ETSIIT, University of Granada. https://www.researchgate.net/profile/Juan_Merelo_Guervos/publication/ 324829397_Perl_6_documentation_repository_through_time_An_initial_report/. Taylor, Quinn C, Jonathan L Krein, Alexander C MacLean, and Charles D Knutson. 2011. “An Analysis of Author Contribution Patterns in Eclipse Foundation Project Source Code.” In IFIP International Conference on Open Source Systems, 269–81. Springer. Yu, Liguo, John Cawley, and Srini Ramaswamy. 2012. “Entropy-Based Study of Components in Open-Source Software Ecosystems.” INFOCOMP Journal of Computer Science.

8