Folder Structure Evolution in Open Source Software Andrea Capiluppi Dipartimento di Automatica e Informatica Politecnico di Torino – ITALY
[email protected]
Maurizio Morisio Dipartimento di Automatica e Informatica Politecnico di Torino – ITALY
[email protected]
Abstract
Predicting when and how a software system will evolve is one of the most fascinating challenges of software engineering. No matter what approach one is using to study such evolution, empirical studies, including observations of systems used in the real world, and of their software processes, are needed in order to identify correlations, find recurring patterns, and eventually predict how systems are likely to evolve. In the empirical study presented in this paper, we take 25 software systems released as Open Source, and observe their evolution. Our focus is not only on how much systems grow in size, but rather on how code structure is adapted and gets modified as the system is evolved. The main goal of this study is to recognize recurring patterns and practices used in evolving long-lived real world software systems. In our study we find three dominant patterns of code structure evolution of Open Source systems: horizontal expansion, vertical expansion, vertical shrinking. By detailed study of exemplars of these three patterns one can identify under which conditions a particular pattern is more likely to prevail than the others.
1. Introduction The long term evolution of E-type software, that is, software systems which are actively used in real world domains and environments [Lehman and Belady 1985], is an important issue for empirical study which can lead to useful insight and applicable lessons both for researchers and practitioners. On the other hand, toy systems, or prototypes, are surely worth analysis, but the conclusions drawn out of them are limited in their applicability to real world applications and domains. The empirical studies of real world software processes and products are limited by the kind of artifacts that an investigator may be able to obtain and measure: proprietary systems are in general difficult to be studied, since the public disclosure of data reflecting those systems tends to be restricted. In this paper we use metrics derived from a number of open source1 systems (OSS), in order to study the characteristics of their long term evolution, and, in particular, how their internal structure evolve. Choosing OSS systems for studying software evolution is an advantage since important amounts of data concerning 1 The authors are aware of the distinction between Free and Open Source software. The distinction is relevant, for example, with regards to users' rights over software artifacts. In this paper, however, we will use Free and Open Software as synonyms.
Juan F. Ramil Computing Department Faculty of Maths and Computing The Open University, UK
[email protected]
software products and processes is available in freely accessible forms such as mailing lists, releases, configuration management repositories (such as the concurrent versions system, CVS), etc. In this paper we study OSS software systems from the point of view of their structural evolution. This involves the study of their enhancement, adaptation and restructuring. We are interested in finding out patterns in the evolution of software code structure. Our data set is composed by 25 OSS systems which we observed in a discrete-time perspective, that is, studying each available release. The dataset globally represents 992 releases or data points. We are interested in observing source code structure and its changes, to learn from long-lived OSS systems what types of structural patterns emerge, what structural changes are more frequently brought to the source code, and also to seek for patterns in the evolutionary trends. Given that code structure and in general, system architecture, can be visualized using a variety of means, we focus on the simplest possible approach: the source folder structure. By folder we mean any directory in the code repository which contains source files. Our research goal is to understand how OSS projects evolve with regards to source code internal structures. In future studies we plan to relate the source folder view of software structural evolution and other structural views (for example, obtained through design recovery [Di Lucca et al 2000]) with factors such as size and type of application, effort subsumed by the evolution and the type of software process.
2. Related work Empirical studies on software development gained momentum after pioneering work of Lehman and his collaborators on the study of the evolution of the proprietary operating system OS/360 [Belady and Lehman, 1976]. The initial studied observed 20 releases of OS/360. The results that emerged from that investigation, and subsequent studies of other proprietary commercial software [Lehman, 1974, 1980], [Lehman and Belady, 1985], [Lehman et al, 1987], [Lehman et al, 1988], include the SPE program classification and a set of laws of E-type software evolution. The findings made in the seventies and eighties were refined and supplemented in the recent FEAST projects [Lehman et al 1998]. More recently, other researchers have studied the software evolution phenomena. For example, Kemerer and Slaughter [1999] studied the evolution of two different
proprietary systems using two approaches: one based on the time series analysis, and the other based on sequence analysis. A study which identifies and categorizes software evolution patterns also is reported in [Barry et al., 2003]. During the last few years, it has been realized that OSS systems have an edge over commercial ones when it comes to availability of data: many studies were done since initial research involving the Apache web-server and Mozilla browser [Mockus et al. 2002]. More recent studies include those which examine single OSS projects [German 2003], [Koch and Schneider 2000], [Aoki et al. 2002], [González-Barahona et al 2001], [Stamelos et al 2002], [Godfrey and Tu 2000], and those which involve several systems [Capiluppi et al 2003], [Capiluppi 2003]. Even though the vast majority of OSS software evolution studies are based on direct trend visualisation and curve fitting, interesting new approaches to study the evolution phenomenon have been recently proposed through both quantitative [Antoniades et al 2003], and qualitative [Smith et al 2004] simulation methods. The work presented in this paper explores the study of the evolution of the code structure, a new dimension not covered in any of the above studies. In doing so, this work aims at complementing the understanding of OSS evolution.
3. Rationale When investigating code structure of various OSS systems, one may encounter different patterns of modifications: if we consider code structure from the perspective of its organisation and storage (one example is depicted in Figure 1), it is possible to visualize basic components (source files, source folders) as composing a tree, with the root of the tree being represented by the parent folder. When analyzing software evolution in a tree perspective, one distinguishes two dimensions: 1. vertical growth, that is, creating a sub-branch in an existing branch (upper part of Figure 1), 2. horizontal growth, that is, adding a new branch over an existing branch (lower part of Figure 1). If we consider Figure 1 from a tree-perspective, we may also state that any vertical growth adds depth in code structure, i.e. a new level has been nested under an existing level. The upper part of Figure 1 shows that creation of folder F3 has introduced a nested level under a current level, which is composed of F1 and F2. Alternatively, as shown in the lower part of Figure 1, F3 can be added at the same level of F1 and F2, that is, without adding a new level. The initial focus for the research reported here is based on Figure 1, and on the common assumption that evolution in software systems is generally implemented in an incremental fashion. Our aim is to understand if source code trees have a common pattern of growth, and if (and how) those patterns have an impact on the evolvability of the systems. In particular, we would like to assess a working hypothesis which is based on anecdotal observations by one of the authors. The hypothesis states
that a system is likely to grow from an initial low-level tree, first by adding branches to the existing levels, and next by adding additional levels. If this or any other common evolutionary pattern is supported by empirical observations, the next question would be why such pattern occurs and whether it can be linked to other characteristics of the software and its related domains. Moreover, the empirical study of structural evolution can help us to identify, and even predict, when and how structural changes occur and whether this can be related to transitions between stages [Rajlich and Bennett 2000], [Nakakoji et al 2002] in the evolution of a software system. This investigation of code structure evolution in OSS requires one to address the following research questions: • How does the source tree evolve over time or releases? • How does the depth of the source tree relate to code size? • How does the code structure evolution relates to the rate of functional growth and change of a system? • What common patterns emerge in source tree growth, given the horizontal and vertical perspective introduced in Figure 1 and in the above discussion? • How could one, by visualizing the evolving code structure, distinguish functional enhancement and adaptation activities, usually the predominant effort during the evolution of source code, from refactoring and restructuring, also called anti-regressive activities [Lehman 1974]? Parent Folder
Parent Folder
F1
F2 F3
F1
F2
Parent Folder
F1
F2
F3
Figure 1 - Two possible modifications of code structure
4. Methodology Our methodological approach can be summarized as the list of steps presented below. The list is not intended as fully sequential, since some steps are intertwined, and provide feedback to other steps: 1. Projects selection: as reported in previous work [Capiluppi et al 2003], we have created a large database with data representing over 400 OSS systems, randomly selected from a popular OSS repository. Initially, we classified these systems based on a number of process and product characteristics. For the study of structural evolution we decided to focus on the larger systems, that involve more complex and richer folder structures. For the present study, we define as 'large'
2.
3.
4. 5.
those systems composed of over 100 KLOCs of code. Furthermore, we extract from the data set some smaller systems for which all the releases in the system's evolution were publicly available for investigation. In total, the sample for the present study includes 25 OSS systems, which is what we could investigate within the time and the resources available. Attribute definition and metrics derivation: since our focus is on measuring systems' evolution, we collect a set of metrics which include system’s size, an indicator which is generally accepted as a surrogate of the functional power of the system. Section 5 provides a description of this and other attributes. Parsing tools: automatic data extraction is key in systems' evolution analysis. In this study we used offthe-shelf, freely available, utilities [XSCC] for counting lines of code. In addition, we built our own tools for parsing source trees (these tools are available to anyone who wishes to replicate this study). Next, we used the dot graphic tool [Graphviz] for extracting source trees out of data, and, finally a PERL script to quantify the number of changes made in-between subsequent releases. Data analysis and pattern recognition: basic plots and visualisations were used as a means to identify recurring patterns. Interpretation: in addition to observing (and recognizing) patterns, one needs to formulate possible explanations for them, based on existing literature (e.g. [Lehman and Belady 1985], [Rajlich and Bennett 2000]), new observations by the authors and hints provided by the documentation of the observed systems.
5. Definition of attributes 5.1. Source code size The vast majority of studies on the evolution software systems so far have involved one type or another of source code size metrics [Lehman and Belady 1985], with only some exceptions [e.g. Anton and Potts 2001]. In this study we measured source code size in three different forms: 1. LOCs: the total amount of lines of code, which we usually counted through off the shelf utilities (wc -l, for instance). 2. SLOCs: the total amount of source lines of code, i.e. remaining LOCs after blank lines and comments have been purged. 3. KBs: the size of a source file in kilo bytes.
5.2. Code Structure
5.2.1. Code components Research has been done aiming at correlating various structural evolutionary metrics to fault and failure discovery rates [Nikora and Munson 2003] based on the view that evolutionary characteristics may be directly related to a few common evolution attributes measured at the file or module level. Other studies in search of patterns of software evolution have concentrated on metrics [Barry et al 2003] but not on visualizations of software structure.
The present study focuses at a coarser level of granularity by measuring attributes of the whole system. For example we deal with code structure in three different forms: 1. source files, as to say, all files that are supposed to contain source code (e.g., “*.c”) 2. source folders, as to say, directories containing at least one source file. 3. folder levels, as to say each level in the code structure where topologically folders may be placed. Files, folders and levels form together a structure which may be interpreted as a simple architectural view of the system. 5.2.2. Folders level Observing Figure 1, one would wish to know in which sequence F1 and F2 were added to the system. In order to investigate this, we use the term encapsulated to refer to a folder that is contained inside another one. Each encapsulation is associated with a specific depth inside of the source file structure; therefore each encapsulation may be related to a depth-attribute, which we call level. Our interest is therefore to analyze the characteristics of folder levels, and observe maximum depths, the size of each level, patterns of growth, and break points in the evolution of source folder trees.
5.3. Modification types Different approaches for classifying maintenance and evolution activity have been proposed over the years e.g., [Kemerer and Slaughter 1999], [Chapin et al 2001]. The application of these classification schemes in an empirical study involve considerable work. In this study we focused on two types of activity, observed based on identifying which files have been added, modified or deleted between two releases. This is relatively simple to identify: 1. source additions, i.e. the set of source files added inbetween two subsequent releases or over a given period of time (e.g. one month); 2. source deltas, i.e. the set of files modified or deleted in-between two subsequent releases or over a given period of time (e.g. one month); 3. number of touched files (or files handled [Lehman and Belady 1985]) i.e. the cardinality of the union of source additions and source deltas. The percentage of touched files at release (or period) j is calculated as the number of files touched at release (or period) j, divided by the total number of files present at the previous release (or period), j-1.
6. Patterns in structural evolution 6.1. Evolution of size In this section we briefly summarize our findings with regards to the evolution of source code, correlations between different measures and the composition and structure of source trees in the 25 OSS systems studied. The size and the length of the evolution period studied for each of these systems are presented in table A1 in the appendix.
Average source file size [KB]at latest state
Project size vs average source file size
Projects size vs. number of files 2.750 2.250 2.000 1.500 1.250 750 500 0 0
5.000
10.000
15.000
20.000
25.000
Project's latest state size [KB]
Figure 4 - Project size as a function of number of files at the most recent release
80 70 60
6.2. Evolution of code structure
50 40 30 20 10 0 0
5.000
10.000
15.000
20.000
25.000
Project latest size [KB]
Figure 2 - Average size of individual files as a function of total size of the system, measured at the most recent release
If we perform the same analysis for the second basic source component (source folders), we observe that the average source folder size measured by adding up the size of files located in each folder and then taking the average over all folders in the same system, varies widely amongst systems (Figure 3). In the plot one can observe, close to the origin, the smallest system of the sample, whose size is around 800 LOCs at the most recent release. Project size vs latest average folders size 700 Latest state source fiolders average size
is what one would have expected from the roughly stable average size of files, displayed in Figure 2. Latest state nr. of files
Whilst visualizing the code evolution of the 25 systems studied, one interesting invariance emerged when plotting the average size of source files, as a function of the system size. In almost all the cases (except for the IMLIB system), the average size of source files displays values not greater than 20 Kilobyte or so. In Figure 2, we observe that almost all projects stabilize the average size for their source files, albeit in general their total size grows over releases. There are particularly interesting cases when these stabilization points are reached after a digressive trend (plots not shown here due to space limitations).
600 500 400 300 200 100 0 0
5.000
10.000
15.000
20.000
25.000
Project latest release size [KB]
Figure 3 - Average folder size, as a function of the total size of the system, both measured at the most recent release
Figure 3 is more scattered than Figure 2, and it implies that there is a higher variability in the average amount of files per folder in the systems studied. This suggests that the study of source folder evolution provides an orthogonal, complementary, view to that provided by studying source files evolution only. Figure 4 presents the plot of systems size in Kilobytes versus the size in number of files, both measured at the latest release: there seems to be a linear relationship which
When we observe the evolution of the folder structure, some recurring patterns can be recognized. In a first attempt to categorizing these patterns, we were able to identify basically three main cases. Here we briefly describe all of them, while in the next Sections we present some illustrative exemplars of each of these three types. Before discussing the types, we need to briefly introduce the notion of articulated source tree. Under articulated source tree we mean a tree which consists of at least two or more levels, which in turn implies the presence of at least one sub-branch in the source folder structure. The three structural patterns which emerged are the following: 1. Horizontally expanding: a first pattern is characterized by the early presence of an articulated source tree at the first release available for study. The articulated tree continues to exists during the subsequent releases, no vertical growth is observed (or the number of levels does not grow), but there is horizontal growth. We observed this pattern in 10 out of 25 analyzed projects. 2. Vertically shrinking: a second pattern is characterized by an initial articulated source tree which evolves into a source tree with a smaller number of levels. This vertical shrinking is not accompanied in general with horizontal shrinking: in other words, some levels get lost in the evolution of the source tree (vertical dimension), but we do not observe a decrease of the number of source folders (horizontal dimension). We observed this pattern in 4 out of 25 projects. 3. Vertically expanding: a third recognized evolution pattern starts with a simple tree structure which then evolves adding at least one level. We observed this in 11 out of 25 projects. In the majority of the cases the pattern followed is a vertical expansion from an early articulated source tree. However, there are 3 systems from this set of 11 whose first observation was a simple source tree (consisting of 1 level only), which in turn evolved into an articulated one. It is worth noting that a horizontally shrinking pattern did not emerge in any of the systems studied. That pattern simply did not exist in the dataset.
Level-1
Level-2
Level-3
Level-4
releases
Figure 5 - Number of files per level for the ARLA system
500 400 300 200 100
Adaptations
TOTAL # of FILES
59
55
47 51
43
39
35
27 31
23
19
15
0
Adaptati ons (Fil es touched)
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
600
releases
Figure 6 - ARLA evolutionary trends: total size and files touched per release
6.3.2. KSI The KSI project is aimed at building a lightweight implementation of the Scheme programming language and interpreter [Scheme]: therefore, it counts on an existing system's knowledge base to build a portable and embedded environment for the Scheme executables. CVS archives do not include the very first releases of the system. This means that we are limited to study the most recent evolution of this system. This issue typically emerges in empirical studies of software evolution in which release 1 in the data set does not correspond to the actual very first release2, because the oldest data has been deleted or is unavailable. In spite of this, the study of the available subset of releases is meaningful, since were able to access data for the most recent 12 releases of KSI, which span over an interval of time of 860 days. We were also able to identify 3 cycles of major releases (3.2, 3.3 and 3.4), and we noticed also that this system is the only one in the data
58 61
52 55
46 49
37 40 43
31 34
25 28
22
16 19
7 10 13
4
1
Nr. of files per level
ARLA - growth of levels 325 300 275 250 225 200 175 150 125 100 75 50 25 0
ARLA - adaptations 700
11
6.3.1. ARLA The ARLA project made available its first public release in February 1998, and its most recent release is labeled 0.35.12 (February 2003). 35 major releases were developed. 62 total releases are made available through their web sites, which then included 27 minor releases. ARLA project's main purpose is to achieve similar functionality as the IBM AFS file system. It is likely that ARLA has currently achieved even more functionality than AFS. Its application domain is distributed file systems management, a domain in which a lot of knowledge is available and openly shared. In this respect, this system is similar to flagships OSS successes (such as Linux or Apache). In ARLA’s evolution there have been two basic ways of enhancing and evolving the system: adding common features for the system (e.g. supporting of specific network protocols), and adding ports so that the system supports different architectures. Observing its folders makeup, as measured by the number of files per folder level (Figure 5), we observe that the majority of the files have been located at Levels 2 and 3. Level 4 experienced a sudden midlife increase at around release 25, accompanied by a sudden decrease at Level 3. Several new folders were added on Level 4, other moved from other parts of the system, which also significantly affect Level 3. In Figure 6, we observe the makeup of evolution of ARLA as depicted by the total number of source files, and touched files over releases. The growth trend can be
interpreted as two segments of decaying growth rate with a midlife growth regeneration point at about release 32. The trend presents similarities with those observed in commercial systems [Ramil and Smith 2002]. The plot of files per level as in Figure 5 is useful for identifying the mid-life growth rate regeneration points and Figure 5 suggests that such regeneration in growth rate was linked to a restructuring of the system. In Figure 6 the number of files touched per release presents only one major peak at release 50, that is, around ¾ of the system's life-cycle (95 percent or so of the size of the system at the previous release was touched), while all other peaks of file touched don't go beyond 60 percent. Except the outlier around release 50, one can observe a predominantly decreasing trend with a super-imposed oscillation in this attribute. The peaks correspond to the major releases. In the case of the ARLA system, the decreasing growth rate in the last third of its evolution history can be linked to a possible move of the system into a “servicing stage” [Rajlich and Bennett 2000], [Nakakoji et al 2001], as revealed by the declining evolution rate, suggested by decreasing trend in the proportion of files touched.
1 4 7
The first evolutionary pattern that we have identified is based on a structure whose vertical dimension remains constant over the entire observed evolution of the application: we observe, in general, a horizontal growth of new branches and leaves, but there's no growth in the vertical dimension, that is, the maximum depth keeps the same value. In some specific cases, new vertical levels were added in the evolution of the system, but then they were discarded in latter releases (e.g. the Grace system). In the following sub-sections, we will analyze a subset of the systems which display this first pattern, and we indicate some background information on their evolution in order to better understand and interpret the observed behaviors.
Number of files
6.3. Horizontally expanding
2 In this and the remainder of the figures and text of this paper, release 1 does not necessarily correspond to the first release of the system, and release should be read as release sequence number.
275
100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%
250 225 200 175 150 125 100 1
2
3
4
5
Number of files
6
7
8
Adaptations
9
10
11
Fi l es tou ch ed [ %]
A possible cause for the declining growth behavior of KSI, as seen in Figure 9, is the possible existence of memory or performance constraints in the target hardware which limit further growth of the system. Another more fundamental explanation would be that KSI is not an Etype system [Lehman and Belady 1985] in the strict sense, since it addresses a problem (the implementation of Schema) which can be sufficiently well specified as to consider it an S-type system. In this regard, KSI behavior is the one that one may expect to visualize for compilers and other precisely specified programs, which are closer to the S than to the E-type program type. KSI - adaptations
total number of files
set that shrinks its size from the earliest to latest available releases (from 111.288 LOCs to 100.157 LOCs). In this particular system Level 1 contains source files used for building the package only, while Level 2 and Level 3 hold nearly the whole code for the application. It's interesting to observe the disposition of folders in a graphical fashion: since few folders are involved, the visualization of this system’s tree structure is easier and clearer than for the larger systems of our study database. Figure 7 shows the disposition of source folders in the earliest release available: each ellipse is a source folder, and all source folders at the same level have the same associated source level number (rectangles on the left of the figure). In Figure 7, the edges between two folders are annotated with the number of files contained in the lowermost side of the connection. For example, close to the connection “ksi-3.20” -> “gc”, the label 59 means that the “gc” folder contains 59 source files.
12
releases
Figure 9 - KSI evolutionary trends: total size and files touched per release
Figure 7 - KSI earliest folder structure
Figure 8, displays the KSI folder structure for the most recent release available for study: reduction in LOCs size is reflected on a reduced number of both source code and source folders. Some branches were pruned away, and the whole design has become more compact, in the sense that the number of folders at Level 2, Level 3 and Level 4 have been reduced. The profile of files touched between releases for KSI, displayed in Figure 9, shows peaks which can be related to the major releases, as in other systems.
Figure 8 - KSI folder structure at most recent release.
6.3.3. Ganymede Ganymede has been initially developed and evolved by academic staff, and includes both an application/database server, and its client. We were able to recover data on its first 12 releases. This data set includes, we believe, its very first release. Only one major release cycle is recognizable (series 1.0) for this system so far. This is reflected both in the size change, and in the structure changes, which are all relatively small. Rather than showing the graph of changes, we present a table (Table 1) with data displaying the number of files per levels for this application. We observe very few additions or deletions of source components, even though the application has grown from 221.893 LOCs to 229.110 LOCs over its lifetime. There's no significant evolution with regards to structural changes: all the components seem to remain in the same place within the folder structure, and few additions are made available during the system lifetime. One possible explanation for tis emerges by going through the Changelog of this system: all developers and code contributors appear to be a small group, with few new contributions coming from outside this group. One of the challenges in software evolution, which is particularly evident is OSS, is the difficulty for outsiders or new contributors to assimilate and comprehend, sometimes massive, amounts of code which has been generated in a closed environment. If we observe the latest available release, we realize that it's dated November 2002, which means that neither new features, nor modifications have been released since then. A possible interpretation is that the feedback coming from the open source external
3
3
3
3
3
274
274
275
275
276
277
14
14
14
14
14
14
3
2
3
3
3
3
3
273
268
273
273
274
274
Lev4
14
14
14
14
14
14
Lev5
76
75
76
77
77
77
77
77
77
77
77
77
Lev6
95
95
95
95
95
95
95
95
95
95
95
95
Lev7
12
12
12
12
12
12
12
12
12
12
12
12
Table 1 - Ganymede source levels evolution When we plot code additions and modifications, presented in Figure 10, we observe that, while this system is growing very slowly, code adaptations (expressed in percentage of files touched per release) are dispersed throughout the code, and represent on average 70 percent of the whole system for each release. We may conclude that, in this particular case, functional enhancement has been very limited by lack of sufficient external feedback, while “servicing” of the application [Rajlich and Bennett 2000], [Nakakoji et al 2001] was conducted by a small group of developers. 100%
450
90%
400
80%
350
70%
300
60%
250
50%
200
40%
150
30%
100
20%
50
10%
0
fi l es tou ch ed [ %]
total number of files
Ganymede - adaptations 500
Gwydion-Dylan - growth of levels 600 550 500
2
3
4
5
6
total files
7
8
9
10
% files touched
11
12
releases
Figure 10 - Ganymede evolutionary trends: total size and files touched per release
400
Level-1 Level-2 Level-3 Level-4 Level-5 Level-6 Level-7
350 300 250 200 150 100 50 0
releases
Figure 11 - Number of files per level for Gwydion-Dylan
The folder structure of the most recently observed release is composed of only 5 levels. The evolution of source folders and files grow proportionally with the evolution of code (on its earliest stage: 64 source folders, 607 source files; on its latest stage 137 source folders, 1147 source files). A midlife restructuring of the system is clearly observable in Figure 11 after release 11, which can explain the increase in growth rate experimented by the system during the last half of the observed sequence of releases (Figure 12). The behavior of the proportion of files touched, for this system is displayed in Figure 12. Gwydion-Dylan adaptations 1200
100%
1100
90%
1000
80%
900
0% 1
450
70%
800 700
60%
600
50%
500
40%
400
30%
300
20%
200
10%
100
The second evolutionary pattern is based on a structure which becomes less articulated in the observed evolution. This means that some branches are pruned from the source tree, so that the global amount of vertical levels is lower than the initial observations. As we did for the first pattern, we will present below a subset of the systems which display this pattern: some background information on their evolution is given in order to better understand and interpret the observed behaviors. 6.4.1. Gwydion-Dylan Gwydion-Dylan is an object-oriented compiler supporting rapid applications development, and aiming to become a complete development environment. We observe
total files
% files touched
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
0% 1
6.4. Vertically shrinking
0
fi l es tou ch ed [ %]
12 --
20 21
11 --
Lev3
9
18 19
10 --
--
16 17
--
Lev2
8
--
14 15
7
12 13
--
9 10 11
6
8
--
7
5
6
--
5
4
--
4
3
3
--
2
2
--
Files per release
1
Lev1
1
# of files per level RSN
21 subsequent releases for this system, but they don't represent its whole life-cycle, since its earlier evolution is not available for study, neither in form of releases, nor in CVS storing. The available releases reflect 4 cycles of major releases, spanning over 1673 days. We observe in Figure 11 that the first available data point is composed of 7 nested levels, which have been progressively mostly likely accomplished through a previous series of releases for which we do not have data.
number of files per level
communities was not sufficiently strong as to guarantee a sustained evolution of the application.
releases
Figure 12 - Gwydion-Dylan evolutionary trends: total size and files touched per release
6.4.2. Gist Gist is a set of tools for building dynamic web sites. We have access at the whole story of this project (20 releases), which enables us to observe for this system how growth and change are related from the first release onwards. There are 4 cycles of major releases for this system, which are clearly noticeable when one plots code changes over releases. When observing the profile of code adaptations, in Figure 13 we observe peaks in correspondence of major
1100
100%
1000
90%
900
80%
800
70%
700
60%
600
50%
500
40%
400
30%
300 200
20%
100
10%
55
52
49
46
43
37 40
34
28 31
25
22
19
16
13
7 10
4
Level-2 Level-3 Level-4 Level-5
1
# of files per level
LCRZO - growth of levels 350 325 300 275 250 225 200 175 150 125 100 75 50 25 0
releases
Figure 14 - Number of files per level for LCRZO system
When investigating source levels, we found 4 different vertical levels. Two sudden jumps in the number of files at Level 2 suggest that something significant happened twice in the operational lifetime of this system. We then investigated these jumps using the tree structure perspective. In Figure 15 we depict the source folders' structure before the first jump (release 22): there are 5 source levels. The number of files contained in each of the level represent the relative weight of the level in the overall structure.
total files
% files touched
20
18
16
14
12
9 10
7
8
6
5
4
3
2
0% 1
0
believe, its whole evolution history. We have been able to recognize four cycle of major releases (1.0, 2.0, 3.0 and 4.0).
fi les touched [%]
Number of files
releases. These peaks are quite noticeable: during these releases, more than 80 percent of the files are touched. As for the Ganymede system analyzed before, this system is mainly developed by a small and stable group of developers. We were interested in finding out whether similar trends to the ones of Ganymede would appear. However, as shown in Figure 13, the size of Gist code is not roughly constant, as in Ganymede: several shrinks in size, both in global LOCs and in the source files, are visible in different points of the system's life-cycle, but the overall trend indicates an increasing size, but with declining rate. The growth from the first to the last release is about 20 percent, which is a low evolution rate in comparison to other systems in the dataset. What's also interesting from the point of view of the folders structure is that it is simple at both the initial and most recent releases of the system: growth seems to proceed on an horizontal basis, while vertical growth seems to be shrinked as long as new horizontal folders are added. In the latest available release, over 30 folders form the structure of the same, nested, level, while in the earliest release we found around 20. For the Gist system, we are able to conclude that the horizontal growth of folders was effective in the evolution of the system. GIST - adaptations
releases
Figure 13 - Gist evolutionary trends: total size and files modified per release
6.5. Vertically expanding The third evolutionary pattern is based on a structure that expands during the observed evolution of the application: this means that new branches are added in one or more sections of the tree, and new vertical levels appear. Besides, horizontal levels may be added, but we experienced that there is not a clear relation between the two dimensions. Two case studies are analyzed in the following sections, and additional information, beside size and structure, is provided in order to gain insights on the observed pattern. 6.5.1. Lcrzo Lcrzo is a shared library for developing network applications. Its functionality, then, consists on a common framework for nearly all network protocols (Ethernet, TCP, and so on). In Figure 14 we depict the evolution trend of the application over 1400 days of its evolution: we were able to access to 56 releases of this system, and these represent, we
Figure 15 - LCRZO code tree at release 22, before the first jump
Figure 15 displays the folder structure at the last release of the 2.0 cycle of releases. When depicting the subsequent release, then, we may be aware that some stable status has been reached, since the first release of 3.0 cycle has been made available. We see in Figure 16 that some branches were pruned: being a major release, unstable features are typically excluded from being made available. What's more, folder “example” in Level 2 gets filled with some 28 KLOCs of new source code. Analyzing their nature, they are models, skeletons and schemas of potential new applications wich can be implemented by using this library. In a sense, they provide the community with entry point for new development (they show potential users what the system may be able to do for them).
6.5.2. Vovida SIP stack Vovida is the system which has experienced one of the largest delta sizes (13 KLOC to 650 KLOC) in the data set from the first to the most recent available release. Vovida is an open source application that implements the SIP (Session Initiation Protocol) stack protocol, for multimedia sessions. It's a particularly interesting application from the point of view of level's growth: we have been able to access the entire life-ime of this system, and it evolved through nesting several levels (from a single level in first release, to 8 levels in latest available).
Vovida Sip Stack - growth of levels rd
Figure 16 - LCRZO state at the 23 release (after first jump)
1500
We also see (Figure 17) that after a series of releases, the code structure is changed again, after which two different levels seem to grow in parallel. The first is dedicated to code of the application, the second as a sort of incubator for new features: in latest instances of this system, we observe new levels, that is, vertical branches in the code structure.
Number of files per level
1400 1300 1200 1100
Level-1 Level-2 Level-3 Level-4 Level-5 Level-6 Level-7 Level-8
1000 900 800 700 600 500 400 300 200 100 0 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
releases
Figure 19 - Number of files per level for VOVIDA system
Figure 17 - LCRZO most recent state
Observing the plots for size and change activity (Figure 18), we note that this system underwent many rewritings and large adaptations: this is surely interesting, because here large adaptations are experienced between minor releases, not only in major releases.
In Figure 19 we can observe that at starting with Level 1, next Level 2 and so on, all levels were added at different moments of time, as well as new source files and folders. That is to say that a massive amount of evolution effort has been made in this application, in order to add new features and functionality As part of this large amount of efforts in code additions, we observed that the source adaptation trend displays on high valued peaks, representing the high evolution rate to which this system has displayed (Figure 20). The rapid evolution rate can be linked to a dynamic and growing community of developers.
total files
% files touched
Figure 18 - LCRZO evolutionary trends: total size and files touched per release
30%
500
20%
250
10%
total files
15
14
13
12
11
0% 9
0 10
releases
750
7
52 55
46 49
40 43
34 37
28 31
22 25
16 19
7 10 13
4
0%
40%
1000
8
10%
50%
1250
6
20%
60%
1500
5
30%
70%
1750
4
40%
80%
2000
3
50%
90%
2250
2
60%
100%
2500
1
70%
Number of files
80%
fi l es tou ch ed p er r el ease [ %]
90%
2750
files touched [%]
VOVIDA - adaptations
100%
1
files per release
LCRZO - adaptations 375 350 325 300 275 250 225 200 175 150 125 100 75 50 25 0
releases
% files touched
Figure 20 - VOVIDA evolutionary trends: total size and files modified per release
7. Conclusions We have analyzed in this paper the evolution of 25 OSS systems, possibly the widest and largest dataset in a study of this kind. The systems studied are of different size ranging from 700 LOCs to 700 KLOCs and represent a diverse set of application domains. 20 out of 25 may be categorized as large-sized systems given that their size at the most recently observed release is greater than 100 KLOCs. These systems are a sub-set taken from a version history database of OSS systems which we have collected for our research. The systems in this database were randomly extracted from a popular software repository dedicated to open source. In this particular study we have sought to identify interesting patterns in the evolution of these systems, with focus on the source code. Our aim is to better understand the evolution of OSS systems and to relate traditional analysis such as plotting of growth trend with visualization of the evolving folder structure. In particular, we are interested in topological patterns, that is when and how new source components are added, how do they relate to existing components, and to the existing overall structure. In this work, we define a “source file” as each single file containing source code, and “source folder” as each directory containing at least one source file. Our first result shows (Figure 2) that there is a stabilization point in the average size of the source files, inbetween 5 and 20 KB. However, when investigating the correlation between the average size of the source folders and the size of the system, there is no apparent correlation. This suggests that both the number of source files and the number of source folders provide two complementary views. This is an improvement on previous studies which have been based on the study of source files counts and related metrics, but which have not considered the folder structure. We analyzed the structure of the source folders, visualizing them as a tree containing branches (source folders), and leaves (source files). In doing so, we have been able to distinguish three main evolution patterns, basically related to how the folders evolve on a vertical and horizontal dimension. The first pattern is based on an invariant code structure on the vertical dimension: we observed this pattern in 10 systems out of 25 analyzed. Deepening the analysis of these 10 systems, we realized that in three of them the system was given a structure before its first public release: in order words, a core group was in charge of developing it before becoming publicly available. In general, when one is studying patterns in software evolution, the smallest systems are likely to display a less disciplined evolutionary behavior, driven by the decision and action of a small group of developers. On the contrary, larger projects are more likely to exhibit an evolution dynamics of their own for reasons that have been discussed in the literature [Lehman and Belady 1985]. In our study, depending on the system, we observed a faster growth in some cases and slower in others: next, we tried to identify why this was so, looking at details of the development
process aspects, and basically we’ve found higher growth rate evolution trends for the systems in which it was easier for potential contributors to become so, that is, where more feedback was available (the ARLA system evolution pattern could take advantage of several added developers, while the Ganymede could not). Furthermore, when observing the percentage of files touched per release, it shows few peak values, and nearly all of those corresponding to important releases. The range of these peaks is on average between 70 and 90 percent. One interesting case was also described (Ganymede), where very few new components were added, while on average 70 percent of existing ones were touched through every release. This behavior requires further investigation. The second recognizable pattern is when the vertical dimension grows. We had initially expected this as the predominant pattern emerging from our analysis, but we found the pattern on only 10 systems out of 25. What's more, several of these underwent some shrinks and expansions in the depth of the code tree, as well. The trend of files touched per version for this class of systems has in general higher peaks than the first pattern, also because new components are added. Several peaks around 100 percent (LCRZO system, VOVIDA system), and around 80 percent can be observed (VOVIDA system). A third, less frequent, pattern also emerges in which the vertical dimension shrinks. The profile of files touched per releases here is in-between pattern one and pattern two, but remarkable peaks in the range 90 percent to 100 percent are recognizable when major releases are prepared (Gwydion-Dylan and GIST), but these peaks are rather sporadic and rarely recurrent. In our future work we plan to refine the identification of patterns of structural evolution by considering metrics which reflect both the evolution of the horizontal and vertical dimension of the code structure, and relate this to other system characteristics by applying cluster analysis.
8. References
[Aoki et al. 2001] Aoki A., Hayashi K., Kishida K., Nakakoji K., Nishinaka Y., Reeves B., Takashima A., and Yamamoto Y., “A Case Study of the Evolution of Jun: an Object-Oriented OpenSource 3D Multimedia Library”, Proc. 23rd Intl. Conference on Software Engineering, ICSE 23, Toronto, Canada, 12-19 May 2001, pp. 524 - 533 [Anton and Potts 2001] Anton A. and Potts C.; “Functional Paleontology: System Evolution as the User Sees It”, Proc. 23rd ICSE, Toronto, Canada, 12-19 May 2001, pp. 421 – 430 [Antoniades et al 2003] Antoniades P., Samoladas I., Stamelos I., Bleris G.L. “Dynamical simulation models of the Open Source Development process”. To appear in Free/Open Source Software Development, Stefan Koch (ed.), Idea Group, Inc. [Barry et al 2003] Barry E.J., Kemerer C.F., and Slaughter S.A., “On the Uniformity of Software Evolution Patterns”, Proc. ICSE 25, Portland, Oregon, May 3 – 10, 2003, pp. 106 – 113 [Basili et al 1996] Basili, V. R. et al, “Understanding and Predicting the Process of Software Maintenance Releases”. Proc. 18th ICSE, Berlin, March 25 – 29, 1996, pp. 464 - 474 [Belady et al 1976] Belady L.A, Lehman M.M, “A Model of Large Program Development”, IBM Systems J., vol. 15, no. 1, 1976, pp. 225 – 252.
[Capiluppi 2003] Capiluppi A., “Models for the evolution of OSS projects”, Proc. of the 7th International Conference on Software Maintenance, ICSM, Amsterdam, September 22 – 26 2003, pp. 65 – 74. [Capiluppi et al 2003] Capiluppi A., Lago P., Morisio M.,, “Characteristics of Open Source Projects”, Proc. of the 7th European Conference on Software Maintenance and Reengineering, CSMR, March 26 – 28 2003, pp. 317 – 327. [Chapin et al 2001] Chapin N., Hale J.E., Khan K.M., Ramil J.F. and Tan W.G., “Types of Software Evolution and Software Maintenance”, Journal of Software Maintenance and Evolution: Res. and Practice, 13(1), January-February, pp 1 – 30, 2001 [Curtis et al 1979] Curtis B., Sheppard S.B., Milliman P., Borst M.A. and Love T., “Measuring the Psychological Complexity of Software Maintenance Tasks with the Halstead and McCabe Metrics”, IEEE Trans. on Softw. Eng., 5(2), 1979, pp. 96 –104 [Di Lucca et al 2000] Di Lucca G.A. et al, Recovering Class Diagrams from Data Intensive Legacy Systems, Proc. ICSM 2000, 11 – 14 Oct. 2000, San Jose CA, pp. 52 – 63 [El-Emam et al 2000] K. El-Emam, S. Benlarbi, N. Goel, W. Melo, H. Lounis, and S. Rai, "The Optimal Class Size for ObjectOriented Software: A Replicated Study," National Research Council of Canada, NRC/ERB 1074, 2000. [German 2003] German D., “Using software trails to rebuild the evolution of software”, International Workshop on Evolution of Large-scale Industrial Software Applications (ELISA) 23 September 2003, Amsterdam, The Netherlands http://prog.vub.ac.be/FFSE/Workshops/ELISA-Workshop.html, (as of Sept. 2003) [Godfrey and Tu 2000] Godfrey, M., and Tu Q., “Evolution in Open Source Software: A Case Study”. Proc. of 2000 ICSM, October 11-14 2000, pp. 131 – 142 [González-Barahona et al 2001] González-Barahona J.M., Ortuño-Pérez M. A., de las Heras-Quirós P., Centeno-González J., Matellán-Olivera V, “ Counting potatoes: The size of Debian 2.2”, http://people.debian.org/~jgb/debian-counting/countingpotatoes-0.2/ (as of June 2004) [Graphviz] Graphviz - open source graph drawing software http://www.research.att.com/sw/tools/graphviz/ [Kemerer and Slaughter 1999] Kemerer, C.F., and S. Slaughter. “An Empirical Approach to Studying Software Evolution”. IEEE Transactions on Software Engineering, 1999, 25(4), pp. 493 – 509. [Koch and Schneider 2000] Koch S., Schneider G., “Results from Software Engineering Research into Open Source Development Projects Using Public Data”, in ”Zum Tätigkeitsfeld Informationsverarbeitung und Informationswirtschaft”, Hans R. Hansen und Wolfgang H. Janko (eds.), Nr. 22, Wirtschaftsuniversität Wien, 2000. [Lehman 1969] Lehman M.M., “The Programming Process”, IBM Res. Rep. RC 2722, Dec. 1969: 46 pp. Also as Chapter 3 in [Lehman and Belady 1985] [Lehman 1974] Lehman M.M., “Programs, Cities, Students, Limits to Growth?”, Inaugural Lecture, in Imperial College of Science and Technology Inaugural Lecture Series, v. 9, 1970, 1974, pp. 211 – 229. Also in Programming Methodology, Gries D (ed.), Springer Verlag, 1978, pp. 42 – 62. Reprinted as Chapter 7 in [Lehman and Belady 1985] [Lehman 1980] Lehman M.M, “Programs, Life Cycles, and Laws of Software Evolution”, Proc. Special Issue Software Eng., IEEE, vol. 68, no. 9, 1980, pp. 1,060 –1,076 [Lehman and Belady 1985] Lehman M.M. and Belady L.A., (eds.) Program Evolution – Processes of Software Change, Academic Press, London, 1985
[Lehman et al 1997] Lehman M.M., J.F. Ramil, P.D. Wernick, D.E. Perry, and W.M. Turski, "Metrics and Laws of Software Evolution The Nineties View”, Proc. Fourth Intl. Software Metrics Symp., Metrics '97, Albuquerque, N.M., 1997, pp. 20 –32 [Lehman et al 1998] Lehman M. M., D. E. Perry, and J. F. Ramil. “Implications of evolution metrics on software maintenance.” Proc. of the 1998 ICSM 98, Bethesda, Maryland, Nov. 1998, pp. 208 – 217. [Mockus et al 2002] Mockus A., Fielding R.T., Herbsleb J.D., “Two Case Studies of Open Source Development: Apache and Mozilla”. In ACM Transactions on Software Engineering and Methodology Vol. 11, No. 3, 2002, pp. 309 – 346. [Nakakoji et al 2002] Nakakoji K., Yamamoto Y., Nishinaka Y., Kishida K.,Ye Y., “Evolution Patterns of OpenSource Software Systems and Communities”. In Proceedings of International Workshop on Principles of Software Evolution (IWPSE 2002), Orlando, Florida, 19 – 20 May, 2002, pp. 76 – 85 [Nikora and Munson 2003] Nikora A.P. and Munson J.C., “Understanding the Nature of Software Evolution”, Proc. ICSM 2003, 22 – 26 Sept., Amsterdam, pp. 83 – 93 [Rajlich and Bennett 2000] Rajlich V.T. and Bennett K.H., “A Staged Model for the Software Life Cycle”, IEEE Computer, July, 2000, pp. 66 – 71 [Scheme] The Scheme programming language, project available at http://www.swiss.ai.mit.edu/projects/scheme/ (as of June 2004) [Shankland 2000] Shankland S., “Linux kernel release falls behind schedule”, available on-line at http://news.com.com/21001001-240061.html?legacy=cnetandtag=st.ne.1002.thed.1003200-1808165 (as of June 2004) [Smith et al 2004] Smith N., Capiluppi A., Ramil J.F., 2004, “Qualitative Analysis and Simulation of Open Source Software Evolution” on the Proc. of the 5th Int. Workshop on Software Process Simulation and Modeling, May 24 – 25 2004, pp. 103-112 [Stamelos 2002] Stamelos, I., Angelis, L., Oikonomou, A., Bleris, G.L., “Code Quality Analysis in Open-Source Software Development”, Information Systems Journal, 2nd Special Issue on OS Software, 12(1), January 2002, pp. 43 – 60. [XSCC] A tool for extraction source lines of code, http://members.tripod.com/vgoenka/unixscripts/xscc.html (as of June 2004)
APPENDIX Files ini
Folders ini
Files fin
Folders fin
Kbs ini
Kbs fin
LOCs ini
LOCS fin
SLOCS ini
SLOCS fin
Depth ini
Depth fin
Time interval (days)
Arla
321
31
658
69
1.831
4.091
63.663
162.218
40.009
108.838
4
4
1.820
Ganymede
473
28
478
28
5.455
5.646
221.893
229.110
123.093
126.955
6
6
558
Gwydiondylan
607
64
1.147
137
6.606
11.012
213.688
348.644
151.145
252.997
6
5
1.673
Ghemical
586
12
555
12
6.426
6.716
217.463
226.769
171.998
180.159
4
4
454
Gimpprint Gist
7
1
136
14
305
2.206
11.156
80.567
9.172
61.895
1
3
1.304
778
27
1.067
37
4.098
4.519
172.111
190.933
126.987
131.401
5
4
1.436
Grace
91
4
310
14
2.025
4.428
73.691
157.919
63.423
113.668
2
2
2.730
Htdig
136
16
511
24
441
3.926
21.300
153.722
14.529
102.621
3
5
2.451
Imlib
27
4
36
4
2.631
2.692
52.651
55.839
50.300
53.163
2
2
1.277
Ksi
259
19
191
14
2.933
2.708
111.288
100.157
81.681
75.561
4
4
860
Lcrzo
19
3
235
9
197
3.658
6.409
109.323
4.955
70.517
1
6
1.435
Linuxconf
586
46
1.347
117
2.475
6.104
103.498
239.223
82.810
191.594
4
4
2.028
1.511
31
1.946
51
17.127
21.941
545.093
704.864
467.151
614.141
3
5
3.430
2
1
28
1
7
160
239
6.836
204
5.901
2
2
1.281
Mutt
120
2
201
6
1.131
2.391
48.640
96.415
37.477
70.171
2
3
2.032
Nicestep
44
4
140
17
1.173
2.414
33.990
74.441
27.555
59.729
1
2
1.168
Parted
52
6
122
16
417
1.354
16.911
51.907
12.431
38.720
3
3
1.405
Pliant
227
37
641
94
1.255
4.270
36.347
116.947
28.868
101.363
5
5
1.845
Quakeforge
396
17
696
58
3.815
5.696
172.946
233.534
123.234
175.377
3
5
1.268
Mit-scheme Motion
Rblcheck
1
1
7
5
2
19
104
772
68
447
1
3
1.493
Rrdtool
113
10
153
26
1.926
3.025
86.138
128.211
68.695
102.298
3
4
1.634
Siagoffice
42
5
322
18
356
3.618
15.386
137.504
13.743
108.254
2
2
2.594
Vovida Sip Stack
49
1
2.618
135
13.307
19.809
13.307
665.749
7.406
398.938
1
6
1.309
Weasel
16
1
36
2
142
511
4.449
17.591
2.629
11.924
1
2
834
Xfce
207
12
450
69
1.323
8.450
46.808
277.423
35.317
225.736
2
3
1.662
Table A1 – Various size measures and length of evolution period studied (time interval) for the 25 OSS systems NOTES: •
In the table header, “init” indicates size measured at the first publicly available release, “fin” indicates size measured at the last publicly available release.
•
Columns 2 to 13 represent various size measures
•
Column 14 represents the length of the period studied for each software, measured as the interval between the first and the latest available release