Folder Structure Evolution in Open Source Software

0 downloads 0 Views 398KB Size Report
grow in size, but rather on how code structure is adapted ... structure evolution of Open Source systems: horizontal .... perspective of its organisation and storage (one example is ..... release should be read as release sequence number.
Folder Structure Evolution in Open Source Software Andrea Capiluppi Dipartimento di Automatica e Informatica Politecnico di Torino – ITALY [email protected]

Maurizio Morisio Dipartimento di Automatica e Informatica Politecnico di Torino – ITALY [email protected]

Abstract

Predicting when and how a software system will evolve is one of the most fascinating challenges of software engineering. No matter what approach one is using to study such evolution, empirical studies, including observations of systems used in the real world, and of their software processes, are needed in order to identify correlations, find recurring patterns, and eventually predict how systems are likely to evolve. In the empirical study presented in this paper, we take 25 software systems released as Open Source, and observe their evolution. Our focus is not only on how much systems grow in size, but rather on how code structure is adapted and gets modified as the system is evolved. The main goal of this study is to recognize recurring patterns and practices used in evolving long-lived real world software systems. In our study we find three dominant patterns of code structure evolution of Open Source systems: horizontal expansion, vertical expansion, vertical shrinking. By detailed study of exemplars of these three patterns one can identify under which conditions a particular pattern is more likely to prevail than the others.

1. Introduction The long term evolution of E-type software, that is, software systems which are actively used in real world domains and environments [Lehman and Belady 1985], is an important issue for empirical study which can lead to useful insight and applicable lessons both for researchers and practitioners. On the other hand, toy systems, or prototypes, are surely worth analysis, but the conclusions drawn out of them are limited in their applicability to real world applications and domains. The empirical studies of real world software processes and products are limited by the kind of artifacts that an investigator may be able to obtain and measure: proprietary systems are in general difficult to be studied, since the public disclosure of data reflecting those systems tends to be restricted. In this paper we use metrics derived from a number of open source1 systems (OSS), in order to study the characteristics of their long term evolution, and, in particular, how their internal structure evolve. Choosing OSS systems for studying software evolution is an advantage since important amounts of data concerning 1 The authors are aware of the distinction between Free and Open Source software. The distinction is relevant, for example, with regards to users' rights over software artifacts. In this paper, however, we will use Free and Open Software as synonyms.

Juan F. Ramil Computing Department Faculty of Maths and Computing The Open University, UK [email protected]

software products and processes is available in freely accessible forms such as mailing lists, releases, configuration management repositories (such as the concurrent versions system, CVS), etc. In this paper we study OSS software systems from the point of view of their structural evolution. This involves the study of their enhancement, adaptation and restructuring. We are interested in finding out patterns in the evolution of software code structure. Our data set is composed by 25 OSS systems which we observed in a discrete-time perspective, that is, studying each available release. The dataset globally represents 992 releases or data points. We are interested in observing source code structure and its changes, to learn from long-lived OSS systems what types of structural patterns emerge, what structural changes are more frequently brought to the source code, and also to seek for patterns in the evolutionary trends. Given that code structure and in general, system architecture, can be visualized using a variety of means, we focus on the simplest possible approach: the source folder structure. By folder we mean any directory in the code repository which contains source files. Our research goal is to understand how OSS projects evolve with regards to source code internal structures. In future studies we plan to relate the source folder view of software structural evolution and other structural views (for example, obtained through design recovery [Di Lucca et al 2000]) with factors such as size and type of application, effort subsumed by the evolution and the type of software process.

2. Related work Empirical studies on software development gained momentum after pioneering work of Lehman and his collaborators on the study of the evolution of the proprietary operating system OS/360 [Belady and Lehman, 1976]. The initial studied observed 20 releases of OS/360. The results that emerged from that investigation, and subsequent studies of other proprietary commercial software [Lehman, 1974, 1980], [Lehman and Belady, 1985], [Lehman et al, 1987], [Lehman et al, 1988], include the SPE program classification and a set of laws of E-type software evolution. The findings made in the seventies and eighties were refined and supplemented in the recent FEAST projects [Lehman et al 1998]. More recently, other researchers have studied the software evolution phenomena. For example, Kemerer and Slaughter [1999] studied the evolution of two different

proprietary systems using two approaches: one based on the time series analysis, and the other based on sequence analysis. A study which identifies and categorizes software evolution patterns also is reported in [Barry et al., 2003]. During the last few years, it has been realized that OSS systems have an edge over commercial ones when it comes to availability of data: many studies were done since initial research involving the Apache web-server and Mozilla browser [Mockus et al. 2002]. More recent studies include those which examine single OSS projects [German 2003], [Koch and Schneider 2000], [Aoki et al. 2002], [González-Barahona et al 2001], [Stamelos et al 2002], [Godfrey and Tu 2000], and those which involve several systems [Capiluppi et al 2003], [Capiluppi 2003]. Even though the vast majority of OSS software evolution studies are based on direct trend visualisation and curve fitting, interesting new approaches to study the evolution phenomenon have been recently proposed through both quantitative [Antoniades et al 2003], and qualitative [Smith et al 2004] simulation methods. The work presented in this paper explores the study of the evolution of the code structure, a new dimension not covered in any of the above studies. In doing so, this work aims at complementing the understanding of OSS evolution.

3. Rationale When investigating code structure of various OSS systems, one may encounter different patterns of modifications: if we consider code structure from the perspective of its organisation and storage (one example is depicted in Figure 1), it is possible to visualize basic components (source files, source folders) as composing a tree, with the root of the tree being represented by the parent folder. When analyzing software evolution in a tree perspective, one distinguishes two dimensions: 1. vertical growth, that is, creating a sub-branch in an existing branch (upper part of Figure 1), 2. horizontal growth, that is, adding a new branch over an existing branch (lower part of Figure 1). If we consider Figure 1 from a tree-perspective, we may also state that any vertical growth adds depth in code structure, i.e. a new level has been nested under an existing level. The upper part of Figure 1 shows that creation of folder F3 has introduced a nested level under a current level, which is composed of F1 and F2. Alternatively, as shown in the lower part of Figure 1, F3 can be added at the same level of F1 and F2, that is, without adding a new level. The initial focus for the research reported here is based on Figure 1, and on the common assumption that evolution in software systems is generally implemented in an incremental fashion. Our aim is to understand if source code trees have a common pattern of growth, and if (and how) those patterns have an impact on the evolvability of the systems. In particular, we would like to assess a working hypothesis which is based on anecdotal observations by one of the authors. The hypothesis states

that a system is likely to grow from an initial low-level tree, first by adding branches to the existing levels, and next by adding additional levels. If this or any other common evolutionary pattern is supported by empirical observations, the next question would be why such pattern occurs and whether it can be linked to other characteristics of the software and its related domains. Moreover, the empirical study of structural evolution can help us to identify, and even predict, when and how structural changes occur and whether this can be related to transitions between stages [Rajlich and Bennett 2000], [Nakakoji et al 2002] in the evolution of a software system. This investigation of code structure evolution in OSS requires one to address the following research questions: • How does the source tree evolve over time or releases? • How does the depth of the source tree relate to code size? • How does the code structure evolution relates to the rate of functional growth and change of a system? • What common patterns emerge in source tree growth, given the horizontal and vertical perspective introduced in Figure 1 and in the above discussion? • How could one, by visualizing the evolving code structure, distinguish functional enhancement and adaptation activities, usually the predominant effort during the evolution of source code, from refactoring and restructuring, also called anti-regressive activities [Lehman 1974]? Parent Folder

Parent Folder

F1

F2 F3

F1

F2

Parent Folder

F1

F2

F3

Figure 1 - Two possible modifications of code structure

4. Methodology Our methodological approach can be summarized as the list of steps presented below. The list is not intended as fully sequential, since some steps are intertwined, and provide feedback to other steps: 1. Projects selection: as reported in previous work [Capiluppi et al 2003], we have created a large database with data representing over 400 OSS systems, randomly selected from a popular OSS repository. Initially, we classified these systems based on a number of process and product characteristics. For the study of structural evolution we decided to focus on the larger systems, that involve more complex and richer folder structures. For the present study, we define as 'large'

2.

3.

4. 5.

those systems composed of over 100 KLOCs of code. Furthermore, we extract from the data set some smaller systems for which all the releases in the system's evolution were publicly available for investigation. In total, the sample for the present study includes 25 OSS systems, which is what we could investigate within the time and the resources available. Attribute definition and metrics derivation: since our focus is on measuring systems' evolution, we collect a set of metrics which include system’s size, an indicator which is generally accepted as a surrogate of the functional power of the system. Section 5 provides a description of this and other attributes. Parsing tools: automatic data extraction is key in systems' evolution analysis. In this study we used offthe-shelf, freely available, utilities [XSCC] for counting lines of code. In addition, we built our own tools for parsing source trees (these tools are available to anyone who wishes to replicate this study). Next, we used the dot graphic tool [Graphviz] for extracting source trees out of data, and, finally a PERL script to quantify the number of changes made in-between subsequent releases. Data analysis and pattern recognition: basic plots and visualisations were used as a means to identify recurring patterns. Interpretation: in addition to observing (and recognizing) patterns, one needs to formulate possible explanations for them, based on existing literature (e.g. [Lehman and Belady 1985], [Rajlich and Bennett 2000]), new observations by the authors and hints provided by the documentation of the observed systems.

5. Definition of attributes 5.1. Source code size The vast majority of studies on the evolution software systems so far have involved one type or another of source code size metrics [Lehman and Belady 1985], with only some exceptions [e.g. Anton and Potts 2001]. In this study we measured source code size in three different forms: 1. LOCs: the total amount of lines of code, which we usually counted through off the shelf utilities (wc -l, for instance). 2. SLOCs: the total amount of source lines of code, i.e. remaining LOCs after blank lines and comments have been purged. 3. KBs: the size of a source file in kilo bytes.

5.2. Code Structure

5.2.1. Code components Research has been done aiming at correlating various structural evolutionary metrics to fault and failure discovery rates [Nikora and Munson 2003] based on the view that evolutionary characteristics may be directly related to a few common evolution attributes measured at the file or module level. Other studies in search of patterns of software evolution have concentrated on metrics [Barry et al 2003] but not on visualizations of software structure.

The present study focuses at a coarser level of granularity by measuring attributes of the whole system. For example we deal with code structure in three different forms: 1. source files, as to say, all files that are supposed to contain source code (e.g., “*.c”) 2. source folders, as to say, directories containing at least one source file. 3. folder levels, as to say each level in the code structure where topologically folders may be placed. Files, folders and levels form together a structure which may be interpreted as a simple architectural view of the system. 5.2.2. Folders level Observing Figure 1, one would wish to know in which sequence F1 and F2 were added to the system. In order to investigate this, we use the term encapsulated to refer to a folder that is contained inside another one. Each encapsulation is associated with a specific depth inside of the source file structure; therefore each encapsulation may be related to a depth-attribute, which we call level. Our interest is therefore to analyze the characteristics of folder levels, and observe maximum depths, the size of each level, patterns of growth, and break points in the evolution of source folder trees.

5.3. Modification types Different approaches for classifying maintenance and evolution activity have been proposed over the years e.g., [Kemerer and Slaughter 1999], [Chapin et al 2001]. The application of these classification schemes in an empirical study involve considerable work. In this study we focused on two types of activity, observed based on identifying which files have been added, modified or deleted between two releases. This is relatively simple to identify: 1. source additions, i.e. the set of source files added inbetween two subsequent releases or over a given period of time (e.g. one month); 2. source deltas, i.e. the set of files modified or deleted in-between two subsequent releases or over a given period of time (e.g. one month); 3. number of touched files (or files handled [Lehman and Belady 1985]) i.e. the cardinality of the union of source additions and source deltas. The percentage of touched files at release (or period) j is calculated as the number of files touched at release (or period) j, divided by the total number of files present at the previous release (or period), j-1.

6. Patterns in structural evolution 6.1. Evolution of size In this section we briefly summarize our findings with regards to the evolution of source code, correlations between different measures and the composition and structure of source trees in the 25 OSS systems studied. The size and the length of the evolution period studied for each of these systems are presented in table A1 in the appendix.

Average source file size [KB]at latest state

Project size vs average source file size

Projects size vs. number of files 2.750 2.250 2.000 1.500 1.250 750 500 0 0

5.000

10.000

15.000

20.000

25.000

Project's latest state size [KB]

Figure 4 - Project size as a function of number of files at the most recent release

80 70 60

6.2. Evolution of code structure

50 40 30 20 10 0 0

5.000

10.000

15.000

20.000

25.000

Project latest size [KB]

Figure 2 - Average size of individual files as a function of total size of the system, measured at the most recent release

If we perform the same analysis for the second basic source component (source folders), we observe that the average source folder size measured by adding up the size of files located in each folder and then taking the average over all folders in the same system, varies widely amongst systems (Figure 3). In the plot one can observe, close to the origin, the smallest system of the sample, whose size is around 800 LOCs at the most recent release. Project size vs latest average folders size 700 Latest state source fiolders average size

is what one would have expected from the roughly stable average size of files, displayed in Figure 2. Latest state nr. of files

Whilst visualizing the code evolution of the 25 systems studied, one interesting invariance emerged when plotting the average size of source files, as a function of the system size. In almost all the cases (except for the IMLIB system), the average size of source files displays values not greater than 20 Kilobyte or so. In Figure 2, we observe that almost all projects stabilize the average size for their source files, albeit in general their total size grows over releases. There are particularly interesting cases when these stabilization points are reached after a digressive trend (plots not shown here due to space limitations).

600 500 400 300 200 100 0 0

5.000

10.000

15.000

20.000

25.000

Project latest release size [KB]

Figure 3 - Average folder size, as a function of the total size of the system, both measured at the most recent release

Figure 3 is more scattered than Figure 2, and it implies that there is a higher variability in the average amount of files per folder in the systems studied. This suggests that the study of source folder evolution provides an orthogonal, complementary, view to that provided by studying source files evolution only. Figure 4 presents the plot of systems size in Kilobytes versus the size in number of files, both measured at the latest release: there seems to be a linear relationship which

When we observe the evolution of the folder structure, some recurring patterns can be recognized. In a first attempt to categorizing these patterns, we were able to identify basically three main cases. Here we briefly describe all of them, while in the next Sections we present some illustrative exemplars of each of these three types. Before discussing the types, we need to briefly introduce the notion of articulated source tree. Under articulated source tree we mean a tree which consists of at least two or more levels, which in turn implies the presence of at least one sub-branch in the source folder structure. The three structural patterns which emerged are the following: 1. Horizontally expanding: a first pattern is characterized by the early presence of an articulated source tree at the first release available for study. The articulated tree continues to exists during the subsequent releases, no vertical growth is observed (or the number of levels does not grow), but there is horizontal growth. We observed this pattern in 10 out of 25 analyzed projects. 2. Vertically shrinking: a second pattern is characterized by an initial articulated source tree which evolves into a source tree with a smaller number of levels. This vertical shrinking is not accompanied in general with horizontal shrinking: in other words, some levels get lost in the evolution of the source tree (vertical dimension), but we do not observe a decrease of the number of source folders (horizontal dimension). We observed this pattern in 4 out of 25 projects. 3. Vertically expanding: a third recognized evolution pattern starts with a simple tree structure which then evolves adding at least one level. We observed this in 11 out of 25 projects. In the majority of the cases the pattern followed is a vertical expansion from an early articulated source tree. However, there are 3 systems from this set of 11 whose first observation was a simple source tree (consisting of 1 level only), which in turn evolved into an articulated one. It is worth noting that a horizontally shrinking pattern did not emerge in any of the systems studied. That pattern simply did not exist in the dataset.

Level-1

Level-2

Level-3

Level-4

releases

Figure 5 - Number of files per level for the ARLA system

500 400 300 200 100

Adaptations

TOTAL # of FILES

59

55

47 51

43

39

35

27 31

23

19

15

0

Adaptati ons (Fil es touched)

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

600

releases

Figure 6 - ARLA evolutionary trends: total size and files touched per release

6.3.2. KSI The KSI project is aimed at building a lightweight implementation of the Scheme programming language and interpreter [Scheme]: therefore, it counts on an existing system's knowledge base to build a portable and embedded environment for the Scheme executables. CVS archives do not include the very first releases of the system. This means that we are limited to study the most recent evolution of this system. This issue typically emerges in empirical studies of software evolution in which release 1 in the data set does not correspond to the actual very first release2, because the oldest data has been deleted or is unavailable. In spite of this, the study of the available subset of releases is meaningful, since were able to access data for the most recent 12 releases of KSI, which span over an interval of time of 860 days. We were also able to identify 3 cycles of major releases (3.2, 3.3 and 3.4), and we noticed also that this system is the only one in the data

58 61

52 55

46 49

37 40 43

31 34

25 28

22

16 19

7 10 13

4

1

Nr. of files per level

ARLA - growth of levels 325 300 275 250 225 200 175 150 125 100 75 50 25 0

ARLA - adaptations 700

11

6.3.1. ARLA The ARLA project made available its first public release in February 1998, and its most recent release is labeled 0.35.12 (February 2003). 35 major releases were developed. 62 total releases are made available through their web sites, which then included 27 minor releases. ARLA project's main purpose is to achieve similar functionality as the IBM AFS file system. It is likely that ARLA has currently achieved even more functionality than AFS. Its application domain is distributed file systems management, a domain in which a lot of knowledge is available and openly shared. In this respect, this system is similar to flagships OSS successes (such as Linux or Apache). In ARLA’s evolution there have been two basic ways of enhancing and evolving the system: adding common features for the system (e.g. supporting of specific network protocols), and adding ports so that the system supports different architectures. Observing its folders makeup, as measured by the number of files per folder level (Figure 5), we observe that the majority of the files have been located at Levels 2 and 3. Level 4 experienced a sudden midlife increase at around release 25, accompanied by a sudden decrease at Level 3. Several new folders were added on Level 4, other moved from other parts of the system, which also significantly affect Level 3. In Figure 6, we observe the makeup of evolution of ARLA as depicted by the total number of source files, and touched files over releases. The growth trend can be

interpreted as two segments of decaying growth rate with a midlife growth regeneration point at about release 32. The trend presents similarities with those observed in commercial systems [Ramil and Smith 2002]. The plot of files per level as in Figure 5 is useful for identifying the mid-life growth rate regeneration points and Figure 5 suggests that such regeneration in growth rate was linked to a restructuring of the system. In Figure 6 the number of files touched per release presents only one major peak at release 50, that is, around ¾ of the system's life-cycle (95 percent or so of the size of the system at the previous release was touched), while all other peaks of file touched don't go beyond 60 percent. Except the outlier around release 50, one can observe a predominantly decreasing trend with a super-imposed oscillation in this attribute. The peaks correspond to the major releases. In the case of the ARLA system, the decreasing growth rate in the last third of its evolution history can be linked to a possible move of the system into a “servicing stage” [Rajlich and Bennett 2000], [Nakakoji et al 2001], as revealed by the declining evolution rate, suggested by decreasing trend in the proportion of files touched.

1 4 7

The first evolutionary pattern that we have identified is based on a structure whose vertical dimension remains constant over the entire observed evolution of the application: we observe, in general, a horizontal growth of new branches and leaves, but there's no growth in the vertical dimension, that is, the maximum depth keeps the same value. In some specific cases, new vertical levels were added in the evolution of the system, but then they were discarded in latter releases (e.g. the Grace system). In the following sub-sections, we will analyze a subset of the systems which display this first pattern, and we indicate some background information on their evolution in order to better understand and interpret the observed behaviors.

Number of files

6.3. Horizontally expanding

2 In this and the remainder of the figures and text of this paper, release 1 does not necessarily correspond to the first release of the system, and release should be read as release sequence number.

275

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

250 225 200 175 150 125 100 1

2

3

4

5

Number of files

6

7

8

Adaptations

9

10

11

Fi l es tou ch ed [ %]

A possible cause for the declining growth behavior of KSI, as seen in Figure 9, is the possible existence of memory or performance constraints in the target hardware which limit further growth of the system. Another more fundamental explanation would be that KSI is not an Etype system [Lehman and Belady 1985] in the strict sense, since it addresses a problem (the implementation of Schema) which can be sufficiently well specified as to consider it an S-type system. In this regard, KSI behavior is the one that one may expect to visualize for compilers and other precisely specified programs, which are closer to the S than to the E-type program type. KSI - adaptations

total number of files

set that shrinks its size from the earliest to latest available releases (from 111.288 LOCs to 100.157 LOCs). In this particular system Level 1 contains source files used for building the package only, while Level 2 and Level 3 hold nearly the whole code for the application. It's interesting to observe the disposition of folders in a graphical fashion: since few folders are involved, the visualization of this system’s tree structure is easier and clearer than for the larger systems of our study database. Figure 7 shows the disposition of source folders in the earliest release available: each ellipse is a source folder, and all source folders at the same level have the same associated source level number (rectangles on the left of the figure). In Figure 7, the edges between two folders are annotated with the number of files contained in the lowermost side of the connection. For example, close to the connection “ksi-3.20” -> “gc”, the label 59 means that the “gc” folder contains 59 source files.

12

releases

Figure 9 - KSI evolutionary trends: total size and files touched per release

Figure 7 - KSI earliest folder structure

Figure 8, displays the KSI folder structure for the most recent release available for study: reduction in LOCs size is reflected on a reduced number of both source code and source folders. Some branches were pruned away, and the whole design has become more compact, in the sense that the number of folders at Level 2, Level 3 and Level 4 have been reduced. The profile of files touched between releases for KSI, displayed in Figure 9, shows peaks which can be related to the major releases, as in other systems.

Figure 8 - KSI folder structure at most recent release.

6.3.3. Ganymede Ganymede has been initially developed and evolved by academic staff, and includes both an application/database server, and its client. We were able to recover data on its first 12 releases. This data set includes, we believe, its very first release. Only one major release cycle is recognizable (series 1.0) for this system so far. This is reflected both in the size change, and in the structure changes, which are all relatively small. Rather than showing the graph of changes, we present a table (Table 1) with data displaying the number of files per levels for this application. We observe very few additions or deletions of source components, even though the application has grown from 221.893 LOCs to 229.110 LOCs over its lifetime. There's no significant evolution with regards to structural changes: all the components seem to remain in the same place within the folder structure, and few additions are made available during the system lifetime. One possible explanation for tis emerges by going through the Changelog of this system: all developers and code contributors appear to be a small group, with few new contributions coming from outside this group. One of the challenges in software evolution, which is particularly evident is OSS, is the difficulty for outsiders or new contributors to assimilate and comprehend, sometimes massive, amounts of code which has been generated in a closed environment. If we observe the latest available release, we realize that it's dated November 2002, which means that neither new features, nor modifications have been released since then. A possible interpretation is that the feedback coming from the open source external

3

3

3

3

3

274

274

275

275

276

277

14

14

14

14

14

14

3

2

3

3

3

3

3

273

268

273

273

274

274

Lev4

14

14

14

14

14

14

Lev5

76

75

76

77

77

77

77

77

77

77

77

77

Lev6

95

95

95

95

95

95

95

95

95

95

95

95

Lev7

12

12

12

12

12

12

12

12

12

12

12

12

Table 1 - Ganymede source levels evolution When we plot code additions and modifications, presented in Figure 10, we observe that, while this system is growing very slowly, code adaptations (expressed in percentage of files touched per release) are dispersed throughout the code, and represent on average 70 percent of the whole system for each release. We may conclude that, in this particular case, functional enhancement has been very limited by lack of sufficient external feedback, while “servicing” of the application [Rajlich and Bennett 2000], [Nakakoji et al 2001] was conducted by a small group of developers. 100%

450

90%

400

80%

350

70%

300

60%

250

50%

200

40%

150

30%

100

20%

50

10%

0

fi l es tou ch ed [ %]

total number of files

Ganymede - adaptations 500

Gwydion-Dylan - growth of levels 600 550 500

2

3

4

5

6

total files

7

8

9

10

% files touched

11

12

releases

Figure 10 - Ganymede evolutionary trends: total size and files touched per release

400

Level-1 Level-2 Level-3 Level-4 Level-5 Level-6 Level-7

350 300 250 200 150 100 50 0

releases

Figure 11 - Number of files per level for Gwydion-Dylan

The folder structure of the most recently observed release is composed of only 5 levels. The evolution of source folders and files grow proportionally with the evolution of code (on its earliest stage: 64 source folders, 607 source files; on its latest stage 137 source folders, 1147 source files). A midlife restructuring of the system is clearly observable in Figure 11 after release 11, which can explain the increase in growth rate experimented by the system during the last half of the observed sequence of releases (Figure 12). The behavior of the proportion of files touched, for this system is displayed in Figure 12. Gwydion-Dylan adaptations 1200

100%

1100

90%

1000

80%

900

0% 1

450

70%

800 700

60%

600

50%

500

40%

400

30%

300

20%

200

10%

100

The second evolutionary pattern is based on a structure which becomes less articulated in the observed evolution. This means that some branches are pruned from the source tree, so that the global amount of vertical levels is lower than the initial observations. As we did for the first pattern, we will present below a subset of the systems which display this pattern: some background information on their evolution is given in order to better understand and interpret the observed behaviors. 6.4.1. Gwydion-Dylan Gwydion-Dylan is an object-oriented compiler supporting rapid applications development, and aiming to become a complete development environment. We observe

total files

% files touched

18

17

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

0% 1

6.4. Vertically shrinking

0

fi l es tou ch ed [ %]

12 --

20 21

11 --

Lev3

9

18 19

10 --

--

16 17

--

Lev2

8

--

14 15

7

12 13

--

9 10 11

6

8

--

7

5

6

--

5

4

--

4

3

3

--

2

2

--

Files per release

1

Lev1

1

# of files per level RSN

21 subsequent releases for this system, but they don't represent its whole life-cycle, since its earlier evolution is not available for study, neither in form of releases, nor in CVS storing. The available releases reflect 4 cycles of major releases, spanning over 1673 days. We observe in Figure 11 that the first available data point is composed of 7 nested levels, which have been progressively mostly likely accomplished through a previous series of releases for which we do not have data.

number of files per level

communities was not sufficiently strong as to guarantee a sustained evolution of the application.

releases

Figure 12 - Gwydion-Dylan evolutionary trends: total size and files touched per release

6.4.2. Gist Gist is a set of tools for building dynamic web sites. We have access at the whole story of this project (20 releases), which enables us to observe for this system how growth and change are related from the first release onwards. There are 4 cycles of major releases for this system, which are clearly noticeable when one plots code changes over releases. When observing the profile of code adaptations, in Figure 13 we observe peaks in correspondence of major

1100

100%

1000

90%

900

80%

800

70%

700

60%

600

50%

500

40%

400

30%

300 200

20%

100

10%

55

52

49

46

43

37 40

34

28 31

25

22

19

16

13

7 10

4

Level-2 Level-3 Level-4 Level-5

1

# of files per level

LCRZO - growth of levels 350 325 300 275 250 225 200 175 150 125 100 75 50 25 0

releases

Figure 14 - Number of files per level for LCRZO system

When investigating source levels, we found 4 different vertical levels. Two sudden jumps in the number of files at Level 2 suggest that something significant happened twice in the operational lifetime of this system. We then investigated these jumps using the tree structure perspective. In Figure 15 we depict the source folders' structure before the first jump (release 22): there are 5 source levels. The number of files contained in each of the level represent the relative weight of the level in the overall structure.

total files

% files touched

20

18

16

14

12

9 10

7

8

6

5

4

3

2

0% 1

0

believe, its whole evolution history. We have been able to recognize four cycle of major releases (1.0, 2.0, 3.0 and 4.0).

fi les touched [%]

Number of files

releases. These peaks are quite noticeable: during these releases, more than 80 percent of the files are touched. As for the Ganymede system analyzed before, this system is mainly developed by a small and stable group of developers. We were interested in finding out whether similar trends to the ones of Ganymede would appear. However, as shown in Figure 13, the size of Gist code is not roughly constant, as in Ganymede: several shrinks in size, both in global LOCs and in the source files, are visible in different points of the system's life-cycle, but the overall trend indicates an increasing size, but with declining rate. The growth from the first to the last release is about 20 percent, which is a low evolution rate in comparison to other systems in the dataset. What's also interesting from the point of view of the folders structure is that it is simple at both the initial and most recent releases of the system: growth seems to proceed on an horizontal basis, while vertical growth seems to be shrinked as long as new horizontal folders are added. In the latest available release, over 30 folders form the structure of the same, nested, level, while in the earliest release we found around 20. For the Gist system, we are able to conclude that the horizontal growth of folders was effective in the evolution of the system. GIST - adaptations

releases

Figure 13 - Gist evolutionary trends: total size and files modified per release

6.5. Vertically expanding The third evolutionary pattern is based on a structure that expands during the observed evolution of the application: this means that new branches are added in one or more sections of the tree, and new vertical levels appear. Besides, horizontal levels may be added, but we experienced that there is not a clear relation between the two dimensions. Two case studies are analyzed in the following sections, and additional information, beside size and structure, is provided in order to gain insights on the observed pattern. 6.5.1. Lcrzo Lcrzo is a shared library for developing network applications. Its functionality, then, consists on a common framework for nearly all network protocols (Ethernet, TCP, and so on). In Figure 14 we depict the evolution trend of the application over 1400 days of its evolution: we were able to access to 56 releases of this system, and these represent, we

Figure 15 - LCRZO code tree at release 22, before the first jump

Figure 15 displays the folder structure at the last release of the 2.0 cycle of releases. When depicting the subsequent release, then, we may be aware that some stable status has been reached, since the first release of 3.0 cycle has been made available. We see in Figure 16 that some branches were pruned: being a major release, unstable features are typically excluded from being made available. What's more, folder “example” in Level 2 gets filled with some 28 KLOCs of new source code. Analyzing their nature, they are models, skeletons and schemas of potential new applications wich can be implemented by using this library. In a sense, they provide the community with entry point for new development (they show potential users what the system may be able to do for them).

6.5.2. Vovida SIP stack Vovida is the system which has experienced one of the largest delta sizes (13 KLOC to 650 KLOC) in the data set from the first to the most recent available release. Vovida is an open source application that implements the SIP (Session Initiation Protocol) stack protocol, for multimedia sessions. It's a particularly interesting application from the point of view of level's growth: we have been able to access the entire life-ime of this system, and it evolved through nesting several levels (from a single level in first release, to 8 levels in latest available).

Vovida Sip Stack - growth of levels rd

Figure 16 - LCRZO state at the 23 release (after first jump)

1500

We also see (Figure 17) that after a series of releases, the code structure is changed again, after which two different levels seem to grow in parallel. The first is dedicated to code of the application, the second as a sort of incubator for new features: in latest instances of this system, we observe new levels, that is, vertical branches in the code structure.

Number of files per level

1400 1300 1200 1100

Level-1 Level-2 Level-3 Level-4 Level-5 Level-6 Level-7 Level-8

1000 900 800 700 600 500 400 300 200 100 0 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

releases

Figure 19 - Number of files per level for VOVIDA system

Figure 17 - LCRZO most recent state

Observing the plots for size and change activity (Figure 18), we note that this system underwent many rewritings and large adaptations: this is surely interesting, because here large adaptations are experienced between minor releases, not only in major releases.

In Figure 19 we can observe that at starting with Level 1, next Level 2 and so on, all levels were added at different moments of time, as well as new source files and folders. That is to say that a massive amount of evolution effort has been made in this application, in order to add new features and functionality As part of this large amount of efforts in code additions, we observed that the source adaptation trend displays on high valued peaks, representing the high evolution rate to which this system has displayed (Figure 20). The rapid evolution rate can be linked to a dynamic and growing community of developers.

total files

% files touched

Figure 18 - LCRZO evolutionary trends: total size and files touched per release

30%

500

20%

250

10%

total files

15

14

13

12

11

0% 9

0 10

releases

750

7

52 55

46 49

40 43

34 37

28 31

22 25

16 19

7 10 13

4

0%

40%

1000

8

10%

50%

1250

6

20%

60%

1500

5

30%

70%

1750

4

40%

80%

2000

3

50%

90%

2250

2

60%

100%

2500

1

70%

Number of files

80%

fi l es tou ch ed p er r el ease [ %]

90%

2750

files touched [%]

VOVIDA - adaptations

100%

1

files per release

LCRZO - adaptations 375 350 325 300 275 250 225 200 175 150 125 100 75 50 25 0

releases

% files touched

Figure 20 - VOVIDA evolutionary trends: total size and files modified per release

7. Conclusions We have analyzed in this paper the evolution of 25 OSS systems, possibly the widest and largest dataset in a study of this kind. The systems studied are of different size ranging from 700 LOCs to 700 KLOCs and represent a diverse set of application domains. 20 out of 25 may be categorized as large-sized systems given that their size at the most recently observed release is greater than 100 KLOCs. These systems are a sub-set taken from a version history database of OSS systems which we have collected for our research. The systems in this database were randomly extracted from a popular software repository dedicated to open source. In this particular study we have sought to identify interesting patterns in the evolution of these systems, with focus on the source code. Our aim is to better understand the evolution of OSS systems and to relate traditional analysis such as plotting of growth trend with visualization of the evolving folder structure. In particular, we are interested in topological patterns, that is when and how new source components are added, how do they relate to existing components, and to the existing overall structure. In this work, we define a “source file” as each single file containing source code, and “source folder” as each directory containing at least one source file. Our first result shows (Figure 2) that there is a stabilization point in the average size of the source files, inbetween 5 and 20 KB. However, when investigating the correlation between the average size of the source folders and the size of the system, there is no apparent correlation. This suggests that both the number of source files and the number of source folders provide two complementary views. This is an improvement on previous studies which have been based on the study of source files counts and related metrics, but which have not considered the folder structure. We analyzed the structure of the source folders, visualizing them as a tree containing branches (source folders), and leaves (source files). In doing so, we have been able to distinguish three main evolution patterns, basically related to how the folders evolve on a vertical and horizontal dimension. The first pattern is based on an invariant code structure on the vertical dimension: we observed this pattern in 10 systems out of 25 analyzed. Deepening the analysis of these 10 systems, we realized that in three of them the system was given a structure before its first public release: in order words, a core group was in charge of developing it before becoming publicly available. In general, when one is studying patterns in software evolution, the smallest systems are likely to display a less disciplined evolutionary behavior, driven by the decision and action of a small group of developers. On the contrary, larger projects are more likely to exhibit an evolution dynamics of their own for reasons that have been discussed in the literature [Lehman and Belady 1985]. In our study, depending on the system, we observed a faster growth in some cases and slower in others: next, we tried to identify why this was so, looking at details of the development

process aspects, and basically we’ve found higher growth rate evolution trends for the systems in which it was easier for potential contributors to become so, that is, where more feedback was available (the ARLA system evolution pattern could take advantage of several added developers, while the Ganymede could not). Furthermore, when observing the percentage of files touched per release, it shows few peak values, and nearly all of those corresponding to important releases. The range of these peaks is on average between 70 and 90 percent. One interesting case was also described (Ganymede), where very few new components were added, while on average 70 percent of existing ones were touched through every release. This behavior requires further investigation. The second recognizable pattern is when the vertical dimension grows. We had initially expected this as the predominant pattern emerging from our analysis, but we found the pattern on only 10 systems out of 25. What's more, several of these underwent some shrinks and expansions in the depth of the code tree, as well. The trend of files touched per version for this class of systems has in general higher peaks than the first pattern, also because new components are added. Several peaks around 100 percent (LCRZO system, VOVIDA system), and around 80 percent can be observed (VOVIDA system). A third, less frequent, pattern also emerges in which the vertical dimension shrinks. The profile of files touched per releases here is in-between pattern one and pattern two, but remarkable peaks in the range 90 percent to 100 percent are recognizable when major releases are prepared (Gwydion-Dylan and GIST), but these peaks are rather sporadic and rarely recurrent. In our future work we plan to refine the identification of patterns of structural evolution by considering metrics which reflect both the evolution of the horizontal and vertical dimension of the code structure, and relate this to other system characteristics by applying cluster analysis.

8. References

[Aoki et al. 2001] Aoki A., Hayashi K., Kishida K., Nakakoji K., Nishinaka Y., Reeves B., Takashima A., and Yamamoto Y., “A Case Study of the Evolution of Jun: an Object-Oriented OpenSource 3D Multimedia Library”, Proc. 23rd Intl. Conference on Software Engineering, ICSE 23, Toronto, Canada, 12-19 May 2001, pp. 524 - 533 [Anton and Potts 2001] Anton A. and Potts C.; “Functional Paleontology: System Evolution as the User Sees It”, Proc. 23rd ICSE, Toronto, Canada, 12-19 May 2001, pp. 421 – 430 [Antoniades et al 2003] Antoniades P., Samoladas I., Stamelos I., Bleris G.L. “Dynamical simulation models of the Open Source Development process”. To appear in Free/Open Source Software Development, Stefan Koch (ed.), Idea Group, Inc. [Barry et al 2003] Barry E.J., Kemerer C.F., and Slaughter S.A., “On the Uniformity of Software Evolution Patterns”, Proc. ICSE 25, Portland, Oregon, May 3 – 10, 2003, pp. 106 – 113 [Basili et al 1996] Basili, V. R. et al, “Understanding and Predicting the Process of Software Maintenance Releases”. Proc. 18th ICSE, Berlin, March 25 – 29, 1996, pp. 464 - 474 [Belady et al 1976] Belady L.A, Lehman M.M, “A Model of Large Program Development”, IBM Systems J., vol. 15, no. 1, 1976, pp. 225 – 252.

[Capiluppi 2003] Capiluppi A., “Models for the evolution of OSS projects”, Proc. of the 7th International Conference on Software Maintenance, ICSM, Amsterdam, September 22 – 26 2003, pp. 65 – 74. [Capiluppi et al 2003] Capiluppi A., Lago P., Morisio M.,, “Characteristics of Open Source Projects”, Proc. of the 7th European Conference on Software Maintenance and Reengineering, CSMR, March 26 – 28 2003, pp. 317 – 327. [Chapin et al 2001] Chapin N., Hale J.E., Khan K.M., Ramil J.F. and Tan W.G., “Types of Software Evolution and Software Maintenance”, Journal of Software Maintenance and Evolution: Res. and Practice, 13(1), January-February, pp 1 – 30, 2001 [Curtis et al 1979] Curtis B., Sheppard S.B., Milliman P., Borst M.A. and Love T., “Measuring the Psychological Complexity of Software Maintenance Tasks with the Halstead and McCabe Metrics”, IEEE Trans. on Softw. Eng., 5(2), 1979, pp. 96 –104 [Di Lucca et al 2000] Di Lucca G.A. et al, Recovering Class Diagrams from Data Intensive Legacy Systems, Proc. ICSM 2000, 11 – 14 Oct. 2000, San Jose CA, pp. 52 – 63 [El-Emam et al 2000] K. El-Emam, S. Benlarbi, N. Goel, W. Melo, H. Lounis, and S. Rai, "The Optimal Class Size for ObjectOriented Software: A Replicated Study," National Research Council of Canada, NRC/ERB 1074, 2000. [German 2003] German D., “Using software trails to rebuild the evolution of software”, International Workshop on Evolution of Large-scale Industrial Software Applications (ELISA) 23 September 2003, Amsterdam, The Netherlands http://prog.vub.ac.be/FFSE/Workshops/ELISA-Workshop.html, (as of Sept. 2003) [Godfrey and Tu 2000] Godfrey, M., and Tu Q., “Evolution in Open Source Software: A Case Study”. Proc. of 2000 ICSM, October 11-14 2000, pp. 131 – 142 [González-Barahona et al 2001] González-Barahona J.M., Ortuño-Pérez M. A., de las Heras-Quirós P., Centeno-González J., Matellán-Olivera V, “ Counting potatoes: The size of Debian 2.2”, http://people.debian.org/~jgb/debian-counting/countingpotatoes-0.2/ (as of June 2004) [Graphviz] Graphviz - open source graph drawing software http://www.research.att.com/sw/tools/graphviz/ [Kemerer and Slaughter 1999] Kemerer, C.F., and S. Slaughter. “An Empirical Approach to Studying Software Evolution”. IEEE Transactions on Software Engineering, 1999, 25(4), pp. 493 – 509. [Koch and Schneider 2000] Koch S., Schneider G., “Results from Software Engineering Research into Open Source Development Projects Using Public Data”, in ”Zum Tätigkeitsfeld Informationsverarbeitung und Informationswirtschaft”, Hans R. Hansen und Wolfgang H. Janko (eds.), Nr. 22, Wirtschaftsuniversität Wien, 2000. [Lehman 1969] Lehman M.M., “The Programming Process”, IBM Res. Rep. RC 2722, Dec. 1969: 46 pp. Also as Chapter 3 in [Lehman and Belady 1985] [Lehman 1974] Lehman M.M., “Programs, Cities, Students, Limits to Growth?”, Inaugural Lecture, in Imperial College of Science and Technology Inaugural Lecture Series, v. 9, 1970, 1974, pp. 211 – 229. Also in Programming Methodology, Gries D (ed.), Springer Verlag, 1978, pp. 42 – 62. Reprinted as Chapter 7 in [Lehman and Belady 1985] [Lehman 1980] Lehman M.M, “Programs, Life Cycles, and Laws of Software Evolution”, Proc. Special Issue Software Eng., IEEE, vol. 68, no. 9, 1980, pp. 1,060 –1,076 [Lehman and Belady 1985] Lehman M.M. and Belady L.A., (eds.) Program Evolution – Processes of Software Change, Academic Press, London, 1985

[Lehman et al 1997] Lehman M.M., J.F. Ramil, P.D. Wernick, D.E. Perry, and W.M. Turski, "Metrics and Laws of Software Evolution The Nineties View”, Proc. Fourth Intl. Software Metrics Symp., Metrics '97, Albuquerque, N.M., 1997, pp. 20 –32 [Lehman et al 1998] Lehman M. M., D. E. Perry, and J. F. Ramil. “Implications of evolution metrics on software maintenance.” Proc. of the 1998 ICSM 98, Bethesda, Maryland, Nov. 1998, pp. 208 – 217. [Mockus et al 2002] Mockus A., Fielding R.T., Herbsleb J.D., “Two Case Studies of Open Source Development: Apache and Mozilla”. In ACM Transactions on Software Engineering and Methodology Vol. 11, No. 3, 2002, pp. 309 – 346. [Nakakoji et al 2002] Nakakoji K., Yamamoto Y., Nishinaka Y., Kishida K.,Ye Y., “Evolution Patterns of OpenSource Software Systems and Communities”. In Proceedings of International Workshop on Principles of Software Evolution (IWPSE 2002), Orlando, Florida, 19 – 20 May, 2002, pp. 76 – 85 [Nikora and Munson 2003] Nikora A.P. and Munson J.C., “Understanding the Nature of Software Evolution”, Proc. ICSM 2003, 22 – 26 Sept., Amsterdam, pp. 83 – 93 [Rajlich and Bennett 2000] Rajlich V.T. and Bennett K.H., “A Staged Model for the Software Life Cycle”, IEEE Computer, July, 2000, pp. 66 – 71 [Scheme] The Scheme programming language, project available at http://www.swiss.ai.mit.edu/projects/scheme/ (as of June 2004) [Shankland 2000] Shankland S., “Linux kernel release falls behind schedule”, available on-line at http://news.com.com/21001001-240061.html?legacy=cnetandtag=st.ne.1002.thed.1003200-1808165 (as of June 2004) [Smith et al 2004] Smith N., Capiluppi A., Ramil J.F., 2004, “Qualitative Analysis and Simulation of Open Source Software Evolution” on the Proc. of the 5th Int. Workshop on Software Process Simulation and Modeling, May 24 – 25 2004, pp. 103-112 [Stamelos 2002] Stamelos, I., Angelis, L., Oikonomou, A., Bleris, G.L., “Code Quality Analysis in Open-Source Software Development”, Information Systems Journal, 2nd Special Issue on OS Software, 12(1), January 2002, pp. 43 – 60. [XSCC] A tool for extraction source lines of code, http://members.tripod.com/vgoenka/unixscripts/xscc.html (as of June 2004)

APPENDIX Files ini

Folders ini

Files fin

Folders fin

Kbs ini

Kbs fin

LOCs ini

LOCS fin

SLOCS ini

SLOCS fin

Depth ini

Depth fin

Time interval (days)

Arla

321

31

658

69

1.831

4.091

63.663

162.218

40.009

108.838

4

4

1.820

Ganymede

473

28

478

28

5.455

5.646

221.893

229.110

123.093

126.955

6

6

558

Gwydiondylan

607

64

1.147

137

6.606

11.012

213.688

348.644

151.145

252.997

6

5

1.673

Ghemical

586

12

555

12

6.426

6.716

217.463

226.769

171.998

180.159

4

4

454

Gimpprint Gist

7

1

136

14

305

2.206

11.156

80.567

9.172

61.895

1

3

1.304

778

27

1.067

37

4.098

4.519

172.111

190.933

126.987

131.401

5

4

1.436

Grace

91

4

310

14

2.025

4.428

73.691

157.919

63.423

113.668

2

2

2.730

Htdig

136

16

511

24

441

3.926

21.300

153.722

14.529

102.621

3

5

2.451

Imlib

27

4

36

4

2.631

2.692

52.651

55.839

50.300

53.163

2

2

1.277

Ksi

259

19

191

14

2.933

2.708

111.288

100.157

81.681

75.561

4

4

860

Lcrzo

19

3

235

9

197

3.658

6.409

109.323

4.955

70.517

1

6

1.435

Linuxconf

586

46

1.347

117

2.475

6.104

103.498

239.223

82.810

191.594

4

4

2.028

1.511

31

1.946

51

17.127

21.941

545.093

704.864

467.151

614.141

3

5

3.430

2

1

28

1

7

160

239

6.836

204

5.901

2

2

1.281

Mutt

120

2

201

6

1.131

2.391

48.640

96.415

37.477

70.171

2

3

2.032

Nicestep

44

4

140

17

1.173

2.414

33.990

74.441

27.555

59.729

1

2

1.168

Parted

52

6

122

16

417

1.354

16.911

51.907

12.431

38.720

3

3

1.405

Pliant

227

37

641

94

1.255

4.270

36.347

116.947

28.868

101.363

5

5

1.845

Quakeforge

396

17

696

58

3.815

5.696

172.946

233.534

123.234

175.377

3

5

1.268

Mit-scheme Motion

Rblcheck

1

1

7

5

2

19

104

772

68

447

1

3

1.493

Rrdtool

113

10

153

26

1.926

3.025

86.138

128.211

68.695

102.298

3

4

1.634

Siagoffice

42

5

322

18

356

3.618

15.386

137.504

13.743

108.254

2

2

2.594

Vovida Sip Stack

49

1

2.618

135

13.307

19.809

13.307

665.749

7.406

398.938

1

6

1.309

Weasel

16

1

36

2

142

511

4.449

17.591

2.629

11.924

1

2

834

Xfce

207

12

450

69

1.323

8.450

46.808

277.423

35.317

225.736

2

3

1.662

Table A1 – Various size measures and length of evolution period studied (time interval) for the 25 OSS systems NOTES: •

In the table header, “init” indicates size measured at the first publicly available release, “fin” indicates size measured at the last publicly available release.



Columns 2 to 13 represent various size measures



Column 14 represents the length of the period studied for each software, measured as the interval between the first and the latest available release