corrective maintenance, and the arch and drivers directories to reflect adaptive ... The Linux operating system kernel was originally developed by Linus Torvalds.
Using the Linux Kernel for Data about Software Maintenance Activities Ayelet Israeli
Dror G. Feitelson
Department of Computer Science The Hebrew University, 91904 Jerusalem, Israel
Abstract Software maintenance involves different types of activities: corrective, perfective, adaptive, and preventative. However, research regarding these distinct activities is hampered by lack of empirical data that is labeled to identify the type of maintenance being performed. A promising dataset is provided by the more than 800 versions of the Linux kernel that have been released over the last 14 years. The Linux release scheme differentiates between development versions (where perfective maintenance is expected to be common) and production versions (expected to be dominated by corrective maintenance). The structure of the codebase also distinguishes code involved in handling different architectures and devices (where additions reflect adaptive maintenance) from the core of the system’s kernel. However, these distinctions are not always crisp, and a detailed analysis may be needed in order to identify the different maintenance activities reliably. Key words: Perfective maintenance, Corrective maintenance, Adaptive maintenance, Linux kernel 1. Introduction Software engineering textbooks typically take the view that software is developed, released, and then maintained. They focus on the development process, especially the requirements, specification, and design, and say little about maintenance [23, 30, 25]. Nevertheless, software maintenance has been an active field of research at least since the 1970s, and there is growing recognition of its importance. To quote Parnas, “A sign that the Software Engineering profession has matured will be that we lose our preoccupation with the first release and focus on the long term health of our products” [21]. A large-scale example of software maintenance1 is provided by the Linux kernel. This highly successful system was first released in March 1994, and since that time over 800 releases have been made in multiple branches. Moreover, Linux is an open source project, and all the code of all the releases is freely available on the Internet (from www.kernel.org and numerous mirrors). This makes it suitable for a study of software maintenance at a scale that has not been done before. 1
A better term would be “software evolution”, but in this paper we use “maintenance” in accordance with the common usage in the literature to describe this classification of activities. Preprint submitted to Elsevier
March 9, 2009
Our main objective in this study is to try to identify and characterize different types of maintenance activities, and how they affect the body of software being maintained. Software maintenance activities have been classified into four types: corrective, adaptive, perfective, and preventative (preventative is sometimes bundled into perfective) [31]. These will be studied by analyzing and comparing different versions of Linux and different subdirectories in the code tree. In particular, we expect development versions to reflect perfective maintenance, production versions to reflect corrective maintenance, and the arch and drivers directories to reflect adaptive maintenance. Linux has served as a source of data for multiple studies of software development (e.g. [1, 6, 26, 22, 35, 32]), as have other open source projects (e.g. [18, 3, 16]). While there is some debate on whether this also reflects the behavior of closed-source software [22], the growth of the open source movement implies that studies based on such data are also important in their own right. Linux itself is undoubtedly interesting and important enough to be studied for its own sake, and our study can also serve as basis for comparison with other systems. The organization of this paper is as follows. Section 2 provides background regarding the Linux kernel and describes our methodology. Section 3 describes the different categories of maintenance activities. In Section 4 we present our measurements of various aspects of the Linux kernel and how they change with time. These are then interpreted with regard to maintenance activities in Section 5. Section 6 concludes with a summary and discussion of future work. 2. Data and Methodology 2.1. The Linux Kernel The Linux operating system kernel was originally developed by Linus Torvalds. Torvalds announced it on the Internet in August 1991. There followed 2 21 years of development by a growing community or developers, and the first production version was released in March 1994. Initially a 3-digit system was used to identify releases. The first digit is the generation; this was 1 in 1994 and changed to 2 in 1996. The second digit is the major kernel version number. Importantly, a distinction was made between even and odd major kernel numbers: the even digits (1.0, 1.2, 2.0, 2.2, and 2.4) correspond to stable production versions, whereas versions with odd major numbers (1.1, 1.3, 2.1, 2.3, and 2.5) are development versions used to test new features and drivers leading up to the next stable version. The third digit is the minor kernel version number. New releases (with new minor numbers) of production versions supposedly included only bug fixes and security patches, whereas new releases of development versions included new (but not fully tested) functionality. The problem with the above scheme was the long lag time until new functionality (and new drivers) found their way into production kernels, because this happened only on the next major version release. The scheme was therefore changed with the 2.6 kernel in December 2003. Initially, releases were managed at a relatively high rate. Then, with version 2.6.11, a fourth number was added. The third number now indicates new releases with added functionality, whereas the fourth number indicates bug fixes and security patches. Kernel 2.6 therefore acts as production and development rolled into one. However, it is actually more like a development version, because new functionality is released with relatively little testing. It has therefore been decided that version
2
2.6.16 will continue to be updated with bug fixes even beyond the release of subsequent versions, and act as a more stable production version2 . The Linux kernel sources are arranged in several major subdirectories [24]. The main ones are init contains the initialization code for the kernel. kernel contains the architecture independent main kernel code. mm contains the architecture independent memory management code. fs contains the file system code. net contains the networking code. ipc contains the inter-process communication code. arch contains all the kernel code which is specific for different architectures (for example intel
processors with 32 or 64 bits, PowerPC, etc.). drivers contains all the system device drivers. include contains most of the include (.h files) needed to build the kernel code.
An interesting note is that both the arch and drivers directories are practically external to the core of the kernel, and each specific kernel configuration uses only a small portion of these directories. However, these directories constitute a major part of the code, which grows not only due to improvements and addition of features, but also due to the need to support additional processors and devices. 2.2. Research Data We examined all the Linux releases from March 1994 to August 2008. There are 810 releases in total (we only consider full releases, and ignore interim patches that were popular mainly in 1.x versions). This includes 144 production versions (1.0, 1.2, 2.0, 2.2, and 2.4), 429 development versions (1.1, 1.3, 2.1, 2.3, and 2.5), and 237 versions of 2.6 (up to release 2.6.25.11). This represents a significant extension of the work of Godfrey and Tu [6], who studied the development of Linux from its inception until January 2000 (their last versions were 2.2.14 and 2.3.39). We aimed at examining the entire source code of the Linux Kernel. A typical C program consists of files with two different suffixes: .h and .c. The .c files consist of the executable statements and the .h files of the decelerations. Both should be looked at in order to get the full picture of the Linux code. In this we follow Godfrey and Tu [6] and Thomas [32], as opposed to Schach et al. [26] who only included the .c files. Given that we are interested in code maintenance rather than in execution, we take the developer point of view and consider the full code base. Schach et al. [26] considered only the kernel subdirectory, that contains but a small fraction of the most basic code. Thomas [32] spent much effort on defining a consistent configuration that spans versions 2.0 to 2.4, thus excluding much of the code (mainly drivers), and limiting the study of innovations. In any case, such a consistent version cannot be found spanning the full range from 1.0 to 2.6. We ignore makefiles and configuration files, as was done by others. 2
This seems to have stopped with 2.6.16.62 released on 21 July 2008.
3
2.3. Software Metrics The perception of a software crisis has led to the invention of many different quantitative measures of software products [10, 17]. These can be classified as product metrics which measure the software product itself (such as size as reflected by LOC and percentage of documented lines) and process metrics which measure the development process (such as development time and experience of the programming team). We focus mainly on product metrics as these can be extracted reliably directly from the code. There are two reasons why it is important to perform code-based measurements. The first is accuracy, as the alternatives of using surveys and logs can be highly inaccurate. For example, a survey of maintenance managers yielded the result that 17.4 percent of maintenance is corrective in nature [14], whereas an analysis of changes to source code (albeit in different projects at a different time) led to a figure more than three times larger [27]. Similarly, a comparison between change logs for three software products and the corresponding changed source code itself showed that up to 80 percent of changes made to the source code were omitted from change logs [4]. The second reason why code-based metrics are important is that certain phenomena can be measured only by examining the code itself. For example, common coupling has been validated as a measure of maintainability [2], and the only way to measure the common coupling within a software product is to examine the code itself. The metrics we measure are the following: 1. Lines of code (LOC), including its related variants: comment lines and total lines. 2. Number of modules, as expressed by the number of directories, files, and functions. 3. McCabe’s cyclomatic complexity (MCC), which is equivalent to the number of conditional branches in the program plus 1 [15, 19]. These include if-then-else, for, while, do, and case statements. We also measure the extended version (EMCC) where one counts the actual conditions and not only the conditional statements, based on the conception that Boolean connectives add complexity to the code. 4. Metrics defined as part of Halstead’s software science [8]. The building blocks of these metrics are the total number of operators N1 and the number of unique operators n1 , as well as the total number of operands N2 and the number of unique operands n2 . using them, Halstead defined the following: The volume HV = (N1 + N2 ) lg2 (n1 + n2 ). This actually measures the total number of bits needed to write the program. n1 N2 The difficulty HD = . This is proportional to the available tools (operators) and the · 2 n2 average usage of operands. The effort HE = HV · HD. This is simply the product of how much code there is and the difficulty of producing it. In cases when the metrics are undefined (e.g. for an empty function n1 = n2 = 0) they were taken as 0. This happened in around 1% of the functions.
4
5. Oman’s maintainability index (MI) [20, 33], which is a composite metric that attempts to fit data from several software projects. Its definition is q
MI = 171 − 5.2 ln(HV ) − 0.23MCC − 16.2 ln(LOC) + 50 sin( 2.46pCM) where X denotes the average of X over all modules, and pCM denotes the percentage of lines that are comments. However, following Thomas [32], we will interpret pCM as a fraction (between √ 0 and 1) rather than as a percentage (between 0 and 100) because with this interpretation 2.46pCM has the range of 0 to approximately π2 . 6. Files and directories handled (added, deleted, or modified). 7. The rate of releasing new versions. A notable omission from the above list is common coupling, which has been used to assess the Linux code in several previous studies [26, 35, 32]. However, all those studies neglected to fully follow inter-procedural pointer dereferences, and thus potentially miss many instances of coupling. As this is an extremely difficult issue, we leave its resolution to future work. 2.4. Analysis Tool In order to analyze our full dataset, a static analysis tool was required. We initially considered using a commercial CASE tool, but in the end developed a tool of our own that was simpler and did precisely what we wanted. Linux is written in C, which includes pre-processor directives (#define, #ifdef, etc.) that are processed before the compilation proper. One of the big challenges we faced was to handle such directives. Generally there are two approaches to perform static code analysis: either to preprocess the code or not to. Applying pre-processing is useful when the goal is to understand how the code actually runs. However, our objective is to study software engineering, and in particular the maintenance of the code. Thus we believe that the correct way to go is to analyze the code as the developer views it, i.e. before applying the pre-processor. For example, macros may be used as an aid to abstraction that has lower cost than subroutines because the code is inlined by the pre-processor. Due to the inlining, if a developer uses the macro #define MAX(X,Y) (X>Y)?X:Y, the “real” complexity of the code increases because a branch is added in each use. But from the developer’s point of view this added complexity is hidden, and therefore should not be counted. Thus macros should be calculated with all their properties upon their definition, but as a function call in their uses. The same applies for #include, which can induce considerable bloat. A major problem comes from use of conditional compilation directives such as #ifdef. Such code sections are used mainly to handle different configurations with essentially the same code, by singling out only the differences and selectively compiling the correct version for each configuration. This implies that if we use pre-processing we will actually only analyze a certain configuration and not the whole code base. Assuming the developer’s perspective again, we want to analyze all the code — all the versions of the #if directives, and their #else parts too — because a developer maintaining such a file must be aware of all different possibilities of the flow of the code. This again implies that pre-processing is inappropriate for our needs. In this we differ from Thomas [32], who used a CASE tool which 5
requires pre-processing, and thus only files and code sections in the pre-processed configuration were examined. The major problem with not performing any pre-processing is that the resulting code is not always consistent. For example, a function may have slightly different signatures in different configuration, and this can be expressed using #ifdef and #else. With pre-processing, the compiler will only see the correct definition each time. But if we just delete the pre-processing directives, we will get two contradicting definitions of the same function one after the other, and moreover, sharing the same function body. As it turns out, in most cases we were able to analyze files which had #ifdef sections. In other cases, we saw that when analyzing both paths of the #if a malformed code is created and thus we were not able to analyze it with our automated tool. Trying to do this manually is also a challenge — how does one decide which path to analyze? Therefore, when we encountered such files we removed them from the analysis. Other malformed files (very few) were removed as well and are not included in our calculations. Overall, less than 1.5% of the source files were not analyzed at all. Among the arch and drivers subdirectories between 0.3%–3% of the files were not analyzed, whereas in the other parts of the kernel the worst case of un-analyzed files was less than 0.7%. Thus the vast majority of the files were analyzed and their data is aggregated in the different metrics. Our analysis tool, while not free of problems and limitations, is tailored to perform the analysis based on the above considerations. We coded a perl script that, given a C file, parses it into its different tokens, and generates an output file with the metrics. Pre-processor directives were stripped out, and all the remaining code was analyzed. In those cases where this practice led to inconsistent code the file was removed from consideration as explained above. Empty functions and files were not considered problematic and are included in the metrics, as they are part of the design. We ran this program on all the .c and .h files of all the versions, creating an output file with the calculated metrics for each one. Then, for each version we aggregated all the data from these output files. This was done for three groupings: the whole kernel, only the arch and drivers directories, and only the other (core) directories. This allows us to study whether these subsystems behave differently from each other and from the whole system [5, 6]. In order to aggregate the metrics at the kernel level, we used the same approach used in other studies (such as Thomas [32]) and as explained in the original metrics definitions. For example, LOC, MCC, and EMCC are simply summed across all files (note that files with no functions, such as some header files, will have an MCC of 0, because MCC is a function level metric). In order to compare the different versions, despite the addition of new files or functions, we will sometimes look at the average metric values of the files and functions of the kernel rather than at the aggregate values. The same is true for the function-level Halstead metrics. Oman’s MI is defined at the file level. Since it has a 100-point scale we cannot aggregate the values. Instead, we will use only the average the MI values of all files in the kernel.
6
3. Categories of Software Maintenance “Maintenance” is often used to describe whatever happens to a software product after the first release. According to the literature at least 50% of the effort invested in software, and perhaps 80% or even more, is dedicated to maintenance (e.g. [14, 11, 12]). Maintenance is classified into four types [31]:
Corrective: correction of discovered problems, i.e. fixing bugs. Adaptive: adapting the software product to the changing environment. Perfective: improving the software, e.g. improving its performance. Preventative: detection and correction of latent faults in the software product before they become effective faults. The term “maintenance” by nature implies a preservation of the existing. However, software often undergoes further development after being released. To accommodate this, the addition of new features is usually included as part of perfective maintenance. Preventative maintenance is also sometimes bundled together with perfective maintenance. Our goal is to use the many versions of the Linux kernel to characterize these four types of maintenance. Based on the structure of the Linux releases, we expect corrective maintenance to be reflected in successive versions of production kernels. Perfective maintenance, on the other hand, especially as it pertains to the addition of new features, is expected to be reflected in successive versions of development kernels. Adaptive maintenance is also reflected in development versions, but is largely confined to the arch and drivers directories, as this is the locus for code that handles interaction with the environment. The identification of preventative maintenance is harder. We assume that preventative maintenance is related to code reorganization, and can therefore be identified, for example, by changes in the number of directories [3]. We will therefore look for isolated events in which many files are partitioned, removed, or moved, again mainly in development kernels. To characterize the different types of maintenance and their effect, we will collect data and calculate various software metrics for different versions and groups of directories. For example, we will compare successive production versions of Linux and successive development versions to establish whether code metrics can be used to distinguish between corrective and perfective maintenance. We also intend to examine these metrics before and after restructuring events, to determine the effects of such restructuring. To study the effect of different maintenance activities in the Linux kernel we calculated numerous code metrics on all the code of all the releases. The code structure and releases were described above in Section 2.1. The metrics we applied include all the well known metrics such as lines of code (LOC), McCabe’s cyclomatic complexity (MCC) [15], Halstead’s volume, difficulty, and effort metrics (HV, HD, and HE, respectively) [8], and Oman’s maintainability index (MI) [20] — as described above in Section 2.3.
7
10000
Number of Source Files − Other Directories
10000
25 22 2324 21 19 20 1718 16 15 13 14 12 v2.611
8000 v2.5
Number of Source Files − Arch and Drivers Directories 25 24 22 23 2120 19 18 17 16 15 13 1412 v2.611
8000
6000
6000
v2.5 v2.4
v2.4
4000
4000 v2.3 v2.1
2000
0
v1.3 v1.1 v1.0 v1.2
v2.3 v2.2
v2.2
2000
v2.0
0
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
v2.1 v1.3 v1.1 v1.0 v1.2
v2.0
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 1: The growth of the number of files in Linux.
4. Measurements of the Linux Kernel To observe the changes in metric values from one version to the next, we typically graph the results for all 810 versions, distinguishing between the arch and drivers directories and the core of the kernel. The X axis in these graphs represents the release date, and we make a distinction between the development versions (1.1, 1.3, 2.1, 2.3, and 2.5), the production versions (1.0, 1.2, 2.0, 2.2, and 2.4), and the 2.6 series. 4.1. Total Size The size of a software system may be measured in different ways. We start by considering the number of files, the number of functions, and the number of lines of code (LOC). In particular, we focus on non-comment non-blank lines of code, i.e. lines that actually include executable statements. Fig. 1 shows the growth in number of files in Linux. We see here that most of the growth occurs in the development versions, whereas the production versions are usually stable, except for some increase at the beginning. This is especially noticeable in 2.4, which continues to grow till 2.6 is released, indicating that much code is added. This strange behavior in the first years of 2.4 will be discussed in detail below. The 2.6 series demonstrates a combination of the development and production versions: for each minor version (third digit) the number of files is constant as in production versions, but it grows between them as in development versions. An interesting observation is that the pattern of growth in the arch and drivers directories can be different from the rest of the kernel, especially in select points. For example, version 2.5 exhibits two large “jumps” in the core directories, only one of which exists in arch and drivers, where it is much smaller. Another example is 2.2 which exhibits significant growth in the beginning for arch and drivers and is pretty steady for other directories. The first jump in 2.5 (from 4017 to 4783 files in the core directories) is explained by the addition of the sound subdirectory. In previous versions there was a much smaller sound directory under drivers, but generally the sound project was developed separately. In 2.5.5 a new sound directory 8
Number of Functions − Other Directories
120000
25 24 22 23 2120 19 1718 16 15 13 14 v2.61112
60000 50000 40000
Number of Functions − Arch and Drivers Directories
100000 80000
v2.5
v2.5
60000
30000 v2.4
40000
20000
v2.3 v2.2
v2.3 v2.1
10000 0
v1.3 v1.1 v1.0 v1.2
25 24 22 23 21 19 20 18 17 16 15 13 1412 v2.611 v2.4
v2.1
20000
v2.2 v2.0
0
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
v1.3 v1.1 v1.0 v1.2
v2.0
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 2: The growth of the number of functions in Linux.
was created at the top level, and the include directory was updated with header files and also grew. The drivers directory lost the old sound directory, but it did not shrink because other unrelated additions were made. The second jump is actually a combination of two effects. One is in the arch and drivers directories, where improvements and developments of the arch/um directory were made in version 2.5.35. In the consecutive version (2.5.36) a new file system support (xfs) was added, generating an increase in the core kernel number of files. Thus the two seemingly correlated jumps are actually unrelated. Fig. 2 shows the growth in number of functions in Linux. Overall, we see a similar picture to that of number of files — they are increasing over time in development versions and usually steady in production versions, with the special cases of initial growth in 2.4 and jumps in 2.5. However, some details are different. First, we see that in terms of sheer number of functions, about 23 to 43 of them are in arch and drivers and only 14 to 13 are in all other kernel directories. Second, the jumps in 2.5 are much more prominent in the core directories, and completely disappear in the arch and drivers directories, where 2.5 overlaps the growth of 2.4. This implies that the added files in 2.5 were relatively large, and had much more than the average number of functions. It is also worthwhile to notice the smaller jump in 2.2 (mainly 2.2.18, where 200 files and 3000 functions were added). These increases are due to many individual changes, including improved USB support and the addition of network drivers. Note that these features were added to 2.2 despite the fact that it is a production version. We calculated LOC both as total lines in the file and as lines of actual code sans empty lines and comments. The resulting patterns were very similar to each other and to those of files and functions. Fig. 3 shows the growth in actual code lines. Comparing the core directories to arch and drivers, LOC closely follows the pattern for functions (i.e. larger jumps than for files in core 2.5, but no jumps in arch and drivers). A completely different metric for code size is Halstead’s volume. Fig. 4 shows the aggregate values of this metric for all the Linux kernel versions. The pattern is generally similar to what we 9
LOC − Other Directories
2e+006 1.8e+006 1.6e+006 1.4e+006
v2.5
25 22 2324 2120 19 1718 16 15 13 14 v2.61112
LOC − Arch and Drivers Directories
4e+006
25 24 23 22 21 19 20 1718 16 15 14 13 12 v2.611 v2.4
3.5e+006 3e+006 2.5e+006
1.2e+006
v2.5
1e+006
2e+006
v2.4
800000
1.5e+006
600000
v2.3
400000 v1.3 v1.1 v1.0 v1.2
0
1e+006
v2.2
v2.1
200000
v2.3
500000
v2.0
0
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
v2.2
v2.1 v2.0
v1.3 v1.1 v1.0 v1.2
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 3: The growth of LOC in Linux for different directories. Halstead Volume − Arch and Drivers Directories
Halstead Volume − Other Directories
25 2324 2122 20 19 1718 16 15 13 14 v2.61112
3.13e+007 2.63e+007 v2.5
2.13e+007
7.1e+007 6.1e+007 v2.5
5.1e+007 4.1e+007
1.63e+007 v2.4
1.13e+007
v1.3 v1.1 v1.0 v1.2
v2.3
3.1e+007 2.1e+007
v2.3 v2.1
6.3e+006 1.3e+006
25 24 23 22 21 19 20 1718 16 15 13 1412 v2.611 v2.4
8.1e+007
v2.2
1.1e+007
v2.0
1e+006
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
v2.2
v2.1 v1.3 v1.1 v1.0 v1.2
v2.0
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 4: Halstead volume for each kernel.
have seen for several previous metrics: a growth in development versions, stability in production versions, and several unique jumps. In particular, the growth pattern of HV is nearly identical to that of LOC. This should be expected since Halstead metrics grow with the code by their very definitions. 4.2. Size of Files and Functions Next, we calculated LOC at the file and the function level (averaging the values over the number of files/functions respectively). We will show the results for lines of code only, excluding empty lines and comments, as the graphs for total lines are qualitatively similar. The most obvious observation regarding the average size of files (Fig. 5) is that they are somewhat volatile and relatively large in the arch and drivers directories, but much smaller and more stable in the other directories. Another interesting finding is that within the arch and drivers directories production versions tend to have much higher values. This is due to a combination of initial growth in these directories which is larger than in contemporary development versions, and 10
LOC per File
550
v2.0 v2.2
500 450
v1.3
400 v1.1
350
v1.0
v2.1
v2.3
v2.4
v1.2
v2.5
14 15 v2.61112 13
16 25 17 18 19 2120 222324
v2.61112 13 14 15
17 18 19 2120 232425 2216 v2.4
Arch and Driver Directories
300 Other Directories
250 200
v2.5
v1.0
150
v1.1
v1.2v1.3
95
96
v2.1
v2.2 v2.0
v2.3
100 50 0 94
97
98
99
00
01
02
03
04
05
06
07
08
Figure 5: The average LOC per file. LOC per Func − Other Directories
45
40
v2.4
v2.2
40
v2.0 v2.2
v1.1 v1.2 v1.3
v1.0 v2.5
35
LOC per Func − Arch and Drivers Directories
45
v1.3 v1.1 v1.2
v2.3 v2.1
v2.1
v2.0
v1.0
35
v2.4
v2.5
v2.611 13 1412 15 16 17 1819 20 2122 232425
13 11 141215 v2.6 1718 16 19 202324 2122 25
30
v2.3
30 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 6: The average LOC per function.
stability while the values for subsequent development versions decline. The end result is that the arch and drivers directories in the 2.0, 2.2, and 2.4 production versions have the highest average file sizes. Looking at the LOC per function data (Fig. 6) we find a slightly different picture. The average sizes are similar in all directories, but with relatively large fluctuations and jumps. And there is a distinctive downwards trend in 2.6, indicating that the number of functions is growing faster than the LOC. Note that the graphs of average LOC per file (Fig. 5) or function (Fig. 6) are generally more volatile than those of the number of files (Fig. 1), number of functions (Fig. 2), or just total LOC (Fig. 3). The reason is that average LOC is the quotient of two other metrics, and while all of these metrics generally grow, the relative rates of growth may fluctuate. However, some patterns 11
900
Avg. Halstead’s Volume per Func − Other Directories
v1.0
Avg. Halstead’s Volume per Func − Arch and Drivers Directories 1000 v1.1 v1.2
800
v2.0 v2.2
v1.3
900 v1.0
v2.1
v2.3 v2.4
700
v1.1 v1.2 v1.3
v2.5
800
v2.61112 13 14 15 1718 16 19 20 2122 2324 25
v2.0 v2.5
600
v2.1
v2.3
v2.4 v2.61112 v2.2 13 14 151718 16 19 20 2122 2324 25
700
500
600 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 7: Average Halstead volume per function.
Category Avg. HV Avg. HD Avg. HE all core 2.4.25 627.15 19.37 39224.57 xfs 2.4.25 1063.94 30.65 107102.20 other core 2.4.25 590.18 18.41 33478.82 all core 2.4.24 592.43 18.45 34252.18 Table 1: Halstead metrics for core kernel directories in version 2.4.25 with and without xfs, and compared with version 2.4.24.
do emerge. For example, in versions 2.0, 2.2, and 2.4 we can see that the growth in LOC was larger than in number of files or functions, meaning each file or function, on average, has “gained” more LOC. In 2.6, by contradistinction, the average LOC per function (but not per file) is decreasing with time. There are also a couple of interesting singularities, especially in the “core” directories. One is the significant drop in LOC per function in version 1.1. Another is the big jump in version 2.2.16. This is en example of the effect of outliers: it was caused by the addition of only 4 files with 7 functions each to the fs directory — but with total of over 54,000 LOC. Fig. 7 shows the average Halstead volume per function, as an alternative size metric. It is interesting to see that the patterns are somewhat different for the arch and drivers vs. other directories. The core directories exhibit a significant decrease in version 1.1 and some in 2.3, but are relatively stable at all other times. The arch and drivers directories, by contradistinction, exhibit a significant decrease in versions 2.5 and 2.6. They also have an increases in the production versions, at least for 2.0 and 2.2. However, for 2.4 HV in the core directories displays a large “jump” in 2.4.25 (which was released with 2.6.3, and has close Halstead metrics values). Notice that there are two jumps in 2.5 too. Finally, the pattern for HV is quite different from the pattern for LOC, especially where large jumps are concerned. The cause of the jump between 2.4.24 and 2.4.25 appears to be the introduction of a new file system, xfs (this file system was first introduced into 2.5.36, and was a part of the initial 2.6 version). A deeper investigation of the xfs files revealed the following: a total of 170 new source files were added, containing a total of 1786 functions. These functions are numerous enough, and 12
229000
McCabe Cyclomatic Complexity − Other Directories
McCabe Cyclomatic Complexity − Arch and Drivers Directories
25 22 2324 21 20 19 1718 16 15 14 13 v2.61112
209000 189000 169000 v2.5
149000
409000 359000 v2.5
259000
109000 v2.4
89000
209000
v2.3
159000
69000
v2.3
49000 9000
25 24 22 23 21 19 20 1718 16 15 13 1412 v2.611 v2.4
459000
309000
129000
29000
509000
v2.1 v1.3 v1.1 v1.0 v1.2
v2.2 v2.0
109000 59000 9000
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
v2.2
v2.1 v1.3 v1.1 v1.0 v1.2
v2.0
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 8: Total McCabe’s cyclomatic complexity for all kernels.
their metrics are high enough, to have a strong effect on the statistics of all the core directories. This is demonstrated in Table 1, which compares the values of these files and the rest to the values of the previous version. As mentioned above, these results can also explain the second jump in 2.5 (the release of 2.5.36). 4.3. Complexity, Effort, and Maintainability McCabe’s Cyclomatic Complexity (MCC) quantifies the complexity of a program as the number of linearly independent paths in the program’s control flow graph. We also examined the EMCC (Extended MCC), but the results were qualitatively very similar to those of the MCC, so we will discuss only the MCC here. Our results (Fig. 8) show that MCC grows with time, and moreover that the pattern of growth is very similar to that of LOC or HV. This matches previous results that noted the general correlation of MCC with LOC, and the specific growth of MCC in Linux that stems from the growing code base [28, 32]. Next, we looked at the average MCC per function, which is interesting since MCC was originally proposed as a metric for independent code units. The results (Fig. 9) indicate a pronounced decrease over time, especially in the core directories of early (1.x) development versions and in the arch and drivers directories of all development versions. This gives the impression that with time the average complexity of the functions is decreasing, and thus maybe the quality of the kernel is improving. This conclusion remains also after a more extensive analysis of the whole distribution of MCC values of all functions, and how the distribution changes with time [9]. However, this doesn’t say whether the improvement is in existing functions (which can be interpreted as preventative maintenance) or simply that new functions that are added have lower complexity. We examined this directly by looking at the new files that are added each year separately from the older files (Fig. 10), focusing on development versions. This shows that new files have a decidedly lower complexity in the arch and drivers directories, and most of the time also in the core directories.
13
Avg. Cyclomatic Complexity per func − Arch and Drivers Directories 6.5 McCabe Cyclomatic Complexity per func − Other DirectoriesAvg. McCabe 7 v1.0 v1.1 v1.2
5.5
6
v1.0 v2.0
v1.3
v1.1 v1.2 v1.3
v2.2
v2.1
v2.3
v2.0
v2.4
4.5
5 v2.1
v2.3
v2.5
v2.2
v2.5 v2.61112 13 14 15 171819 16 20 2122 232425
v2.6 v2.4 13 11 1412151718 16 19 2023 2122 2425
3.5
4 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 9: Average McCabe’s cyclomatic complexity per function for all kernels. Avg. MCC per Function − New VS Old Files − Other Directories
7 6.5
Avg. MCC per Function − New VS Old Files − Arch and Drivers Directories
7 old files all files new files
old files all files new files
6.5
6
6
5.5
5.5
5
5
4.5
4.5
4
4
3.5
3.5
3
3 95
96
97
98
99
00
01
02
03
04
05
06
07
08
95
96
97
98
99
00
01
02
03
04
05
06
07
08
Figure 10: Average MCC per function in new and old files each year. Halstead Effort − Other Directories 1.88e+009
Halstead Effort − Arch and Drivers Directories
232425 2122 2016 171819 15 14 13 1112 v2.6
1.68e+009 1.48e+009
v2.5
2.57e+009
1.08e+009 v2.4
8.8e+008
2.07e+009 v2.3
1.57e+009
6.8e+008 4.8e+008
8e+007
3.57e+009 3.07e+009
v2.5
1.28e+009
2.8e+008
25 24 23 2122 19 20 1718 16 13 141215 v2.4 11 v2.6
4.07e+009
v2.1 v1.3 v1.1 v1.2
v2.1
1.07e+009
v2.3 v2.2 v2.0
5.7e+008 7e+007
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
v2.2
v1.3 v1.1 v1.0 v1.2
v2.0
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 11: Halstead effort for each kernel.
14
Ratio of Comments in Code − Other Directories
0.3
0.3 Ratio of Comments in Code − Arch and Drivers Directories
v2.2
v2.4
v1.1 v2.5 v1.3
0.25
v1.1 v1.2
v2.0
v1.2 v1.3
v2.1 v2.3
v2.6 12 13 11 14 1517 1922 2324 18212016 25
v2.0
0.25 v1.0 v2.1
v2.3
v1.0
0.2
v2.5
v2.2
v2.4 v2.611 13 1412 15171819 16 20 2122 2324 25
0.2 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 12: Percentage of comment lines for each kernel.
Another metric for complexity is the Halstead effort. Fig. 11 shows that it behaves like other metrics we have seen, and in particular is very similar to aggregate MCC. Moving to maintainability, a metric related to LOC is the percent of comments (more commented code is expected to be more maintainable), which gives some perspective on the relation between total lines and those that actually contain code (Fig. 12). On average comments comprise around 25% (Similar to the results of Godfrey and Tu who found that the percentage of comments and blank lines in the files are almost constant at between 28–30% [6]), with a general decreasing trend, and somewhat higher values in production kernels (2.0, 2.2, and 2.4). Dissecting this behavior we find that in the arch and drivers directories there is a big increase in comments in version 1.1, and then a general decreasing trend (to around 20%), where again the production version’s values are more constant and higher. The trend in the core kernel directories is also a big increase in 1.1, but then it stays relatively flat, except for big jumps in 2.2 (upwards) and 2.5 (downwards). Each of the previous metrics has a following of researchers who believe in it, while others criticize its deficiencies. Oman’s Maintainability Index (MI) is an attempt to pool them together in a way that matches empirical data. Its value is 100-point scale from 25 to 125, where low values correspond with lower maintainability and high values with higher maintainability. As MI is measured per module (or in our case, function), the data used is average LOC, MCC, and HV per function. As we saw above, all of these decrease with time, thus contributing to a higher MI. The percentage of comment lines, on the other hand, has a slight downwards tendency. However the change is small, so we do not expect it to have a large negative effect. As a result the general trend of MI is expected to be increasing with time, as is indeed seen in Fig. 13. Dissecting the general results according to directories, we observe the sharp initial improvement in 1.1 which is confined to the core directories. the subsequent slower improvement has more to do with the arch and drivers directories. The lower values attributed to production versions are also due to these two directories. Another interesting point is that since the quality values for the core kernel directories are typically better than those of arch and drivers (i.e. less LOC, lower values for HV and MCC, and slightly more comments), we also see that the MI for these directories is higher — meaning that the core has “better quality” than arch and drivers. However, the arch 15
Oman’s Maintainabilty Index − Other Directories
120
v2.5
v2.1
115
v2.3 v1.3 v1.1 v1.2
v2.2
115Oman’s Maintainability Index − Arch and Drivers Directories
25 2324 2122 20v2.4 19 13 11 1412151718 16 v2.6
232425 2122 20 171819 16 15 14 13 11 v2.6 12 v2.5
v2.0
v2.4
110
110 v1.0 v1.1
v1.3
v2.1
v2.3 v2.2
v1.2 v2.0
105 v1.0
100
105 94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
94 95 96 97 98 99 00 01 02 03 04 05 06 07 08
Figure 13: Oman’s maintainability index for each kernel, divided by directories.
and drivers are showing a larger improvement with time. 4.4. Correlation Between Metrics The literature is full of arguments about software metrics and their relative merits (e.g. [34, 28, 29]). In particular, it has been claimed that MCC is correlated with LOC and thus does not provide new information. MI is by definition related to both MCC and Halstead’s metrics. From our observations of these metrics as they change in different versions of the Linux kernel, we can say that they are indeed often highly correlated with each other. In particular, LOC, MCC, and Halstead’s metrics all display very similar overall patterns. However, there are specific instances where the metrics behave differently. As an example, we show the evolution of LOC and MCC in core directories in Fig. 14. Several discrepancies are evident. Perhaps the most striking is the large jump in LOC in version 2.2.16 in mid 2000, due to the addition of 4 large files. However, these did not contribute significantly to the MCC, and thus there was no corresponding jump in MCC. A similar large increase in LOC with no change in MCC occurred in 2.4 at the beginning of 2002. Upon observation one may also see the steady increase in LOC in 1.3 and the initial segment of 2.1, which is paralleled by a steady decrease in MCC. And finally, version 2.5 starts out with a high LOC that drops precipitously before jumping up again, while its MCC starts out low and does not have such a drop. 5. Analysis of Maintenance Activities We now use the above results to reflect on the four types of maintenance activities: corrective, perfective, adaptive, and preventative. 5.1. Corrective Maintenance As indicated previously, we analyze corrective maintenance as reflected in successive versions of production kernels, since we assume that changes in successive versions of production kernels are usually corrective, due to the structure of the releases in Linux. 16
Avg. McCabe Cyclomatic Complexity per func − Other Directories
6.5 v1.0
5.5
v1.1
v1.2 v1.3 v2.0
4.5 v2.1
v2.3
v2.2
v2.5
v2.61112 13 14 15
v2.4 17 18 16 19 2120 222324 25
3.5 94
95
96
97
98
99
00
01
02
03
04
05
06
07
08
LOC per Func − Other Directories
45
40
v2.4
v2.2 v1.0
v2.5 v1.3
35 v1.1
v2.0
v2.3 v2.1
v1.2
13 14 15 v2.61112
17
18 19
16 22 2120 2324 25
30 94
95
96
97
98
99
00
01
02
03
04
05
06
07
08
Figure 14: Side by side comparison of LOC and MCC per function on core directories.
As we have seen in the results, for each of the different metrics calculated — number of files, number of functions, LOC, MCC, Halstead’s metrics, and Oman’s maintainability index — the metric values for the production versions are essentially constant. This is seen in 1.2, 2.0, 2.2, and 2.4, and also in each of the minor versions of 2.6. However, the metric values in production kernels do change in two cases. One is the large “jumps” seen in versions 2.2 and 2.4. These are explained by changes in functionality, where significant new functionality was propagated into a production version, but without calling this a new major version. The main examples are the improved USB support and additional drivers that were added to the 2.2.18 kernel, and the introduction of the xfs file system to the 2.4.25 kernel. Thus these jumps testify that the assumption that production versions represent only corrective maintenance is not always correct. At the same time, these jumps do not reflect corrective maintenance.
17
The other metric change observed in production versions is that both size and complexity metrics tend to grow initially and only then become constant. For aggregate metrics there is little growth in 2.0, more in 2.2, and a lot in 2.4. For average size metrics, such as LOC per file and LOC per function, the picture is reversed: the largest growth occurs in 2.0, while 2.4 is largely stable from the beginning. This initial growth may indicate that corrective maintenance — and specifically the combination of extensive testing and attempting to respond to user bug reports regarding new production versions — tends to add code and complexity to the existing structure, without a commensurate investment in restructuring and refactoring. However, it may also be just another case of creeping functionality updates, especially considering that most of the growth often occurs in the arch and drivers directories. Resolving this will require a detailed analysis of the actual modifications done in the initial part of production versions. One can thus tentatively conclude that corrective maintenance does not have a strong effect on the different kinds of metrics. If at all, there may be a mild degradation initially, but then the values of the metrics are constant. We can also state that corrective maintenance is much less frequent than others, as the releases of production versions are much less frequent. 5.2. Perfective Maintenance We analyze perfective maintenance as reflected in successive versions of development kernels, since according to the structure of the Linux versions, the main motivation for new development versions is to improve and add functionality. Our results show that the different metrics for successive versions in development kernels usually change more than for production kernels. The direction of the change often depends on whether one considers the whole kernel or average values per file or per function. The detailed behavior observed depends on the metric being studied, with average values displaying considerable volatility. On one hand, it seems like less work is being done to “maintain” the code in development kernels. For example, the ratio of comments is lower for development versions than for production versions (i.e. they can be considered less readable). On the other hand, when comparing metrics such as MCC or the Halstead metrics per function, the values for development kernels are lower and exhibit a decreasing trend. As a consequence, the MI is higher for the development kernels, indicating that development kernels are less complex and more maintainable than production kernels. Of course, if we compare at the general kernel level, the picture will seem the reversed, since there are more files and more LOC in the development kernels. Regarding the trend for improved average metrics over time in development kernels, two interpretations are possible: either the code in development versions is indeed being improved, or else many small files and functions are simply being added at a higher rate than the complexity grows, so the averages shift in the desirable direction. Conversely, files and functions that are added to the production kernels after their initial release may be more complex since they solve critical bugs and security issues. While more detailed research is needed to quantify the relative contribution of these two options, we have seen specific examples (in the analysis of high MCC value functions [9]) of improvements to code complexity and structure. Moreover, these improvements were limited to the 18
development versions. Thus we indeed have preliminary evidence for activity that improves code structure in development versions. This is discussed further below when we deal with preventative maintenance. 5.3. Adaptive Maintenance We analyze adaptive maintenance as reflected in the arch and drivers directories (especially in the development kernels), since they best reflect the adaptation to changes in the environment — the addition of new architectures and new devices that need to be supported. As we have seen above, the arch and drivers directories usually have the same trends as the rest of the kernel, although many times with higher magnitude (for example, the LOC in the arch and drivers directories is higher and grows faster than in other directories). However, there are differences. For example, the changes in LOC and number of files in 2.5 are more noticeable in the core kernel, whereas in 2.2 they are more noticeable in the arch and drivers directories. Also, arch and drivers have a direct effect on the core directories due to the include directory, which is usually updated with new header files once a major change or update in arch and drivers takes place. When looking at normalized metric values, that is per function rather than an aggregate value for the whole kernel, the observed trends are quite different in the different directories. In general, average size and complexity metric values (LOC, MCC, and Halstead’s metrics) in the arch and drivers directories exhibit a steady decline throughout the development versions. The picture in other directories is typically much more volatile, with different versions overlapping each other and large discrete jumps occurring in various average metric values at different times. This seems to indicate that adaptive maintenance, as reflected in the arch and drivers directories, leads to an improvement in code quality. The only exception to the above conclusion is the percentage of comment lines in the code, which also declines in the arch and drivers directories. This might indicate that more code is being added than comments, which in turn might indicate a degradation of maintainability, due to declining readability of the code. But when combined with the improvements noted above, as done by the Oman maintainability index, the overall trend is still one of improvement. An especially interesting observation is the apparent interaction between adaptive and corrective maintenance. In practically all the average metrics, the production versions have consistently higher values than contemporary development versions when only the arch and drivers directories are considered. This leads to significantly lower MI values for these versions, as seen in (Fig. 13). 5.4. Preventative maintenance Preventative maintenance may be inferred from a consistent and marked improvement in code quality metrics. Such considerations indicate that version 1.1 is special, as there seems to be especially significant improvement in the code in that version: the fraction of comments grew, and all the complexity metrics (MCC, Halstead) decreased considerably. One may therefore conjecture that this version saw much preventative maintenance, being the first version after 2 21 years of development from the initial announcement in August 1991 to the first release in March 1994. However, verifying this conjecture will require detailed scrutiny of the code to assess the reasons for the improvement in the metric values.
19
Files Added
files
1000
v1.1
v1.3
v2.1
v2.3
v2.4 v2.5
800 600 400 200 0
v1.1
v1.3
v2.1
v2.3
v2.4 v2.5
10 5 0
94
95
96
97
98
99
00
01
02
03
94
Files Deleted v1.1
v1.3
v2.1
v2.3
v2.4 v2.5
95
96
97
98
99
00
01
02
03
Files Deleted, out of total files
3 2.5
percentage
450 400 350 300 250 200 150 100 50 0
v1.1
v1.3
v2.1
v2.3
v2.4 v2.5
2
1.5 1 0.5 0
94
95
96
97
98
99
00
01
02
03
94
Files Grew
3000 2500 2000 1500 1000 500 0
v1.3
v2.1
v2.3
v2.4 v2.5
files
v1.1
percentage
files
Files Added, out of total files
15
percentage
1200
94
95
96
97
98
99
00
01
02
v1.3
v2.1
v2.3
percentage
v1.1
v2.4 v2.5
files
1500 1000 500 0 94
95
96
97
98
99
00
01
02
v1.1
94
Files Shrunk 2000
03
96
97
98
99
00
01
02
03
Files Grew, out of total files
35 30 25 20 15 10 5 0
03
95
95
v1.3
96
v2.1
97
v2.3
98
99
00
v2.4 v2.5
01
02
03
Files Shrunk, out of total files
18 16 14 12 10 8 6 4 2 0
v1.1
94
95
v1.3
96
v2.1
97
v2.3
98
99
00
v2.4 v2.5
01
02
03
Figure 15: Files added, deleted, grown, or shrunk among development versions.
An alternative approach is to analyze preventative maintenance as reflected in code churn [7], that is the partitioning, moving, or deletion of code. In particular, we will look for isolated events where a large churn is observed [3]. We focused on code changes between successive development versions (including version 2.4, since it seems to have served, at least in the first year, for development; see below). Changes were quantified by looking at the number of files which were changed (i.e. added, deleted, grown, or shrunk) in Fig. 15 and the same for directories in Fig. 16. The “deleted” group are files/directories that were removed, and “added” is the difference between the versions plus the number deleted (thus counting all additions to the second version). “Grown” and “shrunk” refer to the size of the files/directories (for directories, this is the number of files in them). In both figures, on the left column we see the actual number of files/directories per each metric, and on the right column we see the percentage. The comparison for each is continuous, i.e. we compared minor versions within the same major version and also did a cross-version comparison for the first release of a major version and the release of the previous major version that was the base of this release. The first release of version 1.1 was compared to the only release of 1.0. Version 1.3.0 was compared to version 1.2.10, version 2.1.0 to 2.0.21, Version 2.3.0 to 2.2.8, and version 2.5.0 to 2.4.15. The last two comparisons revealed that the initial versions of 2.3 and 2.5 are completely identical to the production versions they originate from (i.e. all values in the graphs for the initial versions of 2.3 and 2.5 are zero). This agrees with our findings above and our ob-
20
Directories Added v1.1
60
v1.3
v2.1
v2.3
v2.4 v2.5
40 20 0 96
97
98
99
00
01
02
v1.3
v2.1
v2.3
v2.4 v2.5
10 5
03
94
95
v1.3
96
v2.1
97
v2.3
98
99
00
percentage
v1.1
v2.4 v2.5
01
02
95
v1.1
94
95
v1.3
96
v2.1
97
v2.3
98
99
00
120
v1.1
v1.3
v2.1
v2.3
99
00
01
02
03
v1.3
v2.1
v2.3
v2.4 v2.5
1 0 95
96
97
98
99
00
01
02
03
Directories Grew, out of total directories v2.4 v2.5
01
02
70 60 50 40 30 20 10 0
03
v1.1
94
Directories Shrunk
150
98
3 2
94
percentage
v1.1
97
4
Directories Grew
450 400 350 300 250 200 150 100 50 0
96
Directories Deleted, out of total directories
6 5
03
95
v2.4 v2.5
90 60 30 0
v1.3
96
v2.1
97
v2.3
98
99
00
v2.4 v2.5
01
02
03
Directories Shrunk, out of total directories
25
percentage
directories
95
Directories Deleted
35 30 25 20 15 10 5 0 94
directories
v1.1
15
0 94
directories
Directories Added, out of total directories
20
percentage
directories
80
20
v1.1
v1.3
v2.1
v2.3
v2.4 v2.5
15 10 5 0
94
95
96
97
98
99
00
01
02
03
94
95
96
97
98
99
00
01
02
03
Figure 16: Directories added, deleted, grown, or shrunk among development versions.
servations regarding the release patterns. Versions 1.3 and 2.1 were not identical to the production versions they originated from, which were slightly modified. Version 2.4.0 was compared with the last 2.3 version (2.3.99-pre9), as it emerged from that version. The last comparison in each graph is between the last version in 2.5 (2.5.75) and the first in 2.6 (2.6.0). The spaces between the bars are times with no development versions. When looking at the two figures, we see immediately (and not surprisingly) that they are similar. Looking at the left column, especially for “grew” and “shrunk” data, we see that the number of files/directories changed grows with time. But when looking at the right column of the figures, the percentage of files/directories changed appears to be constant or maybe even slightly decreasing. Thus the rate of change is pretty constant between consecutive development versions. The strongest downwards trend is in the percentage of directories growing, which might be explained by the stabilization of the versions and releases. These results are slightly different from those found by Lehman [13], who claimed that the percentage of added files tends to decline, and that the percentage of changed files tends to grow. For version 2.4 we see relatively large changes in 2002 and 2003, after it ceased to be used as a development version. This most probably reflects the fact that releases are so rare during this period, and thus each one must integrate changes that accumulated in multiple development versions. The graphs include only a few discrete peaks of activity. For example, in version 2.5 we
21
see a large number of files deleted around November 2002. This is the result of replacing the configuration system in version 2.5.45: the old configuration files Config.help and Config.in were deleted in many directories and were replaced by a single Kconfig file. We also see a relatively large amount of directories added and deleted around March of 2002. This is due to changes in the structure in the arch and drivers directories at that time and also the addition of the sound subdirectory to the kernel. In general, however, preventative maintenance as reflected by code churn in Linux seems to be continuous rather than concentrated in isolated events, as opposed to the finding of Capiluppi for the ARLA system [3]. However, this may also be interpreted as resulting from the fact that Linux is in effect an agglomeration of numerous subsystems: it may be that preventative maintenance is applied to subsystems in discrete events, but this is masked by the rest of the system which is not subject to such activity at the same time. The initial releases of the two production versions, 2.4 and 2.6, have large changes in comparison to the development versions they originated from. This is not surprising. First, since they are production versions, we expect them to be different than development versions, which are less tested and might contain more functionality. Second, both of these versions were released after a long wait of around 6 months after the development version, during which significant additional work was done. 5.5. Development and Production Versions As noted above, we expect development versions of the Linux kernel to reflect perfective maintenance, whereas production versions would reflect corrective maintenance. This is based on the definitions of these two types of branches of the code, which are maintained in parallel to each other. However, it seems that in practice this distinction was not always followed, especially in the initial period of each production version. The most extreme example of “mixing” the roles of the versions occurs at the beginning of 2.4. The last version of 2.3 was released on 24 May 2000. The first version of 2.4 was only released on 5 January 2001, and the first version of 2.5 only on 23 November 2001. Thus there is a gap of some 18 months between successive development versions. However, it seems that the initial part of 2.4 served for development (or at least reflected development activity that was being done without officially being released in a development version), as all our graphs indicate development-like growth and that 2.5 branched out of 2.4. The remaining gap between 2.3 and 2.4 seems to have been filled at least partially by 2.2. Specifically, 2.2 exhibits strong growth in this period, especially in the arch and drivers directories, which matches the difference in size between the end of 2.3 and the beginning of 2.4. In addition to large-scale events like those listed above, there have been instances of “feature creep” from development to production versions. The main reason for this appears to be the desire to reduce the delay until new functionality becomes available in production releases — the same force that later led to the 2.6 release scheme. An additional reason is the contribution of reasonably stable code from third parties [6], for example the incorporation of the xfs file system in 2.4.25 (in parallel to its addition to 2.5.36). The mixing is also evident in the release rate of different versions. With regard to major releases, the principle up to 2.5 was that the development versions form a continuous development
22
Kernels Release Dates
150 v2.1
140
Minor Release Index
130 120 v1.3
110 100 90 80 70
v2.5 v2.3
60 50 40
16 v2.4
v2.0 v1.1
30 20 10
v2.2 v1.2
v2.611 14 15 12 13
17
20 2223 18 19 21
25 24
v1.0
0 94
95
96
97
98
99
00
01
02
03
04
05
06
07
08
Figure 17: Minor number as a function of time, such that the slope of the line indicates the release rate.
backbone, from which stable production versions branch out. However, in reality the effort to stabilize a production version nearly always caused interruptions in the development. The two largest gaps have occurred between 2.3 and 2.4 and between 2.5 and 2.6. According to 2.4 logs and different Linux forums, the first was a result of the version “not being ready” for release (according to Linus Torvalds’s policy to release only when versions are stable), due to complications with development and testing many new features in that version. The second gap is a result of the structural change in the Linux release scheme. In the 2.6 series releases have been much more managed and regular. Fig. 17 displays the release rates of successive minor releases. The graph shows the times at which minor releases are made: the steeper the line, the higher the release rate [32]. Development versions exhibit a steady high rate, whereas production versions tend to start with a somewhat high rate and then taper off when the next development version is started. This indicates that the initial phase may serve as part of the development path, even if only in the role of stabilizing the code (i.e. corrective rather than perfective maintenance). and indeed, version 1.3 is based on 1.2.10, 2.1 on 2.0.21, 2.3 on 2.2.8, and 2.5 on 2.4.15. All this changed with the new release scheme of 2.6. In the first 10 releases, the distribution was similar to that of previous production versions, but the tail of long release intervals was effectively eliminated. And with the 4th digit releases of 2.6.11 and on, the whole distributions are generally much lower than in most production versions, albeit still higher than in the development versions. This should not be considered a problem, as 4th digit releases are not development but rather bug fixes and security patches. 6. Conclusions and Future Work The Linux kernel is one of the most successful open-source software projects in the world. It has been maintained continuously over the last 14 years in order to satisfy the needs of its users. 23
We have presented a detailed characterization of this process, including over 800 versions which represent new developments, major production releases, and minor updates. Many interesting phenomena are only seen at this fine resolution, and would be lost if using the traditional approach of studying only major production releases. Our results are of course specific to Linux, but some observations may generalize to other software systems as well. Due to the structure of Linux releases, we analyzed our results in several different dissections — we compared the development and production versions and version 2.6, and also distinguished between the arch and drivers directories and the core kernel. Looking at development versions, we found, as expected, that Linux is growing. In fact, until version 2.5 it was growing at a super-linear rate, but starting with version 2.6 the growth is closer to linear. However, the average LOC per function is decreasing, indicating that the number of functions is growing faster than the LOC. A similar pattern is seen for commonly used metrics such as MCC and the Halstead metrics. This appears to be at least partly due also to real improvements and preventative maintenance. As a result, Oman’s maintainability index was generally increasing (i.e. the Linux kernel seems to become more maintainable with time). A special case of this is version 1.1, which had improved significantly over time, perhaps since there was much effort on preventative and perfective maintenance. In the production versions the various metrics usually stabilized at a constant level after a while. However they typically tend to suffer some initial degradation before stabilizing, possibly because of corrective maintenance that disregards increased complexity. We also found that, unlike our presumption, production versions sometimes include perfective maintenance and not only corrective maintenance. Specific cases where the production versions or the development versions did not follow the general trend were analyzed and explained in detail. These were usually instances of adding a large module that was developed elsewhere into the kernel. As for the distinction between the arch and drivers directories and the core kernel, we found that arch and drivers usually (but not always) displayed the same trends as the core kernel, but with higher (or “worse”) values. For example, there are more LOC, files, functions, and the complexity and maintainability indexes imply lower quality software for those directories. The same behavior holds when looking at the function level. Thus it seems that adaptive maintenance, as reflected by the arch and drivers directories, leads to lower maintainability and higher complexity values, and is more volatile than other types of maintenance. The results quoted in the last two paragraphs indicate that a globally observed behavior can sometimes be attributed to a very local change somewhere in the codebase. In particular, it may be that various subsystems, not only the arch and drivers directories, exhibit distinct behaviors. This suggests that in future work we should look more closely at distinct subsystems. Indeed, the main drawback of our results so far is their coarse nature, due to viewing a huge system like the Linux kernel in the aggregate. It would be interesting to repeat select parts of this study for specific subsystems, and perform a much more detailed analysis. Another direction for future work is to try and glean the amount of work invested in different maintenance activities. For example, a survey of maintenance managers yielded the result that 17.4 percent of maintenance is corrective in nature [14], but a study that analyzed changes to
24
source code obtained a figure more than three times larger [27]. It would be interesting to try to figure out proportions for Linux by quantifying work done on different versions. Acknowledgments This work was supported by the Dr. Edgar Levin Endowment Fund. References [1] I. T. Bowman, R. C. Holt, and N. V. Brewster, “Linux as a case study: its extracted software architecture ”. In 21st Intl. Conf. Softw. Eng., pp. 555–563, May 1999. [2] L. C. Briand, J. Wust, and H. Lounis, “Using coupling measurement for impact analysis in object-oriented systems ”. In Intl. Conf. Softw. Maintenance, pp. 475–482, Aug 1999. [3] A. Capiluppi, M. Morisio, and J. F. Ramil, “Structural evolution of an open source system: a case study ”. In 12th IEEE Intl. Workshop Program Comprehension, pp. 172–182, Jun 2004. [4] K. Chen, S. R. Schach, L. Yu, J. Offutt, and G. Z. Heller, “Open-source change logs ”. Empirical Softw. Eng. 9, pp. 197–210, 2004. [5] H. Gall, M. Jazayeri, R. R. Kl¨osch, and G. Trausmuth, “Software evolution observations based on product release history ”. In Intl. Conf. Softw. Maintenance, pp. 160–166, Oct 1997. [6] M. W. Godfrey and Q. Tu, “Evolution in open source software: a case study ”. In 16th Intl. Conf. Softw. Maintenance, pp. 131–142, Oct 2000. [7] G. A. Hall and J. C. Munson, “Software evolution: code delta and code churn ”. J. Syst. & Softw. 54(2), pp. 111–118, Oct 2000. [8] M. Halstead, Elements of Software Science. Operating and Programming Systems Series, Elsevier Science Inc., 1977. [9] A. Israeli and D. G. Feitelson, “The Linux kernel as a case study in software evolution ”. Feb 2009. Submitted for publication. [10] S. H. Kan, Metrics and Models in Software Quality Engineering. Addison Wesley, 2nd ed., 2004. [11] C. F. Kemerer and S. Slaughter, “An empirical approach to studying software evolution ”. IEEE Trans. Softw. Eng. 25(4), pp. 493–509, Jul/Aug 1999. [12] M. M. Lehman, “Programs, life cycles, and laws of software evolution ”. Proc. IEEE 68(9), pp. 1060–1076, Sep 1980. [13] M. M. Lehman, D. E. Perry, and J. F. Ramil, “Implications of evolution metrics on software maintenance ”. In 14th Intl. Conf. Softw. Maintenance, pp. 208–217, Nov 1998.
25
[14] B. P. Lientz, E. B. Swanson, and G. E. Tompkins, “Characteristics of application software maintenance ”. Comm. ACM 21(6), pp. 466–471, Jun 1978. [15] T. McCabe, “A complexity measure ”. IEEE Trans. Softw. Eng. 2(4), pp. 308–320, 1976. [16] T. Mens, J. Fern´andez-Ramil, and S. Degrandsart, “The evolution of Eclipse ”. In Intl. Conf. Softw. Maintenance, pp. 386–395, Sep 2008. [17] E. Mills, Software Metrics. Technical Report Curriculum Module SEI-CM-12-1.1, Software Engineering Institute, December 1988. [18] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “Two case studies of open source software development: Apache and Mozilla ”. ACM Trans. Softw. Eng. & Methodology 11(3), pp. 309– 346, Jul 2002. [19] G. Myers, “An extension to the cyclomatic measure of program complexity ”. SIGPLAN Notices 12(10), pp. 61–64, Oct 1977. [20] P. Oman and J. Hagemeister, “Construction and testing of polynomials predicting software maintainability ”. J. Syst. & Softw. 24(3), pp. 251–266, Mar 1994. [21] D. L. Parnas, “Software aging ”. In 16th Intl. Conf. Softw. Eng., pp. 279–287, May 1994. [22] J. W. Paulson, G. Succi, and A. Eberlein, “An empirical study of open-source and closedsource software products ”. IEEE Trans. Softw. Eng. 30(4), pp. 246–256, Apr 2004. [23] R. S. Pressman, Software Engineering: A Practitioner’s Approach. McGraw-Hill, 2nd ed., 1987. [24] D. A. Rusling, “The Linux kernel ”. URL http://tldp.org/LDP/tlk/. [25] S. R. Schach, Object-Oriented and Classical Software Engineering. McGraw-Hill, 7th ed., 2007. [26] S. R. Schach, B. Jin, D. R. Wright, G. Z. Heller, and A. J. Offutt, “Maintainability of the Linux kernel ”. IEE Proc.-Softw. 149(2), pp. 18–23, 2002. [27] S. R. Schach, B. Jin, L. Yu, G. Z. Heller, and J. Offutt, “Determining the distribution of maintenance categories: survey versus measurement ”. Empirical Softw. Eng. 8, pp. 351– 365, 2003. [28] M. Shepperd, “A critique of cyclomatic complexity as a software metric ”. Software Engineering J. 3, pp. 30–36, Mar 1988. [29] M. Shepperd and D. C. Ince, “A critique of three metrics ”. J. Syst. & Softw. 26, pp. 197–210, Sep 1994. [30] I. Sommerville, Software Engineering. Addison-Wesley, 8th ed., 2007. 26
[31] E. B. Swanson, “The dimensions of maintenance ”. In 2nd Intl. Conf. Softw. Eng., pp. 492– 497, Oct 1976. [32] L. Thomas, An Analysis of Software Quality and Maintainability Metrics with an Application to a Longitudinal Study of the Linux Kernel. PhD thesis, Vanderbilt University, 2008. [33] E. VanDoren, Maintainability Index Technique for Measuring Program Maintainability. Technical Report, Software Engineering Institute, Mar 2002. [34] E. J. Weyuker, “Evaluating software complexity measures ”. IEEE Trans. Softw. Eng. 14(9), pp. 1357–1365, Sep 1988. [35] L. Yu, S. R. Schach, K. Chen, and J. Offutt, “Categorization of common coupling and its application to the maintainability of the Linux kernel ”. IEEE Trans. Softw. Eng. 30(10), pp. 694–706, Oct 2004.
27