A systematic review of empirical software engineering studies that ...

Simula Research Laboratory Technical Report 2008-05

A systematic review of empirical software engineering studies that analyze individual changes Hans Christian Benestad, Bente Anda and Erik Arisholm Simula Research Laboratory/University of Oslo Abstract. Understanding, managing and reducing costs and risks inherent in change are key challenges of software maintenance and evolution, addressed in empirical studies with many different research approaches. Change-based studies analyze data that describes the individual changes that are made to software systems. This approach can be effective in order to discover cost and risk factors that are hidden at more aggregated levels. However, it is not trivial to derive appropriate measures of individual changes for specific measurement goals. The purpose of this review is to improve changebased studies by 1) summarizing how attributes of changes have been measured to reach specific measurement goals, and 2) describing validity issues, and hence improvement areas, for change-based studies. Thirty-four papers conformed to the inclusion criteria. Forty-three attributes of changes were identified, and classified according to a conceptual model that we developed for the purpose of this classification. The goal of each study was to either characterize the evolution process, to assess causal factors of cost and risk, or to predict costs and risks. Effective accumulation of knowledge across change-based studies requires precise definitions of attributes and measures of change. We recommend that new change-based studies base such definitions on the proposed conceptual model.

1. INTRODUCTION Software systems that are used actively need to be changed continuously [1, 2]. Understanding, managing and reducing costs and risks of software maintenance and evolution are important goals for both research and practice in software engineering. However, it is challenging to collect and analyze data in a manner that exposes the intrinsic features of software maintenance and evolution, and a number of different approaches have been taken in empirical investigations. A key differentiator between classes of software maintenance and evolution studies is the selection of entities and attributes to measure and analyze: •

Lehman’s laws of software evolution were developed on the basis of measuring new and affected components in subsequent releases of a software system, c.f., [2, 3]. • Investigations into cost drivers during software maintenance and evolution have investigated the effects of project properties such as maintainer skills, team size, development practices, execution environment and documentation, c.f., [4-7]. • Measures of structural attributes of the system source code have been used to assess and compare the ease with which systems can be maintained and evolved, c.f., [8-10]. An alternative perspective is to view software maintenance and evolution as the aggregate of the individual changes that are made to a software system throughout its lifecycle. An individual change involves a change request, a change task and a set of revisions to the components of the system. With this perspective, software maintenance and evolution can be assessed from attributes that pertain to the individual changes. Such attributes are henceforth referred to as change attributes, the measures that operationalize the change attributes are referred to as change measures, and the studies that base the analysis on change attributes and change measures are referred to as change-based studies. Two examples of topics that can be addressed in a change-based study are: •

Identify and understand factors that affect change effort during maintenance and evolution. This knowledge would contribute to the understanding of software

1


•

maintenance and evolution in general, because the total effort expended by developers to perform changes normally constitutes a substantial part of the total lifecycle cost. For a particular project, it is essential to know the factors that drive costs in order to make effective improvements to the process or product. For example, if system components that are particularly costly to change are identified, better decisions can be made regarding refactoring. Measure performance trends during maintenance and evolutions. Projects should be able to monitor and understand performance trends in order to plan evolution and take corrective actions if negative trends are observed.

A central challenge is to identify change attributes and change measures that are effective in order to perform such analyses. For example, in order to assess and compare changes with respect to the man-hours that was needed to perform them, it is necessary to characterize the changes in some way, e.g., by measuring their size and complexity. This paper addresses this challenge by performing a comprehensive literature review of change-based studies. Conducting a comprehensive literature review is a means of identifying, evaluating and interpreting all available research relevant to a particular research question, or topic area, or phenomenon of interest [11]. This review describes the change attributes that have been used in empirical investigations, and we propose a conceptual model for change-based studies that enables us to classify them. We will argue that future change-based studies can benefit from using this model as a basis for classifications and definitions of change attributes and change measures. To sum up, the objective of this literature review is to facilitate more effective investigations into the costs and risks of software maintenance and evolution, whether they are conducted by empirical researchers or by practitioners who are implementing a measurement-based improvement program. The approach is to summarize and critically review the state of the practice in change-based studies. We address two research questions: RQ1. Which overall measurement goals have been set in change-based studies, and which attributes were measured to achieve these goals? RQ2. How can change-based studies be improved over the current state of practice? The remainder of this paper is organized as follows: Section 2 provides a summary of related work. Section 3 describes the review procedures, including the criteria for inclusion and exclusion of primary papers for the review. Section 4 describes the conceptual model for change-based studies. Sections 5 and 6 answer RQ1 and RQ2, respectively. Section 7 discusses limitations to the review. Section 8 concludes. 2. RELATED WORK We are not aware of other attempts to provide a comprehensive review of change-based studies of software maintenance and evolution. Graves and Mockus summarized three of their own studies that showed that time of change, tool usage, and subsystem affected by change affected change effort [12]. They also recommended that statistical models of change effort should control for developer effects, change size and maintenance type. Niessink listed six change attributes that affect change effort that have been identified in empirical work by other authors [13]. Of these, maintenance type and change size matched the change attributes identified by Graves and Mockus.

2

Simula Research Laboratory Technical Report 2008-05 Kagdi et al. conducted a literature review of studies that have mined data from software repositories for the purpose of investigating changes to software components [14]. Their perspective is complementary to ours, because automated extraction of data from software repositories can be an attractive method for obtaining certain change measures. One contribution of this paper is a proposed conceptual model for change-based studies. Existing conceptual models that describe software maintenance and evolution [15-17] constituted a foundation for the model. Relationships between these models and our model are further described in Section 4. 3. REVIEW PROCEDURES 3.1. Criteria for inclusion and exclusion The following top-level criterion for inclusion of papers was derived from the objective of the review that was stated above: Peer reviewed papers that report on case studies that assessed or predicted maintenance and evolution activities on the basis of properties of individual changes, in the context of managed development organizations. Assessment and prediction are two broad purposes of measurement [18]. They are highly interdependent and we chose to include studies that involved one or both purposes. Due to our primary interest in the management of costs and risks of software maintenance and evolution, we focused on studies that have been conducted within managed development organizations, and chose to exclude investigations on distributed, volunteer based development, commonly used in open source software development. Our review targeted both quantitative and qualitative studies. Candidate papers were identified using the following procedure: 1. Send queries based on the inclusion criterion to search engines using full-text search 2. Read identified papers to the extent necessary to determine whether they conformed to the criterion 3. Follow references to and from included papers; then repeat from step 2 Step 1 was piloted in several iterations in order to increase the sensitivity and precision of the search. A discussion of the tradeoffs between sensitivity and precision in the context of research on software engineering is provided by Dieste and Padua [19]. We arrived at the following search criterion for the first step, from which we derived search strings in the query languages that is supported by the selected search engines: ( ( size | type | complexity of [a] change | modification | maintenance [ task | request ] ) OR ( change | modification | maintenance [ task | request ] size | complexity | type ) ) AND project | projects AND software

We used Google Scholar (http://scholar.google.com) and IEEExplore (http://ieeexplore.ieee.org) because full-text search was required to obtain reasonable sensitivity. The queries returned 446 results from Google Scholar and 169 results from IEEExplore on the 19 April 2007. In total, 261 papers remained after excluding papers on the basis of the title alone, i.e., non-software engineering work, definitely off topic, or not a peer reviewed paper. After merging the two sources, 230 papers remained. These underwent Steps

3

Simula Research Laboratory Technical Report 2008-05 2 and 3 above. Sixty-two papers were judged as “included” or “excluded, but under some doubt”. These were re-examined by the second and third author, resulting in 33 included papers. Disagreements were resolved by discussion and by further clarifying and documenting the criteria for inclusion and exclusion. As a final quality assurance, the search criterion was applied to all papers from 27 leading software engineering journals and conference proceedings (1993 to 2007 volumes), see [20] for details of this source. One additional study was identified by this step, resulting in a total of 34 included papers. In order to convey the criteria for inclusion or exclusion more explicitly, the remainder of this section summarizes studies of software maintenance and evolution that were excluded, but were considered to lie on the boundaries of the criteria. An influential body of research on software evolution has based analysis on software releases and the components, i.e., the system parts of some type and at some level of granularity, that were present in successive releases. Belady and Lehman [3] measured the number of components that were created or affected in successive releases of the same system. Using this study as a basis, they postulated the law of continuing change, the law of increasing entropy, and the law of statistically smooth grow. Kemerer and Slaughter [21] provided an overview of empirical studies that have followed this line of research. The studies that used another unit of analysis than the individual change, e.g., releases or components, were excluded from this review. Based on an industrial survey on maintenance of application software, Lientz et al. quantified the amount of new development versus maintenance, and how work was distributed over types of maintenance [22]. This work has been influential in that it has drawn attention to later phases of the software lifecycle, and via the adoption of the change classification scheme of corrective, perfective and adaptive changes, originally described by Swanson [23], and frequently used as a change attribute in the body of research included in this review. This work is not included in the review, because it was based on a survey rather than a case study. Measures of structural attributes (code metrics) have been conjectured to provide inexpensive and possibly early assessments and predictions of system qualities. Measures have normally been extracted from individual source code components, or from succeeding versions of source code components. Briand and Wüst [24] provided an overview of empirical work on relationships between structural measures of object-oriented software, and process and product qualities. In order to identify erroneous components when building fault prediction models, some studies identified the components that were affected by a corrective change request, c.f., [25-27]. However, we did not consider these studies to be change-based, because the unit of analysis was the individual component. Studies on the analysis of software defects have attempted to understand the causes and origins of defects. Generally, these studies have analyzed and extracted measures from individual components. Some of the studies collected data about corrective change tasks, e.g., [28-30]. We chose to exclude studies that analyzed the causes of defects retrospectively, but to include studies that analyzed the change tasks that were performed to isolate or correct defects. Research on cognitive aspects of software engineering has attempted to understand the mental processes that are involved in software engineering tasks. Some of these studies have been conducted in the context of change tasks that are performed during software maintenance and evolution, c.f., [31]. We chose to exclude these studies, because Détienne and Bott [32] have provided a comprehensive summary of this specialized line of research.

4

Simula Research Laboratory Technical Report 2008-05 3.2. Extraction of data Goals, change attributes, and study context (RQ1) were described and classified by combining existing description frameworks with data-driven analysis similar to the constant comparison method of qualitative analysis [33]. In particular, for measurement goals, passages of relevant text were identified, condensed, and rephrased using terms consistent with the description template for measurement goals under the Goal Questions Metrics (GQM) paradigm [34]. These procedures resulted in the taxonomy listed in Table 1. In order to describe and classify conceptual change attributes, we extracted information about the concrete change measures that were used in the studies. Key information was names, definitions, value ranges, and methods for data collection. This information was then compared and grouped with respect to the conceptual model in Figure 1, and with respect to a set of more detailed measurement questions, as listed in Tables 2 and 3. The procedures for developing the conceptual model for change-based studies are described in Section 4. For study context, we describe the business context, measurement procedures and extent of data collection. We identified two measures for each of these attributes by using information that was available in the reviewed papers. The results are shown in Table A4. Our approach to assessing the quality of change-based studies (RQ2) was to assess the studies in light of recurring validity issues, as described by Shadish et al. [35]. In order to identify areas for improvement for change-based studies, we focused on those validity issues that we judged to be particularly relevant in the context of such studies. 4. A CONCEPTUAL MODEL FOR CHANGE-BASED STUDIES Our proposed conceptual model for change-based studies is depicted in Figure 1. The goals for the design of the model were 1) to create a minimal model that 2) facilitates the understanding and definition of entities, attributes and measures that were used in the reviewed body of research, while 3) maintaining compatibility with existing concepts that have been used to discuss software maintenance and evolution. We developed and refined the model iteratively during the course of the review, in order to capture the change attributes that were used in the reviewed studies. Tables 2 and 3 list the relationships between these attributes and the entities in the model. Wherever possible we reused concepts from existing conceptual models of software maintenance. In particular, the entities Development organization, Human resources, Change task, Change request, Component, System and Release, some of them with different names, were reused from the proposed ontology of software maintenance by Kitchenham et al. [16]. Similar conceptual frameworks have been defined by Dias et al. [15] and Ruiz et al. [17]. We used terms in our model that were 1) commonly used in the reviewed body of research, 2) neutral with respect to specific technologies, practices or disciplines in software engineering, and 3) internally consistent. For example, we used the term change task for the entity that is named maintenance task in [16]. Compared to the existing frameworks, the entities Change set, Version and Revision and their interrelationships were added, because they are necessary to describe and classify the change attributes that concerns changes to the system components. The relationships between some of the reused entities were changed, in order to better represent the change-oriented perspective taken in this paper.

5


Figure 1. A conceptual model for change-based studies Standard UML syntax is used in the diagram. A role multiplicity of 1 should be assumed when role multiplicity is not shown. Role names are assigned in one direction only, in order to avoid cluttering. For compositions, indicated by filled diamonds, the roles in the two directions can be read as composed by and part of. The perspective adopted in this paper is that a change task constitutes the fundamental activity around which software maintenance and evolution is organized. A change task is a coherent and self-contained unit of work that is triggered by a change request. A change request describes the requirements for the change task. A change task is manifested in a corresponding change set. A change set consists of a set of revisions, where each revision creates a new version of a component of the system. The new version can be based on an existing version of the component, or it can be the first version of the component. A component can, in principle, be any kind of work product that is considered to be part of the system, although the reviewed studies focused primarily on measurement of source code components. Components can form a hierarchy in where a large component can be composed by components of finer granularity. A system is deployed to its users through releases. A release is composed by a set of versions of components. A release can also be described by the change sets or the corresponding change requests that the release incorporates. It is convenient to use the term change as an aggregating term for the change task, the originating change request, and the resulting change set. Changes, in this sense, involve human resources, and are managed and resolved by a development organization. Large changes, sometimes referred to as new features in the reviewed body of research, can be broken down into smaller changes that are more manageable to the development organization. A change attribute is a property of a change task, of the originating change request, or of the resulting change set. A change attribute can also be derived from attributes of other entities in the conceptual model. For example, the sizes of all components that were involved in a change may be averaged, or otherwise combined, in order to form a change attribute that represent the size of changed components. Change measures can be extracted from change management systems, which are tools that manage the kind of information that is defined by our conceptual model. Such systems include tools that are used to manage and track change requests and change tasks, and tools that are used to support controlled change of the system components. A change outcome is a change attribute that represents the primary quality focus of the study, e.g., change effort. A change outcome measure is the operationalization of a change outcome, and is typically used as the dependent variable in statistical analyses.

6

Simula Research Laboratory Technical Report 2008-05 It is beyond the scope of this paper to provide operational definitions of all variations of specific change measures used in the reviewed body of research. However, the conceptual model in Figure 1 can be utilized further in a specific measurement context to facilitate precise definitions of change measures. For example, the span of a change could be operationalized as “the number of revisions that are part of a change set”, while a measure of the size of affected components can be defined as “the arithmetic mean of lines of code in versions that revisions in the change set are based on”. Such definitions can be expressed formally using the Object Constraint Language (OCL) [36]. 5. GOALS AND MEASURED CHANGE ATTRIBUTES (RQ1) By following the procedures described in Section 3.2, three main categories and 10 subcategories of studies were identified, as shown in Table 1. Key properties of each individual study are listed in Tables A1, A2 and A3, in Appendix A.

Main category Goal 1: Characterize the work performed on evolving systems (Table A1)

Goal 2: Assess change attributes that explain change outcome (Table A2) Goal 3: Predict the outcome of changes (Table A3)

Table 1. Goals and sub-goals for change-based studies. Sub-category Goal 1.1: Understand and improve the maintenance and evolution process in a development organization Goal 1.2: Manage and control the maintenance and evolution process in a development organization Goal 1.3: Investigate selected elements in the maintenance and evolution process Goal 1.4: Understand the general nature of maintenance and evolution work Goal 2.1: Identify change attributes that influence change outcome Goal 2.2: Assess effects of a specific process element Goal 2.3: Validate change measures Goal 3.1: Propose methodology for building predictive models Goal 3.2: Assess prediction frameworks Goal 3.3: Investigate predictive power of change measures

References [37-42] [43-45] [46-49] [21, 50-52] [53, 54] [55-58] [59, 60] [61-64] [65, 66] [13, 67, 68]

Goal 2 and Goal 3 studies employed quantitative models that related independent change measures to the change outcome measure of interest. Goal 2 studies attempted to identify causal relationships for the purpose of understanding and assessment, while Goal 3 studies focused on correlations and predictions. Conversely, most Goal 1 studies used summary statistics to provide a bird’s eye view of the work that was performed during maintenance and evolution. They focused on observing trends in the values for selected change attributes, rather than attempting to explain the observations. 5.1. Summary of characterization studies (Goal 1) Goal 1 studies were split according to the sub-categories listed in Table 1. Goal 1.1 and Goal 1.2 studies are characterized by close involvement with the measured development organization. The measurement programs were planned in advance, e.g., following the GQM paradigm [34]. They are similar with respect to goals, the difference being that Goal 1.1

7

Simula Research Laboratory Technical Report 2008-05 studies had the overall goal of improving the maintenance and evolution process, while Goal 1.2 studies focused on improving management control in ongoing projects. The four earliest Goal 1.1 studies are from the space domain, characterized by a longlasting mutual commitment between the development organization and software engineering researchers. A certain amount of overhead for data collection was accepted in these environments. The studies appear to follow a tendency over time from studies for assessment and insight [41, 42], via studies for understanding and improved predictability [37], towards studies that took concrete actions in the form of process improvements [39]. Lam and Shankararaman [40] showed that these measurement goals were also feasible in projects that are managed less strictly. While the above studies focused on analyzing a comprehensive set of real changes, Bergin and Keating [38] used a benchmarking approach that evaluated the outcome of artificial changes that were designed to be representative of actual changes. The Goal 1.2 studies were conducted within strictly managed development organizations. Arnold and Parker [44] involved management in setting threshold values on a set of selected indicators. This was an early attempt to use change measures to support decisions made by managers in a development organization. Likewise, Abran and Hguyenkim [43] focused on management decision support, and provided upfront and careful considerations about validity issues that pertain to change-based studies. Finally, Stark [45] suggested a rich set of indicators that provided answers to questions about the services provided by the development organization to its clients. Goal 1.3 and Goal 1.4 studies collected data from change management systems, and attempted to provide insight into software maintenance and evolution that was generalizable beyond the immediate study context. Generalizability to other contexts was claimed on the basis of recurring characteristics of systems and development organizations. Goal 1.3 studies investigated the effect or intrinsic properties of specific process elements. Ng [46] investigated change effort in the domain of Enterprise Resource Planning (ERP) implementation. The remaining three studies addressed three different process topics: the intrinsic properties of parallel changes [48], instability in requirements [47], and the intrinsic properties of small changes [49]. Goal 1.4 studies addressed the nature of the software evolution and maintenance process in general. Kemerer and Slaughter [21] categorized change logs that had been written by developers that maintained 23 systems within one development organization in order to identify patterns in the types of change that occurred during the investigated period of 20 years. Mohagheghi [52] analyzed a smaller set of change requests to answer specific questions about who requested changes, which quality aspects that were improved by the changes, time/phase at which the requests were created, and to what extent change requests were accepted by the development organization. 5.2. Change attributes in characterization studies (Goal 1) Change attributes, typical questions and typical values used during data collection in Goal 1 studies are shown in Table 2. The leftmost column indicates the part of the conceptual model in Figure 1 that normally provides the data for change measures derived from the listed change attributes. Table 2: Change attributes that were measured in Goal 1 studies, ordered by number of studies Entity Change Question asked Typical values providing attribute information Change Maintenance What was the purpose of the Fix/enhance/adapt

#

12

8

Simula Research Laboratory Technical Report 2008-05 request Change request Change request Change task

type Change count

change? Was it a change?

Time (period)

When did the change occur?

Change effort

How much effort was expended on the change task? To which system or project did the change belong? Which system quality was improved by the change? What is the current state of the change request? In what context or by which party was the change request made? How long did it take to resolve the change request? How much content was added, deleted or changed? Which activities were involved in the change task? Which activity caused the defect or the need for change? How many components were affected? What kind of coding error was committed? What kind of component was affected? By which technique was the defect/need for change detected? Was the change task resolved later than scheduled?

System

System name

Change request Change request Change request

Quality focus Status Origin

Change task

Change interval

Revision

Change size

Change task

Activity

Change request Revision

Change/defect source Change span

Change request Change request Change request

Defect type

Change task

Artifact type Detection

Delayed

Simple count of changes Date, year, time since first deployment Person hours, ordinal or ratio Nominal measure Functionality/security/ efficiency/reliability New/accepted/rejected/ solved Internal test/external users

11 8 9 4 4 4 3

Days, ordinal or ratio

3

Lines of code, ordinal or ratio Requirements/analysis/ design/coding/test Requirements/analysis/ design/coding/test Count of components

3

Initialization/logic/data/ interface/computational Query/report/field/ layout/data Inspection/test-run/ proof techniques Delayed/not delayed

2 2 2 2 1 1

1

In summary, all Goal 1 studies attempted to characterize the work performed by development organizations. A predominant principle of measurement was to categorize changes according to selected characteristics. The proportion of changes that belonged to each category was compared to organizational standards, to other projects/systems, and between releases or time periods. Maintenance type, originally described by Swanson et al. [23], was the criterion for classification that was applied most frequently. In particular, the proportion of corrective change versus other types of change was frequently used as an indicator of quality, the assumption being that corrective work is a symptom of deficiencies in process or product. In most cases, observations and conclusions were based on descriptive statistics. In four studies, the statistical significance of proportions was investigated [21, 37, 51, 52]. Change effort, measured in person hours, was a key change measure for studies that focused on resource consumption. The number of changes was sometimes used as a surrogate measure when data on effort was not available. Some studies suggested using the average change effort per maintenance type as a rough prediction for the effort required to perform future change tasks of the same type.

9

Simula Research Laboratory Technical Report 2008-05 5.3. Summary of studies that assess change attributes (Goal 2) Goal 2 studies were split according to the goal sub-categories listed in Table 1. The studies used correlation analysis at different levels of complexity in order to identify relationships between change measures used as independent variables and the change outcome measure. An overview of change outcome measures is given in Section 5.5. Goal 2.1 studies attempted to identify causal relationships between change attributes and change outcome, while Goal 2.2 studies investigated the effect of specific process elements. Graves and Mockus [53] controlled for variations due to maintenance type and change size, and showed that change effort increased with system age. They automated the extraction of change measures from change management systems in order to minimize measurement overhead. Schneidewind [54] used historical change requests to investigate correlations between change attributes and the presence of defects. Atkins et al. [55] showed that introducing a new tool to support the development of parallel versions of the same components had a positive effect on effort. Hersleb and Mockus [57] showed that decentralization prolonged the change interval. Rostkowycz, Rajlich et al. [58] showed that re-documenting a system reduced subsequent change effort, and demonstrated that the breakeven point for investment in re-documentation versus saved change effort was reached after 18 months. Goal 2.3 studies attempted to find appropriate change measures of concepts that are commonly assumed to influence change outcome. Maya, Abran et al. [60] described how function point analysis could be adapted to the measurement of small functional enhancements. They tested whether the function point measure could predict change effort, and they observed a weak correlation in their study. Arisholm [59] showed that aggregation of certain measures of structural attributes of changed components could be used to assess the ease with which object-oriented systems could be changed. 5.4. Summary of prediction studies (Goal 3) While Goal 2 studies attempted to identify change attributes that influence change outcome, the Goal 3 studies attempted to predict that outcome. These studies used various prediction frameworks in order to build development organization specific prediction models of change outcome. The studies can be split according to the sub-categories listed in Table 1. Goal 3.1 studies investigated methods and processes for building prediction models. In [61], Briand and Basili suggested and validated a process for building predictive models that classified corrective changes into different categories of effort. Evanco [62] used similar procedures to predict effort for isolating and fixing defects, and validated the prediction model by comparing the results with the actual outcomes in new projects. Xu et al. [64] employed decision tree techniques to predict the change interval. The predictions from the model were given to the clients to set their expectations, and the authors quantified the approach’s effect on customer satisfaction. Mockus and Weiss [63] predicted the risk of system failures as a consequence of changes that were made to the system. They automated the statistical analysis required to build the models, and integrated the predictions into the change process that was used by the developers. Goal 3.2 studies compared prediction frameworks with respect to their predictive power and the degree to which the frameworks exposed explanations for the predictions. In [65], Jørgensen assessed and compared neural networks, pattern recognition and regression models for predicting change effort. He concluded that models can assist experts in making predictions, especially when the models expose explanations for the predictions. In [66], Reformat and Wu compared Bayesan networks, IF-THEN rules and decision trees for

10

Simula Research Laboratory Technical Report 2008-05 predicting change effort on an ordinal scale. They concluded that the methods complemented each other, and suggested that practitioners should use multi-method analysis to obtain more confidence in the predictions. Goal 3.3 studies attempted to identify change measures that could operationalize the conceptual change attribute of interest. Niessink and van Vliet [13] created and compared models for predicting change effort in two different development organizations. They suggested that the large difference in explanatory power between the organizations were due to the differences in the degree to which the development organizations applied a consistent change process. In [67], the same authors investigated variants of function point analysis to predict change effort. Although the regression models improved when the size of affected components was accounted for, the authors suggested that analogy-based predictions might be more appropriate for heterogeneous data sets. Using data on change requests and measures of system size from 55 banking systems, Polo et al. [68] attempted to build predictive models that could assist in the early determination of the value of maintenance contracts. Considerable predictive power was obtained from rudimentary measures, a finding that the authors contributed to the homogeneity of context (banking systems) and maturity of technology (Cobol). 5.5. Change attributes in assessment and prediction studies (Goal 2 and Goal 3) Although Goal 2 and Goal 3 studies have very different goals, they are quite similar from the perspective of measurement, and they are therefore described together in this section. The choice of dependent variable, i.e., the change outcome measure, is a key discriminator with respect to the focus and goal of a study. The dependent variables in the reviewed studies are derived from four change attributes: Change effort. The number of person hours expended on performing the change task is used as a change outcome measure in studies on change attributes that may influence productivity, and in studies on the estimation of effort for change tasks. Twelve of 17 studies had these foci. In most cases, the measure was reported explicitly per change task by developers. Graves and Mockus proposed an algorithm that made it possible to infer change effort from more aggregated effort data [12]. This algorithm was put to use in, e.g., [55]. Change interval. While change effort is a measure of the internal cost of performing a change task, the time interval between receiving and resolving the change request can be a relevant dependent variable for stakeholders external to the development organization. This change measure was used in studies that focused on customer service and customer satisfaction [57, 64], where the measure could be extracted from information resident in change management systems. Defects and failures. Historical data of defects and failures were used to identify change attributes that caused or correlated with defects and failures, to assess probabilities of defects or failures, and to assess the effect on defect proneness or failure proneness of a specific product improvement program. Such change measures are not straightforward to collect, because it can be difficult to establish a link from an observed defect or failure to the change that caused it. The two studies that have used this dependent variable analyze relatively large changes [54, 63]. Change attributes, typical questions and typical values used during data collection in Goal 2 and 3 studies are shown in Table 3. The leftmost column indicates the parts of the conceptual model in Figure 1 that provide data for deriving change measures from the listed change attributes. Measures of the change request, the change task and the revisions that are part of a change set occurred most frequently. Size, structure and age were the most frequently measured change attributes that used information from changed components and their

11

Simula Research Laboratory Technical Report 2008-05 versions. Information about revisions, versions and components that were involved in a change set could only be measured after the change had been made. For the prediction goals, such change measures needed to be predicted first. The degree of collaboration (developer span) was the most frequently measured change attribute that used information about the human resources involved. No attribute of the development organization was used more than once. Table 3. Change attributes measured in Goal 2 and 3 studies, ordered by category Entity providing Change Question asked Typical values/scale data attribute Change request Maintenance What was the purpose of the Fix/enhance/adapt type change? Change request Criticality What would be the effect of not Minor/major accepting the change request? inconvenience/stop Change request Change/defec Which activity caused the Requirements/analysis/ t source defect or the need for change? design/coding Change request Defect type What kind of coding error was Init/logic/interface/data committed? Change request Requirements To what extent were change Number of requirement instability requirements changed? changes Change request Quality focus Which system quality was Security/efficiency/ improved by the change? reliability Change task Change effort How much effort was expended Person hours, ordinal or on performing the change task? ratio Change task Change How long did it take to resolve Days, ordinal or ratio interval the change request? Change task Subjective How complex was the change 3-point ordinal scale complexity perceived to be? Change task Test effort What was the test effort # test runs associated with the change? Revision Change span How many components were # components changed affected? Revision Change size How much content was Added + deleted LOC changed? Revision Function How many logical units will be Count of changed, points changed, added or deleted by added and deleted the change? units, weighted by complexity Revision Coding mode Was content changed or added? Changed/added Revision Execution How much (added) CPU-cycles, bytes of resources computational resources were memory required by the change? Version Size How large were the changed Lines of code, number components? of components affected Version Structural What was the profile of the Count of structural attributes structural attributes of the elements (coupling, changed components? branching statements)

Version Version

Data operation Criticality

Version

Code quality

Table 3. Continued Which data operation did the affected components perform? How critical was the affected component? Had the changed components

#

9 4 2 2 2 1 15 5 3 1 9 7 2

1 1

4 3

Read/update/process

2

Is mission critical?

1

Refactored/not

1

12


Component

Age

Component

Documentati on quality Code volatility

Component

Component Component Component

Component kind Component id Technology

Human resource

Developer span

Human resource

System experience

Human resource

Developer id

Human resource

Maintenance experience

Human resource

Objective change experience

Human resource

Subjective experience

Development organization Development organization Development organization

Team id Location Tool use

been refactored? How old were the changed components? How well was the changed components documented? How frequently had the affected components been changed? What kind of component was affected? Which specific components were changed? Which technology was applied in the changed components? How many developers were involved in performing the change task? For how long time had the developers been involved in developing or maintaining the system? Who performed the change task? For how long had the developers performed software maintenance work? How many changes had earlier been performed by the developers on affected components? How was experience with respect to the affected components perceived? Which team was responsible for the change task? Where were human resources located physically? Which tool was involved in the change task?

refactored Years since deployment, version number, date Was documentation rewritten? Total number of changes

3

1 1

Batch/online program

1

Class or subsystem name 3GL/4GL

1

Number of people

3

Number of years

1

Nominal measure

1

Number of years

1

Number of previous check-ins in version control system

1

3-point ordinal scale

1

Nominal measure

1

Distributed/not distributed Tool used/not used

1

1

1

6. VALIDITY ISSUES AND IMPROVEMENTS TO CHANGE-BASED STUDIES (RQ2) Study validity refers to the truth of propositions about correlations (conclusion validity) and causal relationships (internal validity) between variables within the studied context, whether these variables captured the intended aspects of real worlds concepts (construct validity), and whether results are applicable beyond the studied context (external validity), c.f., [35]. The following more specific validity issues were investigated in order to identify possible improvements to change-based studies: 1. Did the measured entities and change attributes adequately represent the main phenomenon or quality under study (construct validity)?

13

Simula Research Laboratory Technical Report 2008-05 An assumption that underlies most change-based studies is that software maintenance and evolution can be viewed as the aggregation of changes applied to the evolving system. However, if change effort constitutes a small proportion of the total cost of maintenance, the validity of this assumption is questionable. Furthermore, if study questions are about the effects of factors at the project level, such as principles of project management or contract regimes, a change-based study is not necessarily appropriate. Abran and Hguyenkim [43] stated and handled the former threat explicitly, by comparing change effort data to activitybased effort reports. This procedure verified the appropriateness of using change measures to characterize the activities of the development organization. We recommend comparing the effort expended on individual changes to the total cost of maintenance, and using this proportion as a rough indicator of the relevance of a change-based study. 2. Was a clear rationale applied when deriving change measures from change attributes (construct validity)? This question concerns construct validity of the individual change measures. An illustrative example is the use of the term change size. Investigators should make clear whether the change measure is supposed to capture the amount of affected code in the change set, or whether it is supposed to capture the amount of work involved in resolving a change request. Only four of the reviewed studies discussed such issues [48, 49, 52, 59]. A plausible explanation is the lack of use of conceptual frameworks from which to derive change measures. The extracted change attributes that are listed in Tables 2 and 3 and that refer to the conceptual model in Figure 1 are intended to be a framework that is useful for deriving change measures in new change-based studies. 3. Were all change attributes that could contribute to observed outcome considered (internal validity)? It is challenging to identify causal relationships in change-based studies, due to the presence of and interactions between the multitudes of change attributes that might influence change outcome. Even with a coarse grouping of change attributes, as in Tables A2 and A3, very few studies considered change measures from every group. We recommend that new studies consider a broader set of change measures. The summaries in Tables 2 and 3 are intended to be helpful in this respect. In addition, accumulation of knowledge of causal relationships through continuously improved conceptual frameworks and theories would support new studies in the selection of appropriate change attributes and corresponding operational change measures. 4. Did the study have sufficient power to detect relationships in the population (conclusion validity)? In statistical hypothesis testing, power is defined as the probability of discovering a relationship that exists in the population. The power of an analysis increases as the number of data points increases. Table A4 summarizes the extent of data collection in the reviewed studies. One factor that might have affected the extent of data collection in the studies is whether data was created by the change process that the development organization applied, i.e., naturally created, or created for the purpose of measurement. The median numbers of analyzed changes in the reviewed studies were 1724 and 129 for naturally created and purpose-created data, respectively. The median durations of data collection were 48 and 21 months for the same categories. These observations support the intuitive idea that relying on data that occurs naturally not only reduces organizational overhead, but also facilitates prolonged and more extensive measurement programs. We therefore recommend that investigators look for naturally created data that can be used as alternatives to purpose-created data.

14

Simula Research Laboratory Technical Report 2008-05 5. Did the study collect appropriate kinds of data for the research questions (internal validity)? The ability to discover causal relationships also depends on the type of data that was collected, and the method of analysis. In the reviewed studies, the data collected and analysis are primarily quantitative, and qualitative methods are used to a limited degree. Briand et al. [39] elicited root causes for changes by interviewing developers and by inspecting change artifacts. Nurmuliani and Williams [47] employed qualitative methods, such as interviews and observations, in order to extract quantitative change measures. Indeed, many studies relied on interpretations of qualitative data. For example, reading change requests for the purpose of classifying changes is a simple use of qualitative procedures. The study by Lam and Shankararaman [40] went one step further by creating a system specific hierarchy of types of changes, based on the changes that actually occurred in the system. In studies that attempt to explain why and how effects occur, the use of systematic qualitative procedure can be appropriate. The review shows that there is a potential for change-based studies to utilize such procedures. 6. Could change measures be reliably collected (conclusion validity)? Some of the change measures defined in Table 3 (criticality, subjective experience, subjective complexity and maintenance type) depend on human judgement. The potential unreliability inherent in such change measures can be a threat to conclusion validity, because it may weaken or strengthen an observed effect beyond the true effect. Abran and Hguyenkim [43] used a pilot study to verify that change data could be classified reliably. Remedies for unreliability include improved training and use of change measures that are adapted to the local context. Graves and Mockus [53] applied techniques to automatically deduce the maintenance type from information in change management systems. Such techniques improve reliability and reduce measurement overhead, but may introduce other validity issues. 7. Were the study results generalizable to situations beyond the studied context (external validity)? Case studies rely on analytic generalization, which means that the investigator attempts to determine whether or not results are applicable to other contexts through the application of theory [69]. A case study can confirm or refute an existing theory, or indicate that the theory needs to be modified in some way. A strengthened or modified theory can subsequently be applied in other contexts in order to make predictions or explain observed phenomena. Kemerer and Slaughter [21] intended to study and develop a theory of the process of software evolution. No other study discusses theory in the sense described above. This finding is not surprising, because it is known that the use of theory in software engineering research is weak [70]. The result is that many of the studies provide results that are useful within the studied context, while their applicability to other contexts is more questionable. A basic practice to improve generalizability is to select and report context attributes that may have affected the results. Whenever possible, the rationale for selecting specific context attributes should also be reported.

7. LIMITATIONS The process by which papers where selected balanced the use of systematic, repeatable procedures with the intent to identify a comprehensive set of change-based studies. A more repeatable process could have been achieved by limiting searches to abstracts and titles only, by omitting traversal of literature references, and by excluding the Google Scholar search

15

Simula Research Laboratory Technical Report 2008-05 engine, which yielded low precision for paper retrieval. However, a more repeatable process may have failed to retrieve many of the included papers. Given that meeting the objective and answering the research questions of this study relied on identifying a broad set of changebased studies we chose to assign lower priority to repeatability. As a consequence, the procedures we followed did not fully comply with the procedures for systematic reviews that were suggested by Kitchenham et al. [11]. It is worth noting that the challenges experienced in attempting to follow systematic procedures stem from the lack of common conceptual frameworks. A common conceptual basis would clearly improve sensitivity and precision during the selection of papers. The elicitation of measurement goals, change attributes and study contexts (RQ1) was based partly on the coding of qualitative information; hence, decisions regarding coding that were made on the basis of subjective judgment could have influenced the results. The use of existing description frameworks mitigated this effect to some extent, and contributed to a relatively straightforward coding process. The validity issues that were investigated in the quality assessment (RQ2) were identified by the judgment of the authors. They should therefore not be taken as comprehensive. However, we do believe that key issues for change-based studies are addressed. 8. CONCLUSIONS AND FURTHER WORK Change-based studies assume that software maintenance and evolution is organized around change tasks that transform change requests into sets of modifications to the components of the system. This review of change-based studies has shown that specific study goals have been to characterize projects, to understand the factors that drive costs and risks during software maintenance and evolution, and to predict costs and risks. Change management systems constitute the primary source for extracting change measures. Several of the reviewed studies have demonstrated how measurement and analysis can be automated and integrated seamlessly into the maintenance and evolution process. Although this review includes examples of successful measurement programs, it was difficult to determine whether and how insights into software maintenance and evolution could be transferred to situations beyond the immediate study context. On the basis of our discussion on generalizability and other selected study qualities, we recommend that new change-based studies should base measurement on conceptual models and, eventually, theories. This observation may be seen as an instance of a general need for an improved theoretical basis for empirical software engineering research. In order to make progress along this line, we anchored this review in a minimal, empirically based, conceptual model with the intention of supporting change-based studies. We built the model by ensuring compatibility with existing ontologies of software maintenance, and by extracting and conceptualizing the change measures applied in 34 change-based studies from a period of 25 years. In future work, we will conduct a change-based multiple-case study with the aim of understanding more about the factors that drive costs of software maintenance and evolution. The results from this review constitute important elements of the study design. We believe that this review will be useful by other research and measurement programs, and will facilitate a more effective accumulation of knowledge from empirical studies of software maintenance and evolution. ACKNOWLEDGEMENTS

16

Simula Research Laboratory Technical Report 2008-05 We thank Simon Andresen for our discussions on conceptual models of software change, the anonymous reviewers for useful feedback, Chris Wright for proofreading the paper, and Aiko Fallas Yamashita for her comments that improved the readability of this paper. APPENDIX A. SUMMARY OF EXTRACTED DATA The three main classes of included studies are listed in Tables A1, A2 and A3. Within each class, the studies are listed in chronological order. In Tables A2 and A3, an asterisk (*) is used as an indication that the variable was statistically significant, at the level applied by the authors of the papers, in multivariate statistical models. Table A4 summarizes business context, measurement procedures and extent of data collection in the reviewed studies.

17

Simula Research Laboratory Technical Report 2008-05 Table A1. Characterize the work performed on evolving systems (Goal 1). Study goal Indicators and change attributes Manage the Change count by: maintenance • Maintenance type (fix, enhance, restructure) process • Status (solved requests vs. all requests), per maintenance type • Change effort (little/moderate vs. extensive) per maintenance type Measures were compared to local threshold values for several systems Change count by: Weiss and Assess Basili [42] maintenance • Defect source (req. specification, design, language, …) performance in • Change effort (