On Increasing Our Knowledge of Large-Scale Software ... - Springer Link

4 downloads 18354 Views 28KB Size Report
Computer Science Department, Colorado State University, Fort Collins, CO 80523 ... Program comprehension is a central task during software maintenance, ...
159

POSITION PAPERS

On Increasing Our Knowledge of Large-Scale Software Comprehension ANNELIESE VON MAYRHAUSER Computer Science Department, Colorado State University, Fort Collins, CO 80523

[email protected]

A. MARIE VANS Computer Science Department, Colorado State University, Fort Collins, CO 80523

[email protected]

1.

Status and Purpose of Studies

Program comprehension is a central task during software maintenance, evolution and reuse. Some estimates put the cost of understanding software at 50% of the maintenance effort. Clearly, understanding how programmers go about comprehending software they have not written is a worthwhile endeavor. It is a prerequisite to developing maintenance guidelines, documentation, and tools that improve on our current processes because they support cognitive processes. A variety of theories and models of program comprehension exist. For a discussion of these see (von Mayrhauser and Vans, 1995a; von Mayrhauser and Vans, 1995b). However, many of them are based on studying novices (e.g. undergraduate students) working with small programs (less than 200 lines of code) without documentation nor platform tools commonly used by professionals. Most of the work centers on issues of general understanding, rather than specific, goal-driven tasks like corrective maintenance, adaptation, enhancement, or code leverage. Thus, it is not clear to which degree these theories and models apply to professional maintenance tasks that routinely occur on large scale software. Recent work (von Mayrhauser and Vans, 1995c; von Mayrhauser and Vans, 1996a; von Mayrhauser and Vans, 1996b) explored these issues through an observational study. Eleven professional programmers were observed on actual maintenance tasks. The objective was theory building through observation. The aim was to cover as wide a variety of tasks, prior exposure to the code, and expertise (in language and application domain) as possible. Table 1 shows the distribution of participants along the three dimensions of subject classification. Task types are general understanding, corrective maintenance, adaptation, enhancement, and code leverage. Prior exposure to code (accumulated knowledge) is ranked on a 6 point scale from “never seen before” to “worked with code for several years”. Expertise is classified by language platform, domain knowledge, or both. Like other observational studies, this one faced three issues in determining the validity of generalizing its results: 1. Task: while tasks and the code were realistic (they represented actual work assignments), work assignments differed between individuals. 2. Sampling of participants: participants were professional maintenance programmers. The small sample does not claim to represent the full range of cognition behavior in

160

EMPIRICAL SOFTWARE ENGINEERING Table 1. Participant Classification: AD = Adaptation, C = Corrective, EN = Enhancement, G = General Understanding, L = Leverage Code. Expertise⇒ Accumulated Knowledge⇓

Language Expert

Domain Expert

Never Seen Before (Little)

C2: Understand Bug

File Structure Call Graph (Little)

C3:Fix Reported Bug

Requirement & Design Documents (Some)

C1:Fix Reported Bug

Worked some with code, style familiar (Some)

L1:Leverage Small Program

Prior code enhancement, debugging, adaptations (Significant) Worked with code several years (Significant)

Language & Domain Expert

G1:Program: General Understand EN2:Add Function G2:Understand one Module

C4:Track Down Bug

AD2:Add Function, Prototype Assess

AD1:Port Program across Platforms EN1:Add Functionality

all professional software maintenance engineers. The study, however did cover a wide range of background.

3. External validity: the degree to which the conditions used for collecting observational data are applicable to actual maintenance tasks. Observational studies are usually based on realistic field observations and do not have the rigor of controlled experiments nor their sample size. They are used to build theory and are valuable for eliciting a broad spectrum of behaviors. Observational studies should be followed up with validation experiments for each aspect of the theory built.

POSITION PAPERS

2.

161

Challenges for Validation of Software Comprehension Theory

Recent work has built a theory of large scale, professional software maintenance activities. This theory should be validated. Yet, this is not an easy task and traditional validation experiments may well prove unrealistic. The experimental variables that would have to be considered are: 1. Software under maintenance. This should specify code size, application domain, documents and platform (language, utilities, tools). 2. Task. Possible categories are general understanding, corrective maintenance, enhancement, adaptation, code leverage, reuse. 3. Prior exposure to software. This requires a rating mechanism similar to the one implied in Table 1. 4. Level of expertise. Again, this could be accomplished by a ranking similar to Table 1. However, in experiments that propose to evaluate the effect of (growing) expertise over time, the issue of training subjects arises, as do complexities of longitudinal studies. 5. Availability of subjects. Professional programmers are busy people. Observational studies take time out of their work schedule. This is particularly true when a set of programmers are asked to do the same task on the same code. Asking a group of (20 to 30) programmers duplicate actual maintenance work is unlikely, given common deadline pressures in industry. Thus, availability of subjects could be a big obstacle. 6. Cost versus value of information gained. The most detailed and revealing information about programmers’ cognitive processes is undoubtedly provided by speak-aloud protocols and subsequent protocol analysis (Ericcson and Simon, 1993). This is a very expensive method. Not only can recording equipment be quite costly, transcription of the recorded information is expensive as well. Identifying analysis categories (annotating the protocol and extracting information) has to be done manually, by and large. The more subjects and the longer the task, the more expensive this will be. On the other hand, questionnaires given at the end of a task are by far less reliable in the information they provide (Ericcson and Simon, 1993). 7. Variation in individuals. This would require giving the same task to a large enough sample of subjects. In reality, this would require a pre-test for prior exposure to the code and for expertise, followed by giving a realistic maintenance task to these individuals. Obviously, large scale experiments on professional maintenance programmers doing a variety of types of maintenance on large scale code have a feasibility problem. 3.

Possible Solutions

While acknowledging the problems with validation of software comprehension theories, validation is still necessary, if our knowledge of software comprehension is to be more than

162

EMPIRICAL SOFTWARE ENGINEERING

anecdotal. Given that large, controlled experiments are probably out of the question, we will have to assemble this validation information from a larger set of experiments. These could be similar to the naturalistic observations that led to the development of the original theory, but will require comparing and aggregating results of observational studies by different researchers. One way to deal with this problem is to have a detailed enough standard for describing the experimental environment and set up, for the data recorded, and for the analysis methods employed. This will enable comparison and aggregation of results. We propose the following: 1. Develop a classification system for experimental variables. This would include software under maintenance, task, prior exposure to the software, and level of expertise (see section 2). The ratings and classifications implied by Table 1 could be part of the description of such an experimental design. 2. Develop commonly accepted analysis categories. There is little value if experimental design descriptions are equivalent but the outcomes cannot be compared or aggregated. Thus one must analyze the observations according to common categories and analysis methods. (von Mayrhauser and Vans, 1996a) identified lists of action types, hypothesis types, and methods for process analysis. They also (von Mayrhauser and Vans, 1995c) explained a method for relating actions to information needs. Undoubtedly, further analysis categories and methods will emerge over time. Re-analysis of existing data will be possible when categories and methods and experiences with them are shared. 3. Provide common analysis tools to make aggregation and interpretation of results possible. The first phase of the analysis on the session transcripts is necessarily a manual process. For example, the inflection of a subject’s voice may be the only distinguishing feature marking an utterance as a hypothesis or a statement of fact. Once this first phase assignation of action types has occurred, the transcript has been transformed into information that can be analyzed automatically. We developed a relatively simple lex/yacc program for this second phase of analysis. 4. Share first level analysis data. Federal regulations and confidentiality requirements with regards to subjects in many cases prohibit sharing of the actual recorded transcript. This is not the case with the results of the first level, manual analysis. Providing this information with subject and task classifications can provide a growing and fertile analysis database for researchers. 5. Evolve action classification scheme and first phase, manual analysis. Coding a transcript with action types is a subjective process. To make it reliable requires a precise set of rules to recognize various action types, training of individuals in this classification scheme, and inter-rater reliability analysis to validate it.

POSITION PAPERS

163

References Ericcson, K. A., and Simon, H. A. 1993. Protocol Analysis: Verbal Reports as Data, 2nd Edition. Cambridge, MA: MIT Press. von Mayrhauser, A., and Vans, A. M. 1995. Program comprehension during software maintenance and evolution. IEEE Computer 28, August: 44–55. von Mayrhauser, A., and Vans, A. M. 1995b Program understanding: Models and experiments. Advances in Computers 40: 1–38. von Mayrhauser, A., and Vans, A. M. 1995. Industrial experience with an integrated code comprehension model. IEE Software Engineering Journal Sept.: 171–182. von Mayrhauser, A., and Vans, A. M. 1996a. Identification of dynamic comprehension processes during large scale maintenance. IEEE Transactions on Software Engineering 22(6) June: 424–437. von Mayrhauser, A., and Vans, A. M. 1996b. On the role of hypotheses during opportunistic understanding while porting large scale code. Procs. International Workshop on Program Comprehension March, Berlin, 68–77.

Applying QIP/GQM in a Maintenance Project1 SANDRO MORASCA [email protected] Dipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza Leonardo da Vinci 32, I-20133 Milano, Italy

Motivations The introduction of a measurement program can be quite beneficial to a software organization’s maintenance activities. Quantitative analyses allow a software organization to evaluate the current maintenance practices, propose changes, predict the results of such changes, and evaluate the actual results of the changes on objective grounds. As a precondition of its practical usefulness, a measurement program should address the real needs of the software maintenance organization where it is used. Therefore, the measurement goals should be closely related to the organization’s corporate goals. Based on these goals and the characteristics of the specific maintenance environment, measurement programs should enable a software maintenance organization to identify both: •

the quality characteristics of maintenance processes and/or maintained products (e.g., reliability, reusability) of interest with respect to the measurement goals and



the factors (e.g., programmers’ experience, code characteristics) influencing the quality characteristics of interest

that are most relevant in the context of the specific maintenance environment. This allows software managers to make decisions on firmer grounds.