Evaluating Defect Detection Techniques for ... - Semantic Scholar

2 downloads 0 Views 202KB Size Report
In a software inspection, document reading is first performed by inspectors working ... inspectors read documents for defect detection by using ad hoc or checklist ...
Evaluating Defect Detection Techniques for Software Requirements Inspections Filippo Lanubile and Giuseppe Visaggio University of Bari Dipartimento di Informatica Via Orabona, 4 - 70126 Bari, Italy +39 080 544 3270 {lanubile, visaggio}@di.uniba.it Abstract Perspective-Based Reading (PBR) is a family of defect detection techniques which have been proposed to improve the effectiveness of software requirements inspections. PBR drives individual document reading by means of perspectivedependent procedural scenarios, which are different for each inspector in the team. Based on the former PBR experiments, we present a controlled experiment with more than one hundred undergraduate students who conducted software inspections as part of a university course. We tested an enhanced procedural version of PBR by comparing it ad hoc reading and checklist-based reading, and analyzing data both at the individual and team level. The experimental results do not support previous findings that PBR improves defect detection effectiveness with respect to nonsystematic reading techniques. We also investigated team composition issues by assessing the effects of combining different or identical perspectives at inspection meetings. The empirical test did not meet the assumption that different perspectives in an inspection team achieve a higher coverage of a document. Process conformance have been explicitly considered and used to check whether readers have departed from the assigned guidelines and then filter raw data. We discovered that only one fifth of PBR reviewers had actually followed their scenarios, and even many checklist reviewers had reverted back to an ad hoc reading style. From the debriefing questionnaires, we measured the reviewers’ self-assessment of the assigned reading techniques. The results revealed no relationship with the type of reading technique but the subjective evaluation was significantly related to trust in process conformance. This experiment provides evidence that process conformance issues play a critical role in the successful application of reading techniques and more generally, software process tools.

Keywords: perspective-based reading, software reading techniques, software inspection, requirements document, process conformance.

1

Introduction

Software inspection is one of the industry best practices for delivering high-quality software (Wheeler et al., 1996). The main benefit of software inspections derives from applying inspections early during software development and then preventing the exponential growth of defect repair cost (Boehm, 1981). Software inspection is a structured process for the static verification of software documents, including requirements specifications, design documents as well as source code. From the seminal work of Fagan (Fagan, 1976; Fagan 1986) to its variants (Humphrey, 1989; Gilb and Graham, 1993), the software inspection process is essentially made up of four consecutive steps: planning, preparation, meeting, and rework. During planning, a member of the inspection team sends the inspection material to the rest of the team and makes a schedule for the next steps, During preparation, each inspector in the team individually understands and reviews the document to find defects. During the meeting, all the inspectors encounter to collect and discuss the defects from the individual reviews and further review the document to find further defects. Finally, during rework, the author revises the document to fix the defects. The main changes from the original Fagan’s inspection have been a shift of primary goals for the preparation and meeting steps. The main goal for preparation has changed from pure understanding to defect detection, and so inspectors have to individually take notes of defects. Consequently, the main

goal of the inspection meeting has been reduced from defect discovery to defect collection, including the discussion of defects individually found during preparation. Among the many sources of variations in software inspections, Porter et al. (1998) have shown that changes in the inspection process structure can cut inspection cost and shorten the inspection interval but do not improve the inspection effectiveness (basically measured as the number or density of defects found). Rather, reading techniques for analyzing documents are the key for improving inspection effectiveness.

1.1

Reading Techniques for Defect Detection

In a software inspection, document reading is first performed by inspectors working alone during preparation. Many inspectors read documents for defect detection by using ad hoc or checklist techniques. Ad hoc reading for defect detection is a very nonsystematic technique, which leaves to inspectors the freedom to rely on their own review guidelines to find defects. Checklist reading for defect detection requires inspectors to read the document while answering a list of yes/no questions, based on past knowledge of typical defects. Checklist reading can also be considered a nonsystematic technique, although less than ad hoc reading, because it does not provide a guideline on how to answer the questions. Scenario-based reading techniques have been proposed to support the inspectors throughout the reading process in the form of operational scenarios (Basili, 1997). A scenario consists of a set of activities aimed to build a model plus a set of questions tied to that model. While building the model and answering the questions, the reader should write down the defects he finds within the document. Each reader in the inspection team gets a different and specific scenario in order to minimize the overlapping of discovered defects among team members and then increasing the inspection effectiveness after defect collection at the meeting. Two families of scenario-based reading techniques have been generated for defect detection in software requirements documents.

1.1.1

Defect-Based Reading

The first family of scenario-based reading techniques, Defect-Based Reading (DBR), was defined for detecting defects in requirements documents written in a state machine notation for event driven-process control systems (Porter et al., 1995). Each DBR scenario is based on a different class of requirements defects and requires a different model to be built before answering to specific questions. In order to empirically validate their proposal, controlled experiments were first run with graduate students at University of Maryland (Porter et al., 1995) and then with professional software developers from Lucent Technology (Porter and Votta, 1998). Both the experiments showed that DBR was significantly more effective than ad hoc and checklist reading. Replications were perfomed by other researchers (Fusaro et al., 1997; Miller et al., 1998; Sandahl et al., 1998) who reused the same experimental material and slightly changed the experimental design. External replications did not measure any improvement in inspection effectiveness. Since all the external replications were conducted with undergraduate students, the main hypothesis for interpreting the difference in results is the lower familiarity with review activities, requirements specification language, and software domains.

1.1.2

Perspective-Based Reading

Perspective-Based Reading (PBR) is another family of scenario-based reading techniques which have been proposed to improve the inspection effectiveness for requirements documents expressed in natural language (Basili et al., 1996; Shull et al., 2000). The idea behind PBR is that various customers of a product (here requirements) should read the document from a particular point of view. For PBR, the different roles are those within the software development process, e.g., analyst, tester, user. To support the inspector throughout the reading process, operational descriptions, i.e., scenarios, for each role are provided. PBR has three basic properties: •

It is systematic, because it provides a procedure for how to read a document



It is specific, because the reader is only responsible for his role and the defects he can finds from his particular point of view



It is distinct, because readers in a team have different roles and there is the assumption that the overlapping of defects found between readers with different roles is kept to minimum. PBR has been first empirically validated with software developers of NASA/Goddard Space Flight Center (Basili et al.,

2

1996). The results showed that nominal1 teams using PBR achieved better coverage of documents and were more effective at detecting defects in requirements documents than teams which did not use PBR. These results were confirmed in a replicated experiment conducted by other researchers with undergraduate students and real team meetings (Ciolkowski et al., 1997). PBR has been also tailored and applied for reviewing source code documents in an industrial setting (Laitenberger and DeBaud, 1997). Results from an empirical comparison with checklist-based inspection shows that PBR is more effective and less costly than using checklists (Laitenberger et al., 1999). From the empirical studies of PBR, an unexpected effect was that individuals applying PBR, rather than just teams, were more effective for defect detection than individuals applying less systematic reading techniques. This effect seems to provide evidence that the first property of PBR, being systematic, might be a sufficient cause for improved inspection effectiveness. A more procedural-oriented version of PBR was applied in a related empirical investigation2 (Lanubile et al., 1998). To increase the specificity of the PBR techniques, more detailed guidelines for each perspective were provided to inspectors, with specific questions distributed at key points in the guidelines.

1.2

Research Questions and Hypotheses

We were interested to further assess the effects of systematic reading techniques on defect detection. Our main research question is the following: •

Are there differences in defect detection effectiveness between reviewing requirements documents using systematic reading techniques and reviewing requirements documents using nonsystematic reading techniques? Focusing on the enhanced procedural version of PBR as the systematic reading techniques, and based on findings from previous studies, our hypotheses are the following: 1.

Inspection teams applying PBR find a higher percentage of defects than inspection teams applying nonsystematic reading techniques, such as ad hoc reading and checklists. 2. Individual reviewers applying PBR find a higher percentage of defects than individual reviewers applying nonsystematic reading techniques, such as ad hoc reading and checklists. As a by-product of the main research question we are able to compare the performance of the different PBR scenarios. Our secondary research question is the following: •

Are there differences in defect detection effectiveness between reviewing requirements documents using different PBR scenarios? If the scenarios have been fairly developed, the hypothesis is the following:

3. Individual reviewers applying PBR find the same percentage of defects with any of the assigned scenario. However, we are also interested to assess the effects of having distinct roles when composing inspection teams. Our research question is the following: •

Are there differences in defect detection effectiveness between reviewing requirements documents having unique roles in an inspection team and reviewing requirements documents having identical roles in an inspection team? Based on the theory of scenario-based reading, which states that the coordination of distinct and specific scenarios achieves a higher coverage of documents, our hypothesis is the following: 4.

Inspection teams composed of unique PBR perspectives are more effective at detecting defects than inspection teams composed of identical PBR perspectives.

We have investigated these research questions and tested the research hypotheses by means of a controlled experiment in a classroom environment with more than one hundred undergraduate students. 1

There were no real team meetings in the experiments. Meetings were simulated by pooling defects found during individual preparation into nominal team defect logs.

2

The goal of the experiment was to understand the effect of abstracting errors from faults in requirements documents rather than to compare PBR with other reading techniques. Because of the experimental design, differences between reading techniques are confounded with other factors. 3

1.3

Paper Outline

The remainder of this paper is organized as follows. Section 2 describes the experiment, including the variables, design, threats to validity, instrumentation, and execution. Section 3 presents the results from data analysis. The final section summarizes and discusses our findings.

2

The Experiment The experiment was conducted as part of a two-semester software-engineering course at the University of Bari.

Since software requirements specification and software inspections were included in the first half of the course syllabus, the experiment was run as a midterm exam and then reviewers’ performance were subject to grading. Subjects were thirdyear undergraduate students of the computer science department. Because midterm exams are optional in Italian academic courses, participation was on a volunteer basis. However, the “premium” grades made most of the students to participate seriously in the experiment. The experiment simulated in a classroom environment the preparation and meeting steps of an inspection process. We conducted two runs of the experiment. All subjects, with few exceptions, participated in both runs of the experiment, each corresponding to a different software inspection. Some differences between runs were planned in advance and will be described later in the experimental design section, while some changes were introduced after the first run was over, based on subjects’ feedback, and will be described in the execution section.

2.1

Variables

The independent variables are the variables whose values (or levels) determine the different experimental conditions to which a subject may be assigned. We manipulated the following independent variables: •

The reading technique. Subjects and then teams can apply a systematic reading technique (PBR) or a nonsystematic reading technique (Ad Hoc or Checklist).



The team coordination: PBR teams are further decomposed between teams made up of different perspectives (multiperspective PBR) and teams made up of identical perspectives (mono-perspective PBR). Only multi-perspective PBR teams are consistent with the concept of coordinated teams with focused and distinct responsibilities, while monoperspective PBR teams have focused but identical responsibilities. Teams applying an Ad Hoc or Checklist reading technique have unfocused and identical responsibilities.



The perspective. Within PBR, a subject may use only one scenario based on one of the assumed perspectives. For this experiment we used the following three perspectives: - Use Case Analyst (UCA) - Structured Analyst (SA) - Object-Oriented Analyst (OOA) In the former PBR experiments, only the first two perspectives had been used and the third one was the perspective of a tester. However, having to run the experiment as a midterm exam we could not have taught and exercised students to testing. Because the first part of the course focuses on requirements specification and analysis methods, all the perspectives should center around system analysis activities.

The subjects who do not follow a scenario (i.e., with Ad Hoc and Checklist reading techniques) do not have any perspective associated. We measured the following dependent variables: •

The individual defect detection rate: the number of true defects reported by individual inspectors divided by the total number of known defects in the document



The team defect detection rate: the number of true defects reported at inspection meeting divided by the total number of known defects in the document



Preparation time: time spent, in minutes, by an inspector for the individual preparation



Meeting time: time spent, in minutes, by a team for the inspection meeting 4

We also collected additional data from two debriefing questionnaires, one for individual preparations and another for inspection meetings. Most of the questions were in a closed form and contained questions useful for measuring conformance to the assigned guidelines, self-confidence in the inspection results, understanding of the detection techniques and satisfaction with the inspection process.

2.2

Design

The main goals of this experiment are to compare systematic reading techniques (PBR) vs. nonsystematic reading techniques (Ad Hoc and Checklist), and to compare distinct vs. overlapping roles with scenario-based reading techniques. With respect to the former PBR experiment, we had introduced a new defect detection technique, checklist, and a new organization of inspection teams, composed of identical PBR perspectives (mono-PBR). Hence, we could not reuse the experiment plan from the previous experiments and we decided an entirely new design. We first decomposed the experiment in two runs, the second run a week after the first run. Each run required subjects to inspect a requirement document, starting with an individual preparation and finishing with a team meeting steps. The runs had the following differences: -

The document to be inspected: ATM in the first run and PG in the second run (see Instrumentation section for a brief description of these documents) - The nonsystematic reading technique used: Ad hoc in the first run and Checklist in the second run. - The inspection goal: “find as many as defect you can with the help of a defect detection technique” in the first run, “find as many as defect you can while following a defect detection technique”, in the second run (see Execution section for the rationale behind the goal shift). The same subjects participated in both runs using the same technique. However, reviewers assigned to a nonsystematic reading technique applied ad hoc reading in the first run and checklist reading in the second run of the experiment. In the individual preparation, the experimental plan consists of one independent variable (the reading technique) with two main levels: the systematic reading technique (PBR) and the nonsystematic reading technique (Ad Hoc or Checklist). Nested in the PBR level, there are three perspectives (UCA, SA, and OOA). The reading technique and the perspective variables vary between subjects because no subjects are exposed to more than one experimental condition. Subjects were randomly assigned to the experimental conditions. In the team meeting, the experimental plan consists of one independent variable (the reading technique) with two main levels: the systematic reading technique (PBR) and the nonsystematic reading technique (Ad Hoc or Checklist). Nested in the PBR level, there are two levels of the team coordination variable: the multi-perspective PBR and the mono-perspective PBR. The reading technique and the team coordination variables vary between teams, which are the units of analysis. Subjects were randomly assigned to the inspection teams and teams were randomly assigned to the experimental conditions. Teams had to be composed of three persons but in some cases we had to create four-people teams because of spare people to accommodate in a team. Table 1 shows the experimental design for each of the two experimental rounds. Differences in the number of subjects and teams between the two runs are due to subject withdrawals (see Execution section for major details).

2.3

Threats to Validity

This section discusses the threats to validity that are relevant for our experiment. To rule out the threats we could not overcome or mitigate, other experiments may use different experimental settings, with other threats to validity of their own. Basili et al. (1999) discuss how processes, products, and context models have an impact on experimental designs in the software engineering domain.

2.3.1

Threats to Internal Validity

Threats to internal validity are rival explanations of the experimental findings that make the cause-effect relationship between independent and dependent variables more difficult to believe. We identified the following threats to internal validity:

5

First Run - ATM doc Individual Preparation - week 1 - day 1 Reading Technique Nonsystematic (Ad Hoc)

Systematic (PBR)

Perspective

#subjects

no perspective (NONE)

37

use case analyst (UCA)

25

structured analyst (SA)

26

object-oriented analyst (OOA)

26

Inspection Meeting - week 1 - day 2 Reading Technique Nonsystematic (Ad Hoc)

Systematic (PBR)

Team Coordination

#teams

unspecific / identical responsibilities (Ad Hoc)

12

specific / distinct responsibilities (multi-perspective PBR)

14

specific / identical

UCA only

4

responsibilities

SA only

4

(mono-perspective PBR)

OOA only

4

Second Run - PG doc Individual Preparation - week 2 - day 1 Reading Technique Nonsystematic (Checklist)

Systematic (PBR)

Perspective

#subjects

no perspective (NONE)

34

use case analyst (UCA)

24

structured analyst (SA)

26

object-oriented analyst (OOA)

25

Inspection Meeting - week 2 - day 2 Reading Technique Nonsystematic (Checklist)

Systematic (PBR)

Team Coordination

#teams

unspecific / identical responsibilities (Checklist)

11

specific / distinct responsibilities (multi-perspective PBR)

12

specific / identical

UCA only

4

responsibilities

SA only

4

(mono-perspective PBR)

OOA only

4

Table 1. Experimental Plan. The experiment consists of two experimental runs. In the first run subjects reviewed the ATM requirements document and in the second run the PG requirements document. In each run, on the first day subjects performed an individual preparation and on the second day an inspection meeting. The same subjects participate in both runs using the same technique.

6

History. The history threat refers to specific events that might occur between successive measurements and influence the dependent variables in addition to the experimental variable. In our experiment there were four different points of measurement. Because the experimental tasks were part of a midterm exam, the highest risk event is plagiarism, with subjects exchanging information in the intervals between tasks. This might be the case for the two one-day intervals between individual preparations and team meetings. To reduce this risk, we told students that only individual tasks were subject to grading. Furthermore, the individual defect lists were collected after individual preparation and returned to subjects just before the team meeting. Plagiarism could not occur between the two experimental runs because the requirements documents were different. Maturation. The maturation threat refers to changes within the subjects that occur over time and influence the dependent variables in addition to the experimental variable. Possible changes might occur due to boredom, tiredness, or learning. Boredom might have affected the second run of the experiment, because subjects had to perform a second complete inspection using the same technique. However, because the inspections were assessed as midterm exams, we believe that the concern for grading is stronger than some initial boredom. Tiredness occurs because of too much effort required by subjects. In our experiment, four hours were allocated for each experimental task, each inspection activity was performed in distinct day and the two complete inspections were conducted in different weeks. While boredom and tiredness tend to degrade the performance, learning tends to amplify performance. Although we minimized the learning effect by teaching requirements analysis and review and having a training session before the experiment itself, we cannot exclude that learning was still in progress during the experiment. However, the learning effect should be symmetric between the values of the independent variables, because all the subjects were novices with respect to any defect detection technique. Instrumentation. The instrumentation threat refers to changes in the measuring instrument or changes in the observers or scores used. In our experiment, two different requirements documents were used, one for each inspection run. Although the specifications have approximately the same structure, size and number of defects we cannot exclude that the difference in the problem domain might have an effect on inspection effectiveness. However, requirements to be reviewed change symmetrically with respect to the independent variables and in the former PBR studies no interaction effects were observed between documents and reading techniques. Selection. The selection threat refers to natural differences in human performance. In our experiment, we reduced selection effect by randomly assigning subjects to defect detection techniques and we choose individuals at random to form inspection teams. We had a large enough number of subjects for being confident that few talented people could not mask differences in the reading technique performance. Mortality. The mortality threat refers to differences in loss of subjects from the comparison groups. Because our subjects were highly motivated by grading, we did not expect many cases of subject drop-outs. We decided that we would have used only data points of subjects and teams who completed an entire inspection. Process conformance. The process conformance threat refers to changes that the subjects autonomously apply to the process they are supposed follow. In our experiment, we had discovered from the answers to the questionnaires of the first run, that many subjects were not following the systematic technique assigned, thus reverting to a nonsystematic technique. Subjects were just concentrating on successfully accomplishing the inspection goal: find as many defects as you can. Although we knew that we could not strictly enforce the application of the technique assigned, we wanted to have subjects really trying it. Then, before the second inspection, we told subjects that the reading techniques should be really followed to be positively graded and not just be considered an option. The subjects took the announcement seriously because they had to return the analysis models and had to cross-reference the defects with the questions in their procedure. However, the effect of this change is confounded with maturation and instrumentation changes and then we cannot assess it separately. We finally decided to perform distinct analyses of the two experiment runs, and then draw conclusions from each run.

2.3.2

Threats to External Validity

Threats to external validity are factors that limit the generalization of the experimental results to the context of interest, here the industrial practice of software inspections. For our experiment, we can identify the following threats to external validity: Representative subjects. Our students may not be representative of the population of software professionals. However, a former PBR experiment with NASA developers (Basili et al., 1996) failed to reveal significant relationship between PBR inspection effectiveness and reviewers’ experience. Probably, being a software professional does not imply that the experience matches with the skills that are relevant to the object of study. 7

Representative artifacts. The requirements documents inspected in this experiment may not be representative of industrial requirements documents. Our documents are smaller and simpler than industrial ones although in the industrial practice long and complex artifacts are inspected in separate pieces. Representative processes. The inspection process in this experiment may not be representative of industrial practice. Although there are many variants of the inspection process in the literature and industry, we conducted inspections on the basis of a widely spread inspection process (Gilb and Graham, 1993). However, our inspections differ from industrial practice of inspections because individual preparations are not performed on subjects’ own desk with possible interruptions, and inspection meetings do not include the document’s author. All these threats are inherent to running classroom experiments and can only be overcome by conducting replications with people, products, and processes from an industrial context.

2.4

Instrumentation

The experiment has reused most of the material from a previous PBR experiment (Lanubile et al., 1998). The material is available as a lab package on the web (Shull, 1998) but we had to translate everything from English to Italian otherwise many students would not be confident with reading and using it. The material includes requirements documents, instructions, instructions and aids for each defect detection technique, defect report forms to be used both for the individual preparation and the team meeting, and debriefing questionnaires.

2.4.1

Requirements Documents

The software requirements specifications were written in natural language and adhered to the IEEE format for SRS (IEEE, 1984). The requirements documents used for the experiment were: •

Automated Teller Machine (ATM), 17 pages long and containing 29 defects



Parking Garage control system (PG), 16 pages long and containing 27 defects

2.4.2

Defect Detection Techniques

Defect detection was supported by means of instruction documents, which were specific for each reading technique: ad hoc, checklist, and PBR. PBR instructions were composed of three distinct scenarios for each unique perspective. Ad hoc reviewers received a defect taxonomy including definitions for the main classes of requirements faults: missing information, ambiguous information, inconsistent information, incorrect fact, and extraneous information. Checklist reviewers received a single checklist derived from the defect taxonomy, with 17 questions covering all the defect classes. The checklist was not present in the lab-package and then we created the questions by detailing the defect class definitions. PBR reviewers received one of three scenarios corresponding to UCA, SA, and OOA perspectives. Each scenario contains a stepwise guide for creating an abstract model from the requirements document and model-specific questions distributed at key points in the guidelines. The models to be built, and then guidelines and questions, are different for each scenario: •

UCA: the scenario requires creating a use case diagram, including use case descriptions, and answering to 12 questions.



SA: the scenario requires creating a hierarchy of data flow diagrams, and answering to 9 questions



OOA: the scenario requires creating a class diagrams, including attributes and operations, and answering to 11 questions All the scenarios were reused with modifications from the lab-package. However, the OOA scenario had never been applied in any former PBR experiment. PBR reviewers also received a scenario-specific model skeleton to be used for model drawing. The diagrams had to be returned together with the list of defects found in order to check whether the reviewers had really built the required abstraction while reviewing the document.

8

2.4.3

Defect Report Forms

Defect report forms had to be filled out by individuals after inspection preparation and by a team recorder after inspection meeting. A defect report form contains a header and entries for each defect reported. The header includes various identifiers such as the SRS name, reviewer name, team name, date, initial and finish time. A defect entry includes a defect progressive number, the defect location (requirement identifier and page number), a textual description, and the question in the reading technique that was helpful for defect discovery. This last field was not applicable by ad hoc reviewers and was optional for the others. It was included as a traceability mechanism to understand whether inspectors have actually tried to answer the questions, and then check for process conformance.

2.5

Training

All subjects taking a course in software engineering for undergraduates were prepared with a set of lectures on requirements specifications, software inspections, and analysis model building. We gave a 2-hour lecture on the IEEE standard for SRS and taught the requirements defect taxonomy. A requirements document for a course scheduler system was presented and an assignment was given for finding defects. The results were discussed in class and a list of known defects was written out according to the schema of defect report forms. Another 2-hour lecture was given on software inspections, explaining the goals and the specific process to be used in this study. We then introduced a new requirements document for a video rental system, which was available in the experiment lab package for training purposes. As a trial inspection, students were asked to individually read the document and record defects on the defect report forms to be used in this experiment. We then created teams, assigned roles inside the teams (moderator, reader, and recorder) and a trial inspection meeting was conducted. After the trial inspection we discussed with students the list of known defects and what defects they had found. Afterwards, we gave a set of lectures on requirements analysis where we taught use case analysis, structured analysis, and object-oriented analysis. For each analysis method, we presented and discussed with students the analysis models for the course scheduler system. Students were given three assignments (use case model building, data flow model building, and class model building) where we asked to build analysis models for the video rental system. The assignments started in class to allow students asking questions, and then completed at home. The results from each assignment were discussed in class and a solution was presented and commented together with students. Finally, we spent one lecture to present the defect detection techniques and the experiment organization. We also communicated the outcomes of randomly assigning subjects to the experimental conditions. Teams were let free to choose team roles as moderator, reader, and recorder.

2.6

Execution

The experiment was run as a midterm exam. Each experiment run, corresponding to a separate inspection (ATM document first and then PG document), took two consecutive days, one for individual preparation and one for team meeting. The second run was scheduled after one week from the first run. Subjects always worked in two big rooms with enough space to avoid plagiarism and confusion. We were always present to answer questions and preventing unwanted communication. Each experimental task was limited to four hours and before leaving subjects were asked to complete a debriefing questionnaire. Before each individual preparation step, subjects were given a package containing the requirements document, specific instructions for the assigned reading technique, and blank defect report forms. After each individual preparation step, we collected all the material. This material was returned to subjects before the inspection meeting together with new blank defect report forms. At the inspection meeting, the reader paraphrased each requirement and the team discussed defects found during preparation or any new defect. The moderator was responsible for managing discussions and recorder for filling out the team’s defect report forms. Immediately after the first inspection, a preliminary analysis of questionnaires was performed and the results were fed back to the students before the second inspection. From the questionnaire answers and discussion with students we realized that many subjects were concentrating on finding as many defects as they could without applying the technique assigned. There was an uncontrolled migration of subjects from the systematic reading technique, PBR, to the nonsystematic reading 9

technique, ad hoc reading. PBR reviewers were complaining about having to build a model and follow a procedure. On the other hand, ad hoc reviewers were complaining about not having a guide. We then made two major changes for the second run of the experiment. First, we told subjects that they would be graded also with respect to the ability to follow the assigned process. We might check for process conformance by assessing the model developed by following a reading scenario and counting the percentage of defects which were cross-referenced with the questions in their reading technique. The second change was to replace ad hoc reading with checklist-based reading. With this change, also reviewers applying a nonsystematic reading technique would have a guide, albeit not procedural. Five subjects, three from a same team, did not participate to the second inspection and we had to cancel two teams.

2.7

Data Collection

We collected data through individual defect report forms, team defect report forms, and questionnaires. We validated the reported defects by comparing location and description information with those in the master defect list from the first PBR experiment. All the reported defects that could be matched to some known defect were original true defects. The other reported defects could be classified as other true defects, duplicates, false positives, and don’t care. The number of known defects in the original PBR experiment was 29 for ATM and 27 for PG. After the first PBR study, other defects have been discovered, including some found by our reviewers. After adding these other true defects to the originals, the total number of true defects amount to 32 for ATM and 36 for PG. However, to make data analysis results directly comparable to the figures in former PBR experiments, we will consider for our analysis only original true defects, which will be considered as a benchmark of seeded defects.

3

Results

Because of the major changes we made between the two runs of the experiment, we conduct separate analyses for each run. In the following, we first test the stated hypotheses by performing analysis of defect detection effectiveness at the team and individual levels. Next, we analyze the individual performance with respect to process conformance. We then analyze the subjective evaluation of the reading techniques looking at the answers in the debriefing questionnaires. Finally we analyze the relationship between time and defect detection effectiveness.

3.1

Analysis of Team Performance

For the team analysis we have a between-groups nested design. The first factor is the reading technique (RTECH) with two levels: 1.

nonsystematic (ADHOC in the first run and CHKL in the second run)

2.

systematic (PBR). The second factor is the team coordination (TCOORDIN) with three levels:

1.

unspecific/identical responsibilities (ADHOC in the first run and CHKL in the second run)

2.

specific/identical responsibilities, i.e., mono-perspective PBR (MONOPBR).

3.

specific/distinct responsibilities, i.e., multi-perspective PBR (MULTIPBR)

The first TCOORDIN level only appears within the first RTECH level (in fact they share the same name, ADHOC in the first run and CHKL in the second run), while the other two TCOORDIN levels occur within the PBR level of the RTECH factor. The dependent variable is the team defect detection rate (TDEFRATE) that is defined as the number of true defects reported at inspection meeting divided by the total number of known defects in the document. Figure 1 presents the distribution of the defect detection rate for both experimental runs using boxplots. Boxplots allows one to visualize and quickly assess the strength of the relationship between the grouping and dependent variables. Boxplots are also used to view those values which deviates from central tendencies for their respective groups. As can be seen, there are two outliers and one extreme value for the MULTIPBR group in the first run and for the CHKL group in the second run. For each group, we applied the Shapiro-Wilks’ W test of normality and we verified the equality of variances assumption with the Brown-Forsythe’s test. The statistics were not significant at the 0.05 level and then we could proceed to perform a parametric analysis of variance (ANOVA) to test for significant differences between means. 10

The two-way ANOVA for the hierarchically nested design allows us to test two null hypotheses concerning the effects of the two factors on the team defect detection rate. The null and alternative hypotheses for the first factor may be stated as follows: H10: there is no difference between teams using a nonsystematic reading technique and teams using a systematic reading technique group with respect to defect detection rate. H1a: there is a difference between teams using a nonsystematic reading technique and teams using a systematic reading technique with respect to defect detection rate. The null and alternative hypotheses for the second factor may be stated in a similar fashion: H20: within the group of teams with specific responsibilities, there is no difference between teams with identical responsibilities and teams with distinct responsibilities with respect to defect detection rate. H2a: within the group of teams with specific responsibilities, there is a difference between teams with identical responsibilities and teams with distinct responsibilities with respect to defect detection rate. Because of hierarchical design, interactions of the nested factor (TCOORDIN) with the factor in which it is nested (RTECH) cannot be evaluated. Thus, in the present study we cannot test the hypothesis that the reading technique property of being systematic and the particular team coordination interact in their effect on the defect detection rate. The results, summarized in Table 2, revealed no significant effects for the type of reading technique and team coordination with respect to defect detection rate. Table 3 reports the mean scores of the defect detection rate for the groups that define the effects in the analysis. As can be seen, all groups demonstrated similar scores on defect detection rate.

1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

TDEFRATE

TDEFRATE

First Run (ATM doc)

0.5 0.4

0.5 0.4 0.3

Non-Outlier Max Non-Outlier Min

0.2

Non-Outlier Max Non-Outlier Min

0.2

75% 25%

0.1

75% 25%

0.1

0.3

0.0 ADHOC

PBR

Median Outliers

0.0

Median

ADHOC

RTECH

MULTIPBR

MONOPBR

Extremes

TCOORDIN

1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

TDEFRATE

TDEFRATE

Second Run (PG doc)

0.5 0.4

0.5 0.4

0.3

Non-Outlier Max Non-Outlier Min

0.3

Non-Outlier Max Non-Outlier Min

0.2

75% 25%

0.2

75% 25%

0.1

Median

0.1

Median

Outliers

0.0 CHKL

PBR

Outliers

0.0

Extremes

CHKL

RTECH

MULTIPBR

MONOPBR

Extremes

TCOORDIN

Figure 1. Boxplots of Team Defect Detection Rate on the two experimental runs. On the left side, values are plotted for groups of reading techniques; on the right side values are plotted for groups of team coordination.

11

Comparing these scores with those achieved in experiments using the same documents we can note that: •

Inspection teams in our experiment found fewer defects than simulated teams in the Basili et al. (1996) experiment: - in the pilot study the average defect detection rate was 44.3 for the nonsystematic technique and 52.9 for the systematic technique; - in the 1995 run the average defect detection rate was 48.2 for the nonsystematic technique and 79.3 for the systematic technique.



Inspection teams in our experiment found fewer defects than simulated teams in the Ciolkowski et al. (1997) experiment using a systematic technique: - in the 95/96 run the average defect detection rate was 52.0 with the ATM document and 53.5 with the PG document; - in the 96/97 run the average defect detection rate was 46.5 with the ATM document and 48.9 with the PG document.



Inspection teams in our experiment found more defects than simulated teams in the Ciolkowski et al. (1997) experiment using a nonsystematic technique with the exception of ATM document in the 95/96 run: - in the 95/96 run the average defect detection rate was 47.7 with the ATM document and 38.1 with the PG document; - in the 96/97 run the average defect detection rate was 30.8 with the ATM document and 33.5 with the PG document.

First Run (ATM doc) df

MS

df

MS

Effect

Effect

Error

Error

{1}RTECH

1

.000155

35

.008138

.019014

.891116

{2}TCOORDIN

1

.000826

35

.008138

.101560

.751859

Source

F

p

Second Run (PG doc) df

MS

df

MS

Effect

Effect

Error

Error

{1}RTECH

1

.000623

32

.010345

.060262

.807651

{2}TCOORDIN

1

.004267

32

.010345

.412457

.525300

Source

F

p

Table 2. ANOVA Results from testing hypotheses concerning the effects of reading technique and team composition on the team defect detection rate.

First Run (ATM doc) Effect

Level of Factor

Total

N

Mean TDEFRATE 38

.421842

Second Run (PG doc) Level of Factor

N

Mean TDEFRATE 35

.387143

{1}RTECH

ADHOC

12

.419167

CHKL

11

.380909

{1}RTECH

PBR

26

.423077

PBR

24

.390000

{2}TCOORDIN

ADHOC

12

.419167

CHKL

11

.380909

{2}TCOORDIN

MULTIPBR

14

.417857

MULTIPBR

12

.376667

{2}TCOORDIN

MONOPBR

12

.429167

MONOPBR

12

.403333

Table 3. Mean scores of the team defect detection rate for the reading technique and team coordination groups.

12

However, these two previous experiments report scores computed from pooling defects logged in the individual preparation and applying permutation tests on hypothetical teams, while in our experiment the average defect detection rates are computed from defects that were actually logged during real team meetings.

3.2

Analysis of Individual Performance

Analogously to the team analysis, for the individual analysis we have a between-groups nested design. The first factor is the reading technique (RTECH) with two levels: 1.

nonsystematic (ADHOC in the first run and CHKL in the second run)

2.

systematic (PBR). The second factor is the perspective (PERSP) with four levels:

1.

no perspective (ADHOC in the first run and CHKL in the second run)

2.

use case analyst (UCA)

3.

structured analyst (SA)

4.

object-oriented analyst (OOA).

The first PERSP level only appears within the first RTECH level (in fact there is not any perspective with Ad Hoc or Checklist reading techniques), while the other three PERSP levels occur within the PBR level of the RTECH factor. The dependent variable is the individual defect detection rate (IDEFRATE) that is defined as the number of true defects reported by individual inspectors divided by the total number of known defects in the document. Figure 2 presents the boxplots of the individual defect detection rate grouped according to the reading technique and perspective groups. As can be seen, in the first run there is one outlier for the ADHOC group (or NONE group), and in the second run there are two outliers for the PBR group and one outlier for the SA and OOA groups. For each group, we applied the ShapiroWilks’ W test of normality and we verified the equality of variances assumption with the Brown-Forsythe’s test.. The statistics were not significant at the 0.05 level except for the W statistic of the CHKL group (or NONE group) in the second run, and then the hypothesis that the distribution is normal should be rejected for this group. Nevertheless, we performed a parametric analysis of variance (ANOVA) to test for significant differences between means because the ANOVA test is robust against moderate departures from normality when the group size is greater than thirty. The two-way ANOVA for the hierarchically nested design allows us to test two null hypotheses concerning the effects of the two factors on the individual defect detection rate. The null and alternative hypotheses for the first factor may be stated as follows: H30: there is no difference between subjects using a nonsystematic reading technique and subjects using a systematic reading technique group with respect to defect detection rate. H3a: there is a difference between subjects using a nonsystematic reading technique and subjects using a systematic reading technique with respect to defect detection rate. The null and alternative hypotheses for the second factor may be stated in a similar way: H40: within the group of subjects following a scenario, there is no difference between subjects in an UCA perspective, subjects in a SA perspective, and subjects in an OOA perspective with respect to defect detection rate. H4a: within the group of subjects following a scenario, there is a difference between at least two of the following groups with respect to defect detection rate: subjects in an UCA perspective, subjects in a SA perspective, and subjects in an OOA perspective. As in the team analysis, there are no interaction effects to test because of the hierarchical design.

13

1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6 IDEFRATE

IDEFRATE

First Run (ATM doc)

0.5 0.4 0.3

ADHOC

PBR

75% 25%

0.1

Median

0.0

Non-Outlier Max Non-Outlier Min

0.2

75% 25%

0.1

0.4 0.3

Non-Outlier Max Non-Outlier Min

0.2

0.5

Median

0.0

Outliers

NONE

UCA

RTECH

SA

OOA

Outliers

PERSP

1.0

1.0

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6 IDEFRATE

IDEFRATE

Second Run (PG doc)

0.5 0.4 0.3

Non-Outlier Max Non-Outlier Min

0.2

75% 25%

0.1

Median

0.0 CHKL

PBR RTECH

Outliers

0.5 0.4 0.3

Non-Outlier Max Non-Outlier Min

0.2

75% 25%

0.1

Median

0.0 NONE

UCA

SA

OOA

Outliers

PERSP

Figure 2. Boxplots of Individual Defect Detection Rate on the two experimental runs. On the left side, values are plotted for groups of reading techniques; on the right side values are plotted for groups of team coordination.

The results, summarized in Table 4, revealed a significant effect, with respect to defect detection rate, only for the type of reading technique in the first experimental run. Table 5 reports the mean scores of the defect detection rate for the groups that define the effects in the analysis. As can be seen, while in the second run all groups demonstrated similar scores on defect detection rate, in the first run subjects using an Ad Hoc reading technique detected more defects than subjects using PBR. We then tested for differences between the absence of perspective and each of the perspectives using the Spjotvoll & Stoline test, which is a generalization of the Tukey HSD test to the case of unequal sample sizes (Winer et al., 1991). The post-hoc comparison of means failed to reveal significant differences although the comparison of means between subjects using no perspective and subjects in the OOA perspective is slightly higher than the 0.05 p-level (p = 0.052186). Comparing the mean scores of the individual defect detection rate with those achieved in experiments using same documents we can note that: •

In the nonsystematic reading technique group, subjects in our experiment found more defects than subjects in the Basili et al. (1996) experiment: the average defect detection rate was 20.58 in the pilot study and 24.64 in the 1995 run



In the systematic reading technique group, subjects in our experiment found less defects than subjects in the Basili et al. (1996) experiment: the average defect detection rate was 24.92 in the pilot study and 32.14 in the 1995 run



In the nonsystematic reading technique group, subjects in our experiment found more defects than subjects in the Ciolkowski et al. (1997): the average defect detection rate was 23.08 with the ATM document and 19.58 with the PG document



In the systematic reading technique group, subjects in our experiment found less defects in the ATM document than subjects in the Ciolkowski et al. (1997), the average defect detection rate was 25.93, and more defects in the PG document than subjects in the Ciolkowski et al. (1997), the average defect detection rate was 25.93. 14

First Run (ATM doc) df

MS

df

MS

Effect

Effect

Error

Error

{1}RTECH

1

.104446

110

.012019

8.689919

.003908

{2}PERSP

2

.005703

110

.012019

.474481

.623475

Source

F

p

Second Run (PG doc) df

MS

df

MS

Effect

Effect

Error

Error

{1}RTECH

1

.001860

105

.009523

.195322

.659432

{2}PERSP

2

.003082

105

.009523

.323596

.724261

Source

F

p

Table 4. ANOVA Results from testing hypotheses concerning the effects of reading technique and perspective on the individual defect detection rate.

First Run (ATM doc) Effect

Level of Factor

Total

N

Mean IDEFRATE

114

.240789

Second Run (PG doc) Level of Factor

N

Mean IDEFRATE

109

.282202

{1}RTECH

ADHOC

37

.284595

CHKL

34

.288235

{1}RTECH

PBR

77

.219740

PBR

75

.279467

{2}PERSP

NONE

37

.284595

NONE

34

.288235

{2}PERSP

UCA

25

.235600

UCA

24

.279167

{2}PERSP

SA

26

.218462

SA

26

.290385

{2}PERSP

OOA

26

.205769

OOA

25

.268400

Table 5. Mean scores of the individual defect detection rate for the reading technique and perspective groups.

3.3

Analysis of Individual Performance based on Process Conformance

So far, we have considered the inspection performance of all the individuals based on the assumption that they were actually following the assigned reading technique. While this assumption is certainly true for the Ad Hoc group in the first experimental run, because it implies the absence of any given reading technique, we cannot not know a priori whether the PBR group of reviewers whether actually follows the prescribed procedure. In order to check a posteriori for process conformance, the first experimental run had the debriefing questionnaire as a source of data analysis. Each PBR reviewer was asked two closed-ended questions related to the topic of process conformance, one about to what extent the proposed technique was followed and the other about to what extent the reviewer had focused on the questions for defect detection. Figure 3 details these questions. The first category response (“not at all” for question Q1A_4 and “not careful” for question Q1A_5) is equivalent to a rejection of the assigned reading technique while the second category response (“partially” for question Q1A_4 and “little careful” for question Q1A_5) means that reviewers were more likely to fall back to an ad hoc reading style. Only the third response category (“fully” for question Q1A_4 and “very careful” for question Q1A_5) implies that a reviewer has closely followed his assigned PBR scenario. 15

Q1A_4. (0) (1) (2)

Did you follow the assigned reading technique completely? not at all (I have completely ignored it) partially (I tried but I have not followed it all the times) fully (very carefully, step by step)

Q1A_5. How carefully did you focus on the questions as an help for defect detection? (0) not careful (I have completely ignored the questions) (1) little careful (I have read the questions more times and taken them into account during reading) (2) very careful (I tried to answer questions when I encountered them in the procedure) Figure 3. Questions regarding process conformance in the first experimental run.

We analyzed the answers between the two experimental runs. Table 6 and Table 7 summarize the results of the answers, respectively to question Q1A_4 and question Q1A_5. Results show that only 5 PBR reviewers (two with an UCA perspective, two with a SA perspective and none with an OOA perspective) followed the assigned reading technique completely, and just 6 PBR reviewers focused carefully on the questions as an help for defect detection. Furthermore, there is only one PBR reviewer who answered positively to both questions.

First Run (ATM doc)

Did you follow completely? not at all

Effect

Level of Factor

Total

N

Count

Pct

the

assigned

partially Count

Pct

reading

fully Count

technique Missing

Pct

Count

Pct

77

6

7.8%

63

81.8%

5

6.5%

3

3.9%

{2}PERSP

UCA

25

2

8.0%

21

84.0%

2

8.0%

0

0.0%

{2}PERSP

SA

26

1

3.9%

19

73.1%

3

11.5%

3

11.5%

{2}PERSP

OOA

26

3

11.5%

23

88.5%

0

0.0%

0

0.0%

Table 6. Summary of answers to question Q1A_4 after the first experimental run.

How carefully did you focus on the questions as an help for defect detection?

First Run (ATM doc)

not careful

Effect

Level of Factor

Total

N

Count

Pct

little careful

Count

Pct

very careful

Count

Pct

Missing

Count

Pct

77

14

18.2%

51

66.2%

6

7.8%

6

7.8%

{2}PERSP

UCA

25

6

24.0%

15

60.0%

3

12.0%

1

4.0%

{2}PERSP

SA

26

3

11.5%

17

65.4%

1

3.9%

5

19.2%

{2}PERSP

OOA

26

5

19.2%

19

73.1%

2

7.7%

0

0.0%

Table 7. Summary of answers to question Q1A_5 after the first experimental run.

16

From the analysis of debriefing questionnaires, we understood that in the first experimental run PBR reviewers had not appreciated the opportunity to have a procedure as a tool for defect detection. Thus, before the second experimental run, we told them that their goal was going to change from “find as many defects as possible with the help of the assigned reading technique” to “follow the assigned reading technique and find as many defects as possible”. To be fair, we replaced the Ad Hoc reading technique with the Checklist reading technique that is also not systematic but uses the checklist as a tool for defect detection like PBR uses the perspective-based scenario. Having made mandatory the reading technique in the second experimental run, we had to look for some proof of process conformance. For this purpose we verified the analysis models built during PBR scenarios and checked to what extent the defect entries in the preparation logs were explicitly and reasonably mapped to the questions of the assigned reading technique (Checklist or PBR). We measured the results of this post-hoc verification activity with the variable EVIDCONF, whose categories are ordered in terms of evidence for process conformance: 0 = no evidence (the analysis models built by PBR reviewers are just sketched and few defect entries are linked to questions) 1 = weak evidence (the analysis models built by PBR reviewers are just sketched or few defect entries are linked to questions) 2 = strong evidence (the analysis models built by PBR reviewers are sufficiently developed and most defect entries are linked to questions) Based on this scale, only the third category (“2 = strong evidence”) gives us enough confidence that a reviewer has closely followed his assigned reading technique. Table 8 shows the frequencies of the EVIDCONF variable grouped according to the two independent variables, respectively the reading technique (RTECH) and the perspective (PERSP). Results show that for 32 reviewers using checklist and 37 PBR reviewers (12 with an UCA perspective, 16 with a SA perspective and 9 with an OOA perspective) there was a strong evidence that they had followed the assigned reading technique completely. However, since reviewers were warned in advance that the verification was part of the academic evaluation, we thought that they could first detect defects on an informal basis and then work on the deliverables that were expected as the output of the reading technique. Thus, in the debriefing questionnaire at the end of the second experimental run, we asked reviewers one more closed-ended question, related to the topic of process conformance, about to what extent the reviewer had focused on the questions for defect detection. Figure 4 details this question. The first two category responses (“not careful” and “little careful”) mean that reviewers were not actually following any reading technique because the both checklist and PBR scenarios require that questions are used to detect defects and not the opposite. Only the third response category (“very careful”) implies that a reviewer has closely followed his assigned reading technique.

Second Run (PG doc)

Evidence for process conformance no evidence

Effect

Level of Factor

Total

N

Count

Pct

weak evidence

Count

Pct

strong evid.

Count

Pct

Missing

Count

Pct

109

15

13.8%

25

22.9%

69

63.3%

0

0.0%

{1}RTECH

CHKL

34

0

0.0%

2

5.9%

32

94.1%

0

0.0%

{1}RTECH

PBR

75

15

20.0%

23

30.7%

37

49.3%

0

0.0%

{2}PERSP

NONE

34

0

0.0%

2

5.9%

32

94.1%

0

0.0%

{2}PERSP

UCA

24

3

12.5%

9

37.5%

12

50.0%

0

0.0%

{2}PERSP

SA

26

5

19.2%

5

19.2%

16

61.6%

0

0.0%

{2}PERSP

OOA

25

7

28.0%

9

36.0%

9

36.0%

0

0.0%

Table 8. Summary of scores of the evidence for process conformance in the second experimental run.

17

Q1B_3. How carefully did you focus on the questions as an help for defect detection? (0) not careful (I first discovered defects and then I looked for questions to match) (1) little careful (when I found a defect I looked for a question to match) (2) very careful (I tried to answer questions when I encountered them in the procedure) Figure 4. Question regarding process conformance in the second experimental run.

After the second experimental run we analyzed the answers to question Q1B_3 (the results are summarized in Table 9. Results show that 11 reviewers using the checklist and 25 PBR reviewers (9 with an UCA perspective, 9 with a SA perspective and 7 with an OOA perspective) focused carefully on the questions for defect detection. With respect to process conformance, we can only trust in those reviewers who provided the highest evidence for process conformance (“strong evidence”) and answered positively (“very careful”) to the question related to process conformance in the debriefing questionnaire after the second experimental run. Table 10 shows how the “trustable” reviewers are distributed according to the grouping variables reading technique (RTECH) and perspective (PERSP). Less than one third of checklist reviewers can be trusted to have used checklist and one fifth of the PBR reviewers can be trusted to have followed the assigned scenario. The OOA scenario was the least followed PBR scenario (only 3 reviewers over 25). The OOA scenario was introduced by us for this experiment and then it was never tested in former experiments. If we consider the restricted dataset made up of only the defect logs found by trustable reviewers, we might test the effects of the two factors on the individual defect detection rate by repeating the two-way ANOVA for the hierarchically nested design as in the previous section. However, the cell size of the PERSP factor is too small, especially for the OOA level (only three observations). Thus, we can only evaluate the differences in means between the two groups (CHKL and PBR) of the RTECH variable. Figure 5 presents the boxplots of the individual defect detection rate (IDEFRATE) grouped according to the reading technique groups.

Second Run (PG doc)

How carefully did you focus on the questions as an help for defect detection? not careful

Effect

Level of Factor

Total

N

Count

Pct

little careful

Count

Pct

very careful

Count

Pct

Missing

Count

Pct

109

17

15.6%

53

48.6%

36

33.0%

3

2.8%

{1}RTECH

CHKL

34

2

5.9%

21

61.8%

11

32.3%

0

0.0%

{1}RTECH

PBR

75

15

20.0%

32

42.7%

25

33.3%

3

4.0%

{2}PERSP

NONE

34

2

5.9%

21

61.8%

11

32.3%

0

0.0%

{2}PERSP

UCA

24

6

25.0%

9

37.5%

9

37.5%

0

0.0%

{2}PERSP

SA

26

3

11.5%

14

53.9%

9

34.6%

0

0.0%

{2}PERSP

OOA

25

6

24.0%

9

36.0%

7

28.0%

3

12.0%

Table 9. Summary of answers to question Q1B_3 after the second experimental run.

18

Reviewers trustable with respect to process conformance

Second Run (PG doc)

Effect

Level of Factor

N

Total

Count

Pct

109

25

22.9%

{1}RTECH

CHKL

34

10

29.4%

{1}RTECH

PBR

75

15

20.0%

{2}PERSP

NONE

34

10

29.4%

{2}PERSP

UCA

24

6

25.0%

{2}PERSP

SA

26

6

23.1%

{2}PERSP

OOA

25

3

12.0%

Table 10. Summary of "trustable" reviewers with respect to process conformance.

1.0 0.9 0.8 0.7

IDEFRATE

0.6 0.5 0.4 0.3 0.2 0.1 0.0 CHKL

PBR

Non-Outlier Max Non-Outlier Min 75% 25% Median

RTECH Figure 5. Boxplots of Individual Defect Detection Rate for “trustable” reviewers with respect to process conformance on the second experimental run.

19

Since the sample sizes are small (10 observations in the CHKL group and 15 in the PBR group) and the IDEFRATE variable is not normally distributed within the CHKL group (the p-value of the Shapiro-Wilks’ W statistic is 0.0566 and the Lilliefors probability is less than 0.05), we applied the nonparametric Mann-Whitney U test. This test assumes that the dependent variable (IDEFRATE) was measured on at least an ordinal scale. The interpretation of the test is analogous to the interpretation of the t-test for independent samples, except that the U test is computed based on rank sums rather than means. For small to moderate sized samples, the Mann-Whitney U test may offer even greater power to reject the null hypothesis than the t-test. The null and alternative hypotheses for the restricted dataset of trustworthy reviewers may be stated as follows: H50: there is no difference between trusted subjects using a nonsystematic reading technique and trusted subjects using a systematic reading technique group with respect to defect detection rate. H5a: there is a difference between trusted subjects using a nonsystematic reading technique and trusted subjects using a systematic reading technique with respect to defect detection rate. Although the mean score of the defect detection rate for the PBR group (0.286) is higher than the means score for the CHKL group (0.277), the analysis failed to reveal a significant difference between the two groups (U = 66.5, and p = 0.637).

3.4

Analysis of the Subjective Evaluation of the Reading Technique

In the debriefing questionnaire at the end of the second experimental run, we asked reviewers to give a self-evaluation of the reading technique they have just finished to apply for finding defects. Figure 6 details this closed-ended question. The first category response (“harmful”) means that the reviewer judged negatively the reading technique because it was considered an obstacle for the task of defect detection. The second category response (“no help’) means that the reviewer was neutral with respect to the technique applied because it was considered just a waste of time. Only the third response category (“helpful”) implies a positive judgment with respect to the assigned reading technique. Table 11 shows how the percentages of answers related to the subjective evaluation of the reading techniques are distributed according to the grouping variables reading technique (RTECH) and perspective (PERSP). The percentages are presented separately with respect to the criterion of process conformance. Then, “trusted” reviewers are kept apart from “untrusted’ reviewers. We remind that the former group is much smaller than the latter group (as already shown in Table 10). As can be seen in Table 11, the answers are similarly distributed with respect to the reading technique and perspective variables. On the other hand, it seems that “trusted” reviewers are more positive in their evaluation than “untrusted” reviewers. In order to test whether the two groups significantly differ with respect to subjective evaluation, we applied the Chi-square test for the resulting 2 x 3 contingency table. This is the most common test for significance of the relationship between categorical variables. This measure is based on the fact that we can compute the expected frequencies in a contingency table (i.e., frequencies that we would expect if there was no relationship between the variables). The null and alternative hypotheses may be stated as follows: H60: there is no relationship between trust in process conformance and subjective evaluation of the reading technique. H6a: there is a relationship between trust in process conformance and subjective evaluation of the reading technique. The analysis obtained a Chi-Square of 4.911523 with a p value of 0.0267. We may therefore reject the null hypothesis, and tentatively conclude that trust in process conformance is related to subjective evaluation of the reading technique.

1B-5. (0) (1) (2)

Did the harmful no help helpful

reading technique help you identify defects in the requirements? (it was an obstacle; I would do better without it) (I think I would found the same defects even without it) (I have discovered defects that otherwise I could not find)

Figure 6. Question regarding self-evaluation of the reading technique in the second experimental run

20

Second Run (PG doc)

Did the reading technique help you identify defects in the requirements? “Untrusted” reviewers

Level of Factor

Effect

harmful

Total

no help

helpful

“Trusted” reviewers Missing

harmful

no help

helpful

Missing

8.3%

50.0%

38.1%

3.6%

8.0%

28.0%

64.0%

0.0%

4.2%

45.8%

45.8%

4.2%

0.0%

30.0%

70.0%

0.0%

10.0%

51.7%

35.0%

3.3%

13.3%

26.7%

70.0%

0.0%

4.2%

45.8%

45.8%

4.2%

0.0%

30.0%

70.0%

0.0%

{1}RTECH

CHKL

{1}RTECH

PBR

{2}PERSP

NONE

{2}PERSP

UCA

11.1%

50.0%

38.9%

0.0%

0.0%

33.3%

66.7%

0.0%

{2}PERSP

SA

10.0%

45.0%

45.0%

0.0%

16.7%

33.3%

70.0%

0.0%

{2}PERSP

OOA

9.1%

59.1%

22.7%

9.1%

33.3%

0.0%

66.7%

0.0%

Table 11. Summary of answers to question Q1B_5 after the second experimental run.

3.5

Analysis of Relationship between Time and Detection Effectiveness

We wanted to verify whether the amount of time available for preparation and meeting might have influenced the inspection performance. Table 12 shows for both experimental runs the correlation coefficients between time (MTNGTIME for the inspection meeting and PREPTIME for the individual preparation) and the dependent variables related to detection effectiveness (TDEFRATE at the team level and IDEFRATE at the individual level). The last row includes only those individual who can be trusted with respect to process conformance (using the same filtering procedure applied in Section 3.3). Since the time variables are not normally distributed, we use a nonparametric correlation coefficient, Spearman R, which only assumes that the variables under consideration were measured on at least an ordinal scale. As can be seen, there is no correlation between time and team performance, and thus we can exclude that the time spent to perform the review task has an influence over the result of the inspection.

Dataset First Run

Pair of Variables

Valid N

Spearman R

p-level

(teams)

TDEFRATE & MTNGTIME

38

-.298876

.068339

Second Run (teams)

TDEFRATE & MTNGTIME

35

.098098

.575047

First Run

(individuals)

IDEFRATE & PREPTIME

114

.125819

.182242

Second Run (individuals)

IDEFRATE & PREPTIME

109

-.020920

.829053

IDEFRATE & PREPTIME

25

-.174370

.404490

Second Run (“trusted” individuals)

Table 12. Correlation between time and detection effectiveness variables

21

4

Summary and Conclusions

Reading is considered a key technical activity to analyze a document and then to improve its quality when applied for defect detection during software inspections. Past studies, such as (Basili et al., 1996) and (Ciolkowski et al., 1997), have shown that Perspective-Based Reading (PBR) improve the inspection effectiveness on requirements documents with respect to nonsystematic reading techniques, such Ad Hoc or Checklist. We tested the effectiveness of PBR in two runs of a controlled experiment with more than one hundred undergraduate students. The subjects performed both the preparation and inspection meeting phases on the same requirements documents that had been used in the previous studies. The subjects reviewed the documents either applying a nonsystematic reading technique (Ad Hoc in the first run and Checklist in the second run) or a systematic reading technique (PBR). Each PBR reviewer was assigned one of three scenarios based on different perspectives: use case analyst, structured analyst, and objectoriented analyst. The two experimental runs used distinct requirements documents: ATM in the first run and Parking Garage in the second run. While in the first run the reviewers were invited to use the assigned reading technique as an help for finding defects, in the second run they were required to strictly use their reading technique for defect detection. The main research question was: “Are there differences in defect detection effectiveness between reviewing requirements documents using systematic reading techniques and reviewing requirements documents using nonsystematic reading techniques?” The findings from our experiment are the following: •

No difference was found between inspection teams applying PBR and inspection teams applying Ad Hoc or Checklist reading with respect to the percentage of discovered defects (H10).

This finding does not support the expected hypothesis, based on past studies, that inspection teams applying PBR find a higher percentage of defects than inspection teams applying nonsystematic reading techniques, such as ad hoc reading and checklists. However, the analysis of past studies was performed on simulated inspection teams rather than real team meetings, as in our case. •

Individual reviewers applying PBR found a smaller percentage of defects than Ad Hoc reviewers (H3a for the first run), but no difference was found between reviewers applying PBR and Checklist reviewers (H30 for the second run).

This finding does not support the expected hypothesis, based on past studies, that individual reviewers applying PBR find a higher percentage of defects than individual reviewers applying nonsystematic reading techniques, such as ad hoc reading and checklists. A by-product of the main research question was: “Are there differences in defect detection effectiveness between reviewing requirements documents using different PBR scenarios?” Our finding is the following: •

Individual reviewers applying PBR found the same percentage of defects with any of the assigned scenario (H40). Thus, we can consider the three scenarios equivalent with respect to defect detection effectiveness.

We also investigated the effects of having distinct roles when composing inspection teams. The related research question was: “Are there differences in defect detection effectiveness between reviewing requirements documents having unique roles in an inspection team and reviewing requirements documents having identical roles in an inspection team?” The finding is the following: •

There was no difference between teams with identical responsibilities and teams with distinct responsibilities with respect to the percentage of detected defects (H20).

This finding does not support the theory of scenario-based reading, which states that the coordination of distinct and specific scenarios achieves a higher coverage of documents. Furthermore, we verified whether the amount of time available for preparation and meeting could have influenced the inspection performance and thus we can state that that the time spent to perform the review task did not influence the result of the inspection. We went further in our analysis to find an explanation for these contradictory findings. We looked at debriefing questionnaires in order to check the process conformance assumption, i.e., whether subjects had actually followed the assigned reading techniques. We found that in the first experimental run only one PBR reviewer had declared to have both fully followed the reading technique and carefully focused on the questions in the assigned scenario. This was the main reason for making mandatory the use of the reading technique in the second experimental run, and for checking process conformance through the deliverables of the inspection preparation. We measured the results of this post-hoc verification 22

activity and asked again in the debriefing questionnaire to what extent the reviewer had focused on the questions for defect detection. The result was that less than one third of Checklist reviewers could be trusted to have used the checklist and one fifth of the PBR reviewers could be trusted to have followed the assigned scenario. We tested again the main research question but, this time, we considered only the restricted subgroup of reviewers who could be trusted with respect to process conformance. The result is the following: •

there was no difference between “trusted” reviewers using a nonsystematic reading technique and “trusted” reviewers using a systematic reading technique with respect to the percentage of defects found (H5a).

However, this time the PBR group scored better than the checklist group, albeit the difference was not yet significant. The analysis could not be repeated at the team level because there were no inspection teams exclusively made up of trusted reviewers. We also investigated the subjective evaluation of the reading techniques by asking reviewers to self-evaluate their reading technique. The answers were similarly distributed with respect to the reading technique and the perspective and then we can conclude that the subjective evaluation of the reading technique does not depend on the type of reading technique. On the contrary, the distribution was different with respect to trust in process conformance: trusted reviewers were more positive in their evaluation than untrusted reviewers. We tested the significance of this relationship and the result was the following: •

trust in process conformance was related to the subjective evaluation of the reading technique (H60).

However, the relationship between the two variables does not provide evidence concerning a cause-and-effect relationship. One might argue that there were reviewers who followed the reading technique as it was written because they were positively impressed by the assigned reading technique. On the other hand, one might also argue that there were reviewers who gave a positive evaluation of their reading technique because they actually had followed it, thus having the opportunity to appreciate the technique as it was conceived. This latter explanation might imply that there exists a reviewer’s attitude to be driven by a technique while performing a task. We need to better understand how a process tool, such as a reading technique, is perceived by users and what process tool characteristics are compatible with the users’ attitudes. The disposition of users towards a process tool might be influenced by many factors including sociological and psychological characteristics. We may, for instance, believe that being an undergraduate student can cause a subject to be less prone at following instructions than graduate or professional software engineers. Nevertheless, we may also believe that there are individuals who are more self-disciplined than others because of the education received or their personal characters. Further work on reading techniques, and more in general on software processes, should deeply look in the theories and experimental investigation of the human behavior as social scientists do in their disciplines.

Acknowledgments We gratefully acknowledge the collaboration of Nicola Barile in execution and data collection phases of the experiment. Our thanks also to all the students of SE class for their hard work.

References V. Basili, S. Green, O. Laitenberger, F. Lanubile, F. Shull, S. Sorumgard, and M. Zelkowitz, "The Empirical Investigation of Perspective-based Reading", Empirical Software Engineering, 1, 133–164, 1996. V. R. Basili, "Evolving and packaging reading technologies", Journal of Systems and Software, 38 (1): 3-12, July 1997. V. R. Basili, F. Shull, and F. Lanubile, "Building Knowledge through Families of Experiments", IEEE Transactions on Software Engineering, 25(4), July/August 1999. B. W. Boehm, Software Engineering Economics, Prentice Hall, Englewood Cliffs: NJ, 1981. M. Ciolkowksi, C. Differding, O. Laitenberger, and J. Munch, "Empirical Investigation of Perspective-based Reading: A Replicated Experiment", ISERN Report 97-13, 1997 M. E. Fagan, "Design and Code Inspections to Reduce Errors in Program Development", IBM Systems Journal, 15(3):182– 211, 1976. M. E. Fagan, "Advances in Software Inspections", IEEE Transactions on Software Engineering, 12(7):744–751, July 1986. P. Fusaro, F. Lanubile, and G. Visaggio, "A Replicated Experiment to Assess Requirements Inspection Techniques", 23

Empirical Software Engineering, 2, 39–57, 1997. T. Gilb and D. Graham, Software Inspection, Addison-Wesley Publishing Company, 1993. W. S. Humphrey, Managing the Software Process, Addison-Wesley Publishing Company, 1989. IEEE, IEEE Guide to Software Requirements Specifications, IEEE Std. 830, Soft. Eng. Tech. Comm. of the IEEE Computer Society, 1984. O. Laitenberger and J.M. DeBaud, "Perspective-based Reading of Code Documents at Robert Bosch GmbH", Information and Software Technology, 39:781–791, 1997. O. Laitenberger, K. El Eman, and T. Harbich, "An Internally Replicated Quasi-Experimental Comparison of Checklist and Perspective-based Reading of Code Documents", Technical Report, International Software Engineering Research Network, ISERN-99-01, 1999. F. Lanubile, F. Shull, and V. Basili, "Experimenting with Error Abstraction in Requirements Documents", in Proc. of METRICS ’98, 1998. J. Miller, M. Wood, and M. Roper, "Further Experiences with Scenarios and Checklists", Empirical Software Engineering, 3, 37–64, 1998. A. Porter, L. G. Votta, and V. R. Basili, "Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment", IEEE Transactions on Software Engineering, 21(6):563–575, June 1995. A. Porter, and L. Votta, "Comparing Detection Methods for Software Requirements Specification: A Replication Using Professional Subjects", Empirical Software Engineering, 3, 355-379, 1998. A. Porter, H. Siy, A. Mockus, and L. Votta, "Understanding the Sources of Variation in Software Inspections", ACM Transactions on Software Engineering and Methodology, 7(1): 41-79, January 1998. K. Sandahl, O. Blomkvist, J. Karlsson, C. Krysander, M. Lindvall, N. Ohlsson, "An Extended Replication of an Experiment for Assessing Methods for Software Requirements Inspections", Empirical Software Engineering, 3, 327–354, 1998. F. Shull, "Procedural Techniques for Perspective-Based Reading and Error Abstraction", http://www.cs.umd.edu/projects/SoftEng/ESEG/manual/error_abstraction/manual.html, 1998. F. Shull, I. Rus, and V. Basili, "How Perspective-Based Reading Can Improve Requirements Inspections", Computer, 33(7): 73-79, July 2000. D. A. Wheeler, B. Brykczynski, and R. N. Meeson, Jr. (Eds.), Software Inspection: An Industry Best Practice, IEEE Computer Society Press, 1996. B. J. Winer, D. R. Brown, and K. M. Michels, Statistical Principles in Experimental Design, 3rd edition, McGraw-Hill, New York, 1991.

24