However, F2F meetings hinder the applicability of software inspections in the context of distributed software development. We present a preliminary study where ...
A PRELIMINARY STUDY ON ASYNCHRONOUS DISCUSSIONS FOR DISTRIBUTED SOFTWARE INSPECTIONS Filippo Lanubile, Teresa Mallardo 1)
Abstract Face-to-face (F2F) meeting is the cornerstone of the traditional inspection process. However, F2F meetings hinder the applicability of software inspections in the context of distributed software development. We present a preliminary study where we compare six collocated versus six distributed inspections. In the former group team members physically meet and discuss in a collocated manner, while in the latter group team members discuss at different times and from different places, through a web inspection tool. The study shows that asynchronous discussions within distributed software inspections can replace F2F meetings without reducing inspection performance in terms of effectiveness.
1. Introduction Over the last years, global software development has become widespread in the software industry [7]. However, the ability to distribute software engineering activities over multiple geographical sites is limited for those software processes, such as software inspection, that constrain software engineers to meet together at the same time and same place. Face-to-face (F2F) meeting is the cornerstone of the Fagan’s software inspection [4] but it hinders the applicability of software inspection in the context of distributed software development. A repeated case study, at a global telecommunication company, shows that collocated inspection teams were more efficient and effective than distributed teams [3]. On the other hand, several empirical studies [1, 2, 5, 9, 12, 14, 15, 20] have also observed that inspection meetings poorly contribute to discover new defects, although they succeed at identifying false positives. Based on these findings and on behavioral theory of group performance, Sauer et al. [17] have proposed a reorganization of the inspection process that mainly consists of replacing the preparation and meeting phases of the traditional inspection process with three new phases: defect discovery, defect collection and defect discrimination (see Figure 1). 1
University of Bari, Dipartimento di Informatica, email: [lanubile | mallardo]@di.uniba.it
Planning Overview Moderator (or Author)
Discovery Collection Inspectors (in parallel)
Inspection team
Discrimination Rework Follow-up Figure 1. The reengineered inspection process
The discovery stage reflects a historical shift of goal for the preparation phase, that has changed from pure understanding to defect detection, with inspectors taking notes of defects on an individual basis. The collection stage consists of putting together defects found by individual reviewers and requires no more than one person (either the moderator or the author). The discrimination stage is the only phase where inspectors interact in a meeting for removing false positives. The goal of finding further defects as an effect of team synergy is definitively removed. The goal of our research is to determine if an asynchronous discussion can replace F2F meetings in the discrimination stage, by allowing distributed inspection teams to work together at their places and in different times. To achieve our goal, we have conducted a first empirical study where we compare six collocated versus six distributed inspections. In the former group, team members physically meet and discuss during the discrimination stage, while in the latter group team members discuss at different times and from different place, through a web inspection tool. In the following sections we first introduce the experiment, then we show the results from data analysis and discuss threats to validity. Finally, we present our conclusions and future research.
2. The Experiment Our experiment compared the performance of collocated inspection teams and distributed inspection teams. The main research question was the following: “Can asynchronous discussions replace F2F meetings for discrimination purposes, without reducing effectiveness?”. The subjects were 36 fifth-year computer science students attending a web engineering course at the University of Bari. As a course assignment, students had to develop a web application, working in groups. Each of the twelve development groups were free to choose a specific problem to address and had to specify the problem with a requirements document, adhering to a use-case style format.
2.1 Design Having twelve requirements documents, we conducted six inspections where the discrimination stage was performed as a F2F meeting, and six inspections where participants to the discrimination stage interacted at a distance, from university labs or home, in an asynchronous manner. Table 1 shows a characterization of the twelve inspections. When a requirements document was ready it was submitted by the authors for an inspection. All the requirements documents were inspected according to the submission order. Table 1. The twelve requirements inspections G1
G2
G3
G4
G5
G6
G7
G8
G9
No. of pages
19
16
20
24
28
26
29
21
17
No. of use cases
12
8
7
20
27
15
25
23
Team interaction
F2F
F2F
F2F
F2F
async
async
async
async
G10
G11
G12
16
19
19
7
5
11
18
F2F
Async
F2F
async
The inspection teams were formed by three reviewers, which were randomly selected from the other development groups, plus the authors of the requirements document. In the discrimination stage, one of the three reviewers was randomly selected for playing the role of moderator. The moderator was assigned the responsibility of managing the discussion and marking defects as false positives, provided that there was the consensus of the other inspectors. We invited participants to F2F meetings two days in advance. Because of the need to have similar conditions between the two types of team interaction, for the other experimental group we reserved two days to be spent for asynchronous discussion. 2.2 Variables The experiment manipulated one independent variable, team interaction during discrimination, with two treatments: F2F meeting (F2F) and asynchronous discussion (async). We gathered the following measures from the collection, discrimination and rework stages: • collated defects, which result from merging unique and duplicate defects previously found by reviewers; • true defects, which are defects recognized as valid by the team during discrimination; • removed false positives, which are defects recognized as invalid by the team during discrimination; • slipped false positives, which are defects recognized as invalid by the author during rework; • false positives, which are the sum of removed false positives and slipped false positives. In order to assess team interaction our main dependent variable was the discrimination effectiveness, which is the ratio of removed false positives to false positives. 2.3 Execution Inspections were supported by a tool, called the Internet-Based Inspection System (IBIS) [10], which covers all the phases of the reengineered inspection process in Figure 1. Our participants were trained to use the IBIS tool through a previous trial inspection.
During discovery, the tool was used to view the inspected document, apply the assigned reading technique, and annotate defects. In the collection stage, all discovery logs were automatically collapsed into a unique defect inspection list. Then, again with the help of the tool, the author set as duplicates those defects which were discovered by multiple inspectors. The discrimination stage was planned to include the entire inspection team and all collated defects. The IBIS tool was used both for F2F meetings and asynchronous discussions. However, in the former case inspectors physically met at the university labs and the tool was just used by the moderator to remove false positives. On the contrary, in the latter case the IBIS tool made it possible to conduct asynchronous discussions at a distance as in a discussion forum, with each defect being mapped to a threaded discussion. Comments were added by discussants within a thread and notified to the other participants by an email message. To support decision making, discussants could also vote by rating potential defects true defects or false positives. The moderator was thus able to mark as false positives those potential defects for which a consensus had been reached.
3. Data Analysis Table 2 shows that the discrimination effectiveness was slightly higher for asynchronous discussions (mean = 0.90) than for F2F meetings (mean = 0.88). Because there are too few observations, testing for differences would be unfair because the low sample size works in favor of our hypothesis of no differences in discrimination effectiveness. Only in G6 and G8 (0.65 and 0.78, respectively) the effectiveness of the asynchronous discussion is lower than the minimum effectiveness of F2F meetings in G1 (0.80). The low value of G8 could be justified by the high number of collated defects (60) which went through discussion. Instead, the low effectiveness of discrimination in G6 (0.65) might depend on poor discussion intensity. Table 2. Comparison of inspection performance F2F meeting
Asynchronous discussion
G1
G2
G3
G4
G9
0.80
0.89
0.95
0.83
0.81
19
25
43
27
4
16
21
False positives
15
9
Removed false positives
12 3
Discrimination effectiveness Collated defects True defects
Slipped false positives
G11
G5
G6
G7
G8
G10
G12
1.00
1.00
0.65
0.94
0.78
1.00
1.00
35
17
31
28
45
60
17
35
21
19
12
9
8
12
42
7
29
22
6
16
5
22
20
33
18
10
6
8
21
5
13
5
22
13
31
14
10
6
1
1
1
3
0
0
7
2
4
0
0
In a questionnaire that subjects completed one month after the experiment, we asked them how they had perceived this experience. Subjects had to state their degree of agreement to the following statements: (1) F2F meeting / asynchronous discussion has been useful for the purpose of removing false positives; (2) F2F meeting / asynchronous discussion has been useful for the purpose of learning. We solicited answers by email and got replies from 26 subjects that had participated to F2F meeting and 24 subjects that had participated to asynchronous discussion. The total number of replies (50) is higher than the number of subjects (36) because some participants were involved in two inspections, either as an author or as a reviewer.
Results are shown in Table 3 and Table 4. The rows represent the categories of the team interaction variable, while the columns represent the answers, that is the categories of a second variable. Only 3 subjects did not find useful the asynchronous discussion for the purpose of removing false positives, while there was a minority of participants who expressed their disagreement with respect to the usefulness of F2F meetings and asynchronous discussions for the purpose of learning (respectively 3 and 7 subjects). We have performed two chi-square tests of independence on the 2x2 contingency tables to see if there is any evidence that answers are related to team interaction. The probability values for the chi-square statistics are both higher than 0.05 and therefore we cannot reject the null hypotheses that the variables are independent, or unrelated. Table 3. Contingency table for (1) F2F meeting (freq.) Percent of total Asynchronous discussion (freq.) Percent of total Column totals Percent of total Chi-square (df=1)
Table 4. Contingency table for (2) Row
I agree
I disagree
26
0
26
52.00%
0.00%
52.00%
21
3
24
42.00%
6.00%
48.00%
47
3
50
94.00%
6.00%
3.46
p = 0.0630
I agree F2F meeting (freq.) Percent of total Asynchronous discussion (freq.) Percent of total Column totals Percent of total Chi-square (df=1)
I disagree
Row
23
3
26
46.00%
6.00%
52.00%
17
7
24
34.00%
14.00%
48.00%
40
10
50
80.00%
20.00%
2.42
p = 0.1195
4. Threats to Validity Threats to internal validity are rival explanations of the experimental findings that make the causeeffect relationship between independent and dependent variables more difficult to believe [21]. In our study we identified the following threats to internal validity: • Selection. The selection threat refers to natural differences in human performance. In our experiment, we reduced selection effect by randomly assigning subjects to F2F meeting and asynchronous discussions. • Maturation. This threat is due to the changing of the subjects’ behavior in the course of the experiment. One example of these changes is the learning effect. Because in our study some subjects participated twice to the discrimination stage, either as an author or as a reviewer, they might have improved their performance on the second inspection. In order to mitigate this learning effect the order was randomly chosen. Furthermore, we gave similar training to the subjects before the experiment started. • Plagiarism. Subjects exchanging information during or between experimentation tasks is a typical risk for experiments in an academic context. This threat might not occur in our experiment, because the task was limited to remove false positives from a document which was not shared among inspections, although recurring defects could be a bias. To mitigate this threat, both F2F meetings and asynchronous discussions were monitored by one of the researchers and there was no need for intervention. Threats to external validity are factors that limit the generalization of the experimental results to the context of interest [21], here the industrial practice of software inspections. The identified threats to external validity are the following:
• Representative subjects. Since we involved students both as documents’ authors and as reviewers, they may not be representative of the population of software professionals. However, our fifth-year students can be considered equivalent to newcomers that are usually recruited in inspection teams for learning purposes. This might not be true for the moderator who was selected among the participants. • Representative artifacts. The requirements documents inspected in this study may not be representative of industrial requirements documents. Our documents were requirements specifications for web applications for which time to deliver is the dominating factor, while inspections are often conducted for dependable systems for which quality and rework costs are perceived as critical. • Representative processes. The reengineered inspection process and tool-support for distributed inspections may not be representative of industrial practice. Although software inspections are often identified with the Fagan’s model [4], there are actually many variants of the inspection process which have been applied in industry and have been reported in the literature [8, 16]. Furthermore, tool-supported inspections have also gained industrial adoption [13, 19].
5. Conclusions and Further Work A number of studies have been reported on assessing alternatives to the F2F inspection meeting. For the sake of space we report only few of them. Mashayekhi et al. [11] conducted an empirical study to compare conventional inspection meetings with tool-augmented meetings, where they put forth their CAIS tool, as a complement to the F2F meeting, rather then its substitute. Then Stein et al. [18] evaluated AISA, a web-based tool supporting asynchronous discussions, but without a formal comparison with F2F meetings. Recently, Grünbacher et al. [6] ran an experiment where the inspection meeting has the only purpose to discriminate between true defects and false positives. However, they adopted a hybrid strategy based on a mix of asynchronous voting and synchronous discussion, where votes were used to filter out those defects for which there was already a consensus among participants. Our study shows that asynchronous discussions within distributed software inspections can be considered equivalent to F2F meetings with respect to the effectiveness of discriminating between true defects and false positives. Since there were too few observations, we could not statistically test the hypothesis of no differences in discrimination effectiveness. However, we were able to test the hypothesis that there is no relationship between team interaction and usefulness, as perceived by inspectors. Results reveal that inspectors found useful both F2F meetings and asynchronous discussions for the purpose of removing false positives. Furthermore, learning was not affected by the type of interaction during discrimination. As further work we intend to run further controlled experiments in the next editions of our courses. Other than assessing discrimination effectiveness, we will also assess discrimination efficiency. While we were able to collect effort data for F2F meetings we had not equivalent measures for the asynchronous discussions. For this purpose, we are going to extend the IBIS tool in order to automatically collect process data from all the inspection stages, including discrimination. We also plan to introduce in the comparison remote synchronous discussions (e.g., via chat or videoconferencing) as a different type of team interaction.
Acknowledgements We gratefully acknowledge the valuable collaboration of Fabio Calefato for the execution of the experiment. We also thank all students of the web engineering class for their fruitful work.
References [1] BIANCHI, A., LANUBILE, F., VISAGGIO, G., A controlled experiment to assess the effectiveness of inspection meetings, Proceedings of 7th International Symposium on Software Metrics (METRICS2001), London, England, IEEE Computer Society, pp. 42-50, 2001. [2] CIOLKOWKSI, M., DIFFERDING, C., LAITENBERGER, O., MUNCH, J., Empirical Investigation of Perspective-based Reading: A Replicated Experiment, ISERN Report 97-13, 1997. [3] EBERT, C., PARRO, C.H., SUTTELS, R., KOLARCZYK, H., Improving Validation Activities in a Global Software Development, Proceedings of the 23rd International Conference on Software Engineering (ICSE2001), Toronto, Ontario, Canada, IEEE Computer Society, pp. 545-554, 2001. [4] FAGAN, M.E., Design and Code Inspections to Reduce Errors in Program Development, IBM Systems Journal, Vol. 15, Issue 3, pp. 182-211, 1976. [5] FUSARO P., LANUBILE F., VISAGGIO G., A Replicated Experiment to Assess Requirements Inspection Techniques, Empirical Software Engineering, Vol. 2, pp. 39-57, 1997. [6] GRUNBACHER, P., HALLING, M., BIFFL, S., An Empirical Study on Groupware Support for Software Inspection Meetings, Proceedings of the 18th International Conference on Automated Software Engineering (ASE 2003), Montreal, Canada, IEEE Computer Society, pp. 4-11, 2003. [7] HERBSLEB, J.D., MOITRA, D., Global Software Development, IEEE Software, Vol. 18, No. 2, pp. 16-20, 2001. [8] LAITENBERGER, O., DEBAUD, J.M., An Encompassing Life Cycle Centric Survey of Software Inspection, The Journal of Systems and Software, Vol. 50, Issue 1, pp. 5-31, 2000. [9] LAND, L.P.W., JEFFERY, R., SAUER, C., Validating the Defect Detection Performance Advantage of Group Designs for Software Reviews: Report of a Replicated Experiment, Caesar Technical Report 97/2, University of New South Wales, 1997. [10] LANUBILE, F., MALLARDO, T., Tool Support for Distributed Inspection, Proceedings of the 26th Annual International Computer Software & Applications Conference (COMPSAC2002), Oxford, England, IEEE Computer Society, pp. 1071-1076, 2002. [11] MASHAYEKHI, V., FEULNER, C., REIDL, J., CAIS: Collaborative Asynchronous Inspection of Software, Proceedings of the 2nd ACM SIGSOFT Symposium on the Foundations of Software Engineering, New Orleans, Louisiana, 1994. [12] MILLER, J., WOOD, M., ROPER, M., Further Experiences with Scenarios and Checklists, Empirical Software Engineering, Vol. 3, pp. 37-64, 1998. [13] PERRY, D.E., PORTER, A., WADE, M.W., VOTTA, L.G., PERPICH, J., Reducing Inspection Interval in LargeScale Software Development, IEEE Transaction on Software Engineering, Vol. 28, No. 7, pp. 695-705, 2002. [14] PORTER, A., VOTTA, L.G., BASILI, V.R., Comparing Detection Methods for Software Requirements Inspections: A Replicated Experiment, IEEE Transactions on Software Engineering, Vol. 21, No. 6, pp. 563-575, 1995. [15] PORTER, A., VOTTA, L.G., Comparing Detection Methods for Software Requirements Specification: A Replication Using Professional Subjects, Empirical Software Engineering, Vol. 3, pp. 355-379, 1998. [16] PORTER, A., SIY, H., MOCKUS, A., VOTTA, L.G., Understanding the Sources of Variation in Software Inspections, ACM Transactions on Software Engineering and Methodology, Vol. 7, Issue 1, pp. 41-79, 1998. [17] SAUER, C., JEFFERY, D.R., LAND, L., YETTON, P., The Effectiveness of Software Development Technical Reviews: A Behaviorally Motivated Program of Research, IEEE Transactions on Software Engineering, Vol. 26, No. 1, pp. 1-14, 2000. [18] STEIN, M., RIEDL, J., HARNER, S.J., MASHAYEKHI, V., A Case Study of Distributed, Asynchronous Software Inspection, Proceedings of the 19th International Conference on Software Engineering (ICSE1997), Boston, Massachusetts, IEEE Computer Society, pp. 107-117, 1997. [19] VAN GENUCHTEN, M., VAN DIJK, C., SCHOLTEN, H., VOGEL, D., Using Group Support Systems for Software Inspections, IEEE Software, Vol. 18, No. 3, pp. 60-65, 2001. [20] VOTTA L.G., Does Every Inspection Need a Meeting?, ACM Software Engineering Notes, Vol. 18, No. 5, pp. 107-114, 1993. [21] WOHLIN, C., Experimentation in Software Engineering: An Introduction, The Kluwer International Series in Software Engineering, 2000.