Experiences with a Case Study on Pair Programming - CiteSeerX

45 downloads 566 Views 122KB Size Report
realistic task within a university practical course, conducted in teams of six students and .... Thus, this practical course gave us the opportunity to test Pair Programming in a more ..... Online at http://collaboration.csc.ncsu.edu/laurie/research.htm.
Experiences with a Case Study on Pair Programming Marcus Ciolkowski, Michael Schlemmer Universität Kaiserslautern [email protected]

Abstract. Agile methods are becoming more and more popular. Most well known among them is probably Extreme Programming (XP) [2]. One key practice of XP is Pair Programming (PP), where two developers work simultaneously on a programming task. However, despite their popularity, few is known about limitations of these methods in terms of empirical knowledge. Some empirical studies exist on Pair Programming [5][8]. These studies compared PP to solo programming and were conducted for small, isolated tasks. In this paper, we describe a case study conducted in the context of a more realistic task within a university practical course, conducted in teams of six students and comprising about 700 person-hours of total effort. Within our case study setting, we were able to find weak support for the results achieved in earlier studies. More importantly, we describe experiences we made in conducting the case study and suggest improvements for future investigations.

1. Introduction A few years ago, agile processes or methods have been introduced into the software engineering world. These processes are characterized by a combination of methods, focused on few documentation and high flexibility, or, in other words, simplicity and speed [1]. One prominent example, and probably the best known one, is Extreme Programming (XP) [2]. One important element of XP is Pair Programming (PP). Thereby, two developers always work together, especially during implementation. One of them takes the role of the “driver”, which means that he types code and has control over keyboard and mouse. Meanwhile, the other role, the “observer”, watches the driver’s actions, tries to find errors and plans ahead [3]. These roles are changed frequently. This practice represents a kind of continuous informal inspection of the code and is expected to detect defects in the code as early as possible. However, even though PP is growing in popularity, we do still not know much about its benefits and weaknesses. When starting our case study, we only knew of few empirical studies; for example, by Nosek [5], and by Cockburn, Williams and Kessler [3][8]. Basically, their results were that, for their type of programming task, PP needed ca. 15% more effort than solo programming but produced better quality code. However, it is not clear whether their results hold also for larger projects and teams consisting of more than two programmers. In this paper, we present a case study on PP in a university environment. In this case study, students had to work in teams of six, and they had to implement changes for an already existing software system written in Java. Further, we present

2

Marcus Ciolkowski, Michael Schlemmer

experiences and lessons we learned during execution of the case study. Thereby, we hope to facilitate replication and suggest improvements for future investigations. This paper is structured as follows: Section 2 briefly describes empirical studies on PP. Section 3 describes the case study we conducted, and section 4 contains our results. Finally, we summarize and discuss our findings in section 5.

2. Studies on Pair Programming When we planned our case study, we only knew of two documented empirical studies examining PP: One by Nosek [5], and one by Williams, Kessler and Cockburn [3][8]. In the study conducted by John Nosek, he took 15 full-time programmers, split them into 5 pairs and 5 individuals, and gave them a task to solve; the allotted time for the task was 45 minutes. Thereby, he compared teams using PP against solo programmers, using subjective criteria and programming effort to compare the techniques. His results were that pairs needed about 70% of the time compared to individuals, which means that pairs needed 40% more effort. In the study conducted by Laurie Williams, the researchers split 42 students into 14 pairs and 14 individuals. The students then had to write four programming assignments in six weeks. They report that pairs needed 50-60% of the time, which means the same or 20% more effort than individuals. However, it is unclear exactly what kind of task the students had to solve, but it seems that the assignments were independent of each other; that is, they were not connected to creating or changing a larger software system. Further, they claim that pairs produced better quality code, measured through the percentage of failed system tests and through lines of code of the solution. In both studies, the participants had to complete small, isolated programming assignments. Although it is important to study the effect of technique for isolated tasks to increase control over the study, this is not realistic: Usually, programmers work on a large software system, they make changes (e.g., add functionality) to it, and interact with each other.

3. Description of the case study One motivation for this study was to verify the results of previous studies in a more realistic setting, which means: (a) working on a larger system, and (b) working in larger teams than in the previous studies on PP. In this situation, we had the opportunity to use an undergraduate practical course on Java programming, the so-called “Software-Praktikum” (SWP). Every student in computer science at the university of Kaiserslautern has to take this course in his second year. Students work in teams of six on a project that lasts over 13 weeks (i.e., a total effort of about 700 person hours). During this time, the students have to work on a larger project that basically requires to extend and modify an existing system. Thus, this practical course gave us the opportunity to test Pair Programming in a more realistic environment. We decided to evaluate the benefits of PP by the following three criteria: (1) The quality of produced code, (2) the effort used for implementation, and (3) subjective impressions of the students.

Experiences with a Case Study on Pair Programming

3

However, it was clear that we would not be able to get many data points for evaluation, so we did not expect to achieve statistically significant results, only trends or tendencies in the data. Thus, one main objective we had with this study was to gather experiences, to lean how we could use this kind of practical course for such studies, and where the pitholes in evaluating PP are. With the experiences gained, we plan to do more rigorous case studies on PP in future runs of this practical course. This rest of this section is structured as follows: Section 3.1 details the project the students had to work on during the course and describes the design of our case study, and section 3.2 presents hypotheses and metrics we used in our study. 3.1. The Project The system we used in the practical course is called “web-based quiz system” or WBQS. It is implemented in Java and documented through its API (using JavaDoc), and additionally through a UML-based requirements and design document. The purpose of the WBQS is to process multiple-choice quizzes. Thereby, a user starts a Java application. The WBQS then loads (via http) and displays a table of existing quizzes, and the user chooses one them. Next, the system loads (again, via http) an XML-specification of the quiz as well as necessary pictures through the internet and displays the first question. After answering the questions, the user can request an evaluation of her answers and review the incorrectly answered questions. In its initial state, the system had about 4000 lines of Java code (including comments) and consisted of 21 classes. During the practical course, the students had to do inspect and modify the requirements and design documents. Further, they had to do two iterations, during which they had to implement changes and test cases, as well as test the system and generate a documentation. For our case study, only the programming tasks (including writing test cases) for the two iterations are relevant. In total, they covered 5 out of 13 weeks of the practical course. The first iteration dealt with enhancing the existing system functionality. The iteration was divided into three separate subtasks. This iteration was very well suited for PP, as the six-person teams could easily split themselves into three PP teams. The goal of the second iteration was to move the whole system to a client-server architecture (using Java RMI). For this task, it was not obvious how the work could be split. The interesting question was here, how the PP-teams would react to this challenge: Would they continue working as PP-teams or switch to a different mode? Altogether, we had 55 students in nine teams in the course. We had to remove three teams from analysis, as some students dropped out of the course, and these teams were reduced to four students. In the end, we had three teams working with PP, and three teams working in their usual (unsystematic) collaboration mode. It can be argued that the level of control we had in this study is higher than for a typical case study, as we were able to compare two treatments (PP and usual collaboration). However, considering the low process control (for details, see section 4.3) and the long span of time we covered, we cannot speak of a controlled experiment. If at all, we could consider our study a quasi-experiment, taking into account that the level of control is somewhere between a case study and an experiment. However, because of the low level of control we had, we decided to refer to our study as case study, not as quasi-experiment.

4

Marcus Ciolkowski, Michael Schlemmer

3.2. Hypotheses and Metrics We used two hypotheses to evaluate Pair Programming against the usual collaboration mode. Hypothesis 1: The quality of code produced with PP is better than with the usual collaboration. One quality metric that is usually used in such case studies is the number of defects in the resulting code. However, in our case, this metric turned out to be useless (for details see section 4.3). Instead, we tried to measure quality of the code through several other metrics. Some of the most important are: o Lines of Code: LOC (PP) This is the traditional measure of size. It counts the number of code lines (without comments). We decided to include this metric to be able to trace our result to earlier studies which also used LOC as quality indicator. The argument often used is that shorter code for the same task means that it is better planned. However, we are well aware that lower LOC does not necessarily imply better code quality. o Comment Ratio: CR (PP) Counts the ratio between comment lines and to total lines of code including comments. The underlying assumption is that a higher comment ratio means better documented code, and thus higher quality. o Coupling Factor: CF (PP) CF is calculated as the fraction between the number of non-inheritance couplings and the maximum possible number of couplings in the system. The underlying assumption is that a lower CF means a less complex system architecture. Hypothesis 2: The effort necessary for PP is 20% higher than with the usual collaboration mode. We measured effort through effort per person in hours. We chose this hypothesis in accordance with the results of previous studies. However, it can be argued that unsystematic collaboration can result in higher overhead than the systematic collaboration enforced by PP. The assumption behind hypothesis 2 is that the “unsystematic” teams split their work and then work individually on their tasks; that is, they still work as solo programmers.

4. Results In this section, we describe results and experiences we gained in our case study. In sections 4.1 and 4.2, we present results concerning the quality of the code and the effort used for the implementation tasks. Section 4.3 describes experiences and lessons learned during our case study. 4.1. Results concerning the Quality of produced code Lines of Code: The following table shows the LOC for teams using PP and teams using the usual collaboration. As you can see, the mean LOC is slightly lower for PP teams, in both

Experiences with a Case Study on Pair Programming

5

iterations. Although the difference is not statistically different, the result can be seen as a weak support for our hypothesis. Iteration It.1 It.2

Mean LOC Pairprogramming 3182,3 3886,0

Mean LOC Standard deviation Usual Collab. Pairprogramming 3411,7 385,9 4028,7 111,5

Standard deviation Usual Collab. 107,9 230,8

Comment Ratio: The following table presents the comment ratio for teams using PP and teams using the usual collaboration for both iterations. Teams using PP had a slightly lower comment ratio than teams using the usual collaboration. This metric is an example of a (weak) contradiction of our hypothesis. Iteration It.1 It.2

Mean CR Mean CR Standard deviation Pairprogramming Usual Collab. Pairprogramming 22,0 24,3 7,0 20,7 20,7 6,7

Standard deviation Usual Collab. 15,0 2,3

Coupling Factor: The following table shows the coupling factor for the first iteration. The coupling factor is, of course, dominated by the existing system. Therefore, we decided to exclude the second iteration, because it is even stronger dominated by the clientserver architecture of the system, which reduces the influence of PP on the system quality. As you can see, teams using PP produce code with a slightly lower coupling factor. However, the difference is not statistically significant. Iteration It.1

Mean CF Mean CF Standard deviation Standard Pairprogramming Usual Collab. Pairprogramming deviation Usual Collab 17,3 17,7 0,6 0,6

4.2. Results concerning effort The following table shows the mean effort per person in hours for the two iterations. As we be expected, the effort for teams using PP is slightly higher than for teams using the usual collaboration (ca. 10%). For this analysis, we were able to use 16 data points. However, we could not observe any statistically significant differences. Iteration It.1 It.2

Mean effort Standard deviation Mean effort Pairprogramming Pairprogramming Usual Collab 15,1 13,8 3,4 26,4 24,3 16,0

Standard deviation Usual Collab 4,2 9,6

4.3. Experiences and Lessons Learned In this section, we present our main experiences gathered during the case study. Treatment Assignment: Our initial idea was to randomly select some teams and train them in Pair Programming. However, this would have meant that we had to teach some students in PP, and others not. One very important restriction was that all students had to be taught the same topics, so we could not use this type of assignment. Our solution was to provide PP material to all students, so everybody had access to the same information. Then, we tried to motivate some teams to use PP. However, the disadvantage is that the performance of teams using PP may be suboptimal, because

6

Marcus Ciolkowski, Michael Schlemmer

they are not trained. Alternatively, we could have asked for volunteers, but that would also have biased the results, as volunteers are usually more motivated than the rest. It turned out that this motivation worked well enough for our purposes; all students we motivated applied PP later during the course. Process control: Another challenge we had to deal with was that students could work at home. This means that we could not supervise what process they followed. To find out whether the students applied Pair Programming, we used several questionnaires and interviews. We found that this kind of process control has to be improved. The questionnaires we used turned out to be “obvious” in that students could easily answer what they thought we wanted to hear. One possible solution could be to develop a better questionnaire to measure the process they followed, or to require certain presence times during which we can observe the teams. Measurement: Effort and defect measurement depend highly on the reliability or discipline of the students, which we could not control. To get more reliable measurement data, we made turning in effort and defect measurement sheets part of the grading. Although this guarantees that the students report data on a regular (thus more reliable) basis, they may not deliver clear reports about the effort they used, or defects in their system, as they probably fear that the data will be used against them. We tried to control this by emphasizing that only turning in the measurement sheets, not the content of the sheets, was part of the grading. Additionally, we had a student collecting and analyzing the measurement data, making it clear that no tutor could use the data during the course. However, it turned out that this approach was not sufficient. For example, we did not receive clear reports about defects or problems with the code. We have to find a better way to motivate students to report their defects. Another possible reason for not receiving clear reports is that the discipline in recording effort and defects may not be very high. This is a common problem to all measurement, of course. One solution could be to further facilitate the measurement, either by automatic support (e.g., tools like HackyStat [4]), or using a simpler collection method (web-based or simpler). Quality measurement: The system itself was GUI intensive. This means that, unlike in the previous studies on PP [3][8], we could not use automatic test cases to test the system for defects to evaluate the quality of the code. We tried to use test cases based on scenarios, but they turned out to be useless, as every system passed all tests. Thus, we had to rely on different metrics, such as complexity, to evaluate the quality of the code. Our experience was that this approach is basically feasible, but that we need to refine our quality metrics. For example, we used a tool that computed the worst-case complexity for the system. Although we computed complexity only for the part of the system the students modified, some very complex classes that already existed still dominated the measurement. Further, we need to focus the metrics more to measure only the modifications done by the students.

5. Conclusions Regarding the first hypothesis, that teams using PP would produce higher quality code than teams using the usual collaboration mode, we were not able to find strong

Experiences with a Case Study on Pair Programming

7

support. However, we were able to find tendencies for most of the metrics we used that confirm the hypothesis. In previous studies, researchers found a decrease in LOC of about 20% [3], and they concluded that this indicated better code quality. We were not able to show this large difference in our case study. One reason may be that our teams modified an existing system (not a small, isolated task), so the measure was dominated by the system, and the students’ creativity was restricted by it. Regarding our second hypothesis, that teams using PP would need slightly more effort, we were able to find weak support. In our study, teams using PP needed slightly more effort (ca. 10%) than teams using the usual collaboration. However, PP supporters often claim that, through its collaborative nature, PP helps people to learn to use a new technique. In our case study, the students had to adopt to many new techniques, among them XML and Java RMI (remote method invocation). We could not observe any positive effect for PP teams in terms of adopting to the new techniques. One reason for this could be that this improved learning effect does not exist, or that the students only programmed in pairs but did not make use of pairs when learning about the new techniques. Additional to collecting metrics, we conducted interviews with the students to ask them about their impressions on PP. The students described PP as creating a very constructive working atmosphere. They also claimed they would use PP again. We observed that they used PP for the second iteration even though the task was not very well suited for splitting work, because they had a good impression of PP. All in all, we were able to find tendencies for weak support of the results achieved in earlier studies on PP. Moreover, we were able to find them for a more realistic setting (i.e., larger teams and for a larger project). However, our experiences show that we need to improve and repeat our study in the future. We also hope that others will conduct case studies on PP to increase our knowledge about the applicability of this technique. References [1] [2] [3] [4]

[5] [6] [7] [8]

Abrahamsson, P., Salo, O., Ronkainen, J., and Warsta, J.: Agile software development methods – Review and analysis. VTT Publications 478, 2002. Online at http://www.inf.vtt.fi/pdf/ Beck: Extreme Programming Explained: Embrace Change. Addison Wesley 1999 Cockburn, William: The Costs and Benefits of Pair Programming. Online at http://collaboration.csc.ncsu.edu/laurie/research.htm Johnson, P.: You can’t even ask them to push a button: Toward ubiquitous, developer-centric, empirical software engineering. The NSF Workshop for New Visions for Software Design and Productivity, December, 2001. Online at: http://csdl.ics.hawaii.edu/Publications/categoryHackystat.html J.T. Nosek: The Case for Collaborative Programming. Comm. ACM, Vol. 41, No. 3, 1998, pp. 105–108. Schlemmer, Michael: “Eine Fallstudie zum Vergleich von Pairprogramming mit herkömmlicher Programmierung“. Project Thesis (in German), Universität Kaiserslautern, 2002 Williams, L. and Kessler, R.: All I Really Need to Know about Pair Programming I Learned In Kindergarten, Communications of the ACM, May 2000. Williams, Kessler: The Effects of “Pair-Pressure” and “Pair-Learning” on Software Engineering Education, CSEET 2000. Online at http://collaboration.csc.ncsu.edu/laurie/research.htm