An Experimental Investigation of Formality in UML-based Development

Carleton University, Technical Report SCE 03-22, Version 2

January 2005

An Experimental Investigation of Formality in UML-based Development L. C. Briand, §Y. Labiche, ¶M. Di Penta, §H. D. Yan-Bondoc

§

§

Software Quality Engineering Laboratory Department of Systems and Computer Engineering, Carleton University 1125 Colonel By Drive, Ottawa, ON K1S5B6, Canada {briand, labiche}@sce.carleton.ca ¶

RCOST - Research Centre on Software Technology Department of Engineering, University of Sannio Piazza Roma, I-82100 Benevento, Italy [email protected]

Abstract The Object Constraint Language (OCL) was introduced as part of the Unified Modeling Language (UML). Its main purpose is to make UML models more precise and unambiguous by providing a constraint language describing constraints that the UML diagrams alone do not convey, including class invariants, operation contracts, and statechart guard conditions. There is an ongoing debate regarding the usefulness of using OCL in UML-based development, questioning whether the additional effort and formality is worth the benefit. It is argued that natural language may be sufficient, and using OCL may not bring any tangible benefits. This debate is in fact similar to the discussion about the effectiveness of formal methods in software engineering, but in a much more specific context. This paper presents the results of two controlled experiments that investigate the impact of using OCL on three software engineering activities using UML analysis models: detection of model defects through inspections, comprehension of the system logic and functionality, and impact analysis of changes. The results show that, once past an initial learning curve, significant benefits can be obtained by using OCL in combination with UML analysis diagrams to form a precise UML analysis model. But this result is however conditioned on providing substantial, thorough training to the experiment participants.

1


January 2005

1 INTRODUCTION The Object Constraint Language (OCL) [20, 26, 27] was proposed as a way to bring additional precision to system analysis and/or design models described using the Unified Modeling Language (UML). OCL is currently part of the UML standard [4, 21]. However, a number of authors recommend against using OCL [12, 18]1 or leave it out of their proposed methodologies [13]. Some recommend using OCL only during low-level design [8]. Furthermore, very few organizations using UML make use of OCL. Though it is clear that OCL brings precision to UML models and offers a number of potential benefits [7, 10, 17], the question comes down to assessing whether the additional effort and formality associated with OCL bring any tangible benefits in practice. This question is akin to the old, on-going debate in software engineering regarding the degree of formality required in the early phases of software development to develop high-quality software. There is anecdotal evidence that using some kind of formal approach in the early phases of software development brings benefits [14] (e.g., fewer overall effort, fewer faults produced), but there are few rigorous empirical evaluations of those benefits and the factors that influence them. For instance, benefits may be simply due to “the nature of the people who willingly and enthusiastically apply formal methods” [2]. However, two empirical studies have been reported which are of particular interest in our context. In the first one, Pfleeger and Hatton performed a case study on the use of Finite State Machines, VDM and Communicating Sequential Processes [24]. The main results suggest that using formal specifications can lead to code that is relatively simple and easy to test. The authors, however, also report that their results are far from conclusive due to a lack of control and incomplete data as the study is a post-mortem analysis. Sobel and Clarkson performed a quasi experiment2 on the use of a formal method based on first-order logic [25]. The main conclusion is that the use of formal analysis (pre- and post-conditions in first-order logic) during software development produces better, 1

“Unless there is a compelling practical reason to require people to learn and use the OCL, keep things simple and use natural language.” [18] “Unless you have readers [e.g., designers] who are comfortable with predicate calculus, I’d suggest using natural language.” [12] 2 A quasi-experiment is an experiment which is unable to control potential factors that may influence the results, for instance because of a lack of randomization [28]. 2


January 2005

less erroneous programs. However, in [3], it is pointed out that this experiment presents severe threats to validity (e.g., lack of randomization, lack of control) and therefore needs to be replicated with a different design. In particular, the experiment did not exclude the possibility that the benefits may be due simply to the nature of the people who volunteered to learn the formal treatment. This paper reports on a series of two controlled experiments performed in a university setting. Their purpose was to understand whether the OCL has an impact on three software engineering activities, that is, (1) the detection of defects in UML models, (2) the comprehension of a system’s functionality, behavior, and structure based on UML models, and (3) the maintenance of UML documents, with a particular focus on change impact analysis. The subjects are 4th year computer and software engineering students who have been carefully trained in UML-based software development over several courses. As further discussed below, the choice of performing a controlled experiment was made to control other extraneous factors which could have affected the results (e.g., students’ ability) and to ensure that the subjects were all adequately trained (i.e., have sufficient knowledge of UML analysis and OCL). Though effort was a constant here—each student spent roughly the same time to perform the activities—our goal was to determine whether OCL would make a practically significant difference. Results showed that the use of OCL, combined with UML, offers significant benefits, in terms of defect detection, comprehension, and maintenance of UML analysis documents. However, the significant benefits are obtained only after a certain learning curve is overcome. This paper is structured as follows: Section 2 provides details of the design of the experiment, including the experiment definition, context selection, hypotheses formulation and instrumentation. Section 3 reports on the results and provides plausible interpretations. Section 4 discusses threats to validity. Conclusions in terms of practical significance of the results and future work are reported in Section 5.

3


January 2005

2 EXPERIMENT PLANNING Two experiment trials have successively been conducted and are referred to as Experiment I and Experiment II, respectively, in the rest of this article. Sections 2.1 to 2.7 describe Experiment I, following the template provided in [28], a well-known textbook on software engineering experimentation. Experiment II is a replication of Experiment I. Although some changes in the design of the experiment are performed, most of the descriptions in Sections 2.1 to 2.7 apply also to Experiment II, and differences between Experiment I and Experiment II are contrasted in Section 2.8.

2.1 Experiment Definition The purpose of the experiment is to evaluate the impact of using OCL during object-oriented analysis [8] (i.e., when class diagrams, statecharts, sequence diagrams, etc. are first produced), on the effectiveness of three typical software engineering activities3: 1

Understanding the analysis document: Different people have to understand analysis models at different stages of software development: During the analysis phase, different analysts have to produce a common, coherent, set of UML diagrams; During system design, once analysis has been completed [8], designers (likely to be different from the analysts) have to work from the analysis document. If OCL expressions help engineers to understand the analysis, then using OCL may save the effort required to clarify issues with analysts. Furthermore, errors discovered during analysis, and/or misunderstanding of the analysis during the design phase (possibly leading to wrong design choices), may be very expensive if these are not caught at an early stage of development.

2

Modifying the analysis document: During software development, it is common for system requirements to change and for faults to be corrected. Each type of change may require that the UML model be changed and a small change may lead to many other related changes. For

3

Many other software engineering activities could have been considered. But those three were selected as they correspond to typical expectations from people using UML models. The higher level of abstraction of models should help them understand a system, modify it, and the model should help avoid introducing defects in early stages of development. 4


January 2005

example, a change in a class may lead to changing the state definitions of this class and other classes or the pre- / post-conditions of operations. If the presence of OCL expressions helps to point out the impact of a change4 easily, correctly, and quickly, then the effort of devising OCL expressions would be more justified. 3

Detecting defects in the analysis document: Defects are likely to exist in a UML model. They could be due for instance to a misunderstanding of the system requirements or miscommunication among team members. Usually, models are subject to inspection, to improve their correctness and completeness. If the OCL expressions were shown to help engineers to detect more defects in the models before they are being used in subsequent steps, the effort of defining OCL expressions during analysis would be further justified.

According to standard experiment design terminology [28], our experiment has one independent variable or factor, denoted as Method, with two treatments: OCL and no_OCL, corresponding to whether OCL expressions are used in UML analysis models. The experiment has three dependent variables, namely Comprehension (C), Maintainability (M) and Defect Detection (D), which are defined to measure the effectiveness of the three above-mentioned software engineering activities. Determining the effect of the factor Method on these three dependent variables is the objective of the experiments reported here.

2.2 Context selection The context of the experiment is a fourth year Computer and Software Engineering course at Carleton University, Ottawa, Canada. The students have all been trained in UML-based object-oriented software development in at least three previous courses, with an increasing focus on software modeling. The focus of this last course is to make them use what was learnt in the context of a well-defined development process, including analysis and design phases. Experimental tasks are performed on two system analysis documents, related to two real world applications, namely a Cab Distribution (CD) system and a Video Store (VS) system. Each system was 4

Both in the UML diagrams and the code. 5


January 2005

selected because it has an adequate level of complexity (see Figure 1) and represents a familiar application domain5. Each analysis document (CD or VS) is approximately forty pages long, and they are somewhat comparable in terms of complexity and size, as illustrated in Figure 1. Each analysis document entails: a partial use case diagram along with use case descriptions following standard templates [8]; sequence diagrams for a subset of the use cases in the use case diagram; a couple of statecharts along with a textual description of their states and transitions; a class diagram including every model element (e.g., operation) used in sequence diagrams and statecharts; and a data dictionary reporting on classes, attributes, operations, and class associations. For the versions of the analysis documents containing OCL constraints, OCL was used to document the following aspects of the models, which are typically documented in natural language: operation preconditions and postconditions (Figure 2) and class invariants (in the data dictionary), guard conditions in statecharts, path conditions in sequence diagrams.. Figure 1. Case study characteristics # of actors # of use cases # of classes # of classes with inheritance # of inheritance depth # of associations # of inheritance relationship # of operations

CD 7 10 20 1 1 5 2 55

VS 8 12 18 0 0 8 0 66

# of attributes # of states # of state transitions # messages # of OCL clauses # of pages (with OCL) # of pages (without OCL)

5

CD 23 9 16 48 174 41 37

VS 40 10 15 39 291 47 39

Note that these models have little or no inheritance. The role of inheritance in this context will need investigation. However, our conjecture is that inheritance will increase even further the need for using OCL contracts, for instance to check the conformance of hierarchies to the Liskov Substitution Principle [19]. 6


January 2005

Figure 2. Example of operation description cancel() Description: Cancels the current reservation. The reservation object will transit to state Cancelled from states Pending or Outstanding (see description of state invariants) Parameter-List: Return-List: Pre: self.state=#Pending or self.state=#Outstanding Post: self->whenCancelled=TimePoint.currentTime() and self.state=#Cancelled and self.copy->isEmpty and if self.copy@pre->notEmpty then [email protected]=#ForRent endif

2.3 Hypothesis formulation The following null hypothesis (H0) is formulated for testing the effect of Method on each dependent variable: There is no difference in the subjects’ Comprehension (C), Maintenance (M), and Defect Detection (D) effectiveness while working on UML analysis documents using or not using OCL. The alternative hypothesis (Ha) is that using OCL improves effectiveness for all three dependent variables. The formal definitions are provided in Table 1: The rows present the three dependent variables; the columns summarize the corresponding null and the alternative hypotheses. Table 1. Experiment hypotheses Dependent variable Comprehension Maintenance Defect Detection

Null hypothesis H0: C(OCL) = C(no_OCL) H0: M(OCL) = M(no_OCL) H0: D(OCL) = D(no_OCL)

Alternative hypothesis Ha: C(OCL) > C(no_OCL) Ha: M(OCL) > M(no_OCL) Ha: D(OCL) > D(no_OCL)

2.4 Selection of subjects The subjects are the students registered in the last, most advanced software engineering course in their four-year Bachelor program, as it is desired that the subjects possess a reasonable level of technical maturity and knowledge for UML-based, object-oriented software development. The students are familiar with concepts such as pre- and post-conditions, state invariants, and class invariants. There is no need to screen students further as variations in ability are also present in industrial settings. As most

7


January 2005

of these students will actually become professionally registered engineers, they represent the population of new graduates entering the software engineering profession. The experiment is part of a series of compulsory laboratory exercises. The students are told that they are not graded on performance but that they are expected to perform their tasks individually in a professional manner to obtain the points assigned to the laboratory.

2.5 Experiment Design This section carefully presents the design choices we made to optimize the validity of the experiment. Section 2.5.1 defines the tasks performed and time allocated; Section 2.5.2 discusses other undesirable factors, whose effects may confound with the effect of using OCL; Section 2.5.3 describes the experiment design and how the extraneous factors are controlled.

2.5.1 Experimental tasks and time allocation Three tasks are defined to study the dependent variables. The Defect Detection task is formulated to study dependent variable Defect Detection: Thirty defects are seeded in each analysis document, and the subjects are asked to find them all during a period of 150 minutes without a priori knowing the number of seeded defects. The subjects’ performance for this task is measured as the percentage of seeded defects detected. The Comprehension task is formulated to study dependent variable Comprehension: The subjects have to answer to twenty multiple choice questions for each system, during a period of 90 minutes. The subjects’ performance for this task is measured as the percentage of correctly answered comprehension questions. The Maintenance task is formulated to study dependent variable Maintenance: A list of system changes is prescribed for each system, affecting 71 and 60 model elements in the CD and VS systems, respectively. The subjects are asked to identify all the model elements affected by the system changes during a period of 60 minutes, without knowing how many model elements are indeed affected. The subjects’ performance for this task is measured as the percentage of affected model elements correctly identified. As the subjects perform the three tasks on the same system, Defect Detection has to be performed first: The subjects are first given an erroneous document to perform the Defect Detection task, and then a 8


January 2005

correct document to answer comprehension and maintenance questions. Moreover, since the Comprehension task is believed to be simpler than the Maintenance task6 (the latter task also requires understanding the systems), Comprehension is performed before Maintenance. Due to time limitations imposed on each laboratory, the Defect Detection task is completed in one laboratory, while the Comprehension and Maintenance tasks are completed in a subsequent laboratory, with the Comprehension task being performed before the Maintenance task. More time is allocated to the Comprehension task (90 minutes) than to the Maintenance task (60 minutes), as it is assumed that a thorough understanding of each system would help the subjects in performing the subsequent Maintenance task.

2.5.2 Other factors to be controlled Method is the only factor of interest in this experiment. However, other factors may also impact the subjects’ performance in an undesirable way and their effect may be confounded with Method’s effect. Extraneous factors need to be controlled so that only the effect due to Method, if there is any, is discernable. 1

Academic background: Some subjects may come from a program, such as Electrical Engineering, where the software development training received may not have been as intensive compared to that received by subjects coming from a Software/Computer Engineering program. On the other hand, students in the Software Engineering program might be more receptive to the tasks than students in the other two programs.

2

OCL experience: Some subjects may have learned OCL prior to this course, while others may not have.

3

Ability: The subjects may have different levels of understanding of UML, OCL, and objectoriented system analysis.

6

Results (Section 3.4) confirm this intuition. 9


4

January 2005

Order of Method: As indicated below, subjects apply one treatment (OCL, no-OCL) for a system and then the other treatment for the other system (e.g., CD with OCL and then VS without OCL). The order in which the subjects are exposed to OCL may impact the observable effect of OCL as learning effects may be confounded with Method.

5

System: The subjects’ performance may depend on their comprehension of the application domain of each system. Further, system complexity may also have a confounding effect with Method. For example, the effect of OCL may be more or less discernable in a more complex system.

2.5.3 Experiment design Students were assigned to one of four groups, and the tasks they performed over four laboratories (three hours each), as well as the order of the tasks, are summarized in Table 2. The rows represent the four subsequent labs students went through (each lab being a week apart) and the three tasks they performed, whereas the columns are the four groups of students. The table cells show, for each group and lab, which system the students worked on and whether they had OCL constraints for it. Table 2. Experiment design Task Lab 1 (week 1)

Defect Detection

Lab 2 (week 2)

Comprehension, Maintenance

Lab 3 (week 3)

Defect Detection

Lab 4 (week 4)

Comprehension, Maintenance

Group 1 CD (OCL) CD (no_OCL) VS (no_OCL) VS (OCL)

Group 2 CD (no_OCL) CD (OCL) VS (OCL) VS (no_OCL)

Group 3 VS (OCL) VS (no_OCL) CD (no_OCL) CD (OCL)

Group 4 VS (no_OCL) VS (OCL) CD (OCL) CD (no_OCL)

This experiment involves only one factor whose effect is of interest to us: whether or not OCL is used in Analysis documents. We could have used a simple, completely randomized one-factor design, but we wanted to account for individual differences in such a way as to increase the statistical power of our analysis and ensure the design would not create any systematic bias in our results [15]. One way to achieve these goals is to adopt a randomized block design [28] and to account for subjects’ ability 10


January 2005

during data analysis (Section 2.7). Another advantage of proceeding this way is that we can study the interaction between subjects’ ability and OCL. Students were therefore grouped (into “blocks”) according to whether or not they had obtained a minimum grade of B- in the previous course on UML and OCL. Admittedly, other ways to capture ability and form blocks could have been considered, but our focus here was on their UML modeling capability and, as instructors of their previous courses on this topic, we felt it was the best alternative. Subjects were grouped also according to whether they learned OCL in a prerequisite course or in the course (the material and number of hours of lectures used were the same), and according to the undergraduate engineering program they were registered in. Each of the four student groups was then randomly assigned subjects from blocks in nearly identical proportions (Table 3). It was also important for us, in order to facilitate the data analysis, that each of the four groups be of approximately the same size (9 or 10 in this case). Table 3. Subjects’ group assignment for Experiment I

Group 1 Group 2 Group 3 Group 4 Total

OCL Experience Learned prior Learned in to the course the course 5 5 4 5 5 5 5 4 19 19

Undergraduate Program Computer Software System 5 5 7 2 4 6 5 4 21 17

Ability High

Low

5 5 5 5 20

5 4 5 4 18

For each task, each subject worked individually on each of the two systems, using UML+OCL in one case and only UML in the other. Students were also prevented from collaborating and individual work could easily be monitored as tasks were performed in a laboratory setting. The rationale for making subjects work on both systems was (1) to maximize the number of data points (observations) so as to increase statistical power, (2) to ensure that the differences in systems’ complexity would not bias the results (though we tried to select systems of “similar” complexity), and (3) to give the opportunity to every student to learn the same material. However, when asking experimental subjects to perform several times similar tasks (e.g., Defect Detection) and use the same 11


January 2005

techniques (e.g., OCL contracts), we are subject to learning or fatigue effects [15] and we need to ensure they do not confuse the results. This was done in two ways. First, two of the groups used OCL the first time they performed the task whereas the other two groups did so the second time around, thus ensuring learning effects (regarding the tasks, UML, and OCL) would not be confounded with OCL effects. As described in Section 2.7, our data analysis procedure will however account for learning effects and quantify them. Moreover, we used four groups of participants instead of two in order to avoid ordering effects (e.g., learning or fatigue effects) being confounded with system effects. For example, if all participants in group A would have used first the VS system with OCL (and then the CD system without OCL) while all participants in group B would have started with non-OCL documents on the CD system (and then VS with OCL), we could have observed that OCL helps more on the VS system as people in group B had more time to further mature their knowledge of UML and OCL. We would have then been unable to assess learning effects or system effects as they would be entirely confounded. Another result of our design choices is that each group uses each of the two systems twice in a row, sometimes using OCL first or second. One may therefore ask whether this is a threat to the validity of the results. First it is important to observe that when a system is used twice in a row, it is to perform different tasks and the data collected will be part of separate analyses (i.e., for different dependent variables). So if there is a system learning effect, say for Group 1 from Lab 1 to Lab 2, this learning will result in a better Comprehension and Maintenance performance. However, that learning effect will affect all four groups and we can still pool together the four groups’ data for analysis purposes within each of the labs. Furthermore, we expect that system learning effect to be weak as labs take place a week apart and students do not have access to documents between labs. Therefore, we also do not expect them to remember OCL expressions from one week to the other, and certainly not the level of detail where the performance of non-OCL groups would get closer to that of OCL groups. However unlikely, even in the situation where people would remember OCL expressions from one week to the other (e.g., Group 1 from Lab 1 to Lab 2), this would result in decreasing the impact of OCL but could not create an inflated difference between OCL and non-OCL performance scores.

12


January 2005

All the students were monitored by the experimenters as the tasks are performed. They were aware that our goal was to evaluate the impact of using OCL as well as to improve their practical training in UML and OCL, but they were not aware of the exact hypotheses tested or what particular dependent variables were of interest. The subjects were not allowed to communicate with each other during the laboratories, and were required to hand in all experimental materials at the end of each laboratory. Two additional, related issues that can be raised from the chosen experiment design are the possible exchange of information among students between labs and the possibility they might have kept working on the tasks outside the laboratory environment, thus invalidating our assumption that the effort spent on the tasks is nearly constant. This behavior was prevented in several ways. Students were not aware of the detailed plan for the labs and therefore did not know they would have to perform the same tasks as their fellow students from other groups at a later point. Second, as mentioned above, students did not have access to the lab material between labs, whether the system descriptions or questionnaires. As a result any collaboration or continuation of the work between labs was unlikely and difficult.

2.6 Instrumentation The materials used in this experiment comprise: UML analysis documents with or without OCL constraints, and with or without seeded defects (Section 2.6.1), questionnaires for the Comprehension and Maintenance tasks (Section 2.6.2), and a post-lab survey questionnaire (Section 2.6.3). Additionally, to verify if the blocks are appropriate (Section 2.5.3), a pre-lab survey questionnaire is administered to obtain information about the background of the subjects, e.g., their previous UML experience, the undergraduate program to which they belong (Appendix C). Each seeded defect, comprehension question, and maintenance question was carefully selected according to a number of criteria: It had to cover different aspects/parts of the systems, to the largest extent possible, however excluding implementation details; It should not be trivial or nearly impossible to answer based on the available information; It could be answered with or without OCL. In other words, each OCL and non-OCL document contains, in different ways, relevant information to answer questions. If we had introduced question that could be answered only with OCL expressions then the 13


January 2005

experiment would have no point as the results would then become obvious. Such questions are, anyway, difficult to find, and are expected to be rare in practice. Last but not least, the questions had to be relevant, that is had to concern information of interest. In other words, we had to be convinced that those would be questions that software engineers could ask about the system. For example, maintenance changes had to be plausible changes. Standard techniques of phrasing subjective questions and designing survey questionnaires were followed [22], so as to avoid bias and to increase reliability of responses.

2.6.1 The UML analysis documents Two systems are used in this experiment: the Cab Distribution system (CD), and the Video Store system (VS). There are four versions of each of the system documents: with or without OCL expressions, and with or without defects. Each document was carefully reviewed by the authors. Each document is about 40 pages long and, though smaller than industrial analysis/specification documents, they are far from being trivial. To perform the Defect Detection task, 30 defects were seeded in the correct UML document versions, containing or not containing OCL expressions. The subjects were asked to identify all the seeded defects and submit the findings via e-mail. To make the measurement objective and repeatable, the subjects’ performance is measured as the percentage of seeded defects correctly identified. Defects, typically encountered in UML documents, were seeded in class diagrams, data dictionaries, sequence diagrams, and statecharts. Example of these defects are missing/wrong associations in class diagrams, inconsistencies between diagrams and data dictionary, missing/wrong sequences in sequence diagrams, missing/wrong states, transitions guard conditions or actions in statecharts. The distribution of the seeded defects is similar in the two systems, as shown in Appendix A.

2.6.2 Questionnaires for the Comprehension and the Maintenance tasks During the Comprehension and the Maintenance tasks, the correct analysis documents with or without OCL are provided to the subjects, and Comprehension and Maintenance questionnaires are distributed.

14


January 2005

The Comprehension questionnaire comprises 20 multiple-choice questions. Its purpose is to quickly evaluate, in a repeatable and objective way, the extent to which a subject understands the system internal logic and functionality, which is measured as the percentage of all questions correctly answered. Figure 3 shows an example of a multiple choice question used for the VS system. To investigate on how OCL could have helped maintainers, we identified 6 and 5 prescribed changes for CD and VS respectively, and the Maintenance questionnaires focus on impact analysis of these changes. 71 (resp. 60) model elements are impacted in the CD (resp. VS) system by the changes. The subjects are asked to list model elements that would have to undergo change, and their performance is measured as the percentage of the total number of affected model elements that have been correctly identified. Figure 3 shows an example of such a question for the VS system. A complementary measure that is considered but not used is the total number of wrongly identified model elements. Figure 3. Example of Comprehension and Maintenance questions Comprehension Question

Maintenance Question

What is the copy’s state, after it is made available? A. OnHold B. ForRent C. OnHold or ForRent D. Other answer: ____________

The video store company decides to limit the maximum number of unfulfilled reservations that a member can perform in a day. Each member may have a different limit. 1. Which class(es) must be modified? 2. Which modifications (addition, change of attributes, operations, and/or operations’ contracts) must be made in this (these) class(es)?

2.6.3 Post-lab surveys In addition, because we wanted to gain enough insight to strengthen and explain our results, survey questionnaires were distributed after the execution of each Defect Detection, Comprehension and Maintainability task. We asked each student whether he or she had enough time to perform the task, how easy the task was to perform, whether he or she understood the system’s functionalities, or whether the lab instructions were clear and easy to follow. Moreover, for each using OCL, questions were asked about the amount of time used to read OCL expressions, how easy they were to understand, and how helpful they were perceived to be. Most questions are answered on a Likert scales from 1 (Strongly agree) to 5 (Strongly disagree), following recommended templates [22]. An example questionnaire for the Defect Detection task with OCL is shown in Figure 4.

15


January 2005

Figure 4. A survey questionnaire example You are required to rate your level of agreement of the following statements. There are 5 levels of agreement: 1 – Strongly agree 2 – Agree 3 – Not certain 4 – Disagree 5 – Strongly disagree Please mark the corresponding box dark to provide your level of agreement of each statement. 1 2

3

4

5

1. I had enough time to read through the analysis document. 2. I had no problems to find errors in the document. 3. I would have found more errors if there would have been more time. 4. I fully understood what the system is trying to do. 5. I was not sure what type of errors I was looking for. 6. The system analysis was fairly complex. 7. The lab instruction is clear and easy to follow. 8. The OCL expressions in the document were easy to understand. 9. The OCL expression in the document helped me to find errors. 10.How much time did you spend in understanding the OCL expressions during this lab? A. = 25% and =50% and < 75% D. >= 75%

The questionnaires are not part of the experiment proper, but rather gather information about the experiment. The output from the survey questionnaires is just meant to support and/or explain the quantitative results by providing qualitative insight.

2.7 Analysis Procedure A two-step data analysis is performed on subjects’ performance data. The first step accounts for all the observations of both attempts at the tasks (across two labs) for each dependent variable. However, it is also necessary to consider possible ordering effects, which cannot be avoided, and to determine whether they interfere with the overall analysis (e.g., learning effect). In other words, we want to test whether the difference between UML+OCL and UML-Only measurements can be explained by whether OCL is used during the first or the second attempt at a task. If this is the case, the data regarding each laboratory may have to be analyzed independently as we can expect different results. Thus, a second analysis step is performed for each dependent variable for each laboratory. In general, our statistical analysis will assume a level of significance α = 0.05. Assumptions that have to be satisfied by the various analysis techniques we employ will be discussed in the next subsections.

16


January 2005

2.7.1 Step 1: Analyzing data of both attempts at a task 2.7.1.1

All observations analysis

As each subject performs each task twice, using and not using OCL, on each system, the one-tailed paired t-test for independent samples [11] is performed for each task, using all available observations for that task. If the derived p-value ≤ α = 0.05, it can be concluded that using OCL has a significant positive effect on the corresponding dependent variable. Note that we have systematically performed a non-parametric statistical test as well for every hypothesis to be tested. The Wilcoxon test was used as an alternative to the t-test, as the Wilcoxon test is resilient to strong departures from the t-test assumptions. In this article, the results from the Wilcoxon test are in most cases consistent with those of the t-test and are reported only when different. They can be found in the appendices. 2.7.1.2

Testing effect of Order of Method

Half of the subjects apply the no_OCL treatment first and then the OCL treatment (referred to as order N_O), while the other half applies these treatments in the reverse order (referred to as order O_N). If there is a significant effect of Order of Method, it will then be necessary to study separately the effects of Method for each attempt at the tasks. To study the effect of Order of Method, the measurement differences between observations of the subjects when using and not using OCL are calculated, as follows: Diffx = observationx(OCL) – observationx(no_OCL), where x denotes a particular subject. Diff(N_O) denotes the performance difference (Diff) of the subjects who work on the system documents in the order N_O, while Diff(O_N) denotes the performance difference of the subjects who work on the system documents in the order O_N. We also use the symbol µ to denote averages of task performance or differences. A one-tailed t-test for independent samples [11] is performed to test whether Diff(N_O) is larger than Diff(O_N). Diff(N_O) is expected to be larger than Diff(O_N), as the subjects’ ability to use OCL may improve through lectures and training received between laboratories. Therefore: H0: Diff(N_O) = Diff(O_N), Ha: Diff(N_O) > Diff(O_N). A p-value below α = 0.05 would indicate that using OCL during the second attempt brings significantly larger performance differences when using and not using OCL than during the first attempt.

17


2.7.1.3

January 2005

Performance comparison between the first and the second attempt at a task across Ability levels

In order to further understand the effect of Order of Method and learning effects, the one-tailed paired t-test is also performed to compare low- and high-ability subjects’ performance between the first and the second attempt for each specific task. If the subjects’ performance improves over time, we want to determine whether there is a difference between low-ability and high-ability subjects. A p-value below α = 0.05 for a particular Ability level would indicate that each subject performs significantly better during the second attempt than during the first.

2.7.2 Step 2: Data analysis per laboratory When significant ordering effects take place, it is necessary to look into the data one laboratory at a time. To study the effect of Method in each laboratory, a one-tailed t-test for independent samples [11] for the Method factor would suffice. However, we expect student ability to have a strong effect on the task results. Further, Method may have a different effect (interaction) at different levels of subjects’ ability. Though we control for Ability in our design, in order to refine our data analysis, we can thus use a two-way Analysis of Variance (ANOVA) [11, 16] using Ability as a factor as well. Likewise, though we try to minimize its effect by using comparable systems, a three-way ANOVA that includes the System variable could be considered as well. Unfortunately, in the first experiment trial, the available sample size does not permit three-way ANOVA. Instead, two-way ANOVA is performed for Method & Ability, and Method & System, respectively. A three-way ANOVA is performed for Method, Ability, and System in the second experiment trial, as the sample size is this time significantly larger. The reason why two/three way ANOVA is better is two-fold [11]. First, ANOVA allows us to investigate interaction effects between Ability and the use of OCL (Method). Perhaps OCL has a different effect on students with different ability. Second, by introducing Ability, the variation it explains in the data is removed and the analysis of the effect of OCL becomes as a result statistically more powerful, e.g., we are more likely to see any significant effect if there is one.

18


January 2005

When looking at ANOVA result tables, we want to determine whether sample means in the groups formed by Ability and Method categories (levels) are significantly different, i.e., that the independent variables have a significant impact on the dependent variables (Defect Detection, Comprehension, Maintainability) and that the variations in means could not have been expected by chance alone. This is typically captured by the results of a F-test [11] that is yielding a p-value which captures the probability that we could obtain the results we observe if the populations from which the group samples are drawn had equal means (e.g., in terms of Defect Detection). Another standard statistic measures the strength of the relationship between each of the independent variables and the dependent variable. It is denoted as E2 and computes the relative improvement in our ability to predict the value of the dependent variable when we know what group an observation belongs to [16]—it is in many ways similar to R2 in regression analysis. Examples of the above statistics are provided in Appendix E and are used throughout Section 3. The ANOVA procedure is based on a number of assumptions. Each combination of factor levels (e.g., OCL and high-ability) has to be assigned subjects in a random fashion (possibly using blocking as a constraint, as described in Section 2.5.3) and should also have an equal number of observations (subjects). The dependent variable should be, for each level combination, normally distributed and the variance should be equal. Obviously, the last two assumptions are usually not met in practice but ANOVA is known to be very robust to departures from these assumptions [16], especially when the samples sizes are equal or similar for each level combination. If the differences are not too large, unequal sample sizes do not prevent the use of ANOVA, it just makes it more complex to interpret its results: The results may depend on the order in which factors are included in the model. Without further expanding on this, in our experiments the differences in sample sizes exist but are rather small (Appendix E). As expected, departure from normality and equal variance assumptions exist, but a detailed analysis shows that differences in variance are not statistically significant (Appendix F) and the dependent variable distributions do not show extreme outliers or multi-modal distributions, thus exhibiting only mild departures from normality. Moreover, we are in a situation where we are interested only in the effect of one variable (OCL) and the other factors are merely used to refine the analysis. In that case it is recommended to include the important variable last in the model, i.e., the factors are included in the model in order of increasing importance. In our case, Method is therefore 19


January 2005

the last variable to be included. In other words, we test to which extent OCL explains the remaining variation after the effects of Ability and System have been accounted for. Though the differences in sample sizes are very small in our case, we will follow this procedure to be on the safe side. In order to help further the interpretation of results, Descriptive statistics tables and Graph of means diagrams are also provided either in the main text or appendices, in addition to the ANOVA tables. They help assess and visualize the effects of factors and their interactions.

2.7.3 Analyzing the survey data Survey questionnaires are used to detect differences between groups using UML+OCL and UML only (e.g., in terms of their perception of the usefulness of OCL). Such data are used to better explain and support our quantitative results. A simple t-test [11] for independent samples is used to compare groups. This test is one-tailed as we expect OCL to have an effect in a specific direction (e.g., questions are easier to answer and therefore exhibit better scores with OCL). When comparing differences in the central tendencies of questionnaire data between subsequent labs our goal is to monitor for learning effects and explain differences in results that may be observed. Since groups of students are identical when comparing subsequent labs, we use a statistical test for paired samples, the paired t-test [11]. This test is also one-tailed as we expect a learning effect over subsequent laboratories and therefore a performance improvement in terms of questionnaire scores.

2.8 Replicated experiment trial (Experiment II) A second experiment trial, referred to as Experiment II, is conducted, repeating the steps of the first experiment trial as an attempt to confirm its results. The number of subjects is twice that of Experiment I (84 instead of 38), which enables a three-way ANOVA for the Method, Ability and System factors. Due to classroom space constraints, each group has to be further divided into two sub-groups, with each sub-group attending laboratories every two weeks (Table 4). Experiment II covers a time period of eight weeks (instead of 4 weeks for Experiment I). More training was administered before the second experiment trial than in the first trial. This was motivated by our observation during the first trial that subjects needed to be better prepared, despite 20


January 2005

the courses they had already taken and passed so as to minimize the learning effects we had observed during the first experiment. More examples were given to the subjects to familiarize themselves with UML system analysis documents, OCL expressions, and the tasks at hand. It was also observed in Experiment I that, overall, the subjects did not use all the time allocated to perform the Comprehension task (90 minutes), while the subjects needed more time to perform the Maintenance task (60 minutes). Thus 75 minutes are allocated to each of the two tasks in Experiment II. Overall, the grade distributions are similar across the two experiment trials and the same blocking strategy was used for Experiment II, as summarized in Table 4. However, the differences between Experiment I and Experiment II introduce new threats to internal validity that are discussed in Section 4.1. Table 4. Subjects Grouping for Experiment II OCL experience

Group 1

Group 2

Total

Sub group Group 1-1 Group 1-2 Group 1-3 Group 1-4 Group 2-1 Group 2-2 Group 2-3 Group 2-4

Learned prior to the course 1 0 1 0 1 0 1 0 4

Learned in the course 10 10 9 11 10 10 9 11 80

Undergraduate program Computer Software System 10 1 10 0 10 0 10 1 11 0 9 1 10 0 11 0 81 3

Ability

High 5 4 5 5 5 5 4 5 38

Low 6 6 5 6 6 5 6 6 46

3 THE EXPERIMENTAL RESULTS AND ANALYSIS This section presents the results and the analysis for both experiment trials. Sections 3.1, 3.2, and 3.3 respectively provide the analysis results for the dependent variables Defect Detection, Comprehension and Maintenance. In each section, results of Experiment I, Experiment II, and their comparison, are reported. For each dependent variable in an experiment trial, the results from the two-step analysis

21


January 2005

(Section 2.7) are reported. Main results are presented in these sections and complete detailed data can be found in appendices: Appendix E for dependent variables, and Appendix D for survey results. In order to simplify the presentation of the tasks performed during the different laboratories in the different experiment trials, the following convention is used: --. For example, Comprehension-2-I (C-2-I) refers to the Comprehension task performed during Lab 2 in Experiment I. Also, in order to simplify the characterization of the subjects, a number of acronyms are used, as shown in Table 5. Table 5. Types of subjects Type of subjects OCL subjects No_OCL subjects N_O subjects

O_N subjects

Description The subjects who performed a task on the UML+OCL version of system analysis documents The subjects who performed a task on the UML-only version of system analysis documents The subjects who performed a task on the UML-only version of a system analysis document during the first attempt at the task, and on UML+OCL version of the document during the second attempt The subjects who performed a task on the UML+OCL version of a system analysis document during the first attempt at the task, and on UML-only version of the document during the second attempt

3.1 Defect Detection (D) This section reports results related to the Defect Detection task. Section 3.1.1 reports analyses on data from both laboratories and compares the results of Lab 1 and Lab 3. Sections 3.1.2 and 3.1.3 report, for Lab 1 and Lab 3 respectively, the two and three-way ANOVA using the factors Ability, Method, and System.

3.1.1 Analyzing data of both laboratories 3.1.1.1

Analysis of all observations and order effects

In Experiment I, overall, the subjects performed slightly better when OCL is used than when OCL is not used, but the difference is not significant: a t-test does not yield any significant result (Table 6, row

22


January 2005

1). Results are different in Experiment II: the performance of the subjects is significantly better when OCL is used compared to when OCL is not used (Table 6, row 3, µ(OCL) = 9.48%, µ(no_OCL) = 7.76%, where µ is used throughout the report to denote the average values of dependent variables or differences, for particular groups of subjects across experiments). Our most plausible explanation for the difference in results across experiment trials is that subjects in Experiment II received more training in using OCL and in performing the experiment tasks than those of Experiment I. This suggests that substantial training is required for making OCL useful to effectively detect defects in UML models. Table 6. All observations analysis for both experiment trials - D Experiment trial Experiment I

Experiment II

Hypothesis H0: D(OCL) = D(no_OCL) Ha: D(OCL) > D(no_OCL) H0: Diff(N_O) = Diff(O_N) Ha: Diff(N_O) > Diff(O_N) H0: D(OCL) = D(no_OCL) Ha: D(OCL) > D(no_OCL) H0: Diff(N_O) = Diff(O_N) Ha: Diff(N_O) > Diff(O_N)

p-value 0.3 0.002 0.05 0.05

In each experiment trial, Order of Method has a significant effect: the difference between the performance of OCL and no-OCL subjects is significantly larger for the N_O subjects than for the O_N subjects as shown in Table 6 (rows 2 and 4). In Experiment I, the N_O subjects perform better when they apply OCL than when they do not (µ(Diff(N_O)) = 7.95%), while the O_N subjects perform worse (µ(Diff(O_N)) = -6.36%), as illustrated in Table 7. In Experiment II, OCL brings improvement to all subjects though the N_O subjects’ performance improvement (3.53%) is larger than the negligible difference observed for O_N subjects (0.09%). One plausible interpretation is that we have learning effects confounded with the effect of using OCL. Such learning effects are clearly visible as, for all subjects, the average fault detection rates double from Lab 1 to Lab 3 (Table 70). In other words, the N_O subjects’ performance difference between Lab 1 and Lab 3 may be due to the compounded effect of two factors: use of OCL (Method) and

23


January 2005

learning effects. For N_O subjects, learning effects may have entirely overridden the positive effect of using OCL in Experiment I and most of it in Experiment II. There is also evidence of learning effects in the survey data (Table 41), indicating that the subjects are significantly more confident in the type of defects to look for in Lab 3 than in Lab 1 (p-value = 0.03). Also the subjects spend significantly more time in reading and using OCL in Lab 3 than in Lab 1 (Table 42). Table 7 Order effects on OCL usage - D

O_N N_O

3.1.1.2

Lab 1 OCL No_OCL

Lab 3 No_OCL OCL

Experiment I Diff(OCL – no_OCL) -6.36% 7.95%

Experiment II Diff(OCL – no_OCL) 0.09% 3.53%

Comparing Defect Detection between Lab 1 and Lab 3 across Ability levels

Experiment I In Experiment I, while the absolute performance improvement of the subjects is modest (14.58% 7.36% = 7.22%), it corresponds to a substantial relative improvement (98%) and is statistically significant (p-value = 0.002) (Table 70). It is particularly large for low-ability N_O subjects (159%, Table 8) and suggests that, in addition to learning effects, using OCL may particularly help low ability subjects during the second attempt at the task in this experiment trial. The survey data also indicates a positive effect of using OCL for low-ability subjects (Table 43 and Table 44). Table 8. Comparing subjects’ Defect Detection between labs (paired t-test) - D

O_N Experiment I N_O O_N Experiment II N_O

Ability High Low High Low High Low High Low

24

µ(Lab 1) 12.00% 4.43% 8.33% 5.70% 8.97% 5.87% 9.60% 7.30%

µ(Lab 3) 24.00% 6.10% 15.00% 14.77% 7.30% 6.98% 15.50% 8.33%

p-value 0.06 0.32 0.11 0.02 0.82 0.31 0.005 0.29


January 2005

Experiment II In Experiment II, results are different as this is the performance of N_O high-ability subjects that significantly improves over time. It indicates that using OCL may have helped only high-ability subjects in the second attempt at the task after a certain learning curve has been overcome. The survey results suggest that the subjects’ understanding level is enhanced from Lab 1 to Lab 3 in this experiment trial (Table 50). The subjects are much more comfortable with the task and the systems in Lab 3 than in Lab 1: they feel less time pressure, have more confidence of the type of defects to look for, have less problems to find defects, and perceive the system analysis to be less complex. Further analyzing the survey data (Table 53), we found that high-ability subjects consider OCL to be more helpful in Lab 3 than in Lab 1.

3.1.2 Defect Detection ANOVA in Lab 1 3.1.2.1

Experiment I

Recall from Section 3.1.1.1 that, though the OCL subjects perform slightly better than the No_OCL subjects, such a difference is not statistically significant. ANOVA confirms this result as only Ability has a statistically significant effect (p-value = 0.04) on the percentage of defects detected (see Table 9, descriptive statistics are reported in Table 71) and explains 15% of the total variation (E2 = 0.15). There is no significant interaction effect. Table 9. Two-way ANOVA (Method & Ability) – D-1-I Source

DF

Sum of Squares

Mean Square

F Ratio

Prob > F

E2

Ability

1

0.0190

0.0190

4.79

0.04

0.15

Method

1

0.0004

0.0004

0.10

0.75

0.003

Ability*Method

1

0.0042

0.0042

1.07

0.31

0.03

Residual

27

0.1074

0.0039

Total

30

0.1313

0.82

Figure 5 clearly shows that high-ability subjects perform better than low-ability subjects, irrespective of the methods used.

25


January 2005

Figure 5. Graph of means by Method and Ability – D-1-I

Defect Detection-1-I by Ability 30% 20%

OCL

10%

no_OCL

0% High Ability

Low Ability

Performing a two-way ANOVA on Method & System showed that, although the subjects working on the CD system perform slightly better than the subjects working on the VS system, System does not have any significant main effect. Neither the effect of Method nor the interaction between Method and System is found to be significant (Table 72 and Table 73). To conclude, Ability is the only factor that has a significant main effect on Defect Detection in Lab 1 of Experiment I. Neither Method, nor System has any significant main or interaction effect. 3.1.2.2

Experiment II

A t-test shows that there is no difference between using and not using OCL in this laboratory. The three-way ANOVA results (Table 10) show that both Ability (p-value = 0.03) and System (p-value = 0.004) have significant main effects, while Method does not, in Lab 1 of Experiment II. System explains 11% of the total variation, while Ability explains 6%. Descriptive statistics are shown in Table 74. The effect of Ability can be observed in Figure 6: high-ability subjects perform better than lowability subjects for both the CD and VS systems (Figure 6a and Figure 6b). The effect of System is shown in Figure 7: the performance for the CD system is better than that for the VS system, for both high (a) and low (b) ability subjects. There seems to be a mild interaction between Method and System, but it is not significant.

26


January 2005

Table 10. Three-way ANOVA (Method, Ability & System) – D-1-II Source System Ability Method Ability*System Method*System Ability*Method Ability*Method *System Residual Total

DF 1 1 1 1 1 1 1 68 75

Sum of Squares 0.0364 0.0195 0.0003 0.0006 0.0097 0.0018 0.0006 0.2707 0.3390

Mean Square 0.0364 0.0195 0.0003 0.0006 0.0097 0.0018 0.0006 0.0039

F Ratio 9.14 4.89 0.09 0.17 2.44 0.46 0.17

Prob > F 0.004 0.03 0.76 0.68 0.12 0.50 0.68

E2 0.11 0.06 0.001 0.002 0.03 0.006 0.002 0.8

D-1-II by Method & Ability (CD)

D-1-II by Method & Ability (VS)

15%

15%

10%

Score

Score

Figure 6. Graph of Means by Method and Ability – D-1-II

OCL

5%

no_OCL

10%

OCL

5%

no_OCL

0%

0% High

High

Low

Low

Ability

Ability

(a)

(b)

The survey results explain the existence of a significant effect of Ability (Table 45). Compared to lowability subjects, high-ability subjects have less problems to find defects (p-value = 0.02, µ(high-ability) = 2.94, µ(low-ability) = 3.4); and they know better what to look for in finding defects (p-value = 0.003, µ(high-ability) = 3.14, µ(low-ability) = 2.42). Moreover, high-ability subjects spend less time reading OCL expressions than low-ability subjects (p-value = 0.03, µ(high-ability)=1.59, µ(low-ability)=2.09). This suggests that low-ability subjects may have more difficulties to understand OCL compared to high-ability subjects.

27


January 2005

Figure 7. Graph of Means by Method and System – D-1-II D-1-II by Method & System (Low Ability)

D-1-II by Method & System (High Ability)

15%

OCL

10% 5%

Score

Score

15% no_OCL

0% CD

10%

OCL no_OCL

5% 0%

VS

CD

System

VS System

(a)

(b)

The results of the survey questionnaire (Table 46) also support the main effect of System being significant: the subjects feel less time pressure when performing the task on the CD system than on the VS system (p-value = 0.01, µ(CD) = 2.95, µ(VS) = 3.55).

3.1.3 Defect Detection ANOVA in Lab 3 3.1.3.1

Experiment I

A t-test shows that there is no significant difference between using and not using OCL in this laboratory. ANOVA results in Table 11 show that Ability has a significant main effect (p-value = 0.02), explaining 20% of the total variation. Corresponding descriptive statistics are shown in Table 75. Method does not have a significant main effect, but has a significant interaction effect with Ability (p-value = 0.02): This interaction explains 19% of the total variation. Figure 8a shows that using OCL seems to have a negative effect on high-ability subjects, but a positive effect on low-ability subjects. Table 11. Two-way ANOVA (Method & Ability) – D-3-I Source Ability Method Ability*Method Residual Total

DF 1 1 1 20 23

Sum of Squares 0.0485 0.0000 0.0460 0.1543 0.2418

Mean Square 0.0485 0.0000 0.0460 0.0077

28

F Ratio 6.29 0.002 5.97

Prob > F 0.02 0.96 0.02

E2 0.20 0.0001 0.19 0.64


January 2005

Figure 8. Graph of means – D-3-I Defect Detection-3-I by Ability

Defect Detection-3-I by System

30%

30%

20%

OCL

20%

OCL

10%

no_OCL

10%

no_OCL

0%

0%

CD

High Ability Low Ability

(a)

VS

(b)

Table 12 shows that System has a significant main effect (p-value = 0.008), explaining 31% of the total variation. Descriptive statistics are reported in Table 76. Figure 8-(b) shows that the subjects perform better for the CD system than for the VS system, irrespective of the methods used. Method, again, does not seem to have a significant effect. There is no significant interaction between Method and System, which is also apparent from Figure 8-(b). Table 12. Two-way ANOVA (Method & System) – D-3-I Source System Method System*Method Residual Total

DF 1 1 1 20 23

Sum of Squares 0.0737 0.0001 0.0002 0.1677 0.2418

Mean Square 0.0737 0.0001 0.0002 0.0083

F Ratio 8.79 0.01 0.02

Prob > F 0.008 0.90 0.88

E2 0.31 0.0006 0.0009 0.69

To conclude, both Ability and System have significant main effects, while Method does not impact the subjects’ performance. However, Method has a significant interaction effect with Ability in Lab 3 representing 19% of the total variation. High ability subjects seem to fare better without OCL, while low-ability subjects seem to fare better with OCL. One plausible explanation is that OCL may have helped low-ability subjects to find more defects in a mechanical way, through a systematic comparison of OCL expressions and diagrams. For example, by comparing statecharts and class invariants, inconsistencies between valid states can be observed without understanding what they really are. On the other hand, using OCL may have hindered high-ability subjects who have a more intuitive way to perform the task, based on a good understanding of object-oriented modeling, and who may be slowed

29


January 2005

down by trying to systematically analyze OCL expressions. This interaction remains, however, to be further investigated. The survey results of Lab 3 (Appendix D, Table 39) show that low-ability subjects who use OCL actually perceive the systems to be relatively more complex, compared to those low-ability subjects who do not (µ(OCL)=2.14 vs. µ(no_OCL)=3.29, p-value = 0.03). On the other hand, high-ability subjects who use OCL have more confidence regarding the type of defects to look for, compared to those who do not (µ(OCL)=3.83, µ(no_OCL)=2.6, p-value = 0.03). This result is inconsistent with the observed interaction between Method and Ability, and requires further investigation. Results from the survey questionnaire confirm the significant main effect of System (Table 40). In Lab 3, the subjects working on the CD system tend to consider that they have enough time to finish the task, whereas the subjects working on the VS system do not (µ(CD) = 2.38, µ(VS) = 3.39, p-value = 0.03). 3.1.3.2

Experiment II

The t-test shows (p-value = 0.003, µ(OCL) = 11.87%, µ(no_OCL) = 7%) that using OCL has a significant positive effect on Defect Detection. Table 13 shows that Method has a significant main effect (p-value = 0.002), explaining 11% of the total variation. Ability also has a significant main effect (p-value = 0.024), explaining 5% of the total variation (descriptive statistics are reported in Table 77). Table 13. Three-way ANOVA (Method, Ability & System) – D-3-II Source

DF

Sum of Squares

Mean Square

F Ratio

Prob > F

E2

System

1

0.0156

0.0156

3.60

0.06

0.04

Ability

1

0.0230

0.0230

5.31

0.02

0.05

Method

1

0.0469

0.0469

10.80

0.002

0.11

Ability*System

1

0.0049

0.0049

1.14

0.29

0.01

Method*System

1

0.0340

0.0340

7.82

0.007

0.08

Ability*Method

1

0.0194

0.0194

4.48

0.04

0.04

Ability*Method *System

1

0.0005

0.0005

0.11

0.73

0.001

Residual

67

0.2912

0.0043

Total

74

0.4400 30

0.66


January 2005

Figure 9 shows, for each system, that using OCL has a positive effect for high-ability subjects, but not necessarily for low-ability subjects. This interaction between Method and Ability is significant (p-value = 0.04), explaining 4% of the total variation. In the survey results, there are some differences between high and low-ability subjects (Table 48): compared to low-ability subjects, high-ability subjects are more confident when looking for defects (p-value = 0.03); they also consider OCL to be more helpful (p-value(Wilcoxon test) = 0.05) (Table 49). This explains the positive effect observed in high-ability subjects. Figure 10 also shows an interaction between Method and System which is significant (p-value = 0.007), explaining 8% of the total variation (Table 13). Using OCL has a positive effect for the VS system, but not for the CD system. A plausible explanation is that the CD system is much simpler and OCL may be beneficial for Defect Detection only for more complex models such as the one of the VS system. Figure 9. Graph of Means by Method and Ability – D-3-II

20% 15% 10% 5% 0%

D-3-II by Method & Ability (VS)

OCL

Score

Score

D-3-II by Method & Ability (CD)

no_OCL High

20% 15% 10% 5% 0%

OCL no_OCL High

Low

Low

Ability

Ability

(a)

(b)

31


January 2005

Figure 10. Graph of Means by Method and System – D-3-II

20% 15% 10% 5% 0%

D-3-II by Method & System (Low Ability)

Score

Score

D-3-II by Method & System (High Ability)

OCL CD

VS

no_OCL

20% 15% 10% 5% 0%

OCL no_OCL CD

System

VS System

(a)

(b)

3.1.4 Comparing ANOVA Defect Detection Results in Experiment I and Experiment II Sections 3.1.4.1 and 3.1.4.2 provide comparisons of ANOVA results for the two experiment trials for both main and interaction effects within each laboratory, respectively. 3.1.4.1

Main effects

The significant main effects of Experiment I and Experiment II are summarized in Table 14. Ability has significant main effects in each laboratory of each experiment trial for the Defect Detection dependent variable, which indicates that an individual’s ability is key to such a task, as for many other software engineering tasks. The factors which have significant main effects in both D-3-I and D-1-II are: Ability and System. Method has a significant main effect only for D-3-II, explaining 11% of the total variation. Recall that in Experiment II, the subjects were provided more OCL training compared to the subjects of Experiment I. Therefore, a plausible way to interpret these results is perhaps that the subjects of Experiment II have a knowledge of OCL in Lab 1 that is similar to that of the subjects of Experiment I in Lab 3. Then, because the subjects of Experiment II in Lab 3 have a more advanced understanding, OCL starts showing benefits. Furthermore, Method has significant interactions with Ability in both D3-I and D-1-II, as further discussed in Section 3.1.4.2.

32


January 2005

System has a significant main effect for D-3-I and D-1-II, but not for D-1-I or D-3-II. For D-3-II, there exists a significant interaction only between System and Method. It is possible that when UML/OCL knowledge is lacking (in D-1-I), the complexity level of a system does not matter: the subjects are likely to perform poorly on any system. Once more UML/OCL knowledge is gained (in D-3-I), this impacts a simpler system (CD) more than a more complex system (VS). In Experiment II, since the subjects received additional prior training, this phenomenon shows right away during the first attempt at the task (in D-1-II). With even further experience gained, System, by itself, becomes less important (in D-3-II). As a piece of supporting evidence, survey results (Table 52) show that low-ability subjects consider the systems to be less complex in D-3-II than D-1-II (p-value = 0.01, µ(lab1) = 2.32, µ(lab3) = 2.8). Table 14. Comparing Defect Detection in Experiment I and Experiment II

Lab 1

Lab 3

3.1.4.2

Experiment I (Two-way ANOVA) Effect p-value Ability 0.04 (ANOVA with Method) Ability 0.02 (ANOVA with Method) System 0.008 (ANOVA with Method)

E2 0.15

Experiment II (Three-way ANOVA) Effect p-value E2 Ability 0.06 0.03 System 0.004 0.11

0.2

Ability

0.02

0.05

0.31

Method

0.002

0.11

Interaction effects

The significant interaction effects in Experiment I and Experiment II are summarized in Table 15. A significant interaction between Method and Ability is visible for Lab 3 of each experiment trial, and a significant interaction between Method and System is visible only in Lab 3 of Experiment II. This latter interaction may be visible only whenever Method has a significant main effect (Table 14), that is when OCL has a strong enough impact due to training and experience.

33


January 2005

Table 15. Comparing Experiment I and Experiment II – Interaction effects Experiment I (Two-way ANOVA) Effect p-value E2

Lab 3

Interaction of Method and Ability

0.02

Experiment II (Three-way ANOVA) Effect p-value E2 Interaction of Method and 0.04 0.04 Ability 0.19 Interaction of Method and 0.08 0.007 System

There is an important difference between Experiment I and Experiment II for the interaction between Method and Ability in Lab 3. Recall that in Experiment I, using OCL seems to hinder high-ability subjects but helps low-ability subjects; whereas in Experiment II, OCL helps high-ability subjects, but not low-ability subjects. The additional training given to the subjects of Experiment II likely explains the abovementioned difference as well. At first, high-ability subjects may have used their intuition to find the defects, without the help of OCL; whereas low-ability subjects may have used OCL in a systematic way to detect document inconsistencies. When the subjects become much more comfortable with OCL, high-ability subjects may then use it better than low-ability subjects. We see from this discussion a pattern that is starting to emerge: learning effects and subjects’ training have a huge impact on the results and complicate our analysis. Though it is difficult to determine beforehand how much and what type of training is required, and given the fact that learning effects are unavoidable, it is important that such factors be carefully analyzed.

3.2 Comprehension (C) This section reports on results related to the Comprehension task. Section 3.2.1 reports analyses on data from both laboratories and compares results from Lab 2 and Lab 4. Sections 3.2.2 and 3.2.3 report, for Lab 2 and Lab 4 respectively, the two and three-way ANOVA on the factors Ability, Method, and System.

34


January 2005



In Experiment I the subjects perform significantly better when using OCL (p-value = 0.03, µ(OCL) = 49.3%, µ(no_OCL) = 42.9%). The result (Table 16) is confirmed in Experiment II (p-value = 0.0001, µ(OCL) = 49.9%, µ(no_OCL) = 42.4%) and therefore suggests a significant improvement of roughly 7 percent in correctly answering questions. Weaker significance levels are to be expected for Experiment I as the number of observations is smaller but there is nevertheless strong evidence that using OCL brings benefits for the Comprehension task. Table 16. All observations analysis of both experiment trials - C Experiment trial

Experiment I

Experiment II

Hypothesis H0: C(OCL) = C(no_OCL) Ha: C(OCL) > C(no_OCL) H0: Diff(N_O) = Diff(O_N) Ha: Diff(N_O) > Diff(O_N) H0: C(OCL) = C(no_OCL) Ha: C(OCL) > C(no_OCL) H0: Diff(N_O) = Diff(O_N) Ha: Diff(N_O) > Diff(O_N)

p-value 0.03 0.06 0.0001 0.04

For Experiment I, as shown in Table 17, the N_O subjects clearly perform better when using OCL (µ(Diff(N_O)) = 11.4%), while the O_N subjects perform only slightly better (µ(Diff(O_N)) = 1.4%). In Experiment II, once again, the N_O subjects perform better when using OCL and show quantitative results very similar to Experiment I (µ(Diff(N_O)) = 10.8%). It is once again the case that, though O_N subjects perform better when they use OCL (µ(Diff(O_N)) = 4%), the difference is significantly smaller than with N_O subjects. Like for Defect Detection, consistent results across experiments strongly suggest that there are compounded learning and OCL effects for N_O subjects whereas for O_N subjects OCL effects are partly cancelled by learning effects. It is therefore clear that OCL has a positive effect on Comprehension but that it lies below 11% and is probably around 6-7% (difference based on all subjects, in Experiment II: Table 82 & Table 84).

35


January 2005

Table 17 Order effects on OCL usage - C Experiment I Experiment II Lab 2 Lab 4 Diff(OCL – no_OCL) Diff(OCL – no_OCL) O_N OCL No_OCL 1.4% 4% N_O No_OCL OCL 11.4% 10.8%

3.2.1.2

Comparing Comprehension between Lab 2 and Lab 4 across ability levels

Experiment I Looking further into the data, the N_O high-ability subjects’ comprehension improves significantly from Lab 2 to Lab 4 (p-value = 0.003), as shown in Table 18. This suggests that high-ability subjects improve further than low-ability subjects when OCL is used during the second attempt at the task. The survey results (Table 59, Table 61) support this interpretation as instructions appear clearer and comprehension questions are perceived to be easier to answer in Lab 4. Table 18. Comparing subjects’ comprehension between labs (paired t-test)

O_N Experiment I N_O O_N Experiment II N_O

Ability

µ(Lab 2)

µ(Lab 4)

p-value

High Low High Low High Low High Low

46.45% 35.70% 46.45% 45.70% 50.56% 44.30% 42.94% 40.71%

44.30% 35.00% 65.00% 50.00% 44.06% 42.25% 56.77% 49.05%

0.63 0.53 0.003 0.21 0.96 0.68 0.003 0.007

Experiment II Experiment II shows an overall significant improvement from Lab 2 to Lab 4 (p-value = 0.04). This is also supported by the survey results (Table 68 and Table 69) which show that there was less time pressure and less understanding problems in Lab 4. However, Table 18 shows that only the N_O subjects’ comprehension improves significantly from Lab 2 to Lab 4 (p-value(high-ability) = 0.003, pvalue(low-ability) = 0.007). There is no such improvement for the O_N subjects. This confirms that, for the reasons discussed above, a significant improvement is visible only for N_O subjects and in particular high-ability ones. 36


January 2005

It is interesting to note that in Experiment I only the N_O high-ability subjects improve significantly from Lab 2 to Lab 4, whereas in Experiment II, the performance of the N_O subjects of both high and low abilities improve significantly (Table 18). One likely explanation is that low-ability subjects require more training than high-ability subjects to use OCL in an effective manner, something they obtained in Experiment II.

3.2.2 Comprehension ANOVA in Lab 2 3.2.2.1

Experiment I

A t-test shows that there is no difference when using OCL in this laboratory. Similar to Method, Ability and System do not have any significant main or interaction effect on Comprehension for Lab 2. For details see Table 78, Table 79, Table 80 and Table 81. 3.2.2.2

Experiment II

A t-test shows a positive effect (p-value of 0.017, µ(OCL) = 46%, µ(no_OCL) = 41%) when using OCL for the Comprehension task. Moreover, Table 19 shows that only Method has a significant main effect in this laboratory, explaining 7% of the total variation, while Ability and System do not. Descriptive statistics are reported in Table 82. In other words, using OCL has a positive effect on the subjects’ performance, regardless of the subjects’ ability or the system used, as shown in Figure 11 and Figure 12. There is no significant interaction among the factors. Table 19. Three-way ANOVA (Method, Ability & System) – C-2-II Source System Ability Method Ability*System Method*System Ability*Method Ability*Method *System Residual Total

DF 1 1 1 1 1 1 1 69 76

Sum of Squares 0.0416 0.0226 0.0900 0.0162 0.0007 0.0208 0.0174 1.1591 1.3536 37

Mean Square 0.0416 0.0226 0.0900 0.0162 0.0007 0.0208 0.0174 0.0168

F Ratio 2.48 1.35 5.35 0.96 0.04 1.23 1.03

Prob > F 0.11 0.25 0.02 0.33 0.84 0.27 0.31

E2 0.03 0.02 0.07 0.01 0.0005 0.02 0.01 0.86


January 2005

Figure 11. Graph of Means by Method and Ability – C-2-II C-2-II by M ethod & Ability (VS)

C-2-II by M ethod & Ability (CD)

60%

40%

OCL

20%

no_OCL

Score

Score

60%

40%

OCL

20%

no_OCL

0%

0% High

High

Low

Low Ability

Ability

(a)

(b)

Figure 12. Graph of Means by Method and System – C-2-II C-2-II by Method & System (Low Ability)

60% 40% 20% 0%

Score

Score

C-2-II by Method & System (High Ability)

OCL no_OCL CD

VS

60% 40% 20% 0%

OCL no_OCL CD

System

VS

System

(a)

(b)

3.2.3 Comprehension ANOVA in Lab 4 3.2.3.1

Experiment I

A t-test (p-value = 0.002, µ(OCL) = 57.9%, µ(no_OCL) = 39.6%) shows that using OCL has a significant positive effect on Comprehension in this laboratory. Performing a Two-way ANOVA (Table 20 and Table 21), we found that Method, Ability and System have significant main effects (related descriptive statistics are reported in Table 83). As illustrated by Figure 13-(a), using OCL has a positive effect on both high and low-ability subjects, though high-ability subjects always perform better than low-ability subjects. Method explains 27% of the total variation, which is particularly strong and higher than the effect of Ability (12%). Although only N_O high-ability subjects improve significantly from Lab 2 to Lab 4, ANOVA does not show a significant interaction between Method

38


January 2005

and Ability. Performing again a two-way ANOVA, Method and System have significant main effects (p-value = 0.001 and 0.007, respectively), explaining 28% and 18% of the total variation, respectively. Table 20. Two-way ANOVA (Method & Ability) – C-4-I Source

DF

Sum of Squares

Mean Square

F Ratio

Prob > F

E2

Ability

1

0.1032

0.1032

4.98

0.03

0.12

Method

1

0.2232

0.2232

10.77

0.003

0.27

Ability*Method

1

0.0057

0.0057

0.27

0.60

0.007

Residual

24

0.4971

0.0207

Total

27

0.8292

0.6

Table 21. Two-way ANOVA (Method and System) – C-4-I Source

DF

Sum of Squares

Mean Square

F Ratio

Prob > F

E2

System

1

0.1454

0.1454

8.80

0.007

0.18

Method

1

0.2346

0.2346

14.21

0.001

0.28

System*Method

1

0.0581

0.0581

3.52

0.07

0.07

Residual

24

0.3963

0.0165

Total

27

0.8292

0.48

Figure 13. Graph of means – C-4-I Comprehension-4-I by Ability

Comprehension-4-I by System

80%

80%

60%

60%

OCL

40%

OCL no_OCL

40%

no_OCL

20%

20%

0%

0%

High Ability Low Ability

CD

(a)

VS

(b)

Figure 13-(b) shows that using OCL has a positive effect for each system, though subjects always perform better for the CD system than for the VS system. However, ANOVA does not show the interaction between Method and System to be significant (p-value = 0.07). Overall, we can conclude the main effects of all three factors (Method, System and Ability) are significant but that there is no significant interaction effect involving Method.

39


January 2005

The survey questionnaire data (Table 56) also helps to explain why Method has a significant main effect. Compared to the no_OCL subjects, the OCL subjects spend less time in understanding the system (p-value = 0.006, µ(OCL) = 3.87, µ(no_OCL) = 2.86) and consider the Comprehension questions less difficult to answer (p-value = 0.0002, µ(OCL) = 3.5, µ(non-OCL) = 1.92). Survey results (Appendix D, Table 58) also show that the subjects consider OCL easier to understand for the CD system than for the VS system (p-value = 0.001, µ(CD) = 1.88, µ(VS) = 3.33), which confirms the significant main effect of System. 3.2.3.2

Experiment II

A t-test (p-value = 0.01, µ(OCL) = 51%, µ(no_OCL) = 44%) also shows that using OCL has a significant positive effect on the Comprehension dependent variable in this laboratory. A three-way ANOVA (Table 22, Figure 14 and Figure 15) show that all three factors—Method, System, and Ability—have significant main effects (p-value = 0.015, 0.014 and 0.018, respectively): using OCL has positive effect on the subjects’ performance overall; high-ability subjects perform better than lowability subjects; and the subjects perform better for the CD system than for the VS system. Each of them explains about 7% of the total variation. There is no significant interaction among them. Descriptive statistics are reported in Table 84. Table 22. Three-way ANOVA (Method, Ability & System) – C-4-II Source System Ability Method Ability*System Method*System Ability*Method Ability*Method *System Residual Total

DF 1 1 1 1 1 1 1 69 76

Sum of Squares 0.1013 0.0945 0.0999 0.0161 0.0012 0.0244 0.0262 1.1060 1.4577

40

Mean Square 0.1013 0.0945 0.0999 0.0161 0.0012 0.0244 0.0262 0.0160

F Ratio 6.32 5.90 6.23 1.00 0.07 1.52 1.63

Prob > F 0.01 0.02 0.01 0.32 0.78 0.22 0.20

E2 0.07 0.06 0.07 0.01 0.0009 0.02 0.02 0.76


January 2005

Figure 14. Graph of Means by Method and Ability – C-4-II C-4-II by M ethod & Ability (CD)

C-4-II M ethod & Ability (VS) 60%

40%

OCL

20%

no_OCL

Score

Score

60%

0%

40%

OCL

20%

no_OCL

0%

High

Low

High

Ability

Low Ability

(a)

(b)

Figure 15. Graph of Means by Method and System – C-4-II C-4-II by Method & System (High Ability)

C-4-II by Method & System (Low Ability) 60%

40%

OCL

20%

no_OCL

Score

Score

60%

0%

40%

OCL

20%

no_OCL

0% CD

VS

CD

System

VS System

(a)

(b)

The survey results explain all the three factors having significant main effects. Although the OCL subjects feel more time pressure (p-value = 0.02, µ(OCL) = 3.22, µ(no_OCL) = 3.47), they consider the questions easier to answer compared to the no_OCL subjects (p-value = 0.003, µ(OCL) = 2.95, µ(no_OCL) = 2.3) (Table 65). High-ability subjects spend less time reading and using OCL than lowability subject (p-value = 0.02, µ(high) = 1.56, µ(low) = 2.2) and this may explain why Ability is having a significant effect (Table 66). Finally, subjects spend significantly less time in performing the task for the CD system than for the VS system (p-value = 0.05, µ(CD)=3.36, µ(VS)=2.95). The subjects also spend more time using OCL for the VS system than for the CD system (p-value = 0.04, µ(CD) = 1.63, µ(VS) = 2.16) (Table 67). This explains why System is having a significant effect.

41


January 2005

3.2.4 Comparing ANOVA Comprehension Results in Experiment I and Experiment II Sections 3.2.4.1 and 3.2.4.2 provide comparisons of ANOVA results for the two experiment trials for both main and interaction effects within each laboratory, respectively. 3.2.4.1

Main effects

As shown in Table 23, in Experiment I no significant main effects exist in Lab 2, while all three factors have significant main effects in Lab 4. In Experiment II, Method is the only factor that has significant main effect in both Lab 2 and Lab 4, whereas Ability and System significantly impact the subjects’ performance in Lab 4, but not in Lab 2. The results of the two experiment trials are consistent, except that Method does not have a significant main effect in C-2-I, which can be explained again by the weaker skills of subjects in both OCL and UML system analysis. Method explains a large portion of the total variation for Lab 4 of Experiment I (27%). However, it explains only about 7% of the total variation in each laboratory of Experiment II. Table 23. Comparing Comprehension in Experiment I and Experiment II Experiment I (two-way ANOVA) Significant Effect Lab 2

Lab 4

p-value

Method 0.003 (ANOVA with Ability) Ability 0.03 (ANOVA with Method) Method 0.001 (ANOVA with System) System 0.007 (ANOVA with Method)

Experiment II (three-way ANOVA) 2

E

Significant Effect

p-value

E2

-

Method

0.02

0.07

0.27

Method

0.01

0.07

0.12

Ability

0.01

0.06

0.28

System

0.02

0.07

0.18

Similarly, in each experiment trial, Ability and System have significant main effects only during the second attempt at the task. This is possibly due to the subjects’ lack of familiarity with the task during the first attempt, thus leading to poor scores across the board and no visible impact of System and Ability. 42


3.2.4.2

January 2005

Interaction effects

No significant interaction effect is observed in any of the laboratories. For the Comprehension task, using OCL has a similar effect for high and low-ability subjects, and for the simpler and more complex system.

3.3 Maintenance (M) This section reports results related to the Maintenance task. Section 3.3.1 reports analyses on data from both laboratories and compares results between Lab 2 and Lab 4. Sections 3.3.2 and 3.3.3 report, for Lab 2 and Lab 4 respectively, the two and three-way ANOVA on the factors Ability, Method and System.



Table 24 summarizes the results of the hypothesis testing of all observations for both experiment trials for the Maintenance task. First, by looking at row 1 we see there is no difference when using and not using OCL in Experiment I, while using OCL has a significant positive effect on the subjects’ performance in Experiment II, as visible from row 3 (p-value = 0.003, µ(OCL) = 41.37%, µ(no_OCL) = 33.25%). This can be explained by the fact that the subjects of Experiment II have received more prior training in OCL and the tasks than those of Experiment I: in the latter, the subjects are not yet sufficiently prepared to use OCL in maintaining UML system analysis documents.

43


January 2005

Table 24. All observations analysis of both experiment trials - M Experiment trial Experiment I

Experiment II

Hypothesis H0: M(OCL) = M(no_OCL) Ha: M(OCL) > M(no_OCL) H0: Diff(N_O) = Diff(O_N) Ha: Diff(N_O) > Diff(O_N) H0: M(OCL) = M(no_OCL) Ha: M(OCL) > M(no_OCL) H0: Diff(N_O) = Diff(O_N) Ha: Diff(N_O) > Diff(O_N)

p-value 0.16 0.03 0.003 0.26

Regarding the Order of Method, in Experiment I, the N_O subjects perform better when they use OCL (µ(Diff(N_O)) = 13.75%), while the O_N subjects perform worse (µ(Diff(O_N)) = -4.32%), as shown in Table 25. This seems to indicate, once again, that learning effects must have overcome the positive effect of using OCL (Table 25), even more so than for Defect Detection and Comprehension. This could be explained by the fact that the Maintenance task is probably the most difficult of all tasks, as mentioned in Section 2.5.1. In Experiment II, the N_O subjects perform better when they apply OCL (µ(Diff(N_O)) = 9.82%), whereas the O_N subjects also perform better though not as well (µ(Diff(O_N)) = 6.06%). This suggests that the order of using OCL does not have as strong of an effect as in Experiment I and this is likely due to the improved training we provided before Experiment II which resulted into weaker learning effects. Also, since Maintenance is the last task, after working with OCL during the Defect Detection and the Comprehension tasks, the subjects may have by then gained enough experience to be able to use OCL effectively the very first time they perform the Maintenance task in Lab 2. Table 25 Order effects on OCL usage - M

O_N N_O

Lab 2 OCL No_OCL

Lab 4 No_OCL OCL

Experiment I Diff(OCL – no_OCL) -4.32% 13.75%

44

Experiment II Diff(OCL – no_OCL) 6.06% 9.82%


3.3.1.2

January 2005

Comparing Maintenance between Lab 2 and Lab 4 across Ability levels

Experiment I In Experiment I there is a significant overall improvement from Lab 2 to Lab 4 (p-value = 0.03, µ(lab2) = 30%, µ(lab4) = 39%). High ability subjects experience a weak (0.05 < p-value < 0.1) improvement irrespective of the Order of Method: using OCL does not seem to have an effect on highability subjects. On the other hand, the O_N low-ability subjects’ performance seems to deteriorate from Lab 2 to Lab 4, whereas the N_O low-ability subjects improve (Table 26). Though none of the abovementioned differences are statistically significant and remain to be confirmed with larger samples, the results seem to suggest that using OCL might have a positive effect on low-ability subjects. The survey results suggest the existence of both learning effects and a positive effect when using OCL (Table 59). The subjects have less difficulties to answer questions in Lab 4 than in Lab 2 (pvalue=0.01, µ(lab2) = 3.1, µ(lab4) = 2.55). Furthermore (Table 60), the subjects perceive the use of OCL to be more helpful in performing the Maintenance task in Lab 4 than in Lab 2 (p-value=0.04 with µ(lab2)=3.23 and µ(lab4)=2.5). Table 26. Comparing subjects’ maintenance between labs (paired t-test) O_N Experiment I

N_O O_N

Experiment II

N_O

Ability High Low High Low High Low High Low

µ(Lab 2) 28 25.8 33.38 31.1 40.5 42.88 36 27.1

µ(Lab 4) 44.71 17.4 51 40.3 36.69 34.71 41.67 40.48

p-value 0.06 0.89 0.056 0.202 0.721 0.867 0.156 0.003

Experiment II Regarding Experiment II, we found that there is no significant overall performance difference between the two laboratories (Table 26). Only the N_O low-ability subjects have a significant improvement in performance from Lab 2 to Lab 4 (p-value = 0.003), which demonstrates a significant positive effect of

45


January 2005

using OCL for low-ability subjects. For high-ability subjects, a deterioration of performance is visible for the O_N group whereas an improvement is visible for N_O group, thus suggesting also some weaker but positive effect when using OCL. However, none of these differences are statistically significant and therefore need to be further investigated with larger samples. In any case, these results suggest an interaction effect between Method and Ability for the Maintenance task. Based on the survey data, the subjects have significantly less problems to understand the system in Lab 4 than in Lab 2 (p-value = 0.02, µ(lab2) = 2.68, µ(lab4) = 2.43) (Table 68), suggesting the existence of learning effects. Although the subjects spend significantly more time to use OCL in Lab 4 than in Lab 2 (p-value = 0.04, µ(lab2) = 1.62, µ(lab4) = 2.04), they do not find it more useful (p-value = 0.31, µ(lab2) = 2.59, µ(lab4) = 2.47) (Appendix D, Table 69).

3.3.2 Maintenance ANOVA in Lab 2 3.3.2.1

Experiment I

A t-test shows that there is no evidence that using OCL has any effect in this laboratory. A Two-way ANOVA further reveals that neither Method nor Ability has a significant main or interaction effect on the subjects’ performance for the Maintenance task in Lab 2 (Table 85 and Table 86). Table 27 (see descriptive statistics in Table 87) shows that only System has a significant main effect, explaining 26% of the total variation. Figure 16 shows that the subjects perform better for the CD system than for the VS system. There is no significant interaction between Method and System. Table 27. Two-way ANOVA (Method & System) – M-2-I Source System Method System*Method Residual Total

DF 1 1 1 28 31

Sum of Squares 0.1534 0.0141 0.0016 0.4274 0.5992

Mean Square 0.1534 0.0141 0.0016 0.0152

46

F Ratio 10.04 0.92 0.11

Prob > F 0.004 0.34 0.74

E2 0.26 0.02 0.003 0.71


January 2005

Figure 16. Graph of Means – M-2-I

Maintenance-2-I by System 60% 40%

OCL no_OCL

20% 0% CD

3.3.2.2

VS

Experiment II

A t-test shows (p-value=0.003, µ(OCL) = 40%, µ(no_OCL) = 31%) that using OCL has a significant positive effect on Maintenance in Lab 2. This confirms the results, discussed above, that using OCL has positive effect from the very first attempt at the task. A Three-way ANOVA (Table 28) shows that both Method and System have significant main effects in this laboratory. Method and System explain 9% and 39% of the total variation, respectively. (Descriptive statistics are reported in Table 88.) As shown in Table 28, Figure 17 and Figure 18, using OCL has a positive effect on the subjects’ overall performance, and the subjects perform better for the CD system than for the VS system. The survey results provide further evidence by showing that less time is being spent on using OCL for the CD system than for the VS system, thus confirming the former is a simpler model (p-value = 0.02, µ(CD) = 1.33, µ(VS) = 1.89) (Table 64). Although Ability does not have a significant main effect, it interacts with Method (p-value = 0.05) for the CD system. As visible in Figure 17 (a), using OCL benefits low-ability subjects more than high-ability subjects. There is also a significant interaction between System and Method (p-value = 0.004): from Figure 18, we see that low-ability subjects benefit more from using OCL for the CD system than for the VS system, perhaps because they are able to handle only the simpler of the two models.

47


January 2005

Table 28. Three-way ANOVA (Method, Ability & System) – M-2-II Source System Ability Method Ability*System Method*System Ability*Method Ability*Method *System Residual Total

DF 1 1 1 1 1 1 1 69 76

Sum of Squares 0.7158 0.0231 0.1690 0.0240 0.0944 0.0424 0.0392 0.7333 1.8315

Mean Square 0.7158 0.0231 0.1690 0.0240 0.0944 0.0424 0.0392 0.0106

F Ratio 67.35 2.17 15.90 2.26 8.88 3.99 3.68

Prob > F F 0.006 0.03 0.20

E2 0.22 0.13 0.04 0.61

In Table 30, both System and Method have significant main effects (p-value(System) < 0.0001, pvalue(Method) = 0.01), explaining 47% and 11% of the total variation, respectively. The effects can be observed from Figure 19-(b): using OCL has a positive effect on the subjects’ overall performance, and the subjects perform better for the CD system than for the VS system. There is no significant interaction between System and Method. Table 30. Two-way ANOVA (Method and System) – M-4-I Source

DF

Sum of Squares

Mean Square

F Ratio

Prob > F

E2

System

1

0.5469

0.5469

30.87

= 3 months, and = 6 months, and < 12 months >= 12 months 2. Did you have UML knowledge before this course? a) Yes b) No If “yes” to question 3), please answer question 2.1 and 2.2, otherwise jump to question 3. 2.1 Have you ever used UML in analyzing / designing “real-world” software systems? a) Yes b) No 2.2 Which of the following statements is most appropriate to describe your experience of using UML? a) I was introduced to UML in this course for the first time. b) I have learned UML in school, but never applied it to a “real-world” system analysis. c) I have used UML in analyzing “real-world” systems for just a few times. d) I am very experienced in using UML and apply it to analyze “real-world” systems often. e) None of the above, please specify: ___________________________________________ 3. Did you have any knowledge of OCL before this course? a) Yes b) No If “no” to question 3, please skip the rest of the questions, otherwise continue. 3.1 How did you gain the knowledge of OCL? a) I was taught in school. b) I learned it by myself. c) I was taught by co-workers. d) None of the above, please specify ___________________________________________ 3.2 Have you ever used OCL in analyzing a system? a) Never. b) Yes, I have used it in my assignment at school only. c) Yes, I have used it in analyzing “real-world” (commercial) systems. d) None of the above, please specify ________________________________________________________

78


January 2005

C.2 Defect Detection Survey Questionnaire C.2.1 Common questions 1

2

3

4

5

1 2 8. The OCL expressions in the document were easy to understand. 9. The OCL expression in the document heled me to find errors. 10.How much time did you spend in understanding the OCL expressions during this lab? A. =25% and =50% and =75%

3

4

5

1. I had enough time to read through the analysis document. 2. I had no problems to find errors in the document. 3. I would have found more errors if there would have been more time. 4. I fully understood what the system is trying to do. 5. I was not sure what type of errors I was looking for. 6. The system analysis was fairly complex. 7. The lab instruction is clear and easy to follow.

C.2.2 Extra questions for OCL groups

C.3 Comprehension and Maintenance Survey Questionnaire C.3.1 Common questions 1 1. It took me a lot of time to understand the system, before I could answer any questions. 2. I did not have any problems to understand what the system is doing. 3. I had no difficulties to complete all the questions in this module. Comprehension Maintenance 4. I hope there could have been more time for me to finish this module. Comprehension Maintenance 5. Some of the questions of this module were not easy to answer. Comprehension Maintenance 6. The lab instruction is clear and easy to follow.

79

2

3

4

5


January 2005

C.3.2 Extra questions for OCL groups 1

2

3

7. The OCL expressions in the document were easy to understand. 8. The OCL expressions in the document made the system easier to understand 9. The OCL expression in the document helped to clear up some ambiguities 10. The OCL expression in the document helped me to answer the questions in this module. Comprehension Maintenance 11. How much time did you spend in understanding the OCL expressions during this lab? A. =25% and =50% and =75%

D.3.3 Extra questions for Lab 4 (last lab) 1 2 12. Video system is more complex than Cab system 13. I would like to use OCL in my future system analyses. Specify the reasons: __________________________________________________

80

3

4

5

4

5


January 2005

Appendix D Survey Results This appendix presents the survey results. The survey feedback was provided on a Likert scale [22]: 1. Strongly agree 2. Agree 3. Not certain 4. Disagree 5. Strongly disagree Two exceptions are question 10 of the Defect Detection survey, and question 11 of the Comprehension and Maintenance survey. The subjects’ feedback of how much time they spent on OCL was provided on the following scale: 1. < 25% of the lab time 2. 25~50% of the lab time 3. 50~75% of the lab time 4. 75~100% of the lab time Table 38. Defect Detection-1-I by Ability 1 2 3 4 5 6 7 8 9 10

Enough time No problem to find error Found more if more time Understand the system Not know what to look for System analysis is complex Instructions clear OCL easy to understand OCL helped Time spent

High

Low

p-value (t-test)

p-value (Wilcoxon test)

3 3.2 2.4 1.8 3.2 2.4 1.8 3 2.71 1.57

2.94 3.19 2.19 1.88 2.69 2.75 1.88 2.67 3.33 1.56

0.45 0.49 0.32 0.41 0.10 0.18 0.39 0.27 0.06 0.49

0.49 0.47 0.41 0.33 0.1 0.16 0.23 0.33 0.06 0.33

81


January 2005

Table 39. Defect Detection-3-I by Ability & Method High-ability

1 2 3 4 5 6 7 8 9 10


OCL

no_OCL

2.83 2.67 2.67 2.17 3.83 2.5 2

1.8 2.6 3.4 2 2.6 2.2 1.6

3 2.67 2.17

3.29 2.71 2.67

p-value (t test) 0.13 0.45 0.18 0.40 0.03 0.25 0.15 0.33 0.46 0.23

Low-ability p-value (Wilcoxon) 0.11 0.42 0.15 0.34 0.04 0.31 0.17 0.27 0.41 0.23

OCL

no_OCL

2.86 3.29 2.29 2 3.29 2.14 2.14

2.14 2.43 2.29 1.86 3.71 3.29 1.71

p-value p-value (t test) (Wilcoxon) 0.08 0.08 0.08 0.09 0.5 0.47 0.3 0.38 0.18 0.17 0.03 0.03 0.14 0.15

Table 40. Defect Detection-3-I by System 1 2 3 4 5 6 7 8 9 10

Cab 2.38 2.92 2.54 1.85 3.15 2.92 1.69 3.5 3.17 1.67


Video 3.39 3.39 2.11 1.83 2.78 2.33 1.94 2.4 3 1.5

p-value (t test) 0.02 0.10 0.17 0.49 0.18 0.06 0.18 0.02 0.35 0.37

p-value (Wilcoxon) 0.03 0.07 0.1 0.41 0.19 0.05 0.26 0.03 0.34 0.5

Table 41. Comparing Defect Detection-1-I and Defect Detection-3-I 1 2 3 4 5 6 7

Had enough time No problems to find defects More found if more time Understood the system Not sure what type of defects System analysis complex Instruction clear

Lab 1 2.96 3.2 2.24 1.88 2.96 2.48 1.84

Lab 3 2.44 2.76 2.6 2 3.4 2.56 1.88

p-value 0.06 0.09 0.92 0.24 0.03 0.4 0.39

Table 42. Comparing D-1-I and D-3-I – attitudes regarding OCL 8 9 10

OCL easy to understand OCL helped Time spent

Lab 1 2.81 3.06 1.56

Lab 3 3.15 2.69 2.42

82

p-value (t test) 0.2 0.12 0.01

p-value (Wilcoxon) 0.22 0.11 0.02


January 2005

Table 43. Comparing D-1-I and D-3-I by Ability High-ability

Low-ability

Lab 1

Lab 3

p-value

Lab 1

Lab 3

p-value

1 2

Enough time No problem to find error

3.1 3.09

2.36 2.64

0.03 0.14

2.86 3.29

2.5 2.86

0.25 0.20

3 4 5 6 7

Found more if more time Understand the system Not know what to look for System analysis is complex Instructions clear

2.27 1.91 3.36 2.18 1.82

3 2.09 3.27 2.36 1.82

0.03 0.28 0.34 0.29 0.5

2.21 1.86 2.64 2.71 1.86

2.29 1.93 3.5 2.71 1.93

0.42 0.36 0.008 0.5 0.36

Table 44. Comparing D-1-I and D-3-I – attitudes regarding OCL by Ability High-ability

8 9 10


Low-ability

Lab 1

Lab 3

p-value (t test)

p-value (Wilcoxon)

Lab 1

Lab 3

p-value (t test)

p-value (Wilcoxon)

3 2.71 1.57

3 2.67 2.17

0.50 0.46 0.19

0.47 0.41 0.14

2.67 3.33 1.56

3.29 2.71 2.67

0.08 0.06 0.01

0.09 0.06 0.024

Table 45. Defect Detection-1-II by Ability 1 2 3 4 5 6 7 8 9 10


High 3.26 2.94 2.11 2 3.14 2.63 2 3.24 3.31 1.59

Low 3.27 3.4 2.16 2.09 2.42 2.33 2.27 3.17 2.96 2.09

p-value (t-test) 0.49 0.02 0.43 0.33 0.003 0.08 0.12 0.42 0.19 0.03

p-value (Wilcoxon test) 0.45 0.01 0.39 0.38 0.003 0.06 0.22 0.39 0.17 0.02

Table 46. Defect Detection-1-II by System 1 2 3 4 5 6 7 8 9 10


CD

VS

p-value (t-test)

p-value (Wilcoxon test)

2.95 3.13 2.29 1.95 2.71 2.61 2.03 3.11 2.79 1.95

3.55 3.26 2 2.14 2.76 2.33 2.26 3.29 3.4 1.81

0.01 0.28 0.12 0.16 0.42 0.10 0.15 0.29 0.06 0.31

0.01 0.29 0.14 0.08 0.46 0.10 0.12 0.27 0.06 0.28

83


January 2005

Table 47 Defect Detection-3-II by Method 1 2 3 4 5 6 7

Enough time No problem to find error Found more if more time Understand the system Not know what to look for System analysis is complex Instructions clear

OCL 2.08 2.74 3 2.03 2.95 2.67 1.97

no_OCL 2.1 2.67 3.12 1.95 3.02 2.9 2.07

p-value (t test) 0.47 0.37 0.31 0.34 0.39 0.15 0.34

p-value (Wilcoxon) 0.43 0.34 0.29 0.4 0.34 0.17 0.3

Table 48 Defect Detection-3-II by Ability 1 2 3 4 5 6 7 8 9 10


High-ability 2.32 2.68 3.08 1.92 3.24 2.78 2 2.89 2.56 1.83

Low-ability 1.89 2.73 3.05 2.05 2.77 2.8 2.05 2.9 3.05 2

p-value (t test) p-value (Wilcoxon) 0.03 0.03 0.41 0.46 0.44 0.45 0.24 0.3 0.03 0.04 0.48 0.45 0.42 0.38 0.49 0.43 0.07 0.05 0.29 0.27

Table 49 Defect Detection-3-II – attitudes regarding OCL by Ability 8 9 10


High 2.89 2.56 1.83

Low 2.9 3.05 2

p-value (t test) 0.49 0.07 0.29

p-value (Wilcoxon) 0.43 0.05 0.27

Table 50 Comparing Defect Detection-1-II and Defect Detection-3-II 1 2 3 4 5 6 7

Lab 1 3.27 3.2 2.13 2.06 2.75 2.46 2.16

Had enough time No problems to find defects More found if more time Understood the system Not sure what type of defects System analysis complex Instruction clear

Lab 3 2.08 2.68 3.1 1.99 3.02 2.81 2.04

p-value

An Experimental Investigation of Formality in UML-based Development

An Experimental Investigation of Formality in UML-based Development

Suggest Documents

An experimental investigation of improvised

An Experimental Investigation of Turbulent

An Experimental Investigation of

Experimental investigation of helium migration in an

An Experimental Investigation - Semantic Scholar

An Experimental Investigation on Inclined

Preliminary Experimental Investigation and An

An Experimental Investigation - Google Sites

An Experimental Investigation to Facilitate an

An experimental-differential investigation of ... - Psychologie aktuell

an experimental investigation of the physical ... - CiteSeerX

An Experimental Investigation of Showerhead Film ... - ScienceDirect

An experimental investigation of risky choices

AN EXPERIMENTAL INVESTIGATION OF VISUAL ENHANCEMENTS ...

An Experimental and Numerical Investigation of ... - CiteSeerX

Investigation of an experimental ejector refrigeration machine ...

An experimental investigation into perceptions of ...

An experimental and modeling investigation of

Experimental Investigation of an Aluminium Fuelled

An experimental and theoretical investigation of

an experimental investigation of gray water treatability

An Experimental Investigation of Factors Affecting

Experimental And Numerical Investigation Of An

An experimental investigation of two different