Manual Test Case Derivation from UML Activity ...

Manual Test Case Derivation from UML Activity Diagrams and State Machines: a Controlled Experiment Michael Felderer University of Innsbruck A-6020 Innsbruck, Austria Technikerstr. 21a [email protected]

Andrea Herrmann Herrmann & Ehrlich D-70372-Stuttgart, Germany Daimlerstr. 121 [email protected]

ABSTRACT Context: It is a difficult and challenging task to fully automatize model-based testing because this demands complete and unambiguous system models as input. Therefore, in practice, test cases, especially on the system level, are still derived manually from behavioral models like UML activity diagrams or state machines. But this kind of manual test case derivation is error-prone and knowing these errors makes it possible to provide guidelines to reduce them. Objective: The objective of the study presented in this paper therefore is to examine which errors are possible and actually made when manually deriving test cases from UML activity diagrams or state machines and whether there are differences between these diagram types. Method: We investigate the errors made when deriving test cases manually in a controlled student experiment. The experiment was performed and internally replicated with overall 84 participants divided into three groups at two institutions. Results: As a result of our experiment, we provide a taxonomy of errors made and their frequencies. In addition, our experiment provides evidence that activity diagrams have a higher perceived comprehensibility but also a higher errorproneness than state machines with regard to manual test case derivation. This information helps to develop guidelines for manual test case derivation from UML activity diagrams and state machines. Conclusion: Most errors observed were due to missing test steps, conditions or results, or content was written into the wrong field. As activity diagrams have a higher perceived comprehensibility, but also more error-prone than state machines, both diagram types are useful for manual test case derivation. Their application depends on the context and should be complemented with clear rules on how to derive test cases.

KEYWORDS System testing; test design; model-based testing; controlled experiment.

1. INTRODUCTION Model-based testing is a variant of system testing that relies on explicit models that encode the intended behavior of a system under test (SUT) [1]. An advantage of model-based testing is that it allows tests to be linked directly to the SUT’s requirements, which improves the readability, understandability, and maintainability of tests. Furthermore, it helps ensure a repeatable basis for testing and to provide good coverage of all the behaviors of the SUT [2]. For the derivation of test cases, model-based testing relies on behavior models of the system which are in practice often UML activity diagrams or state machines [3], as these two diagram types are most frequently used to model system requirements [4]. Therefore, the derived tests are system tests concerned with testing an entire system based on its specification [5]. As indicated by an empirical case study [6], both automatically and manually derived model-based test suites have the potential to detect significantly more requirements defects than handcrafted test suites, which were directly derived from the requirements. The fully automated derivation of test cases from UML activity diagrams or state machines is still difficult and challenging. This is due to the fact that high quality UML models, which contain all information needed for automatically deriving test cases, are required for this purpose. However, such models are rarely available in practice. In addition, there is empirical evidence [6] that automatically generated model-based test suites do not detect more errors than hand-crafted model-based test suites with the same number of tests. Therefore, in this paper, we investigate the case which is still practically more relevant than automation: A system tester analyzes a UML activity diagram or state machine and derives several test cases manually from them in order to achieve

1

test coverage. These testers are typically key users or domain experts without in-depth experience and knowledge in testing [7]. Like all complex manual activities and especially when taking the often missing testing expertise of system testers into account, this kind of test case derivation is error-prone, what impacts the quality of the derived test cases. Knowing possible errors made, when manually deriving test cases from UML activity diagrams or state machines as well as the comprehensibility and suitability of these diagrams for test case derivation, are valuable to provide guidelines for systematically deriving test cases avoiding these errors. The objective of the study presented in this paper therefore is to examine which errors are possible and actually made when manually deriving test cases from UML activity diagrams or state machines and whether there are differences between these diagram types with regard to manual test case derivation. Both serve as system models from behavioral perspective and are in practice often used as alternatives [4]. Knowing which of both better serves the purpose of test case derivation therefore can be useful for practice, when one has to select the right UML model type for system modeling taking testing aspects into account. In industry, it is common to have system tests executed by test personnel with some domain knowledge, but only little experience in systematic testing [7], for instance by key users. The required skills in test design are then often provided in short trainings [7]. This situation is similar to classes if domains familiar to students and suitable trainings are provided. We therefore investigate the difference between the two diagram types, i.e., UML activity diagrams and state machines, in a controlled experiment with overall 84 students divided into three groups at two institutions, i.e., the experiment was performed with two groups at Duale Hochschule Baden-Württemberg in Karlsruhe (Germany) and its internal replication [8] by the same researchers was performed at the University of Innsbruck (Austria). From the results, we derive a taxonomy of errors, identify the most frequent errors for each diagram type, as well as differences with regard to perceived comprehensibility and errors made between the two diagram types. As a result, we provide a taxonomy of errors made and their frequencies. In addition, our experiment and its internal replication provide evidence that activity diagrams are perceived to be more comprehensible but are also more errorprone with regard to manual test case derivation. This paper refers to established guidelines for reporting experiments in software engineering [9], [10] and is structured as follows. In Section 2, we provide an overview of related work. In Section 3, we present the experiment planning and execution. Then, in Section 4 we present the experiment results and their analysis. In Section 5, we discuss the interpretation of the results and threats to validity. Finally, in Section 6 we outline conclusions and future work.

2. RELATED WORK The manual derivation of test cases from UML models has not been investigated empirically before. However, there are two types of related work: (1) empirical studies on comprehensibility of UML models as well as the manual derivation of test cases discussed in Section 2.1, and (2) methods for semi-automatically deriving test cases from UML models discussed in Section 2.2. In the remainder of this paper, we sometimes skip the term “UML” when referring to UML activity diagrams or state machines.

2.1 Empirical studies on comprehensibility of UML models and test case derivation The quality attribute of UML models studied most empirically is comprehensibility [11]. It has been investigated from different perspectives. Some approaches focus on layout and visualization aspects of UML models, e.g., [12] or [13]. Others compare the effect of using different UML diagram types, e.g., [14] or [15]. Another line of research considers styles and rigor in UML models, e.g., [16] or [17]. Finally, other studies in the area of modeling style looked into the effect of using stereotypes on model comprehension, e.g., [18], [19], or [20]. Closest to our work are approaches comparing the effect of using different UML diagram types. In two controlled student experiments, Otero and Dolado [15] compare sequence diagrams and state machines. They observe that (1) state diagrams provide higher semantic comprehension of dynamic modeling in UML when the domain is real-time and sequence diagrams are better in the case of a management information system, and (2) regardless of the domain, a higher semantic comprehension of the UML designs is achieved when the dynamic behavior is modeled by using both sequence diagram and state machine. Results of a controlled student experiment by Glezer et al. [14]. The authors compare collaboration diagrams to sequence diagrams and indicate that collaboration diagrams are easier to comprehend than sequence diagrams in real-time systems, but there is no difference in comprehension of the two diagram types in management information systems. The difference between UML activity diagrams and state machines has not been compared in controlled experiments, neither with respect to comprehensibility nor with respect to manual test case derivation. But especially comparing the effect of manual test case derivation from both diagram types is highly relevant, as motivated in the introduction. As mentioned before, several approaches compared different UML diagram types with each other (e.g., [14] or [15]) or even UML diagrams with nonUML diagrams (like [21], where UML class diagrams are compared to Entity Relationship diagrams with the finding that class diagram are easier to understand). However, activity diagrams and state machines have not been compared so far. In

2

fact, class diagrams are the UML diagram type investigated most, while only 16.16% of research investigates state machines and 2.02% activity diagrams [22]. Adding another diagram – here a sequence diagram – to a system model can enhance system comprehension [23]. As there are several aspects of model comprehensibility, Aranda et al. [24] propose a framework for empirical evaluation of model comprehensibility which recommends the following four metrics to measure model comprehensibility: correctness of understanding (i.e., the degree to which persons can answer questions about the representation correctly), time (i.e., the time required to understand the representation), confidence (i.e., the subjective confidence of people regarding their own understanding of the representation), and perceived difficulty (i.e., the subjective judgment of people regarding the ease to obtain information through the representation). Many empirical studies have investigated testing techniques [25] or compare different defect detection methods like testing and inspection [26]. Juristo et al. [25] summarize 25 years of empirical research on software testing techniques based on more than 20 studies and conclude that collective knowledge on testing techniques is limited. Runeson et al. [26] provide a survey on empirical defect detection studies comparing inspection and testing techniques based on evidence reported in 10 experiments and 2 case studies. Both techniques have low effectiveness; reviewers find only 25 to 50 percent of an artifact’s defects using inspection, and testers find 30 to 60 percent using testing. Using test models is one approach claimed to improve effectiveness, although there is hardly any evidence for this. Pretschner et al. [6] present an evaluation of model-based testing and its evaluation based on a case study of an automotive network controller. Both automatically and manually derived model-based test suites detected significantly more requirements errors than handcrafted test suites that were directly derived from the requirements. The number of detected programming errors did not depend on the use of models. Automatically generated model-based test suites detected as many errors as hand-crafted model-based suites with the same number of tests. Finally, Nugroho and Chaudron [27], [28] investigate the effect of UML modeling, specifically the production of class and sequence diagrams, on the quality of the code, as measured by defect density, and on defect resolution time in the context of an industrial Java system. The authors found that the production of UML class diagrams and sequence diagrams reduces defect density in the code and the time required to fix defects. The findings discussed in this section motivate the usage of models on the system level for manually deriving test cases. As different types of models may have different comprehensibility, comparing the manual test case derivation from UML activity diagrams and state machines is a relevant research problem that has not been addressed before.

2.2 Methods for deriving test cases from UML models Related work, which derives test cases from UML activity diagrams or state machines, differs from our work either in the form of the input or output, usually in both. Many authors try to automatize the test case derivation from activity diagrams completely or partly. However, for delivering all information needed for this automation, the activity diagram must be complete and supplemented by information, which is often not modelled in practice. The input used by other researchers is an activity diagram enhanced by additional information, in order to be complete and unambiguous. We, instead, investigate how human testers cope with the ambiguity of activity diagrams. For instance, the TOTEM Methodology derives system test cases from UML models enhanced by OCL expressions [5]. In [29], the activity diagram is annotated with stereotypes like “userAction”, “SystemResponse”, “Include” and “Define” and in [30] with test specific information like creation or state changes of objects. Other authors use activity diagrams, which do not model user and system activities on system level, but each activity in the diagram represents a Java method [31], [32], [33]. Offutt and Abdurazik [34] enhance state machines by OCL expressions in order to contain complete information. Weissleder and Sokenou [35] correlate OCL pre-/postconditions of operations and guard conditions of state machines to derive test data input partitions automatically. Some state machines used for test case derivation do not model the system on system level, but on class level [36]. In research literature, the generated test cases are typically derived for automatic execution on a specific platform and their format differs from our test case format for manual execution (see Section 3.3). For instance, in [34], [37] and [38] a test sequence is described by a sequence of transitions only. Samuel et al. [39] state that a test case is the triplet [I, D, O], where I is the initial state of the system at which the test data is input, D is the test data input to the system and O is the expected output of the system. This means, that a test case contains one initial state, a sequence of test data as input and the final state of the system after the test case execution. The output of single steps is neither defined nor checked in this approach. When generating test cases, intermediate graphs are often constructed. For instance, finite state machines are transformed to flow graphs in [36] and [38] as well as to usage models with transition probabilities [37]. Also, from activity diagrams, intermediate models are generated, like the sets of directed graphs in [31], [32], [33], [40] and [30], which present one test scenario or test case each.

3

3. EXPERIMENT PLANNING AND EXECUTION In this section, we discuss planning and execution of the experiment, which include goals and investigated research questions, participants, experiment tasks and material, variables and hypotheses, design of the experiment, experiment procedure, execution of the experiment, as well as the applied analysis procedure.

3.1 Goals and Research Questions Using the GQM template [9], [10] , we can formulate the goal of this experiment as follows: Analyze manual test case derivation from UML system models for the purpose of understanding what errors are made and which differences there are between UML activity diagrams and state machines, with respect to the errors made and the perceived comprehensibility of the diagrams, from the point of view of the test case quality in terms of errors made, in the context of university students manually deriving test cases from diagrams. From this overall goal, we derive the following research questions investigated in this paper. (RQ1) What types of errors are made when manually deriving test cases from a UML activity diagram or state machine? (RQ2a) What are the most frequent types of errors made when manually deriving test cases from a UML activity diagram? (RQ2b) What are the most frequent types of errors made when manually deriving test cases from a UML state machine? (RQ3) Are there differences between test models described as UML activity diagrams and state machines with regard to perceived comprehensibility or errors made when manually deriving test cases?

3.2 Participants The experiment and its internal replication were performed with students as experimental subjects. The two authors used their access to three groups of students at two institutions, i.e., two groups of Business Informatics students at the Duale Hochschule Baden-Württemberg in Karlsruhe (Germany) as well as a group of Computer Science students at the University of Innsbruck (Austria), to perform the experiment. All students in each of the groups knew UML models from previous courses. At the Duale Hochschule Baden-Württemberg, the students attended a one-semester course on software analysis where they learned about the different UML diagram types including activity diagrams and state machines before the experiment actually took place. Additionally, they applied UML models in an extensive one-semester case study project. At the University of Innsbruck the situation was similar: the students attended a one-semester course on software development and project management where they also learned about the different UML diagram types including activity diagrams and state machines. In parallel, the students applied UML in an accompanying practical one-semester software project. All students received the same introductory training and domains familiar to them were chosen, i.e., a gumball machine in the training phase as well as a drink vending and an automated teller machine in the actual experiment. Besides basic knowledge about the applied UML models, neither specific business nor technical knowledge was required with respect to the main task to execute during the experiment, i.e., the manual derivation of test cases from activity diagrams and state machines. As it was therefore unlikely that the Business Informatics and Computer Science students have different prerequisites with respect to executing the experiment task, each group performed the same one. We tested our assumption that there are no differences between the three groups by a Kruskal-Wallis one-way analysis of variance [41] which confirmed that the perceived comprehensibility and number of errors made for activity diagrams and state machines do not differ significantly between the three groups. This context is similar to how system testing is often performed in industry, where it is not uncommon to have system tests executed by test personnel with some domain knowledge, but only little experience in systematic testing [7]. For instance, when implementing enterprise systems, testing is often performed by key users who are no software engineers, but future users of the system. The required skills in test design are then often provided in short trainings, i.e., similar to the training phase in our experiment. Such an experiment with unexperienced participants can show what is difficult about a task, i.e., in our case, the manual derivation of test cases from UML activity diagrams and state machines, and where a procedure is not intuitive. We did not prescribe any procedure for the concrete test case derivation, but only investigated the derived test cases themselves.

4

3.3 Experiment Tasks and Material The main task for each participant was to derive test cases from activity diagrams and state machines in order to achieve edge coverage (sometimes also called branch coverage) [42], [43]. For activity diagrams this means that all transitions between two activities must be called by at least one test case. For state machines, each transition between two states must be called by at least one test case. All students had to perform this task twice with varying domain, i.e., drink vending machine (DVM) and automated teller machine (ATM), and diagram type, i.e., UML activity diagram (AD) and state machine (SM), used as experimental objects (see Appendix A for the diagrams). This resulted in two treatments, i.e., in Treatment A as first task test cases where derived from the AD of the DVM, and as second task from the SM of the ATM, and in Treatment B first from the SM of the DVM and second from the AD of the ATM (see Section 3.5 on the experimental design for more details). For each task, the students had to derive test cases and then to answer questions regarding comprehensibility. For each task, the students received two sheets. The first sheet included the diagram, instruction for test case derivation, and comprehensibility questions. The second sheet contained tabular test case templates which were filled by the students. All provided sheets used as experiment material were created by the two authors and are available online at http://homepage.uibk.ac.at/~c703409/manual_test_derivation. In the following, we explain the content of the material in more detail. The concrete wording of the instruction for test case derivation on the first sheet is as follows: “Please write down all test cases, which you need for achieving edge coverage. For each test case, define the preconditions, individual test steps, input data and expected results. Enter the test cases in the template. Try to define test cases with the possible minimum of test steps.” The provided diagrams showed basic system models for a drink vending machine (DVM) and an automated teller machine (ATM), which are both domains familiar to the students (see Appendix A). The activity diagrams contain lifelines, activities and decision nodes, one initial node and final nodes. Each decision node has several outgoing edges with disjunctive conditions. Figure 1 shows an example for such an activity diagram modeling the flow of the gumball machine used in the training phase. This activity diagram comprises 5 activities and has a cyclomatic complexity [44] of 3 (with 11 nodes and 12 edges). The activity diagrams used in the experiment (see Figure 5 and Figure 7 in Appendix A) were slightly larger and comprised 12 activities, both for DVM and ATM. Their cyclomatic complexities were 2 for the DVM (with 19 nodes and 19 edges) and 3 for the ATM (with 19 nodes and 20 edges). Customer

Machine

[ Coin Inserted ] [ Coin Not Inserted ] Insert Coin

[ Coin Matching ]

Turn Lever

Check Coin Number

[ Not Enough Coins Thrown In ] [ Coin Not Matching ] Remove Coin [ Enough Coins Thrown In ] Eject Gumball

Figure 1. Test Model of Gumball Machine Described as UML Activity Diagram Used to Prepare the Participants in the Training Phase

The state machines contain states and transitions with triggers, guards and events. Figure 2 shows such a state machine modeling the flow of the gumball machine used in the training phase. This state machine comprises 4 states and has a cyclomatic complexity [44] of 5 (with 6 nodes and 9 edges). The state machines used in the experiment (see Figure 6 and

5

Figure 8 in Appendix A) had 3 states for the DVM and 5 states for the ATM. Their cyclomatic complexities were 5 for the DVM (with 5 nodes and 8 edges) and 6 for the ATM (with 7 nodes and 11 edges).

Standby

Insert Coin [ Coin Not Inserted ]

Turn Lever [ Coin Not Matching ] / Remove Coin

[ Enough Coins Thrown In ] / Eject Gumball

[ Coin Inserted ]

Coin Inserted

Turn Lever [ Coin Matching ]

Coin Thrown In

Check Coin Number

Number of Coins Checked

Insert Coin [ Not Enough Coins Thrown In ]

Figure 2. Test Model of Gumball Machine Described as UML State Machine Used to Prepare the Participants in the Training Phase

The input and output of an activity usually was obvious based on the model (in the activity diagrams) or modelled explicitly (in the state machines). In the gumball example, we have the activity “insert coin” and later-on the condition “coin matching”, from which the tester can conclude that the input to step “insert coin” is a matching coin and the expected result is “coin inserted”. We took care that both models of the same machine were equivalent in terms of level of abstraction as well as content, although the activity diagram notation originally is meant as a system’s black box model and the state machine a white box model. According to the definition of the International Software Testing Qualifications Board (ISTQB) [43], a test case contains a set of input values, execution preconditions, expected results and execution postconditions developed for a particular objective or test condition, such as to exercise a particular program path or to verify compliance with a specific requirement. In our experiment, test cases are supposed to be executed manually and on the system level. According to our practical experience of the structure of test cases, we adapted the ISTQB definition. A test case consists of a precondition as well as a sequence of test steps, each with operation call, input data and expected results developed for a particular objective or test condition, such as to exercise a particular path or to verify compliance with a specific requirement. A test step is mainly determined by its operation call which further defines the input data and expected result. We skip an explicit postcondition, which is in our template included in the expected result of the last test step. A test suite is a set of several test cases for the same system under test. We specify each test case in a tabular template. The template for test cases, which is also commonly used in industrial projects [45], is shown in Table 1. An identifier is inserted in the upper left corner next to the precondition. Each test step corresponds to one numbered row including an operation call, input data and expected result(s). All activities of an activity diagram as well as all triggers and events of a state machine correspond to test steps. The students obtained a sheet for each task with four test case templates on it. Table 1. Test Case Template Precondition: Nr.

Operation Call

Input Data

1 2 3

6

Expected Result

Table 2 shows an example test case derived from the UML activity diagram of the gumball machine shown in Figure 1. Table 2. Example Test Case Derived from the Activity Diagram in Figure 1 TC_2 Nr.

Precondition: No Coin Inserted Operation Call

Input Data

1

Insert Coin

Not Matching Coin

2

Turn Lever

Coin Not Thrown In

3

Remove Coin

No Gumball Ejected and Coin Removed

Expected Result

Table 3 shows an example test case derived from the UML state machine of the gumball machine shown in Figure 2. Table 3. Example Test Case Derived from the State Machine in Figure 2 TC_1 Nr.

Precondition: Coin Inserted and Coin Not Matching Operation Call

Input Data

Expected Result

1

Turn Lever

Coin Not Thrown In

2

Remove Coin

Standby

3

Insert Coin

4

Turn Lever

Coin Thrown In

5

Check Coin Number

Not Enough Coins Thrown In

6

Insert Coin

7

Turn Lever

Coin Thrown In

8

Check Coin Number

Enough Coins Thrown In

9

Eject Gumball

Gumball Ejected and Standby

Matching Coin

Matching Coin

The derived test cases are abstract as they do not contain concrete but abstract input data. We decided to focus on abstract test cases as the derivation of concrete input data would require additional data flow modeling and introduces confusing factors. But already the derivation of abstract test cases, as discussed in our experiment, is of high practical relevance as the creation of test data is in practice often performed independently of the abstract test case derivation and even applies specific test design techniques like equivalence partitioning [45] which are out of scope of the experiment. Additionally to the main task of deriving test cases, the participants were also asked several open comprehension questions with respect to the diagrams, in order to verify that they basically understood the respective diagram. Finally, the participants were asked to judge the perceived comprehensibility of the respective diagram on the following five-point Likert-scale: Very good = +2, good = +1, average = 0, bad = -1, very bad = -2. We created and internally discussed sample solution for each of the four sets of test cases. The sets of test cases needed for edge coverage for each of the four diagrams were modelled completely. The sample solution contained 3 test cases for the ATM’s activity diagram and 4 for the DVM’s activity diagram. Both state machines demanded only one test case for edge coverage by traversing loops several times. However, semantically they contained the same test cases like the activity diagram’s test cases like “wrong PIN entered”, one after the other. We also created sample solutions for all other questions we asked.

3.4 Variables and Hypotheses Research question RQ1 (as stated in Section 3.1) on the types of errors are made when manually deriving test cases from activity diagrams or state machines is answered by analyzing the derived test cases and extracting the types of errors. It therefore does not require an independent variable but collects the possible error types from all four diagrams, i.e., activity diagrams and state machines for the ATM and DVM. The possible types of errors are categorized according to the main affected artifact, i.e., the precondition, input data, expected result, operation call, test case, or the overall test suite, and result in a taxonomy of error types. Research questions RQ2a and RQ2b investigate which are the most frequent error types when manually deriving test cases from an activity diagram or state machine. RQ2a and RQ2b are answered by the number of errors per error type (in the taxonomy derived to answer RQ1) for activity diagrams (RQ2a) and state machines (RQ2b), respectively. Errors with regard to input data, expected result or the overall test step are counted on the level of test steps, i.e., for each row in a test

7

case description as shown in Table 1. Errors with regard to precondition or the overall test case are counted on the level of test cases, i.e., zero or one time for each test case as shown in Table 1. Finally, errors with regard to the test suite (i.e., errors which appear in all test cases of one test suite) are counted on the level of test suites. If a specific error affects all test steps, then it counts on the level of test cases, and if an error affects all test cases (including their precondition), then it counts on the level of the test suite. For instance, if error “precondition missing” holds for all test cases of a test suite (with more than one test case), then the error “precondition missing” is counted for the overall test suite. If input is missing for all test steps of a test case (with more than one test step), then the error “no input” is counted for the overall test case. RQ3 compares the diagram types with respect to perceived comprehensibility and number of errors made. RQ3 is answered by testing two one-sided hypotheses investigating the differences between the diagram types with respect to the perceived comprehensibility and errors made per error type, respectively. The corresponding null hypotheses H-1 nul and H-2 null as well as their related alternative hypotheses H-1 alt and H-2 alt , respectively, are as follows. H-1 null : UML activity diagrams (AD) do not have a significantly higher perceived comprehensibility (COMP) than state machines (SM). H-1 alt : UML activity diagrams have a significantly higher perceived comprehensibility than UML state machines. H-2 null : There are not significantly more errors per error type (ERROR) made when deriving test cases from UML activity diagrams (AD) than from state machines (SM). H-2 alt : There are significantly more errors per error type made when deriving test cases from UML activity diagrams than from state machines. The goal of the statistical analysis is to reject the null hypotheses and possibly accept the alternative ones. For both hypotheses, the independent variable is the diagram type (DT), i.e., UML activity diagram (AD) or UML state machine (SM). For the first hypothesis, the dependent variable is the list of perceived comprehensibility values (COMP) of all AD and SM, respectively, created for the ATM and DVM during the experiment task. The perceived comprehensibility of a diagram is judged by the participants of the experiment on a five-point Likert-scale with values “very good”, “good”, “average”, “bad”, and “very bad” and based on the subjective metrics of a framework for empirical evaluation of model comprehensibility [24]. For the second hypothesis, the dependent variable is the distribution of errors per error type (ERROR), i.e., the number of errors made when deriving test cases from AD or SM for each error type in the taxonomy (derived to answer RQ1), for AD and SM, respectively.

3.5 Experiment Design The experiment in Karlsruhe and its internal replication by the same researchers in Innsbruck was performed with three groups of students named after their institution, i.e., Karlsruhe Group 1, Karlsruhe Group 2, and Innsbruck. The students from each group were assigned to one of two treatments. For each group, we selected a balanced design [46] in which each treatment has an equal number of student participants as experimental subjects. The assignment of the students to one of the two treatments was performed randomly. The tasks to be executed were the derivation of test cases from a UML activity diagram and a state machine. For this purpose, in Treatment A, each student received an activity diagram of the drink vending machine (DVM-AD) and a state machine of the automated teller machine (ATM-SM), and in Treatment B vice versa a state machine of the drink vending machine (DVM-SM) and an activity diagram of the automated teller machine (ATM-AT). So in both treatments, the DVM model was used first. Table 4 provides an overview of the experiment treatments per group. The design of the experiment tries to minimize carryover effects as both the domain and the diagram type change between the tasks of each treatment which avoids learning effects regarding the domain and the diagram type when deriving test cases. Table 4. Overview of Treatments per Group: For each Group, Treatment A means that the Participants first derived Test Cases from the DVM’s AD and then from the ATM’s SM, and Treatment B vice versa first from DVM’s SM and then from ATM’s AD

Treatment A Treatment B

Karlsruhe Group 1

Karlsruhe Group 2

Innsbruck

DVM-AD ATM-SM DVM-SM ATM-AD



8

3.6 Experiment Procedure The experiment followed the same procedure for all three student groups. As no previous experience with the task of test case derivation could be assumed and we wanted all participants to understand the task and the artefacts used, the experiment was prepared by a previous, similar exercise. Consequently, each experiment was executed in two sessions, i.e., the training phase and the actual experiment. The first session was the training phase. After an explanation of the task, in this session, the students were asked to derive test cases from an activity diagram as well as a state machine describing a gumball machine. The students first received instructions on the task, and then tried themselves to perform the task. Afterwards, our sample solution was presented and discussed with the whole group of students. This sample solution contained a set of test cases for each of the diagrams which achieves edge coverage and uses the minimum number of loops and steps. The sample solution also included the correct answers to all questions on the questionnaire: the number of test cases needed for each of the three types of coverage, and the answers to the comprehensibility questions. The students also received the sample solution. The experiences and feedback from this exercise helped us to make the task description for the experiment more precise, to plan the time schedule and to refine the material used. The observation of the students when deriving test cases and the joint discussion of the solution reflected a broad understanding of the UML diagrams and test derivation from these. Additionally, the two researchers designed sample solutions before executing the experiment, to estimate the required time and the difficulty of the experiment task. The second session comprised the actual experiment. All students had to perform the same tasks twice with varying domain and diagram type (see Section 3.3). The provided system models showed a drink vending machine (DVM) and an automated teller machine (ATM), which are both domains familiar to the students. We randomly distributed the students from each group to Treatment A or Treatment B. In both treatments, the DVM model was used first. This was due to the reason if a student asks a question concerning the DVM, it is better that the other group works on the DVM as well to not being confused. In addition to the design of test cases, the participants were asked to answer two questionnaires, one for each test case design task. The task descriptions for each treatment including the underlying diagrams and questionnaires are available online at http://homepage.uibk.ac.at/~c703409/manual_test_derivation.

3.7 Experiment Execution The experiment and its replication followed the same procedure described in Section 3.6, only the students differed. The group sizes were 31, 25 and 28 students, respectively. The first two groups were students of Business Informatics at the Duale Hochschule Baden-Württemberg in Karlsruhe in their third year of studies and the experiment took place on the 14th October 2013 during the software engineering lecture. The members of the third group were 28 bachelor students of Computer Science at the University of Innsbruck. This experiment was conducted in a lecture on software quality and took place on the 31st of October 2013. Table 5 shows how many participants from each of the three groups, i.e., Karlsruhe Group 1, Karlsruhe Group 2, and Innsbruck, have received which of the two treatments, i.e., Treatment A and Treatment B. Table 5. Group Sizes for each Treatment and Group: Treatment A means that the Participants first derived Test Cases from the DVM’s AD and then from the ATM’s SM, and Treatment B vice versa first from DVM’s SM and then from ATM’s AD Karlsruhe Group 1

Karlsruhe Group 2

Innsbruck

Sum

Treatment A

15

13

14

42

Treatment B

16

12

14

42

Sum

31

25

28

84

During the actual experiment, there were almost no questions asked and the domains, i.e., drink vending machine and automated teller machine, were well understood. The students were assigned randomly to the two treatments by alternating the treatment from one occupied seat to the next. The time available to perform both experiment tasks was not limited. It would have been interesting to additionally log the time needed for each individual task of the experiment – reading, answering questions and test case derivation. However, for several practical reasons, we resigned to log the time. As we used printed questionnaires, the participants themselves would have to manually note the time needed. In previous experiments, we observed that noting the time distracts the participants from the test case derivation, and it is difficult to

9

receive correct and complete time values. In addition, drawing conclusions from the time needed would be difficult anyway: a long time needed to perform an activity could mean slow comprehension, but also high motivation and systematic working style (including re-reading one’s results). Therefore, we decided to not log time and to not limit time to avoid any time pressure on the participants. The motivation for the students to participate in the experiment was the hint that a similar task would be part of the course´s final exam.

3.8 Analysis Procedure Our objective was to analyze which errors are made during test case derivation and which differences there are between UML activity diagrams and state machines. For deriving the error taxonomy, we proceeded iteratively as follows. The creation of the sample solutions already had provided hints on possible error types. After the experiment, the authors compared all test cases resulting from the experiment and its replication with the sample solution to determine their quality in terms of errors made. Next, classes of errors were derived from the errors observed. When deriving test cases, errors can be made in each of the template fields. For each template field, fundamental types of errors can be made like something is missing, something is wrong or too much. In addition, content can be written into the wrong field. We make a difference whether an error is made only in single test steps or whether it is made for a complete test case or even for all test cases of a test suite. We attributed an identifier to each error type (see Table 6). Then, the authors coded all errors found into the identified error categories. Afterwards, the authors discussed the findings and the difficulties and uncertainties observed during coding. This led to a clearer and less ambiguous error taxonomy. The errors found had to be re-coded partly afterwards because new errors types were included. When an error was made several times within one test case (e.g., “input missing”), we counted it as often as it appeared. We did not count it as an error if a test step was reasonably contained in another test case than in the sample solution as there can be alternative correct sets of test cases for the same system model. If edge coverage was not achieved, this counted either as error “test case missing” or “test step missing”, depending on whether a complete test case was missing or not. The state machines’ tests could be either modelled as one multi-loop test case (containing 3 or 4 semantic test cases) or several individual test cases. We did not count it as an error if participants wrote several test cases instead of one, although we had said we wanted the lowest number of test cases possible. For error types which refer to single test cases, also for the state machines, we counted each (semantic) test case individually. The error taxonomy and the coded number of errors found for each error type and each system model, i.e., activity diagrams and state machines for the ATM and DVM, respectively, were the basis for further descriptive and inferential statistics. We provide an overview of the most frequent error types per diagram type in a bar chart and grouped the errors made (also per diagram type) according to whether something was missing, redundant, in the wrong template field or invalid. To answer whether there are differences between activity diagrams and state machines with regard to perceived comprehensibility or errors made when manually deriving test cases, we finally formulated hypotheses and performed tests within the statistical computing environment R [41]. For hypothesis H-1, which addresses the difference between the perceived comprehensibility (COMP) of UML activity diagrams (AD) and state machines (SM), we performed – as the dependent variable COMP is measured on an ordinal scale - the non-parametric one-tailed Wilcoxon rank sum test [41]. For hypothesis H-2, which addresses the difference in the number of errors made per error type (ERROR) when deriving test cases between UML activity diagrams (AD) and state machines (SM), we apply a paired difference tests as the corresponding error types for AD and SM are matched and measured on different levels of granularity, i.e., test step, test case or test suite. We use a Shapiro-Wilk normality test [41] to determine whether the hypothesis that the data is normally distributed can be rejected. If the hypothesis cannot be rejected, we test H-2 by means of the one-tailed paired t-test [41], which is a parametric test. Otherwise, we apply the non-parametric one-tailed Wilcoxon signed rank test [41]. For testing each of the two hypotheses, we use a significance level of α = 0.05 and report effect size as well as statistical power. The results of applying the described analysis procedure are presented in the next section.

4. RESULTS In this section, we present the results and their interpretation according to the stated research questions. To answer RQ1, we collected the types of errors and categorized them according to the main affected artifact, i.e., the precondition, input data, expected result, overall test step including the determining operation call, test case, or the complete test suite. The resulting taxonomy of error types is shown in Table 6. This taxonomy covers for each system model, i.e., activity diagram for drink vending machine (DVM-AD), state machine for drink vending machine (DVMSM), activity diagram for automated teller machine (ATM-AD), and state machine for automated teller machine (ATMSM), the overall number of observed errors during the experiment.

10

For each affected artifact, the taxonomy defines error types with an assigned error code. The error code consists of an abbreviation for the affected artifact, i.e., “P“ for precondition, „I“ for input data, “R“ for expected result, “T“ for test step /operation call, “TC“ for test case, and “TS“ for test suite, as well as a running number and letter for the error type and concrete error, respectively. For instance, a typical error of the artifact input is “Input missing” with the concrete suberrors “Condition missing” and “Input value missing” with error code “I1A“ and “I1B“, respectively. For the artifact result, a typical error is “Result invalid” with the concrete sub-errors “Input as result” (“R3A”), “Operation call as result” (“R3B”), as well as “Condition invalid” (“R3C”). The notion “as” which is used for instance in the concrete error “Operation call as result” means that one artifact (in this case an operation call) is written at a place where another artifact (in this case a result) is expected. Table 6 lists all errors types and their numbers found in the experiment. Table 6 therefore answers RQ1. All error types despite absent test steps in a single test case (TC5A) and in the overall test suite (TS4A) are observed at least once. For reasons of completeness, these error types are included in the taxonomy. Table 6 additionally contains the number of errors observed in each category which we use to answer RQ2a and RQ2b. We first give an overview of the overall distribution of errors, and then consider activity diagrams and state machines separately to answer research questions RQ2a and RQ2b. Table 6. Taxonomy of Error Types and their Numbers Observed in the Experiment: DVM stands for the Drink Vending Machine, ATM for the Automated Teller Machine, AD for Activity Diagram and SM for State Machine Affected Artifact Error Type Precondition Precondition missing Precondition redundant Precondition invalid

Input Data

Precondition too generic Precondition too concrete Input missing Input redundant Input invalid

Wrong level of abstraction

Expected Result

Result missing Result redundant Result invalid

Wrong level of abstraction

Test Step

Test step missing

Test step redundant Test step invalid

Test Case

Test Suite

Test case irrelevant Incomplete test flow No input No result No test step No precondition No input No result No test step No test case Test case missing Test case redundant Wrong number of test cases specified

Concrete Error No precondition Preconditon but no precondition is expected Operation call as precondition Behavior of test case as precondition e.g. physical or cognitive precondition e.g. "2€" instead of "sufficient money" Condition missing Input value missing Input but no input expected Result as input Operation call as input Condition invalid Wrong input Test data too concrete Test data too abstract Result missing Result but no result expected Input as result Operation call as result Condition invalid Test data too concrete Test data too abstract Test step missing Only user actions as test steps Only system actions as test steps Test step but specific test step not expected Synonym for action name used Operation call named after state Condition as Operation call Invalid name used Input as Operation call Result as Operation call Test case does not fit for the defined model no complete test flow from inital to final node no input defined for a test case no result defined for a test case no test step defined for a test case no precondition defined at all no input defined at all no result defined at all no test step defined at all No test case defined at all A specific test case is missing Test case which is not expected Wrong number specified relative to objective

Code DVM-AD DVM-SM ATM-AD ATM-SM P1A 4 1 2 0 P2A 19 2 0 23 P3A 6 0 3 1 P3B 6 1 7 1 P4A 6 0 26 7 P4B 2 5 5 4 I1A 26 29 50 48 I1B 24 12 12 4 I2A 3 0 18 0 I3A 0 0 9 10 I3B 0 55 81 20 I3C 1 0 0 4 I3D 0 4 12 0 I4A 13 19 9 12 I4B 8 0 0 0 R1A 29 24 44 26 R2A 21 0 15 36 R3A 5 9 24 8 R3B 128 2 131 16 R3C 0 0 2 5 R4A 9 4 2 2 R4B 3 0 0 0 T1A 29 20 83 56 T1B 6 0 4 0 T1C 0 8 9 0 T2A 3 3 4 2 T3A 3 0 16 0 T3B 0 5 0 5 T3C 9 37 28 73 T3D 1 0 0 0 T3E 0 1 0 4 T3F 0 17 1 2 TC1A 2 0 1 0 TC2A 11 5 32 4 TC3A 3 2 5 1 TC4A 2 0 0 1 TC5A 0 0 0 0 TS1A 20 12 18 5 TS2A 5 3 3 5 TS3A 6 5 2 6 TS4A 0 0 0 0 TS5A 4 4 4 6 TS6A 0 1 1 2 TS7A 2 0 1 0 TS8A 4 22 3 15

Figure 3 shows the numbers of the six by far most frequent error types on the level of test steps, partitioned into errors made in activity diagrams (AD) and state machines (SM). The most frequent error, mainly made for AD, are “operation

11

call as result“ (R3B). The other most frequent error types are “test step missing“ (T1A), “operation call as input“ (I3B), “condition missing“ (I1A), “condition as operation call“ (T3C), and “result missing“ (R1A). On the level of test cases, the by far most frequent error types are “incomplete test flow“ (observed in 52 test cases), “precondition but no precondition expected“ (44), and “precondition too generic“ (39). For the overall test suite, the by far most frequently observed error types are “no precondition defined at all“ (observed in 55 test suites) as well as “wrong number of test cases specified“ (observed in 44 test suites). 300 250

18

200 150

259

76 75

100

77 110

112

50

81

76

0 R3B

T1A

I3B

I1A AD

37 T3C

50 73 R1A

SM

Figure 3. The Five Most Frequent Error Types When Manually Deriving Test Cases From Activity Diagrams (AD) and State Machines (SM)

The overall most frequent errors also cover the most frequent errors for each diagram type. For activity diagrams, the five by far most frequent error types on the level of test steps are “operation call at result” (259), “test step missing” (112), “operation call as input” (81), “condition missing” (76), and “result missing” (73). The by far most frequent error types on the test case level are “incomplete test flow” (43) as well as “precondition too generic” (32). On the test suite level, the by far most frequent error type for activity diagrams is “no precondition” (38). For state machines, the five by far most frequent error types on the test step level are “condition as operation call” (110), “condition missing” (77), “test step missing” (76), “operation call as input” (75) as well as “result missing” (50). On the test case level, the by far most frequent error type is “precondition but no precondition expected” (25). Finally, on the test suite level the most frequent error type is “wrong number of test cases specified” (37). Figure 4 summarizes the error statistics using four categories: a) Content is missing: This category sums up several error types from Table 6 as follows: P1A + I1A + I1B + R1A + T1A + T1B + T1C+ TC2A + TC3A + TC4A + TC5A + TS1A + TS2A + TS3A + TS4A + TS5A + TS6A b) Content is redundant: P2A + I2A + R2A + T2A + TC1A + TS7A c) Content is in the wrong field: P3A + P3B + I3A + I3B + R3A + R3B + T3C + T3E + T3F d) Content is invalid: P4A + P4B + I3C + I3D + I4A + I4B + R3C + R4A + R4B + T3A + T3B + T3D + TS8A The numbers presented in Figure 4 show that it was clearly more frequent that something was missing, compared to redundant content. Errors in terms of invalid content were rare, but it often happened that content was written into the wrong field of the provided test case template.

12

800 700 600

290

257

438

438

500 400 300 200

66

100

125

89

0 missing

113

redundant

wrong field

AD

invalid

SM

Figure 4. Error statistics, summarized using the four categories missing, redundant, wrong field, and invalid, for Activity Diagrams (AD) and State Machines (SM)

As mentioned in the description of the analysis procedure in Section 3.8, hypothesis H-1 was tested with the nonparametric Wilcoxon rank sum test (also known as Mann-Whitney U test). With a p-value of 0.000316 (effect size r = 0.2778, power 1-β = 0.55), we can reject the null hypothesis H-1 null and conclude that UML activity diagrams have a significantly higher perceived comprehensibility than UML state machines. For hypothesis H-2, the p-value of the Shapiro-Wilk normality test is lower than 0.001. This finding points to the fact, that the data are not normally distributed, and therefore H-2 was tested according to the analysis procedure with a Wilcoxon signed rank test due to the matched error types. With a p-value of 0.01491 (effect size r = 0.2656, power 1-β = 0.77), we can reject the null hypothesis H-2 null and conclude that significantly more errors are made when deriving test cases from UML activity diagrams than from UML state machines.

5. DISCUSSION AND THREATS TO VALIDITY In this section, we interpret the results of the previous section, compare them to related work and discuss threats to validity.

5.1 Interpretation In this experiment, we found that a large diversity of errors is made when manually deriving system test cases from UML activity diagrams and state machines. However, some are clearly more frequent than others. The most frequent errors made were – in this order: R3B “operation call at result”, T1A “test step missing”, I3B “operation call as input”, I1A “condition missing”, T3C “condition as operation call” and R1A “result missing”. So, most often, complete test steps, conditions or results were missing, or content was written into the wrong field. UML activity diagrams were perceived to be (with statistical significance) more comprehensible than UML state machines by the participants. However, more errors were made when deriving test cases from UML activity diagrams than from UML state machines. This is not necessarily a contradiction. These results could mean, that the easier perceived comprehensibility of activity diagrams led the participants to underestimate the carefulness demanded to derive test cases, while state machines demanded both, more care for comprehension and for test case derivation. Activity diagrams, probably due to lower formality and less rigid semantics, are on the one hand easier to understand than state machines, but on the other hand more ambiguous which may induce errors when manually deriving test cases. For instance, preconditions and expected results are modelled explicitly in the UML state machines, whereas in the UML activity diagrams, this information is contained implicitly in the diagram. In our experiment, although describing the same machine, the activity diagrams had a smaller cyclomatic complexity [44] than the state machines, which might have supported comprehension, but evidently not the test case derivation. The differences observed between activity diagrams and state machines imply that the two diagram types might be optimized for different project stakeholders. Activity diagrams might be easier to understand for customers, whereas for system-level testers, the more formal state machine might fit better because they contain more information valuable for test case derivation.

13

With regard to the observed error types and their frequencies, we wondered how the most frequently made errors arose and how they could be better prevented in future. Figure 4 shows that, regardless of the underlying diagram type, most often, either content was missing or content was written to the wrong field of the test template when manually deriving test cases. From this finding, we conclude that important guidelines for improving the manual test case derivation could be as follows. The definition of all fields in the test case template must be clearly documented and explicitly handed out to the participants in written form. There must be explicit completeness criteria for the test cases. The most frequently observed errors could presumably be prevented (partly) by defining clear rules on how specific diagram elements are transformed into specific test case elements. Such rules could, for instance, be formulated in the following way: “Each traversed activity in the UML activity diagram is transformed into one test step. The activity’s name can be used as the test step’s operation call.”, or “Input includes all data, triggers or conditions that have to be provided such that the test step can be executed. Each input needs to be documented in the test case. Attention: The activity diagram does not necessarily contain all inputs explicitly.” Furthermore, it should be made more obvious under which conditions fields like preconditions have to be filled. It is neither required to fill them always nor can they be ignored completely. A precondition has to be satisfied and checked before a test case is executed because it determines a condition which must be fulfilled such that the test case takes the intended direction. For instance, in the DVM, one test case tested the case that there are not enough coins in the machine to return the change. This means that the test case requires the precondition to start with no or very few monetary stock.

5.2 Comparison to Related Work Our results are consistent with related work. We found that during manual derivation of test cases from UML models, errors are made. This is similar for other complex activities performed manually during software engineering, like inspections [26]. Such activities cannot be performed manually without errors. This, however, is no argument for fully automated test case derivation because this would simply propagate defects from the UML models to the test cases, while the manual test case derivation can detect and correct defects in the UML model. In literature, we have found no comparison of the comprehensibility and error-proneness of UML activity diagrams and state machines with regard to manual test case derivation. So, this is the first research showing the easier perceived comprehensibility and higher error-proneness of activity diagrams compared to state machines when manually deriving test cases. In order to verify that the participants have really understood the diagrams, we asked them several comprehension questions. Unlike comparable work, we did not ask multiple choice questions, but free text questions which could be either correct (1 point), incorrect or partly correct (1/2 point). When comparing to comprehensibility measured by multiple choice questions, the F-Measure from the multiple choice questions is mathematically comparable to the ratio of correct free text answers. Our participants achieved a ratio of 0.66 correct answers for the activity diagram of the DVM and 0.64 for the activity diagram of the ATM. A series of student experiments [47] achieved an F-Measure of 0.52 to 0.64 for activity diagrams which is comparable to the results achieved in our experiment.

5.3 Threats to Validity Finally, we must consider certain issues which may have threatened the validity of the experiment, i.e., its conclusion, internal, construct, and external validity [10]. Conclusion validity is concerned with data collection, the reliability of the measurement and the relationship of the treatment and the outcome, i.e., whether there is a statistically significant relationship. In our experiment, 84 students participated. This is a sufficient sample size providing enough data to derive significant results. The collected data was reviewed and analyzed by both experimenters in three iterations. After each iteration, an intensive discussion of the results and a refinement of the analysis procedure took place. Internal validity is threatened if a relationship is observed between the treatment and the outcome, although there in fact is none. This can happen when the observed effect is caused by the treatment. Another issue that is a potential threat is the exchange of information among the participants. We emphasize that participants were not allowed to communicate with each other; we prevented this happening by arranging them appropriately, i.e., students with the same treatment didn’t sit next to each other, and monitoring them during the run of the experiment. When the experiment was concluded, the participants were asked to give back all the experiment material. In our specific experiment, the errors made by the participants depend strongly on the quality of the explanations and instructions they obtained, as no previous experience in test case derivation can be assumed. One irritating aspect with regard to the state machines could be could that we did not explicitly model the transitions “Start” and “Stop” of the state machines, leading from the initial state to state “Standby” and from “Standby” to the final state, respectively (see Figure 2). However, this convention was the same for all state machines used in the training phase and the actual experiment and did not pose any problems. As a convention, we did not mention this information in the actual experiment, but explained it explicitly in the initial training phase. In

14

addition, it did not cause any obvious errors in the derived test cases. We also took care that the activity diagrams and state machines modelling the same domain were equivalent in terms of level of abstraction as well as semantically in terms of their content, although the activity diagram notation is meant to be a system’s black box model and the state machine a white box model. Of course, a test step like “insert coin” is modelled as an activity in the activity diagram and as an event in the state machine. However, this is the natural difference between these two types of diagrams, and observing the effect of this difference is one of the objectives of the experiment as both diagram types are used for the same purpose in practice. In the experiment itself, we avoided unnecessary irritation of the students by designing the experiment in a way that each received only one diagram, i.e., activity diagram or state machine per domain, i.e., Drink Vending Machine and Automated Teller Machine. We executed the experiment with three groups which received the same explanations. They were prepared by a previous exercise, so there was a feedback loop for the participants and also for the experimenters. Nevertheless, misunderstandings in this learning process could have provoked errors which otherwise would not have happened. We understand our experiment results as a hint on where the training process can be improved, so the participants would make fewer errors. We do not expect that there are differences between the Business Informatics students and the Computer Science students because their previous knowledge with respect to the task to execute was similar. All students knew UML models from previous courses, received the same introductory training and are familiar with the chosen domains. In addition, neither specific business nor technical knowledge is required to perform the experiment task. We tested our assumption that there are no differences between the three groups by a Kruskal-Wallis one-way analysis of variance [41] which confirmed that the perceived comprehensibility and number of errors made for activity diagrams and state machines do not differ significantly between the three groups. Construct validity refers to the extent to which the experiment setting actually reflects the construct under study, e.g., the ability of the measure chosen. The metrics to answer the research questions were defined in several iterations on the basis of the available material, an empirical framework for the evaluation of model comprehensibility [24], and an in-depth discussion between the two experimenters. Students were given enough time to answer all questions and to perform the experiment tasks. Social threats (e.g., evaluation apprehension) have been avoided, since the students were not graded on the results obtained. We did not define any explicit “quality measure” which measures the test cases´ quality with one value, because we believe that such a metric is questionable. In order to verify that the participants have really understood the diagrams, we asked them several comprehension questions. As was already said in Section 5.2, our participants achieved a ratio of 0.66 correct answers for the activity diagram of the DVM and 0.64 for the ATM activity diagram. This is comparable to the results from a similar series of student experiments [47]. The state machines achieved 0.56 correct answers for the DVM and 0.74 for the ATM. So, the students have basically understood the diagrams. Finally, to minimize carryover effects our experimental design varies both the domain and the diagram type between the tasks of each treatment. This avoids learning effects regarding the domain and the diagram type when deriving test cases. External validity is concerned with to what extent it is possible to generalize the findings. It is an important issue in student experiments and can be threatened when experiments are performed with students. To evaluate whether a testing approach is suitable for practice, it should ideally be applied in an industrial context by professionals. Nevertheless, for practical reasons, approaches are regularly evaluated in experiments performed with students and the representativeness of these participants may be doubtful in comparison to that of test professionals. However, our experiment setting with student participants is similar to the situation of system testing in industry, where it is common to have system tests executed by test personnel with some domain knowledge, but only little experience in systematic testing [7]. As mentioned before, when implementing enterprise systems testing is often performed by key users. The required skills in test design are then provided in short trainings (i.e., similar to the training phase in our experiment). Similar to students, also testers in industry have to be concerned with the system model when a new project starts. Therefore, we think our student experiment could be considered as appropriate, because students and novice test professionals are in a comparable situation. But it has to be stated, that the results of our experiment cannot be generalized to industrial settings where testing is performed by experienced testers as they would probably derive more complete test cases. Furthermore, our results are not generalizable for systems of arbitrary size or complexity, due to the rather low size and complexity of the considered problems, i.e., basic functionality of a drink vending machine as well as an automated teller machine, and the simplicity of the provided diagrams. Both, the rather low complexity of the considered problems and the simplicity of the diagrams, result from requirements of the controlled setting of the experiment. But also in industry, a remarkable amount of the used diagrams are only simple and illustrative, and testers not always testing experts but rather domain experts, which again increases the transferability of our results to industrial settings. The appropriateness of our experiment is also supported by Tichy [48] who outlines four situations where it is acceptable to use students as experimental subjects: (1) if students were trained well enough to perform the task they are asked for, (2) to establish trends, i.e., when comparing methods, the trend of the difference if not its magnitude can be expected to be comparable to that of professionals, (3) to eliminate hypotheses, i.e., if there is no effect observable in the student experiment, it is very unlikely that an effect is observed with professionals, or (4) as a prerequisite for experiments with professionals. This means that student experiments are important and helpful for initial studies or when the students’ experience is comparable to that of professionals which holds in our case as argued before. We use the experiment for gaining results on what errors are intuitively made by participants who have low previous experience in deriving test cases, but who have

15

been trained before. Thus, our student participants are to some extent comparable to system testers with only a short training in test case derivation like key users or domain experts. We expect that more experienced testers make fewer errors and consider the empirical investigation of this expectation as future work.

6. CONCLUSION AND FUTURE WORK In this paper, we empirically evaluated the manual derivation of test cases from UML activity diagrams and state machines in a controlled experiment with 84 student participants as experimental subjects. The students were divided into three groups at two institutions, i.e., the experiment was performed with two groups at Duale Hochschule BadenWürttemberg in Karlsruhe (Germany) and its internal replication by the same researchers was performed at the University of Innsbruck (Austria). The presented results are limited to the manual derivation of system-level test cases from UML activity diagrams or state machines and have been achieved in a student experiment. In our experiment, we found a large diversity of errors made where some are clearly more frequent than others. Most often, complete test steps, conditions or results were missing or content was written into the wrong field. UML activity diagrams were perceived to be (statistically significantly) more comprehensible than UML state machines by the participants. However, more errors were made when deriving test cases from UML activity diagrams than from UML state machines. This is an interesting result which may signify that for students it is difficult to judge their model comprehension. However, understanding the model is only the first step, test case derivation the second one. So, it is possible that one model is easier to understand, while the other one better supports test case derivation. Another possible explanation can be that deriving test cases from a less comprehensible model (i.e., UML state machine) forces the tester to work more systematically and focused (i.e., avoiding errors) compared to a more comprehensible model (i.e., UML activity diagrams). Finally, we must remark that the statistical correlations observed refer to the group and not to individual participants. Due to our results, we will continue following this direction of research and plan further activities. We will perform additional controlled experiments to refine our findings. For instance, we plan to give more detailed instructions and to extend the introductory preparation with the aim to investigate whether these improvements indeed reduced the total number of errors and also the error types we expect. For instance, clear derivation rules should reduce the number of errors related to writing content to the wrong field. We will also collect data on personality traits of the participants and try to correlate them to errors made to support the selection of suitable testers. Then, we plan to quantify comprehensibility by asking comprehensibility questions and measuring how many questions are answered correctly. Furthermore, potential causes of the errors can be investigated in more detail, e.g., by measuring the time need for each activity like reading the UML model, deriving the test cases and answering further questions. Measuring time is difficult to perform manually, but is automatically logged when using electronic questionnaires. Finally, as manual derivation of test cases from activity diagrams or state machines is common in practice, we plan additional industrial case studies to investigate our research questions in an industrial context and to provide and evaluate guidelines for manual test derivation to improve testing in practice.

ACKNOWLEDGEMENTS This work was sponsored by the projects QE LaB – Living Models for Open Systems (FFG 882740) and MOBSTECO (FWF P 26194-N15). In addition, we thank all participants of the experiment for their time and concentration.

APPENDIX A This appendix provides the diagrams used as experimental object. Figure 5 shows the activity diagram and Figure 6 the state machine of the drink vending machine. Figure 7 shows the activity diagram and Figure 8 the state machine of the automated teller machine.

16

DVM

Customer

show prize

enter article number

[ No Cancellation ] [ Canellation ]

compare amount with prize

enter money

cancel [ Amount < Price ] return amount [ Amount >= Price ] calculate change

remove amount

[ Monetary Stock < Change ]

return complete amount

[ Monetary Stock >= Change ] dispose change

remove change and article

dispose article

Figure 5. Test Model of Drink Vending Machine (DVM) Described as UML Activity Diagram (AD)

welcome cancel / return amount enter article number prize display enter money [amount < prize]

enter money [amount >= prize]

entered amount high enough [monetary stocks < (amount - prize)] / return complete amount

[monetary stocks >= (amount - prize)] / dispose change, dispose article

Figure 6. Test Model of Drink Vending Machine (DVM) Described as UML State Machine (SM)

17

User

ATM

welcome insert card check card

input PIN

[card refused] [card accepted]

[PIN wrong for 1st or 2nd time] check PIN

retract card input amount

[PIN correct]

[PIN wrong for 3rd time]

compare amount with account balance

[amount account balance]

remove amount return card

remove card

Figure 7. Test Model of Automated Teller Machine (ATM) Described as UML Activity Diagram (AD)

18

welcome insert card card inserted [card refused] return card [card accepted] input PIN [PIN wrong for 1st or 2nd time] card correct input PIN [PIN wrong for 3rd time] retract card [PIN correct] PIN correct

enter amount amount entered [amount > account balance] return card

[amount

Manual Test Case Derivation from UML Activity ...

Manual Test Case Derivation from UML Activity ...

Suggest Documents

Slicing-based test case generation from UML activity diagrams

Test Case Generation from UML Models

Extenics-based Test Case Generation for UML Activity ... - Science Direct

UML Activity Diagram-Based Automatic Test Case ... - CiteSeerX

Test Cases Generation from UML Activity Diagrams

Test Case Generation from Behavioral UML Models - CiteSeerX

Automatic Test case Generation From UML State Chart ... - CiteSeerX

Automatic Test case Generation From UML State Chart ... - CiteSeerX

Automated Test Case Generation from UML Activity Diagram ...www.researchgate.net › publication › fulltext › Automate

UML Activity Diagrams Activity

From Design to Test with UML - CiteSeerX

Graph Grammar-based Derivation of LQN Models from UML

Derivation of Petri Net Performance Models from UML ... - CiteSeerX

From Design to Test with UML - CiteSeerX

From Design to Test with UML - CiteSeer

UML Generated Test Case Mining Using ISA - icmlc

Test Case Generation and Optimization using UML ... - Science Direct

Test Case Generation by means of UML Sequence ... - Semantic Scholar

Slicing-based test case generation using UML 2.0 ...

Mapping UML to Labeled Transition Systems for Test-Case Generation

Derivation of test objectives automatically

Automating the process of test derivation from SDL ... - CiteSeerX

UML 2 Activity Modeling - Conradbock.org

Test Cases Generation from UML State Diagrams - Semantic Scholar