KEYWORDS. Usability test, usability evaluation, task sequence, task complexity, transfer ..... Thus, a Barnard's unconditional test (Barnard, 1945) was performed ...
TASK SEQUENCE EFFECTS IN USABILITY TESTS Yiqi Li* Nina Hollender* Theo Held* SAP AG* Dietmar-Hopp-Allee 16 / UX D-69190 Walldorf Germany
ABSTRACT How to order tasks in a usability test is an important question, since the task ordering is likely to confound the success rate, time on task and subjective measures, such as satisfaction. This may provide incorrect information for decisionmaking. We hypothesize that if a subsequent task comprises similar components to a previous task, the performance on the subsequent task will be improved due to learning processes and transfers, accompanied by more favorable subjective ratings. An experiment was carried out to examine the hypothesis using two pairs of similar tasks presented in different orders. In each pair, one task was more complex than the other. The task sequence turned out to have varying effects, depending on the kind of the measures and the complexity of the subsequent task. Further implications for usability testing are discussed. One important aspect is that task sequence should not only be set and monitored carefully, but controlled transfers between tasks can also be utilized to improve the detection of specific usability problems. KEYWORDS Usability test, usability evaluation, task sequence, task complexity, transfer
1. INTRODUCTION In usability tests, participants work on a set of tasks using a software program or prototype that is being tested. While the participants complete their tasks, their feedback as well as specific measures, such as completion rates and task times, are gathered (Dumas & Redish, 1999; Rogers, Sharp & Preece, 2011; Rubin & Chisnell, 2008). Since the general goal of usability tests is to reveal usability problems or to measure the actual usability reliably, we need to be aware of the influencing factors and how they impact the results. Several influencing factors were discussed in the past: for example the number of participants (Sauro & Lewis, 2012), as well as moderator effects (Rosenthal, 1976). However, as Lindgaard and Chattratichart (2007) point out, too little attention has been paid to the impact of usability test tasks compared to other aspects. They suggest that research should focus more on test tasks and how various aspects related to them would influence task performance and other measures. Lindgaard and Chattratichart found significant correlations between the number of tasks in usability tests performed by different teams, and the number of identified issues. Alshamari and Mayhew (2008) investigated how different task types would impact the finding of usability issues of different quality. Tasks that rather guide through a workflow tend to reveal surface issues while problem-solving tasks without guidance reveal severe usability issues. As one of the potential impact factors, the effect of the task ordering in a usability test on different measures is examined in this article. Different suggestions were made on how to order tasks in a usability test. A prominent one is to sequence the tasks according to how they would be ordered most probably in real life. This aims at making the test setting more authentic. Two further options are to repeat the same or similar tasks within a test in case learning effects are of interest, and to put important tasks closer to the beginning of the test, since later tasks in the test may not be completed by all participants. Another common recommendation is to vary the task sequence by balancing or randomizing the order of tasks in order to
prevent sequencing effects (Rubin & Chisnell, 2008; Sauro, 2004). Interestingly, some suggestions explicitly aim at revealing sequencing effects, while others explicitly try to prevent them, depending on the purpose of the test. While empirical research regarding tasks in usability tests is in general very rare, there is a larger research base regarding task design and task sequencing effects in educational research, with the goal to order tasks in a way to improve learning outcomes (Scheiter & Gerjets, 2007). It appeared that the ordering of learning tasks with different attributes can impact performance (Scheiter & Gerjets, 2007; van Merriënboer & Kester, 2005). Often, studies examined whether “simple to complex” sequences of tasks could facilitate learning. The results vary, depending on the type of tasks. For example, Sweller (1976, 1980) showed that a simplecomplex sequence accelerates the speed of learning numerical rules whereas it did not affect the speed of solving specific problem solving tasks (e.g. Tower of Hanoi). Solving usability test tasks using a certain interface is a different type of tasks compared to the tasks studied in educational research. Based on prior knowledge as well as on exploring features of the interface, participants try to achieve the goal described in the test task, whereby the solution path is initially unknown. In the course of solving a task, initial mental models (e.g. Anderson, 2001) of the solution path and of the function of specific features are constructed. If the participant does not successfully solve a task, correct mental models may still be created if the moderator shows the solution of the task to the participant, for example to get further feedback regarding specific features. When two tasks require the use of the same or similar features or even share several solution steps, the task in the subsequent position may be solved more effectively and efficiently and perceived as less difficult, because adequate mental representations regarding the overlapping feature functions and solution steps have already been built by solving the previous task. This effect should be larger if the solution steps of the subsequent task are comprised completely in a previous task which contains some additional steps. As to satisfaction, it can be assumed that satisfaction goes hand in hand with task success and efficiency, so that subsequent tasks should get a more favorable satisfaction rating. The same assumption applies to other subjective measures such as estimation of the task difficulty. In the next section, an experimental study designed to test the following hypotheses is reported: H1: Given that two tasks in the same task set comprise the use of similar features and solution steps, any of these two tasks will be solved more often, require less time and lead to more favorable satisfaction and difficulty rating, if it is presented in the subsequent position rather than the initial position. H2: If the subsequent task is fully comprised in the previously completed task, the effect described in hypothesis 1 will be larger.
2. METHOD A laboratory experiment with mixed design was conducted to examine the transfer effect between tasks. The within subject factor was task type, while the between subject factor was group/task order.
2.1 Materials and Design 2.1.1 Software A user interface prototyping tool, CogTool version 1.2.1.0 (John, Prevas, Salvucci, & Koedinger, 2004), was used as the test platform in this experiment. This program supports the prediction of the time a skilled user needs to accomplish tasks using a certain design. The design can be modeled based on screenshots of a software prototype. In this study, three screenshots from Google Maps (maps.google.com) were used. CogTool was chosen as a relevant application for the selected sample of participants (usability professionals).
2.1.2 Task set Two pairs of experimental tasks were constructed as the within subject factor, in a way that the similarity of the tasks was high within each pair, but low across pairs. Within each pair, the two tasks resembled each
other structurally by requiring partially the same course of actions relating to similar features and functionalities. One task in each pair was more complex than the other by involving more complex features and additional steps of actions for the solution. In contrast, the two pairs of tasks differed from each other structurally by requiring a different course of actions related to different features and functionalities. The pair “widget” (Table 1) consisted of two tasks that required defining specific widgets on a screenshot. The pair “transition” consisted of two tasks that asked participants to create a mouse transition or a keyboard transition from a widget in one screenshot to another screenshot. Similar tasks were worded similarly, so that the wording was consistent with the content. Table 1. Example for a simple and a complex task and the corresponding steps for solving the tasks Task
Widget simple
Description Step 1 Step 2 Step 3 Step 4
Define the button “My places” on screen “start” Open the frame “start” Select the button widget tool in tool box on the left Draw an orange rectangle covering the button “my places” on the screenshot Enter the button name “My places” into properties pane on the right
Step 5
-
Step 6
-
Step 7-8
-
Widget complex Define the menu “Satellite” on the screen “find location” with the menu items “Traffic,” “Photos” and “Weather” (visible on mouse-over) Open the frame “find location” Select the menu widget tool in tool box on the left Draw an orange rectangle covering the square “satellite” on the screenshot Enter the menu name “Satellite” into properties pane on the right Select the option “Hover” in submenu transition action in properties pane Enter the name “Traffic” of menu item into the blank under the orange rectangle Press return to create new blank field and proceed accordingly to enter the other two menu items “Photos” and “Weather”
The whole task set consisted of six tasks: the two pairs of experimental tasks, a practice task and a control task. Both the practice task and the control task had nothing in common with the experimental tasks regarding the structure: the practice task required changing the name of a frame, while the control task required creating a new frame using an existing image.
2.1.3 Group/Task order The between subject factor group had two conditions: in one condition, the two simple tasks were presented prior to the two complex tasks; in the other condition, tasks were presented in exactly the reverse order. Table 2 displays the set-up of the study. Table 2. Study set-up Group Simple-first Complex-first
practice practice
Position 1 widget simple widget complex
Task order Position 2 Position 3 transition simple control transition complex control
Position 4 transition complex transition simple
Position 5 widget complex widget simple
2.2 Participants, Procedure, and Measures Twenty user experience professionals (interface designers and user researchers) volunteered to participate in the study. None of them had prior experience with CogTool. They were assigned randomly to the two experimental conditions. Participants were seated in front of a PC monitor with a standard keyboard and a mouse. The PC monitor was extended to the display of a laptop on its left side. The PC monitor served as the main screen with the application to be worked on (CogTool). The laptop screen was used to display the test tasks in a front size of 24 pt. Participants read the task description once before working on the task. Once they turned to work on the task in CogTool, the laptop screen got black and the task description remained invisible as long as
participants kept working in CogTool. However, participants were allowed to access the task description as often as required by moving the mouse to the laptop screen and clicking once. Participants were first informed about the procedure, the timing of the ratings, as well as the way in which the task description can be accessed. They were instructed to take time to read the test script carefully when they see a task for the first time. They were also told that the moderator would not give any assist or feedback. Next, participants read a brief introduction to CogTool which outlined the main design concepts that were relevant for the test tasks. During this time, they were given the opportunity to ask questions about the introduction. They then read the test scenario and completed the practice task to get familiar with the procedure. Afterwards, participants read the description of each test task, rated the expected difficulty and worked on the tasks in the prescribed order according to the experimental condition they had been assigned to. The moderator did not indicate whether a task was successfully completed or not. Participants were allowed to work on a task until they solved it or gave up. If they gave up, the solution was demonstrated to them by the moderator after the actual difficulty and satisfaction rating. After that, participants moved on to the next task. At the end of the session, the participants were compensated with a small gift. The following variables were measured as central indicators for each task: task success (yes vs. no), total time spent on the task (regardless of success), satisfaction, expected difficulty before working on a task and actual difficulty after completing a task (each on a 7-point rating scale). To examine the transfer effect, the base line performance, i.e. user’s performance on a task in the initial position, has to be compared to the user’s performance on the same task in the subsequent position. Note that as a result of the experimental mixed design, the status of a certain task is determined jointly by the group (simple-first/complex-first) and the task type (simple/complex). For the simple tasks, the initial position was obtained in the simple-first condition. For the complex tasks, in contrast, the initial position was obtained in the complex- first condition. Thus, a transfer predicts an interaction between the two factors group and task type. Furthermore, this interaction effect should be larger for the simple tasks, because the solution steps of them are almost fully comprised in the complex tasks.
3. RESULTS One participant from the simple-first group did not complete the last task (“widget complex”). The measures of this person on this task were treated as missing values. Table 3 displays the means and standard deviations for all the metric dependent variables. According to distribution tests, the metric dependent variables could be assumed to be normally distributed. Since they were significantly correlated with each other (ranging from r = -.79 to r = .72), a doubly multivariate analysis (Tabachnick & Fidell, 2012) was performed to analyze all the metric dependent variables simultaneously. The between subject factor was group and the within subject factor was task type. The dependent variables were the logarithm of the task solving time (for an explanation why the logarithm of time on task should be used, see Sauro & Lewis, 2012) and the ratings for satisfaction, expected difficulty and actual difficulty. Table 3: Means and standard deviations of the metric variables
Measure
Group
Logarithms of Simple-First Solving time Complex-First Simple-First Satisfaction Complex-First Simple-First Expected difficulty Complex-First Simple-First Actual difficulty Complex-First
Control M 4.80 4.81 4.20 3.70 4.10 4.50 3.20 3.60
SD 0.48 0.38 1.32 2.21 1.91 2.27 1.23 1.71
Widget Simple M SD 5.42 0.62 4.18 0.49 2.70 1.77 4.50 2.12 3.40 1.65 3.60 2.12 5.40 1.96 2.80 1.93
Task Widget Complex M SD 5.83 0.28 6.15 0.60 2.11 1.36 1.50 1.27 4.60 1.58 4.30 1.25 5.67 1.87 5.90 1.60
Transition Simple M SD 5.07 0.83 3.94 0.51 3.80 2.44 4.40 2.27 3.70 1.34 3.90 2.08 3.80 2.53 3.10 2.18
Transition Complex M SD 5.29 0.63 5.64 0.40 3.70 1.64 2.70 2.36 4.50 1.72 5.10 1.79 4.60 1.96 4.80 2.53
The result of doubly multivariate analysis showed that the within subject factor task type had a significant main effect according to Hotelling´s criterion, F (16, 254) = 10.55, p < .001. Also the group by task type
interaction was significant, F (16, 254) = 4.34, p < .001. This means that regardless of the groups, all participants performed differently on different task type, yet these differences varied in their pattern for the two groups, which indicates a transfer effect on at least one metric measure for one task. The main effect of the between factor group was not significant, F (4, 14) = 1.68, p = .21, n. S., which indicates that the two groups performed in general on the same level, taken all the tasks into account. Our central hypothesis (H1) makes predictions only concerning the interaction effect of the two factors. Therefore, corresponding interaction contrasts with Scheffé adjustment (Tabachnick & Fidell, 2012) were examined separately on each metric dependent variable for each level of the between subject factor task type to reveal the pattern of the transfer effect on each task.
3.1 Success rate Figure 1 illustrates the success rate on each task in the two groups. For each task, it was descriptively higher in the subsequent position. Because of its particularity of being a categorical variable, success rate could not be analyzed by doubly multivariate analysis. Thus, a Barnard’s unconditional test (Barnard, 1945) was performed to examine if the descriptive differences are due to a nonrandom association between success rate and group. For the task “transition complex”, success rate increased significantly to 90% in the subsequent position (simple-first group), compared to the base line of 40%, p < .05 (one side). Correspondently, the effect size was dcox logit = 1.58 with confidence limits from .11 to 3.04, which indicates a strong effect. For the task “widget complex”, the success rate rose from 20% in the initial position to 44% in the subsequent position. However, this increase did not reach the significance level (one side p = .17, n. S.). Taken the small sample into account, the divergent results for the two complex tasks were not necessarily due to difference in transfer in fact but could be a result of low statistical power.
Figure 1. Success rate on each task in the two groups. “*” indicates that the difference between two values is significant on the level p < .05.
To gain a better understanding, a qualitative classification of the failures was done. A review of the video recordings revealed that 3 out of 8 participants who failed the task “widget complex” in the initial position foundered on finding the relevant tool box or using the required tool appropriately to create a widget. The other 5 made the mistake of creating a widget of another type which was similar to the requested widget type. The first type of mistakes involves the common part of the two transition tasks, whereas the second type of mistakes involves merely the distinct part of the task “widget complex.” In comparison, all the 5 participants who failed the same task in the subsequent position made exactly the second type of mistakes. The absence of the failure type related to the common part suggests that a transfer from “widget simple” to “widget complex” has happened but was constrained to the first steps of the solving. Since the failure type reduced by the transfer accounted for only a smaller part of the original failures, the transfer was not sufficient to facilitate a significant increase of the success rate for the whole task. The failure classification for the task “transition complex” yielded the opposite pattern. 4 out of 6 failures in solving this task in the initial position were ascribed to mistakes related to only the common part, whereas the other 2 were due to mistakes related only to the distinct part. In comparison, the only one failure in solving the same task in the subsequent position came under the second type. This supports the account that a transfer increased the probability of
solving a task by decreasing those failures related to the common part with the previous task, when this failure type made up the larger part of the original failures. For the simple tasks, the increases did not reach the significance criterion, p = .12 (one side) for “widget simple” and p = .46 (one side) for “transition simple.” Note that there is evidence for a ceiling effect: already 80% of the participants were able to solve both tasks in the initial position, which constrained the maximum of possible increase to merely 20%. This could account for the fact that the increases were not significant, although the success rate reached 100% and 90% for the two tasks in the subsequent position, respectively.
3.2 Time on Task The geometric mean (see Sauro & Lewis, 2010) is reported in Figure 2 to characterize the central tendency of the time on task for each task in the two groups.
Figure 2. Geometric mean of the time on task in seconds on each task in the two groups. “**” indicates that the difference between two values is significant on the level p < .001; “***” indicates that the difference between two values is significant on the level p < .0001.
Participants spent descriptively less time on each task, if it appeared in the subsequent position. This decrease was significant for both simple tasks. For “widget simple,” F (1, 18) = 24.26, p < .0001, and the corresponding effect size Cohen’s d = 2.20 with confidence limits from 1.09 to 3.31; for “transition simple,” F (1, 18) = 13.59, p < .005, and the effect size d = 1.65 with confidence limits from .63 to 2.66. Both were strong effects. For both complex tasks, the decreases did not reach significance criterion, F (1, 17) = 2.20, p = .16 for “widget complex” and F (1, 18) = 2.21, p = .15 for “transition complex.”
3.3 Satisfaction In general, the satisfaction rating increased descriptively if a task was performed in the subsequent position. However, all the increases failed to reach significance criterion. The most distinct increase was for “widget simple,” F (1, 18) = 4.25, p = .054. For the other three tasks, the increase was negligible: F (1, 18) = .32, p = .58 for “transition simple,” F (1, 17) = 1.02, p = .33 for “widget complex” and F (1, 18) = 1.21, p = .29, for “transition complex.”
3.4 Difficulty Regarding the expected difficulty, the tasks were rated to be less difficult descriptively if they were presented in the subsequent position, with the exception of “transition complex.” However, all the differences failed to reach the significance criterion: F (1, 18) = .06, p = .82 for “widget simple,” F (1, 18) = .07, p = .80 for “transition simple,” F (1, 17) = .22, p = .64 for “widget complex” and F (1, 18) = .59, p = .45 for “transition complex.” Regarding the actual difficulty, all the tasks were rated to be less difficult descriptively if they were presented in the subsequent position. The effect was only significant for the task “widget simple,” F (1, 18) =
8.95, p < .01. The effect size was d = 1.34 with confidence limits from 0.37 to 2.31, which indicates a strong effect. The decrease was negligible for other tasks: F (1, 18) = .44, p = .52 for “transition simple,” F (1, 17) = .04, p = .77 for “widget complex” and F (1, 18) = .09, p = .85 for “transition complex.” For the control task, the two groups showed no differences in both the mean and the variance of all the performance and rating measures. This indicates that the differences found on other tasks are not due to a prior existing difference of the group members. To sum up the findings, hypothesis 1 was supported with regard to success rate and time on task, whereas hypothesis 2 was supported merely with regard to time on task. If similar tasks are presented in the same task set, the one presented in the subsequent position was more likely to be solved or required less time, compared with the initial position. This position advantage manifested for simple tasks mainly in the time on task and for complex tasks mainly in the success rate. This pattern of findings can be accounted for by the following explanation: Transfers occur most of all in the common part of the task solutions within the scope which the easiness of this part allows; in addition, the relative proportion of that part to the whole task also influence if a significant improvement in success rate can be observed. The simple tasks in this study are by definition comprised in the complex tasks and consequently have a larger possibility to be solved and require less time. Therefore, a ceiling effect may have happened to the simple tasks, so that the effect of transfer concerning the success rate was concealed. Subjective ratings, in contrast, appeared not to be affected much by the sequence of tasks. This provides evidence for the account that users adjust their expectations and judgments, probably because they were aware of the transfers and attributed their performance improvement to previous learning experiences from initial tasks or moderator’s demonstration of the solution.
4. CONCLUSION The results imply that usability test results can be influenced by the task sequence to a considerable extent. In the presented study, this influence is found for performance and ratings on tasks. This bias appears to affect performance on task level but not the overall performance. The final outcome of a usability test can nevertheless be biased, since it is a common practice that each usability measure is benchmarked separately on task level, before they are integrated into an overall assessment of the product. In view of this, a selfevident recommendation is to randomize the order of test tasks, especially when the task set includes component-wise overlapping tasks. Although randomizing the task order can provide a more precise estimate of the true usability regarding each task, it remains a conservative strategy. Taken the usually small number of participants (in the most cases below 20) into consideration, the realizable different task sequences in a usability test make up only a small subset of the possible permutations (k! for k tasks). Consequently, it is doubtful that randomizing can achieve the goal of counterbalancing sequence effects as such statistically. In case of small-scale replications, what randomizing does is just to “leave it to chance.” This is especially grave, when the effect size of the influencing factor is large. A more long run oriented approach is to control unwanted impacts carefully so that they can be estimated precisely and taken into account when decisions are made based on the usability test results (e.g. design decisions). First, the presence and the absence of transfer respectively can help to detect usability problems. Given the knowledge of transfers, we can expect that test participants will learn by necessity handling a certain user interface during the course of a usability test. According to this, transfers from task to task ought to lead to more frequently and faster solving of certain tasks in a more satisfying way, when they are exposed to the user repeatedly. If this expected learning does obviously not happen, the existence of an underlying usability problem can be surmised. Typical cases are interaction sequences that are extremely complicated and unintuitive or user interface elements with significantly low or incorrect affordance. Secondly, in combination with an error categorization, we can find out quickly which of the interface features play the most important role for the solution of a task, as shown in the result analysis section. Furthermore, how fast users acquire rules and concepts of an interface or certain elements of the interface can reveal useful information about users’ expectations, needs and thinking styles, which can help to develop more usable software as well as to improve the training of users. However, these strategies require deeper understandings
of the underlying mechanism of users’ cognitive processes and a precise description of the involved factors. These cannot be realized in the framework of the traditional testing and data analysis. Thus, further research should be directed to model-based analyses, with the aid of which the influence of each involving factor can be estimated as concrete parameters.
ACKNOWLEDGEMENT The authors wish to express their gratitude to Mr. Christopher Lafleur for his helpful comments and language editing.
REFERENCES Alshamari, M. and Mayhew, P., 2008. How different types of task can influence usability testing. IADIS International Conference Interfaces and Human Computer Interaction. Amsterdam, The Netherlands, pp. 244-248. Anderson, J. R., 2001. Kognitive Psychologie. 3rd Edition. Heidelberg, Spektum Akademischer Verlag. Barnard, G.A., 1945. A new test for 2x2 tables. Nature. Vol. 156, No. 3954, p 177. Dumas, J. S. and Redish, J. C., 1999. A Practical Guide to Usability Testing. Revised Edition. Exeter, Intellect. John, B., Prevas, K., Salvucci, D., and Koedinger, K., 2004. Predictive Human Performance Modeling Made Easy. Proceedings of the Conference on Human Factors in Computing Systems (CHI 2004). Vienna, Austria, pp. 455-462. Lindgaard, G., and Chattratichart, J., 2007. Usability testing: what have we overlooked? Proceedings of the Conference on Human Factors in Computing Systems (CHI 2007). San Jose, California, USA, pp. 1415-1424. Rogers, Y., Sharp, H., and Preece, J., 2011. Interaction Design: Beyond Human - Computer Interaction. 3rd Edition. New York, John Wiley. Rosenthal, R., 1976. Experimenter Effects in Behavioral Research. New York, John Wiley. Rubin, J., and Chisnell, D., 2008. Handbook of usability testing: how to plan, design and conduct effective tests. 2nd Edition. Indianapolis, Wiley Pub. Sauro, J., 2004. The Importance of Task Order Randomizing during a Usability Test. measuringusability.com. Retrieved December 1, 2012, from http://www.measuringusability.com/random.htm Sauro, J. and Lewis J. R., 2010. Average Task Times in Usability Tests: What to Report. Proceedings of the Conference on Human Factors in Computing Systems (CHI 2010), Atlanta, Georgia, USA, pp. 2347-2350. Sauro, J. and Lewis, J. R., 2012. Quantifying the user experience: practical statistics for user research. Amsterdam, Elsevier. Scheiter, K. and Gerjets, P., 2007. Making your own order: order effects in system- and user-controlled settings for learning and problem solving. In Ritter, F. E., Nerb, J., Lehtinen, E., O’Shea, T. M. (eds.), In order to learn: how the sequence of topics influences learning. New York, Oxford University Press. pp. 195-212. Sweller, J., 1976. The Effect of Task Complexity and Sequence on Rule Learning and Problem Solving. British Journal of Psychology. Vol. 67, No. 4, pp. 553-558. Sweller, J., 1980. Transfer effects in a problem solving context. Quarterly Journal of Experimental Psychology. Vol. 32 No. 2, pp. 233-239. Tabachnick, B. G. and Fidell, L. S., 2012. Using multivariate statistics. 6th Edition. Munich, Pearson. Van Merriënboer, J. J. and Kester, L., 2005. The four-component instructional design model: Multimedia principles in environments for complex learning. In Mayer, R. E. (ed.), The Cambridge handbook of multimedia learning. New York, Cambridge University Press. pp. 71-93.