The present paper describes how four developers assessed the utility of 619 ... Developers find problem descriptions that are clear, propose solutions, and ...
WHAT KINDS OF USABILITY-PROBLEM DESCRIPTION ARE USEFUL TO DEVELOPERS? Kasper Hornbæk & Erik Frøkjær Department of Computer Science University of Copenhagen, Denmark Type name and affiliation information here
Descriptions of problems found in usability evaluations aim to help developers improve an interactive system. However, little is know about what makes a problem description useful to developers. The present paper describes how four developers assessed the utility of 619 usability problems and relates their assessments to characteristics of how problems were described. Developers find problem descriptions that are clear, propose solutions, and elaborate on why something is a problem of significantly higher utility than problems without such information. Problems coded as persistent for expert users are assessed of higher utility than descriptions of novice users’ problems. Reference to observable user actions made no difference to developers’ assessment of utility. While developers did not see problems produced with an empirical and an inspection method as differing in utility, methods differed in how problems were described. We conclude by discussing recommendations for how to report the results of usability evaluations.
INTRODUCTION Many models of system development and user-centered design recommend that usability evaluation form part of both early and late activities in the development lifecycle. For doing this, a host of usability evaluation methods are available for practitioners to choose from. These evaluation methods have been extensively described in the scientific literature on human-computer interaction (e.g., Nielsen & Molich, 1990) and in handbooks aimed at practitioners (e.g., Dumas & Redish, 1999). In general, however, evaluation methods and handbooks for practitioners offer little advice on how to report the findings from a usability evaluation. Besides an early paper by Jeffries (1994), only recently have papers appeared that present data on how to describe usability problems (e.g., Dumas, Molich & Jefferies, 2004; Capra & Smith-Jackson, 2005). Those papers, however, are based on the opinions of usability specialists. While they give valuable advice, the uptake and use of deliverables from usability evaluations are in many contexts done mainly by developers and/or designers. Their opinions may differ from those of usability specialists, invalidating the above recommendations on how to describe usability problems. This paper therefore explores what kinds of usability-problem description that developers consider of utility in their work. In the context of a large web application, developers’ assessments of usabilityproblem descriptions were investigated across various
characteristics of the problems’ content and presentation. The aim is to provide data on developers’ needs and wishes for descriptions of usability problems, and thereby to provide practitioners with advice on how to describe usability problems.
METHOD Reporting of usability problems The usability problems used in the present study were reported by 43 undergraduate and graduate students who evaluated one of the largest job portals in Denmark, Jobindex (see www.jobindex.dk). Twenty-one evaluators received Molich (2003) as description of think aloud user testing (an empirical usability evaluation method); twenty-two evaluators received Hornbæk and Frøkjær (2002) as description of the usability inspection method called metaphors of human thinking. The evaluators had one week to conduct the evaluation, and performed it individually. They were told to use approximately eight to ten hours on performing and reporting the evaluation. For each usability problem identified, evaluators were instructed to give (a) a brief title, (b) a detailed description, and (c) a seriousness rating. A total of 619 problems were reported.
Developers’ assessment of usability problems In practical usability work, the development team plays a decisive role in choosing what usability problems to correct and what redesign proposals to follow. Therefore, usability problems were assessed by four core members of the development team at Jobindex, here referred to as the developers. The developers individually assessed a selection of problem descriptions; one developer rated all 619 problems; the other developers rated those problems that concerned parts of the application that they worked on. Developers assessed each problem on four dimensions; see Hornbæk and Frøkjær (2005). Here we are only interested in the assessment of utility, as indicated by the question How useful is the problem in the further development of Jobindex? Does the description of the usability problem or redesign suggestion contain something valuable that you want to use in the future development of Jobindex, for example if you find something new or get ideas for improvements. To answer this question, developers put a cross on a continuous/graphical rating scale (shown as a 100 mm horizontal line) with end-points labeled “not useful” and “very useful”. We quantify utility as the number of millimeters from the “not useful” end point to the place where the developer had put a cross, giving a value between 1 and 100 for every usability problem. In this paper we are not interested in the absolute values of utility reported, nor do we place any special emphasis on the granularity of the scale. In cases where several developers rated the utility of a problem, we used the average of their ratings.
Independent judgment of usability problems To understand the benefit of different kinds of usabilityproblem description one author of the present paper judged each usability problem on five aspects of description. For each aspect it was judged whether the problem provided or failed to provide a description containing the aspect under consideration. Problems were judged in random order and blind to which evaluation method that had helped identify a particular problem. The five aspects are presented in turn below and their relation to the literature is discussed. We judged a problem as containing a solution proposal if it gives one or more recommendations for how to change the application under evaluation. The motivation for investigating if developers value proposals for solutions is that such solutions have been
suggested a useful outcome from usability evaluation (Hornbæk & Frøkjær, 2005). However, it remains a debated point whether usability professionals should or should not provide solutions (Capra & Smith-Jackson, 2005). Operationally, we looked for mentioning of user interface elements that should be added, removed, or changed. We did not consider the identification of missing functionality as a solution proposal unless accompanied by a suggestion of how to design or implement the missing functionality. We judged a problem as being persistent if even users proficient with the system would experience the problem. We expected problems that persisted for expert users to be valued higher by developers than novice-only problems. Operationally, persistence was defined as whether experienced users (i.e., persons who had used an interface more than 10 times within a week) would experience the problem more than half of the times when in a situation similar to that where the problem occurred. Descriptions of missing functionality would be considered a persistent problem; we did not consider the existence of a work-around to imply a novice-only problem. We also judged the extent to which a problem was justified, that is, whether it was clearly described what posed a problem and why it was a problem. Many authors have suggested that providing clear justification is key to useful descriptions of usability problems (e.g., Jeffries, 1994; Lavery, Cockton & Atkinson, 1997). Jeffries (1994), for example, suggested that developers often go directly to describing solutions, without being explicit about what poses a problem to the user. However, the extent to which a justification matter to developers (or whether they easily infer it on their own) is unclear. Operationally, we examined if it was clearly described what constituted the problem and whether it was described would pose a problem to the user. We also judged the degree to which a problem was describing observable user actions. Describing observed user actions has been suggested as an important quality of usability-problem descriptions (Capra & SmithJackson, 2005). It may also make problem descriptions easier to understand, even for problems generated by inspection techniques (Connell, Blandford & Green, 2004), and appears to catch developers’ attention (Hornbæk & Frøkjær, 2005). Operationally, we looked for words like he, she, user, participant, and so forth, and checked whether they were used in a sense where they referred to observed (or concretely predicted) user actions. It seems obvious that usability problems should be clear, that is, easily understandable. Dumas et al. (2004), for example, suggested being careful to avoid usability
jargon in the description of usability problems. Therefore, we also judged the clarity of the problem descriptions. Operationally, we distinguished whether or not the reader was left with a relatively sure understanding of what is intended on part of the evaluator. This aspect was also assessed by Hornbæk and Frøkjær (2004). To check the reliability of the ratings, two raters each coded a random sample of 10% of the problems. This was done independently of the authors of this paper and of the other rater. The average Cohen’s Kappa for the inter-rater reliability between the authors and each of the raters were .63 and .66; according to Landis and Koch (1977), this is a substantial agreement.
RESULTS Table 1 summarizes the results of the independent judgments. For each of the five aspects the table summarizes the number of problems where an aspect is present (the row labeled Yes) or absent (the row labeled No). For each of these possibilities, the mean and standard deviations of developers’ utility assessments are given. The rightmost column in the table reports a test of the difference between the utility ratings for the presence and absence of a particular description aspect. Results indicate that the presence of a solution proposal made developers assess that problem as being of higher utility in their work compared to problems offering no solution. The utility score increases by about 15% when a solution proposal is present. Whether developers actually use or intend to use the solution proposals is unclear. Given the sketchy nature of most solution proposals we are inclined to think that the main contribution of a solution proposal is to suggest a direction to pursue in alleviating a problem and to enable developers form a more complete understanding of the problem. Usability-problem descriptions judged as persistent for expert users were of more utility to developers: the difference between the utility of non-persistent (M = 22.4) and persistent problems (M = 27.2) was significant. One characteristic of persistent problems is that they are more complex and less obvious than novice-only problems: developers apparently appreciate this kind of information. Providing a justification for problems appears to increase developers’ assessment of the utility of a problem significantly (by about 8%). It seems that the explicit information on why something is a problem (e.g., by describing the behavioral consequences of a difficulty) is useful and non-trivial to developers.
Describing problems in terms of observable user actions had no significant influence on the utility assessment of problems. This is contrary to our expectations and to the stated opinions of the developers, who during interviews had articulated a great reliance on problems that explicitly mentioned users and their actions (see Hornbæk & Frøkjær 2005). Unsurprisingly, problems that were judged as being unclear were also seen to have significantly lower utility (M = 22.4) compared to clear problem descriptions (M = 26.0). Note that only 55 problems were assessed as unclear: most of these were difficult to understand because they were really brief. We may also use our data to say something about the difference in problem descriptions between the empirical method, think aloud user testing (TA), and the inspection method, metaphors of human thinking (MOT). The rationale behind comparing methods is to investigate whether one method or the other is more likely to lead to descriptions of a particular kind. If this is the case, evaluators might need to take extra care in providing the relevant description aspects when they use a particular method. Table 2 summarizes the relevant data. We find no difference between methods in whether solution proposals are given, whether a problem is persistent, whether observable user actions are described, or in the clarity of the problem descriptions. However, Table 2 contains one significant result regarding the justification of problems. Problems identified with MOT are more likely to contain a justification compared to problems found with TA–the difference is relatively small, about eight percentage points.
Table 1: Utility ratings for the five independently-judged description aspects. Description Level aspect
N
Utility M
SD
Sig.
Solution proposal
Yes No
152 467
28.4 24.8
12.8 10.6
t(617) = 3.44, p < .001
Persistence
Yes No
418 201
27.2 22.4
11.9 9.15
t(617) = 5.08, p < .001
Justification for problem
Yes No
260 359
26.8 24.8
11.2 11.3
t(617) = 2.24, p < .05
Observable user action
Yes No
180 439
26.0 25.5
10.4 11.6
t(617) = 0.45, p > .5
Clarity
Yes No
564 55
26.0 22.4
11.1 12.4
t(617) = 2.28, p < .05
DISCUSSION This study has presented initial data on what kinds of usability-problem description developers find of utility in their work. In contrast to the existing literature on usability evaluation, we focus on developers’ perception of utility. Although such perceptions may be inherently biased, they play a large role in determining which usability problems that get addressed and in shaping the extent to which usability evaluations impact systems development. This study has showed how developers see solution proposals as being of utility. This finding challenges the belief held by some usability professionals and presented in the literature that such information would be irrelevant to developers. We have also shown how aspects of the contents of usability problems (e.g., the persistence of a problem) to some extent impact developers’ assessment of usability problems. In this study we have not investigated if other factors could have impacted the utility ratings by co-varying with persistence. Further research could explore other kinds of problem-content that developers value, so as to develop recommendations about what to focus on in reporting usability tests. It appears that justifications for usability problems are useful to developers: what kinds of justification (HCI principles, observable user difficulties, etc.) that is most valued remains an issue to be further explored. We have also investigated differences between methods in terms of the kinds of usability-problem description a particular method is most likely to report. Interestingly, think aloud testing did not differ from metaphors of human thinking on the obvious parameters of providing a solution and of reporting observable user action. On the contrary, problems found using the metaphors were seen as more frequently providing a justification. Note that according to most common definitions of effect size, all the effects reported in this paper are small. However, the data contain a lot of variance, in particular related to individual differences in the description style employed and to the content of the problems. An obvious step for future work is to validate the findings in a systematic manner, where the reporting format chosen are varied according to the factors discussed in this paper and not, as done here, coded after the problems were described and rated. For the purpose of practical usability work, we present the four recommendations for describing usability problems: (1) include solution proposals in the description of usability problems; (2) justify why something is a problem, possibly by referring to behavioral consequences of a problems or to general
Table 2: Differences in description aspects between problems found with an empirical (TA, N = 321) and an analytic evaluation method (MOT, N = 298). Description aspect
Method
Level
Sig.
Yes
No
Solution proposal
TA MOT
22% 27%
78% 73%
χ2 = 1.63, n.s.
Persistence
TA MOT
65% 70%
35% 30%
χ2 = 2.27, n.s.
Justification for problem
TA MOT
38% 46%
62% 54%
χ2 = 4.37, p < .05
Observable user action
TA MOT
31% 27%
69% 73%
χ2 = 1.39, n.s.
Clarity
TA MOT
90% 93%
10% 7%
χ2 = 0.97, n.s.
principles for usable design; (3) in addition to the typical novice-only problems, also present descriptions of usability problems that are complex and persistent for users; (4) make problem descriptions long enough to be understandable by other persons.
REFERENCES Capra, M. & Smith-Jackson, T. (2005). Developing guidelines for describing usability problems (Rep. No. ACE/HCI-2005002). Blacksburg, VA: Virginia Tech. Connell, I., Blandford, A., & Green, T. (2004). CASSM and cognitive walkthrough: usability issues with ticket wending machines. Behaviour & Information Technology, 23, 5, 307320. Dumas, J., Molich, R., & Jefferies, R. (2004). Describing Usability Problems: Are we sending the right message? Interactions, 4, 24-29. Dumas, J. & Redish, J. (1999). A practical guide to usability testing. (2nd ed.) Exeter, England: intellect. Hornbæk, K. & Frøkjær, E. (2002). Evaluating User Interfaces with Metaphors of Human Thinking. In N. Carbonell & C. Stephanidis (Eds.), Proceedings of Proceedings of 7th ERCIM Workshop "User Interfaces for All", Lecture Notes in Computer Science 2615 (pp. 486-507). Berlin: SpringerVerlag. Hornbæk, K. & Frøkjær, E. (2004). Usability Inspection by Metaphors of Human Thinking Compared to Heuristic
Evaluation", International Journal of Human-Computer Interaction, 17(3), 357-374. Hornbæk, K. & Frøkjær, E. (2005). Comparing usability problems and redesign proposals as input to practical systems development. In Proceedings of ACM Conference on Human Factors in Computing Systems (pp. 391-400). New York, NY:ACM Press. Jeffries, R. (1994). Usability Problem Reports: helping Evaluators Communicate Effectively with Developers. In J. Nielsen & R. L. Mack (Eds.), Usability Inspection Methods (pp. 273-294). New York, NY: John Wiley. Lavery, D., Cockton, G., & Atkinson, M. P. (1997). Comparison of evaluation methods using structured usability problem reports. Behaviour & Information Technology, 16, 4/5, 246-266. Landis, J. & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159-174. Molich, R. (2003). User testing, Discount user testing. www.dialogdesign.dk Nielsen, J. & Molich, R. (1990). Heuristic evaluation of user interfaces. In Proceedings of ACM Conference on Human Factors in Computing (pp. 249-256). New York, NY: ACM Press.