DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
On the Advantages of a Systematic Inspection for Evaluating Hypermedia Usability A. De Angeli+, M. Matera°, M.F. Costabile*, F. Garzotto°, P. Paolini° +
NCR - Knowledge Lab, London, UK
°Dipartimento di Elettronica e Informazione, Politecnico di Milano, Italy *Dipartimento di Informatica, Università di Bari, Italy
[email protected], {matera, garzotto, paolini}@elet.polimi.it,
[email protected]
ABSTRACT It is indubitable that usability inspection of complex hypermedia is still an "art", in the sense that a great deal is left to the skills, experience, and ability of the inspectors. Training inspectors is difficult and often quite expensive. The SUE inspection technique has been proposed to help usability inspectors share and transfer their evaluation know-how, to simplify the hypermedia inspection process for newcomers, and to achieve more effective and efficient evaluation results. SUE is based on the use of evaluation patterns, called Abstract Tasks, which precisely describe the activities to be performed by evaluators during inspection. This paper highlights the advantages of this inspection technique, by presenting its empirical validation through a controlled experiment. Two groups of novice inspectors have been asked to evaluate a commercial hypermedia CD-ROM applying the SUE inspection or the traditional heuristic technique. The comparison was based on three major dimensions: effectiveness, efficiency, and satisfaction. Results have shown a clear advantage of the SUE inspection over the traditional inspection on all dimensions, demonstrating that Abstract Tasks are efficient tools to drive the evaluator's performance.
1. INTRODUCTION One of the ultimate goals of the Human-Computer Interaction (HCI) discipline is to define methods for ensurin g usability, which is now universally acknowledged as a significant aspect of the overall quality of interactive systems. ISO Standard 9241-11 (International Standard Organization, 1997) defines usability as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use”. In this framework, effectiveness is defined as the accuracy and the completeness with which users achieve goals in particular environments. Efficiency refers to the resources expended in relation to the accuracy and completeness of the goals achieved. Satisfaction is defined as the comfort and the acceptability of the system for its users and other people affected by its use. 1
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
Much attention on usability is currently paid by industry, which is recognizing the importance of adopting evaluation methods during the development cycle, for verifying the quality of new products before they are put on the market (Madsen, 1999). One of the main complains of industry is however that cost-effective evaluation tools are still lacking. This refrains most companies from actually performing usability evaluation, with the consequent result that a lot of software is still poorly designed and unusable. Therefore, usability inspection methods are emerging as preferred evaluation procedures, being less costly than traditional user-based evaluation. It is indubitable that usability inspection of complex applications, such as hypermedia, is still an "art", in the sense that a great deal is left to the skills, experience, and ability of the inspectors. Moreover, training inspectors is difficult and often quite expensive. As part of an overall methodology called SUE, Systematic Usability Evaluation (Costabile et al. , 1997), a novel inspection technique has been proposed to help usability inspectors share and transfer their evaluation know-how, make it easier the hypermedia inspection process for newcomers, and achieve more effective and efficient evaluations. As described in previous paper (Garzotto et al., 1998; Costabile & Matera, 1999; Garzotto et al., 1999), the inspection proposed by SUE is based on the use of evaluation patterns, called Abstract Tasks, which precisely describe the evaluator's activities to be performed during the inspection. This paper presents an empirical validation of the SUE inspection technique. Two groups of novice inspectors have been asked to evaluate a commercial hypermedia CD-ROM applying the SUE inspection or the traditional heuristic technique. The comparison was based on three major dimensions: effectiveness, efficiency, and satisfaction. Results demonstrated a clear advantage of the SUE inspection over the heuristic evaluation on all dimensions, showing that Abstract Tasks are effic ient tools to drive the evaluator's performance. The paper has the following organization. Section 2 provides the rationale for the SUE inspection by describing the current situation of usability inspection techniques. Section 3 briefly describes the main characteristics of the SUE methodology, while Section 4 outlines the inspection technique proposed by SUE. Section 5 is the core of the paper. It describes the experiment that has been performed in order to validate the SUE inspection. Finally, Section 6 gives the conclusions.
2. BACKGROUND Different methods can be used for evaluating usability, among which the most commonly used are user-based methods and inspection methods. User-based evaluation mainly consists of user testing: it assesses usability properties by observing how the system is actually used by some representative of real users (Whiteside et al., 1988; Dix et al., 1993; Preece et al., 1994). Usability inspection refers to a set of methods in which expert evaluators examine usability-related aspects of an application and provide 2
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
judgments based on their knowledge. Examples of inspection methods are heuristic evaluation, cognitive walkthrough, guideline review, and formal usability inspection (Nielsen & Mack, 1994). User-based evaluation claims to provide, at least until now, the most reliable results, since they involve samples of real users. Such a method, however, has a number of drawbacks, such as the difficulty to properly select a correct sample of the user community and to train it to master the most sophisticated and advanced features of an interactive system. Furthermore, it is difficult and expensive to reproduce actual situations of usage (Lim et al., 1996) and failure in creating real-life situations may lead to “artificial” findings rather than to realistic results. Therefore, the cost and the time to set up reliable empirical testing may be conspicuous. With respect to user-based evaluation, usability inspection methods are more subjective. They are strongly dependent upon the inspector skills. Therefore, it may happen that different inspectors produce not comparable outcomes. However, usability inspection methods “save users” (Jeffries et al., 1991; Nielsen & Mack, 1994) and do not require special equipment, nor lab facilities. In addition, experts can detect problems and possible future faults of a complex system in a limited amount of time. For all these reasons, inspection methods have seen an increasing widespread use in recent years, especially in industrial environments (Nielsen, 1994a). Among usability inspection methods, heuristic evaluation (Nielsen, 1993; Nielsen 1994b) is the most commonly used. It prescribes to have a small set of experts inspecting the system and evaluating its interface against a list of recognized usability principles - the heuristics. Experts in heuristic evaluation can be usability specialists, experts of the specific domain of the application to be evaluated, or (preferably) double experts, with both usability and domain experience. During the evaluation session, each evaluator goes individually through the interface at least twice. The first step is to get a feel of the flow of the interaction and the general scope of the system. The second is to focus on specific objects and functionality, evaluating their design, implementation, etc., against a list of well-known heuristics. Typically, such heuristics are general principles, which refer to common properties of usable systems. However, it is desirable to develop and adopt category-specific heuristics that apply to a specific class of products (Nielsen & Mack, 1994; Garzotto & Matera, 1997). The output of a heuristic evaluation session is a list of usability problems with reference to the violated heuristics. Reporting problems in relation to heuristics enables the easy generation of revised design, in accordance with what is prescribed by the guidelines provided by the violated principles. Once the evaluation has been completed, the findings of the different evaluators are compared in order to come out with a possible unified report summarizing all the findings. Heuristic evaluation is one of the “discount usability” methods (Nielsen, 1993; Nielsen 1994a). In fact, some researches have shown that it is a very efficient usability engineering method (Jeffries & Desurvire, 3
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
1992), with a high benefit-cost ratio (Nielsen, 1994a). It is especially valuable when time and resources are short, because skilled evaluators, without needing the involvement of representative users, can produce high quality results in a limited amount of time (Kantner & Rosenbaum, 1997). This technique has however a number of drawbacks. As highlighted in (Jeffries et al., 1991; Doubleday et al., 1997; Kantner & Rosenbaum, 1997), its major disadvantage is the high dependence upon the skills and experiences of the evaluators. Nielsen states that novice evaluators with no usability expertise are poor evaluators and that usability experts are 1.8 times as good as novices. Moreover, application domain and usability experts (the double experts) are 2.7 as good as novices and 1.5 as good as usability experts (Nielsen, 1992). This means that the experience with the specific category of applications really improves the evaluators’ performance. Unfortunately, usability specialists may lack domain expertise and domain specialists are rarely trained or experienced in usability methodologies. In order to overcome this problem for hypermedia usability evaluation, the SUE inspection technique has been introduced. It uses evaluation patterns, called Abstract Tasks, for guiding the inspectors’ activity. Abstract Tasks precisely describe which hypermedia objects to look for and which actions the evaluators must perform in order to analyze such objects. In this way, also less experienced evaluators, with lack of expertise in usability and/or hypermedia, are able to come out whit more complete and precise results. The SUE inspection technique solves a further heuristic evaluation drawback, which is reported in (Doubleday et al., 1997). The problem is that not always heuristics, as they are generally formulated, are able to deeply guide the evaluators. At this proposal, the SUE inspection framework provides evaluators with a list of detailed heuristics that are specific for hypermedia. Abstract Tasks, then, provide a detailed description of the activities to be performed for detecting possible violations of the hypermedia heuristics. In SUE, the overall inspection process is driven by the use of an application model, HDM (Garzotto et al., 1993). The HDM concepts and primitives allow evaluators to identify precisely the hypermedia constituents that are worth to be investigated. Moreover, both the hypermedia heuristics and the Abstract Tasks focus on such constituents and are formulated through the HDM terminology. Such a terminology is also used by evaluators for reporting problems, thus avoiding the generation of not comprehensible and vague inspection reports. In some recent papers (Hartson et al., 1999; Andre et al., 1999), authors highlight the need for more focused usability inspection methods and for a classification of usability problems, so that to support the production of inspection reports easy to read and compare. These authors have defined the User Action Framework (UAF), which is a unifying and organizing environment, supporting design guidelines, usability inspection, classification and reporting of usability problems. UAF provides a knowledge base, in which different usability problems are organized taking into account how users are affected by the design during the interaction, at various points where they must accomplish cognitive or physical actions. The classification of design problems and usability concepts is a way to 4
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
capitalize on past evaluation experiences. It allows evaluators to better understand the design problems they encounter during the inspection and helps them identify precisely which physical or cognitive aspects cause the problems. Evaluators are therefore able to propose well-focused redesign solutions. The motivations behind this research are similar to ours. Reusing the past evaluation experience and making it available to less experienced people is also one of our basic goals, which we pursue through the use of Abstract Tasks. Their formulation is in fact the reflection of the experiences of some skilled evaluators. Differently from the UAF, rather than recording problems, Abstract Tasks offer a way to keep track of the activities to be performed for discovering problems.
3. THE SUE METHODOLOGY SUE is a methodology for evaluating the usability of interactive systems, which prescribes a structured flow of activities (Costabile et al., 1997). SUE has been largely specialized for hypermedia (Costabile et al., 1998; Garzotto et al., 1998; Garzotto et al., 1999), but this methodology can be easily exploited for evaluating the usability of any interactive application (Costabile & Matera, 1999). A core idea of SUE is that the most reliable evaluation can be achieved by systematically combining inspection with user-based evaluation. In fact, several studies have outlined how such two methods are complementary and can be effectively coupled for obtaining a reliable evaluation process (Virzi et al., 1993; Desurvire, 1994; Karat, 1994). The inspection proposed by SUE, based on the use of Abstract Tasks (in the following SUE Inspection), is carried out first. Then, if inspection results are not sufficient to predict the impact of some critical situations, user-based evaluation is also conducted. Being driven by the inspection outcome, in SUE the user-based evaluation tends to be better focused and cost-effective. Another basic assumption of SUE is that, in order to be reliable, the usability evaluation should encompass a variety of dimensions of a system. Some of these dimensions may refer to general layout features common to all interactive systems, others may be more specific for the design of a particular product category, or a particular domain of use. For each dimension, the evaluation process consists of a preparatory phase and an execution phase. The preparatory phase is performed only once for each dimension and its purpose is to create a conceptual framework that will be used to carry out actual evaluations. As it will be better explained in the next section, such a framework includes a design model, a set of usability attributes, and a library of Abstract Tasks. The activities in the preparatory phase may require an extensive employment of resources. Therefore, they should be regarded as a long-term investment. The execution phase is performed every time a specific application must be evaluated. It mainly consists of an inspection, performed by expert evaluators. If needed, inspection can be followed by sessions of user testing, involving real users.
5
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
4. THE SUE INSPECTION TECHNIQUE FOR HYPERM EDIA The SUE inspection is based on the use of an application design model for describing the application, a set of usability attributes to be verified during the evaluation, and a set of Abstract Tasks (ATs for short), to be applied during the inspection phase. The term “model” is used in a broad sense, meaning a set of concepts, representation structures, design principles, primitives and terms, which can be used to build a description of an application. The model helps organize concepts, so identifying and describing, in a nonambiguous way, the components of the application which constitute the entities of the evaluation (Fenton, 1991). For the evaluation of hypermedia, we have adopted HDM – Hypermedia Design Model (Garzotto et al., 1993), which focuses on structural and navigation properties, as well as on active media features. The usability attributes are obtained by decomposing general usability principles into finer-grained criteria that can be better analyzed. In accordance with the suggestion given in (Nielsen & Mack, 1994), namely to develop category-specific heuristics, we have defined a set of usability attributes, able to capture the peculiar features of hypermedia (Garzotto & Matera, 1997; Garzotto et al., 1998). A correspondence exists between such hypermedia usability attributes and the ten Nielsen’s heuristics (Nielsen, 1993). The hypermedia usability attributes, in fact, can be considered a specialization for hypermedia of the Nielsen's heuristics, with the only exception of “Good Error Messages” and “Help and Documentation”, which do not need to be further specialized.
Table I: Two ATs from the library of hypermedia ATs (Matera, 1999).
ATs are evaluation patterns which provide a detailed description of the activities to be performed by expert evaluators during inspection (Garzotto et al., 1998; Garzotto et al., 1999). They are formulated precisely by following a pattern template, which provides a consistent format including the following items: •
AT Classification Code and Title: these items univocally identify the AT and succinctly convey its essence;
•
Focus of Action: it shortly describes the context, or focus, of the AT by listing the application constituents that correspond to the evaluation entities;
•
Intent: it describes the problem addressed by the AT and its rationale, trying to make clear which is the specific goal to be achieved through the AT application;
•
Activity Description: it describes in detail the activities to be performed when applying the AT;
•
Output: it describes the output of the fragment of the inspection the AT refers to. 6
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
Optionally, a comment is provided, with the aim of indicating further ATs to be applied in combination or highlighting related usability attributes. A further advantage of the use of a model is that it provides the terminology for formulating the ATs. The forty ATs defined for hypermedia (Matera, 1999) have been formulated by using the HDM vocabulary. Two examples are reported in Table I. The two ATs focus on active slots1 . The list of ATs provides a systematic guidance on how to inspect a hypermedia application. Most evaluators are very good in analyzing only certain features of interactive applications; often they neglect some other features, strictly dependent on the specific application category. Exploiting a set of ATs ready for use allows evaluators with no experience in hypermedia to come up with good results. During the inspection, evaluators analyze the application and specify a viable HDM schema, when it is not already available, for describing the application. During this activity, the different application components (i.e., the objects of the evaluation) are identified. Then, having in mind the usability criteria, evaluators apply the ATs and produce a report in which the discovered problems are described. The terminology provided by the model is used by the evaluators for referring to objects and describing critical situations while reporting troubles, so attaining a major precision in their final evaluation report.
5. THE VALIDATION EXPERIMENT In order to validate the SUE inspection technique, we have conducted a comparison study involving a group of senior students of a Human-Computer Interaction class at the University of Bari, Italy. The aim of the experiment was to compare the performance of evaluators carrying out the SUE inspection, based on the use of ATs, with the performance of evaluators carrying out the heuristic inspection, based on the use of heuristics only. For the sake of brevity, in the rest of the paper we indicate as SI the SUE Inspection and as HI the Heuristic Inspection. As better explained in Section 5.2, the validation metrics were defined along three major dimensions: effectiveness, efficiency, and user satisfaction. Such dimensions actually correspond to the principal usability factors as defined by the Standard ISO 9241-11 (International Standard Organization, 1997). Therefore, we can say that the experiment has allowed us to assess the usability of the inspection technique (John, 1996). In the defined metrics, effectiv eness refers to the completeness and accuracy with which inspectors performed the evaluation. Efficiency refers to the time expended in relation to the effectiveness of the evaluation. Satisfaction refers to a number of subjective parameters, such as perceived usefulness, difficulty, acceptability and confidence with respect to the evaluation technique. For each dimension we have tested a specific hypothesis.
1
In the HDM terminology, a “slot” is an atomic piece of information, such as texts, pictures, videos, sounds, etc.
7
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
•
Effectiveness Hypothesis. As a general hypothesis we predicted that SI should increase the evaluation effectiveness as compared to HI. The advantage is related to two factors: (a) the systematic nature of the SI technique, deriving from the use of the HDM model for precisely identifying the application constituents; (b) the use of ATs, which suggest the activity to be conducted over such objects. Since the ATs directly address hypermedia applications, this prediction should be also weighted with respect to the nature of problems detected by evaluators. The hypermedia specialization of the SI could constitute both the method advantage and its limit. Indeed, while it could be particularly effective with respect to hypermedia-specific problems, it could neglect other flaws related to presentation and content. In other words, the limit of ATs could be that of taking evaluators away from defects not specifically addressed by the AT activity.
•
Efficiency Hypothesis. A limit of SI could be that a rigorous application of several ATs is time consuming. However, we expect that SI should not compromise inspection efficiency as compared to a less structured inspection technique. Indeed, the higher effectiveness of the SI technique should compensate for the major time demand required by its application.
•
Satisfaction Hypothesis. Although we expect that SI should be perceived as a more complex technique than HI, we also hypothesize that it should enhance the evaluators’ control over the inspection process and their confidence on the obtained results.
5.1. Method In this section, we describe the experimental method adopted to test the effectiveness, efficiency and user satisfaction hypotheses. Participants Twenty-eight senior students from the University of Bari participated in the experiment as part of their credit for an HCI course. Their curriculum comprised training and hands-on experience in the Nielsen’s heuristic -evaluation method, which they had applied to paper prototypes, computer–based prototypes, and hypermedia CD-ROMs. During lectures they were also exposed to the HDM model. Design The inspection technique was manipulated between participants. Randomly, half of the sample was assigned to the HI condition, the other half to the SI condition.
8
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
Procedure A week before the experiment, participants were introduced to the conceptual tools to be used during the inspection. The training session lasted 2 hours and 30 minutes for the HI group and 3 hours for the SI group. The discrepancy was due to the different conceptual tools used during the evaluation by the two groups, as better explained in the following. A preliminary 2-hour seminar shortly reviewed the HDM Model and introduced all the participants with hypermedia -specific heuristics, as defined by SUE. Particular attention was devoted to inform students without influencing their expectations and attitudes towards the two different inspection techniques. A couple of days later, all the participants were presented with a short demo of the application, lasting almost 15 minutes. A few summary indications about the application content and the main functions were introduced, without providing, however, too many details. In this way, participants, having a limited time at their disposal, did not start their usability analysis from scratch, but had an idea (although vague) of how to get oriented in the application. Then, partic ipants assigned to the SI group were shortly introduced to the HDM schema of the application and to the key concepts on applying ATs. In the proposed application schema, only the main application components inthe-large were introduced (i.e., structure of entity types and applicative links for the hyperbase, collection structure and navigation for the access layer), without revealing any detail that could give indications of usability problems. Table II: The list of ATs submitted to inspectors.
The experimental session lasted three hours. Participants had to inspect the CD-ROM applying the technique they were assigned to. All participants were provided with a list of ten SUE heuristics, summarizing the usability guidelines for hypermedia 2 (Garzotto & Matera, 1997; Garzotto et al., 1998). The SI group was also provided with the HDM application schema and with ten ATs to be applied during the inspection (see Table II). The limited number of ATs was due to the limited amount of time participants had at their disposal. We selected the most basic ATs, which could guide SI inspectors in the analysis of the main application constituents. For example, we disregarded ATs addressing advanced hypermedia features. Working individually, participants had to find the maximum number of usability problems in the application and to record them on a report booklet. It differed according to the experimental conditions. In the HI group, the booklet included ten forms, one for each of the hypermedia heuristics. The forms
2
By providing both groups with the same heuristic list we have been able to measure the possible added value of the systematic inspection induced by SUE, with respect to the subjective application of heuristics.
9
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
required information about the application point where that heuristic was violated and a short description of the problem. The SI group was instead provided with a report booklet including ten forms, each one corresponding to an AT. Again, the forms required information about the violations detected through that AT and where they occurred. Examples of the forms included in the report booklets provided to the two groups are reported in Figures 1 and 2. Figure 1: An example of table in the report form provided to the HI group.
Figure 2 : One of the ATs submitted to the evaluators in the ATI group and the correspondent form for reporting problems.
At the end of the evaluation, participants were invited to fill in the evaluator-satisfaction questionnaire, which combined several item formats to measure three main dimensions: user-satisfaction with the evaluated application, evaluator-satisfaction with the inspection technique, and evaluator-satisfaction with the results achieved. The psychometric instrument was organized in two parts. The first was concerned with the application, the second included the questions about the adopted evaluation technique. Two final questions asked participants to specify how much they felt satisfied about their performance as evaluators. The Application The evaluated application is the Italian CD-ROM “Camminare nella pittura” – meaning “Walking through Painting” (Mondadori New Media, 1997). It is composed of two CD-ROMs, each one presenting the analysis of Painting and some relevant artworks in two periods. The first CD-ROM (CD1 in the following) covers the period from Cimabue to Leonardo; the second one the period from Bosch to Cezanne. The CD-ROMs are identical in the structure, and each one can be used independently from the other. Each CD-ROM is a distinct and "complete" application of limited size, particularly suitable for being exhaustively analyzed in a limited amount of time. Therefore, CD1 only has been submitted to participants. The limited number of navigation nodes in CD1 has simplified the post-experiment analysis of the paths followed by the evaluators during the inspection and the identification of the application points where they highlighted the problems. Data Coding The report booklets were analyzed by three expert hypermedia designers (expert evaluators, in the following) with a strong HCI background, to assess effectiveness and efficiency of the applied evaluation
10
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
technique. All reported measures had a reliability value of at least .85. Evaluator satisfaction was instead measured analyzing the self-administered post-experiment questionnaires. All the statements written in the report booklets were scored as Problems or Non-Problems. Problems are actual usability flaws, which could impact on user performance. Non-Problems include: (a) observations reflecting only evaluators’ personal preferences but not real usability bugs; (b) evaluation errors, reflecting evaluators’ misjudgments or system defects due to a particular hardware configuration; (c) not understandable statements (i.e., statements not clearly reported). For each statement scored as Problem or Non-Problem of type (a), a severity rating was performed. As suggested by Nielsen (Nielsen, 1994b), severity was estimated considering three factors: i) the frequency of the problem; ii) the impact of the problem on the user; iii) the persistence of the problem during interaction. The evaluation was modulated on a Likert scale, ranging from 1 = "I don’t agree that this is a usability problem at all", to 5 = "Usability catastrophe". Each problem was further classified in one of the following dimensions, according to the nature of the problem itself: •
Navigation, which includes problems related to the task of moving within the hyperspace. It refers to the appropriateness of mechanisms for accessing information and for getting oriented in the hyperspace.
•
Active media control, which includes problems related to the interaction with dynamic multimedia objects, such as video, animation, audio comment, etc. It refers to the appropriateness of mechanisms for controlling the dynamic behavior of media, and of mechanisms providing feedback about the current state of the media activation.
•
Interaction with widgets, which includes problems related to the interaction with the widgets of the visual interface, such as buttons of various types, icons, scrollbars etc. It includes problems related to the appropriateness of mechanisms for manipulating widgets, and their self-evidence. Note that navigation and active media control are dimensions specific to hypermedia systems.
5.2. Results The total number of problems detected in the application is 38. Among these, 29 problems were discovered by the expert evaluators, through an inspection prior to the experiment. The remaining 9 were identified only by the experimental inspectors. During the experiment, inspectors reported a total number of 36 different types of Problems. They also reported 25 different types of Non-Problems of type (a) and (b). Four inspectors reported at least one nonunderstandable statement, Non-Problems of type (c). The results of the psychometric analysis are reported in the following paragraphs with reference to the three experimental hypotheses. 11
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
Effectiveness Effectiveness can be decomposed into the completeness and accuracy with which inspectors performed the evaluation. Completeness corresponds to the percentage of problems detected by a single inspectors out of the total number of problems. It is computed by the following formula:
Completenessi =
Pi * 100 n
where Pi is the number of problems found by the i-th inspectors, and n is the total number of problems existing in the application (n = 38). On average, inspectors in the SI group individually found the 24% of all the usability defects (mean S.E. = 1.88); inspectors in the HI group the 19% (mean S.E. = 1.99). As shown by a Mann-Whitney U test, the difference is statistically significant (U = 50.5, (N = 28), p < .05). It follows that the SI technique enhances evaluation completeness, allowing individual evaluators to discover a major number of usability problems. Accuracy can be defined by two indices: precision and severity . Precision is given by the percentage of problems detected by a single inspector out of the total number of statements. For a given inspector, precision is computed by the following formula:
Precisioni =
Pi * 100 si
where Pi is the number of problems found by the i-th evaluator, and si is the total number of statements he/she reported (including Non-Problems). In general, the distribution of precision is affected by a severe negative skewness, with 50% of participants not committing any errors. The variable ranges from 40 to 100, with a median value of 96. In the SI group, inspectors were totally accurate (precision value = 100), with the exception of two of them, which were slightly inaccurate (precision value > 80). On the other hand, only two participants of the HI condition were totally accurate. The mean value for the HI group is 77.4 (mean S.E. = 4.23), the median value is 77.5. Moreover, four evaluators in the HI group reported at least one non-understandable statement, while all the statements reported by the SI group were clearly expressed and referred to application objects using a comprehensible and consistent terminology.
Figure 3: Average numb er of problems as a function of experimental conditions and problem categories.
12
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
This general trend reflecting an advantage of SI over HI has been confirmed also by the analysis of the severity index, which refers to the average rating of all the scored statements for each participant. A t-test analysis demonstrated that the mean rating of the two groups varied significantly (t(26)= -3.92, p < .001). Problems detected applying the ATs were scored as more serious than those detected when only the heuristics were available (means and standard errors are reported in Table III). Table III: Means and standard errors of the severity index.
The effectiveness hypothesis also states that the SUE inspection technique could be particularly effective for detecting hypermedia -specific problems, while it could neglect other bugs related to GUI widgets. In order to test this aspect, we have analyzed the distribution of problem types as a function of experimental conditions. As can be seen in the bar chart reported in Fig ure 3, the most common problems detected by all the evaluators were concerned with navigation, followed by defects related to active media control. Only a minority of problems regarded interaction with widgets. In general, it is evident that the SI inspectors found more problems. However, this superiority especially emerges for hypermedia related defects (navigation and active media control), t(26) = -2.70, p < .05. A slight major average number of "Interaction with widgets" problems has been found by the HI group. Anyway, a Mann-Whitney U test, comparing the percentage of problems in the two experimental conditions, has shown that the difference is not significant (U = 67, N = 28, p = .16). This means that differently from what we have hypothesized, the systematic inspection activity suggested by ATs does not take evaluators away from other problems not covered by the activity description. Since the problems found by the SI group in the “Interaction with widgets” category are those having the highest severity, we can also assume that the hypermedia ATs do not prevent evaluators from noticing usability catastrophes related to presentation aspects. Also, we believe that supplying evaluators with ATs focusing on presentation aspects, such as those presented in (Costabile & Matera, 1999), we could obtain a deep analysis of the graphical user interface, with the result of having a major number of “Interaction with widgets” problems found by the SI evaluators. Efficiency Efficiency has been considered both at the individual and at the group level. Individual efficiency refers to the number of problems extracted by a single inspector, in relation to the time spent. It is computed by the following formula:
Ind_Efficiencyi = 13
Pi ti
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
where Pi is the number of problems de tected by the i-th inspector, and ti is the time spent for finding the problems. On average, SI inspectors found 4.5 problems in 1 hour of inspection, versus the 3.6 problems per hour found by the HI inspectors. A t-test on the variable normalized by a square root transformation demonstrates that the difference is not significant (t(26) = -1.44, p = .16). Such a result confirms the efficiency hypothesis, since the application of the ATs do not compromise efficiency as compared to a less structured evaluation technique. Rather SI shows a positive tendency in finding a major number of problems per hour. Group efficiency refers to the evaluation results achieved aggregating the performance of several inspectors. At this aim, the Nielsen’s cost-benefit curve, relating the proportion of usability problems to the number of evaluators, has been calculated (Nielsen, 1994b). This curve derives from a mathematical model, which is based on the prediction formula for the number of usability problems found in a heuristic evaluation reported in the following (Nielsen, 1992): Found(i) = n(1-(1-λ)i) where Found(i) is the number of problems found by aggregating reports from i independent evaluators, n is the total number of problems in the application, and λ is the probability of finding the average usability problem when using a single average evaluator.
Figure 4: The cost-benefit curve (Nielsen & Landauer, 1993), calculated for the two techniques, HI (Heuristic Inspection) and SI (SUE Inspection). Each curve shows the proportion of usability problems found by each technique, when using different numbers of evaluators.
As suggested by the authors, one possible use of this model is the estimation of the number of the inspectors needed for identifying a given percentage of usability errors. We have therefore used the model for determining how many inspectors for the two techniques would enable the detection of a reasonable percentage of problems in the application. The curves calculated for the two groups are reported in Figure 4 (n = 38, λHI = 0.19, λSI = 0.24). As emerging from the graph, SI tend to reach better performance with a lowest number of evaluators. Assuming the Nielsen’s 75% threshold, SI can reach it with five evaluators. The HI group needs instead seven evaluators. Satisfaction With respect to an evaluation technique, satisfaction refers to many parameters, such as perceived usefulness, difficulty, and acceptability of applying the method. The post-experiment questionnaire addressed three main dimensions: user-satisfaction with the application evaluated , evaluator-satisfaction with the inspection technique, and evaluator -satisfaction with the results achieved. At a first sight it may 14
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
appear that the first dimension, addressing the satisfaction of the evaluators with the application, is out of the scope of the experiment, whose main intent was to compare two inspection techniques. However, we wanted to verify in which way the used technique influenced the inspector severity. User-satisfaction with the application evaluated was assessed through a semantic -differential scale that required inspectors to judge the application on 11 pairs of adjectives describing satisfaction with information systems. Inspectors could modulate their evaluation on 7 points (after re-coding of reversed items 1 = very negative, 7 = positive). The initial reliability of the satisfaction scale is moderately satisfying (α = .74), with three items (reliable – unreliable; amusing – boring; difficult – simple) presenting a corrected item-total correlation inferior to .30. Therefore, the user-satisfaction index was computed averaging scorings for the remaining 8 items (α = .79). Then, the index was analyzed by a t-test. Results showed a significant effect of the inspection group (t(26) = 2.38, p < .05). On average, the SI inspectors evaluated the application more severely (mean = 4.37, mean S.E .= .23) than HI inspectors (mean = 5.13, mean S.E. = .22). From this difference, it can be inferred that ATs provide evaluators with a more effective framework to weight limits and benefits of the application. The hypothesis is supported by the significant correlation between the number of usability problems found by an evaluator and his/her satisfaction with the application (r = -.42, p < .05). The negative index indicates that as more problems were found, as less positive the evaluation was. Evaluator-satisfaction with the inspection technique was assessed by 11 pairs of adjectives, modulated on 7 points. The original reliability value is .72, increasing to .75 after deletion of 3 items (tiring – restful; complex – simple; satisfying – unsatisfying). The evaluator-satisfaction index was then computed averaging scorings to the remaining 8 items. The index is highly correlated to a direct item assessing learnability of the inspection technique (r = .53, p < .001). As easiest the technique was perceived, as better it was evaluated. A t-test showed no significant differences in the satisfaction with the inspection technique (t(26) = 1.19, p = .25). On average, evaluations were moderately positive for both techniques, with a mean difference of .32 slightly favoring the HI group. To conclude, despite being objectively more demanding, SI is not evaluated worse than HI. Evaluator-satisfaction with the result achieved was assessed directly by a Likert-type item asking participants to express their gratification on a 4 point-scale (from “not at all” to “very much”) and indirectly by a percentage estimation of the number of problems found. The two variables are highly correlated (r = .57, p < .01). The more problems an inspector thinks to have found. the more satisfied he/she is with his/her performance. Consequently, the final satisfaction index was computed multiplying the two scores. A Mann-Whitney U-test showed a tendency towards a difference in favor of the HI group (U = 54.5, p = .07). Participants in the HI group felt more satisfied about their performance than those in the SI group. 15
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
By considering this finding in the general framework of the experiment, it appears that ATs provide participants with higher critical abilities than heuristics. Indeed, despite the major effectiveness achieved by participants in the SI group, they are still less satisfied with their performance, as if they could better understand the limit of an individual evaluation. Summary Table IV provides a summary of the experimental results presented in the previous paragraphs. The advantage of the systematic approach adopted by the evaluators assigned to the SI condition is evident. The implication of these findings will be discussed in the final paragraph.
Table IV: Summary of the experimental results.
6. CONCLUSIONS In the last decade, several techniques for evaluating the usability of software systems have been proposed. Unfortunately, research in HCI has not devoted sufficient efforts towards the validation of such techniques and therefore some open questions persist about them (John, 1996). The study reported in this paper has provided some answers about the effectiveness, efficiency, and satisfaction of the SUE inspection technique. The experiment seems to confirm our general hypothesis of sharp increasing of the overall quality of inspection when ATs are used. More specifically we can conclude that: •
The SUE inspection increases the evaluation effectiveness. The SI group has shown a major completeness and precision in reporting problems and has also identified more severe problems.
•
Although more rigorous and structured, the SUE inspection does not compromise the inspection efficiency. Rather, it enhances the group efficiency, defined as the number of different usability problems found by aggregating the reports of several inspectors and shows a similar individual efficiency, defined as the number of problems extracted by a single inspector in relation to the time spent.
•
The SUE inspection enhances the inspectors’ control over the inspection process and their confidence on the obtained results. SI inspectors evaluated the application more severely than HI inspectors. Although SUE inspection is perceived as a more complex technique, SI inspectors have been moderately satisfied about it. Finally, they have shown a major critical ability, feeling less satisfied with their performance, as if they could understand the limit of their inspection activity better than the HI inspectors. 16
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
We are very confident on the validity of such results, since by no means our evaluators have been influenced by our attachment to the SUE inspection. Actually, the evaluators were more familiar with Nielsen’s heuristic inspection, being exposed to this method during the HCI course. On the contrary, they learnt the SUE inspection only during the training session. We are now planning further experiments involving expert evaluators, in order to prove whether ATs provide greater power to them as well.
ACKNOWLEDGEMENTS We are immensely grateful to Prof. Rex Hartson, from Virginia Tech, for his valuable suggestions. We also thank Francesca Alonzo and Alessandra Di Silvestro, from the Hypermedia Open Center of Polytechnic of Milan, for the help offered during the experiment data coding.
REFERENCES Andre, T.S., Hartson, H.R., & Williges, R.C. (1999). Expert-based Usability Inspections: Developing a Foundational Framework and Method. Proceedings of the 2nd Annual Student’s Symposium on Human Factors of Complex Systems. Costabile, M.F., Garzotto, F., Matera, M., & Paolini., P. (1997). SUE: A Systematic Usability Evaluation (Technical Report 19-97). Milano: Dipartimento di Elettronica e Informazione, Politecnico di Milano. Costabile, M.F., Garzotto, F., Matera, M., & Paolini., P. (1998). Abstract Tasks and Concrete Tasks for the Evaluation of Multimedia Applications. Proceedings of the ACM CHI’98 Workshop From Hyped-Media to Hyper-Media: Towards Theoretical Foundations of Design Use and Evaluation , Los Angeles, CA, April 1998. Available at: http://www.eng.auburn.edu/department/cse /research/vi3rg/ws/papers.html. Costabile, M.F., & Matera, M. (1999). Evaluating WIMP Interfaces through the SUE Approach. Proceedings of IEEE ICIAP’99 – International Conference on Image Analysis and Processing, 1192-1197. Los Alamitos (CA): IEEE Computer Society. Desurvire, H.W. (1994). Faster, Cheaper!! Are Usability Inspection Methods as Effective as Empirical Testing? In (Nielsen, & Mack, 1994), 173-202. Dix, A., Finlay, J., Abowd, G., & Beale, R. (1998). Human-Computer Interaction (Second Edition). London: Prentice Hall Europe. Doubleday, A., Ryan, M., Springett, M., & Sutcliffe, A. (1997). A Comparison of Usability Techniques for Evaluating Design. Proceedings of ACM DIS’97 - International Conference on Designing Interactive Systems, 101-110. Berlin: Springer Verlag. Fenton, N.E. (1991). Software Metrics - A Rigorous Approach. London: Chapmann & All. Garzotto, F., & Matera, M. (1997). A Systematic Method for Hypermedia Usability Inspection. The New Review of Hypermedia and Multimedia, 3, 39-65. Garzotto, F., Matera, M., & Paolini, P. (1998). Model-based Heuristic Evaluation of Hypermedia Usability. Proceedings of AVI’98 – International Conference on Advanced Visual Interfaces, 135-145. New York: ACM Press. Garzotto, F., Matera, M., & Paolini, P. (1999). Abstract Tasks: a Tool for the Inspection of Web Sites and Off-line Hypermedia. Proceedings of ACM Hypertext’99, 157-164. New York: ACM Press. Garzotto, F., Paolini, P., & Schwabe, D. (1993). HDM - A Model Based Approach to Hypermedia Application Design. ACM Transactions on Information Systems, 11 (1), 1-26.
17
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION Hartson, H.R., Andre, T.S., Williges, R.C., & Van Rens, L. (1999). The User Action Framework: a Theory-bas ed Foundation for Inspection and Classification of Usability Problems. Proceedings of HCI International’99, 10581062. Oxford: Elsevier Publisher. International Standard Organisation. (1997). Ergonomics Requirements for Office Work with Visual Display Terminal (VDT) - Parts 1-17. International Standard Organization 9241, Geneva, Switzerland. Jeffries, R., & Desurvire, H.W. (1992). Usability Testing vs. Heuristic Evaluation: Was There a Context? ACM SIGCHI Bulletin, 24(4), 39-41. Jeffries, R., Miller, J., Wharton, C., & Uyeda, K.M. (1991). User Interface Evaluation in the Real Word: a Comparison of Four Techniques. Proceedings of ACM CHI’91 - International Conference on Human Factors in Computing Systems, 119-124. New York: ACM Press. John, B.E. (1996). Evaluating Usability Evaluation Techniques. ACM Computing Surveys, 28(4es). Kantner, L., & Rosenbaum, S. (1997). Usability Studies of WWW Sites: Heuristic Evaluation vs. Laboratory Testing. Proceedings of ACM SIGDOC’97 – International Conference on Computer Documentation, 153-160. New York: ACM Press. Karat, C.M. (1994). A Comparison of User Interface Evaluation Methods . In (Nielsen & Mack, 1994), 203-230. Lim, K.H., Benbasat, I., & Todd, P.A. (1996). An Experimental Investigation of the Interactive Effects of Interface Style, Instructions, and Task Familiarity on User Performance. ACM Transactions on Computer-Human Interaction, 3(1), 1-37. Madsen, K.H. (1999). Special Issue on “The Diversity of Usability Practices”. Communication of ACM, 42 (5). Matera, M. (1999). SUE: A Systematic Methodology for Evaluating Hypermedia Usability. (Ph.D. Thesis). Milano: Dipartimento di Elettronica e Informazione, Politecnico di Milano. Mondadori New Media . (1997). Camminare nella Pittura. CD-ROM. Milano (Italy): Mondadori New Media Publisher. Nielsen, J. (1992) . Finding Usability Problems through Heuristic Evaluation. Proceedings of ACM CHI’92 – International Conference on Human Factors in Computing Systems, 373-380. New York: ACM Press. Nielsen, J. (1993). Usability Engineering. Cambridge (MA): Academic Press. Nielsen, J. (1994a). Guerrilla HCI: Using Discount Usability Engineering to Penetrate Intimidation Barrier. In R.G. Bias & D.J. Mayhew (Eds.), Cost-Justifying Usability. Cambridge (MA): Academic Press. Available at http://www.useit.com/papers/guerrilla_hci.html. Nielsen, J. (1994b). Heuristic Evaluation. In (Nielsen & Mack, 1994), 25-62. Nielsen, J., & Landauer, T. K. (1993). A Mathematical Model of The Finding of Usability Problems. Proceedings of ACM INTERCHI’93 – International Conference on Human Factors in Computing Systems, 296-213. New York: ACM Press. Nielsen, J., & Mack, R.L. (1994). Usability Inspection Methods. New York: John Wiley & Sons. Preece, J., Rogers, Y., Sharp, H., Benyon, D., Holland, S., & Carey, T. (1994 ). Human-Computer Interaction. New York: Addison Wesley. Virzi, R.A., Sorce, J.F., & Herbert L.B. (1993). A comparison of Three Usability Evaluation Methods: Heuristic, Think-Aloud, and Performance Testing. Proceedings of Human Factors and Ergonomics Society 30th Annual Meeting. Whiteside, J., Bennet, J., & Holtzblatt, K. (1988). Usability Engineering: Our Experience and Evolution. In M. Helander (Ed.), Handbook of Human-Computer Interaction, 791-817. Oxford: Elsevier Science.
18
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION Table I: Two ATs from the library of hypermedia ATs (Matera, 1999). AS-1: “CONTROL ON ACTIVE S LOTS ” FOCUS OF A CTION: an active slot. INTENT : to evaluate the control provided over the active slot, in terms of:
A. mechanisms for the control of the active slot; B. mechanisms supporting the state visibility, i.e., the identification of any intermediate state of the slot activation. ACTIVITY DESCRIPTION: given an active slot:
A. execute commands such as play, suspend, continue, stop, replay, get to an intermediate state, etc.; B. at a given instant, during the activation of the active slot, verify if it is possible to identify its current state, as well as its evolution up to the end. OUTPUT : A. a list and a short description of the set of control commands, and of the mechanisms supporting the B.
• • •
state visibility; a statement saying if: the type and number of commands is appropriate, in accordance with the intrinsic nature of the active slot; besides the available commands, some further commands would make the active slot control more effective; the mechanisms supporting the state visibility are evident and effective.
AS-6: “NAVIGATIONAL B EHAVIOR OF ACTIVE SLOTS ” FOCUS OF A CTION: an active slot + links. INTENT : to evaluate the cross effects of navigation on the behavior of active slots. A CTIVITY DESCRIPTION: consider an active slot: A.
activate it, and then follow one (or more) link(s) while the slot is still active; return to the “original” node where the slot has been activated, and verify the actual slot state;
B.
activate the active slot; suspend it; follow one (or more) link(s); return to the original node where the slot has been suspended and verify the actual slot state;
C.
execute activities A. and B. traversing different types of links, both to leave the original node and to return to it;
D.
execute activities A. and B. by using only backtracking to return to the original node.
OUTPUT : A. a description of the behavior of the active slot, when its activation is followed by the execution of navigational links and, eventually, backtracking; B.
a list and a short description of possible unpredictable effects or semantic conflicts in the source and/or in the destination node, together with the indication of the type and nature of the link that has generated the problem.
19
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION Table II: The list of ATs submitted to inspectors. AT Classification
AT Title
Code AS -1
Control on Active Slots
AS -6
Navigational Behavior of Active Slots
PS -1
Control on Passive Slots
HB-N1
Complexity of Structural Navigation Patterns
HB-N4
Complexity of Applicative Navigation Patterns
AL-S 1
Coverage Power of Access Structures
AL-N1
Complexity of Collection Navigation Patterns
AL-N3
Bottom-up Navigation in Index Hierarchies
AL-N11
History Collection Structure and Navigation
AL-N13
Exit Mechanisms Availability
20
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
Violated heuristic
Where
Problem Description
H1: Control Command Availability over Active Slots, Passive Slots, and Automatic Guided Tours.
Figure 1: An example of table in the report form provided to the HI group.
21
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION AL-N1: "C OMPLEXITY OF C OLLECTION N AVIGATION PA TTERNS" F OCUS OF A CTION: a collection + the set of collection links. INTENT: A. to evaluate the collection navigation pattern in a given collection; B. to evaluate the mechanisms supporting the navigation status visibility, i.e., those elements which give indications about the placement of members within the collection. A CTIVITY D ESCRIPTION: in a collection: A. from the collection center access a generic collection member; B. from a generic member, access the collection center (if any); C. from a generic member of the collection, access: the previous and the next member in the collection order, the first collection
member and the last one, another arbitrary member; D. from the first (respectively, the last) member, try to go “previous” (respectively "next”); E. in parallel with each one of the previous activities, every time a member is accessed, verify if it is possible to identify its
placement within the collection. O UTPUT: A. a description of the collection navigation pattern, in terms of links for moving among collection members, and between the collection center and each collection member; B. a description of the mechanisms supporting the navigation status visibility; C. a statement expressing if the collection navigation pattern adequately supports the collection navigation; D. a statement expressing if the mechanisms described at point B. are evident and effective.
Where
Problem Description
Figure 2: One of the ATs submitted to the evaluators in the SI group and the correspondent form for reporting problems.
22
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
6
Navigation
4 Active Media Control
2 0 HI
SI
Interaction with Widgets
Figure 3: Average number of problems as a function of experimental conditions and problem categories.
23
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
Table III: Means and standard errors for the analysis of severity.
Severity index
HI
SI
3.66 (.12)
4.22 (.08)
24
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
Proportion of Usability Problems
100%
75%
50%
25% HI SI 0% 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15
Number of Evaluators
Figure 4: The cost-benefit curve (Nielsen & Landauer, 1993), calculated for the two techniques, HI (Heuristic Inspection) and SI (SUE Inspection). Each curve shows the proportion of usability problems found by each technique, when using different numbers of evaluators.
25
DRAFT SUBMITTED TO I NTERNATIONAL JOURNAL OF H UMAN COMPUTER I NTERACTION
Table IV: Summary of the experimental results. Hypothesis EFFECTIVENESS Completeness Accurateness EFFICIENCY S ATISFACTION
Indices Precision Severity
Individual Efficiency Group Efficiency User Satisfaction with the application evaluated Evaluator Satisfaction with the inspection technique Evaluator Satisfaction with the achieved results
Legend: -: worse performance +: better performance =: equal performance : major critical ability
26
HI = < =
= >