How Many Participants Are Really Enough for Usability ... - IEEE Xplore

Science and Information Conference 2014 August 27-29, 2014 | London, UK

How Many Participants are Really Enough for Usability Studies? Roobaea AlRoobaea

Pam J. Mayhew

College of Computers and Information Technology, Taif University Saudi Arabia & School of Computing Sciences, University of East Anglia, UK [email protected]

School of Computing Sciences, University of East Anglia, UK [email protected]

Abstract—The growth of the Internet and related technologies has enabled the development of a new breed of dynamic websites, applications and software products that are growing rapidly in use and that have had a great impact on many businesses. These technologies need to be continuously evaluated by usability evaluation methods (UEMs) to measure their efficiency and effectiveness, to assess user satisfaction, and ultimately to improve their quality. However, estimating the sample sizes for these methods has become the source of considerable debate at usability conferences. This paper aims to determine an appropriate sample size through empirical studies on the social network and educational domains by employing three types of UEM; it also examines further the impact of sample size on the findings of usability tests. Moreover, this paper quantifies the sample size required for the Domain Specific-to-context Inspection (DSI) method, which itself is developed through an adaptive framework. The results show that there is no certain number of participants for finding all usability problems; however, the rule of 16 4 users gains much validity in user testing. The magic number of five evaluators fails to find 80% of problems in heuristic evaluation, whereas three evaluators are enough to find 91% of usability problems in the DSI method. Keywords—Heuristic evaluation (HE); User Testing (UT); Domain Specific Inspection (DSI); methodological framework; sample size

I.

INTRODUCTION

A crucial aspect in planning UEM sessions is establishing the sampling size, as each method entails certain implementation costs; we need to balance costs against benefits. Many companies struggle with limited budgets, and so usability experts recommend recruiting only five participants for usability testing, rather than the large sampling needed for experimental research; however, some experts are opposed to this figure [1]. This issue has concerned usability professionals for the last two decades, and no usability conference is complete without one or more heated debates on it [2]. For this reason, many theoretical models and equations have been established for determining the most appropriate sample sizes for usability evaluation studies. This paper aims to readdress this issue and to quantify the sample size required based on empirical studies conducted on two different domains. Also, it seeks to measure the effect that different sample sizes have on the number of usability problems found.

This paper is organized in the following way. Section 2 is a brief literature review. Section 3 presents the analysis and results. Section 4 discusses the important points, and Section 5 presents the conclusion and suggestions for future work. II.

LITERATURE REVIEW

Throughout the literature, one of the most frequently asked questions in the usability field, and one whose answer is very important for developers, designers, market researchers, and usability practitioners and experts, is “how many users are really enough?” This question has challenged researchers and professionals in the field of usability engineering and Human Computer Interaction (HCI) because it has consequences for any evaluation results. Although this issue has been hotly debated amongst researchers for many years, there is no consensus on any rule that could be relied upon to determine this number because all usability practitioners seems to have a different opinion. This is a major challenge in HCI because no one can know in advance how many problems exist. Thus any estimation of how many participants are required to find a certain percentage of interface problems is based on an assumption [3]. There are thorny issues making an unequivocal answer to this question almost impossible; for instance, the aim of researcher‟s study, the size of the project, the accuracy of the uncovered usability problems, and the design and scope of the tasks [4]. For example, if the aim of an evaluation is to identify the major usability problems in a small part of a system, the researcher may recruit a small sample. Should the researcher wish to evaluate the whole system with many task scenarios, a large sample size would be needed to identify the remaining issues [5]. Over the years, two different camps have emerged on this topic; one that believes that five users are enough to identify most of the usability problems, and another that believes that this number is nowhere near enough [1; 6]. These are discussed in the following subsections. A. Why are only five users needed? The pioneers of the first camp, such as Nielsen, Lewis and Virzi, believe that 80% of usability problems can be identified with a sample of five users, which is known as the „magic number‟. They arrived at this conviction after analysing the results of many empirical studies. They find that observing five

48 | P a g e www.conference.thesai.org


users allows them to discover 80% of a product‟s usability problems [7]. More specifically, they find that the first user discovers almost one-third of all usability problems; the second discovers many repeated problems but new ones appear; the third user discovers a small number of new problems; and the fourth and fifth users also find a small number. After the fifth user, many problems are merely repeated, and fewer and fewer new problems are revealed [8]. [9] argued that the optimal sample size in terms of commercial cost-benefit may be as low as three users. [10] summarized how different user sample sizes lead to discovering various percentages of usability problems, based on a large number of studies. This advice can assist in choosing the sample size according to the probable minimum level or mean percentage of discoverable usability problems. TABLE I. # of users 5 10 15 20 30 40 50

ABSRACTED FROM FAULKNER (2003) Minimum % found 55 82 90 95 97 98 98

Mean % found 85.55 94.686 97.050 98.4 99.0 99.6 100

This camp uses the following formula (1) to estimate the problem discovery rate (p). [4] defines it as “the average of the proportion of participants experiencing each observed problem”. 1) Proportion of unique problems found (P) = ( ) ) Where is the average problem discovery rate computed across subjects/problems, is the number of subjects, and 'P' is the percentage of problems that can be discovered. P can be computed by listing all the usability problems identified during the test. Then, for each user, mark all the usability problems, add the total number of the usability problems identified by each user, and finally divide by the total number of problems. B. Why and when are five users not are enough The pioneers of the second camp, such as Lindgaard, Chattratichart, Spool, Schroeder, Hwang and Salvendy, disagree with the above assertion. They criticise using a small number of users arguing that reliability may be lost and usability problems may be missed. Also, employing only a small number of users ignores the individual differences between them, and yet this aspect underpins the relatively straightforward studies utilizing quite closed/specific tasks. Accordingly, they recommend recruiting more than five users. For example, [11] evaluated four different electronics websites, and they found that five users discovered only 35% of the usability problems. [12] conducted nine usability tests; they compared the results of two teams, where team A consisted of six users and team B consisted of twelve. The analyses showed the teams discovered 42% and 43%, respectively. [13] reported that five users failed in reaching the 80% overall usability discovery rate, and that eleven users were needed to achieve this percentage. [14] analysed the quantitative data of 27 experiments. Those 27 studies all employed three evaluation methods, and linear regression was

applied to determine the samples for each. They found that Think Aloud (TA), Heuristic Evaluation (HE) and Cognitive Walkthrough (CW) required nine users, eight evaluators and eleven evaluators, respectively, to discover 80% of usability flaws. As a result, they proposed a new rule for optimal sample size, which is 10±2, recommending its application under general evaluation conditions. In this regard, [10] found that a sample size of ten participants will most likely reveal a minimum of 82% of the problems. However, [15] doubts the ability of ten users or experts in finding 80% of usability problems; also, this rule ignores usability practitioners who test with only a few participants in iterative design cycles. [16] developed an adjustable sample size estimation model for usability assessment by using two factors: Beta (β) and Alpha (α); they found that the best estimation for sample size about eight users. However, [17] pointed out that seven users are optimal a small project and fifteen users are optimal in a medium or large project. Later, [7] improved the small sample estimation of 'p' by using the GOOD-Turning adjustment model, and they applied this on eight users. They found that the appropriate sample size would be seven users, even where the study is quite complex in nature. [18] argued that twenty users are suitable for many commercial studies. [19] found that “8 to 25 participants per team is a sensible range to consider and that 10 to 12 participants are probably a good baseline range”. C. Why there is no consensus on sample size? There are various factors impacting on the estimation of sample size, as discussed in [17; 7; 20]. These are;  Properties of the system and interface, including the size of the software product.  Stage in the usability lifecycle the product is assessed in, whether early in the design phase or after several iterations of test and re-design.  Type and quality of the methodology used to conduct the assessment (summative or formative).  Specific tasks selected.  Match between the assessment and the context of realworld usage.  Representativeness of the assessment users.  Skill of the evaluator.  Personality of the participant (introverts, extroverts). Furthermore, increasing the effectiveness of usability studies is a major aim of HCI researchers. Consequently, they find the solution is to increase the sample size. [21] mentions that a large sample size is needed in quantitative studies (at least twenty users), for card sorting (at least fifteen users), and for eye-tracking (at least 39 users). Moreover, there is conviction amongst of usability professionals that major problems need to be found and fixed; also, they have conviction that most usability testing methods will never discover all of the usability problems, even if a large sample is used. From an industry point of view, time and money are more important than finding all the problems; if all the usability



problems are discovered, and most of them cannot be fixed, it is of no value. It is not logical to expend all of the available time and money on finding problems; rather, it is more commercially fruitful to balance time with finding problems and fixing them [22]. There is agreement amongst usability professionals that testing with a large sample leads to increased costs and time consumed in analysing data, albeit with, improved reliability. Nevertheless, testing with too few users can lead to important usability problems being left undiscovered [19]. [15] asserts that most usability practitioners continue to apply strategies of iterative low-budget assessment, where quantitative data are unreliable or unnecessary. Also, he concludes that the importance of sample size is based primarily on the context of a study. Finally, [3] adjusted the above question to “how many users can be afforded?” and “how many users do we have time for?” D. Improving the Current UEMs The current challenge in UEMs is how to improve them such that they can be used to evaluate new technologies in an efficient manner. We have decided to take this opportunity to address this challenge, and we have accordingly developed an adaptive framework. This framework is applicable across numerous domains. It is here used for generating two Domain Specific Inspection (DSI) evaluation methods for social network and educational domains [23, 24, 25]. Furthermore, we have developed two checklists (from both of the newly developed DSI methods) that can be applied to any website in the social network and educational domains as tools that can be used by designers, developers, instructors, evaluators and website owners to facilitate their evaluation process in designing interactive interfaces or in assessing the quality of existing websites. These checklists also allow anyone to adopt any particular area of usability (or principle) to identify usability problems relating to seven specific areas in the social network domain or five specific areas in the educational domain [26, 27]. In addition, the DSI methods for both domains are tested intensively through rigorous validation methods to verify the extent to which they achieve the identified goals, needs and requirements that the adaptive framework was originally developed to address. The DSI approach is applied alongside heuristic evaluation (HE) and user testing (UT) on each domain to identify which problems are discovered by HE and DSI and not discovered by UT, and vice versa. In this paper, we aim to quantify the sample size required for this method.  Selection of the Targeted Websites and Recruitment of Participants For the educational domain, three websites were chosen (based on specific criteria which are good interface design, rich functionality, good representatives of the completely free educational websites, not familiar to the users, no change will occur before and during the actual evaluation), and they are Skoool, AcademicEarth and BBC KS3bitesize. The researchers decided to recruit eight expert evaluators to employ two methods, namely DSI and HE, to evaluate the three different websites. Also, 60 users were engaged for the UT method; they were chosen carefully to reflect the real users of the targeted websites and were divided into three groups for each website,

i.e. a total of 20 users for each website. The majority of the users are students, and they were mixed across the three users groups in terms of gender, age, education level and computer skills. For the social network domain, three websites were chosen (based on specific criteria which are good interface design, rich functionality, good representatives of the social network domain, not familiar to the users, no change will occur before and during the actual evaluation), and they are Google+, LinkedIn and Ecademy. The researchers decided to recruit six expert evaluators to employ two methods, namely DSI and HE, to evaluate the three different websites. Also, 75 users were engaged for the UT method engaged; they were chosen carefully to reflect the real users of the targeted websites and were divided into three groups for each website, i.e. a total of 25 users for each website. The majority of the users are students and employees, and they were mixed across the three users groups in terms of gender, age, and education level and computer skills.  Experimental Procedure For HE and DSI, the evaluators were divided into groups. The evaluation were carried out in a prescribed sequence, i.e. one group used DSI on Website 1 and then HE on Website 2, and finally DSI on Website 3, while the second group used HE on Website 1 and Website 3 and then DSI on Website 2. The researchers adopted this technique to avoid any bias in the results and also to avoid the risk of any expert reproducing his/her results in the second session through over-familiarity with one set of heuristics, i.e. each evaluation was conducted with a fresh frame of mind. The researchers emphasized to each evaluator groups that they should apply a lower threshold before reporting a problem in order to avoid misses in identifying real problems in the system. Then, the actual expert evaluation was conducted and the evaluators evaluated all websites consecutively, rating all the problems they found in a limited time (which was 90 minutes). They used checklists that were developed to facilitate the evaluation process. After that, they were asked to submit their evaluation report, and to give feedback on their own evaluation results. Finally, the researchers extract the problems discovered by the experts from the checklists of both DSI and HE. Then, they conduct a debriefing session with the same expert evaluators to agree on the discovered problems and their severity, and to remove any duplicate problems, false positives or subjective problems. Then, the problems approved upon are merged into a master problem list, and any problems upon which the evaluators disagree are removed. Ultimately For UT, testing session started with a training (familiarization) session for the users; it involved a quick introduction on the task scenarios, and the purpose of the study. The next step entailed explaining the environment and equipment, followed by a quick demonstration on how to „think aloud‟ while performing the given tasks. Prior to the tests, the users were asked to read and sign the consent letter, and to fill out a demographic data form that included details such as level of computer skill.



All the above steps took approximately ten minutes for each test session. Furthermore, the researchers emphasized to each user groups that they should apply a lower threshold before reporting a problem in order to avoid misses in identifying real problems in the system. Then, the actual test started from this point, i.e. when the user was given the task scenario sheet and asked to read and then perform one task at a time. Once they had finished the session, they were asked to write down their comments and thoughts, and to explain any reaction that had been observed by observer (researchers) during the test, all in a feedback questionnaire. This was followed by a brief discussion session with independent evaluators to rank the severity of the problems derived from the user testing and to remove any duplicate problems. Following this, they establish the list of usability problems for UT. Subsequently, a single unique master list of usability problems is consolidated from the three methods. III.

RESULTS AND DISCUSSION

Recent challenges to estimating the most appropriate sample size are the reliability and validity of the gathered data. In this study, Nielsen‟s scale was used to rank the severity of the usability problems found, which has been widely used in past studies [28]. We used multiple evaluators to minimize any negative effects they may have on the DSI and HE methods. For the educational domains, they were divided into two groups of four, carefully balanced in terms of experience. In each group, there were two „double expert‟ evaluators (usability specialists in educational websites) and two „single expert‟ evaluators (usability specialists in general). For the social network domain, the evaluators were divided into two groups of three, again carefully balanced in terms of experience. In each group, there were two double expert evaluators (usability specialists in social networking websites) and one single expert evaluator. Each evaluator conducted his/her evaluation separately, rating all the problems they found within a limited timeframe (1 hour) in order to ensure independent and unbiased evaluations. In addition, the evaluation was carried out in a prescribed sequence, i.e. Group 1 used HE, DSI and HE on website 1, 2 and 3, respectively, while the Group 2 used DSI, HE and DSI on website 1, 2 and 3, respectively. The researchers adopted this technique to avoid any bias in the results and also to avoid the risk of any expert reproducing his/her results in the second session through over-familiarity with one method, i.e. each evaluation was conducted with a fresh frame of mind. Moreover, the researchers extracted the problems of three methods from the problems sheet and removed all false positive („not real‟) problems, evaluators‟ subjective problems and duplicated problems during the debriefing session. Thus, the problems agreed upon were merged into a unique master problem list. The reliability Any-Two-Agreement formula (2) was used in order to measure the performance of the evaluators individually on the same websites to determine the level of agreement amongst them on any usability problems found. The results ranged from 0.08 to 0.28 for HE and from 0.21 to 0.58 for DSI in both experiments.

2) Any-Two-Agreement = Average of |Pi∩Pj| / |PiỤPj| over all ⁄ n (n-1) pairs of evaluators. Where 'Pi' is the set of problem discovered by evaluator 'i' and the other evaluator 'j', and 'n' refers to the number of evaluators [29]. In terms of users, they were chosen carefully to reflect the real users of the targeted websites and were divided into three groups for each website. For example, in the educational domain, there were a total of twenty users for each website, the majority of whom are students, and they were mixed across the three user groups in terms of gender, age, education level and computer skills. Also, in the social network domain there was a total of 25 users for each website, the majority of whom are students and employees, and they were mixed across the three user groups in terms of gender, age, education level and computer skills. Overall, pilot studies were conducted to assess the time needed for each task, and the user testing sessions were observed by the researchers, who used an observation sheet to write down the behaviour of each user and the number of problems encountered. Also, the test environment was a quiet room. Researchers attempted to identify what equipment the users regularly use and set it up for them before the test, for example, using the same type of machine and browser. Furthermore, Researchers used a ranking sheet to help the independent evaluators, who ranks the severity of the usability problems that found by the users. A. Sample sizes for DSI and HE After analysing the evaluator‟s results, we used Formula 1 to quantify the appropriate sample size for both HE and the new method (DSI). In terms of the performance of each group in discovering unique and overlapping problems in the educational domain experiment, Table I illustrates the total number of real problems discovered, which was 99 on the three websites, out of which 25 were identified using HE-Group and 74 using DSI-Group. All the duplicated problems were removed and compared by two independent evaluators, in order to identify the unique and overlapping problems. When the problems from the two evaluation groups were consolidated, there were 19 duplicates; we thus identified a total of 80 real problems in all websites. The total for uniquely identified problems in all websites was 61; DSI-Group identified 55 real problems (69% of the 80 problems) that were not identified by HE-Group, and there were 6 real problems (8% out of 80) identified by HE-Group that were not identified by DSI-Group. 19 real problems (24%) out of 80 problems were discovered by both groups (as depicted in Fig. 1). The t-test was used for comparing the means of two samples in this study, and the results reveal that there are significant differences between the samples (p-value < 0.001; Table II). Moreover, the single expert evaluators were more efficient in discovering usability problems relating to design navigation and layout, whereas the double expert evaluators were more efficient in discovering usability problems relating to content quality, learning process and motivational factors; this is because they have expertise in this domain.


Science and Information Conference 2014 August 27-29, 2014 | London, UK TABLE II.

SUMMARY OF EACH EVALUATOR IN EACH TEAM ACCORDING TO THE PROBABILITY LEVEL OF PROBLEM DISCOVERY IN THE EDUCATIONAL DOMAIN (+) DOUBLE EXPERT

Website

Group

G1 Skoool G2

G1 Academic Earth G2

G1 BBC KS3bitesize G2

Expert and type

Method

Ev. 1+ Ev. 2^ Ev. 3^ Ev. 4+ Ev. 1+ Ev. 2^ Ev. 3+ Ev. 4^ Ev. 1+ Ev. 2^ Ev. 3^ Ev. 4+ Ev. 1+ Ev. 2^ Ev. 3+ Ev. 4^ Ev. 1+ Ev. 2^ Ev. 3^ Ev. 4+ Ev. 1+ Ev. 2^ Ev. 3+ Ev. 4^

HE HE HE HE DSI DSI DSI DSI DSI DSI DSI DSI HE HE HE HE HE HE HE HE DSI DSI DSI DSI

(^) SINGLE EXPERT # of problems found 8 5 1 5 12 21 21 21 11 10 9 12 6 6 7 4 1 1 1 2 6 6 7 5

Method

Mean

Std. Deviation

Sig.

NH DSI

3.92 10.83

2.610 4.707

.000

# of problems without repetition

(EV) EVALUATOR P

% of problems discovered 75%

10

0.29

33

0.59

98%

98% 0.64 29 84% 21

0.37

2

0.35

12

0.71

82%

99%

Moreover, the single expert evaluators were more efficient in discovering usability problems relating to layout, formatting, navigation and search, and content quality, whereas the double expert evaluators were more efficient in discovering usability problems relating to business support, user usability, sociability and management activities; this is because they know, based on their expertise, the factors that lead to the success of websites in this domain.

19 (24%)

Fig. 1. Overlapping problems between both group sizes

In terms of the performance of each group size in discovering unique and overlapping problems in social network domain, Table III illustrates that the total number of real problems discovered was 182 in all three websites, out of which 47 were identified using HE-Group and 135 using DSIGroup. When the problems from the three evaluation groups were consolidated, there were 24 duplicates; we thus identified a total of 158 problems in all websites. The total for uniquely identified real problems in all websites was 128 problems. The evaluation using DSI-Group identified 96 real problems (61% of the 158 problems) that were not identified by HE-Group, and there were 32 real problems (20% out of 158) identified by HE-Group that were not identified by DSI-Group. 30 real problems (19%) out of 158 were discovered by both group sizes (as depicted in Fig. 2). The t-test was used for comparing the means of the two samples in this study, and the results reveal that there are significant differences between the samples (p-value = 0.003; Table III).

In summary, DSI worked better than HE, but neither was able to discover all the usability issues in this study. The above figures show the results of both methods that depend noticeably on the evaluators‟ performance; the double evaluators in both domains, whether HE or DSI, discovered more problems than the single evaluators. Also, the former were more efficient in discovering catastrophic and major problems, whereas the latter were good in discovering minor and cosmetic problems. Therefore, the effects of the evaluators‟ characteristics (double or single) and method types have been confirmed in these usability studies. Another effect that has been confirmed here is the sample size which can be looked from two sides. The first side when the sample size of evaluators is examined without focusing on their characteristics “double or single”, as shown in Table II and III. For HE method, four evaluators are enough to find from 75% to 84% of the usability problems, which is in line with previous studies [7]. However, three evaluators are enough to find from 23% to 27%.


Science and Information Conference 2014 August 27-29, 2014 | London, UK TABLE III.

SUMMARY OF EACH EVALUATOR IN EACH TEAM ACCORDING TO THE PROBABILITY LEVEL OF PROBLEM DISCOVERY IN SOCIAL NETWORK DOMAIN

(+) Double Expert

Website

Group

G1 Google+

G2 G1 LinkedIn G2 G1 Ecademy G2

Expert and type

Method

Ev. 1^ Ev.2+ Ev.3+ Ev.1+ Ev. 2^ Ev. 3+ Ev. 1^ Ev.2+ Ev.3+ Ev.1+ Ev.2^ Ev.3+ Ev.1^ Ev.2+ Ev.3+ Ev.1+ Ev.2^ Ev.3+

DSI DSI DSI HE HE HE HE HE HE DSI DSI DSI DSI DSI DSI HE HE HE

(^) Single Expert

# of problems found

Total # of problems without repetition

16 33 17 6 5 11 2 8 6 24 8 27 6 28 23 5 3 4

Method

Mean

Std. Deviation

Sig.

NH DSI

5.56 20.22

2.698 9.162

.003

TABLE IV.

(Ev.) Evaluator

P

% of problems discovered

55

0.97

99%

22

0.1

27%

13

0.09

25%

47

0.73

98%

33

0.55

90%

12

0.08

23%

TEAM SIZE ACCORDING TO THE PROBLEMS DISCOVERED AND AVERAGE PROPORTION IN BOTH DOMAINS

(61%)

32 (20%)

Educational. domain

96

Total # Evaluator NH

Fig. 2. Overlap between both group sizes

For DSI method, four evaluators are enough to find from 98% to 99% of the usability problems, whereas three evaluators are enough to find from 90% to 99% of the usability problems. The second side when the sample size of evaluators is examined with focusing on their characteristics “double or single”, as shown in Table IV. For the HE method, four single evaluators are not enough to find 80% of the usability problems; in the best cases, 26% of the total problems were found. However, four double evaluators, in the best cases, found 65% of total problems, which is much nearer to 80%. Consequently, using 7 (mixed double and single evaluators) are sufficient to find 80% or more usability problems by using the HE method. For DSI method, four double evaluators, in the best cases, discovered 98%, whereas four single evaluators discovered 66% and two single evaluators discovered 48%. Consequently, three evaluators (mixed double and single) are enough to find 91% of the usability problems by using the DSI method.

Social networks domain

DSI

4 double

Method type

Total # problems found

HE

29

% of problems discovered 42%

4 single

HE

18

26%

4 double

DSI

69

91%

4 single

DSI

52

66%

4 double

HE

40

65%

2 single

HE

28

17%

4 double

DSI

125

98%

2 single

DSI

30

48%

B. Sample size for User Testing During the testing, we studied the users in chunks punctuated by time slots for incrementally analysing the data from each group. In this study, the users were divided to four groups which were two groups in each experiment. For example, testing with Group One (8 users) on Website One in the educational domain was followed by analysis and discussion of their results. After that, testing with Group Two (12 users) on the same website was followed by analysis and discussion of their results, and so on in the other websites in second experiment. Based on the results of other studies as mentioned in literature review, we chose group of 5, 8, 12, 20 users amongst two experiments.



In the first experiment, the figures in Table V show that, in the Skoool website, Team A reported 46 issues, whereas Team B reported 92 issues. In Academic Earth, Team C reported 39 issues; however, Team D reported 91 issues. Finally, Team E in BBC KS3bitesize reported 50 issues, but Team F reported 88 issues. The maximum overlap was 122 issues; it occurred between Team E, which tested 8 users and reported 50 issues, and Team F, which tested 12 users and reported 88 issues. The minimum overlap was 114 issues; between Teams A and B. The whole study identified 41 critical issues, 1 catastrophic issue, 8 major issues, 9 minor issues and 23 cosmetic issues. The t-test was used for comparing the means of the two samples in this study, and the result reveals that there was no significant difference between samples (p-value = 0.596; Table V). Moreover, when the problems of both groups were classified according to the five problem areas in this domain, the group of 8 users were more efficient in discovering problems relating to two particular areas, which were 'User usability' and 'Content information and process orientation'. However, the group of 12 users was more efficient in discovering problems relating to three areas, which were 'Design and media usability', 'Learning process' and 'Motivational factors'. TABLE V.

TEAM SIZE ACCORDING TO THE PROBABILITY LEVEL OF

PROBLEM DISCOVERY IN THE EDUCATIONAL DOMAIN

Websites

Team name

# of Users

# of Issue Found 46 92 39 91 50 88

Mean Found

P

Skoool Academic Earth BBC KS3bitesize

A B C D E F

8 12 8 12 8 12

5.75 7.66 4.88 7.58 6.25 7.33

0.23 .31 0.18 0.37 0.27 0.34

Group size

5 users 8 users

Mean

Std. Deviation

Sig.

6.29 7.53

2.911 3.094

.596

% of problems discovered 88% 98.9% 80% 99.7% 92% 99.4%

In the second experiments, Table VI shows that Team G in Google+ reported 43 issues, whereas Team H reported 198 issues. In LinkedIn, Team I reported 48 issues, however, Team J reported 159 issues. Finally, Team K in Ecademy reported 35 issues but Team L rep orted 123 issues. The maximum overlap was 207 issues; it occurred between Team G, which tested 5 users and reported 43 issues, and Team H, which tested 20 users and reported 198 issues. The minimum overlap was 139 issues, between Teams K and L. The whole study identified 79 critical issues, 6 catastrophic issues, 17 major issues, 25 minor issues and 32 cosmetic issues. The t-test was used for comparing the means of two samples in this study, and the result reveals that there was no significant difference between the samples (p-value = 0.699; Table VI). Nielsen [1] proclaims that 5 users are enough to catch 80% of the problems on practically any website. However, our data provide evidence to the contrary. When analysing samples of five users, in the best cases only 37% of the total problems are found, i.e., not near the 80% objective. Moreover, when the problems of both groups were classified according to the seven problem areas in this domain, the group of 5 users were more efficient in discovering problems relating

to three areas, which were 'Layout and formatting', 'Content quality' and 'Accessibility and compatibility'. However, the group of 20 users were more efficient in discovering problems relating to four areas, which were 'User usability, sociability and management activities', 'Navigation system and search quality', 'Security and privacy' and 'Business support'. TABLE VI.

TEAM SIZE ACCORDING TO THE PROBABILITY LEVEL OF PROBLEM DISCOVERY IN THE SOCIAL NETWORK DOMAIN

Websites

Team name

# Users

Google+

G H I J K L

5 20 5 20 5 20

LinkedIn Ecademy Group size

5 users 20 users

Mean

8.40 8

# Issue Found 43 198 48 159 35 123

Std. Deviation

3.355 3.723

IV.

Mean Found

P

7 9.9 9.6 7.95 8.6 6.15

0.03 0.1 0.06 0.21 0.09 .18

% of problems discovered 15% 88% 26.7% 99.2% 37.6% 98.2%

Sig.

0.699

PRACTITIONER'S ADVICES

This study with 16±4 participants is valid in discovering over 90% of the usability problems in tested interfaces by using the UT method. We arrived at these results through using differing combinations of task designs and think-aloud approaches, which increased the complexity of these experiments [30] [31]. Therefore, we should rethink Nielsen‟s argument when he argues in favour of increasing the sample size only to test complex systems. Furthermore, the figures in this study cannot be generalised for another domains because of the complexity and context of this study; it employed different types of complex task designs, the websites were specifically chosen, and the differences between and among the users and evaluators in terms of their characteristics and knowledge may have been significant in these studies. Also, the participants were recruited from different cultures, namely British, Indian and Arab, which means that the interaction, communication and tested interfaces may all be different for those who are from the same culture. This could explain the differences between the users‟ results, and this needs further investigation in the future. Table VII shows the appropriate sample sizes for various usability study purposes based on the findings of this study. TABLE VII.

SAMPLE SIZE ESTIMATION FOR VARIOUS USER TESTING PURPOSES

Main Purpose To find more cosmetic problems and problems relating to structure and content. To find few major and more minor problems. Also, this is more appropriate for commercial studies and more problems in layout and formatting. To find more catastrophic, major, minor and cosmetic problems; also, for finding more problems relating to design, navigation and the key aims and functions for which the system is built. Moreover, it is more appropriate for comparative studies. For statistically significant studies and analysis of the performance metrics, such as success rate

# users 5 8 16±4



It can be seen that there is no solid sample size for finding all usability problems. Also, for studies where statically significant findings are being sought or for comparative studies, a group size of greater than or equal to twenty users is valid. This paper strongly recommends considering the twenty users as the highest sample sizes and twelve users as the lowest sample sizes along with the study‟s complexity and the criticality of its context before starting an evaluation study in order to achieve a successful evaluation. Furthermore, for the HE and DSI methods, in the recruitment of experts, one must consider that the number of evaluators and their expertise (double or single) can affect the results to a considerable degree, and probably more than participant group size, as seen Table VIII.

[3]

TABLE VIII.

[4]

TSAMPLE SIZE ESTIMATION FOR THE VARIOUS EXPERT TESTS

Main Purpose For HE, this sample with mixed double and single experts is enough to find 80%, and applying the user testing (UT) as complementary method for each other. For DSI, this sample with mixed double and single experts is enough to find 80% without applying the user testing (UT) method.

V.

# experts 7

ACKNOWLEDGMENT We thank the expert evaluators and users in the University of East Anglia (UEA) and the Aviva Company for their participation in the comparative study. Also, we thank the Editor in Chief and reviewers of this paper at the Science and Information (SAI) Conference 2014. [1]

[2]

[5] [6]

3 [7]

CONCLUSION

[8]

It is challenging to determine the optimal sample size based on problem discovery or level of confidence and then to generalize this advice because the result should be driven by the study‟s context. There is no solution to the challenge here. The above results provide evidence that the first camp‟s affirmation, which states that a sample of 5 users will discover 80% of all usability problems, is not wholly correct, whereas the 16 4 rule gains much validity for user testing. For heuristic evaluation, the 7 rule with mixed double and single evaluators is sufficient; also, 3 with mixed double and single evaluators are sufficient for the DSI method. This study confirms the importance of involving the double expert evaluators to take advantage of their expertise in finding the main problems that might lead to product failure. The reality is that most usability testing will never discover all or most problems. Also, even if all of them are discovered, most of them will never be fixed because of their cost. Consequently, there is no unique model for sample size estimation because the sample size depends on the objective of each particular study as mentioned in table VII and VIII. Then, the group size should typically be increased along with the study‟s complexity and the criticality of its context. We should be careful when dealing with any advice being offered in the literature. Furthermore, it is better to split the sample size into groups of users (the data can be analysed for each group); also, a study can be terminated in the early stages when its purpose has been achieved to save time and money. Further necessary work is to generate additional DSI methods for other systems such as mobile applications, and to determine the sample size for them by performing diverse tests on increasing numbers of users, and then to compare these data to those processed here in order to more conclusively verify this rule (16 4). Finally, it is important to examine the impact of culture on usability testing and on determining the sample size.

[9] [10]

[11]

[12]

[13]

[14]

[15] [16]

[17]

[18] [19]

[20]

[21]

REFERENCES J. Nielsen, “Why You Only Need to Test with 5 Users”, http://www.nngroup.com/articles/why-you-only-need-to-test-with-5users/,2000. R. Molich, “A Critique of" How To Specify the Participant Group Size for Usability Studies: A Practitioner‟s Guide", by Macefield. Journal of Usability Studies, 5(3), 124-128, 2010. J, Lazar, J. H. Feng, and H. Hochheiser, “Research methods in humancomputer interaction”, Wiley. Com, 2010. J. R. Lewis, “Sample sizes for usability tests: mostly math, not magic. Interactions”, 13(6), 29-33. 2006. T. Tullis, and B. Albert, “Measuring the user experience: collecting, analyzing, and presenting usability metrics”. Newnes, 2013. A. Woolrych, and G. Cockton, (2001). Why and when five test users aren‟t enough. In Proceedings of IHM-HCI 2001 conference (Vol. 2, pp. 105-108). Toulouse,, France: Cépadèus. C. W. Turner, J. R. Lewis, and J. Nielsen, “Determining usability test sample size”, International encyclopedia of ergonomics and human factors, 3, 3084-3088, 2006. C. Zapata, and J. A. Pow-Sang, “Sample size in a heuristic evaluation of usability”, SOFTWARE ENGINEERING: METHODS, MODELING, AND TEACHING, Pontificia Universidad Católica del Perú, 37, 2012. R. A. Virzi, “Refining the test phase of usability evaluation: How many subjects is enough?”, Human Factors, 34, 457-468, 1992. L. Faulkner, “Beyond the five-user assumption: Benefits of increased sample sizes in usability testing”, Behavior Research Methods, Instruments and Computers, 35(3), 379-383, 2003. J. Spool, and W. Schroeder, “Testing web sites: Five users is nowhere near enough”, In CHI'01 extended abstracts on Human factors in computing systems (pp. 285-286). ACM, 2001. G. Lindgaard and J. Chattratichart, "Usability testing: what have we overlooked?" in Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pp. 1415-1424, ACM, 2007. L. C. Law, and E.T. Hvannberg, “Analysis of combinatorial user effect in international usability tests”, In CHI Conference on Human Factors in Computing Systems, ACM, 2004. W. Hwang, and, G. Salvendy, “Number of people required for usability evaluation: the 10±2 rule”. Communications of the ACM, 53(5), 130133, 2010. M. Schmettow, “Sample size in usability studies”, Communications of the ACM, 55(4), 64-70, 2012. H. S. Jabbar, T. V. Gopal, and S. J. Aboud, “An Adjustable Sample Size Estimation Model for Usability Assessment”, American Journal of Applied Sciences, 4(8), 525, 2007. J. Nielsen, and T. K. Landauer “A mathematical model of the finding of usability problems”, In Proceedings of the INTERACT'93 and CHI'93 conference on Human factors in computing systems (pp. 206-213). ACM, 1993. C. Perfetti, and L. Landesman, “Eight is not enough”, Retrieved March 3, 2009 from http://www.uie.com/articles/eight_is_not_enough, 2001. R. Macefield, “How to specify the participant group size for usability studies: A practitioner‟s guide”, Journal of Usability Studies, 5(1), 3445, 2009. A. Vassar, “The effect of personality in sample selection for usability testing”, SCHOOL OF COMPUTER SCIENCE AND ENGINEERING, THE UNIVERSITY OF NEW SOUTH WALES, 2012. J. Nielsen, “How Many Test Users in a Usability Study?”, http://www.nngroup.com/articles/how-many-test-users/, 2012.


Science and Information Conference 2014 August 27-29, 2014 | London, UK [22] D. Wixon, “Evaluating usability methods: why the current literature fails the practitioner”, interactions, 10(4), 28-34, 2003. [23] R. AlRoobaea, A. Al-Badi, and P. Mayhew, “Generating a Domain Specific Inspection Evaluation Method through an Adaptive Framework: A Comparative Study on Educational Websites”, International Journal of Human Computer Interaction (IJHCI), 4(2), 88, 2013. [24] R. S. AlRoobaea, A. H. Al-Badi, and P. J. Mayhew, “A framework for generating a domain specific inspection evaluation method: A comparative study on social networking websites”, In Science and Information Conference (SAI), 2013 (pp. 757-767). IEEE, 2013. [25] R. AlRoobaea, A. H. Al-Badi, and P. J. Mayhew, “Generating a Domain Specific Inspection Evaluation Method through an Adaptive Framework”, International Journal of Advanced Computer Science and Applications, 4(6), 2013. [26] R. AlRoobaea, A. H. Al-Badi, and P. J. Mayhew, “Generating an Educational Domain Checklist through an Adaptive Framework for Evaluating Educational Systems”, International Journal of Advanced Computer Science and Applications, 4(8), 2013.

[27] R. AlRoobaea, A. H. Al-Badi, and P. J. Mayhew, “Generating a Domain Specific Checklist through an Adaptive Framework for Evaluating Social Networking Websites”, International Journal of Advanced Computer Science and Applications, 4(10), 2013. [28] J. Nielsen, “Heuristic evaluation”, Usability inspection methods, vol. 24, pp. 413, 1994. [29] M. Hertzum and J. Nielsen, “The evaluator effect: A chilling fact about usability evaluation methods”, International Journal of HumanComputer Interaction, 13(4): 421-443, 2001. [30] R. AlRoobaea, A. H. Al-Badi, and P. J. Mayhew " The Impact of the Combination between Task Designs and Think-Aloud Approaches on Website Evaluation”, Journal of Software and Systems Development, Vol. 2013 (2013), Article ID 172572, DOI: 10.5171/2013. 172572, 2013. [31] O.Alhadreti, R.Alroobaea, K.Wnuk, and P. Mayhew "The Impact of Usability of Online Library Catalogues on the User Performance." In Information Science and Applications (ICISA), 2014 International Conference on IEEE, 2014.