Interpreting Results from Large Scale Automatic Evaluation of Web ...

Interpreting Results from Large Scale Automatic Evaluation of Web Accessibility Christian B¨ uhler1 , Helmut Heck1 , Olaf Perlick1 , Annika Nietzio1 , and Nils Ulltveit-Moe2 1

Forschungsinstitut Technologie-Behindertenhilfe (FTB) der Evangelischen Stiftung Volmarstein, Grundsch¨ otteler Str. 40 58300 Wetter (Ruhr), Germany [email protected] www.ftb-net.de 2 Høgskolen i Agder, Fakultet for Teknologi Grooseveien 36, 4876 Grimstad, Norway [email protected] http://www.hia.no

Abstract. The large amount of data produced by automatic web accessibility evaluation has to be preprocessed in order to enable disabled users or policy makers to draw meaningful conclusions from the assessment. We study different methods for interpretation and aggregation of the results provided by automatic assessment tools. Current approaches do not meet all the requirements suggested in the literature. Based on the UCAB approach decribed in UWEM 0.5 we develop a new aggregation function targeted at the requirements.

1

Introduction

In the Information Society, where a lot of information is made available on the web, it is essential to make this content accessible to all people including people with disabilities.1 To get an overview of the accessibility status of a large number of sites manual evaluation by experts or disabled users can produce the most reliable results but often turns out to be too time-consuming and expensive. An automatic assessment of web accessibility is an alternative even though it can not perform all the necessary tests for a conformance claim. However, it can measure certain features that can be utilised as indicators for accessibility and allows the monitoring of a large number of web sites. The majority of existing automatic web accessibility assessment tools (e. g. Watchfire Bobby,2 or ATRC web accessibility checker3 ) are critique systems designed for the needs of web site developers. The user can select a set of guidelines 1

2 3

The European Union has defined eInclusion as part of the Lisbon strategy: “Ensure that every citizen should have the appropriate skills needed to live and work in a new Information Society for all.” http://webxact.watchfire.com/ http://tile-cridpath.atrc.utoronto.ca/acheck/servlet/ShowGuide

K. Miesenberger et al. (Eds.): ICCHP 2006, LNCS 4061, pp. 184–191, 2006. c Springer-Verlag Berlin Heidelberg 2006

Interpreting Results from Large Scale Automatic Evaluation

185

like WCAG 1.0 [1]. The tool checks the web site or single web pages against these guidelines and reports a list of error messages often combined with repair suggestions. Such a report is not very helpful for disabled people or policy makers who have questions like – How accessible is this web site for a certain disability group? – How accessible is this web site compared to previous versions or compared to other sites? A conformance approach that summarises the evaluation result into a conformance category (WCAG A, AA or AAA) is too coarse to answer these questions. Therefore, it is useful to present the results instead as a continuous quality measure that allows comparison and grading. The EIAO project4 is currently establishing the technical basis for a European Internet Accessibility Observatory (EIAO). An internet robot for automatic and frequent collection of data on web accessibility has been developed. The evaluation of this data is performed by a set of web accessibility metrics that reports accessibility problems and deviations from web standards according to UWEM 0.5 [2] and WCAG 1.0. A data warehouse will provide on-line access to collected accessibility data. To present the large amount of data to the public we need meaningful interpretation and aggregation methods. The remainder of this paper is organised as follows. In section 2 we discuss the requirements that a large scale assessment of web accessibility should meet. We also review some approaches addressing the interpretation of results from automatic web accessibility evaluation that have been proposed in the literature. The next section presents new aggregation functions for large scale web accessibility evaluation and shows how they comply with the requirements. We conclude with the results from preliminary experimental evaluation. We also point out some open questions and give prospects for future research directions.

2

Measuring Web Accessibility

Determining the accessibility of a web page can be viewed as a two stage process [2]. First, the resources are inspected with regard to possible barrier types. Several different properties (e. g. nesting, relation and attribute values of HTML elements) are extracted and reported (e. g. via an EARL report [3]). The goal of the second stage is to compute a single value representing the accessibility of the web page from these fine-grained reports. In this paper we examine methods for the second stage. These values can facilitate the comparison and presentation of the results. Further statistical analyses allow estimation of the accessibility of a whole web site or groupings of web sites from the results for the parts. 4

The EIAO project is co-funded by the European Commission, under the IST contract 2003-004526-STREP.

186

C. B¨ uhler et al.

2.1

Requirements

In his study of requirements for a web accessiblity metric, Zeng [4] discusses the following properties: 1. Continuous range of values (more discriminative power than binary passfail-results or conformance levels) 2. Take into account size and complexity of web site (or web page) 3. Efficient computation (scalability)5 4. Normative definition of accessibility (derived from WCAG or other standard)6 From UWEM 0.5 [2] two further requirements can be derived: 5. Enable unique interpretability, repeatability and comparability of results 6. Take into account different disability groups During the developement of methods for the EIAO project it has become apparent that there is an additional requirement for large scale automatic web accessibility assessment: 7. Support for efficient sampling algorithms7 (provide preliminary results for parts of the web site already during data collection) However, this is beyond the scope of the paper. We will focus our study on requirements 1, 2, 5, and 6. 2.2

Terms and Notation

We will use the following notation to refer to the quantities of a web resource that are involved in the calculations. b is the barrier type (e. g. the UWEM test name) u is a disability group (e. g. blind, hard of hearing, physically disabled) i is a unique identifier of the location that was inspected (e. g. URL + line/column number or URL + XPath) A sample p is denoted by a set of location identifiers p = {i0 , i1 , . . . , in } containing all locations from the relevant web pages, or key use scenarios. The results from stage one are given by a report 1 barrier b detected at location i Rib = 0 no evidence for b at location i 5

6

7

This requirement is met because we consider only functions that can be calculated from a closed expression. Methods involving higher level statistical analysis of the results from stage one (e. g. statistical / machine learning approaches) will not be taken into account here. We assume that the results from stage one have been derived in accordance with WCAG or an other standardised methodology. Sampling algorithms can be employed to sample to a given error margin within a given confidence interval. This improves performance because a complete assessment of all pages from a web site is not necessary.


187

The total number of reports for barrier b within sample p is denoted by Npb . Bpb = #{i ∈ p : Rib = 1} denotes the number of fail reports for barrier b. A sum over all barrier types b yields N = p b Npb the total number of reports for p and Bp = b Bpb the total number of fail reports for p. The severity of barrier type b for disability group u is given by Sub ∈ [0; 1].8 2.3

Related Work

Sullivan & Matson, 2000. Sullivan and Matson [5] describe the evaluation of eight priority 1 checkpoints from WCAG. They calculate the ratio between the potential points of failure and the actual points of failure: failure rate(p) =

Bp Np

(1)

This approach does not distinguish the barrier types, it only counts the total number of barriers reported Bp .9 The result is interpreted as failure rate. (0 meaning no accessibility problems and 1 complete failure.) This approach meets requirements 1 and 5. Requirement 2 is addressed as well. The barrier model does not distinguish different barrier types and consequently does not offer support user group modelling (req. 6). Zeng, 2004. Zeng [4] proposes a scoring function called “WAB score” for a web page p. Bpb wb (2) WABscore(p) = Npb b

where wb is the inverse of the WCAG priority of the checkpoint relevant for barrier b. A high W ABscore means low accessibility. The score for a web site S = {p0 , p1 , . . . , pm } is the arithmetic mean of the scores for the individual pages. Bpb p∈S b Npb wb (3) WABscore(S) = |S| This approach complies with requirements 1 and 5. It has no support for different disability groups. The handling of complexity (ratio of encountered violations and possible violations) favours samples with few barrier types. E. g. a page with three true barriers out of three potential barriers will get a score of 3wb if all three barriers have a different type, but only wb if the three barriers have the same type. 8

9

The severity is sometimes denoted by the term barrier probability (i. e. the probability that a barrier of type b is a barrier for disability group u.) In UWEM 0.5 the notation Fcui is used, where c is a WCAG checkpoint, u a disability group, and i a failure mode (equivalent to barrier type b in our notation). It is mentioned that the formula has been adapted to include the size of the web page – 10 failures out of 100 should be treated differently from 1 failure out of 10. However, a formal description of this procedure is not given.

188

C. B¨ uhler et al.

UWEM 0.5. The Unified Web Evaluation Methodology (UWEM) 0.5 [2] introduces a probabilistic model: the User Centric Accessibility Barrier model (UCAB). The barriers are grouped by WCAG checkpoint c. The severity is denoted by Fcub . Because the barriers are assumed to be (statistically) independent the barrier probability of sample p for disability group u can be calculated by multiplication of the involved probabilities Fcub (probability that barrier type b constitutes a barrier for disability group u). F (p, u) = 1 −

n

(1 − Fcub )

(4)

c=1

where n is the number of checkpoints. A lower value of F (p, u) indicates higher accessibility. This approach satisfies requirement 6 because it allows calculation of assessments for different user groups.10 It also has properties 1 and 5, but the complexity of the inspected web resource it not taken into account. The aggregation includes only one barrier b for each checkpoint c. It is not clear how b is determined if there are multiple barriers corresponding to one checkpoint c. As we are looking for a statement about the accessibility of the entire sample p it seems necessary to take into account all barriers that were reported. Note, that this methodology is still under development. The upcoming version UWEM 1.0 will presumably address some of the issues mentioned above. Other Methods. Some methodologies, like BITV short test11 and AIR judging form12 also perform aggregation of test results. However, they are not subject to our investigation because they are based on manual assessment. The reports contain only one statement about the occurence of each barrier within the whole resource and are not broken down into statements for single locations, which is a prerequisite for the approaches we compare.

3

An Improved Aggregation Method

We base our further exploration of aggregation functions on the UCAB model [2] because of the advantages it has over the simple additive calculation of score values (cf. discussion of Zeng’s approach). First of all we propose to aggregate all reports for one sample p (Notation as described in section 2.2). (1 − Rib Sub ) (5) A1 (p, u) = 1 − {Rib :i∈p} 10

11 12

UWEM 0.5 does not cover the question how the severity values for the different user groups can be estimated. We will address this issue in section 3.2. http://www.wob11.de/bitvkurztest.html http://www.knowbility.org/


189

Subsequently, we address the two major issues: How can complexity of the web resource and needs of different disability groups be taken into account? 3.1

Including Complexity

The function A1 can be converted into a product over barrier types (1 − Sub )Bpb A2 (p, u) = 1 −

(6)

b

This is easy to verify as the number of factors (1 − Sub ) for one barrier b is Bpb . Note that this formula enables modelling of absolute barriers. A barrier is absolute if it prevents to user from completing a task. The severity for such a barrier is Sub = 1. This yields A2 (p, u) = 1 because the product becomes zero (one of the factors is (1 − Sub )Bpb = (1 − 1))Bpb = 0). We propose to model the complexity of the sample by adapting the exponent in this formula. (7) A3 (p, u) = 1 − (1 − Sub )Cpb b

where Cpb is a value describing the complexity of p with regard to barrier type b. Quantities relevant for the calculation of Cpb are Bpb , Npb and, Np . Cpb should satisfy – If no barriers of type b are encountered there is no contribution to the aggregation function. (Bpb = 0 ⇒ Cpb = 0) – If a barrier of type b is encountered this will decrease the result of the aggregation function. (Bpb > 0 ⇒ Cpb > 0) The following complexity function has the desired properties. Cpb =

Bpb Bpb + Npb Bp

(8)

This function takes into account the ratio of potential and actual barriers. And in addition the ratio of all failures to the number of failures for one barrier type. This additional contribution ensures that barriers are considered according to their overall proportion of occurrences within the web resource. 3.2

Estimating Severity

Simple Heuristics. The barrier types which are relevant for a specific disability group can be identified rather straightforward. E. g. for deaf users without visual impairment a missing textual description of an image is not a barrier. This yields the following estimate (all relevant barrier types get the same weight). 0 barrier type b is not relevant for disability group u Sub = s > 0 barrier type b is a barrier for disability group u

190

C. B¨ uhler et al.

User Model. In an iterative process the results from automatic evaluation and aggregation are compared to results acquired from manual testing. The initial values are adapted accordingly to improve the predictions of the automatic system. Involving Disabled Users. The most reliable way of estimating severity weights involves feedback from disabled users. I. e. asking users to rate the severity of the different barrier types.

4

Experimental Evaluation

The goal of an aggregation function is to provide a value that indicates the accessibility of a web page for a disabled user. Therefore we chose to compare the results from automatic web accessibility assessment with the ratings given by a group of fifteen disabled people during a user testing study. Table 1. Results from experimental evaluation page A page B page C page D page E page F all blind all blind all blind all blind all blind all blind Sullivan & Matson 0.33 n/a 0.20 n/a 0.34 n/a 0.08 n/a 0.57 n/a 0.27 n/a UWEM 0.5 0.99 0.99 0.51 0.51 0.60 0.60 0.97 0.97 0.10 0.10 0.96 0.96 A3 0.42 0.42 0.20 0.20 0.15 0.15 0.10 0.10 0.15 0.15 0.36 0.36 User rating 0.58 0.58 0.30 0.13 0.10 0.08 0.33 0.08 0.47 0.75 0.29 0.38

Table 1 presents the average results form all users and results for a selected user group. The reports that are the input to the aggregation functions were generated by the EIAO observatory.13 The severity values are set to Sub = 0.05 for all barrier types b and disabled user groups u. The evaluation shows that the values from the improved aggregation function A3 in most cases are closest to the user ratings. It is interesting to see that the simple failure rate measure proposed by Sullivan & Matson also yields good results.14

5

Conclusion and Research Prospects

We studied different methods for interpreting results from large scale automatic evaluation of web asseccibility. We found that current approaches do not meet all the requirements suggested in the literature. Based on the UCAB approach 13

14

EIAO covers automatically testable features that are a subset of the WCAG 1.0 AA requirements. The aggregation method proposed by Zeng has not been included in the comparison because the WAB score has no upper boundary and can therefore not be normalised to values in [0; 1].


191

decribed in UWEM 0.5 [2] we developed a new aggregation function targeted at the requirements. A preliminary experimental evaluation shows some promising results. To strengthen the statistical evidence additional experiments involving more users will be conducted. There are still many open questions in this field. Directions for future research include: – Improved modelling of key use scenarios: A key use scenario is sequence of tasks that a user performs on a web site. An accessibility evaluation has to take into account that there are crucial parts within the scenario (following links, filling in forms) which should be modelled accordingly. – Introducing uncertain reports: Some barriers can only be identified with limited confidence. The range of the reports can be extended to probability values Rib ∈ [0; 1].

Acknowledgements The authors would like to thank Jenny Craven and Peter Brophy who designed and conducted the user testing experiments and all other EIAO partners who provided comments and valuable feedback during our work.

References 1. W3 Consortium: Web content accessibility guidelines 1.0. Available at http://www.w3.org/TR/WCAG10/ (1999) 2. Web Accessibility Benchmarking Cluster: D-WAB2 unified web evaluation methodology (uwem 0.5). available from http://www.wabcluster.org/uwem05/ (2005) 3. McCathieNevile, C., Abou-Zahra, S.: Evaluation and report language (EARL) 1.0 Schema. W3C editor’s working draft. Available at http://www.w3.org/WAI/ER/ EARL10/WD-EARL10-Schema-20060101 (2006) 4. Zeng, X.: Evaluation and Enhancement of Web Content Accessibility for Persons with Disabilities. PhD thesis, University of Pittsburgh (2004) 5. Sullivan, T., Matson, R.: Barriers to use: Usability and content accessibility on the web’s most popular sites. In: Proceedings of ACM Conference on Universal Usability – CUU. (2000)

Interpreting Results from Large Scale Automatic Evaluation of Web ...

Interpreting Results from Large Scale Automatic Evaluation of Web ...

Suggest Documents

Techniques for Large-Scale Automatic Detection of Web Site ...

(isa): results of a large-scale web-based survey

Interpreting large-scale redshift-space distortion measurements

New Challenges : Large Scale Automatic

Large-Scale Automatic Reconstruction of Neuronal Processes from ...

Large-Scale Automatic Classification of Phishing Pages

evaluation of automatic road extraction results from sar imagery

Academic Misconduct in Portugal: Results from a Large Scale ... - FEP

Problematic Social Media Use: Results from a Large-Scale ... - PLOS

Results From a Large-Scale, Practical, Clinical Trial for Patients

Problematic Social Media Use: Results from a Large-Scale ... - PLOS

Results from a Large Scale Quasi-Experiment on ... - CiteSeerX

Large-Scale Induction and Evaluation of Lexical Resources from the ...

Automatic Large-Scale Oral Language Proficiency

Large-scale, Parallel Automatic Patent Annotation - GATE.ac.uk

Large-scale, Parallel Automatic Patent Annotation - GATE.ac.uk

Automatic large-scale oral language proficiency assessment

Automatic Reconfiguration for Large-Scale Reliable Storage ...

Model-driven development of large-scale Web

Iterative Evaluation of a Large-Scale - People.csail.mit.edu

Large-scale evaluation of splicing localization ...

Large-Scale Evaluation of Molecular Descriptors

Password Meters and Generators on the Web: From Large-Scale ...

Extracting Event-Centric Document Collections from Large-Scale Web ...