Intelligent non-visual navigation of complex HTML ... - Springer Link

11 downloads 1756 Views 729KB Size Report
Nov 6, 2002 - 1 Department of Computer Science, New Mexico State University, Las Cruces, NM 88003, USA; E-mail: [email protected]. 2 Department ...
Univ Access Inf Soc (2002) 2: 56–69 / Digital Object Identifier (DOI) 10.1007/s10209-002-0036-4

Intelligent non-visual navigation of complex HTML structures E. Pontelli1 , D. Gillan2 , G. Gupta3 , A. Karshmer4 , E. Saad1 , W. Xiong1 1

Department of Computer Science, New Mexico State University, Las Cruces, NM 88003, USA; E-mail: [email protected] Department of Psychology, New Mexico State University, Las Cruces, NM 88003, USA; E-mail: [email protected] 3 Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA; E-mail: [email protected] 4 Department of Information Technology, University of South Florida, Lakeland, FL, USA; E-mail: [email protected] 2

Published online: 6 November 2002 –  Springer-Verlag 2002

Abstract. This paper provides an overview of a project aimed at using knowledge-based technology to improve accessibility of the Web for visually impaired users. The focus is on the multi-dimensional components of Web pages (tables and frames); our cognitive studies demonstrate that spatial information is essential in comprehending tabular data, and this aspect has been largely overlooked in the existing literature. Our approach addresses these issues by using explicit representations of the navigational semantics of the documents and using a domain-specific language to query the semantic representation and derive navigation strategies. Navigational knowledge is explicitly generated and associated to the tabular and multi-dimensional HTML structures of documents. This semantic representation provides to the blind user an abstract representation of the layout of the document; the user is then allowed to issue commands from the domain-specific language to access and traverse the document according to its abstract layout. Keywords: Non-visual Web – Universal accessibility – Domain-specific languages

1 Introduction Accessibility to resources in our information-based society is considered an entitlement by the majority of the industrialized world. Through a reasonably simple series of keystrokes and mouse clicks, we are able to access vast amounts of information from distant parts of the world. While the value of all the information available is sometimes questionable, it is there and it is accessible. Accessible that is, if you are not one of the many handicapped members of our society. For many in this group, such access is difficult or impossible. In the United States, land-

mark legislation has mandated equal access to the information world for all Americans. Through laws such as the 1990 Americans with Disabilities Act (ADA), the 1996 Telecommunications Act and the Section 508 Amendment to the 1993 Rehabilitation Act, disabled people are guaranteed accessibility “where reasonably achievable.” These three words belie the problem. In the domains of hardware or architectural modifications, the problems, for the most part, are understandable and achievable. In the domain of computer software, however, the situation is somewhat different. The areas of operating-system interfaces and Web access highlight this problem. In the current work, we have focused on the accessibility of the World Wide Web by blind and severely visually impaired people, both in general and more specifically in the educational setting. In this work, we face two basic problem areas: the accessibility offered by the popular Web browsers and the effect of the design of the actual Web pages. In the former, we are not interested in making changes to the Web browser’s code, while in the latter, we are constrained to making accessibility recommendations to Web page designers. Additionally, we focus on the delivery of Web content to non-visual users in aural format, specifically through speech synthesis. Given these constraints, our project has identified three of the most difficult areas of Web access through current browsers: tables, frames and forms. While other groups are working on the more general problem areas of Web browsing [2, 21, 22], we have confined our research to these three aspects of the problem, because they are the most difficult to solve in a general way. The general reading of a Web page by the visually impaired can be implemented by presenting a totally text-based version of the page, while this is not possible in the areas of tables, frames and forms. Text is linear in nature, while tables, frames, and forms are multi-dimensional and their layout is an inherent part of their semantic content.

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

The problem is broad in scope for the visually impaired. Lack of access to these structures has a negative impact not only on everyday-life decision-making, but also on access to educational materials that are now being presented in larger measure via the Web. 2 Project overview The focus of this work is on developing a collection of tools aimed at improving accessibility of a well-defined set of complex information structures commonly found on the Web. The focus is on non-visual accessibility – to provide support to visually impaired individuals as well as users accessing the Web with devices with no or limited display capabilities – and the research is currently aimed at analyzing the role of HTML tables and frames. These two constructs are widely used in virtually every Web page (indeed the two constructs are often used interchangeably) and they are by nature multi-dimensional . This implies that a linear translation of these components into speech would lead to a substantial loss of semantic content – as illustrated by the cognitive studies conducted as part of this project and described in the next section. This is particularly evident in the case of tables. A table is inherently a multi-dimensional structure, and its spatial layout is an essential component of the semantics encoded in it. The desired output is represented by tools that can assist blind users in navigating such complex data organizations producing aural output. Our work has lead to the design and implementation of a system providing userdirected and semantic-based navigation of tables, frames, and other non-linear data structures [13, 18]. This paper provides an overview of the results accomplished in this project to this date. The paper is organized as follows. We first present an overview of the cognitive studies performed; these studies have highlighted the importance of providing users with an adequate representation of the spatial nature of tabular information. We next illustrate the structure of the system we have designed, focusing on the use of knowledge-representation techniques (conceptual structures) and domain-specific languages to support user-directed navigation of tables and other complex HTML structures. 3 Related work A number of proposals have recently been made toward the development of tools to improve Web accessibility for visually impaired people. Official government acts (such as the ADA and the Telecommunications Act) have imposed the adoption of design practices to make Web pages universally accessible, and initiatives such as the W3C Web Accessibility Guidelines [21] have tried to put these directions in practice. In spite of this, a substantial part of the Web is still not designed according to these criteria – even popular Web design tools do not enforce such

57

guidelines. For this reason, in recent years a large number of proposals have emerged, aimed at providing accessibility at the level of the final user , providing the final user with tools to improve accessibility of existing documents. This solution is orthogonal to the application of design guidelines. The initial proposals based on the use of standard screen readers (such as Jaws) in association with traditional Web browsers had limited success – information on the Web is highly context sensitive and highly structured, and only ad-hoc screen readers are capable of taking advantage of this context-sensitivity and adapt their behavior to the structure of the document; even fewer tools are capable of modifying their behavior to respond also to the desires and capabilities of the end-users. The most notable outcome of this thread of research is represented by the design of aural Web browsers; in these tools, the screen reader is an integral part of a browser and obtains direct access to the internal structure of the documents recovered from the Web. The knowledge of the structure of the documents is employed to drive the generation of an accessible version of the document in the following ways: • Provide a top-level presentation of the document’s title and headings [14] • Create aural summaries of the document [22] • Restructure the document to place emphasis on its main components [2] • Use sound cues to identify the different components of the document [11] The Web Accessibility for Blind (WAB) effort at ETH Z¨ urich [14] represents one of the first efforts in this area, where transcoding proxy servers are employed to extend HTML documents with (static) navigation information – specifically links to facilitate the retrieval of titles and hyperlinks. Substantial effort has been invested in the IBM’s HomePage Reader (HPR) [2] and the related work by Asakawa et al. [1, 16]. This is indeed the only other proposal that explicitly deals with the issue of table navigation. Similarly to HPR, the pwWebSpeak browser [4] also offers speech output through HTML interpretation. The reported version of pwWebSpeak does not support frames; it represents tables as linear sequences of links (each table location is represented as a separate page), and tables are sequentially navigated left-to-right, topto-bottom. The BrookesTalk [22] speech-enabled browser provides access to different parts of a Web page using function keys; its novelty is in the use of natural-language summarization to facilitate non-visual navigation. Another line of related work is represented by the efforts in providing sound and haptic access to data in graphical form (e.g., business graphics); relevant efforts in this area have been proposed by Brewster et al. [19] and Kurze [15]. In this project our aim is to provide integrated clientside (at the browser level) and server-side (at the level of

58

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

the Web-page author or provider) semantic specifications for the understanding and navigation of complex HTML structures (frames and tables). Relatively few other proposals have tackled this problem and most of them rely exclusively on the linearization of the row-column HTML structure to provide navigation. The need for adaptation of browsers and direct access to the HTML structure for accessibility support has been raised by various authors [3, 10]. This was further underscored by recent survey studies that explicitly pointed out the ineffective performance of existing screen readers when coupled with the Web [5, 8]. Gunderson and Mendelson specifically cite the task of locating information in table structures as one of the most challenging tasks for visually impaired users [8].

4 Cognitive aspects of table navigation Our first step in designing tools for the navigation of tables and frames is to study how individuals typically read a table. To study the cognitive processes used in reading tables, we began by conducting an archival study in which we examined the structure of tables in selected scientific journals (Science and Nature) and Web pages (from government and commercial sites), as well as the tasks for which the tables were designed. For both media, the structure of the table was a function of the difficulty of the user’s projected task: simple tasks (such as an overview of the data in a table) were typically associated with simple tables (such as the row-by-column matrix), whereas the more complex tasks (such as understanding an interaction between variables) were associated with tables in which the rows or columns contained spatiallydefined groups. In addition, Web tables typically contained graphics and hyperlinks. This archival study indicated that two types of spatial organization are used to provide organizational information in tables: the twodimensional row-by-column layout and spatial grouping. Following from the results of the archival study, we conducted two series of experiments to examine different means for supplying a table reader with information about the structure of a table via non-spatial means. In both series, we used sighted participants in an attempt to understand the basic cognitive mechanisms used by sighted readers interacting with visually presented information structured in a table format. In some of the

experiments, we attempted to simulate the flow of information that blind users would experience when they interacted with a table by restricting the visually available information to one item at a time. The goals that underlie this approach are to provide a baseline against which later to compare the cognitive processes of blind users and to understand the information that spatial cues provide a sighted user and possible ways of providing that same information by non-spatial and non-visual means. The participants were from the participant pool at New Mexico State University, which is representative of the larger student body at the university. Consequently, the participants included both males and females ranging in age from 18 to 50, with most of the participants falling between 18 and 23. In one series of experiments, we investigated the use of tables to learn structured information, much like a chemistry student might learn the periodic table. The experiments were conducted using 5 × 5 tables, and participants were instructed to learn the information (i.e., a word) in each of the 25 cells. For example, in Table 1 each target word is associated with an individual combination of row and column headers. However, to simulate the flow of information experienced by a blind reader, each cell was presented individually; accordingly, during the learning phase in each experiment, groups comprising the row header, column header and a unique target word were presented to the participant. This type of presentation was called the temporal format. Depending on the condition, the temporal format might be represented by a specific cue designed to provide structural information. Participants did not see the underlying table at any time during the experiment; nor were they told that the information was extracted from a tabular display (except as an experimental condition in one experiment). Following the learning phase, participants received a test in which they had to select, for each row-column combination, the correct target word from a list consisting of the correct target word, the remaining target words from the row and column and six foils (for a total of fifteen possible responses). The first experiment examined the effect of the temporal presentation versus the spatial presentation of tabular information. Participants studied and were tested on three tables that differed by the information contained and the type of cue displayed with the row header, column header and target word. The temporal-only condition (n = 24) consisted of only the row and column

Table 1. An example of a table (on pesticides) used to study table reading

Vanquish Zapper Riddance BeGone Purge

Type

Formulation

Target

Toxicity

Exposure

herbicide fungicide insecticide rodenticide miticide

flowable powder fogger granules dust

dandelions mildew cockroaches mice ticks

medium low high extreme nominal

dermal ocular inhalation ingestion cutaneous

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

headers with the target word. The spatial-separated condition (n = 24) also showed an empty 5 × 5 table in which the relevant position where the target word would normally be located in a table was highlighted. Likewise, the spatial-integrated condition (n = 24) showed this same table, and the indicator cell contained the target word. During the subsequent test, participants from the spatial-integrated condition made the fewest errors. The data suggests that providing spatial cues results in better learning than the pure temporal presentation of the tabular information. A second experiment examined whether non-spatial cues could also produce better learning than the temporal presentation format. Experiment 2 used the same basic procedures as experiment 1, in that participants received training with the temporal format and two conditions in which the temporal format was enhanced with another cue. The two non-spatial enhancements in the learning phase involved the use of color or auditory tones. In the color condition (n = 24), each column in the table was assigned a distinct hue, with rows represented by an amount of saturation. In the tone condition (n = 24), each column was assigned a discrete degree of auditory balance of sound-pressure level in the left and right ears. This manipulation of the sound intensity between the ears resulted in the perception of column 1 as being to the far left, column 3 directly in front, and column 5 to the far right of the participant; column 2 was perceived between columns 1 and 3, whereas column 4 was perceived between columns 3 and 5. In other words, the five columns were presented in an auditory spatial structure along the horizontal dimension. Row information was indicated by changes in pitch, with higher pitch indicating higher rows. In this tone condition, participants wore Sony Dynamic Stereo headphones to ensure the desired manipulation of the intensities and pitches to the two ears. In addition, a third condition (n = 24) was identical to the temporalonly condition in the first experiment. Contrary to our hypothesis, the results showed that providing color or auditory cues to the structure of the tabular information resulted in worse test performance than the temporal format. As a consequence of the surprising finding that adding structural information led to worse performance in the second experiment, a third experiment was designed to explore the hypothesis that the auditory and color cues disrupted learning because participants did not recognize that the cues provided structural information since they had not been informed that the target information could be organized as a table. Accordingly in the third experiment, participants were randomly assigned to one of three groups: The no-instruction group (n = 24) received no additional information about the underlying tabular structure. In the structure condition (n = 24), participants were told that the data was organized in a systematic fashion. The table group (n = 24) was told that the data was organized in a systematic

59

fashion and were also presented with an example table and explicitly shown how the keywords (column header and row header) were related to the target word. Additionally, the tone cues were paired with the example table and participants interacted with the table to explore the relationship between the tones and spatial locations of the target words. For example, if a participant clicked on each row in a single column of the example table, a change in pitch at a constant balance level would be heard progressively down the column. Again the results were contrary to our hypothesis. Participants in the table condition, who were explicitly informed of the underlying structure and shown an example, produced significantly more errors than either the no-instruction or structure conditions. No significant differences in recall performance were detected between these latter conditions. Thus, these results show a disadvantage for table users when they are given information in a procedure that simulates the flow of information to blind users using a screen reader. The results also show that providing a table user with additional structural cues was helpful only when the cues were spatial in nature; adding color or auditory cues made learning performance worse. The second series of studies examined the use of spatial and non-spatial forms for indicating groups within a row-by-column matrix table. Thirty participants used tables to answer a variety of questions that varied in complexity. Across the experiments, the participants’ performance were compared when they used a table organized simply by alphabetic listing (baseline), with groupings of table rows indicated by (1) spatial proximity on the display, (2) color-coding, (3) participant-controlled graying-out of the group not selected by the participant, and (4) exploding-out of the selected grouping (that is, the rows of the selected group expand in size and appear to move closer to the reader). The results showed that the various methods used to cue grouping information only aided readers answering complex questions. Spatial grouping, color-coding, graying-out, and exploding-out all produced better performance on selected questions than the alphabetical organization, although explodingout was the least effective. Interestingly, when the spatial grouping method was combined with any of the other methods, the combination fared no better than the cues taken one at a time. We have proposed that these findings suggest that these forms of grouping cues permit the table reader to shift procedures on complex questions to the procedures that they use in answering simple questions. As a consequence, their performance improves only on those complex questions. These archival studies and experiments have helped us to understand how important spatial cues are to the construction of tables. Replacing spatial cues with nonspatial ones can confuse a table reader, even when it is understood that the cues provide organizational information. Work on non-visual representation of interrelated

60

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

objects has also confirmed that audio representations (sound cues) alone without interrelational representation significantly lower the mental-perception level of blind people [12]. These results motivate our efforts toward making use of user interaction and semantic-based navigation instead of automatic introduction of fixed sets of auditory cues, as done in related proposals. These results also provide a fundamental motivation to our work by highlighting the inherent complexity of table reading (even for sighted users) and the intrinsic dependence of content learning on spatial layout. The lessons learned from these experiments have been valuable in designing the tools described in the rest of the paper and have provided us with valuable evaluation criteria.

(GUI) or automatically through syntactic and semantic analyzers. The second phase involves the actual nonvisual navigation, where requests (navigational goals) from a blind user are paired with the available SDs to support interactive and intelligent non-visual navigation. This second phase is conducted with the help of an aural browser . Navigational goals can range from simple interactive traversals of the multi-dimensional structure to more sophisticated goal-oriented search problems (such as locating items of the structure according to a given set of properties). A domain-specific language (DSL) is introduced to support interaction between blind users and the aural browser. In the successive sections, we describe in more detail the structure of the SDs used in this project and the domain-specific language used to support navigation.

5 System structure 6 Semantic descriptions Figure 1 provides an overview of the proposed system that we are developing for making complex HTML structures (e.g., tables) accessible. The system is composed of two subsystems. The first subsystem deals with the retrieval of documents from the Web, as directed by visual users (e.g., a teacher preparing course material) or by visually impaired users during Web navigation. Each retrieved document is properly filtered and analyzed, leading to the generation of a semantic description (SD), which is then stored in a local database. The SD is aimed at explicitly representing the desired spatial layout for the components of the document, and supports the user-driven traversal of such layout. Thus, the SD effectively becomes a roadmap that the blind user can interactively follow to navigate the document’s components. As explained later, SDs can be generated either manually through a dedicated graphical user interface

One of the key aspects of the solution we propose in this project is the adoption of knowledge-representation mechanisms, specifically conceptual graphs [20], to capture the navigation structure of a complex HTML component. We will refer to the structure used to encode the navigation structure of a complex HTML component as its navigational SD (or simply semantic description (SD)). The purpose of the SD is to explicitly describe the desired spatial layout of the HTML component (e.g., a table). The SD is synthesized in a semi-automatic fashion (as described later). The SDs are used to (interactively) guide the process of navigating the document’s component, through the use of an aural navigator. Ultimately, the use of SDs allows one to customize the navigation process according to the semantic structure of each individual document.

Fig. 1. Overall system structure

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

In the rest of this section, we describe in detail the structure of our SDs and how SDs are obtained in the system we are developing. 6.1 Conceptual graphs The knowledge representation scheme that we propose to adopt in this project is based on conceptual graphs. Conceptual graphs and their associated theory, conceptual structure theory, were proposed in the 1970s as a way of drawing logical statements in diagrammatic form rather than in a linear text-based calculus. The basic ontology is very simple, as it is in mathematical logic. A conceptual graph can have two kinds of node; a concept node, which represents types and objects of that type, and a relation node, which represents a relationship between these objects. The theory allows for a basic expressiveness that is equivalent to first-order logic as well as mechanisms for defining concepts and for representing type hierarchies of concepts and relations. Researchers have extended the formalism to allow representation and manipulation of more advanced ontologies, especially those involving actions and events, and higher-level structures such as viewpoints, and nested contexts. There are two kinds of structure present in any Web page: syntactic and semantic. The former is hierarchical, as reflected in the use of HTML/XML. Before XML was proposed, the only way to navigate a Web page was to follow this syntactic structure. Any other navigation technique was forced to rely on search through content, especially if no meta-data was present. With the advent of XML, meta-data representation has become far easier, since a Web page can now store arbitrary amounts of additional information in addition to its main content. An XSL style sheet (or other XML processor) can choose to ignore this extra content when displaying the document. It has been suggested that this extra meta-data content can be used for Web-page retrieval; it can also clearly be

61

used for Web-page navigation. It is this latter use that we are proposing as semantic navigation. Since the content of this meta-data is arbitrary, links between meta-data items and between meta-data and the main content of the page can cut across the syntactic structure of the page. These links have to be represented separately from the page itself, possibly in an associated document, just as display information can be separated from the page and placed in a style sheet. We are proposing to use conceptual graph formalisms for the representation of the semantic links. The creation of the SD for a multi-dimensional component of a document (e.g., a table) represents the key problem we propose to solve in this project. The process involves two steps: (i) identification of the concept nodes of the conceptual graph, and (ii) identification of the relation nodes of the conceptual graph. In our context, the concept nodes of the graph represent the semantic entities that are described by the document’s component, either the individual cells of a table or semantically meaningful groups of cells. Nodes are commonly organized according to one or more hierarchies. The lower level of each hierarchy commonly includes syntactic elements directly extracted from the document (such as cells of a table or panes of a frame structure). The higher levels of the hierarchies provide semantic entities representing general concepts or collections of concepts (such as viewing a column as a collection of cells). The edges of the conceptual graphs represent relationships between the conceptual entities identified as nodes of the graphs. A natural class of relationships originates from the presence of a hierarchy between different concepts present in the graph, such as the natural inclusion relationship between collections of cells (a cell belongs to a row that belongs to given group of rows, and so on). The conceptual graph representing a document component (e.g., an HTML table) will be created by combining three sources of knowledge: (i) syntactic content of the

Fig. 2. Creation of semantic descriptions

62

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

document (e.g., use of HTML tags and attributes); (ii) direct input from a human annotator (e.g., the teacher, the creator of the document, a third party); and (iii) history of how the documents’ components have been used in the past (through databases of access traces). These methods are analyzed in more detail in the next three subsections (and summarized in Fig. 2). It is important to observe that, in our framework, the knowledge representation is not static, rather it is highly dynamic and can be updated at any time. 6.2 Explicit generation of SDs It is fairly clear that reliable purely automatic generation of SDs for every complex HTML structure (CHS) is an impossible task [20]; for example, the syntax used to express CHSs is inadequate to explicitly express the complete navigation strategies for such document component. In order to make the task of constructing the SD more realistic, we restricted our focus to the context of coursework management ; that is, the various documents are part of the material offered to the students of a course. There are some clear advantages in taking such perspective: (i) we can assume the presence of an instructor, who will take charge of generating and/or completing the semantic descriptions wherever they are lacking or incorrect; and (ii) we can assume the presence of a controlled population of users, whose actions can be, to a certain extent, supervised. The first and simplest approach toward the construction of the SD of a complex structure is the manual approach: a human annotator (e.g., an instructor or a TA) makes use of a specialized tool to create the descriptions

while assembling the course material. This task is accomplished through the use of a specialized GUI. As shown in Fig. 2, each request generated by the instructor/TA is filtered by a proxy server; each time the incoming document contains a CHS (in our tests, we have focused on HTML tables), the proxy server automatically extracts the CHS and starts the annotation GUI. As shown in Fig. 3, the GUI presents the user with an abstraction of the table (rows and columns, no content); the human annotator can select arbitrary groups of cells in the table and assigns to them a description – thus generating new abstraction levels in the SD. The tool has been developed in Java and provides the following capabilities: (i) specialized parsing routines focused on the extraction of CHSs; and (ii) integration of the syntactic analyzer (described next), thus offering to the annotator an initial SD to work on. 6.3 Syntactic analysis of tables The task of developing the SD for a complex HTML structure is facilitated by the introduction of a syntactic analyzer . The objective of the syntactic analyzer is to extract as much navigational information as possible from the syntactic structure of the CHS. Once again, we have focused on the analysis of HTML tables to assess the feasibility of this phase. The intuition behind this syntactic analysis is to recognize collections of table cells (i.e., HTML elements) that the syntactic layout of the table suggests should be grouped together. There are a variety of different items in the syntactic layout of the table that may suggest grouping of cells: for example, (i) the explicit grouping of cells and rows performed through the use of elements such as and attributes

Fig. 3. Graphical user interface for creation of semantic descriptions

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

such as COLSPAN; (ii) the use of indexing elements in the table, such as header tags ( in combination with the scope attribute, and ); (iii) each cell may identify one or more header elements (through the headers attribute in combination with the axis attribute); and (iv) in addition to the above components (that are all explicit HTML constructs to structure a table), Web authors frequently use additional visual features to suggest grouping of cells; the most typical approach is through the use of different background colors, fonts colors and size. We have developed a syntactic analyzer that explores the use of these features to recognize possible meaningful grouping of cells. These grouping of cells are proposed to the manual annotator (through the GUI described in the previous section); at that stage, the human annotator has the option to accept, modify, or reject such suggested collections. For example, consider the table on the left in Fig. 4; the syntactic analyzer is capable of recognizing the components of the semantic structure illustrated on the right in Fig. 4. The original table explicitly indicated (using HTML 4.0 attributes) the cells Lodging and Meals as headers (through the header attribute). The cells with the dark background were also singled out by the analyzer and grouped together under a separate artificial header (called Gray cells in the figure). The nodes in the graph represent either concepts (e.g., the whole table, the content of individual cells) or relations (header names, other grouping of cells suggested by the layout of the table). The edges in the graph connect elements of the table ac-

63

cording to their layout. This simple graph representation provides a clean layout of the table as composed by different groups of columns (meals and lodging), themselves grouped according to the travel destination. In turn, the analyzer has recognized an additional grouping of cells, containing the totals for each trip entry. 6.4 Usage-based synthesis of SDs 6.4.1 Pathfinder networks In this phase of the project, we have employed pathfinder networks to generate a model for Web navigation. The intuition is to maintain a collection of traces, where each trace contains the events generated during a previous navigation session of the same document. Pathfinder networks can be used to process these traces and generate clusters of table cells that can be translated into new levels of abstractions in the SD. Network models have been used in many areas in computer science including artificial intelligence, operating systems, and databases. One of the methods that produce these network models is Pathfinder (PFNET) [6]. The pathfinder algorithm is an algorithm that finds the shortest path between every two entities in a weighted graph and produces a network structure based on estimates of similarities between its entities. Distances can be obtained by estimating the similarity between each pair of entities.

Fig. 4. Result of syntactic analysis

64

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

The input to the pathfinder algorithm is represented by a collection of entities (represented as nodes in a graph) and estimates of the distance (or similarity) between certain pairs of nodes. The outcome is the generation of a network where links between nodes are present if the corresponding entities are considered sufficiently similar (and each link bears a weight measuring such similarity). PFNET introduces a special metric called Minkowski r-metric to compute the distance between each pair of nodes that are not directly linked. The rmetric is the rth root of the sum of the rth power of each link weight between the considered nodes. For r = 1, the

path length is the sum of all weights in the path; for r = 2, the length is the computed Euclidean distance; and for r = infinity, the length is the maximum weight of any link in the path. The generation of the network is based on a second parameter called the q-parameter that specifies the maximum number of links in paths that are guaranteed to satisfy the triangle inequality. Intuitively, a path between two nodes satisfies the triangle inequality if the weight of the link between the extremes of the path is no greater than the size of the path (according to the r-distance). Links are eliminated from PFNET because

Fig. 5. Network obtained from PFNET algorithm

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

they violate the triangle inequality of paths that have q or fewer links. In the resulting PFNET, triangle inequality violation could be found in paths that have more than q links. Larger values of q result in fewer triangle inequality violations. There is no triangle inequality violation in the resulting PFNET when q is the number of nodes − 1. The link membership rule (LMR) is used to determine whether or not links should be added to the pathfinder network. This is accomplished by ordering all links in the network in increasing order according to their weights. Direct links between a pair of nodes are included in the pathfinder network if and only if their weight is less than the weight of any path with q or less links (triangle inequality) using the rmetric to compute path weights. In addition, pathfinder network uses the link labeling rule (LLR), which provides a label for each link according to some classification scheme. Links are labeled as primary, secondary, or ternary depending on their role in the formation of the pathfinder network. A primary link is the link that is the only path joining a connected subgraph of the PFNET to a subgraph of only one node. A secondary link is the link that joins subgraphs in the PFNET or provides alternate path between nodes already connected by primary links. A link is labeled ternary if it joins nodes within a subgraph for which alternate paths already exist. 6.4.2 Application of pathfinder networks In this project, we have encoded the pathfinder network generation algorithm to support the maintenance of SDs. A specialized tool has been developed to offer the opportunity to sighted and blind Web users to generate traces of the steps taken in navigating a document. Traces are in turn translated in inputs to the PFNET algorithm; nodes corresponds to table cells and/or known clusters of cells, while initial measure of similarities are derived from the distance between nodes in the traces. The resulting network is used to detect new clusters of cells, nodes linked by primary edges. We have performed a variety of experiments to validate the effectiveness of the usage-based synthesis of semantic representations via pathfinder networks. The example in Fig. 5 has been obtained by processing one of the tables from the Ohio State University weather server; this home page provides marine, NCEP, and satellite textual weather data. The data is organized in tables; each cell in a table is a hyperlink to another home page. Our target is to navigate through these tables to produce a number of navigation traces, and then use such traces to create a network model (PFNET) from the hyperlinks accessed during the navigation. We have worked on the Alaskan marine data table. This table has 15 hyperlinks; these hyperlinks represent the nodes in PFNET. We have used a number of navigation sequences (traces) to build a PFNET (14 navigation sequences). We have used the

65

values r = 2 and q = 14 as parameters1. The distances between nodes are measured according to how close cells appear in a navigation sequence. This network model obtained from PFNET can then be used to automatically add new clusters to the SD of the table. For example, the figure shows a significant clustering around the “Eastern Gulf Coast” cell – the cells reached from this one can be clustered into a single group for navigation purposes. This allows us to navigate through the nodes (links) level-wise, depth-wise, or both. This way of navigation is better than traditional row- or column-wise navigation because it is based on navigating the most important (frequently visited) hyperlinks first, which may lead the user to target information much faster. A final post-processing phase is used to integrate the PFNET obtained into the main SD; groups of nodes clustered by PFNET are translated into new levels of abstraction in the SD. Informal evaluations conducted have demonstrated that PFNET generates clusters of nodes of good quality, especially in the presence of a large number of traces.

7 Domain-specific languages for table navigation In [18] we presented for the first time a domain-specific language designed to support navigation of complex HTML structures. This solution implies that each step in the navigation of a CHS is reduced to the execution of a sequence of commands in a specialized language. There are various advantages in following this approach; for example, one could predefine (parts of) navigation strategies and provide clean interface to support different forms of disabilities (such as partial blindness). The SD can be explicitly generated through a sequence of DSL commands (thus allowing for its dynamic manipulation). This ability allows the dynamic reconfiguration of the SD, to allow users to “keep their finger on the table” during navigation. It also reduces the management of the SD to an explicit procedure that can be reused and parameterized to accommodate for orthogonal changes in documents’ content (e.g., dynamic Web pages). In turn, the DSL commands that generate the SD can be automatically generated by the syntactic and usage-based analyzers. The effect of each navigation command is to produce an aural response. As long as the navigation remains within the layout of the table (non-leaf elements of the conceptual graph), the aural response is a description of the element of the conceptual graph (typically obtained from the header elements of the table). Whenever the navigation reaches the leaf elements of the SD – a cell of the table – a basic HTML aural reader is invoked to reproduce the content of the cell. The key commands used to describe the semantic structure are: (i) connected(Node,Node,Description) 1 Parameters suggested for this problem by D. Dearholt, one of the original designers of Pathfinder.

66

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

stating that the two nodes are part of two layers and the description is the conceptual definition of their relation; and (ii) content(Node) describing the content of a given node (e.g., the HTML stored in a table cell). Navigation is performed through standard graph-traversal commands: (i) concrete(Concept) and abstract(Concept), which allow one to move between levels in the semantic representation (e.g., move from a group of rows to a specific row); and (ii) previous and next to move within a conceptual layer (e.g., move between cells in a column). For example, the construction of the SD for a table representing some travel expenses will be obtained by executing commands of the type: group(expenses, [san-jose, seattle]) group(san-jose, [25august]) group(seattle, [27august,28august]) group(25august, [8-25meal,8-25hotel, 8-25transport]) group(27august, [8-27meal,8-27hotel, 8-27transport]) group(28august, [8-28meal,8-28hotel, 8-28transport]) group(8-25meal, [37.24]) .... Some additional commands [18] are also present to query the current status of the navigation process – such as remembering the cells that have already been explored. A sample sequence of commands that allows the user to access a table representing travel expenses and obtain the travel, lodging, and meal expenses for the a trip to Seattle on a given date will look as follows: // Sample Navigation concrete(seattle) concrete(28august) concrete(8-28meal) abstract concrete(8-28hotel) abstract concrete(8-28transport) The response to each command is an aural message describing the current location of the user in the structure; for example, after executing concrete(seattle) the aural browser will speak out the dates of the travels to Seattle, allowing the user to select the next command, such as concrete(28august) that will access the expenses of the travel to Seattle conducted on the 28 August.

8 DSL for frame navigation Frames-based Web pages are yet another source of difficulty for blind users. Most screen readers will traverse the frames in a frame-based Web page in an arbitrary order. Quite often, the author of the frame-based page intended the frames to be viewed in some specific order. The in-

formation regarding the order in which the frames are to be viewed is implicit in contents of the frames and has to be inferred through visual inspection. This information regarding the order is lost to the screen-reader and thus the screen-reader might read out the frames in the wrong order. For example, the most common usage of frames is for displaying an index frame and a content frame, where the content of the various items in the index can be seen in the content frame by clicking on them; in this situation a rational screen reader should read the index page first, then the contents page, and then go back to the index page. For making frame-based pages accessible, we use an approach similar to the one used for making tables accessible to blind users, described in the previous section. A DSL (which is essentially a subset of the one described earlier for tables) is designed to allow navigation of framebased pages. A frame-based page is divided into a number of frames. Our DSL treats these frames as a set and allows creation of a conceptual graph representing the layout of the different frames. We are assuming that the graph for a frame page is hierarchical; the levels in this graph represent groups of frames. The DSL also provides commands for reification and abstraction to allow one to move between levels in such hierarchy; thus, navigation is effectively viewed as the user-directed traversal of a tree structure, where the leaves represent the content of the frames. A simplified grammar for this DSL is given below: Program ::= Declaration ; Command . Declaration ::= Declaration ; Declaration | url Name = URL Command ::= Command ; Command | group (Name, Namelist) | abstractto(FromGName, ToGName) | goto(Name) | abstract | concrete (Name) The url command allows associating symbolic names (URL variables) to specific URLs. The goto(Name) statement can direct the browsing sequence to any declared URL or group. The Name can be either a URL variable name or a group name. The current position and current level (in the abstraction hierarchy) is updated to Name. The abstract statement directs the browsing sequence to the next higher level (if one exists) in the conceptual hierarchy. The current level and current position in the hierarchy are updated as follows: if an upper level does not exist, the current level and position are unchanged. Otherwise, if the current level is a URL variable, the new current level is set to the group that contains this URL, and the new current position is set to the last element in the list of URLs associated with the group. If the current level is a group name, the new current level is set to the group name of the upper level, and the new current position is set to the last element in the list of URLs associated with the upper level group. The concrete(Name) command directs the browsing se-

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

quence to a lower conceptual level (closer to the actual frame panes), if one exists. The current level and current position are updated as follows: If a lower level does not exist, the current level and position are unchanged. Otherwise, the first element of the list associated with Name is examined: if it’s a URL variable, the new current level is set to it, and the new current position is set to the URL variable name. If it is a group name, the new current level is set to this group name, and the new current position is set to the first element in the list of names corresponding to this group. The group(Name,Namelist) statement assigns a set of URL variables to a group. A group represents a new level in the browsing hierarchy. URL variable list, current position and browsing sequence remain unchanged due to this command. The abstractto(FromGName,ToGName) statement associates an upper (more abstract) level ToGName to FromGName. Applying the operation abstract(FromGName) will result in control moving to the higher level (ToGName) in the hierarchy. The concreteto(FromGName,ToGName) statement defines a (lower) concrete level ToGName for the level FromGName. The URL variable list and current position remain changed. Applying the operation concrete(FromGName) will result in control moving to ToGName. 9 From DSLs to action theories To understand a table, even when familiar with its content, a visually impaired user has no choice but examining all of its cells. Any language that supports the process of understanding a table needs to be designed to help the users to navigate in the table, perhaps at different level of abstraction. It also needs to provide a simple way for users to find some specific information, to locate a specific cell with the content satisfying a criteria. We argue that all of these features can be supported under an action theory framework [7]. Action theories have been proposed in the artificial-intelligence community to allow the formal description of domains of interests, in terms of properties that describe the domain and actions that an agent can perform to interact and modify the domain. Under this perspective, a visually impaired user could be viewed as an agent that tries to explore an environment. The agent has the capability of moving around (moving a finger from location to location) and can comprehend what is in the cell2 . Gradually, together with general knowledge about the world, our agent will arrive at conclusions that are useful. For instance, the conclusion would be drawn that the table in Fig. 4 contains travel-expenses information and could, for example, be used to determine the total expenses for the LA trip. 2 How this can be done is an interesting topic and deserves an indepth study that is not the main topic of this paper.

67

In the action theory framework, the results of the navigation process will be described by a set of properties describing the environment, called fluents. The commands that allow user to navigate between table cells are called actions. Actions can be divided into two types, basic actions and table-specialized actions. The former is similar to that of moving the finger around the table, thus providing the basic navigation capabilities to users. Observe that the structure of the basic action is table independent. The latter is, in many cases, a carefully developed procedure that helps users to understand the table without having to navigate through the whole table. Action languages can be used to formulate goals that an agent intends to accomplish; planning technology can then be employed to automatically derive the sequence of actions needed to accomplish the goal. In our context, this generalization provides the ability to carry out more complex navigation tasks. For example: 1. It allows the user to describe the navigation objective as a goal and let automatic mechanisms (i.e., a planner) develop (part of) the navigation process; for example, a query of the type ? achieve located_at(Cell), type(Cell, city), connected(City, Expense, lodging), content(Expense) ≥ $150 will develop a navigation plan (on behalf of the user) that will lead to a cell (located_at) that contains a city (type) and where lodging expenses are more than $150. 2. It allows the semantic description to predefine not only complete navigation strategies but also partial skeletons, making the remaining part of the navigation dependent on run-time factors, such as user’s goals, specific aspects of the table’s content, and user’s run-time decisions. For example, the following action constraint can be added to Table 1 to force the user to access information about toxicity level immediately after accessing a new pesticide: always ( if located_at(Cell) and type(Cell, pesticide) then next moveto(Cel1,toxicity) ) Details of this generalization have been presented in [17].

10 Implementation details System interfaces. The interfacing of the various system components with the outside world has been accomplished using standard off-the-shelf interfaces. The system interacts with the outside world at three different levels. First of all, the interactions with the external servers are accomplished via transcoding proxy servers, as illustrated in the previous sections. We have currently implemented two different proxy servers: the first handles the annotation process; that is, it interacts with a stan-

68

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures

dard Web browser (Netscape) and redirects the incoming documents to the annotation GUI and the syntactic analyzer. The transcoding proxy server also enriches the received documents with calls to a tracer applet; this allows the Web browser to offer the user the ability to trace access to the tables’ content. These traces are then used by the pathfinder algorithms to improve the SD. This proxy server has been built from scratch, as a simple C program interacting with external Java programs (the annotation GUI and the syntax analyzer). The second proxy server is employed by the speech navigator to retrieve each document accessed by the non-visual user and to retrieve (if present) the SD associated to the CHS present in such document. In this second proxy server, we have experimented with IBM’s WBI Development Kit (Web Based Intermediary). WBI is programmed to act as a proxy and to serve as a HTTP request and response processor; a plug-in (a module programmed into WBI) has been developed to interface the incoming HTTP requests and the incoming documents with the SDs in the local repositories. Speech synthesis. Speech synthesis is currently produced using IBM ViaVoice Runtime and Software Development Kit for Text-to-speech tools. Implementation of DSLs. The development of the DSL relies on the use of advanced logic-programming technology. The specification of the DSL has been accomplished through the use of Horn-logic denotations [9], a software-engineering methodology based on logic programming to specify domain-specific languages and automatically derive provably correct implementations. The DSL inference engine is also currently being extended to take advantage of the powerful capabilities offered by the underlying logic-programming layer. As discussed earlier, the DSL can be seen as an instance of an action language; this allows the use of sophisticated logic-based planning and reasoning algorithms to (partially) automate the navigation process and to delegate to software agents some of the more tedious tasks.

11 Conclusions and future work This paper provides an overview of a project conducted over the last two years with the objective of improving Web accessibility for visually impaired individuals. The proposed approach is based on the idea of deriving – either automatically or with the help of a human annotator – a semantic description of the more complex components of a document (e.g., tables, frames), and using such description to guide the navigation of the document. We proposed different ways to derive and maintain the semantic representations. We have also developed a domain-specific language, which allows the user to query the semantic representation and derive navigation strategies. The language has been validated in two separate contexts: the navigation of tables and the navi-

gation of frames. The different components have been interfaced and integrated, and the complete system is now becoming operational. We are currently experimenting with adapting the same technology for the interactive navigation of XML fragments. XML provides a clean syntactic organization of documents (hierarchical structure), but often the syntactic organization does not reflect the desired navigational behavior (as in the same way we need to resort to XSLT to provide adequate visual display). We are currently exploring methodologies to develop a navigational SD from the information contained in XSLT sheets associated to the XML document. The actual navigation can then be realized adopting the same type of DSL language as that used for tables and frames. We plan to perform extensive usability studies to investigate the effectiveness of the proposed techniques in accessing multi-dimensional data organization in a nonvisual fashion. Preliminary experiments with sighted users are in progress and experiments with blind users are in the planning stage. Acknowledgements. The work has been partially supported by NSF grants HRD-9906130, CCR-9875279, EIA-0130887, EIA-9810732, and by a DoEd NIDRR grant. The authors would like to thank S. Pazuchanics and E. Pennington for the effort invested in developing the cognitive component of this project.

References 1. Asakawa C, Itoh T (1998) User interface of a home page reader. In: Blattner MM, Karshmer AI (eds) Proceedings of the third international ACM conference on assistive technologies. ACM Press, New York 2. Asakawa C, Laws C (1998) Home page reader: IBM’s talking web browser. Technical report. IBM 3. Brewer J, Dardailler D, Vanderheiden G (1998) Toolkit for promoting web accessibility. In: Proceedings of the technology and persons with disabilities conference, Los Angeles, Calif., 17–23 March 1998 4. De Witt JC, Hakkinen MT (1998) Surfing the web with pwWebSpeak. In: Proceedings of the technology and persons with disabilities conference, Los Angeles, Calif., 17–23 March 1998 5. Earl C, Leventhal J (1999) A survey of Windows screen reader users: recent improvements in accessibility. J Vis Impairment Blindness 93(3):174–177 6. Fowler RH, Dearholt DW (1989) Pathfinder networks in information retrieval. Technical report MCCS-89-147. New Mexico State University 7. Gelfond M, Lifschitz V (1998) Action languages. Electr Trans AI 3(16). http://www.ep.liu.se/ea/cis/1998/016/. Cited 11 September 2002 8. Gunderson J, Mendelson R (1997) Usability of World Wide Web browsers by persons with visual impairments. In: Proceedings of the RESNA annual conference, Pittsburgh, Penn., 20–24 June 1997. RESNA Press 9. Gupta G, Pontelli E (2002) Specification, implementation, and verification of domain specific languages. In: Computational logic: from logic programming into the future. Springer, Berlin Heidelberg New York 10. Hendrix P, Birkmire M (1998) Adapting Web browsers for accessibility. In: Proceedings of the technology and persons with disabilities conference, Los Angeles, Calif., 17–23 March 1998 11. James F (1997) Presenting HTML structures in audio. Technical report. Stanford University

E. Pontelli et al.: Intelligent non-visual navigation of complex HTML structures 12. Kamel HM, Roth P, Sinha RR (2001) Graphics and user’s exploration via Simple Sonics (GUESS): providing interrelational representation of objects in a non-visual environment. In: Proceedings of the international conference on auditory display, Espoo, Finland, 29 July–1 August 2001 13. Karshmer A, Pontelli E, Gupta G (1999) Software technology and computer interfaces for the disabled: non-visual navigation of the World Wide Web. In: Proceedings of HCI international, Munich, Germany, 22–27 August 1999 14. Kennel A, Perrochon L., Darvishi A (1996) WAB: WWW access for blind and visually impaired computer users. In: Burger D (ed) New technologies in the education of visually handicapped. John Libbey Eurotext, Paris 15. Kurze M, Holmes E (1996) 3-D concepts by the sighted, the blind, and from the computer. In: Proceedings of the 5th international conference on computers helping people with special needs, Linz, Austria, 17–19 July 1996 16. Oogane T, Asakawa C (1998) An interactive method for accessing tables in HTML. In: Blattner MM, Karshmer AI (eds) Proceedings of the third international ACM conference on assistive technologies. ACM Press, New York

69

17. Pontelli E, Son TC (2002) Navigating HTML tables: planning, reasoning, and agents. In: Proceedings of the international conference on assistive technologies, Edinburgh, Scotland, 8–10 July 2002. ACM Press, New York, pp 73–80 18. Pontelli E, Xiong W, Gupta G, Karshmer A (2000) A domainspecific language framework for non-visual browsing of complex HTML structures. In: Tremaine M, Cole E, Mynatt E (eds) The fourth international ACM conference on assistive technologies. ACM Press, New York 19. Ramloll R, Yu W, Brewster S, Riedel B, Burton M, Dimigen G (2000) Constructing sonified haptic line graphs for the blind student. In: Tremaine M, Cole E, Mynatt E (eds) The fourth international ACM conference on assistive technologies. ACM Press, New York 20. Sowa JF (1984) Conceptual structures. Addison Wesley, Upper Saddle River, N.J. 21. Vaderheiden G, Chisholm W, Jacobs I (1998) WAI accessibility guidelines: page authoring. Technical report, WD-WAIPAGEAUTH-19980918. W3C 22. Zajicek M, Powell C, Reeves C (1999) Ergonomic factors for a speaking computer interface. In: Hanson M (ed) Contemporary ergonomics. Taylor and Francis, London

Suggest Documents