ENHANCING A VOICE-ENABLED WEB BROWSER FOR THE VISUALLY IMPAIRED Atiwong Suchato*, Jirasak Chirathivat, Proadpran Punyabukkana Spoken Language Systems Research Group, Department of Computer Engineering, Chulalongkorn University, Phyathai Rd., Pathumwan, 10330 Bangkok Thailand
ABSTRACT
A vast body of digital content lies in the World Wide Web, to which people can access via web browsers. Automatic speech recognition as well as Text-toSpeech (TTS) components have been incorporated into some web browsers aiming to assist the visually impaired access web content. Various webpagereader programs usually read out textual content sequentially as appeared on the webpages. Many times, presenting a webpage’s content aurally in such a sequential fashion does not give visually-impaired users a clear picture of the content structure of that webpage. Furthermore, it is not uncommon that webpage-reader programs read texts that are unrelated to the main content. This paper proposes a hierarchical webpage content representation, which allows webpage content to be stored in a tree structure, and an associated XML that defines the parsing rules for parsing webpage content into the hierarchical representation. The parser was implemented in a Thai voice-enabled web browser with multilingual TTS capability, named CUVoiceBrowser, so that the targeted webpage content is parsed into the hierarchical representation before being read accordingly to the users. The browser’s command list was extended to support navigation through the hierarchical webpage content representation. A demonstration of the concept was shown by its application on the front page of a newspaper website.
KEYWORDS: Assistive Technologies for Person with Disabilities, Speech Recognition Application
* Corresponding author. Tel: 66-2-218-6956 Fax: 66-2-218-6955 E-mail:
[email protected]
1. INTRODUCTION With advancements of the internet software and high-speed connection, it is undeniable that surfing the web has become an integral part of many people’s lives. While the web seems to be an ideal source of information and service for many people, it is currently far from ideal for the blinds and people with low vision. It is clear that visually-impaired population usually has some difficulties using the web, let alone reaping the web’s full benefits appreciated by people with normal sight. An obvious source of difficulties for the visually impaired to surf the web is due to the fact that most web content is presented visually, i.e. in the forms of texts and pictures, on web browsers. Digital contents in other formats with sounds included may be available on some web sites but very scarcely. More importantly, such contents are not intended to solely represent the overall content of a webpage. Thus, they will never render textual content obsolete, at least in the foreseeable future. Another source of difficulty concerns methods used for receiving user’s input deployed in the user’s interfaces (UI) of currently available web browsers. Web browsers’ UIs are usually designed to work in accordance with the most prominently used devices, which are keyboards and mice. Unlike motor-handicapped people, the visually impaired usually have no problem learning to use keyboards and mice. The difficulties lie in that they do not see the positions of their mouse pointers or text cursors. When surfing the web, navigation via hyperlinks, as well as locating elements of web forms such as input textboxes, seem almost impossible without appropriate assistive technologies. Technologies have been used to ease such difficulties. Some libraries around the world deploy assistive technologies heavily in their establishments with an aim that any individual must be able to access libraries resources equally regardless of their sights [1]. Many assistive technologies are based on Braille characters. [2]. However, speech technology is a preferred choice for many researchers and developers [3]. One reason is that speech communication is natural. Also, data rate via speech communication is higher than typing words on an ordinary or Braille keyboard. Furthermore, efforts spent in improving speech technology are paid back not only to the visually-impaired community but also to the more general society. Common components utilizing speech technology for assisting the visually impaired accessing webbased information include webpage-reader programs, automatic speech recognition systems, and voice-enabled web browsers. A webpage-reader program is a computer program that reads texts on a webpage. It utilizes a Text-to-Speech (TTS) module, which is able to generate speech sounds from corresponding texts. Note that the most immediate goal of a webpage-reader for the visually impaired is not how to make the program read texts more naturally but how to read them with high intelligibility and appropriately. Automatic Speech Recognition (ASR) software allows probabilistic mapping from sounds to texts [4]. The role of an automatic speech recognizer in assisting the visually impaired is to provide an alternative mean of controlling and inputting data into voiced-enabled computer programs. A voice-enabled web browser is a web browser program that can be operated using voice commands, in addition to normal keyboard and mouse operations. It usually has a webpage-reader program integrated, so that users can surf the web based purely on voice, if they would like or have to. Many researchers have successfully developed prototypes to demonstrate the concept of web surfing via voice in various languages. Hemphill et al. has developed voice-enabled navigation via a speakable hotlist, speakable links and smart pages using a speaker-independent speech recognizer [5]. Brondsted et al. has built a Danish voice-enabled web browser for motor-handicapped users [6]. And, Punyabukkana et al. has demonstrated such capability in Thai [7].
2. MATERIALS AND METHODS 3.1 Hierarchical Structured Content A webpage-reader program usually reads webpage content in order of appearance. This way, the information is presented sequentially regardless of the organization of the real content. Users with normal sight can visually capture the structure of the webpage content via its formatting and be able to make a mental model of its organization. However, with the absence of visual information, the visually impaired are easily lost after the program reads the webpage content for a while. Consider an illustration of a webpage content structure in Fig.1. This webpage contains an article on volcanoes1. The sizes of the fonts, the bold faces, and the use of underlines in this illustration resemble the formatting used in the original webpage. Such formatting naturally helps the reader understand the organization of the content. If this content is read by a webpage-reader sequentially from top to bottom without presenting any cues to the topic hierarchy, visually-impaired users could easily have troubles understanding how these topics are organized. Volcano Volcano classification Erupted material Lava composition … (content) … Lava texture … (content) … Shape Shield volcanoes … (content) … Cinder cones … (content) … Stratovolcanoes … (content) … Supervolcanoes … (content) … Submarine volcanoes … (content) … Subglacial volcanoes … (content) … Classifying volcanic activity … (content) … Notable volcanoes Volcanoes on Earth
Fig. 1. A webpage and the illustration showing its content organization
Article
root st
Topic
1 level topic
topicName
2nd level topic
contentList
3rd level topic
:
:
Fig. 2. Organization of the hierarchical structured content
Fig. 3. Illustration of a Topic object
A convenient way to preserve the organization of topics on a webpage is to arrange those topics into a hierarchical structure, in a similar fashion to the illustration in Fig.2, and let a 1
http://en.wikipedia.org/wiki/Volcano retrieved June 24th, 2006
webpage-reader work with this arranged content instead of the original one. Each circle in Fig.2 represents a Topic object, a data storage for an individual user-defined topic. The Topic object has two properties, namely topicName and contentList. The property topicName stores a text string chosen to be the name of that Topic object, while the property contentList stores a collection of text excerpts associated with the Topic object. To make this more specific to the use of webpage-reader programs, we can say that this collection of text excerpts is to be read under the topic associated with its Topic object. From the hierarchy, if Topic B is a child of Topic A. we call that Topic A is a subtopic of Topic B. Any Topic objects on the same level and with the same parent are called siblings. Every Topic objects must belong to an Article. Adding navigation commands based on this hierarchical structured content to voiceenabled web browsers will let the visually impaired browse through the content more efficiently. The implementation of such commands in a voiced-enable browser is described in later section. The next section describes how one can define the mapping from the content of a webpage into the above hierarchical structure.
3.2 Obtaining Hierarchical Structured Content Using XML-based Template Hierarchical structured content can be used as a method to prepare webpage content for a webpage-reader program. It preserves the organization of the webpage content when formatting does not work, as in the case of visually-impaired users. Apart from the ability to arrange webpage content in a meaningful organization, it is also desirable to control which text elements are to be read or not to be read. This control over what to be read is also another benefit of the mapping from normal webpage content to the hierarchical structured one. In many webpage-reader programs, it is possible for users to set parameters in the programs to select what types of elements on the webpage should be read and which should be opted out. This parameter setting could be done on a page-by-page basis or apply globally to every webpage. However, the downside of this is that selecting what to be read or not to be read is done by visitors to that webpage, when it makes much sense to have the author of the visited webpage do the preparation. If the author of a webpage would like to have a control over how a specific webpage-reader program read the content on his/her webpage, he/she might be able to do that partially by offering a suggested set of parameters for visitors to set their webpagereader programs. Still, this is not very convenient, especially in the case of visually-impaired visitors. Here, we propose the use of XML-based parsing templates that describe the parsing rules defining how content of a webpage should be mapped into the hierarchical structure. A program called “Heirarchical Structured Content Parser (HSC parser)”, which was implemented as a part of this work, is used for applying the parsing rules define in a parsing template to the webpage of interest and creating the associated hierarchical structured content, which is also stored in an XML format. Since both the parsing template and the resulting hierarchical structured content are in plain text, it should be simple to create these parsing templates in a normal text editor and make use of the parsed content. Fig. 4 illustrates the hierarchical structured content parsing process. Parsing rules are defined in parsing templates using XML. The parsing rules identify how Topic objects are created, as well as how their properties are filled. Parsing template
Webpage source
Hierarchical structured content
HSC parser
Fig. 4. Obtaining hierarchical structured content of a webpage content from its source code using HSC parser and the corresponding parsing template
3. RESULTS AND DISCUSSION 3.1 Implementation in CUVoiceBrowser CUVoiceBrowser is a voice-enabled web browser, integrated with Thai automatic speech recognition and multilingual TTS modules. Both the visually impaired and people with normal sight are taken into consideration for the design of CUVoiceBrowser. Navigation can be done via both traditional and voice inputs. Together with the webpage reading capability of the TTS, voice commands allow users to perform most of the tasks required for accessing mainstream contents on the web, including going to the desired URLs, following links shown on the webpage, asking for the list of links on the webpage, opening the user’s bookmark page, navigating forward and backward, activating to the user’s pre-defined search procedure, filling in web forms and performing search using character-wise data entry, controlling the text reading of the TTS module, requesting instructions from its help page, and perform simple program controls. CUVoiceBrowser automatically extracts webpage content from its surrounding formatting tags. By default, the reading is performed sequentially from the top to the bottom of the page. Different voices are used to distinguish between normal and hyperlinked texts in order to help visually-impaired users identify links they can follow. The HSC parser was integrated into CUVoiceBrowser in order for the browser to support the hierarchical structured content parsing. CUVoiceBrowser was also modified so that when it retrieves a webpage from a URL, the browser looks for appropriate parsing templates. Parsing templates can be found in two ways. First, the author of a webpage can provide the URL of the parsing template designed for that webpage in the head section of the webpage’s source. The other way for parsing templates to be found is from the browser’s pre-loaded parsing template inventory. When the browser engine receives a webpage via http response and the parsing template associated with that webpage is presented, the source of that webpage is fed to the HSC parser together with its parsing template. Then, the hierarchical structure for that webpage content is created and stored in the browser’s memory. Users can navigate through the hierarchical structure and have the TTS read the content or the name of the desired topic, as well as the list of topics at any levels of the structure by using an extended set of voice command. Original voice commands can still be used and they will be applied to the content of the topic that the browser is currently in.
3.2 An Example Usage In order to illustrate its usage, we have applied our approach to organizing webpage content on some frequently-updated webpages. Here, we present a demonstration of the approach on the front page of The New York Times (http://www.nytimes.com). The format of such webpage was analyzed, the targeted hierarchical structure of the desired content on that webpage was defined, and the corresponding parsing template was constructed and used. Fig. 5 shows a portion of the webpage where the breaking news (A, B) are listed. It is easy to see that a typical webpage-reader will not do a good job narrating the content in this portion. In contrary, we can write a parsing template that looks for news headlines in section A and section B by searching for specific formatting tags (in this case,
) and make each one of them, together with its corresponding content, a Topic object. These Topic objects will be listed as a subtopic of another Topic object names “Breaking News”, and their topic names are extracted from the
elements (for the first news) and the elements (for the other news) in the vicinity of each Topic object. The part of the parsing template that defines how to create Topic objects in section A is shown in
Fig. 6. Note that some details of the XML are neglected in the figure. Once the webpage has been parsed into the hierarchical structured content, CUVoiceBrowser allows user to navigate through the parsed content.
6. CONCLUSIONS We proposed a method to arrange webpage content into a hierarchical structure using XMLbased parsing templates. When webpage content is arranged in such a structure, it is easy for a webpage-reader program that supports the structured content to read the content to the users in a more organized and controllable manner. This will help the visual impaired navigate through the content of a webpage in a content-driven fashion. Such an approach is our next step to help the visually impaired accessing web-based information beyond the use of a voice-enabled web browser with a traditional webpage-reader module. Although the parsing templates are flexible and small in size due to their text-based nature, it might still be too complicated for a novice to markup languages. A graphical user interface could be developed so that one can define Topic objects via simple mouse actions on the webpage itself rather than tags in its source code. Another interesting aspect that would make the read webpage content as informative as its original content with visual formatting is to use various aspects of voice quality, together with other sounds, to communicate the intention of each visual formatting element. For example, a weak beep sound could be played in the interval when a hyperlink is read, or a higher-pitched voice could be used to read blinking texts. Psychological studies are needed for such sound representations.
A
B
Fig. 5. A portion of a front page of The New York Times website
Definition of a Topic object … named “Breaking News” … Breaking News … Definition of a Topic object named after the element … h3 … Definition of an unspecified … number of Topic objects named … after elements … h5 … Fig. 6. Portion of the parsing template used to parse the content in section A
REFERENCES [1]
[2] [3]
[4] [5] [6]
[7]
Lee, Young Sook, 2005: The Impact of ICT on Library Services for the Visually Impaired: 8th International Conference on Asian Digital Libraries (ICADL2005). Bangkok, Thailand. 44-51. Zagler, W. L., Mayer, P., 1992: Microprocessor Devices to Lower the Barriers for the Blind and Visually impaired: Journal of Microcomputer Applications 15(1): 57-64. Nolan, Y.M., de Paor, A., 2005: Phoneme Recognition Based Software System for Computer Interaction by Disabled People: The Int. Conf. on Computer as a Tool, 2005 (EUROCON 2005) Vol.1. 394 – 397. Rabiner, L.R., 1989: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition: Proc. of the IEEE Vol.77(2) 257-286 Hemphill, C.T., Thrift, P.R., 1995: Surfing the Web by Voice: ACM Multimedia 95Electronic Proceedings, San Francisco, CA, USA. Brondsted, T., Aaskoven, E., 2005: Voice-Controlled Internet Browsing for Motorhandicapped Users, Design and Implementation Issues: Interspeech 2005. 9th European Conference on Speech Communication and Technology (Interspeech2005). Lisboa, Portugal. Punyabukkana, P., Chirathivat, J., Maekwongtrakarn, J., Chanma, C., Suchato, A., 2005: The Implementation of CUVoiceBrowser, a Voice Web Navigation Tool for the disabled Thais: Unpublished paper.