Speech technologies for blind and low vision persons - IOS Press

135

Technology and Disability 20 (2008) 135–156 IOS Press

Speech technologies for blind and low vision persons Diamantino Freitasa,∗ and Georgios Kouroupetroglou b a

b

Department of Electrical and Computer Engineering, University of Porto, Porto, Portugal Department of Informatics and Telecommunications, University of Athens, Athens, Greece

Abstract. In this work a review of speech technologies and their applications that provide or augment access to the printed or electronic information, the daily or social activities, as well as the private or public facilities for blind or low vision persons is presented. Speech technologies are currently considered to be essential for providing general purpose interfacing besides providing accessibility for the people who are visually impaired. Speech-enabled devices, reading machines, accessible computers, software applications, World Wide Web (WWW) content and structured environments constitute the main areas addressed throughout this paper with reference to the background technologies, architectures, formats, on-going research activities and projects. In the state-of-the art of the accessibility field, the speech communication channel is considered by the authors as one of the most important modality to benefit the blind and low vision persons. Keywords: Screen-readers, text-to-speech, speech recognition, audio description, talking devices, automated reading devices, voice browsers, voice portals, accessibility, mobility, digital talking books

1. Introduction Some of the major limitations that persons with visual impairments face are: – – – –

To access written information, To operate devices with complex user interfaces, To get orientation and mobility support, To view television broadcastings, movies, live performances or shows.

The term written information is used here to include all print material (mainly on paper but also on labels, signs or tags), electronic material (stored in information systems or WWW content and displayed in monitors or projected on screens) and handwritten material that include text, scientific symbols and graphical representations (e.g. diagrams, maps, pictures, drawings). When the written information is language-based (i.e. ∗ Address

for correspondence: Diamantino Freitas, Department of Electrical and Computer Engineering, University of Porto, Rua Dr Roberto Frias, 4200-465 Porto, Portugal. Tel.: +351 22 5081837; Fax: +351 22 5081443; E-mail: [email protected].

text), it is straightforward to state that speech (i.e. the oral form of language), is the direct alternative or substitute modality to convey the information to a blind person and it is very helpful for the partially sighted persons. This is confirmed by the fact that, traditionally, recordings on audiotapes were widely used by the blind community to access information in books, magazines and newspapers. In case the written content is not in text form, language-based descriptions can be used to render the information, at least partially. The same approach, named audio description, has been used in theatres, live performances and in television. Furthermore, even persons without vision loss prefer to interact with computers, and other devices or appliances using speech commands, or even spoken dialogues. Thus, it is clear that speech technologies, i.e., coding, synthesis and recognition of speech signals [13] , play an important role as assistive technologies for the visual impaired people [12,54] to alleviate the major barriers described above. The high potential of the speech communication channel to convey visual and environmental information converted or adapted into speech has already led

ISSN 1055-4181/08/$17.00  2008 – IOS Press and the authors. All rights reserved

136

D. Freitas and G. Kouroupetroglou / Speech technologies for blind and low vision persons

to a number of solutions as described along this paper. What is generally necessary in this case is to provide the appropriate transformation of the information into a linguistic form. For example, some visual objects, like mathematical expressions and chemical formulas, have remained inaccessible until very recently. Other domains remain almost untouched, like graphics and charts. Automatic and standardized speech descriptions of non-linguistic information are late in becoming available causing an enormous handicap for blind persons. Speech technology may and must use the dominant linguistic information tiers: the lexical, the syntactic, the semantic and the prosodic properties. Metalinguistic and paralinguistic information are some extra tiers capable of conveying contextual information as well. Speaker characteristics may also be used for this purpose (for instance, using alternative voices). Changing languages could be considered, if appropriate, in the same scope. Automatic conversion of text into speech or of speech into other text is more or less straightforward nowadays [48]. It is being achieved more and more effectively and in an affordable way that leads to ubiquity. However a lot remains to be done in the upper layers of the speech chain that connect with the information that is being conveyed [50]. Main problems affecting speech transmission are signal to noise ratio and acoustic interference that, more or less, compromise intelligibility and effectiveness of the communication channel. Other troubles are lack of privacy, speech serial transmission (pieces of information follow one another in a rather slow timeline, with a negative impact in cognition of longer messages), and the almost practical impossibility of parallel speech channels perception [9,59]. However, parallel combinations of speech with other sounds and other media, like tactile information [62], can enhance the possibilities of relatively simple parallel information. Speech and non-speech sounds when combined may span a multi-dimensional information space [55]. Use of non-speech sounds for communication is still rather underdeveloped to be widely used, but is slowly increasing in importance, mainly in the research field [11]. Speech input is also generally hampered by acoustic interference and has remained immature for many applications. A few ones, like machine or terminal voice commands, have been introduced already but still receive rather low acceptance from blind users. Technical means for deploying speech interface solutions

normally incorporate a logical part and a communication part. In the former, the functionality is obtained mainly by means of a structured dialogue. The use of XML dialects in this sector allows compact and coherent technical representations as well as adaptability by means of style sheets. A number of domains may be handled, from raw text to technical texts or even literary documents enriched with metalinguistic annotations. A pre-processor module transforms annotations and coded information into full-text with prosodic commands. VoiceXML, Speech Synthesis Markup Language (SSML), Semantic Interpretation for Speech Grammars (SISR), Speech Recognition Grammar Specification (SRGS), Voice Browser Call Control (CCXML) and Synchronized Multimedia Integration Language (SMIL) [95] are some of the more important text annotation, processing and operation control languages in the field of automatic speech interfacing. In the production of artificial speech, the conversion is done with awareness of the linguistic and metalinguistic information. In some cases even some nonlinguistic information may be used. The main types of devices are the Text-to-Speech converter (TtS) and the Automatic Speech Recognizer (ASR). Vocabulary size is an essential limitation in terms of the technical feasibility of the latter type of application. For small vocabularies, ASR is already quite effective in a quiet ambient. In this paper a series of speech applications are presented, starting from speech-enabled devices and going down to speech technologies in digital television and DVDs. Electronic reading devices are addressed first, followed by spoken access to the computer and documents. Access to the WWW is also a major issue, because of its universality and potential as an information resource and a communication infrastructure. Voice-browsers and voice portals are therefore presented hereafter. Speech can also be used to complement the basic mobility capabilities of a blind person. Speech technologies in mobility systems are an important emerging domain due to the increased support that they can provide when connected to an appropriate information and guidance system. Education and professional content need a good accessibility not only to the reading material but also to the learning environment. Speech-based accessibility in the classroom is therefore presented in the following of this work. Access to the main visual medium, the TV, is also very important in what it delivers, not only in terms of information, but also in terms of entertainment.


137

Table 1 Speech-enabled devices along with the corresponding embedded speech technologies Application areas

Speech-enabled devices

Healthcare

Talking (clinical) Thermometers Talking Blood Pressure Meters Talking Glucose Meters Talking Weight/Bathroom Scales Talking Pill Bottles (pill organizers and reminders) Talking Calorie Counters Talking Pedometers Speech Command Phone Dialler Talking Caller-ID Mobile Cell-Phone Talking Indoor/Outdoor Thermometer Talking Programmable Thermostat Talking Tape Measure Talking Calculators Talking Scientific Calculators Talking Dictionaries Talking Braille Note-Takers Talking (word) Translators Talking Clocks/Alarms Talking Watches Event Talkers Talking Timers

Communication

Measuring

Reading/Writing Education

Time

Daily living Cooking

Identifiers

Mobility

Controlling Organizing

Talking Kitchen Timers Talking Cooking Thermometers Talking Kitchen Scales Talking measuring jugs Talking Microwave Ovens Talking Identifier of Colors Talking Barcode Readers Talking RFID Tag Readers Note Teller Money Identifiers Talking Compasses Talking Electronic Canes Portable Navigation System with GPS Indoor Navigation System Speech activates Talking Remote Controls for Home Appliances Speech activated Talking TV Remote Controls Personal Digital Assistants (PDAs)

2. Speech-enabled devices The progress made over the last few decades on Digital Signal Processing, VLSI and speech processing algorithms enabled the integration of the three main speech technologies: compression, synthesis and recognition [12] on everyday devices and products. Embedded speech technologies are offered on both hardware (as chips or Integrated Circuits) and software platforms. Speech-enabled devices can be classified as follows: – Talking devices, – Speech-activated devices, – Spoken dialogue-based devices.

Embedded Speech Technologies Compression Synthesis Recognition * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

* * * * *

* *

* * *

*

* * * *

Typical speech-enabled devices along with the corresponding embedded speech technologies are presented in Table 1. Talking devices provide spoken messages to the visually impaired user. The length of these messages can vary from a single word to a sentence, or even a paragraph. A range of devices is depicted in Fig. 1. The example devices are a Talking Watch (Fig. 1a), a Speech Activated Talking Remote Control (Fig. 1b) and a Voice Activated Phone Dialer (Fig. 1c), a Talking Barcode Reader (Fig. 1d) and a Talking Pill Organizer and Reminder (Fig. 1e). Speech processing chips that are continuosly appearing and being updated in the market constitute the core of the devices’ operation. A typical architecture for a speech compression

138


(a)

(b)

(d)

(c)

(e)

Fig. 1. Speech-enabled devices: a) Talking watch, b) Speech-activated talking remote control, c) Voice Activated Phone Dialer, d) Talking Barcode Reader [37], e) Talking Pill Organizer and Reminder [36].

Fig. 2. Typical architecture of a speech compression and synthesis chip.

and synthesis chip is presented in Fig. 2. Three main methodologies are used in talking devices: – Prerecorded digitized speech, – Concatenated prerecorded speech, – Text-to-Speech synthesis. Prerecorded digitized speech based on waveform coding [15] is utilized in the cases where the words of the message are predefined and the total length of all the messages is limited. The messages can be recorded

once by the manufacturer (e.g. talking thermometer) or by the user (e.g. talking pill organizers/reminders Fig. 1e). The content of the message may be an announcement of the status of the device (e.g. talking microwave oven) or the delivery of a result from a measurement (e.g. talking weight/bathroom scale) or a message/reminder or a prompt. The quality of the speech in this methodology is natural. Concatenated prerecorded speech [12] provides better flexibility in the formation of the message as it can


139

Fig. 3. Typical architecture for a speech recognition chip designed to support Hidden Markov Modeling as well as Neural Network methodologies for speaker dependent or speaker independent speech recognition.

use not only prerecorded words, or phrases, but also parts of a word (e.g. stems, endings) [100]. This is very useful particularly where an inflectional language is used. Text-to-Speech synthesis can deliver messages with unlimited or unrestricted vocabulary, without the necessity of prerecording. Speech-activated devices use speech recognition to control their own functionality (e.g. voice dialing in mobile cell phones) or that of external devices (e.g. talking TV control, Fig. 1b). Both speaker-independent as well as speaker-dependent speech recognition can be embedded [97]. Speaker-independent speech recognition [50] works right out of the box, doesn’t require end-user training and it is most suitable for small vocabulary applications, like phone numbers and digit strings (voice dialing applications such as handsets, personal dialers, mobile phones and hands-free kits). On the other hand, speaker dependent speech recognition is language independent and works well with any accent or dialect and even dysarthric speech or less spoken languages. Spoken dialogue-based devices allow the user to interact with them in a more natural way. Most of them use a structured machine-directed dialogue approach [58]. Advanced devices utilize word-spotting as well as barge-in techniques in speech recognition. Figure 3 presents a typical architecture for a speech recognition chip designed to support Hidden Markov Modeling (HMM) as well as Neural Network methodologies for speaker-dependent or speaker-independent speech recognition [48]. Embedded speech technologies are accompanied by appropriate development toolkits and SDKs (Software

Development Kits) that facilitate their integration in consumer products [63,97]. We will now describe in brief some characteristics of speech-enabled devices. A speech-activated talking clock/alarm can respond to voice commands such as: – – – –

“What time is it?” “Snooze” “Play Memo” “Any Alarms?”

A voice input/voice output universal remote control may be used with a television (TV), Video Cassette Recorder (VCR), DVD player/recorder, cable TV box or satellite receiver. Users have the choice of hands-free voice operation or hand-held operation through tactile buttons and voice feedback. The remote control can perform simple operations, such as turning on the TV and DVD, tuning the television to a specific channel and telling the DVD player to begin playing a movie, based on a single spoken command or button press. Voice feedback tells users which button they have pushed. Dedicated handheld digital organizers with voicerecognition have been developed for use by persons who are blind or have low vision. Voice recognition enables users to store and retrieve information, such as phone numbers, addresses and appointments, using voice commands. When the user speaks a name that is already stored, the device repeats it and finds the phone number and it can send a tone signal to the user’s telephone to dial the number. In addition to a phone book and dialling assistance, those devices feature an appointment book, voice notepad, talking alarm clock and talking calculator. The organizer speaks all menus

140


and control panel options and also speaks buttons’ characters as they are pressed. Similar functionality has been already incorporated in mainstream Personal Digital Assistants (PDAs) as well as in mobile phones. A speech-enabled PDA/mobile phone alerts users with messages for incoming e-mails, text messages and calls. Furthermore it uses the caller-ID function that enables users to know who is calling without picking up the phone. The caller’s identity is read aloud to the user in a configurable voice. In addition, speechenabled access to features, such as battery and signal level, ensures that users can be aware of the telephone’s status. Furthermore, the user can utilize speech to access more features, including location-based services such as news and traffic alerts, ring tone downloads and access to Web content. In addition, users can use a GPS navigation system to find a map and hear detailed directions to the nearest restaurant or hotel, or explore a route before starting a walk. A Talking Barcode Reader (TBR) (Fig. 1d) aids the visually impaired person to identify items using a portable machine that transforms bar code or Universal Product Code (UPC) information into audible speech. Using digital voice recording and Text-to-Speech technologies, these devices allow users to access a large database of product descriptions, along with a tailored set of their own voice messages. TBRs can be used to quickly identify any product or item that is not easily recognized by means of the other senses, including cans, jars, boxes, bottles, clothing, playing cards, prescription drugs, compact discs, albums, cassette tapes, diskettes, books, important papers or file folders, etc. Bar codes are already placed on virtually every product sold in stores today. A TBR already contains a huge database of items including grocery, spirits, pharmacy, movies, music and much more. As an added benefit, many items in this database include extended package details, including: nutritional information, ingredients, warnings, instructions, package size, and miscellaneous package details. These bar code data, which are stored on the memory card, is growing every day, and database updates are available online or on a CD ROM. Also, the user can load his/her private specialized database. In addition a TBR allows the user to associate any number of recorded speech messages with a scanned bar code. If no description is found in its database, the user is prompted to record his/her own voice description. When that bar code is scanned in the future, the recorded descriptions are played back. Today, mainstream cell-phones with embedded bar code readers and embedded speech synthesis and recognition are already available in the market.

A similar approach can be applied to RFID (RadioFrequency Identification) tags: when an RFID reader identifies a tag’s unique ID, an associated spoken message can be produced. RFID is not simply a replacement of barcodes, as its usage spans from consumer products to location-based tags. The main advantages of RFID compared to barcodes, are that the content and the length of the spoken message can be modified easily or even be user specific. 3. Automated Reading Devices Automated Reading Devices (ARDs) are stand-alone machines that can convert printed or electronic text to audible speech. They do not require to be connected to any other device, like a computer. ARDs have been designed for use by individuals who are blind or have low vision. Depending on the type of the source material, there are two main classes of ARDs: – Printed-Text (PT) ARDs, – Electronic-Text (ET) ARDs. Printed Text ARDs (or PT-ARDs) use a scanner to capture an image of the printed text. This image is then processed by an Optical Character Recognition (OCR) application in order to extract the content in an electronic text format which drives a Text-to-Speech system. In Fig. 4, some desktop (Fig. 4a and 4b) and handheld devices (Fig. 4c) are depicted. They allow users to read letters and brochures to newspapers, reports, books or any other printed material. The user just has to place a text document on the glass surface, press the start button and the document is read aloud in a synthesized, easy to understand voice. If the user wants to save a scanned document, he/she has to press a button and speak a filename into the built-in microphone. Although the same functionality can be achieved using a conventional scanner connected to a Personal Computer (PC) with OCR and TtS software, the single-task approach of PT-ARDs is much more convenient and fast for the visually impaired. The remarkable attributes of a PT-ARD are speed, reading accuracy, simplicity of use, low weight and small size. Other features include internal memory for storage (of hundreds of thousands pages), multilingual capabilities, USB port connection, connection with a Braille display, stop button, headphone jack, volume, speed and pitch control of the synthesised speech, along with navigation buttons. PTARDs are sub-classified in desktop devices with flat bed scanner (Fig. 4a,b) and handheld devices (Fig. 4c).


(a)

(b)

141

(c)

Fig. 4. Desk-top (a) (b) and hand-held (c) Printed-Text Automated Reading Devices (PT-ARD).

Fig. 5. Electronic Text Automated Reading Devices (ET-ARD).

Electronic Text Automated Reading Devices (ETARDs) are hand-held machines that can convert files of electronic text to audible speech through an embedded Text-to-Speech system (Fig. 5). These text files can be read from flash cards inserted in an appropriated slot of the ET-ARD or downloaded by a wired connection to a computer or a wireless service. The most common file format they support is the DAISY/NISO (Digital Accessible Information System / National Information Standards Organization) standard [7] for Digital Talking Books (DTB) [10]. They are also referred as Digital Talking Book Players. The main desirable features of ET-ARDs include: – Support of the DAISY/NISO standard. – Support of removable flash cards such as SD (Secure Digital), or CF (Compact Flash). – Ability to play NLS (National Library Service for the Blind and Physically Handicapped) downloaded books and NLS book cartridges. – Voice recording capability through built-in microphone. – Playback through headphones or secondary small speaker. – USB port to transfer books from a PC. – Long-life rechargeable batteries. Other advanced DAISY book players features include: – Four-arrow keys for navigation by chapter,section, page and bookmarks

– Go-to features to jump to a specific page, heading, book or bookmark. – Browsing of a bookmark list. – Simple bookmarks to mark a reading position. – Several types of bookmarking saved separately for multiple books. – Audio bookmarking with recorded voice note. – Highlight bookmarking to mark start and end of a passage. – Book information key. – Where-Am-I key for information on reading position. – Built-in user guide and Key Describer. – Audio messages for battery capacity.

4. Spoken access to the computer and documents 4.1. Introduction to document aspects Documents may generally be considered as physical information elements or sets of elements. Several types of information exist in such elements, namely symbolic, diagrammatic or sensory-representational information: – Linguistic information can be found in most documents, mostly in written form and in audio-spoken documents;

142


– Symbolic information is very abundant in technical and other documents; mathematical and chemical expressions and artworks are a few examples; – Diagrammatic elements such as block diagrams, electronic schematics, charts, graphs, technical drawings and project timelines may also be found [87] ; – Ambient or environmental information, such as maps, meteorological charts, natural world documents, city plans and route identification information is also a very important set of information elements; – Sensory or cognitive representational information, such as images and movies, non-linguistic sounds and music, plastic or figurative arts and emoticons (i.e. icons that illustrate emotions). Documents are associated with many different and relevant processes or operations that must be considered when dealing with the problem of their accessibility. Firstly, documents must be created and edited. Generally, a guide or template may be available for that, but many times, the document is created from raw material, text or other elements. Support for creation activities is extremely important for a satisfying inclusion of disabled people. Secondly, documents must be reached, handled and used for information retrieval or general management actions, in a random way. Handling documents is in general related to preparatory operations for access, like selection or browsing locally or in the Web, but also to document life cycle management including preservation, security, transportation or transmission and destruction. Thirdly, documents need, in general, some form of indexing or cataloguing for future reference, archiving, search and retrieval. Media related aspects of documents are fundamental in terms of accessibility and usability for persons that are disabled or have special needs. In fact disability generally leads to a partial or total access limitation to one or more media, therefore, it is very important to introduce the possibility of making a document convertible from its original medium or media into alternative ones that are accessible by the document user. This can be done with a substitution approach or in an augmentative one. Contextual issues and activities are also important when considering the use and handling of documents. Working at home or at the office imposes less access restrictions to a wide range of documents than during

traveling or outdoors. At home or at the office there is in general an easier access to computer terminals that may support the required conversion and interfacing, but when moving, this interface is almost completely lost except when using alternative and portable computers or similar devices such as PDAs. Outdoor ambient generally brings privacy issues into discussion; therefore, the available channels (and related media) become fewer. Speech coming from a loudspeaker or spoken into a microphone is quite hard to use outdoors because of the interference from existing noise and because of eavesdropping. Blind or low-vision persons may access documents and their contents only when converted into an accessible medium like speech or sound, tactile like Braille embossing, swell-paper printing or a haptic display. Interfacing possibilities offered by the physical support of the document are the next essential element in the chain of transaction of content. This is related to the hardware and software of the terminal, be it a PC, a PDA, a mobile phone or an e-book. In the following of this section the medium conversion approach will be explored for a number of document types with a discussion of existing and foreseeable solutions. Accessibility tools will be addressed too. 4.2. Retrieving written linguistic information The availability of a sound communication channel is ubiquitous in currently available computational and communication terminals that are more and more convergent and integrated. Through this channel a speech output is easily set-up with the use of operating system’s features and drivers. Recorded digital speech and synthetic speech need a speech engine that basically does the selection or generation of waveforms for a specified voice and the needed time alignment and other speech features processing and subsequently outputs the waveforms into the terminal’s sound channel. Software control of this time-sequencing operation is obviously essential and real-time control by the listener is mandatory, not only for practical reasons of operation control, but also for cognitive reasons. The user must be able to control the flow of auditory information adjusting the rate at levels suitable for a comfortable comprehension of the content. One of the main features of today’s terminals is the speech synthesizer module that executes the Text-toSpeech Synthesis function already mentioned in Sec-


143

application

Graphical user interface created by application developers (other UIs need to be deployed at design time)

Scripting brings semantic and functional information from the application to the screen reader

Screen-Reader builds off-screen model inferring the application semantics from the incomplete GUI information

Fig. 6. Illustration for the relationship between the Screen-Reader, the GUI and the application showing the paths of information flow.

tion 2. However, using this hardware module is not very practical, because of the low-level commands and data that are required to feed it. Therefore, a range of applications for computer screen reading has become available in the past decades, the so-called Screen-Readers.

Fig. 7. A blind person can benefit in a crucial way from the speech output produced by screen reader software.

4.2.1. Screen-Readers Since the early 80’s a family of applications denominated Screen-Readers has been introduced aiming to produce a vocal rendering of the text contents of the computer screen under user control through the keyboard, using a Text-to-Speech converter (TtS). Screen-Reader software stays active in the background, analysing the actual content of the screen that are produced by any software application. The screen memory’s pixel information makes the task of extracting the embedded content very difficult. Therefore the ScreenReader software must interpret the operating system messages to build an off-screen model, a substantially difficult task. Screen-Reader software has evolved much in 2.5 decades. Figure 6 depicts the relationship between the Screen-Reader, the graphical user interface (GUI) and the document access application. Screen readers can also analyse menus and message or dialog boxes and contribute to the user interface by producing the corresponding speech output. User-controlled modes allow many kinds of text scan, like introductory reading of a few characters from each first line of each paragraph to full text reading or even individual character reading. Non-linear or ran-

dom screen exploration is possible. Keyboard control of speech output allows a quite fast navigation when the user uses shortcuts. A simulation of the use of a screen reader is available at the WebAIM website [44]. Figure 7 illustrates a blind person using Screen-Reader software to access the laptop’s content. The audio channel is conveyed to the person’s ear by means of an earphone. Many Screen-Reader applications exist, see in [23]: some stand-alone commercial examples are JAWS (Job Access With Speech) from Freedom Scientific [28], Windows Eyes from GW Micro [29] and Hal from Dolphin. There are a few operating systems that provide some basic screen readers, like Narrator from Microsoft [33] and VoiceOver from Apple Computer [25]. Emacspeak, from T.V. Raman [22], is a distinguished free Screen-Reader and aural user interface system (AUI) among the many available for Linux. However, there are many limitations that current screen readers cannot overcome, for instance those related to images (screen readers cannot describe images, only output a readout of their textual description when available), to visual layout (the user has no means to realize how is the page organized and go directly to

144


the point of interest as the screen reader usually reads linearly and doesn’t skip over uninteresting parts) and to data constructs that use positional information, in rows and columns, for instance, like tables and plots (data tables can be quite confusing and lengthy when reproduced through speech, putting a big stress over the subject’s ability for interpretation and memory). Scripting is a technique adopted by some ScreenReaders to adapt to applications semantics that are not observable in the GUI data; this is the case of, for example, the JAWS Screen-Reader. The Text-to-Speech converter should allow spelling and reading of individual characters and all kinds of text elements that may appear, like numeric expressions, abbreviations, acronyms and other coded elements, including punctuation. Text features like formatting (bold, italic character formats) and meta-information may be retrieved through variation of speech features, like giving a higher pitch according to the user’s preferences (see section on Document-to-Speech Synthesis). The World Wide Web Consortium (W3C) has introduced in 1998 a chapter relative to the acoustical rendering of a web page, the Aural Cascading Style Sheet (ACSS) [41] recommending the use of auditory icons that are to be rendered to the user acoustically by means of an appropriate surround-sound output system (stereo). Auditory icons hold acoustical spatial characteristics and when used along with voice provide an augmented acoustic interface. There is a number of additional decorative features or actions that many documents encompass. For instance, onmouseover events can be made accessible by including descriptive data in the file. 4.2.2. TtS module control and other features Screen-Reader software should give its user the capability of interrupting the current utterance after sent to the TtS module. Command-to-speech-output time latency should be conveniently small, so that the user can really navigate at her/his own pace and not at the TtS’s output pace. Screen-Readers compliant with ACSS mark acoustically some features of the rendered text. For this purpose orthographic error correction tools and controls of supra-segmental speech features like prosody, rhythm, articulation and pauses, must be available in the TtS. The blind user may take these under his/her strict control besides being used automatically by the Screen-Reader.

4.3. Retrieving symbolic, diagrammatic and technical information Until recently, one of the main access limitations to technical documents has been the lack of conversion possibilities to accessible formats. Especially mathematical expressions have gained the attention of the research community with the aim to reach a conversion system capable to be used by blind and low-vision persons through a hearing and touching combination. Braille codes have existed for mathematical representation (e.g. the Nemeth code [34]) and newer 8 dot versions are capable to elevate the computer’s capabilities for Braille representation. “Math-to-sound” conversion (or audio rendering) of mathematical expressions, has been explored more recently by a number or reserch institutions and companies [26,32]. The use of speech is an established trend. One of the main problems concerning the rendering of mathematical and technical expressions is the hearing memory overload that arises when a moderately complex expression or piece of information is acoustically rendered. When hearing the 2-dimensional mathematical information linearly, the reader can’t get a quick glance to the full content of the expression, neither has the possibility to browse it as a sighted reader normally does. To solve this issue it is necessary to introduce some methods of summarizing the technical contents and a way of exploring its structure and contents. Graphics are very important in diagrammatic and technical information. Tactile representation of graphics seems to be more effective when compared with audio description. Paper plotting and embossing is possible but refreshable tactile graphic displays would be the best option for this purpose, although their resolution is still low. When combined with audio rendering the result is a promising conversion system [39]. Mathematical expressions are essential in scientific documents, occurring at all levels of the learning structure as well as in professional life. Like language, mathematical reasoning has been considered another pillar of intellect formation. Access to mathematical content is therefore fundamental. 4.3.1. Accessing mathematical contents through speech Digital representation of mathematics has been an object of research for many decades. The first usable results appeared in the context of the publication of scientific documents in the 80’s. LaTeX [31] coding


of mathematical formulae or expressions is a linear structured code that allows a semantic approach into the formula or expression contents. A description of a formula can be based on basic mathematical entities followed with a series of comma-separated arguments. For instance, the following expression: lim(n \to \infty) sum (k = 1, n, frac(1, kˆ2)) n

1 2 n→∞ k=1 k

= frac(piˆ2, 6) , represents lim

=

π2 6

is the code for the printed representation of “limit, when n tends to infinity of the sum from k equal to 1 to n of one over k square equals to pi square over 6”. As, historically, the main purpose of this codification was the printout of the document it took a while before other conversions started taking place. Many printed mathematical objects have a 2dimensional layout and the relative and absolute positions of symbols are important. Therefore the codification evolved by considering a matrix type structure with rows and columns for interpreting the positioning of the symbols. The advent of the Internet brought the need to convey the code information in a coherent and universal way, so efforts were directed towards obtaining a different and modern codification that could be compatible and interoperable between the various players in the field, mainly for web browsers, namely, math authoring tools and math rendering software add-ons. MathML emerged in the context of the W3C as a response to this need. Many software companies adhered to this de facto standard making their products compatible [98]. An objective of this markup language is to increase accessibility to scientific and mathematical contents for persons with special needs. This benefited mostly blind and low vision people. The lack of available software to access mathematical content in the e-learning system that disabled students could use, in the year 2000, urged a research at FEUP ’s Speech Processing Laboratory (LSS) with the objective of producing an audio rendering of mathematical expressions. This should be done by means of a software module suitable for machine-human communication applied to blind students in the Faculty. Taking into consideration existing LaTeX-based initiatives, for example, the work of T.V. Raman in the development of ASTER, or the Lambda Project [30], the Audiomath 2005 system was designed and implemented for the European Portuguese language. It is presently in a prerelease phase. A demonstration web page is available at [24].

145

Audiomath 2005 is based on the assumption that mathematical expressions can be entirely and unambiguously conveyed through speech only. Although this assumption seems theoretically reasonable, it unfolds lengthy descriptive phrases even for moderately complex formulas. The user’s auditive memory risks to be overloaded and the user will not be able to retain a good mental representation of the formula unless something is done to ease up the process. A few critical aspects must be considered. First of all it is necessary to produce unambiguous descriptions. This requires that formulae elements are kept together with their fellow neighbors in sub-expressions and not connected with other elements that belong to other sub-expression. A solution for this is to employ words that signal the existence of boundaries between sub-expressions. Consider, for instance, the expression: √ a3 + b 2 that can be rendered through speech as: “the square root of a to the third, end of radicand, plus b squared”. If the boundary-signaling element “end of radicand” is omitted, the textual description becomes ambiguous. Of course the full textual representation is just an intermediate phase before obtaining the speech waveform from the TtS, and the final prosodic cues that are added should not be disregarded because they can help in producing the right boundaries. A professional speaker, having understood the mathematical expression, reads the textual description using as much as possible a number of prosodic cues composed of a series of pauses and a corresponding series of intonation movements, in order to signal the existing formula boundaries. In Audiomath 2005 [24], a research of the distribution of prosodic cues in speaking some two-level nested expressions was made and the results of the speech prosodic analysis were clearly organized as rules. Two classes of pauses (two quite different duration values) and two classes of intonation movements (two different patterns), are normally used by the speaker to signal the two main boundaries types (major and intermediate). Taking advantage of these results a more effective and understandable TtS output can be produced. Another important aspect to consider during the production of a long description of a mathematical expression is the listener’s memory overload. With this in mind, the description was broken down into chunks short enough to make sense and an intra-formula navigation mechanism was introduced. The description should follow a meaningful tree but the progression up to the tree leaves should be left to the decision and

146


rhythm the user desires. For blind users, the definition of a set of at least four arrow keys is sufficient for the purpose of this navigation. Home and end keys may also help. The navigation mechanism was tested with blind users showing quite good results in comparison with audio rendering approaches without the navigation mechanism. MathML code may be produced in two modes: presentation and content. While the first is more directed to visual rendering, the second is more adequate for audio rendering. However the conversion between the two modes is not straightforward. The Audiomath 2005 system was developed for the MathML presentation mode because this is the most frequently encountered mode in MathML editing software. User customization (user defined modes) of the reading mechanism is possible in some cases in order to adjust the system’s behavior to comply with the user’s cognitive requirements and to obtain the better comfort. The artificial reading of mathematical expressions can be considered as a special case of structured text generation for speech production. Any kind of mathematical expressions’ codification allows a document to be searchable. Specific search systems need to be produced for this purpose relieving the user from having to learn the codes. Besides reading of documents with mathematical expressions, the other most important activity that must be considered is writing or editing mathematical expressions. The Screen-Reader can help blind users to handle the production reading simultaneously the current result of the keystrokes. Anyhow, any person who cannot handle a keyboard easily enough is therefore prevented from creating and editing mathematical expressions. In order to overcome this unacceptable situation the Speakmath project was recently started at LSS [27]. The main objective of this project is to study and design a speech recognition interface that may allow the control of a mathematical expression editing software. 4.3.2. Document-to-speech synthesis Beyond the text (commonly referred as plain text) which is the carrier of the content, almost all the different types of printed or electronic documents, like newspapers, books, journals and magazines, contain visual and non-visual metadata. Visual information includes text formatting (such as font size, colour and type as well as typesetting like bold and italics) and simple or complex text structures (e.g. tables, hierarchical lists, scientific formulas, and layout such as columns, bor-

ders, boxes) along with diagrams, drawings, charts, figures, logos and photos. It is interesting to note that text formatting includes domain-specific or context-specific semantics. In other words, there are various reasons for the use of a specific text formatting. Non-visual information in documents includes text style, like title, subtitles, header/footer, footnote, captions, etc. Most of the current Text-to-Speech systems do not include effective provision of the semantics and the cognitive aspects of the visual and non-visual document elements. Recently, there was an effort towards Document-to-Speech synthesis supporting the extraction of the semantics of document metadata [14] and the efficient acoustic representation of both text formatting [89,99–102] and tables [78,79] through modelling the parameters of the synthesised speech signal. 4.3.3. Voice access to the computer In the last couple of decades extensive research has been carried on about the computers’ accessibility options for the benefit of people with vision loss. In this section we are attempting to present, in a nonexhaustive way, the current state of knowledge of spoken access to windows-based Graphical User Interfaces. Alternatives to the usual desktop metaphor have been introduced [96], the most important among them being ROOMS [21,66] and Audio Rooms [60]. Afterwards, the concept of dual user interfaces was developed [70] and demonstrated by the HOMER User Interface Management System [67,71]. Stephanidis [82] has introduced the concept of Unified User Interfaces (UUI) i.e. interfaces that are selfadapting to user- and usage-context, either using offline knowledge or real time adaptivity. This approach follows the principles of Inclusive design and Universal design [69]. The theoretical approach [73,75], the design methodology [74,76] along with an appropriate architectural framework [68] for the development of UUI, as well as tools that facilitate and support the design and implementation phases [1,72], were well documented in the literature. Four levels of adaptation were identified in UUI: – semantic (internal functionality, information represented), – syntactic (dialogue sequencing, syntactic rules, user tasks), – constructional, – physical (devices, object attributes, interaction techniques).


According to this approach, automatic adaptation at any level implies a polymorphism feature, i.e., for a given task, to design alternative interactive artefacts according to different attribute values of user and usagecontext [85]. Based on the above UUI paradigm, a number of accessible Public Access Terminals (PATs) prototypes were developed and evaluated by FORTH (Foundation for Research and Technology – Hellas). First was the AVANTI web browser, deployed as an example of accessible kiosk metaphor [83]. NAUTILOS [85] was an information kiosk enabling accessibility by blind users. Its interface supports multiple languages, offering Braille and synthetic speech, in dual interface mode, i.e. both the visual and the non-visual browsers are displayed concurrently with synchronisation of the loaded web site. The PALIO framework was implemented as a self-adaptive tourism information system [84,103]. The ARGO system [81] has been developed in order to operate as an information point in public places. ARGO supports visual and non-visual interaction in order to satisfy the requirements of blind users and users with vision problems. To facilitate blind users in locating the system and learning how to use it without any assistance, a camera is attached to the system, and when motion is detected, instructions of use are announced. EZAccess [57,92] consists of a set of interface enhancements developed by TRACE Center [88] which can be applied to electronic productsand devices so that they can be used by more people including those with visual disabilities. Among the main EZ features are: – Voice + 4 button navigation gives complete access to any onscreen controls and content. This feature also provides feedback and information in a logical way such that it can be used by both sighted and non-sighted users. Typical items include onscreen text, images and controls. – Touch talk lets users touch onscreen text (and graphics), to hear them read (or described) aloud. – Button help provides a way for users to instantly identify any button on the device. At any time, a person can see and/or hear any button’s name and status. They can also get more information about what that button can be used for. – Layered help provides context sensitive information about using the device. If a person needs more help, it suffices to press the help button repeatedly, receiving more information each time. The EZAccess methodology can be applied to a wide range of interactive electronic systems, from

147

public information and transaction machines, such as kiosks, to personal handheld devices, like remote controls, mobile phones and PDAs [91]. By using EZAccess , developers can design Public Access Terminals (PATs) that are usable not only by more people, but also in a wider range of environments and contexts. EZAccess principles have been deployed in a number of prototype Public Access Terminals and electronic voting machines [56,93].

5. Voice-browsers and voice portals Interacting with a web page using voice can be achieved to different extents. Voice output alone, allows speech rendering of the web page, using some page structural features. A Screen-Reader for example may just read out the web page contents. However, in general, this is not the most convenient approach because the user will have little chance to obtain an early overview of the page contents. Using voice input as well allows a user to interfere dynamically with the reading, issuing commands and selecting options to get the desired reading and to operate with web services. A voice browser is a software that is capable of interpreting and executing voice mark-up languages that allow the generation of speech output and accept speech input, as well. W3C (the World Wide Web Consortium) is undertaking a Voice Browser activity [95] and has recently relaunched the Voice Bowser Working Group aiming at standardizing languages for capturing and producing speech and for managing the dialog between users and computers [40]. The Voice Browser Working Group has created the W3C Speech Interface Framework suite of specifications, which includes the VoiceXML Recommendation (Voice Extensible Mark-up Language) that specifies the flow control and exchange of information between users and computers, the SRGS Recommendation (Speech Recognition Grammar Specification), that specifies the words and phrases which a speech recognition system can convert from speech to text, the SSML Recommendation (Speech Synthesis Mark-up Language) that specifies how to render text as human-like speech by a speech synthesis system, the Pronunciation Lexicon, that specifies how words are pronounced, and the CCXML (Call Control Extensible Mark-up Language), that specifies how to manage the telephone system (answer incoming calls, initiate outgoing calls, create conference calls, etc.).

148


VoiceXML is designed for creating audio dialogs that feature synthesized speech, digitized audio, recognition of spoken and DTMF key input, recording of spoken input, telephony, and mixed initiative conversations. SSML defines a mark-up language for prompting users via a combination of pre-recorded speech, synthetic speech and music. The user can select voice characteristics (name, gender and age) and the speed, volume, pitch, and emphasis. There is also provision for overriding the synthesis engine’s default pronunciation [42]. A voice portal is a World Wide Web portal (or gateway) that can be accessed entirely by spoken dialogue technology through the common fixed or the cellular telephone. Ideally, any type of information or transaction found on the WWW could be accessed through a voice portal. Typical services offered by voice portals include phone directory, weather information, travel and hotel bookings, e-mail reading or sending, stock exchange information or transaction, banking, egovernment transactions, etc. Voice portals are modern interactive voice response (IVR) systems (where oldstyle IVR mostly used touch-tone input (“press 1 for sales, 2 for service”, etc.). VoiceXML provides a simple interface between an IVR system and other XML applications, such as Web sites, allowing, for example, the spoken retrieval of information from databases. Many companies offer development systems for preparation of web and telephone applications using the speech interface [6,19,77]. Parsing of a web page content is a necessary step in the way towards a desired speech rendering. Through a careful analysis, the structure or outline of a web page can be discovered and used as a table of contents for speech rendering. In a former project developed for Portuguese in cooperation between LSS and DI-UM, but applicable for any other language (AudioBrowser, executed during 2003–2005) [3,46], this idea was implemented with an interesting success. This idea is addressed in the Emacspeak package [22] also, in the form of a tool for outline processing of files. In AudioBrowser, a special web browser was developed with the capability of displaying separated but related frames of information. The created (or existing) table of contents of the page is displayed in one frame, the original page linked contents in a second one and the portion under observation in a third one with magnification (for low-vision users). Figure 8 depicts a screen with the mentioned frames. On the left frame, the obtained document outline is represented and links or forms are viewable by selection through keyboard or

RELOAD

STOP

FORWARD

BACKWARD

HOME

URL:

Modo de funcionamento Teclas de controlo de voz Teclas de movimenta¤ o

Modo de funcionamento .... .... .... Teclas de controlo de voz .... .... .... Teclas de movimenta¤ o .... ....

Fig. 8. A representation of the frames of the AudioBrowser accessibility software. On the left frame, the structure or outline of the webpage is presented and on the right the contents of the page are linked to the items on the left. Navigation and speech output can easily be toggled from one frame to the other.

by the tabs that exist on the left of each frame. Generalpurpose menus are kept on the upper bar, near the URL display box. The user of this application can freely navigate inside the contents of each frame or jump between frames from contents to table of contents or vice-versa in order to scan or navigate through the page in a more structured and friendly way. The Text-to-Speech device constantly helps a blind or low-vision user, following the navigation accurately. This browser accessibility model is less general but more structured, gaining in terms of speed and clarity of the browsing. Anyhow it may fall back to the traditional approach of page voice rendering if, by user option, the discovered structure is not followed. Voice manipulation in this case is even more important in order to signal the switching between windows. It can be achieved with one or with two voices in different speech modification styles. The W3C consortium, besides the above-mentioned Voice Browser Working Group, has been issuing through its Web Accessibility Initiative (WAI) a relevant set of Web contents accessibility guidelines (WCAG), currently in version 2. These guidelines are greatly helpful in orienting web page design for accessibility [43]. Authoring Tool Accessibility Guidelines (ATAG), also in version 2.0, are also important for developers of authoring tools. Testing is one of the best ways to check that a web page is accessible, even if a sighted developer does this testing. Coding errors, images with missing ALTDtext, spelling mistakes and grammatical errors are a


few issues that are easily detected by testing [38] but many times they are visually undetectable. Currently, a number of publicly available web browsers claim to be compliant with the voice browser concept, namely with voice mark-up languages. Opera [35] is one of the web browsers that claim to have conformity with the voice browser and screen reader concepts. Recently, spatially enriched browsing shortcuts, along with an appropriate framework, have been introduced in order to improve information seeking for the blind people [51–53,65,86]. Voice browsers have also been developed for the mobile user [5].

6. Speech technologies in mobility systems Mobility systems are essential components that take part in the independence of vision-disabled persons. Navigation is a natural activity that requires support in various ways [16,94]. It is important that the shortrange mobility is covered by personal capabilities with or without special equipment besides the cane, like an obstacle detector. Outdoor or indoor navigation may be supported by an adequately formatted information system that must be capable of providing an effective flow in response to the user choices. The preferred communication medium for blind or low vision persons is the audio channel [64] that can be made available at certain information points and/or through the personal wireless communicator. Built environment and transportation means are particularly important to consider for the accessibility of a blind or low-vision disabled person because in these environments, the person is confronted with artificial barriers derived from space restrictions, technological devices, complexity of the spaces and health or injury risks. On the other hand, these locations are unavoidable and needed for all the purposes of daily life, social inclusion and individual needs and rights. Information resources with the features of the locations and the guidance possibilities must be provided. In the planning phase of travelling, information resources are the most important ones, however, in the actual travelling phase the guidance resources become dominant. How can the audio or speech interfaces be used for this? To answer this question a short discussion of some related daily life paradigms is useful [17]. The first one is the travel guide (book) paradigm. A travel guide allows the user to access information about a certain location enabling the planning of a trip, building

149

an itinerary and making a list of actions to accomplish. The second paradigm is the human tour guide. This personage acts as a guide and companion of travel. Excellent communication skills and a fluent knowledge of the features of the location are required. The third paradigm is the virtual tour. In this type of tour the relevant information is available in order to reconstruct the simulated space or location by computational means and artificial representations, in a way that the user can explore it at will with as much sensorial completeness as possible. The main difficulties in travelling in the built environment and in transportation systems rise from the need to identify, at any moment of the travel, the sites that are referenced in the travel plan and the directions to reach the next ones: in general each site has a number of signs, direction indications and other information plates that help persons that can see them to navigate. Blind or low-vision persons have total or partial difficulty in accessing this visual information. A second difficulty is accessing the complex information involved in each location. Generally this is solved by means of the travel guide, the tour guide or local information panels. Unfortunately neither the travel guide-books nor the tour guide are an accessible solution. Therefore, an alternative must be found; a system that implements the combination of the ideas of the virtual tour concept together with the tour guide concept and the travel guide concept to build a new information, guidance and navigation system. Speech is the main medium involved. For everyone, this is one of the possible media for accessing the travel guide, but for blind persons it may be the only one. The tour guide is now a virtual one, interacting with the user through speech and providing the guidance for the actual travelling in the spaces. The virtual tour is also accessible through speech and allows the user to discover the detailed navigation information relevant to the intended travel. Speech directions are, however, not totally unambiguous and some other physical cues may be necessary at each site, depending on its complexity. Tactile marks on the ground are an example. Another crucial aspect of the virtual tour concept that is embedded in this application is user tracking. This aspect is a way to co-ordinate the guide book, the tour guide and the virtual tour. User tracking technologies exist that may be used. The required connectivity between the system’s infrastructure and the user should be wireless. The system should provide speech-mediated access to information and to guidance. Dialogue system con-

150


system may be found almost at every time of day in professional or private life. Two accessibility projects, INFOMETRO and NAVMETRO sponsored by the Portuguese Government Program POSConhecimento [12] started recently in Porto, Portugal, aiming to build an information and navigation infrastructure for the blind clients of the Metro do Porto, S.A. company, that runs the underground transportation system of the region. The main concept is to provide access to the visual information that is available for the clients that can see, through the mobile phone, employing dialogue systems, speech recognition and speech synthesis technology.

7. Speech-based accessibility in the classroom

Fig. 9. Personal area networking communication allows the exchange of transport information in the case of waiting at the bus stop.

trol, speech input and speech output are the essential interface elements of the system. In terms of privacy, speech input may need an alternative in public premises. The types of locations involved in this concept are mainly public places away from home or the workplace that are generally too familiar to the user to need the contribution of this system. Built environments including transportation facilities and vehicles are the most relevant, but guidance may be extended to outside locations. In Fig. 9, an example is given of a situation in which personal area networking communication (PAN – a short range wireless communication infrastructure capable of automatically logging-in an approaching user and provide a specific network service) allows the exchange of transport information that is essential in the case of a blind person waiting at the bus stop. Personal area networking is a way of creating the required wireless connectivity. Some international R&D projects have appeared targeting this general type of problems. One is particularly interesting and comprehensive and addresses amongst others this type of needs of disabled users [2]. Situations for use of this

The main requirements of the blind students for an effective inclusive educational system are given in the first column of Table 2. Nowadays information technologies offer significant total or partial solutions in order to fulfill them. As shown in Table 2, almost all of these solutions incorporate one or more key speech technologies, i.e. speech compression, speech synthesis and speech recognition. Even more, advanced methodologies, like emotional speech synthesis [89], Document-to-Speech [49,90], Concept-to-Speech [4] and lecture mining by speech recognition have been recently applied [18,47]. Nowadays, e-books have exceeded printed books by providing accessible browsing, navigation, searching, highlighting and multimedia facilities to the reader. The content is rendered based on embedded meta-data of structural and layout properties that affect the visual stimuli. E-books that use speech technologies can be classified into: – Digital Audio Books (DABs) and – Digital Talking Books (DTBs). In the case of DABs the content is pre-recorded in digital compressed speech and is reproduced through a common portable or desktop device like a CD, an MP3 player or a Personal Computer. DTBs provide audio presentation of the content through a Text-to-Speech synthesizer and, in addition, support special tagging of the content in order to facilitate navigation and audio synchronization with the text. Automated Electronic Text Reading Devices or common Personal Computers can be used for the reproduction of the content of DTBs. The complexity of access-


151

Table 2 Speech-based accessibility in the classroom: requirements, IT solutions and embedded speech technologies Requirement

Information technology solution

Access to the printed educational material: principal (text-books) and peripheral (encyclopedias, journals, newspapers)

– PC with Screen-Reader & Text-to-Speech – Printed Text Automated Reading Device – Digital Audio Books – Digital Talking Books – Electronic Text Automated Reading Device

Access to the whiteboard and the projections in the classroom (overhead & data projects) Writing in the classroom, at home and in examinations

– Smart whiteboard with OCR & Text-to-Speech

– PC with Screen Reader & Text-to-Speech – Talking Calculators – Talking Scientific Calculators – Talking Dictionaries – Talking Braille Note-Takers – Talking (word) Translators

Access to the electronic educational ma- – PC with Screen-Reader & Text-to-Speech terial, electronic libraries, the content of the WWW and the educational software applications Navigation to the structured places and – Talking Electronic Canes related information (shingles) – Talking Indoor Navigation System – Talking Barcode Readers – Talking RFID Tag Readers Access to interpersonal communication (e-mail, SMS, e-chatting)

– Accessible Mobile Cell-Phone, – Accessible PC

ing DTBs in a user-friendly way has led to the proposal of several guidelines and specifications in the field of electronic content accessibility. The NISO/ANSI [7] standard has been proposed for the creation of Digital Talking Books. This standard is supported by the DAISY consortium and specifies a faithful access of the representation of a published work to visually impaired and print-disabled readers like the one that is available to readers of the original printed publication.

8. Speech technologies in digital television Access services are provided with broadcasts, to improve accessibility for people with disabilities towards an Inclusive Television [80]. Access services can be provided as either open services (provided to all) or as closed services (which can be turned on/off by the user). The most common access services for the blind and partially sighted persons are: – – – –

Audio Description (AD), Talking Subtitling, Talking Electronic Program Guide (EPG), Talking Teletext.

Embedded speech technologies Compression Synthesis Recognition * * * * * * * * *

* *

* * *

*

*

* * * *

*

*

*

* * * * * *

*

*

Audio Description (AD) is a technique for annotating television broadcasts with spoken narration of screen background, actions and expressions, so that blind people get a description of what is going on visually in a scene and specifically during gaps in the program dialogue and effects. Normally this is done by sending additional hidden information along, which is then decoded into a spoken description. Closed Captioning is a similar but more general term mainly used in the US for AD of video, movies and TV broadcasts. We have seen a period of successful provision of AD by public service and commercial broadcasters, via both satellite and terrestrial Digital Television (DTV) or Digital Video Broadcasting (DVB) or Digital Broadcasting Television (DBT). Although TV audio description trials started in the 1990s, the service has only really become available to “viewers” of digital television from 2001 onwards. Audio description on video has been around for thirteen years and in cinemas for just seven. During the authoring of AD, speech recognition can be used for the automatic production of the narration text. Furthermore, advanced multi-voice Text-to-Speech synthesis can replace the rather costly use of an audio describer. In the previous years analogue audio description was provided:

152


Fig. 10. Delivery of spoken subtitling.

– using a single channel of a stereo audio-pair (as in Germany), thus forcing the normal programme to be broadcast in mono and – broadcasting audio description via AM radio (as in Italy by RAI). For digital delivery, there are two options available: – An extra DVB audio channel (pre-mixed) or “Broadcaster Mixed AD” or Narrative, – DVB receiver-mix audio AD. The pre-mixed solution typically uses 192–256 kbit/s rather than the 64 kbit/s used for the mono description channel, to maintain the audio clarity of the programme sound. This makes it unpopular for use in Terrestrial DVB that has bandwidth limitations when compared with Satellite DVB. Also it may be traded against other services; for example, the capacity of multiple AD channels may instead be used to enhance or add a video channel. Another possibility in DVB is to broadcast the text of the description and the Text-to-Speech synthesis will be performed by the DVB receiver. Subtitles originated with silent movies and were naturally adopted for language translation. All past and present generations of Set-top-Boxes (STBs) or DBT receivers can render digital subtitles. The delivery of subtitles in spoken form is essential for the blind and some low vision individuals. In some countries where a substantial part of the TV program is in a foreign

language, synthetic speech is generated automatically from translation subtitles. This service is commonly called Spoken subtitling or Audio subtitling. Speech recognition is used, not for automatic recognition of the programme sound, but to allow subtitlers to (re)speak the words to be subtitled. The use of controlled acoustic conditions takes away most of the problems caused by background noise. At the same time, the recognition constraints are eased, because the software can be trained for the subtitlers using it, and specialised vocabularies can be selected to optimize recognition (e.g. a “sports – volley” vocabulary). Also Text-to-Speech synthesis can be used during the authoring of talking subtitles. Furthermore, embedded TtS synthesis can be used in the DVB receiver (Fig. 10) [8]. With the spread of digital TV, Electronic Programming Guides (EPGs) are becoming more and more common. There is currently no standard on EPG design, which means the presentation of EPGs varies widely. For people with disabilities, the increasing use of graphics, combined with the lack of a common approach, is bad news. A positive factor is that the information, in principle, should be easy to parse as, typically, it is sent in text format instead of video. This leads to the concept of the “talking EPG”, which basically would require a TtS converter (synthesizer) at the user’s end, plus an agreed data format from the broadcaster. There are special consumer products available that take the textual data from traditional Teletext magazine


153

Table 3 Access services and speech technologies

Audio Description Talking Subtitling Talking Electronic Program Guide Talking Teletext Control and Navigation

Authoring Text-to-Speech Speech Recognition √ √ √ √

information services and generate synthesized audio for visually impaired users. Digital television offers new opportunities for delivering such information services through new more creative and flexible technologies (e.g. OpenTV, MHEG-5 and MHP) rather than Teletext. In principle, new commercial products can be designed to perform, with digital text, a similar function to the Teletext “readers”. Another major challenge will be for the designers of new text services to ensure that the navigational structure for any new digital text service (which is now typically menu-driven rather than accessible by page number) will not preclude easy and user-friendly access to information to the visually impaired user [20]. Spoken feedback is crucial for the blind users in order to understand that a DBT device is functioning correctly and to assist during all forms of device setup and application navigation. Table 3 summarizes the possible use of speech technologies in the authoring and the delivery process of accessible DBT services. The DVD (Digital Versatile Disk) is not normally considered as DTV, but it is linked to broadcast content. DVD is a key media for delivering access-enabled content to the Blind and low vision communities. It is quite common for DVDs to carry multiple language audio tracks. Audio Description is available on some DVDs as an alternative language called Narrative or Description. A successful implementation of access service across all platforms requires consistency of technology and user interface even across non-broadcast platforms.

9. Conclusions The main objective of the present paper has been to review the status of the work directed towards the target of alleviating the accessibility barriers posed to the blind and vision disabled persons in accessing printed, written or visual information through the use of the speech channel. Speech enabled, talking, speech activated or spoken dialogic devices using either pre-recorded, concatenat-

Delivery Text-to-Speech Speech Recognition √ √ √ √ √ √

ed or synthesized speech may already be found and used for many situations in healthcare, communication, environmental control, leisure and education activities, indoor and outdoor navigation and personal organization. Automated reading machines may provide an answer to the need for a spoken version of printed or digital/ electronic content. The personal computer, its applications and the content stored in the machine or in the WWW, can also be made accessible to visually disabled persons by means of the speech channel, mainly through the speech synthesizer in collaboration with the Screen-Reader software, or talking browsers, and the automatic speech recognizer. Universal information formats are available that allow an easy application of these principles. The W3C consortium has developed a number of tools and annotation and processing languages, normally XML dialects, like VoiceXML, SSML, SVG, MATHML, etc., that, together with structural tools in the form of style sheets (CSS, ACSS, for example) support the development of document formats and creation of accessible documents, readily handled by current browsers and Screen-Readers. Technical documents are also becoming accessible with the continuous development of the specific reading software. This is for instance the case of the mathematical and other related technical and graph-like contents. Further development is heading towards improving the navigation of the user in complex web pages. Mobility systems may also employ the speech channel of a personal communication device to instruct the blind person about the physical environment in order to allow a safe, autonomous navigation for blind person outdoors and even indoors. They also help to access most of the public facilities and services, as far as the environment and the sites are equipped with the appropriate systems already being tested at laboratorial or prototype level. Educational environments and activities are also widely supported by speech technology solutions. Accessibility to the printed material, to the whiteboard, to

154


the electronic documentation resources and to the pedagogical systems is also possible nowadays employing speech and Braille. Last but not least, the entertainment domain comes. Digital television can be accessed by means of audio subtitling of the broadcasts, films, live scenes, teletext and other resources and by speech-actuated menus and navigation software of the sets. One main general conclusion the authors wish to share is that the already proven speech communication channel is continuously growing in terms of coverage, of conveying information and of interfacing situations in daily life for the blind and the low-vision disabled persons and is contributing to substantial alleviation of many important barriers. The proliferation of various devices for each of the situations to be covered is a problem that should be handled by means of a stronger tendency for technical convergence of different electronic means in a reduced number of devices: possibly one communicator device will do most of the mobile work in the future. Accessibility-aware design and development is the main principle that should be reinforced to allow solutions to be disseminated amongst manufacturers and service providers so that accessibility may be uniformly granted at least in the public environments, facilities and resources.

References [1]

[2]

[3] [4]

[5]

[6] [7] [8]

D. Akoumianakis and C. Stephanidis, USE-IT: A Tool for Lexical Design Assistance in: User Interfaces for All – Concepts, Methods, and Tools, C. Stephanidis, ed., Lawrence Erlbaum Associates, Mahwah, NJ, 2001, pp. 469–487. Ask-IT – Ambient Intelligence System of Agents for Knowledge-based and Integrated Services for Mobility Impaired users. EU FP6-IST e-Inclusion Project. (search in http://cordis.europa.eu/search). Audiobrowser Project, FCT (POSI/SRI/41952/2001), Portugal. J. Calder, A.C. Melengoglou, C. Callaway, E. Not, F. Pianesi, I. Androutsopoulos, C.D. Spyropoulos, G. Xydas, G. Kouroupetroglou and M. Roussou, Multilingual Personalized Information Objects, in: Multimodal Intelligent Information Presentation, O. Stock and M. Zancanaro, eds, Springer, Series: Test, Speech and Language Technology, Vol. 27. ISBN: 1-4020-3049-5, 2005, pp. 177–201. X. Chen, M. Tremaine, R. Lutz, J. Chung and P. Lacsina, AudioBrowser: a mobile browsable information access for the visually impaired, Journal of Universal Access in the Information Society 5 (2006), 4–22. M. Cohen, J. Giangola and J. Balogh, Voice User Interface Design, Addison Wiley, 2004. DAISY/NISO, http://www.daisy.org/z3986/. Frans de Jong, Access Services for digital television, EBU Technical Review, October, 2004.

[9] [10]

[11]

[12]

[13]

[14]

[15] [16]

[17]

[18]

[19] [20]

[21]

[22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36]

R. Diehl, A. Lotto and L. Holt, Speech perception, Annual Revue of Psychology 55 (2004), 149–179. C. Duarte and L. Carriç o, Conveying Browsing Context Through Audio on Digital Talking Books, Lecture Notes in Computer Science 4556 (2007), 259–268. D.N. Alistair Edwards and E. Mitsopoulos, A principled methodology for the specification and design of nonvisual widgets, ACM Transactions on Applied Perception 2(4) (2005), 442–449. K. Fellbaum and D. Freitas, Speech Processing, in: Towards an Inclusive Future, P. Roe, ed., COST, Brussels, 2007, pp. 24–42. K. Fellbaum and G. Kouroupetroglou, Principles of Electronic Speech Processing with Applications for People with Disabilities, Assistive Technologies, Technology and Disability 20 (2008), 55–85. F. Fourli-Kartsouni, K. Slavakis, G. Kouroupetroglou and S. Theodoridis, A Bayesian Network Approach to Semantic Labelling of Text Formatting, in: XML Corpora of Documents, Lecture Notes in Computer Science 4556 (2007), 299–308. S. Furui, Digital Speech Processing, Synthesis and Recognition, Marcel Dekker, 1989. V. Gaudissart, S. Ferreira, C. Thillou and B. Gosselin, SYPOLE: A Mobile Assistant for the Blind, in: International Conference on Speech and Computer (SPECOM), 2004, 538–544. F. Gaunet, Verbal guidance rules for a localized wayfinding aid intended for blind-pedestrians in urban areas, Universal Access in the Information Society 4(4) (April 2006), 328– 343. J. Glass, Timothy J. Hazen, S. Cyphers, I. Malioutov, D. Huynh and R. Barzilay, Recent Progress in the MIT Spoken Lecture Processing Project, Proc. EUROSPEECH 2007, 2553–2556. R. Harris, Voice Interaction Design, Elsevier, 2005. A. Helal, S.E. Moore and B. Ramachandran, Drishti, An Integrated Navigation System for the Visually Impaired and Disabled, International Symposium on Wearable Computers, Zurich, Switzerland, October, 2001, 149–156. D. Henderson and S. Card, Rooms, The use of multiple virtual workspaces to reduce space contention in a window based graphical user interface, ACM Transactions on Graphics 5(3) (1986), 211–243. http://emacspeak.sourceforge.net/. http://en.wikipedia.org/wiki/List of screen readers. http://lpf-esi.fe.up.pt/ãudiomath/. http://www.apple.com/macosx/features/voiceover/. http://www.dessci.com/en/products/mathplayer/tech/ accessibility.htm, (accessed on the 6/February/2008). http://www.fe.up.pt/si uk/PROJECTOS GERAL.MOSTRA PROJECTO?P ID=1099. http://www.freedomscientific.com/fs products/software jaws.asp. http://www.gwmicro.com/Window-Eyes/. http://www.lambdaproject.org. http://www.latex-project.org/. http://www.logicalsoft.net/Math.html (accessed on the 6/ February/2008). http://www.microsoft.com/enable/training/windowsxp/ narratorturnon.aspx. http://www.nfbcal.org/s e/list/0033.html, (accessed on the 6/February/2008). http://www.opera.com/. http://www.talkingrx.com/.

D. Freitas and G. Kouroupetroglou / Speech technologies for blind and low vision persons [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47]

[48] [49]

[50] [51]

[52]

[53]

[54]

[55] [56]

[57]

[58] [59] [60]

http://www.tiresias.org/equipment/labelling/electronic labelling.htm. http://www.utexas.edu/research/accessibility/resource/faq. html#question9. http://www.viewplus.com, IVEO. http://www.w3.org/News/2007#item58 and http://www.w3. org/2006/12/voice-charter.html. http://www.w3.org/TR/WD-acss. http://www.w3.org/Voice/. http://www.w3.org/WAI/. http://www.webaim.org/simulations/screenreader.php. http://www.yourdolphin.com/productdetail.asp?id=5. https://repositorium.sdum.uminho.pt/bitstream/1822/761/4/ iceis04.pdf#search=%22audio browser%22. J. Huang, M. Westphal, S. Chen, O. Siohan, D. Pover, V. Libal and A. Soneiro, The IBM Rich Transcription Spring 2006 speech-to-text system for lecture meetings, Lecture Notes in Computer Science 4299 (2006), 432–443. X. Huang, A. Acero and H.-W. Hon, Spoken Language Processing, Prentice Hall, 2001. K. Ikospentaki, S. Vosniadou, D. Tsonos and G. Kouroupetroglou, HOMER: A Design for the Development of Acoustical-Haptic Representations of Document Meta-Data for Use by Persons with Vision Loss, Proc. of the Second European Cognitive Science Conference (EuroCogSci 2007), Delphi, 23–27, May, 2007, 912. D. Jurafksy and J. Martin, Speech and Language Processing, Prentice Hall, 2000. C. Kouroupetroglou, M. Salampasis and A. Manitsaris, A Semantic-web based framework for developing applications to improve accessibility in the WWW, in: Proceedings of the 2006 international cross-disciplinary workshop on Web accessibility (W4A), 2006, 98–108. C. Kouroupetroglou, M. Salampasis, A. Manitsaris, Browsing shortcuts as a means to improve information seeking of blind people in the WWW, Universal Access Information Society Vol. 6 (2007), 273–283. C. Kouroupetroglou, M. Salampasis and A. Manitsaris, The effects of spatially enriched browsing shortcuts on web browsing of blind users, Lecture Notes in Computer Science 4556 (2007). G. Kouroupetroglou and G. Nemeth, Speech Technology for Disabled and Elderly People, in: Telecommunications for All, Patrick Roe, ed., the European Commission – Directorate General XIII, Catalogue number: CD-90-95-712-ENC, 1995, 186–195. G. Kramer, ed., Auditory Display: Sonification, Audification, and Auditory Interfaces, Addison-Wesley, 1994. C.M. Law and G.C. Vanderheiden, EZ Access strategies for cross-disability access to kiosks, telephones and VCRs, CSUN’98, Los Angeles, CA, March, 1998. C.M. Law and G.C. Vanderheiden, The development of a Simple, Low Cost Set of Universal Access Features for Electronic Devices, The Association of Computer Machinery Conference on Universal Usability, Washington, DC, 2000, 118–123. M. McTear, Spoken Dialogue Technology: Toward the Conversational User Interface, Springer-Verlag, 2004. B. Moore, Psychology of Hearing, Elsevier, 2004. E. Mynatt and K. Edwards, Metaphors for nonvisual computing, chapter in the book: Extra-Ordinary Human-Computer Interaction, A. Edwards, ed., Cambridge University Press, 1995, pp. 201–220.

[61]

[62] [63] [64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

155

National Center for Accessible Media (NCAM), A Developer’s Guide to Creating Talking Menus for Set-top Boxes and DVDs, 2003, http://ncam.wgbh.org/resources/talkingmenus /index.html. I. Oakley and S. Brewster, eds, Haptic and Audio Interaction Design, Lecture Notes in Computer Science 4813 (2007). I. Pitt and A. Edwards, Designing Speech-based Devices, Springer, 2003. J. Rajamäki, P. Viinikainen, J. Tuomisto, T. Sederholm and M. Säa¨ mänen, LaureaPOP Indoor Navigation Service for the Visually Impaired in a WLAN Environment, in: Proceedings of the 6th WSEAS Int. Conf. on Electronics, Hardware, Wireless and Optical Communications, Corfu Island, Greece, February 16–19, 2007. M. Salampasis, C. Kouroupetroglou and A. Manitsaris, Semantically enhanced browsing for blind people in the WWW. HYPERTEXT ’05: Proceedings of the sixteenth ACM conference on Hypertext and hypermedia, 2005, 32–34. A. Savidis and C. Stephanidis, Building non-visual interaction through the development of the Rooms metaphor, in: Companion Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI ’95), Denver, Colorado, 7–11 May. ACM Press, New York, 1995, 244–245. A. Savidis and C. Stephanidis, Developing Dual User Interfaces for Integrating Blind and Sighted Users: the HOMER UIMS, in: Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI ’95), Denver, Colorado, ACM Press, New York, 7–11, May, 1995, 106–113. A. Savidis and C. Stephanidis, Development Requirements for Implementing Unified User Interfaces, in: User Interfaces for All – Concepts, Methods, and Tools, C. Stephanidis, ed., Lawrence Erlbaum Associates, Mahwah, NJ, 2001, pp. 441–468. A. Savidis and C. Stephanidis, Inclusive development: Software engineering requirements for universally accessible interactions, Interacting with Computers 18 (2006), 71–116. A. Savidis and C. Stephanidis, Integrating the Visual and Non-visual Worlds: Developing User Interfaces, in: Proceedings of RESNA ’95 Annual Conference, Vancouver, Canada, 9–14 June, RESNA Press Washington, 1995, 458– 460. A. Savidis and C. Stephanidis, The HOMER UIMS for Dual User Interface Development: Fusing Visual and Non-visual Interactions, International Journal of Interacting with Computers 11(2) (1998), 173–209. A. Savidis and C. Stephanidis, The I-GET UIMS for Unified User Interface Implementation in: User Interfaces for All – Concepts, Methods, and Tools, C. Stephanidis, ed., Lawrence Erlbaum Associates, Mahwah, NJ, 2001, pp. 489-523. A. Savidis and C. Stephanidis, The Unified User Interface Software Architecture, in: User Interfaces for All – Concepts, Methods, and Tools, C. Stephanidis, ed., Lawrence Erlbaum Associates, Mahwah, NJ, 2001, pp. 389–415. A. Savidis and C. Stephanidis, Unified User Interface Design: Designing Universally Accessible Interactions, International Journal of Interacting with Computers 16(2) (2004), 243– 270. A. Savidis and C. Stephanidis, Unified User Interface Development: Software Engineering of Universally Accessible Interactions, Universal Access in the Information Society 3(3) (2004), 165–193. A. Savidis, D. Akoumianakis and C. Stephanidis, The Unified User Interface Design Method, in: User Interfaces for All –

156

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

D. Freitas and G. Kouroupetroglou / Speech technologies for blind and low vision persons Concepts, Methods, and Tools, C. Stephanidis, ed., Lawrence Erlbaum Associates, Mahwah, NJ, 2001, pp. 417–440. C. Sharma and J. Kunins, VoiceXML: Strategies and Techniques for Effective Voice Application Development, Wiley, 2002. D. Spiliotopoulos, G. Xydas and G. Kouroupetroglou, Diction Based Prosody Modeling in Table-to-Speech Synthesis, Lecture Notes in Artificial Intelligence 3658 (2005), 294– 301. D. Spiliotopoulos, G. Xydas, G. Kouroupetroglou and V. Argyropoulos, Experimentation on Spoken Format of Tables in Auditory User Interfaces, in: Proc. of the 3rd Int. Conference on Universal Access in Human-Computer Interaction, 22–27 July, 2005, Las Vegas, Nevada, USA, Vol. 3, pp. 451–460. G. Stallard, Standardisation Requirements for Access to Digital TV and Interactive Services by Disabled People, CENELEC, 2003. S. Ntoa and C. Stephanidis, ARGO: A System for Accessible Navigation in the World Wide Web, ERCIM News No. 61, April, 2005, 53. C. Stephanidis, The concept of Unified User Interfaces, in: User Interfaces for All – Concepts, Methods, and Tools, C. Stephanidis, ed., Lawrence Erlbaum Associates, Mahwah, NJ, 2001, pp. 371–388. C. Stephanidis, A. Paramythis, M. Sfyrakis and A. Savidis, A Case Study in Unified User Interface Development: The AVANTI Web Browser, in: User Interfaces for All – Concepts, Methods, and Tools, C. Stephanidis, ed., Lawrence Erlbaum Associates, Mahwah, NJ, (ISBN 0-8058-2967-9, 760 pages), 2001, pp. 525–568. C. Stephanidis, A. Paramythis, V. Zarikas and A. Savidis, The PALIO Framework for Adaptive Information Services, in: Multiple User Interfaces: Cross-Platform Applications and Context-Aware Interfaces, A. Seffah and H. Javahery, eds, John Wiley & Sons, Ltd., Chichester, UK, 2004, pp. 69–92. C. Stephanidis, A. Savidis and D. Akoumianakis, Engineering Universal Access: Unified User Interfaces, Tutorial in the 1st Universal Access in Human-Computer Interaction Conference (UAHCI 2001), jointly with the 9th International Conference on Human-Computer Interaction (HCI International 2001), New Orleans, Louisiana, USA, 5–10 August, 2001. H. Takagi, C. Asakawa, K. Fukuda and J. Maeda, Site-wide annotation: reconstructing existing pages to be accessible, in: Proceedings of the fifth international ACM conference on Assistive technologies, 2002, 81–88. H. Takagi and T. Ishihara, Technology Advances and Standardization toward Accessible Business Graphics, Lecture Notes in Computer Science 4556 (2007), 426–435. Trace Research & Development Center, College of Engineering, University of Wisconsin-Madison, http://trace. wisc.edu/ez/. D. Tsonos, G. Xydas and G. Kouroupetroglou, A Methodology for Reader’s Emotional State Extraction to Augment Expressions in Speech Synthesis, Proc 19th IEEE Int Conf on Tools with Artificial Intelligence (ICTAI 2007), Vol. II, Patras, Greece, 29–31 October, 2007, 218–225. D. Tsonos, G. Xydas, and G. Kouroupetroglou, Auditory Accessibility of Metadata in Books: A Design for All Approach, Lecture Notes in Computer Science 4556 (2007), 436–445.

[91]

[92]

[93]

[94]

[95] [96]

[97] [98]

[99]

[100]

[101]

[102]

[103]

G.C. Vanderheiden, Everyone Interfaces, in: User Interfaces for All – Concepts, Methods, and Tools, C. Stephanidis, ed., Lawrence Erlbaum Associates, Mahwah, NJ, 2001, pp. 441– 468. G.C. Vanderheiden, C.M. Law and D. Kelso, EZ Access Interface Techniques for Anytime Anywhere Anyone Interfaces, CHI’99 (Computer-Human Interaction), Pittsburgh, PA, May 1999. G.C. Vanderheiden, Cross Disability Access to Touch Screen Kiosks and ATMs, in: Design of Computing Systems, Proceedings of the Seventh International Conference on HumanComputer Interaction (HCI International ’97, M.J. Smith, G. Salvendy and R.J. Koubek, eds, Elsevier, New York, 1997, pp. 417–420. A. Virtanen and S. Koskinen, NOPPA Navigation and Guidance System for the Visually Impaired, in: Proceedings of the 11th World Congress and Exhibition on ITS, Nagoya, Japan, October 2004. Voice Browser Activity, http://www.w3.org/Voice/. G. Weber, D. Kochanek, C. Stephanidis and G. Homatas, Access by Blind People to Interaction Objects in MS Windows, in: Proceedings of the 2nd European Conference on the Advancement of Rehabilitation Technology (ECART-2), E. Jacobsson, ed., Stockholm, Sweden, 26–28 May, Swedish Institute for the Handicapped, Stockholm, 1993, pp. 202– 204. S. Weinschenk and D. Barker, Designing Effective Speech Interfaces, John Wiley, 2000. Wolfram Research-Mathematica, The Mathworks-Matlab, Design Science – Mathtype, Mackichan Software – Scientific Workplace, etc. G. Xydas and G. Kouroupetroglou, Augmented Auditory Representation of e-Texts for Text-to-Speech Systems, in: Lecture Notes in Artificial Intelligence 2166 (2001), 134– 141. G. Xydas and G. Kouroupetroglou, Text-to-Speech Scripting Interface for Appropriate Vocalisation of e-Texts, in: Proceedings of EUROSPEECH 2001, Sept. 3–7, Aalborg, Denmark, 2001, 2247–2250. G. Xydas, D. Spiliotopoulos and G. Kouroupetroglou, Modelling Emphatic Events from Non-Speech Aware Documents in Speech Based User Interfaces, in: Human-Computer Interaction, Theory and Practice, Proceedings of HCI International 2003 – The 10th International Conference on HumanComputer Interaction, June 22–27, 2003, C. Stephanidis and J. Jacko, eds, Crete, Greece. Lawrence Erlbaum Associates, Inc., Mahwah, NJ, 2003, pp. 806–810. G. Xydas, G. Argyropoulos, Th. Karakosta and G. Kouroupetroglou, An Experimental Approach in Recognizing Synthesized Auditory Components in a Non-Visual Interaction with Documents, in: Proc of the 11th Int Conference on Human-Computer Interaction, Las Vegas, Nevada, USA, 22–27 July, 2005. V. Zarikas, G. Papatzanis and C. Stephanidis, An Architecture for a Self-Adapting Information System for Tourists, in: Proceedings of the Workshop on Multiple User Interfaces over the Internet: Engineering and Applications Trends (in conjunction with HCI-IHM’2001), Lille, France, 10–14 September, 2001.