Visualization of speech and audio for hearing impaired ... - IOS Press

97

Technology and Disability 20 (2008) 97–107 IOS Press

Visualization of speech and audio for hearing impaired persons Jonas Beskow, Olov Engwall, Bj o¨ rn Granström∗ , Peter Nordqvist and Preben Wik Centre for Speech Technology, School of Computer Science and Communication, Kungliga Tekniska H o¨ gskolan (KTH), Stockholm, Sweden

Abstract. Speech and sounds are important sources of information in our everyday lives for communication with our environment, be it interacting with fellow humans or directing our attention to technical devices with sound signals. For hearing impaired persons this acoustic information must be supplemented or even replaced by cues using other senses. We believe that the most natural modality to use is the visual, since speech is fundamentally audiovisual and these two modalities are complementary. We are hence exploring how different visualization methods for speech and audio signals may support hearing impaired persons. The goal in this line of research is to allow the growing number of hearing impaired persons, children as well as the middle-aged and elderly, equal participation in communication. A number of visualization techniques are proposed and exemplified with applications for hearing impaired persons. Keywords: Sound classification, speech processing, hearing impairment, communication support, talking heads, speech reading, speech training

1. Introduction Speech and sounds are important sources of information in our everyday lives for communication with our environment, be it interacting with fellow humans or directing our attention to technical devices with sound signals. For hearing impaired persons this acoustic information must be supplemented or even replaced by cues using other senses. We believe that the most natural modality to use is the visual, since speech is fundamentally audiovisual and these two modalities are complementary. We are hence exploring how different visualization methods for speech and audio signals may support hearing impaired persons. The goal in this line of research is to allow the growing number of hearing impaired persons, children as well as the middle-aged and elderly, equal participation in communication. To reach this goal, our concept focuses on an integrated support that performs audio processing and presents ∗ Address for correspondence: B. Granstr¨ om, Centre for Speech Technology, Lindstedtsv. 24, KTH, SE-100 44 Stockholm, Sweden. E-mail: [email protected].

the information using visualizations that are adapted to different types of tasks, as illustrated in Fig. 1. For every acoustic input, the signal is first analyzed to determine whether it is speech, an alarm signal, e.g., from the microwave oven or the telephone, or merely background noise. For speech signals, audio processing, such as noise reduction and speech enhancement, is performed to improve both the acoustic feedback given to the hearing impaired user and the input to the visualization module. Several methods and solutions presented in this article will be implemented, tested, and evaluated in a recently started EU project named Hearing at Home (HaH). The goal of the project is to develop the next generation of assistive devices that will allow the growing number of hearing impaired persons – which predominantly includes the elderly – equal participation in communication and empowers them to play a full role in the information society. The project focuses on the needs of hearing impaired persons in home environments. Formerly separated devices like personal computer, Hi-Fi system, TV, digital camera, telephone, fax, intercom and services like internet access, VoIP,

ISSN 1055-4181/08/$17.00  2008 – IOS Press and the authors. All rights reserved

98

J. Beskow et al. / Visualization of speech and audio for hearing impaired persons

Communication Support

Visual Support Alarm Signal Type

Alarm Signal Detection

Sound Illustration

Sound Class

Sound Environment Classification

Speech Present?

Phoneme Recognition

Phoneme Observation

Speech Perception support Visible Articulation Synthesis

Speech Production support Pronunciation error detection Articulatory Inversion

Sound Class

Audio Pre- Processing

Audio signal

Audio Video

Fig. 1. Concept outline for an audio/visual communication support system for hearing impaired persons.

Personal Information Management (PIM), pay TV and home automation grow together to be accessible via a TV set connected to a PC or set top box (STB) that implements interfaces to network gateways as well as to home automation services. The TV becomes the central Home Information and Communication (HIC) platform of the household in the communication society. This article first describes the audio processing to detect alarm signals, classify the background sound environment and perform the phoneme recognition of incoming speech. We then propose different visualization techniques and exemplify with a number of visualization applications for hearing impaired persons.

2. Speech and audio pre-processing The audio in the system can be delivered to the user either by loudspeakers or by a headset. In both solutions it is possible to pre-preprocess the audio to better fit the individual hearing loss. The purpose of the pre-processing is to increase the intelligibility and the listening comfort for the user. The pre-processing also includes noise reduction methods in order to present the signal with minimal noise disturbance. The following sections present other pre-processing methods that

are used to extract relevant information from the audio signal. These solutions span from overall recognition of the listening environment to recognition of phoneme sequences. 2.1. Sound classification Automatic sound classification has become an important area in various types of applications, for example surveillance systems, hearing aids, home automation, and communication support systems. The typical usage of a classification system is to support other functions, e.g. switching features on/off in a hearing aid or enable/disable functionality in a larger system. A sound classification system consists of three components; a microphone, a feature extractor, and a classifier. The microphone receives a signal from a signal source that may have several internal states. Since it would be too demanding to use all the information in the signal, the task of the feature extractor is to extract relevant information from the signal. The classifier consists of a number of statistical models that are used to calculate the likelihood that the current sound belongs to a certain category, for example babble noise. Sound classification can be divided into two smaller categories, sound environment classification and signal event detection. Sound environment classification is


the process of detecting the overall acoustic situation, e.g. music, babble noise, traffic noise, and speech. Signal event detection is used when the task is instead to detect a particular signal that is generated from a well defined source. These types of signals are defined as signal events since they have a relatively short duration and are used for a specific purpose, e.g. the door bell signal. 2.2. Sound environment classification The speech perception and speech production visual support functionalities presented in this work is only meaningful and should only be activated when the audio input is speech. A fully automatic system supporting these two functions must therefore include at least a two category sound classifier that label the incoming sound into two categories: speech and non speech. The general principle for sound classification is presented in a separate work [23]. There are several approaches on how to build a robust sound environment classifier. One solution is focused on classifying background noises using features based on the modulation spectrum [17]. A similar solution is presented in another work [31]. A sound environment classifier based on hidden Markov models and delta-cepstrum features has also been investigated [24]. In another study, environmental noises are classified to increase the context awareness [22]. It is also possible to classify and detect more specialized listening environments, for example the usage of the phone [25]. 2.3. Signal event detection The main challenge of signal event detection in home environments is that the system must be robust against background noises, e.g. vacuum cleaning and music. In the approach presented here it is assumed that at least one spectral peak from the signal is visible above the noise spectrum. The frequency and the duration of the spectrum peaks for each signal event category are stored and compared against the current incoming sound. The signal event categories used in the results presented here were door bell, digital clock, mobile phone, and phone. The signal sources were placed in an apartment at four different locations, living room, TV room, bedroom, and in the hall. A microphone was connected to a standard PC and placed in the hall. The signal sources were recorded in quiet and the spectrum peak characteristics were stored and used as models for the

99

detection algorithm. After the training of the system, the same signals were presented again but now together with background noise at various signal-to-noise ratios. The types of the background noises were TV, music and home noise (e.g. vacuum cleaning or porcelain clattering). The result from the evaluation is presented in Fig. 2. As expected there is a strong dependency between the hit rate (number of detected alarms over number of generated alarms) and the signal-to-noise ratio. The system works well with hit rates close to 100 percent for signal-to-noise ratios above 0 dB. The performance decreases rapidly when the signal-to-noise ratio is below 0 dB. It is interesting to notice that the hit rates for some of the signals are relatively high even at poor signal conditions. Particularly signals designed for mobile phones were standing out and obtained high hit rates even at low signal-to-noise ratios. The false alarm rate is also an important design variable. The false alarm rate is defined as the number of unwanted detections, caused by other sound sources, per time unit. This number must be kept low in order for the user to accept the system. In this evaluation no false alarms were detected. Another solution for recognition of acoustical alarm signals for the profoundly deaf is using cepstrum features and hidden Markov models [26]. 2.4. Phoneme recognition Speech recognition is a promising technology when it comes to supportive methods for hearing impaired persons. It has however proven difficult to employ speech-to-text conversion as a general purpose technique in real-life conditions, to aid e.g. during telephone conversation. One problem is for example that such a device would have to cope with a virtually unlimited vocabulary, in order to be useful for general purpose conversation. It should also be speaker independent, and operate in close to real-time. An alternative to performing full blown speech-to-text recognition is phoneme recognition. The results of a real-time phoneme recognizer may be used in several ways, e.g. the results may be displayed as (phonetic) symbols on a screen, as was done in [16], or they may, as is described in this article, be used to drive a facial animation system, that provides speechreading support to hearing impaired persons. In this system, visible speech movements appear in synchrony with the speech signal, so that they may be naturally integrated with the acoustic signal. Since the interpretation is left to the user,

100


SNR versus Hit Rate

100.0

Hit Rate

80.0

60.0

40.0

20.0

0.0 -25.0

-20.0 -15.0

-10.0

-5.0

0.0

5.0

10.0

15.0

20.0

25.0

Signal-to-noise ratio [dB]

Fig. 2. Acoustic signal event detection in noisy home environments. Hit rate as function of signal-to-noise ratio.

the consequence of recognition errors are likely to be less severe than in the case of speech-to-text conversion – single miss-recognised phonemes can often be overlooked or repaired in the audio-visual perception process, while a speech-to-text system may produce an entirely different word or phrase based on a single miss-recognition. In such a system, there are some constraints placed on the recognition system that are normally not present, the most important of which is that it has to operate in real-time, and produce results a fraction of a second after the sound signal is received. In the system described later in this article, these constraints were met by implementing a special purpose real-time phoneme recognition system, based on a hybrid of recurrent neural networks (RNNs) and hidden Markov models (HMMs), that can deliver recognition results with a only 30 ms delay. This is described in detail in [27,28].

ment awareness for hearing impaired persons. A signal event could be illustrated with a symbol representing the action that caused the event. Similarly the current listening environment could be illustrated by e.g. one of four different symbols illustrating speech, babble, noise, and music. The usage of symbols for increasing the environment awareness will be further investigated and evaluated in the HaH-project presented in the introduction.

3. Visualization of signal events and sound environment

4.1. The talking head model

Visualization of the acoustic environment can be used to help hearing impaired persons to identify and minimize the uncertainty of the current listening situation. Illustrating the signal events and the listening environment is one way to increase the environ-

4. Visualization of speech For visualization of speech sounds we are mainly relying on synthetic talking heads that can support both speech perception and speech production training through computer-animated facial movements of e.g. lip, jaw and tongue as described below.

The talking head consists of face, tongue and teeth models, as shown in Fig. 3, based on static 3Dwireframe meshes that are deformed by applying weighted transformations to their vertices [1]. These transformations are described by parameters, which for the face are jaw opening, jaw shift, jaw thrust, lip


Fig. 3. The Talking Head models of (a) the face, (b) the tongue, (c) a side-view of the face with the cheek made transparent.

rounding, upper lip raise, lower lip depression, upper lip retraction andlower lip retraction. For the tongue, the parameters are dorsum raise, body raise, tip raise, tip advance and width. Additional parameters for the face control face expressions, enabling the talking face to display emotions and non-linguistic cues, such as eyebrow raising, eye gaze, head nodding etc. The talking heads may hence potentially provide perception support for higher level information, such as prosody or speaker mood, as well as for the actual phonemes uttered.

101

To ensure that the talking head animations are phonetically correct, several measurement sources have been used. The tongue and teeth models are based on a database of Magnetic Resonance Images of one Swedish subject [9] producing 13 vowels and 10 consonants in three symmetric vowel-consonant-vowel (VCV) contexts. Simultaneous measurements of the speech acoustics and face and tongue movements have also been performed [4]. The role of these measurements is not only to serve as a basis for the computer animations, but also to establish statistical relations between the different modalities, which may be employed in automatic audiovisual analysis of the speech signal, as described in Section 4.4 below. The measurement setup consisted of an audio recorder for the speech acoustics, a video camera and an optical motion capture system for the facial movements and an electromagnetic tracking system for the tongue movements. The video recordings are full frontal images of the speaker’s face collected at 25 Hz, while the optical motion tracking was made using a Qualisys system with four cameras. This system combines the images from the four cameras to calculate the 3D-coordinates of small reflectors at a rate of 60 frames per second. 28 reflectors were glued to the subject’s jaw, cheeks, lips, nose and eyebrows. The data on the tongue movements was collected with the Electromagnetic articulography system Movetrack [6] that uses two transmitters on a light-weight head mount and receiver coils placed on the tongue. In the simultaneous recordings, the subject was a female native speaker of Swedish, who has received high intelligibility ratings in audio-visual tests. She produced a corpus with 270 short Swedish everyday sentences, 138 symmetric VCV and VCC{C}V words and 41 asymmetric C 1 VC2 words. 4.2. Perception support Numerous studies have shown that computer animated talking heads may be used to influence and support speech perception and to increase speech intelligibility for normal hearing [7,20,27] as well as hearing impaired subjects [2]. Typically, the basic structure of these studies is borrowed from audiovisual perception experiments [14,29] and aims to measure either phoneme identification (representing bottom-up processing) or speechreading performance (top-down processing). In one such experiment, that was con-

102


ducted as part of the SynFace project [30], audio from Swedish, English and Dutch sentences was degraded to simulate the information losses that arise in severeto-profound hearing impairment. 12 normal-hearing native speakers for each language took part. Auditory signals were presented alone, with the synthetic face, and with a video of the original talker. The data show a significant benefit from the synthetic face under the degraded auditory conditions. Intelligibility on the purely auditory conditions was low (average of 30% across the three languages) and representative of intelligibility in the SynFace target group (severely hearing impaired persons) for the same or similar sentences. With an average improvement of 20%, the magnitude of the intelligibility increase for the synthetic face compared to no face was broadly consistent, statistically reliable, and large enough to be important in everyday communication. 4.2.1. The SynFace telephone support For a hearing impaired person it is often necessary to be able to lip-read as well as hear the person they are talking with in order to communicate successfully. This puts hearing impaired users at a distinct disadvantage when it comes to telephone communication. Video telephony can provide essential visual speech information, however, videophones require specialised equipment at both ends, as well as broadband connections. SynFace provides an alternative that works over ordinary telephone lines or over IP telephony alike. The idea behind SynFace is to try to re-create the visible articulation of the speaker at the other end, in the form of an animated talking head. The visual signal is presented in synchrony with the acoustic speech signal, which means that the user can benefit from the combined synchronized audiovisual perception of the original speech acoustics and the re-synthesized visible articulation. When compared to video telephony solutions, SynFace has the distinct advantage that only the user on the receiving end needs special equipment – the speaker at the other end can use any telephone terminal and technology – fixed, mobile or IP-telephony. The SynFace technology has its background in the Teleface project [1,27], which demonstrated that synthesised facial movements driven by an automatic speech recogniser can provide phonetic information that is not available in the auditory signal to a hearing impaired user. In the SynFace project this has been further developed into a multilingual synthetic talking face. There are two main research areas that have been addressed in the SynFace project: The visual speech information

requirements of auditory-visual communication have been defined, and techniques to derive this information from the acoustic speech signal in near real time have been developed. A multilingual prototype of the SynFace system has been developed for Dutch, English and Swedish, and is currently being extended to other languages in an ongoing EU-project, and is also being commercially developed by a private company. Technically, SynFace is a computer program that employs a specially developed real-time phoneme recognition system, that delivers information regarding the speech signal to a speech animation module, that renders the talking face to the computer screen using 3D graphics, as shown in Fig. 6. Input can come either from an analogue phone line, via the computer sound card, or from an IP-telephony client program in the computer. The total delay from speech input to animation is only about 0.2 seconds, which is low enough not to disturb the flow of conversation. However, in order for the face and voice to be perceived coherently, the acoustic signal also has to be delayed by the same amount. This delay is implemented in the SynFace software. During the SynFace project, the system was evaluated by 49 users with varying degrees of hearingimpairment in UK and Sweden, in both lab and home environments. SynFace was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed SynFace to be a useful product. For details on the evaluation, see [1]. 4.2.2. Increased articulation support In some situations and for users with a severe hearing-impairment, there is a need for more visual support than is given by the talking face. Many phonemes are either visually identical (such as /t/, /n/ and /d/) or impossible to identify by looking at the speaker’s face, since the place of articulation is too far back in the mouth. For speech perception, disambiguation between these phonemes may be achieved with cued speech [8], where the facial movements are supplemented with hand sign gestures. The cued speech gestures are however an additional iconic system that needs to be learnt and it may be beneficial to represent all relevant articulatory features faithful to actual speech production, since the listener may relate this representation directly to their speech movements. As an extension to SynFace, we have therefore created a perception support in which the interior articulation is visualized by removing parts of the talking face’s


cheek. The use of this talking head is twofold, either as a support for perception or as a tool for production training, as described in Section 4.3. The French LABIAO project has demonstrated that a perception support with cued speech synthesis from phoneme recognition input can be used successfully by hearing impaired children in the classroom to improve their understanding of the teacher’s lesson. In the LABIAO application, the teacher wears a microphone and lip movements and cued speech hand gestures are synthesized using an animated character on a computer screen at the hearing impaired pupil’s desk. As a demonstration showcase in the European network of excellence MUSCLE, we are currently working on an alternative visual output, consisting of one front view of the whole face, similar to the SynFace setup, and one side view, in which parts of the skin have been made transparent to show the tongue and teeth, as shown in Fig. 3(c). The two views of the face are animated using the output of the phoneme recognizer as input to the visual speech synthesis, just as for the SynFace telephone support. In order to better illustrate the fast speech movement a “slow-motion” feature is introduced and time scale makes it possible to get an articulatory snapshot of any time in the utterance. We are presently working on methods to display additional information on the face like prosodic information that is very important for correctly perceiving e.g word stress. 4.3. Production support Children who are born with a severe auditory deficit have a limited acoustic speech target to imitate and this often results in unintelligible speech. Training with a speech therapist can result in dramatic improvements of the child’s speech, but other senses must be used to supplement the auditory feedback that hearing children use when they learn to speak. Computer-based visualization tools can play an important role in this training, especially if they provide information about visual or tactile properties of the pronunciation. The animations of the tongue positions and movements may thus be used to help a hearing impaired child understand how to produce articulations that are difficult to infer from a view of the speaker’s face. The animations in the visualization tool are created based on the output from a phoneme recognizer, as described previously for SynFace and the perception support. When used as a tool to help the child grasp the articulation of a difficult phoneme, it is of essence that the important features shown in the visualization are noticed

103

by the child if they are to be transferred to the child’s own production. We have therefore implemented the functionality to rewind the animation or play it in slow motion, in order to enhance the visual information. Seeing a speech therapist or a computer animation produce a phoneme gives the child information on the articulatory target that is to be reached, but learning will be more effective if the child can relate the target to the own performance. The talking head may hence be used also to provide feedback on how the child should change the own production. This is a much more difficult task, since it requires that the child’s actual production is estimated rather visualizing a generic typical correct production. This means that an acoustic-toarticulatory inversion must be performed instead of a phoneme recognition followed by an animation. Independent computer-based speech training (CBST) software must hence detect errors, identify the cause of the error through an articulatory inversion and provide feedback to the user with audiovisual instructions. We are developing such software with a computer-animated speech therapist or language teacher, Artur, the ARticulation TUtoR [10]. 4.3.1. Audiovisual detection of pronunciation errors Pronunciation errors may occur on different levels and be of different types, e.g. prosodic or phoneme insertions, deletions or substitutions. An automatic speech tutor should ideally address all types of pronunciation errors, but we are here focusing on phonemic errors, i.e. that a phoneme has been incorrectly articulated. Automatic detection of such pronunciation errors is difficult, but the task becomes more feasible if video images of the speaker’s face are added, since features that may be confused acoustically often are visually distinct, such as lip rounding for vowels or place of articulation for consonants. We have therefore investigated how audio-visual phoneme classification may support pronunciation training applications [18]. The most common approach to visual speech recognition is to track the lip contours, but we instead track the upper part of the face, extract the mouth region in the stabilized image and represent its articulatory information implicitly in terms of image pixel values. The reason is that this representation preserves the information about the visibility of the tongue tip and the position of the lips relative to the face. Basis functions representing the most prominent lip shape variations were learned and used together with the acoustic signal to classify Swedish vowels and consonants.

The method was trained and tested on the speech material described in Section 4.1, using C 1 VC2 words for vowel classification and VCV words for consonants. The acoustic and video data for each phoneme was divided into four equally large parts and each of the parts was tested using the other three parts for that phoneme and all data for other phonemes as training material. The phoneme classification was evaluated on separate frames, without any contextual information, vocabulary or grammar defined. The acoustic and visual input data were combined using late fusion, which means that classification is performed separately on the two signals before combining the results under the assumption that the signals are statistically independent. This method of combining different modalities after separate analysis is similar to the processing believed to occur in human speech perception [21]. The results, summarized in Fig. 4, show that video images of the speaker’s face improve phoneme classification compared to if it is performed on the acoustic signal only. The addition of visual data means that confusions between unrounded and rounded vowels and between acoustically similar consonants with different places of articulations are less common. Moreover, when misclassifications occur, they tend to be less serious from the point of view of an automatic tutor, since they are more articulatorily similar, e.g. [f] classified as [v] by the audiovisual classifier, compared to as [p, , ] by the acoustic only, or [u:] as [ :] instead of as [i]. Feedback generated based on the audiovisual classifier will therefore not only be correct more often than if an acoustic only classifier is used, it will also give less confusing feedback when misclassifications do occur. 4.3.2. Audiovisual-to-articulatory inversion If a mispronunciation has been detected on a phoneme, the next step towards helping the user to correct the articulation is to process the speech signal to estimate how he or she shaped the vocal tract to produce that phoneme, i.e. perform an acoustic-to-articulatory inversion. This is however a presently unresolved task, mainly because many vocal tract shapes may have produced the same speech sound. For a speech training application that is to give feedback to the user on how the articulation should be changed, this certainly is problematic, as the feedback instructions should relate both to the target and to what the user is currently doing instead. One possible solution to improve the articulatory inversion is to add visual data of the subject’s face, since information on the jaw position, mouth opening

Correct classifications (%)


80 75 70

AO

65

VO AV

60 55 50

Fig. 4. Results from acoustic (AO), visual (VO) and audiovisual (AV) classification of Swedish phonemes. 1 0,9 0,8 0,7

Correlation

104

0,6

AO

0,5

AV

0,4 0,3 0,2 0,1 0 VCV

Sent

Fig. 5. Results from articulatory inversion with a neural network using either the acoustic signal (AO) or the acoustic signal and the horizontal position of the lip corners and the vertical position of the upper and lower lip and the chin (AV) as input.

and lip rounding may limit the number of possible vocal tract shapes that could have produced the sound. Automatically analyzing video images of the speaker’s lips may hence be beneficial, just as for the mispronunciation detection described in Section 4.3.1. In order to test the potential contribution provided by the lips if the automatic tracking is error-free, motion capture data corresponding to features extractable from a front view of the lip region was used. This signifies the horizontal position of the lip corners and the vertical position of the upper and lower lip and the chin. Significant improvements of the estimation of the tongue shape and position were achieved when the acoustic input data was supplemented with motion capture face data, as shown in Fig. 5 [23]. The results are measured using correlation coefficients, which is a measure for the similarity between the estimation and the actual data. The increase with automatic analy-


105

Fig. 6. The SYNFACE telephone as a PC application. Several features like a user created telephone book with name calling from and selection of faces associated with calling/called number has been implemented.

sis of video images is similar, but smaller (0.11 compared to 0.18 for the correlation coefficients of the VCV words [19]). An articulatory analysis of the estimation shows that the facial measures mainly provide information to recover the movements of the jaw and the tongue tip and that the combination of audio and visual data improves the estimation of the front-back movement of the tongue body. 4.3.3. Audiovisual feedback from the speech tutor To test the usability of the Artur system and the benefit of using audiovisual feedback instructions, we have conducted Wizard of Oz studies with hearingor speech-impaired children [10] and second language learners [12]. In these tests, a human, phonetically trained judge replaced the automatic detection of mispronunciations and the articulatory inversion and chose the feedback given to the student from a set of pregenerated audiovisual instructions on how to improve the articulation. The audiovisual feedback from the virtual speech tutor is in the form of spoken, subtitled, instructions accompanied by computer animations that show the most important part of the instructions. An instruction to the user to lower the tongue tip and raise the back part of the tongue is illustrated with an animation showing the talking head doing just this. Important features, such as the place of articulation, are

pointed out using green (for correct) and red (for incorrect) circles. The animations are played as real-time movements or in slow-motion, depending on the user’s request (Fig. 6). The feedback given to the user is based on the conclusions from interviews with speech therapists [15], language teachers and students and classroom observations [12] and is automatically handled using a system for feedback management [13]. This means, e.g., that the amount and detail of feedback is adapted to the user’s previous performance, so that feedback is varied even if the user repeats the same error several times. In addition, the user is actively involved in the monitoring of the pronunciation using firstly the functionality that the user controls the amount of feedback given and may request more feedback on the difference between the target and the own previous attempt. Secondly, proprioceptive feedback is used, such as encouraging the user to try to feel the touch between specific parts of the tongue and the palate. Simultaneously, this contact is illustrated with computer animations, which means that yet another modality is added to support the user’s learning. Users who have tested the system in short training sessions state in interviews and questionnaires that they became more aware of their own tongue movements and were able to relate to the instructions and animations after a very short adaptation period. The respons-

106


es were very positive, with users stating that the audiovisual instructions were extremely helpful. The wizard’s auditory judgment was further that the users’ pronunciation was improved during the session. We are now analyzing ultrasound data of users collected during training sessions with the tutor to estimate the changes of pronunciation in articulatory terms.

[4]

[5]

[6]

5. Conclusions We have described how acoustic signals, both speech and alarm sounds, can be processed automatically and visualized. Sound classification and signal event detection are promising techniques to improve the context awareness for hearing impaired persons. Synthetic talking heads can be used to support both speech perception and speech production training through computer animated facial movements of lip, jaw and tongue. Thus, with careful adaptation several advances in visualization and audio and speech processing promise to benefit hearing impaired persons

Acknowledgments SynFace originated under the EU project SYNFACE, and is now further developed within the EU-project HaH. The development of the increased articulation support is funded by the European network MUSCLE. The work on the pronunciation tutor ARTUR is funded by the Swedish Research Council. The work on audiovisual-to-articulatory inversion is part of the EU project ASPI. The work on mispronunciation detection is part of the ADEPT – Audiovisual Detection of Errors in Pronunciation Training – project, funded by the Swedish Research Council and the Swedish International Development Cooperation Agency.

References

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17] [18]

[1]

[2]

[3]

E. Agelfors, J. Beskow, I. Karlsson, J. Kewley, G. Salvi and N. Thomas, User Evaluation of the SYNFACE Talking Head Telephone. in ICCHP 2006, LNCS 4061, K. Miesenberger et al., eds, 2006, pp. 579–586. E. Aglefors, J. Beskow, M. Dahlquist, B. Granström, M. ¨ Lundeberg, K.-E. Spens and T. Ohman, Synthetic faces as a lipreading support, Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP’98), Sydney, Australia, 1998, 3047–3050. J. Beskow, Animation of Talking Agents, Proceedings of AVSP’97, 1997, 149–152.

[19]

[20]

[21]

J. Beskow, O. Engwall and B. Granström, Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. In: Proceedings of the 15th ICPhS, M.J. Solé, D.R. and J. Romero, eds, 2003, pp. 431–434. J. Beskow, M. Dahlquist, B. Granström, M. Lundeberg, K.¨ E. Spens and T. Ohman, The Teleface project: Multimodal speech communication for the hearing impaired, Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH’97), Rhodos, Greece, 1997. P. Branderud, Movetrack – a movement tracking system, Proc of the French-Swedish Symposium on Speech, Grenoble, 1985, 113–122. M.M. Cohen, R.L. Walker and D.W. Massaro, Perception of Synthetic Visual Speech, in: Speechreading by Humans and Machines: Models, Systems and Applications, D.G. Stork and M.E. Hennecke, eds, Berlin: Springer, 1996, pp. 153–168. O. Cornett and M.E. Daisey, The Cued Speech Resource Book for Parents of Deaf Children, National Cued Speech Association, 1992. O. Engwall, Combining MRI, EMA & EPG in a threedimensional tongue model, Speech Communication 41(2–3) (2003), 303–329. ¨ O. Engwall, O. Bälter, A.-M. Oster and H. Kjellström, Designing the user interface of the computer-based speech training system ARTUR based on early user tests, Journal of Behavioural and Information Technology 25(4) (2006), 353–365. O. Engwall, Evaluation of speech inversion using an articulatory classifier, in: Proceedings of the Seventh International Seminar on Speech Production, H. Yehia, D. Demolin and R. Laboissière, eds, Ubatuba, Sao Paolo, Brazil, 2006, pp. 469– 476. O. Engwall and O. Bälter, Feedback from real and virtual teachers in pronunciation training. To appear in Journal of Computer Assisted Language Learning (in press). ¨ O. Engwall, O. Bälter, A.-M. Oster and H. Kjellström, Feedback management in the pronunciation training system ARTUR, Proceedings of the International Conference on Human Factors in Computing Systems, CHI, 2006, 231–234. N.P. Erber, Interaction of audition and vision in the recognition of speech stimuli, Journal of Speech and Hearing Research 12 (1969), 423–425. ¨ E. Eriksson, O. Bälter, O. Engwall, A.-M. Oster and H. Kjellström, Design Recommendations for a Computer-Based Speech Training System Based on End-User Interviews. In: Proceedings of the Tenth International Conference on Speech and Computers, Patras, Greece, 2005, 483–486. M. Johansson, M. Blomberg, K. Elenius, L.-E. Hoffsten and A. Torberger, A phoneme recognizer for the hearing impaired. In: Proc. Of ICSLP’2002, Denver, Colorado, USA, 2002, 433–436. J.M. Kates, Classification of background noises for hearingaid applications, J Acoust Soc Am 97 (1995), 461–470. H. Kjellström, O. Engwall, S. Abdou and O. Bälter, Audiovisual phoneme classification for pronunciation training applications. In: Proceedings of Interspeech, 2007. H. Kjellström, O. Engwall and O. Bälter, Reconstructing Tongue Movements from Audio and Video. In: Proceedings of Interspeech, 2006. B. Le Goff, Automatic Modeling of Coarticulation in Textto-Visual Speech Synthesis. Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH’97), Rhodos, Greece, 1997, 1667–1670. D. Massaro, Speech Reading by Ear and Eye, Erlaub, Hillsdale, 1987.

J. Beskow et al. / Visualization of speech and audio for hearing impaired persons [22]

[23] [24]

[25]

[26]

[27]

L. Ma, D. Smith and B. Milner, Context Awareness using Environmental Noise Classification, in: Eurospeech (Geneva), 2003, 2237–2240. P. Nordqvist, Sound Classification in Hearing Instruments, PhD Thesis, 2004. P. Nordqvist and A. Leijon, An efficient robust sound classification algorithm for hearing aids, J Acoust Soc Am 115 (2004), 3033–3041. P. Nordqvist and D. Huanping, Automatic classification of the telephone-listening environment in a hearing aid, J Acoust Soc Am 109 (2001), 2491. S. Oberle and A. Kaelin, Recognition of acoustical alarm signals for the profoundly deaf using hidden Markov models, in: IEEE International symposium on Circuits and Systems (Hong Kong), 1995, 2285–2288. G. Salvi, Truncation error and dynamics in very low

[28]

[29] [30]

[31]

107

latency phonetic recognition. In: ISCA workshop on Nonlinear Speech Processing, 2003. G. Salvi, Dynamic behaviour of connectionist speech recognition with strong latency constraints, Speech Communication 48(7) (2006), 802–818. W.H. Sumby and I. Pollack, Visual Contribution to Speech Intelligibility in Noise, J Acoust Soc Am 26 (1954), 212–215. C. Siciliano, G. Williams, J. Beskow and A. Faulkner, Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. In: Proc. of ICPhS, 2003, 131–134. J. Tchorz and B. Kollmeier, Using amplitude modulation information for sound classification, in: Psycophysics, Physiology, and Models of Hearing, T. Dau, V. Hohmann and B. Kollmeier, eds, World Scientific, Oldenburg, 1998, pp. 275– 278.