silent voice command recognition for hci using video ...

1 downloads 0 Views 309KB Size Report
The performance of such a system degrades when the user changes the speaking style (speaks softly or whispers) or in noisy environments. This research ...
IADIS International Conference Interfaces and Human Computer Interaction 2007

SILENT VOICE COMMAND RECOGNITION FOR HCI USING VIDEO WITHOUT EVALUATING AUDIO Wai Chee Yau1,2, Dinesh Kant Kumar1, Hans Weghorn2 1

School of Electrical and Computer Engineering, RMIT University, GPO Box 2476v, Melbourne, Victoria 3001 Australia. 2 Information Technology, BA-University of Cooperative Education, Stuttgart, Germany.

ABSTRACT Speech control provides a flexible and natural way for users to interact with machines. However, speech recognition systems are not widely used as human computer interfaces due to their intrinsic sensitivity to variations in acoustic conditions. The performance of such a system degrades when the user changes the speaking style (speaks softly or whispers) or in noisy environments. This research proposes a method to overcome these problems by recognizing speech from video recordings, without evaluating the audio signals. This study applies video analysis techniques to recognize the voice commands. A video segmentation method to detect the start and end of isolated utterances is presented. Potential applications of such a system include car radio control, defense applications, speech control in noisy environment and HCI for disabled people. KEYWORDS human-computer interaction, silent speech recognition, video analysis, computer vision.

1. INTRODUCTION The rapid growth in computer technology results in an increasing demand for HCI methods with enhanced flexibility. One of the drawbacks of the conventional human computer interfaces such as mice and keyboards is that they need to be operated using hands. Such interfaces are not suitable in “hands-eyes-busy” situations such as the control of a navigation system in a moving vehicle. This limitation can be overcome by using speech control. Speech systems allow users to control computers by uttering voice commands. Such systems are also useful for people with limb disabilities caused by amputation, strokes or amyotrophic lateral sclerosis to control the environment (Feng et al. 2006). A major shortcoming of speech interfaces using audio is the sensitivity of such systems to the variation in acoustic condition. The performance of such systems is affected by the ambient noise level and the user’s speaking style. Speech-based HCI using audio is not suitable for confidential communications or giving discreet commands when there may be other people talking in vicinity. To overcome these limitations, this study proposes to adopt a non acoustic modality to recognize utterances. Such techniques require only the sensing of facial and speech articulators movement, without the need to sense the sound output of the speaker. The options available are such as visual, recording of vocal cords movements through electroglottograph (EGG) and recording of facial muscle activity (Arjunan et al. 2006). This paper proposes to identify speech using video recordings. The visual modality is selected because the acquisition of video data is non intrusive as opposed to other methods that involve the placement of sensors on the user. Another motivation for adopting vision-based method is because such systems can be incorporated into devices with embedded cameras such as mobile phones and laptops. The advantages of visual-only speech recognition system are (i) not affected by audio noise (ii) not affected by changes in acoustic conditions (iii) does not require the user to make a sound. Video data of the speaker contain information on the visible movement of the speech articulators such as lips, facial muscles, tongue and teeth. The visual information from a speaker's face is long known to aid the understanding of

197

ISBN: 978-972-8924-39-3 © 2007 IADIS

spoken language by humans with normal hearing (Summerfield 1987). The ability of people with hearing impairment to lip read is another clear demonstration of the significance of the visual speech information. Research where audio and video inputs are combined to recognize large vocabulary, complex speech patterns are being reported in the literature (Potamianos et al. 2003; Hazen 2006). Without the voice signals, such systems have very high error rates for visual-only speech recognition tasks, with errors in the order of 90% (Potamianos et al. 2003; Hazen 2006). Few researchers have focused on visual-only speech recognition as a stand-alone problem. The need for such systems arises in situations where the audio signals are unreliable. A typical example is the voice control of car radios. Such systems allow users to control the car radios while keeping their hands on the wheel and attention on the road. Nevertheless, the audio signals inside a moving vehicle are highly contaminated by the car’s engine noise and sounds from the radio. Hence the information from the video is very important in identifying the control commands. The low accuracy of visual-only recognition results reported in (Potamianos et al. 2003; Hazen 2006) suggest that it is difficult to recognize large vocabulary of continuous speech without using acoustic data. The visual cues contain less classification power for speech compared to audio signals and hence it is to be expected that such systems would have a small vocabulary. This paper reports on a novel technique to recognize voice commands from video recordings without evaluating the audio signals. Earlier work by the authors has demonstrated a videobased approach that can identify English consonants with good accuracy (Yau et al. 2006). This paper reports on an enhanced approach based on the authors’ previous work (Yau et al. 2006). This study evaluates the performance of the enhanced visual system in classifying English vowels and consonants. This paper describes a new framework to automatically segment an individual utterance from a sequence of utterances in the video. The segmented utterances can be fed into the recognition system to identify the commands given by the user and elicit machine actions to complete the tasks.

2. THE PROPOSED APPROACH DESCRIPTION This paper proposes a vision-based technique to identify voice commands without evaluating the sound signals. The proposed method can be divided into two phases: (i) the recognition of utterances based on facial movement (ii) the segmentation of individual utterance from the video recording containing multiple utterances.

2.1 Speech Command Recognition from Video This research proposes a method to classify voice command of the users using video data. Figure 1 illustrates the block diagram of the proposed visual speech recognition approach.

Figure 1. Block diagram of the proposed visual-only speech recognition technique.

The first video processing step involved in is the segmentation of facial motion. The facial movement in each video recording is represented using a 2D grayscale image - spatial-temporal template (STT). STT contains both spatial and temporal information of the facial movement (Yau et. al. 2006). The STT is preprocessed using discrete stationary wavelet transform (SWT) to reduce the small variations of the facial movement between different repetitions of the same utterance. The STT is represented using the SWT

198

IADIS International Conference Interfaces and Human Computer Interaction 2007

approximate sub image. Analyzing the pixel values directly is difficult as each image consists of a large number of pixels. An image of size 240 x 240 has 57600 pixels. Further, the pixel values are sensitive to changes in scale, rotation and translation of the mouth in the images. This research proposes to use represent the SWT approximate sub image using 49 Zernike moments. Previously, the authors have demonstrated that Zernike moments are robust features and suitable to represent the STT (Yau et al. 2006). The Zernike moments can be identified as one of the commands using hidden Markov models (HMM) classifier.

2.2 Utterance Segmentation from Video One of the challenges in recognizing speech based on video clips is the segmentation of individual utterances. A number of audio-visual speech recognition techniques proposed in the literature segment the utterances using audio signals (Potamianos et al. 2003; Foo and Dong 2002). In situations where audio signals are not available or highly corrupted by noise, video segmentation is required. This section describes a technique to segment individual utterance from video data that consist of multiple utterances. Figure 2 shows the proposed framework to segment and classify voice commands from video. Video input

Mouth activity detection

Segmented Utterance

Voice command Recognition

Identified command

Figure 2. A framework showing the segmentation and recognition of voice commands from video

This research proposes to segment utterances from video clips based on mouth motion. The individual utterances (words or phonemes) can be segmented from the video by detecting the mouth activity. In applications that involve isolated command words or phones, a short silence period is present in between the utterances. Mouth movements are produced when the user is pronouncing an utterance. The mouth activity is minimal in the silence period that separate two consecutive commands. The end of an utterance corresponds to the start of the silence period. The end of the silence period indicates the start of the subsequent utterance. The mouth activity detection stage can be implemented using the spatial-temporal templates (STT) approach. STT is computed from a moving time window that slides across the video data. The video frames contained in a particular time window are assumed to be within the silence period if no mouth movement is detected in that window. Frames within a time window that has appreciable mouth movement will be segmented as an utterance. The segmented utterance can be fed into the recognition sub-system described in section 2.1 to be identified as one of the command. Figure 3 shows the proposed segmentation algorithm based on a ‘moving’ time window (in shaded grey color) applied on a sequence of video frames to detect the start and end of utterances. moving window

0

Video frames (mouth images)

Time (frame number)

Figure 3. A moving time-window applied on the video frames to segment the individual utterance

3. METHODOLOGY AND SYSTEM VALIDATION Experiments were conducted to test the proposed visual speech recognition technique. Video data were recorded using an inexpensive web camera in a typical office environment. This was done towards having a practical system using low resolution video recordings in a realistic environment (as opposed to video corpus recorded in noise free, studio-like environments). The camera focused on the mouth region of the speaker and was kept stationary throughout the experiment. The following factors were kept the same during the

199

ISBN: 978-972-8924-39-3 © 2007 IADIS

recording of the videos: window size and view angle of the camera, background and illumination. 14 English vowels and consonants with different facial movements are used in the experiments. 280 video files (240 x 240 pixels) were recorded and stored as .AVI files. The frame rate of the AVI files was 30 frames per second. The proposed recognition algorithm described in section 2.1 was applied on the AVI files. The phonemes were manually segmented. One HMM was created and trained for each utterance. Each class is modeled using a left-right HMM with three states, one mixture of Gaussian component per state and diagonal covariance matrix. The leave-one-out method was used in the experiment to evaluate the performance of the proposed approach. The average recognition rate of the proposed system is 88.2%. The results indicate that the proposed technique can reliably recognize English phonemes. To evaluate the proposed video segmentation technique of section 2.2, video recordings consisting of multiple utterances will be used. This method detects consecutive frames with appreciable mouth activity and segments those frames as an utterance. The authors propose to validate the results video segmentation using audio signals. The detected start and end frames of utterances can be overlaid on the synchronized audio signals to verify the accuracy of the video segmentation technique.

4. SUMMARY This paper describes a voice control method for speech-based HCI using video without evaluating audio signals. The proposed approach recognizes utterances from mouth images. The proposed system is evaluated on English vowels and consonants and encouraging results are obtained. This paper presents a video segmentation framework to detect the start and end of each utterance from a sequence of commands. For future work, the authors intend incorporate this segmentation concept into the proposed voice control system. Further, the investigation shall be extended from an English-spoken environment to other languages, e.g., German and Mandarin. Silent speech-based HCI is useful for controlling machines in noisy environments. For example, such a system may be implemented in-vehicle control based on voice commands such as “on”, “off” and isolated digits. Such systems can also used for helping disabled people to control computers. Future applications cover robotics and defense tasks involving voice-less communication.

REFERENCES Arjunan, S. P. et al., 2006. Unspoken Vowel Recognition Using Facial Electromyogram. Proceedings of 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2006 EMBS'06., New York, pp. 2191-2194. Feng, J., et al. 2006. A longitudinal evaluation of hands-free speech-based navigation during dictation. In Int. J. HumanComputer Studies, Vol. 64, pp. 553-569. Foo, S. W. and L. Dong 2002. Recognition of Visual Speech Elements Using Hidden Markov Models. Lecture notes in computer science, Vol. 2532, pp.607-614. Hazen, T. J. 2006. Visual Model Structures and Synchrony Constraints for Audio-Visual Speech Recognition. In IEEE Transactions on Speech and Audio Processing. Vol. 14, No. 3, pp. 1082-1089. Potamianos, G. et al. 2003. Recent Advances in the Automatic Recognition of Audiovisual Speech. Proceedings of the IEEE, Vol. 91, No. 9, pp. 1306-1324. Summerfield, A. Q. et. al. 1987. Some preliminaries to a comprehensive account of audio-visual speech perception. Hearing by Eye : The Psychology of Lipreading. Yau, W. C. et. al. 2006. Visual Speech Recognition Method Using Translation, Scale and Rotation Invariant Features. IEEE International Conference on Advanced Video and Signal based Surveillance, Sydney, Australia.

200

Suggest Documents