Voiceless Speech Recognition Using Dynamic Visual Speech Features

1 downloads 0 Views 3MB Size Report
Voiceless Speech Recognition Using. Dynamic Visual Speech Features. Wai C. Yau, Dinesh K. Kumar, Sridhar P. Arjunan. School of Electrical and Computer ...
Voiceless Speech Recognition Using Dynamic Visual Speech Features

Wai C. Yau, Dinesh K. Kumar, Sridhar P. Arjunan School of Electrical and Computer Engineering, RMIT University GPO Box 2476V Melbourne, Victoria 3001 Australia [email protected], [email protected]

VisHCI 2006

Introduction – Voiceless Speech Recognition Motivations for voiceless systems : z

suitable for users with speech impairment

z

does not require user to make a sound

z

invariant to audio noise

Possible non-acoustic speech modalities such as: 1.

Visual / speechreading – images/video

2.

Facial muscles activity – EMG signals •

NTT DoCoMo

“Voiceless Recognition Without the Voice”, May. 1, 2004 Issue of CIO Magazine

Introduction 3.

Vocal cords movement – Electroglottograph (EGG)

Vision-based methods are least intrusive and non invasive

VisHCI 2006

Related Work {

Appearance-based visual features - based on the image pixels in the surrounding mouth region are used for speechreading in (Potamianos et. al. 2004, Liang et. al. 2002)

{

Shape-based visual features - based on the geometric shape of the mouth and lip are proposed in (Petajan 1984 , Adjoudani et. al. 1996)

{

A combination of shape and appearance information through active appearance model is used in (Matthews et. al. 1998)

{

Dynamic visual speech features using optical flow method is proposed by (Pentland et. al 1989) VisHCI 2006

The Proposed Technique The aims of this paper: {

{

{

investigate the feasibility of using dynamic visual speech features (visible facial motion) in identifying English consonants focus on consonant recognition because z consonants are harder to ‘hear’ and easier to ‘see’ propose a visual system z z z z

suitable for classifying isolated consonants speaker dependent easy to train for different users where camera can be attached to a headsets - ROI extraction is not performed to reduce the computations

The Proposed Approach

VisHCI 2006

Visual Speech Model {

The recognition units are visemes

{

The number of visemes may vary due to factors such as geographical location, culture, and age of speakers. Hence it is difficult to determine a viseme model suitable for all users.

{

Viseme model of MPEG-4 standard is adopted to enable the proposed system to be coupled with any MPEG-4 supported facial animation systems

VisHCI 2006

Segmentation of Dynamic Visual Speech Features {

Dynamic visual speech information – facial movements are represented using motion history images (MHI)

{

MHI is a spatial-temporal template (Bobick & Davis 2001): z z

Intensity values indicate ‘when’ the movements occurred Pixel location indicate ‘where’ the movements happened

{

Step 1 : Compute the difference of frames (DOF)

{

Step 2 : Convert the DOFs into binary images

{

Step 3 : Multiply the binary images with linear ramp of time, take the union of all binary images and retain the maximum value

VisHCI 2006

Segmentation of Dynamic Features {

Removes the static elements & preserve the short duration facial movements

{

Computationally simple

{

Invariant within limits to skin colour

{

The intensity values of MHI are normalized to [0...1] to reduce the overall variations of speed of speech

{

Preprocessing of MHI: discrete stationary wavelet transform (SWT)

VisHCI 2006

Zernike Moment Features {

{

{

Advantages of Zernike moments : z simple rotational property z outperform other moments features in terms of sensitivity to noise and image representation (Teh & Chin 1988) Computed by projecting the image function onto the orthogonal Zernike polynomial

The MHI is mapped to a unit circle for computation of Zernike moments

VisHCI 2006

Zernike Moment Features {

{

The absolute values of Zernike moments are rotation invariant 49 absolute values of Zernike moments are used as features

VisHCI 2006

Neural Network Classifier {

Multilayer perceptron (MLP) artificial neural network (ANN) with back propagation (BPN) learning algorithm is used

{

Characteristics of supervised ANN classifier: z z z z z

generalization ability does not require information on the statistical properties of the data fault tolerance may be suboptimal, but is an easy tool to implement as first step trained ANN has fast classification speed

VisHCI 2006

Methodology {

Tested on 9 English consonants based on the viseme model of MPEG-4

{

Video recording and processing : z z z z z z

recorded using low resolution camera in an office environment single speaker – frontal view of the mouth each consonant was repeated 20 times a total of 180 true colour AVI files (30 frames/sec) one MHI (240 x 240 pixel resolution) generated from each AVI files SWT at level-1 using Haar wavelet was applied on the MHIs and approximate image is used for analysis

VisHCI 2006

Methodology {

Feature Extraction z

{

49 Zernike moments computed from each SWT approximate image of MHI

Classification z

90 MHIs were used to train ANN and remaining 90 MHIs were used to test the ANN

z

Experiments were repeated in 10 trials through random sub-sampling of the training and testing data VisHCI 2006

Results { {

Mean recognition rates : 84.7% Standard deviation: 2.8%

VisHCI 2006

Discussion {

Albeit very preliminary, for pure speechreading, the results obtained seem encouraging.

{

The results demonstrate that the proposed dynamic /motion features are suitable for consonant recognition.

{

One possible reason for misclassification is the occlusion of the articulators movements. Viseme model of MPEG-4 does not

take into account the visibility of speech articulators

{

At this stage, the proposed technique is only designed for isolated phones and not continuous & spontaneous speech (speech segmentation and co-articulation problems are obviated)

{

The proposed dynamic features can be combined with ‘static’ visual features for speechreading

Conclusion {

We propose a new method to recognize consonants based on dynamic visual speech features by combining MHI, SWT, Zernike moments and neural network classifier

{

The results suggests that dynamic features can be used to distinguish a number of the English consonants reliably

{

Such a system can be used z z z

to drive computerized machinery in noisy environment in helping disabled people to use a computer/machine for voice-less communication VisHCI 2006

Future Work {

Design an enhanced visual speech model

{

Increase the dataset and the number of speakers

{

Determine an optimum classifier for the proposed system

{

Incorporate ROI detection and extraction

{

Extend the system for words recognition tasks VisHCI 2006

Thank you Questions?

VisHCI 2006