A Compact Representation of Visual Speech Data Using Latent ...

0 downloads 0 Views 1MB Size Report
paper, we propose a generative latent variable model to provide a compact ... the visual data through placing priors of the latent variables along a curve.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

A Compact Representation of Visual Speech Data Using Latent Variables Ziheng Zhou, Member, IEEE, Xiaopeng Hong, Member, IEEE, Guoying Zhao, Senior Member, IEEE, and Matti Pietika¨inen, Fellow, IEEE Abstract—The problem of visual speech recognition involves the decoding of the video dynamics of a talking mouth in a high-dimensional visual space. In this paper, we propose a generative latent variable model to provide a compact representation of visual speech data. The model uses latent variables to separately represent the interspeaker variations of visual appearances and those caused by uttering within images, and incorporates the structural information of the visual data through placing priors of the latent variables along a curve embedded within a path graph. Index Terms—Visual speech recognition, compact representation, latent variable

Ç 1

INTRODUCTION

IT is known that human speech perception is a bimodal process which makes use of information not only from what we hear (acoustic) but from what we see (visual) [16]. In machine vision, visual speech recognition (VSR), sometimes also referred to as automatic lip-reading, is the task of recognizing the utterances through analyzing the visual recordings of a speaker’s talking mouth without any acoustic input. Although visual information cannot in itself provide normal speech intelligibility, it may be sufficient within a particular context when the utterances to be recognized are limited. In such a case, VSR can be used to enhance natural human-computer interactions through speech, especially when audio is not accessible or severely corrupted. In this paper, we consider the problem of VSR in such a constrained scenario. The key question for us to answer in the problem of VSR is how to characterize the highly dynamic process of uttering in a highdimensional visual data space. As for many computer vision problems, the high dimensionality of the image data leads us to the common assumption that the actual dimension of the underlying structure of the visual speech (VS) data is significantly smaller than the one of the observed image sequences. Therefore, the goal of this work is to learn a compact representation for the visual speech data. In acoustic speech recognition, techniques such as the vocaltract normalization [13] and the maximum likelihood linear transformations [8] have been developed to counter the variability in the acoustic signal among different speakers. In the visual domain, we face the same problem of speaker dependence. There are two major sources of variations in images, including the interspeaker variations of visual appearances of the mouth region and the variations of the mouth’s shape and texture caused by speaking some utterance. The former contain information of the speakers’ identities, which are irrelevant to the problem of VSR, . The authors are with the Center for Machine Vision Research, Department of Computer Science and Engineering, University of Oulu, P.O Box 4500, FI-90014 Oulu, Finland. E-mail: {ziheng.zhou, xiaopeng.hong, guoying.zhao, mkp}@ee.oulu.fi Manuscript received 1 Jan. 2013; revised 27 June 2013; accepted 20 Aug. 2013; published online 16 Sept. 2013. Recommended for acceptance by T. Cootes. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TPAMI-2013-01-0002. Digital Object Identifier no. 10.1109/TPAMI.2013.173. 0162-8828/14/$31.00 ß 2014 IEEE

Published by the IEEE Computer Society

VOL. 36, NO. 1,

JANUARY 2014

181

while the latter contain information that can be used to characterize the utterance. Therefore, it is desirable that the representation be designed such that these two sources of variations can be modeled separately. Moreover, since utterances are recorded as image sequences, the representation also has to preserve the structure of the visual data, i.e., if two images are observed at some close positions in an utterance, their representations should also be close in the latent space. In this paper, we propose a latent variable model (LVM) to learn the compact representation. The model is generative in the sense that an observed image sequence is assumed to be generated from one shared latent speaker variable (LSV) and a sequence of latent utterance variables (LUVs). The former accounts for the interspeaker variations of visual appearances and the latter for the variations caused by uttering. We model the structure of image sequences of the same utterance by a path graph and incorporate the structural information through using the low-dimensional curve embedded within the graph as our prior knowledge on the locations of LUVs in the latent space. In such a way, we can impose soft constraints to penalize values of LUVs that contradict the modeled structure. The proposed method is tested on two data sets that contain typical preprocessing errors and varying speeds in the speakerindependent test scenario. It is compared with other compact representations and the results show the effectiveness of our method.

2 2.1

BACKGROUND Visual Speech Recognition

Most of the previous studies on VSR have been focused on two separate tasks: 1) how to design a set of compact and informative visual features for recognition and 2) how to model the dynamic process of uttering in the feature space. A comprehensive survey can be found in [20]. For the first task, the principal component analysis (PCA) [2], [6] and the 2D discrete cosine transform (DCT) [9], [17] were applied on the gray-scale pixel values of the normalized mouth region to obtain visual features. Matthews et al. [15] implemented the active appearance model (AAM) to calculate compact representations of both shapes and appearances of a talking mouth. Lan et al. [11] compared various features used in previous studies. In [12], [21], the compact representation, named “HiLDA,” was introduced for VSR. Visual features extracted from successive frames were concatenated to form the multiframe vectors to contain temporal information. Those corresponding to the same viseme were grouped together and the linear discriminant analysis was performed upon the groups to compute the HiLDA features. Saenko et al. [23] proposed to use articulatory features (AFs). They trained SVM classifiers for the AFs and used the intermediate scores for VSR. For the second task, the hidden Markov models (HMMs) [22] were widely used to characterize the dynamics in the feature space [2], [6], [9], [15], [17]. In [23], the more generalized dynamic Bayesian networks (DBNs) were constructed for the task. From a different perspective, Zhao et al. [25] calculated spatiotemporal local texture features directly from image sequences and trained SVMs upon the features for classification. In our previous work [27], we proposed to use a path graph to represent the structure of an image sequence of a given utterance. A mapping was learned such that a reference sequence could be projected onto the curve embedded in the path graph. It was then applied to other sequences for classifying the utterance. Since the model was trained on one image sequence of one particular speaker, it generalized poorly on data from other speakers. In this work, we use a path graph to model the structure of all the visual data within an utterance instead of one single sequence. The

182

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 36,

NO. 1,

JANUARY 2014

structural information is incorporated into our model through placing priors of LUVs along the embedded curve. Compared with the above-mentioned methods, our model identifies two different sources of variations in images, those from the visual appearances of various speakers (which are irrelevant to the problem of VSR) and those caused by uttering, and tries to separate them explicitly. Moreover, the model takes the structure of image sequences into account when searching for lowdimensional representations of visual speech data.

2.2

Latent Variable (LV) Models

Latent variable models have been widely used to tackle various computer vision problems, such as object recognition [7], human pose estimation [10], [24], face recognition (FR) [14], and speaker verification [4]. The proposed model is closely related to the one proposed in [14] in a sense that they share the idea of using two separate LVs to represent the interpersonal variations of visual appearances and the variations left caused, for example, by the process of uttering in VSR or by the various poses and illumination conditions in FR. What makes our model differ from theirs is that our models are designed to generate a sequence of images describing a dynamic process instead of a single image. To do that, we carefully construct the prior distributions of LUVs to reflect the connections of images. This work is also related to the research done in [10], [24]. In this work, values of LVs for the observed training data were directly calculated through spectral embedding such that LVs were able to preserve the geometric information of the training data points. We also want to build into LUVs such information that reflects the structure of data points. Instead of using the distances directly measured in the image space based on some metric, we use a path graph to model the intrinsic structure of the visual speech data within an utterance. The LUVs describing the talking motions are assumed to be situated around the low-dimensional curve embedded within the graph. In such a way, we are able to impose soft constraints on the LUVs to incorporate the structural information in the LV space.

3

PROPOSED LATENT VARIABLE MODEL

We assume that an observed image sequence of a mouth speaking an utterance will be generated from some low-dimensional latent variables by a noisy process. Our goal is to learn models that describe the sequence-generating process and to use them for accurate predictions of the unknown data within the lowdimensional latent variable space. In particular, given an image sequence of length T , X ¼ fxt gTt¼1 , we consider those images xt to have been generated from a latent speaker variable h and a sequence of latent utterance variables fwt gTt¼1 in the linear form xt ¼  þ Fh þ Gwt þ  t :

ð1Þ

LSV h represents the interspeaker variations of visual appearances of the mouth area and LUV wt accounts for the dynamic changes of the shapes and textures of a talking mouth during the process of uttering. Both variables are multivariate and continuous. Matrix F is a factor matrix whose columns span the interspeaker subspace, while matrix G contains a basis of the subspace describing the variations caused by uttering. Vector  is the global mean and  t are independently distributed noise terms with a zero mean and diagonal covariance . In terms of conditional probabilities, the proposed model can also be described as pðxt j h; wt Þ ¼ N ðxt j  þ Fh þ Gwt ; Þ;

ð2Þ

Fig. 1. (a) A graphical model showing the distributions of images fxt g of a T -frame sequence conditioned on the shared variable h and a set of variables fwt g. (b) Prior distributions of fwt g constructed along the curve embedded within a path graph. Here, the curve is plotted in 3D for the purpose of illustration. However, it could have a higher dimension, e.g., six in our experiments. Each prior is defined as a Gaussian with an isotropic covariance matrix and the red shaded region pffiffiffiffi spans  around the means (red dots).

where N ðz j a; BÞ denotes a Gaussian distribution in z with mean vector a and covariance matrix B. Following [14], we define a simple prior in h as pðhÞ ¼ N ðh j 0; IÞ. As illustrated in Fig. 1a, the proposed model allows LUVs wt to be independent. Such a structure provides us some efficient and close-form solutions to recognition and training as will be shown shortly. However, it somehow contradicts the fact that the model is built for a sequence of images describing a continuous process rather than a single one, and therefore, wt should preserve the sequential structure of the observed images. To correlate wt , we carefully construct their Gaussian priors that impose soft constraints on their locations. To incorporate information of the sequential structure, the Gaussian means are placed along a curve that can be used to characterize the dynamic process of uttering in the latent variable space. In this paper, we choose the continuous curve embedded within a path graph to build pðwt Þ, as illustrated in Fig. 1b. Mathematically, the prior pðwt Þ can be written as      t1  ; I ; ð3Þ pðwt Þ ¼ N wt  f T 1 where f denotes the embedded curve defined on ½0; 1 and  is a constant that controls the variances. In the next section, we will explain our choice in detail.

3.1

Path Graph Modeling

We use a path graph Pn to represent the underlying structure of image sequences that record a particular utterance. Here, n is the number of nodes. As shown in Fig. 2a, the nodes are connected as a chain. Instead of considering the nodes as the observed images as we did in our previous work [27], we have them to represent some data points (either observed or unobserved) generated during the process of speaking the utterance. Our goal is to search for low-dimensional representations of the n nodes that best reflect the structure of Pn . This can be achieved by solving the problem of mapping the graph onto a line so that connected points stay as close together as possible [1]. Let J2f0; 1gnn be the adjacent matrix such that Jij ¼ 1 if node i and j are connected, and 0 otherwise. Let y ¼ ½y1 ; y2 ; . . . ; yn T be a map, and we can obtain y by X ðyi  yj Þ2 Jij ; ð4Þ y ¼ arg min yT y¼c

i;j

where c is a constant. According to [3], y can be calculated as the eigenvectors of the graph Laplacian, L which is defined as L ¼ D  J, where D is a diagonal matrix with the ith diagonal P entry computed as Dii ¼ nj¼1 Jij . Based on the definition of L,

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 36, NO. 1,

JANUARY 2014

183

Let f t ¼ f ðTt1 1Þ. The above equation can be rewritten in terms of probabilities as   ~ ; ~ j  þ Ay;  pð~ x j yÞ ¼ N x

ð8Þ

pðyÞ ¼ N ðy j m; CÞ;

ð9Þ

2

2 3 0 3 0 6 7 f 6 1 7 6 7 07 7 6 f2 7 7 .. 7; m ¼ 6 7; 6 . 7 . 5 6 . 7 4 . 5  fT 3 0  0 I    0 7 7 .. . . .. 7 7: . . . 5 0    I

 0  6 0   6 ~ ¼6 . where  .. . . 6 . 4 . . . 0 0  2

I 60 6 and C ¼ 6 6 .. 4. 0

Fig. 2. (a) A path graph and (b)-(g) sine waves (dashed lines) that form the first to sixth dimensions of the curve embedded within the path graph P15 . The black dots show the values of eigenvectors.

there are n  1 eigenvectors with nonzero eigenvalues. Here, we use yk to denote the map (or eigenvector) with the kth smallest eigenvalue. Moreover, as illustrated in Figs. 2b, 2c, 2d, 2e, 2f, and 2g, each map yk lies on a sine wave fkn such that   u1 ; yuk ¼ fkn n1   ð5Þ ðn  kÞks ðn þ kÞ fkn ðsÞ ¼ sin þ ; n 2n where yuk is the projected value of node u ¼ 1; 2; . . . ; n. In such a way, we can obtain an r-dimensional representation for node u as u1 u1 u1 T ½f1n ðn1 Þ; f2n ðn1 Þ; . . . ; frn ðn1 Þ , where r  n  1. It can be seen that such representations of all the nodes lie on the continuous and deterministic curve f : ½0; 1 ! IRr : 2 n 3 f1 ðsÞ 6 f2n ðsÞ 7 7 6 ð6Þ f ðsÞ ¼ 6 . 7: 4 .. 5 frn ðsÞ Based on the finding, we consider the embedded curve f as being capable of describing the continuous process of speaking the utterance and assume that all other points on the curve (besides those corresponding to the nodes) can also be used to represent data points generated in the process. We also assume that the space of LUVs coincides with the r-dimensional space and LUVs are located around f in the space. Based on the assumptions, we place pðwt Þ along f , as defined in (3) and illustrated by Fig. 1b, to penalize the values of wt that contradict our assumptions.

3.2

We can then calculate   ~ 1 ð~ ~ Þ ¼ N y j fAT  ~ Þ þ C1 mg;  ; pðy j x x

ð10Þ

  ~ 1 A 1 :  ¼ C1 þ AT 

ð11Þ

where

It can be seen that (11) provides us a closed-form solution to posterior pðh; W j XÞ.

3.3

Learning Parameters

For a particular utterance, we aim to learn the model parameters  ¼ f ; F; G; g based on the observed training data. The learning is carried out through the expectation-maximization (EM) algorithm [5]. Let X denote the training set that consists of N image i sequences and each sequence Xi contains Ti images fxij gTj¼1 . During each EM iteration, the E-step updates the posterior distribution of all the corresponding latent variables conditioned on X using the existing model parameters. The posterior can be calculated as the product of the posteriors conditioned on data of individual speakers in X . Moreover, these posteriors can be measured following the same strategy as we used to infer (10). In the M-step, we update  through maximizing the expectation of the logarithm of the joint probability of X and all the corresponding latent variables under the posterior measured in the E-step. As shown in [14], the model parameters can be updated as 1 ¼P i

Posterior Estimate

Given model parameters  ¼ f ; F; G; g and an image sequence X ¼ fx1 ; x2 ; . . . ; xT g, we want to calculate the posterior probability of the latent variables, pðh; W j XÞ, where h is the shared LSV and W ¼ fwt gTt¼1 the corresponding LUVs. Following the trick in [14], we combine the generative equations (see (1)) for all the T images: 3 2 2 3 2 3 2 2 3 3 h  1 x1 F G 0  0 6 7 w 6 x2 7 6  7 6 F 0 G    0 76 1 7 6  2 7 7 7 6 6 7 6 7 6 6 .. .. . . . 76 w2 7 þ 6 .. 7; 6 .. 7 ¼ 6 .. 7 þ 4 .. 4 . 5 4.5 .. 7 4 . 5 . . . . .. 56 4 . 5 F 0 0  G xT T  wT or, giving names to these composite vectors and matrices, ~¼ ~ þ Ay þ ~ : x

ð7Þ



X i;j

1 ¼P

i Ti

X

X Ti

! ðxij   ÞE½zij T

ð12Þ

xij ;

i;j

X  E zij zT ij

!1 ;

ð13Þ

i;j

  I  ðxij   Þðxij   ÞT  BE½zij ðxij   ÞT ;

ð14Þ

i;j

T T where B ¼ ½F; G, zij ¼ ½hT i ; wij  and I  D denotes the operation that sets off-diagonal elements of matrix D to zero. Here, hi is the LSV learned for the speaker in sequence Xi . For sequences from the same speaker, their hi remain the same. Fig. 3 shows how the model interpreted an image sequence after performing 10 iterations of the EM algorithm. Figs. 3a, 3b, 3c, 3d, and 3e visualize the global mean, the original image sequence, the visual appearance represented by the maximum a posteriori (MAP) estimate of the LSV, the variations caught by the MAP

184

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 36,

NO. 1,

JANUARY 2014

^ (d) images of Gw ^ t (positive values displayed in red and negative Fig. 3. (a) Global mean  , (b) an image sequence of the utterance “excuse me,” (c) the image of  þ Fh, ^ and fw ^ þ Gw ^ t g are the MAP estimate of the latent variables. ^ t . Here, h in green), and (e) the images of  þ Fh

estimate of the LUVs, and the corresponding reproduced image sequence. The dimensions of the LSV and LUVs were set to 19 and 6, respectively. It can be seen that the LSV well captured the visual characteristics of the mouth region of the speaker and the LUVs the variations caused by speaking the utterance within a space of only six dimensions.

3.4

Classification

Suppose that there are in total M utterances fU m gM m¼1 to be recognized and we have trained models fMm gM m¼1 for all of them. Given an unknown image sequence, our goal is to find out which utterance it belongs to. Since the LSV is irrelevant to our problem, we will focus on decoding the utterance identities from LUVs. ^m¼ Let the unknown sequence have T frames and W m m ^m ^ ^ fw ; w ; . . . ; w g be the MAP estimate of the LUVs based on 1 2 T Mm . We need to calculate a score  m that quantifies the possibility of the sequence being in the class. Since we are dealing with a sequence of vectors, any standard technique for modeling multivariate time series can be used to do the job. In this work, we propose to use a rather simple approach. Since we have incorporated into our model some strong prior knowledge about the LUVs, that is, they are located around the embedded curve f in ^ m to show the same the latent space, we would expect W characteristic if the sequence belonged to U m . We simply compute  m as the sum of the cross correlations ^ m and f along each dimension. Recall between the trajectories of W that the projection of f along each dimension is just a sine wave. Mathematically, the score  m can be written as m ¼

T  X

 m T

^t w

t¼1

 t1 : f T 1

4.1

4.2

Parameters

During the learning stage, we had to initialize the model parameters , F, and G in the EM algorithm. In our experiments, we used the sample variances measured from the training data to initialize the diagonal elements of . Following [14], we used the principal components of the interspeaker scatter matrix computed from training data to initialize F. The initial value of G was computed as the least-squares solution to the linear regression of the Gaussian means in (3) on the global-mean-removed training data. Let xij denotes the jth image



ð15Þ

Note that we have tried techniques such as HMMs and the dynamic time warping and found out that the proposed simple method performed the best.

4

In [23], Saenko et al. collected a small VS data set containing three subjects speaking utterances that could be used to control a radio system. Moreover, utterances were spoken at various speeds. Such a database would be useful when testing techniques for building a practical VSR system. Unfortunately, it has not been made publicly available. We collected our own data set, named “RadioVS,” using the same data-collection protocol as used in [23]. The data set included 10 subjects speaking 20 phrases (see Table A-2, available in the online supplemental material). Each phrase was repeated three times. Subjects were asked to clearly enunciate the utterances during the first repetition and to speak successively faster during the second and third repetitions. We name the data collected during the three repetitions as “slow,” “medium,” and “fast,” respectively. During preprocessing, eyes were located manually, images rotated and resized, and the mouth region cropped off for recognition.

EXPERIMENTAL DETAILS Data

We trained and tested the proposed latent variable model on two visual speech data sets. The first is the OuluVS database [25]. It consists of 10 daily-use short utterances that are listed in Table A-1, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.173, each read by 20 subjects up to nine times. The video data were preprocessed in two different ways. In one way, we manually located two eyes and resized and rotated images to fix the eye positions. The mouth region was then cropped off from each image. In the other way, we preprocessed the video data automatically following [25]. Fig. 4 gives some typical examples of the manually and automatically preprocessed mouth images. It can be seen that the latter are clearly more difficult due to the imperfect allocation and rescaling of the mouth region.

of the ith sequence in the training data. Its corresponding Gaussian ~ ¼ ½xij    and mean f ij can be computed as f ij ¼ f ð j1 Þ. Let X Ti 1

~ T ÞT . Y ¼ ½f ij . We initialized G with ððYYT Þ1 YX In addition to the model parameters, we had to set the hyperparameters n and  to run the system. Recall that n is the number of nodes in the path graph and  the constant in the definition of the Gaussian prior. These two parameters were decided based on cross validations done on the training data. We set n ¼ 15 in all our experiments and the value of  varied depending on the input visual data.

Fig. 4. Examples of the manually (a) and automatically (b) preprocessed visual speech data.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Fig. 5. (a) Overall recognition rates against the dimensionality of LUVs, and (b) recognition rates on individual speakers.

4.3

Comparison

We compared our model with other techniques for compact representations of visual speech data including the combination of 2D discrete cosine transform and principal component analysis [9], [17], the active appearance model [15], [11], the HiLDA upon DCT and AAM [21], [12], and the articulatory features [23]. The AFs were decoded by DBNs and the others by HMMs. For PCA, we set the number of principle component coefficients retained as 75. We hand-labeled 40 mouth images for each speaker and built a separate AAM for tracking the speaker’s mouth. Moreover, we manually initialized the AAM tracker for the first frame of each image sequence to maximize the fitting performance. For HiLDA, we used the viseme classes defined by Table A-3, available in the online supplemental material. Finally, we chose to use the three AFs and the “whole-word” DBN models with asynchrony as described in [23]. The parameters of HMMs and DBNs were manually tuned to maximize the performance.

5

EXPERIMENTS

VSR experiments were conducted in the speaker-independent setting, that is, the system was trained on a data set that did not include the test speaker. In our experiments, we performed the leave-one-speaker-out cross validation: Data from one speaker were used for testing and data from the other speakers for training in each turn. We first tested our system on the manually preprocessed OuluVS database. The proposed latent variable model was used to learn a compact representation from high-dimensional raw pixel values. Here, we set dimðhÞ ¼ 19 (the number of speakers in the training data set minus one) and  ¼ 0:005. To show how the dimension of the latent utterance variables affected the system performance, we gradually increased the dimension and the results are shown in Fig. 5a. (Note that a recognition rate is defined as the total number of correctly recognized sequences divided by the total number of test sequences and will be used to report results in the rest of this paper.) It can be seen that the performance increased rapidly at the beginning and became steady after the dimension reached six. Recall that increasing the dimension is equivalent to using more sine waves with higher frequencies to model the structure of image sequences within an

Fig. 6. Comparisons between the true mean images and the images generated ^ for speakers S3 and S4. from the MAP estimate of LSV, h,

VOL. 36, NO. 1,

JANUARY 2014

185

Fig. 7. Contributions of individual dimensions of LUVs (or their combinations) to the overall VSR performance. Here, “D” stands for “dimension” and the digits in the brackets give the indices of the used dimension(s).

utterance. It indicates that using sine waves with frequencies higher than a certain level does not contribute much to the overall performance. In the following experiments, we fixed the dimensionality of LUVs as six. Fig. 5b shows the recognition rates for individual speakers. Here we denote the ith speaker as Si. Our system achieved the highest recognition rate on S3 and the lowest on S4. We analyzed the results for these two speakers and found out that the MAP ^ could not capture the personal appearance of estimate of LSV, h, S4, as illustrated in Fig. 6, causing the failure of the LVM to generate images representing S4. On the contrary, it can be seen that the trained model well characterized the mouth appearance of S3. Since there are only 20 speakers in the OuluVS database, we would expect the learned interspeaker subspace to become more representative once we included more speakers in the training set. We also singled out individual dimensions of LUVs to check their contributions to the overall VSR performance. Fig. 7 shows the results. Keep in mind our assumption that the LUVs are located around the curve f. Along each dimension, the projected curve is a sine wave and we index the dimensions according to the frequencies starting from the lowest to the highest. It can be seen that the first dimension (D[1]) significantly outperformed the others. We also combined the three dimensions (D[1, 5, 6]) with the highest rates and their performance was merely 6 percent better than the one of D[1]. However, when we took all six dimensions (D[1-6]) into account, the performance jumped more than 30 percent, showing the necessity of the dimensions despite their own poor performance. Table 1 gives the confusion matrix for the 10 utterances in the database. The number at the intersection of the ith row and jth column gives the percentage (percent) of the ith utterance being classified as the jth by the LVM. The biggest confusion occurs between the third utterance, “Hello” and the eighth, “Thank you.” Fig. 8 shows the comparative results of various compact representations of visual speech data on the OuluVS database. On the manually preprocessed data, the proposed LVM achieved the highest recognition rate. The HiLDA features performed

TABLE 1 Confusion Matrix Showing the Recognition Results for the 10 Utterances in the OuluVS Database

186

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Fig. 8. VSR results on the OuluVS database.

TABLE 2 Parameters Used for Computing the LBP and LBP-TOP Features

VOL. 36,

NO. 1,

JANUARY 2014

Fig. 9. VSR results of various compact representations of visual speech data on the RadioVS data set.

TABLE 3 VSR Results of the LVM Using the LBP and LBP-TOP Features as Input on the OuluVS Database Fig. 10. VSR results of the proposed LVM on the RadioVS data set.

significantly better than the rest of the features. The AAM outperformed the DCT+PCA, consistent with the findings reported in [11]. On the more challenging automatically preprocessed data, all suffered some significant loss on the recognition rates. It can be seen that the AAM and AAM+HiLDA features are the most robust against the preprocessing errors. However, significant amount of manual work was required to build and initialize the speakerspecific AAMs to extract satisfactory features. We also investigated the use of local spatial and spatiotemporal texture features instead of raw pixel values. To do that, we extracted the spatial LBP [18] and spatiotemporal LBP-TOP [26] features. Table 2 gives the parameters we used. Note that the LBPTOP features were extracted on a per-frame basis. For each suitable image in a sequence, we chose to extract features from the volume formed by the current image together with its two neighboring frames. Table 3 reports the obtained recognition rates for the proposed LVM. It can be seen that the two local features are more resilient to the preprocessing errors. The results also show that the temporal information, which characterizes the connections between frames, could significantly boost the performance. Besides the OuluVS database, we also conducted experiments on the RadioVS data set. Recall that each utterance was repeated three times by a speaker with three various speeds. We took data with one particular speed out of the training data set during training and conducted tests on data with that speed in the test data set. We also trained and tested the systems using data with all speeds. In this way, we tried to check how speed affected the VSR performance. Fig. 9 shows the results. The bars show the recognition rates obtained when the training data with the speed associated with the test data were removed during training, while the black lines give the corresponding results when data with all speeds were used for training. It can be seen that the LVM achieved the highest recognition rates and was relatively insensitive to the various speeds. On the contrary, the HiLDA features show the most sensitivity. We can see that their corresponding left bars were substantially higher than the other two in Fig. 10 showing the

Fig. 11. Cumulative match characteristics for the classification of visemes in context using various compact representation.

results of the LVM using LBP and LBP-TOP features as input. The LBP-TOP features once again outperformed the other two inputs, showing that VSR in the speaker-independent scenario could benefit from the information characterizing the temporal evolvement of the shapes and textures of a talking mouth.

6 6.1

BEYOND THIS WORK Toward Continuous Visual Speech Recognition (CVSR)

Our ultimate goal is to do continuous visual speech recognition. Although the proposed compact representation was tested through the task of classifying a limited number of phases and short sentences, it does not prevent us from using the technique for future research on CVSR. To test its potential, we conducted a preliminary experiment on the task of classifying visemes in context (VICs), similar to the triphones defined in the acoustic speech recognition, using the proposed LVM. The experiment was conducted in the speaker-independent setting on the manually preprocessed OuluVS database. All videos were linearly interpolated to have a frame rate of 100 fps [21], [12]. The spatial raw pixel values and spatiotemporal LBP-TOP features were used as the input. For comparison, we chose the two HiLDA features and trained a conventional three-state HMMs for each VIC

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

for classification. The viseme classes are defined in Table A-3, available in the online supplemental material. Experimental results are shown in Fig. 11 by means of cumulative match characteristics [19]. The curves report the percentages of the VICs that are correctly recognized in the top k matches, where k is the rank on the abscissa. It can be seen that the proposed compact representation achieved competitive results with the state-of-the-art visual features and classifiers. The results show the potential of our compact features for the problem of CVSR. However, developing a fully functioning CVSR system based on our method is beyond the page limit and scope of this paper and will be our focus in the future.

6.2

Model Extension

In practice, the visual appearances of the talking mouth are dependent on many factors, including, for example, the emotional state of the speaker, the pose of the speaker, and the background environments. Although the models described in this paper are developed for handling the frontal-view and neutral-expression data, the use of generative latent variable models (GLVMs) provides us great power and flexibility to tackle the large variations. Li et al. [14] demonstrated the use of GLVMs for tackling the large pose variations in face recognition. In their work, a face image x was assumed to be generated from latent variable h representing the identity of the individual and another LV responsible for other variations in face images. Given another image x0 of the same individual but with a large pose change, the corresponding h0 was tied with h, assuming that x and x0 share the same underlying h, but are generated by different processes regarding the two poses. In the case of VSR, we can adopt the above strategy to counter the variations in the visual speech data. To do that, suppose we have two videos recording the same utterance, for instance. For some reason (e.g., variations of poses or emotional states), we cannot generate them using the model described in this paper. We may assume that the videos share the same LUV space, but are generated by different processes. We cannot tie the LUVs directly since the images in the two videos may not be aligned. Let one video be the reference and the other the target. One possible solution is to interpolate the target into an image sequence with the same length of the reference such that each image and its correspondence in the reference are phonemically aligned. By phonemically aligned, we mean that the two images are placed at the same position in the interval of the corresponding phoneme. After that, we can tie the LUVs. In such a way, we would be able to extend the proposed model to cope with the variations.

VOL. 36, NO. 1,

[1] [2] [3] [4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16] [17]

[18]

[19]

[20]

[21]

CONCLUSIONS

We have described a generative latent variable model that provides a compact representation for the high-dimensional visual speech data. The model explicitly represents in images the interspeaker variations and those caused by uttering using the low-dimensional latent speaker variables and latent utterance variables, respectively. A path graph is used to model the structure of image sequences of the same utterance and its embedded curve to characterize the dynamic process of speaking. The curve is then used as a prior knowledge when building prior distributions for LUVs. Experimental results have shown the effectiveness of the proposed method.

187

REFERENCES

[22]

7

JANUARY 2014

[23]

[24]

[25]

[26]

[27]

M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering,” Proc. Advances in Neural Information Processing Systems, pp. 585-591, 2001. C. Bregler and Y. Konig, “‘Eigenlips’ for Robust Speech Recognition,” Proc. Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 669-672, 1994. F. Chung, Spectral Graph Theory (CBMS Regional Conference Series in Mathematics), Am. Math. Soc., 1996. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, May 2011. A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood for Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc. B, vol. 39, pp. 1-38, 1977. S. Dupont and J. Luettin, “Audio-Visual Speech Modeling for Continuous Speech Recognition,” IEEE Trans. Multimedia, vol. 2, no. 3, pp. 141-151, Sep. 2000. R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition by Unsupervised Scale Invariant Learning,” Proc. IEEE Int’l Conf. Computer Vision and Pattern Recognition, pp. 264-271, 2003. M. Gales, “Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition,” Computer Speech & Language, vol. 12, no. 2, pp. 75-98, 1998. J. Gowdy, A. Subramanya, C. Bartels, and J. Bilmes, “DBN Based MultiStream Models for Audio-Visual Speech Recognition,” Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing, pp. 993-996, 2004. A. Kanaujia, C. Sminchisescu, and D. Metaxas, “Spectral Latent Variable Models for Perceptual Inference,” Proc. 11th IEEE Int’l Conf. Computer Vision, pp. 1-8, 2007. Y. Lan, R. Harvey, B. Theobald, E. Ong, and R. Bowden, “Comparing Visual Features for Lipreading,” Proc. Int’l Conf. Auditory-Visual Speech Processing, pp. 102-106, 2009. Y. Lan, B. Theobald, R. Harvey, E. Ong, and R. Bowden, “Improving Visual Features for Lip-Reading,” Proc. Int’l Conf. Auditory-Visual Speech Processing, pp. 142-147, 2010. L. Lee and R. Rose, “A Frequency Warping Approach to Speaker Normalization,” IEEE Trans. Speech Audio Processing, vol. 6, no. 1, pp. 4960, Jan. 1998. P. Li, Y. Fu, U. Mohammed, J. Elder, and S. Prince, “Probabilistic Models for Inference about Identity,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 1, pp. 144-157, Jan. 2012. I. Matthews, T. Cootes, J. Bangham, S. Cox, and R. Harvey, “Extraction of Visual Features for Lipreading,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198-213, Feb. 2002. H. McGurk and J. MacDonald, “Hearing Lips and Seeing Voices,” Nature, vol. 264, no. 5588, pp. 746-748, 1976. A. Nefian, L. Liang, X. Pi, X. Liu, and K. Murphy, “Dynamic Bayesian Networks for Audio-Visual Speech Recognition,” EURASIP J. Applied Signal Processing, vol. 2002, no. 1, pp. 1274-1288, 2002. T. Ojala, M. Pietika¨inen, and T. Ma¨enpa¨a¨, “Multiresolution Gray Scale and Rotation Invariant Texture Classification with Local Binary Patterns,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971-987, July 2002. P. Phillips, H. Moon, S. Rizvi, and P. Rauss, “The FERET Evaluation Methodology for Face-Recognition Algorithms,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 10, pp. 1090-1104, Oct. 2000. G. Potamianos, C. Neti, and G. Gravier, “Recent Advances in the Automatic Recognition of Audio-Visual Speech,” Proc. IEEE, vol. 91, no. 9, pp. 13061326, Sept. 2003. G. Potamianos, C. Neti, G. Iyengar, A. Senior, and A. Verma, “A Cascade Visual Front End for Speaker Independent Automatic Speechreading,” Int’l J. Speech Technology, vol. 4, pp. 193-208, 2001. L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257286, Feb. 1989. K. Saenko, K. Livescu, J. Glass, and T. Darrell, “Multistream Articulatory Feature-Based Models for Visual Speech Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 9, pp. 1700-1707, Sept. 2009. Y. Tian, L. Sigal, H. Badino, F. De la Torre Frade, and Y. Liu, “Latent Gaussian Mixture Regression for Human Pose Estimation,” Proc. Asian Conf. Computer Vision, vol. 3, pp. 679-690, 2010. G. Zhao, M. Barnard, and M. Pietika¨inen, “Lipreading with Local Spatiotemporal Descriptors,” IEEE Trans. Multimedia, vol. 11, no. 7, pp. 1254-1265, Nov. 2009. G. Zhao and M. Pietika¨inen, “Dynamic Texture Recognition Using Local Binary Patterns with an Application to Facial Expressions,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 6, pp. 915-928, June 2007. Z. Zhou, G. Zhao, and M. Pietika¨inen, “Towards a Practical Lipreading System,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 137144, 2011.

ACKNOWLEDGMENTS This research was supported by the Academy of Finland and Infotech Oulu.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Suggest Documents