OuluVS2: a multi-view audiovisual database for non-rigid mouth ...

OuluVS2: a multi-view audiovisual database for non-rigid mouth motion analysis Iryna Anina, Ziheng Zhou, Guoying Zhao and Matti Pietikäinen Center for Machine Vision Research, University of Oulu, Finland {isavelie, zhouzh, gyzhao, mkp}@ee.oulu.fi

Abstract— Visual speech constitutes a large part of our nonrigid facial motion and contains important information that allows machines to interact with human users, for instance, through automatic visual speech recognition (VSR) and speaker verification. One of the major obstacles to research of non-rigid mouth motion analysis is the absence of suitable databases. Those available for public research either lack a sufficient number of speakers or utterances or contain constrained view points, which limits their representativeness and usefulness. This paper introduces a newly collected multi-view audiovisual database for non-rigid mouth motion analysis. It includes more than 50 speakers uttering three types of utterances and more importantly, thousands of videos simultaneously recorded by six cameras from five different views spanned between the frontal and profile views. Moreover, a simple VSR system has been developed and tested on the database to provide some baseline performance.

I. INTRODUCTION Visual speech constitutes a large part of our non-rigid facial motion and contains important information that allows machines to interact with human users, for instance, through automatic visual speech recognition [1], [2] and speaker verification [3], [4], [5]. On one hand, speech perception is a bimodal process that makes use of information related both to what we hear (acoustic) and what we see (visual) [6]. There is clear evidence that visual cues of speech play an important role in automatic speech recognition when audio is corrupted or even inaccessible [1], [2]. On the other hand, visual speech contains unique information related to speakers’ identities and may be used as some ideal facial biometrics since it allows us to conduct liveness detection that is critical to a verification system. Despite the apparent motivation, the problem of non-rigid mouth motion analysis remains relatively under-studied. One of the major obstacles is the lack of suitable databases available for public research. By ‘suitable’, we mean that the database should include a relatively large number of speakers for sufficient training and testing, various types of utterances, such as digits, phrases and sentences, for building various applications and, last but not least, multiple views of talking mouths that represent the large variation we may encounter in a real-world situation since we cannot assume that users would face the video camera all the time during their interaction with machines. At the moment, there This work was supported by Infotech Oulu, University of Oulu and Academy of Finland.

are only few audiovisual databases [2] available for public research and none of them satisfies the above criteria. Motivated by the demand for representative datasets, we collected a multi-view audiovisual database, named OuluVS2, for non-rigid mouth motion analysis. It contains videos of more than 50 speakers speaking various types of utterances, simultaneously recorded by six cameras from five different views and will be available for downloading in near future. This paper provides a comprehensive introduction to the OuluVS2 database. Moreover, we conducted VSR experiments in a setting that has been widely used in previous studies [7], [8], [9], [10], [11]. The results reported in this paper can serve as the baseline performance for future research. They show that the best VSR performance does not come from frontal view videos, somehow contradicting our expectation that the frontal view may provide the most useful information for VSR. Such results could give us insight into future studies of non-rigid mouth motion analysis. The rest of this paper is organized as follows: in Section 2 audio-visual databases available for public research are briefly described. Section 3 introduces the newly collected multi-view database, including the details of data collection, video format and preprocessing. Section 4 presents VSR experiments conducted on part of the OuluVS2 database as well as discussion about results. Finally, Section 5 concludes the paper. II. BACKGROUND In this section, we briefly describe all the publicly available audiovisual databases. The XM2VTSDB database designed for person authentication contains 295 speakers pronouncing two sequences of digits and one phonetically balanced sentence [12]. The AVLetters database, which includes ten speakers uttering isolated letters of English alphabet, was created for lipreading purpose [13]. The AVLetters2 dataset [14], a higher definition version of AVLetters, contains 5 speakers uttering isolated A-Z letters seven times. The OuluVS database was created for VSR [7]. It includes 20 speakers uttering 10 small English phrases of everyday use. The CUAVE database geared toward audio-visual speech recognition includes 36 speakers [15]. Speakers pronouncing isolated digits and connected-digit sequences were framed from the frontal view and then both profile views. Simultaneous speech of pair speakers and speaker movements are the special features of CUAVE database.

978-1-4799-6026-2/15/$31.00 ©2015 IEEE

The AVICAR database was recorded in a moving car [16]. There were 100 speakers involved uttering isolated digits and letters, phone numbers and randomly chosen TIMIT sentences [17]. The database consists of videos recorded by four cameras from different views simultaneously. However since the cameras were located in a lateral array in front of the speaker in a limited space, the actual angles between the views are unknown, and all of them appear near-frontal. The Grid audio-visual corpus [18] involved 34 speakers in the recording, but due to some technical reasons the video data of only 13 speakers is available. Each speaker pronounced 1000 synthetic sentences constructed combining a small number of keywords. Table I summarizes the above databases. It lists the number of subjects, types of utterances and camera views of each database. OuluVS2 database presented in this paper is also placed in the same table for comparison. III. THE OULUVS2 DATABASE A. Utterances There were three phases in each data collection session. In phase 1, a subject was asked to utter continuously ten digit sequences. Each sequence consisted of ten randomly generated digits and was repeated three times during recording. The talking speed choice was up to the subject. The ten digit sequences were generated just once and stayed the same for all the subjects. In phase 2 the subject pronounced ten daily-use short English phrases such as “Hello” and “Nice to meet you”. The same set of phrases was used in the OuluVS database [7] that had been widely used for VSR studies. Every phrase was uttered three times. In phase 3 the subject was asked to read ten randomly chosen TIMIT sentences [17]. Every sentence was read only once. A separate set of sentences was generated for every subject. Table II shows examples of utterances used in different phases of data collection.

TABLE II E XAMPLE U TTERANCES USED IN DIFFERENT PHASES OF DATA COLLECTION

Phase 1: digit sequences 1735162667 4029185904 1907880328 Phase 2: phrases Thank you Have a good time You are welcome Phase 3: TIMIT sentances [17] Military personnel are expected to obey government orders. Chocolate and roses never fail as a romantic gift. Agricultural products are unevenly distributed.

sample images of the speakers framed from different views. Most of the participants were university students and staff. There were no native English speakers among them. According to appearance subjects can be conventionally grouped into five following appearance types: European, Chinese, Indian/Pakistani, Arabian and African. The relation between these types in subject population of OuluVS2 database is shown in Fig. 2. C. Data Collection Fig. 3 illustrates the setting for the data collection. As can be seen, a subject was asked to sit on a chair in front of six cameras facing HD camera 1 and high speed (HS) camera (0°). Other four HD cameras were located in the following positions: 30°, 45°, 60° and 90° (profile view) to the subject’s right hand side. Digits, phrases and TIMIT sentences were shown to the subject on a computer monitor located slightly to the left behind the frontal cameras. Subjects were asked to keep the head still and facial expression neutral during recording. Of course, natural uncontrolled head movements

B. Subject Population Among 53 subjects taking part in our data collection there were 40 males and 13 females. Fig. 1 shows some TABLE I S UMMARY OF THE PUBLICLY Database XM2VTSDB [12]

Subj. 295

AVLetters [13] CUAVE [15]

10 36

AVICAR [16]

100

Grid [18] AVLetters2 [14] OuluVS [7]

34 5 20

OuluVS2

53

AVAILABLE AUDIO - VISUAL DATABASES

Utterances continuous digits, sentences isolated letters isolated digits, continous digits Isolated digits, continuous digits, isolated letters, sentences Sentences isolated letters Phrases continuous digits, phrases, sentences

Views frontal frontal frontal & profile,

four near frontal views frontal frontal frontal Five views: frontal, profile, 30°, 45°, 60° Fig. 1.

Sample images of the speakers framed from different views

videos of different views. In our case, audio synchronization for all six cameras was not possible due to the absence of microphone in the HS camera. D. Synchronization

Fig. 2. The relation between appearance types in OuluVS2 subject population

and body position changes can still be found in the recorded videos. The recording was made in an ordinary office condition with mixed lighting (professional studio lighting intermixed with ordinary office illumination and natural daylight falling through the window) and possible background sounds (e.g., human conversations). For the video and audio recording, five GoPro Hero3 Black Edition cameras were used (video resolution 1920×1080, 30 fps, audio bit rate 128 kbps). The “fisheye” effect was reduced by using “narrow” recording mode. From the frontal view the 100 fps HS video was also recorded using PuxeLink PL-B774U camera (resolution 640×480). It could be used to investigate the influence of video frame rates. At the beginning of every recording session, we placed a white cardboard in front of the subject and projected a periodic flash light on it. By doing so we got a periodic signal in front of all the six cameras for a short period. Frames recorded within such a signal were later used to synchronize

Videos produced in the same session by different cameras were synchronized using the periodic flash-light signal recorded at the beginning of every video. A semi-automatic procedure was developed to synchronize HD videos. We first manually located the flash-light-spot and roughly marked the period when the white cardboard appeared in every video sequence. The average grayscale signal was calculated over the small neighborhood (5 × 5) around to the center of the located light-spot for every frame. Fig. 4 shows the resulting signal curve for every camera view. It can be seen that there are rapid changes caused by the flash light periodically switched on and off in the curves for all the videos. After matching those periodic signals we obtained time shifts (in frames) between videos obtained for different views. The matching was performed through minimizing the following heuristic function: X j |fti − ft+∆t h= | (1) t i

where f is a grayscale signal calculated for the ith video and ∆t is the time shift between the two videos. Synchronization results were also checked by eye at the end. The same approach can be used to synchronize HS video with HD videos taking into account the difference in the frame rate. This kind of synchronization is planned to be done on the later stage of database preprocessing work which is being continued. E. Video Preprocessing Preprocessing of HD videos were conducted semiautomatically. Given a video recorded in one recording

Text

Fig. 3.

OuluVS2 recording system setup

Fig. 4. An example of gray-level change at the center of the light-spot used for video synchronization. Periodic parts correspond to flash-light switching and mark the same time moments.

TABLE III VSR Camera View 0° 30° 45° 60° 90° mixed

Fig. 5. Example of the synchronous preprocessed images of a talking mouth. All images are resized to have the same height for the purpose of illustration.

session, we used its audio to locate all the utterances and cropped off video segments for them. For each video segment, we first detected a bounding box for the face in the first frame [19]. Images within the box were cropped off across the sequence. We then calculated the SURF features from images, matched the feature points and estimated the image transformation matrix for each pair of consecutive images [20]. After that, all images were aligned to the first one. We next performed facial landmark (e.g., eye corners, nose tip and lip corners) localization using the method described in [21], checked the detected facial points by eye and manually marked the landmarks if they were far away from the true position. After that, images containing the talking mouth were cropped off according to some fixed heightwidth ratio. Fig. 5 shows us an example of the synchronous cropped images of a talking mouth. Images were resized to have the same height for the purpose of illustration. All the HS videos were roughly segmented in time and converted to MP4 format. The original raw videos are too large (about 30 GB per video) but can be made available on demand. The above procedure will be used to preprocess these videos in future.

Aspect Ratio 1:0.8 1:0.8 1:0.9 1:1 1:1.25 1:1

EXPERIMENT DATA DETAILS

Downsampled Resolution 56x45 56x45 56x50 56x56 45x56 56x56

Subjects involved 52 51 51 52 46 52

Videos involved 1560 1530 1525 1560 1374 1560

can be found in Table III. Due to data-collection errors (e.g., the subject wrongly positioned him/herself such that his/her talking mouth was not included in the recorded video sequences), not all the videos were used in the experiments. The number of videos used for training varied from 1560 (frontal and 60° views) to 1374 (profile view). For feature extraction, our VSR system computed 2D DCT features from each image and performed PCA to reduce the feature dimension to 100 [22]. For recognition, a whole-word hidden Markov model (HMM) [23] was constracted for classification for each phrase. The numbers of states and Gaussian mixtures were determined empirically to maximize recognition performance. The experiments were designed to be speaker independent using leave-one-speaker-out cross validation. In other words, we chose one speaker as the test subject and used corresponding data only for testing. System training was carried out using data from all the other speakers. We first conducted VSR experiments for every single view. After that, all the views were mixed together. During training, we randomly sampled the view for every phrase of every subject and picked the corresponding video into the training corpus. The testing data was chosen the same way. The recognition rate results for different views are shown in Fig. 6 It is interesting to see that the best recognition rate (47%), 6% higher than that of the frontal view, comes from the 60° view and the second best (42%), 1% higher than that

IV. VSR EXPERIMENTS In this work, we considered VSR as the application for testing. A simple VSR system was developed and tested in an experimental setting that had been widely used in previous VSR studies [7], [8], [9], [10], [11]. The purpose was to simply provide some baseline performance (results for the frontal view comparable to previous work) rather than achieving the state of the art. Moreover, we wanted to test each view under the simple system and carry out comparison that might give us insight into designing better systems for analyzing non-rigid mouth motion. Following [7], the task was to recognize ten phrases using only visual information. To do that, we first normalized videos such that images from the same view had the same dimension. The image sizes together with some other details

Fig. 6.

Recognition results for different views

of the frontal view, from the profile view, which somehow contradicts our expectation that the frontal or those close-tofrontal views should provide the most useful information for VSR. For the test involving all the views, there is no surprise that the outcome recognition rate (23%) is significantly lower than those of single views. It shows that the standard way of extracting features and constructing classifiers cannot cope with the large variations of the mouth appearance caused by camera-view changes. The experimental results highlight the need for more research effort to better understand visual speech, especially under various views, so as to develop more effective methods to model such variations. V. CONCLUSIONS We have presented our newly recorded multi-view audiovisual database named OuluVS2. It contains thousands of videos of more than 50 speakers speaking three types of utterances, recorded simultaneously by six cameras from five different views, which makes OuluVS2 an appropriate corpus for non-rigid mouth motion analysis. In addition, synchronized multi-view videos may also be useful for research of 3D face analysis and synthesis. We have given details about the speaker population, utterances, data collection, video synchronization and video preprocessing. Baseline VSR experiments have been carried out on the database based on a widely used setting. Recognition results show that the best VSR performance does not come from the frontal view or those close-to-frontal views. They highlight the need for more research effort to better understand visual speech especially under various camera views. The OuluVS2 database will be available for downloading1 . Synchronization information and preprocessed data will also be provided soon. Based on the database, our future research will be focused on developing models to cope with the large variations caused by camera-view changes and applying the models in the applications of visual-only or audiovisual speech recognition and speaker verification. R EFERENCES [1] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, G. Bailly, E. Vatikiotis-Bateson, and P. Perrier, editors, Issues in audio-visual speech processing. MIT press, 2004, ch. Audio-visual automatic speech recognition: an overview. [2] Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen, “A review of recent advances in visual speech decoding,” Image and Vision Computing, vol. 32, no. 9, pp. 590–605, 2014. [3] N. A. Fox, R. Gross, J. F. Cohn, and R. B. Reilly, “Robust biometric person identification using automatic classifier fusion of speech, mouth, and face experts,” IEEE Transactions on Multimedia, vol. 9, no. 4, pp. 701–714, 2007. 1 http://www.cse.oulu.fi/CMV/Downloads

[4] R. Jiang, A. H. Sadka, and D. Crookes, “Multimodal biometric human recognition for perceptual human-computer interaction,” IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 40, no. 6, pp. 676–681, 2010. [5] X. Liu and Y.-M. Cheung, “Learning multi-boosted hmms for lippasswordbased speaker verication,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 2, pp. 233–246, 2014. [6] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, no. 5588, pp. 746–748, 1976. [7] G. Zhao, M. Barnard, and M. Pietikäinen, “Lipreading with local spatiotemporal descriptors,” in IEEE Trans. Multimedia, vol. 11, no. 7, 2009, pp. 1254–1265. [8] Z. Zhou, G. Zhao, and M. Pietikäinen, “Towards a practical lipreading system,” in IEEE Int. Conf. Comput. Vis. Pattern Recognition (CVPR), 2011, pp. 137–144. [9] E.-J. Ong and R. Bowden, “Learning sequential patterns for lipreading,” in British Mach. Vis. Conf. (BMVC), 2011, pp. 1–10. [10] Y. Pei, T.-K. Kim, and H. Zha, “Unsupervised random forest manifold alignment for lipreading,” in IEEE Int. Conf. Comput. Vis. (ICCV), 2013, pp. 129–136. [11] Z. Zhou, X. Hong, G. Zhao, and M. Pietikäinen, “A compact representation of visual speech data using latent variables,” in IEEE Trans. Pattern Anal. Mach. Intell, vol. 36, no. 1, 2014, pp. 181–187. [12] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “Xm2vtsdb: The extended m2vts database,” in Int. Conf. Audio, Video-Based Biometrics Person Authentication (AVBPA), vol. 964, 1999, pp. 965– 966. [13] I. Matthews, T. Cootes, J. A. Bangham, S. Cox, and R. Harvey, “Extraction of visual features for lipreading,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 2, pp. 198–213, 2002. [14] S. Cox, R. Harvey, Y. Lan, J. Newman, and B. Theobald, “The challenge of multispeaker lip-reading,” in Int. Conf. Auditory-Visual Speech Process. (AVSP), 2008, pp. 179–184. [15] E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, “Cuave: A new audio-visual database for multimodal humancomputer interface research,” in IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), vol. 2, 2002, pp. 2017–2020. [16] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys, M. Liu, and T. Huang, “Avicar: Audio-visual speech corpus in a car environment,” in Annu. Conf. Int. Speech Commun. Assoc. (INTERSPEECH), 2004, pp. 380–383. [17] V. Zue, S. Seneff, and J. Glass, “Speech database development at mit: Timit and beyond,” Speech Communication, vol. 9, no. 4, pp. 351–356, Aug. 1990. [18] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audiovisual corpus for speech perception and automatic speech recognition,” J. Acoust. Soc. Amer., vol. 120, no. 5, pp. 2421–2424, 2006. [19] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in IEEE Int. Conf. Comput. Vis. Pattern Recognition (CVPR), vol. 1, 2001, pp. 511–518. [20] H. Bay, T. Tuytelaars, and L. V. Gool, “Surf: Speeded up robust features,” Computer Vision and Image Understanding (CVIU), vol. 110, no. 3, pp. 346–359, 2008. [21] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in IEEE Int. Conf. Comput. Vis. Pattern Recognition (CVPR), 2012, pp. 2879–2886. [22] J. N. Gowdy, A. Subramanya, C. Bartels, and J. Bilmes, “DBN based multi-stream models for audio-visual speech recognition,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, 2004, pp. 993–996. [23] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.