Automated lip feature extraction for liveness verification in ... - CiteSeerX

3 downloads 0 Views 609KB Size Report
Keywords: lip detection, audio-video authentication, liveness verification, replay attack. 1 Introduction. Audiovisual authentication, based simultaneously on.
Automated lip feature extraction for liveness verification in audio-video authentication Girija Chetty, and Michael Wagner Human Computer Communication Laboratory School of Information Sciences and Engineering University of Canberra, Australia [email protected]

Abstract In this paper we propose an automated lip feature extraction technique for liveness verification in audio-video speaker authentication systems. Faces are detected using the red and blue colour components of the image, combined with geometric constraints and comparisons with an average face. The lip region is determined using derivatives of the hue and saturation functions, combined with geometric constraints. The algorithm was tested on the VidTIMIT database. It has been integrated into an audio-video authentication system which verifies the liveness of the audiovisual stream and has achieved verification equal error rates of less than 1% with visual feature vectors comprising both eigenlips and lip-region-of-interest measurements.

Keywords: lip detection, audio-video authentication, liveness verification, replay attack 1

Introduction

Audiovisual authentication, based simultaneously on the voice and face of a person, offers a number of advantages over both single-mode speaker verification and single-mode face verification [1], including enhanced robustness against variable environmental conditions. One of the most significant advantages of combined face-voice recognition is the decreased vulnerability against replay attacks, which could take the form of presenting either a voice recording to a single-mode speaker verification system or a still photograph to a single-mode face verification system. Replay attacks on combined facevoice recognition systems would normally be much more difficult. Nevertheless, since most proposed combined audiovisual verification systems verify a person’s face statically, these systems remain vulnerable to replay attacks that present pre-recorded audio together with a still photograph. To resist such attacks, audio-video authentication should include verification of the “liveness” of the audio-video data presented to the system. Until now, although there has been much published research on the liveness, for example, of fingerprints, research on liveness verification in face-voice authentication has been very limited [2]. Lip-motion is intrinsic to speech production and the inclusion of dynamic lip features along with acoustic speech features not only reflects static and dynamic

acoustic and visual features, but also provides proof of liveness. It was shown by Matsui and Furui in [3] that there is a significant correlation between the acoustics of speech and the corresponding facial movements and that the acoustic dynamic features of a speaking face are inextricably linked to the visible articulatory movements of the teeth, tongue and lips. Therefore, using dynamic visible features of the lip region for liveness verification in audiovisual person authentication seems to be a natural, simple and effective technique to address replay attack issues as compared with more complex methods based on 3D depth and pose estimation, or challenge-response techniques described in [4] and [2]. The dynamic lip features are determined by using a lip tracking system, which locates the lips in the video sequence and then extracts the dynamic lip parameters. Subsequently, these features are fused synchronously with the acoustic features extracted from the corresponding acoustic data in order to ascertain liveness and to establish the person’s identity. The detection and tracking of facial features such as lips, is a fundamental and challenging problem in computer vision. This research area has many applications in face identification systems, modelbased coding, gaze detection, human-computer interaction, teleconferencing etc. For detecting lips in a face, we need to solve two problems. Spatial segmentation, that is finding and tracking the lips, and recognition, that is estimating the relevant features

about the “lip configuration” for classification. The main objective of most of the methods proposed for lip feature extraction, particularly in the audio-visual speech recognition domain, to date, is to assist in a speech recognition task by including the visual modality. Though some of the conclusions could be extended to a speaker verification context, not much can be said about the “liveness” aspects of the results of previous experiments. However, by appropriate adaptation of some of the above-mentioned techniques, liveness verification can be made more powerful. One such simple and effective adaptation is to implement the multimodal fusion with synchronous and early integration of acoustic and moving lip region feature vectors before classification. This will require building client and background models based on synchronised audio and dynamic visual features, as well as appropriate interpolation of the feature vectors acquired at different rates. With this, the liveness of an audio-video sample would be ascertained more accurately in the verification phase by differentiating a “live” client sample and a “fake” sample consisting of an audio signal in conjunction with a still photograph. The literature on such “liveness” experiments and on the error rates achieved is sparse. Preliminary experiments on liveness verification using manual lip tracking were reported in [5]. Subsequently, we have conducted further experiments with an automated lip feature extraction technique, which have yielded much improved liveness verification equal error rates. The speaker recognition aspects of these experiments are reported in a parallel paper [6]. In the current paper, the details of the automatic lip detection and feature extraction technique developed for carrying out the liveness verification experiments are reported.

2

Face and Lip Detection

Face detection is done in the first frame of the video sequence and the tracking of the lip region in subsequent frames is done by projecting the markers from the first frame. This is followed by measurements on the lip region boundaries based on pseudo-hue edge detection and tracking. The advantage of facial feature extraction based on exploiting the alternative colour space is that it is simpler and more powerful in comparison with methods based on deformable templates, snakes and pyramid images [7]. Moreover, the method can be easily extended to the detection and tracking of facial features in multiple-face images or in images with natural and complex backgrounds

Figure 1: A skin colour sample and the Gaussian distribution in red-blue chromatic space The face detection scheme consists of three image processing stages. The first stage is to classify each pixel in the given image as a skin or non-skin pixel. The second stage is to identify different skin regions in the image through connectivity analysis. The last stage is to determine for each of the identified skin regions whether it represents a face. This is done using two parameters. They are the aspect ratio (height to width) of the skin-coloured blob, and template matching with an average face at different scales and orientations. A statistical skin colour model was generated by means of supervised training, using a set of skincolour regions, obtained from a coloured-face database. A total of 10,000 skin samples from 100 colour images were used to determine the colour distribution of human skin in chromatic colour space. The extracted skin samples were filtered using a lowpass filter to reduce the effect of noise. Red-blue colour vectors were concatenated from the rows of the red and blue pixel matrices Cr and Cb x = (Cr11, … , Cr1m, Cr21, … , Crnm, Cb11, … , Cb1m, Cb21, … , Cbnm)T The colour histogram reveals that the distribution of skin colour for different people is clustered in this chromatic colour space and can be represented by a Gaussian model N(µ, C) with mean vector µ = E[x] and covariance matrix C = E[(x-µ) (x-µ)T]. One such skin sample and the Gaussian distribution N(µ, C) fitted to our data in red-blue chromatic space are shown in Figure 1. With this Gaussian skin colour model, we can now obtain the “skin likelihood” for any pixel of an image.

Figure 2: Face detection by skin colour in red-blue chromatic space

The skin probability image thus obtained is thresholded to obtain a binary image. The morphological segmentation involves the image processing steps of erosion, dilation and connected component analysis to isolate the skin-coloured blob. By applying an aspect ratio rule and template matching with an average face, it is ascertained the skin-coloured blob is indeed a face. Figure 2 shows an original image from the VidTIMIT database, and the corresponding skin-likelihood and skin-segmented images, obtained by the statistical skin colour analysis and morphological segmentation.

3

Lip region key point extraction

Once the face region is localised as shown in Figure 2, the lip region is detected using hue-saturation thresholding. It is easier to detect the lips in hue/saturation colour space because hue/saturation colour for the lip region is fairly uniform for a wide range of lip colours and fairly constant under varying illumination conditions and different human skin colours [8]. We derive a binary image B from the hue and saturation images H and S using thresholds H0 = 0.4 and S0 = 0.125 such that Bij = 1 for Hij > H0 and Sij > S0, and Bij = 0 otherwise.

Figure 3: Lip region localisation using hue-saturation thresholding Figure 3 shows lip region-of-interest localisation based on hue-saturation thresholding. To derive the lip dimensions and key points within the lip region of interest, we detect edges based on combining pseudohue colour and intensity information. The pseudo-hue matrix P is computed from the red and green image components as Pij = Rij / (Gij +Rij ) and then normalised to the range [0,1]. Pseudo-hue is higher for lips than for skin. In addition, luminance is a good cue for lip key point extraction and we compute a luminance matrix L, which is also normalised to the range of [0,1]. We then compute the following gradient matrices from P and L: Rmid = ∇y( P ∇xL ), where the matrices P and ∇xL are multiplied element-by-element; Rtop = ∇y( P - L ); and Rlow = ∇y( P + L ). Rmid has high values for the middle edge of the lips, Rtop has high values for the top edge of the lips, and Rlow has large negative values for the lower edge of

the lip as shown in Figure 4. From these observations it is possible to do a robust estimation of the key points in the lip region. The horizontal mid-line is determined by finding the maximum value over several central columns of Rmid. The row-index of that maximum determines the horizontal mid-line ym, as shown in Figure 5. To detect the left and right corners of the lips, we consider the pseudo-hue on the horizontal mid-line. Pseudo-hue is quite high around the lip corners and the lip corners appear clearly when we compute the gradient of the pseudo-hue image ∇xP along the horizontal mid-line. The maxima and minima of this gradient over the cross-section, with appropriate geometric constraints, determine the outer left lip corner (olc-left), inner left lip corner (ilc-left), outer right lip corner (olc-right) and inner right lip corner (ilc-right), which are shown in Figure 5. The two edges of the upper lip are determined by two zerocrossings, again with appropriate constraints, of the gradient Rtop taken along the vertical mid-line of the lips and the two edges of the lower lip are determined by two zero-crossings of the gradient Rlow taken along the vertical mid-line of the lips. Consequently, the upper lip width (ULW) and the lower lip width (LLW) can be determined. An illustration of the geometry of the extracted features and measured key points is shown in Figure 6. The major dimensions extracted are the inner lip width (w1), outer lip width (w2), inner lip height (h1), outer lip height (h2), upper and lower lip widths (h3 and h4) and the distance h5 =h2/2 between the midhorizontal line and the upper lip. The extracted key points from the lip region are used for the constructing the audiovisual fusion vector. The method was applied to the video sequences showing a speaking face for all subjects in the database. In all sequences, the lip region of the face was identified correctly with more than 95% accuracy using the method described here.

4

Audio-Visual Integration

The multimodal person authentication database VidTIMIT developed by Sanderson in [9] was used for conducting all the experiments in this study. The VidTIMIT database consists of video and corresponding audio recordings of 43 people (19 female and 24 male), reciting short sentences selected from the NTIMIT corpus. The data were recorded in 3 sessions, with a mean delay of 7 days between Session 1 and 2, and of 6 days between Session 2 and 3. The mean duration of each sentence is around 4 seconds, or approximately 100 video frames. A broadcast-quality digital video camera in a noisy office environment was used to record the data. The video of each person is stored as a sequence of JPEG images with a resolution of 512 × 384 pixels (columns × rows), with corresponding

audio provided as a monophonic, 16 bit, 32 kHz WAV file. The experiments were undertaken with different dimensions of audio-visual vector, obtained from the fusion of the acoustic feature vector and visual feature vector. The structure of the acoustic feature vector was kept constant for all the experiments and comprises 8 mel-frequency cepstral coefficients (MFCC) [10]. The MFCC acoustic vectors were obtained by pre-emphasising the

audio signal and processing it using a 30ms Hamming window with one-third overlap, yielding a frame rate of 50 Hz. An acoustic feature vector was determined for each frame by warping 512 spectral bands into 30 mel-spaced bands, and computing the 8 MFCCs. Cepstral mean normalisation [10] was performed in order to compensate for the varying environmental noise and channel conditions.

Figure 4: Detecting lip boundaries using pseudo-hue/luminance images

Figure 5: Key lip point extraction from lip-edge images

Figure 6: Illustration of geometric features and measured key points for different lip openings

5

Liveness Verification Experiments

In order to examine the improvement achieved in liveness verification error rates with automatic extraction of lip features, three experiments were conducted. In the first experiment, the visual feature vector comprised the 6 lip parameters h1, h2, h3, h4, w1, and w2, described in the previous section. In the second experiment, 10 eigenlip projections of the entire lip region, based on principal component analysis as proposed in [11] and [12], were used. Before performing principal component analysis of the lip region images, each image was normalised by translation, rotation and scaling operations [5]. For the third experiment, the geometric lip parameters and the eigenlip coefficients were combined to form a 16element video feature vector. A 10-Gaussian mixture model for each client is obtained from the first 2 sessions of the 24 male and 19 female subjects in the VidTIMIT database [11]. The third sessions are used for testing. Further details of the visual feature vectors, Gaussian client models and the testing procedure are described in [6]. In order to obtain suitable thresholds to distinguish live recordings from fake recordings, detection error tradeoff (DET) curves and equal error rates (EER) were determined. The testing process of clients’ live recordings and simulated replay attacks were repeated with different visual feature vectors for experiments I – III. The DET curves showing EERs achieved for liveness verification experiments are shown in Figure 7,8 and 9. The DET curve for experiment I in Figure 7 represent a visual vector consisting of 6 lip parameters. The three curves show the error rates for the best speaker, all speakers and the worst speaker. The resulting EER for this speaker is about 4.8%. The DET curve for experiment II in the Figure 8 represents a visual vector consisting of 10 eigenlip projections. Here an EER of 2.2% is achieved for all speakers, with the best speaker’s EER being 1.8% and the worst speaker’s EER being 3.6%. The DET curve for experiment III, shown in Figure 9 is obtained with a visual vector consisting of a combination of the 6 lip

parameters (Experiment I), and 10 eigenlip projections (Experiment II), and for this combined visual vector, the EER for all speakers is 0.8%, with best-speaker EER being 0.5% and worst-speaker EER being 1.9 %.

Figure 7: DET curves and EER values for Experiments I

Figure 8: DET curves and EER values for Experiments II

[4]

Choudhury, T., B. Clarkson, T. Jebara and A. Pentland, “Multimodal person recognition using unconstrained audio and video”, in AVBPA-1999, Audio- and Video-based biometric person authentication, International Conference on Audio and Video-Based Biometric Person Authentication, 1999.

[5]

Chetty, G. and M. Wagner, “ “Liveness” verification in audio-video authentication”, Proceedings International Conference on Spoken Language Processing, ICSLP-2004, 2004.

[6]

Chetty, G. and M. Wagner, “Liveness verification in Audio-Video Speaker Authentication”, accepted for Australian International Conference on Speech Science and Technology SST-04, 2004.

[7]

Eveno, N., A. Caplier and P.Y. Coulon, “Jumping Snakes and Parametric Model for Lip Segmentation”, International Conference on Image Processing, Barcelona, Spain, 2003.

[8]

Matthews, I., J. Cootes, J. Bangham, S. Cox and R. Harvey, “Extraction of visual features for lipreading” , IEEE Trans PAMI, vol.24, no.2,pp, 198-213, 2002.

[9]

Sanderson, C. and K.K. Paliwal, “Fast features for face authentication under illumination direction changes”, Pattern Recognition Letters 24, 2409-2419, 2003.

[10]

Atal, B.. “Effectiveness of linear prediction characteristics of the speech waveform for automatic speaker identification and verification”, Journal of Acoustical Society of America 55, 1304-1312. 1974

[11]

Turk, M and A. Pentland, “Eigenfaces for recognition”, Journal of Cognitive Neuroscience 3, 71-86, 1991

[12]

Bregler, C. and Y. Konig, “ “Eigenlips” for robust speech recognition”, ICASSP-1994, Proc. Int. Conf. On Acoustics, Speech and Signal Processing, 1994.

Figure 9: DET curves and EER values for Experiments III

5

Conclusions

The results of the liveness verification experiments have shown that the audio-video feature vector consisting of a combination of MFCC acoustic features, eigenlip projections and the geometric lip parameters as determined by the proposed automated lip feature extraction technique achieves liveness verification equal-errors rates of less than 1%. The technique demonstrates a simple and powerful method of verifying liveness and thwarting replay attacks. Further experiments will include investigation of automated lip-feature extraction techniques for visual environments with multiple faces and complex backgrounds.

6

References

[1]

Stork, D. and M. Hennecke “Speechreading by man and machine: data, models and systems”, Springer-Verlag, Berlin, 1997.

[2]

Frischholz, R. and A. Werner, “Avoiding replay attacks in a face-recognition system using head pose estimation”, Proceedings IEEE Int Workshop on Analysis and Modeling of Faces and Gestures, AMFG’03, 2003.

[3]

Matsui, T and S. Furui, “Concatenated phoneme models for text-variable speaker recognition”, Proceedings International Conference on Acoustics, Speech and Signal Processing, ICSLP-1993, 391-394, 1993.

Suggest Documents