Performance Enhancement in Lip Synchronization ... - Semantic Scholar

2 downloads 0 Views 498KB Size Report
[7] Alfie Tan Kok Leong (2003): A music identification system based on audio content similarity, Thesis of Bachelor of Engineering,. Division of Electrical ...
Mahesh Goyani et. al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 2364-2369  

Performance Enhancement in Lip Synchronization Using MFCC Parameters MAHESH GOYANI

*

Department of Computer Engineering, Sardar Patel University, Vallabh Vidya Nagar – 388 120, Anand, Gujarat, India [email protected] maheshgoyani.co.in

NARENDRA PATEL Department of Computer Engineering, Sardar Patel University, Vallabh Vidya Nagar – 388 120, Anand, Gujarat, India [email protected]

MUKESH ZAVERI Department of Computer Engineering, Sardar Vallabhbhai National Institute of Technology Surat – 395 001, Gujarat, India [email protected] Abstract Many multimedia applications and entertainment industry products like games, cartoons and film dubbing require speech driven face animation and audio-video synchronization. Only Automatic Speech Recognition system (ASR) does not give good results in noisy environment. Audio Visual Speech Recognition system plays vital role in such harsh environment as it uses both – audio and visual – information. In this paper, we have proposed a novel approach with enhanced performance over traditional methods that have been reported so far. Our algorithm works on the bases of acoustic and visual parameters to achieve better results. We have tested our system for English language using MFCC and LPC parameters of the speech. Lip parameters like lip width, lip height etc are extracted from the video and these both acoustic and visual parameters are used to train neural network. Our system is giving almost cent percent response against vowels. Key Words: ASR, Phoneme, Viseme, Speech parameters, Neural Network. 1.

Introduction

Lip synchronization is the process of speech to lip motion mapping. The process of lip synchronization is divided in four main parts. In first pass, lip features are extracted using standard algorithms which use color, edge and motion information. In second pass, audio processing is done. The signal parameters are extracted. In third pass, the speech parameters are presented to classifier in two sets – Training set and Testing set. Trained classifier will identify the given audio signal and in last stage, animation is carried out. The problem of mapping a speech signal to the lip shape information can be solved on several different levels, depending on the speech analysis that is being used [1]. There are three processing level of speech and they are application specific. First level processing is known as signal level processing. A signal level concentrates on a physical relationship between the shape of the vocal tract and the sound that is produced. This method uses a large set of audio-visual parameters to train the mapping. There are many algorithms are available for such mapping – Vector Quantization (VQ), the Neural Networks (NN), the Gaussian Mixture Model (GMM), etc.. This method uses a large set of audio visual parameters to train the mapping. Second level processing is known as Phoneme level processing. A phoneme is the basic unit of the acoustic speech [1],[2]. The speech phrase is first segmented into a sequence of phonemes. Mapping is then found for each phoneme in the speech signal using a lookup table, which contains one viseme set for each phoneme. Viseme is a visual representation of phoneme [1],[2],[3]. Phoneme level approach is language dependent. Third level processing is known as word level processing. This model works at phrase or word level and is more concerned about context in the speech signals. The Hidden Markov Model can be used to represent the acoustic state transition in the word. This approach is computationally expensive.

ISSN: 0975-5462

2364

Mahesh Goyani et. al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 2364-2369   We have used signal level approach because it is simple, language independent and suitable for the real-time implementation. As a classifier we have used neural network 2.

Materials and Methods

2.1 Acoustic feature extraction Human speech can be modeled by some speech parameters like Mel frequency Cepstrum coefficients (MFCC), linear predictive codes (LPC), and LSP etc. MFCC and LPC are the most widely used parameters in area of speech processing [1], [3], [4]. We also have employed the same in our research. LPCs are derived on the assumption that speech signal is linear in nature and MFCCs are derived on the assumption that signal is logarithmic in nature [1], [4], [5]. 2.1.1. Mel frequency cepstrum coefficients (MFCC) The Mel-Frequency Cepstrum Coefficients (MFCC) is an audio feature extraction technique which extracts parameters from the speech similar to human hearing system [5], [6]. They are commonly used in the automatic speech recognition because MFCCs take into consideration the characteristics of the human auditory system. Additionally, these coefficients are robust and reliable to variations according to speakers and recording conditions. The speech signal is first divided into time frames consisting of an arbitrary number of samples. In most systems overlapping of the frames is used to smooth transition from frame to frame. Each time frame is then windowed with Hamming window to eliminate discontinuities at the edges [1],[4]. The filter coefficients w(n) of a Hamming window of length n are computed according to the formula: 0.54 – 0.46

2 1

,0

1

= 0, otherwise Where N is total number of sample and n is current sample. After the windowing, Fast Fourier Transformation (FFT) is calculated for each frame to extract frequency components of a signal in the time-domain. FFT is used to speed up the processing. The logarithmic Mel-Scaled filter bank is applied to the Fourier transformed frame. This scale is approximately linear up to 1 kHz, and logarithmic at greater frequencies [7]. The relation between frequency of speech and Mel scale can be established as: 2595

1 700

The last step is to calculate Discrete Cosine Transformation (DCT) of the outputs from the filter bank. DCT ranges coefficients according to significance, whereby the 0th coefficient is excluded since it is unreliable [5]. The overall procedure of MFCC extraction is shown on Figure 1. Pre Emphasis, Framing FFT and Magnitude Mel Filter Bank Log () DCT / IFFT

Truncation Fig. 1. MFCC Derivation

ISSN: 0975-5462

2365

Mahesh Goyani et. al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 2364-2369   For each speech frame, a set of MFCC is computed. This set of coefficients is called an acoustic vector which represents the phonetically important characteristics of speech and is very useful for further analysis and processing in Speech Recognition. We have taken audio of 2 Second which gives approximate 128 frames each contain 128 samples (window size = 16 ms). We have used first 20 frames that give good estimation of speech. Forty Two MFCC parameters include twelve original, twelve delta (First order derivative), twelve delta-delta (Second order derivative), three log energy and 3 0th parameter. 2.1.2. Linear predictive codes (LPC) It is desirable to compress signal for efficient transmission and storage. Digital signal is compressed before transmission for efficient utilization of channels on wireless media. For medium or low bit rate coder, LPC is most widely used [8]. The LPC calculates a power spectrum of the signal. It is used for formant analysis [9]. LPC is one of the most powerful speech analysis techniques and it has gained popularity as a formant estimation technique [10]. While we pass the speech signal from speech analysis filter to remove the redundancy in signal, residual error is generated as an output. It can be quantized by smaller number of bits compare to original signal. So now, instead of transferring entire signal we can transfer this residual error and speech parameters to generate the original signal. A parametric model is computed based on least mean squared error theory, this technique being known as linear prediction (LP). By this method, the speech signal is approximated as a linear combination of its p previous samples. In this technique, the obtained LPC coefficients describe the formants. The frequencies at which the resonant peaks occur are called the formant frequencies [11] as shown in figure 2. Thus, with this method, the locations of the formants in a speech signal are estimated by computing the linear predictive coefficients over a sliding window and finding the peaks in the spectrum of the resulting LP filter. We have excluded 0th coefficient and used next ten LPC Coefficients.

Fig. 2 Filter response and speech generation

In speech generation, during vowel sound vocal cords vibrate harmonically and so quasi periodic signals are produced. While in case of consonant, excitation source can be considered as random noise [12]. Vocal tract works as a filter, which is responsible for speech response. Biological phenomenon of speech generation can be easily converted in to equivalent mechanical model. Periodic impulse train and random noise can be considered as excitation source and digital filter as vocal tract. 2.2. Extraction of lip features Our proposed algorithm detect lip region using color and motion information. It detects the region which is dominated in red color and which has maximum motion from the given video. Detected Lip region may include other parts of mouth. So after detecting lip region green and blue levels of the image pixels are used to exclude lip from the other parts. The fact is that lip region is dominated in red but green and blue is almost same and difference between red and green is higher. This technique is known as red exclusion because only green and blue color components are used. Red Exclusion gives perfect Lip region. Lip is elliptical in shape as shown in Figure 3.

Fig. 3. Geometric feature points of the lips

This image will be scanned to detect maximum and minimum value in horizontal and vertical direction. This minimum and maximum value of coordinates in horizontal scanning will provide value of x coordinates for points P1 and P2. The y coordinates of P1 and P2 are determined by minimizing the correlation, with the template coefficients shown in figure 4, along the two vertical stripes that are previously found (x = P1.x, P2.x).

ISSN: 0975-5462

2366

Mahesh Goyani et. al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 2364-2369   Coefficient mask will be moved over the vertical strip of x coordinates of feature points P1 and P2. In each of this strip value with minimum correlation will give y coordinates of feature points P1 and P2. So by combining this information regarding x and y coordinates of feature point P1 and P2, it can be located on the detected lips. The x coordinates of P3 and P6 are determined by finding perpendicular bisector of line segment P1P2.Minimum and Maximum value of coordinates in vertical direction may not provide exact y coordinates for points P3 and P6. So we have applied canny edge detector on detected lip region. Then to determine y coordinates for P3 and P6, edge map is scanned on line P3P6, keeping only the first and the last encountered nonzero pixels. We have found mean of pixels consisting of pixels on line P3P6, vertical strip left to it and vertical strip right to it. Local maximum of gradient on mean pixels is used to detect feature points P4 and P5. Lip width=p2x-p1x and lip height= p6y-p3y. Using above algorithm we have measured lip width and lip height for various speakers. 0 0 0 0 0

0 0 0 0 0

0 0 2 0 0

0 2 2 2 0

2 2 2 2 2

2 2 2 2 2

0 2 2 2 0

0 0 2 0 0

0 0 0 0 0

0 0 0 0 0

Fig. 4. Coefficient masks for the detection of P1.x and P2.y

2.3. Audio visual mapping After extracting acoustic and lip features, they are given to classifier in two sets – Training set and Testing set. Multilayer feed forward neural network with back propagation algorithm is the common choice in classification and pattern recognition [13][14]. Hidden Markov Model, Gaussian Mixture Model, Vector Quantization are the some of the techniques for acoustic features to visual speech mapping [1]. Neural network is one of the good choices among all. GA can be used with neural network for performance improvement by optimizing parameter combination.

Fig. 5. Structure of employed neural network

We have used three layer feed forward back propagation neural network with input, hidden and output layer as shown in Figure 5. Acoustic features of speech are LPC and MFCC parameters. In our experiment, we have used m neurons in input layer, n neurons in hidden layers and p neurons in output layer. Value of m is 10 when input to neural network is LPC parameters and value of m is 12 or 42 in case of MFCC parameters. Numbers of neurons in hidden layer (n) are found through experiments. Value of p is 5 because we have trained and tested neural network for five vowels. We have trained network for 500 epochs with goal 0.001. We have employed MATLAB functions ‘traincgf’ for training and ‘learnwh’ for learning. By testing we found out that this combination gives the fastest convergence. For speech to lip movement animation generation, we can use either standard MPEG-4 face model or any parametric model that can model speech to lip motion. We have used our own designed parametric model for facial animation [15].Once neural network is trained it gives lip width and lip height for any unknown audio. These parameters are presented to parametric model to carry out the animation. 3.

Results and Discussions

The simulation results of LPC and MFCC parameters are shown in Table 1 and Table 2 with the various combinations of neurons in hidden layer for male and female speakers. The results show that the system provides best accuracy with 12 neurons in the hidden layer for male and 18 neurons for female for LPC parameters. Same way 12 and 42 MFCC parameters are giving best results for minimum 42 and 6neurons in hidden layer respectively, for both male and female speakers. Tables show that as we increase the number of features for MFCC parameters, accuracy is also going to be increase.

ISSN: 0975-5462

2367

Mahesh Goyani et. al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 2364-2369   Table 1: Accuracy Measurement for Male Speaker No. of Hidden Layer Neuron 5 6 7 8 12 18 24 26 42

LPC 60.95 64.76 71.42 68.57 79.04 65.71 53.33 61.90 63.80

MFCC (12 Para) 52.38 62.86 60.00 45.71 65.71 61.90 48.47 61.90 66.67

MFCC (42 Para.) 95.00 100.00 100.00 100.00 100.00 100.00 99.04 100.00 99.04

Table 2: Accuracy Measurement for Female Speaker No. of Hidden Layer Neuron 5 6 7 8 12 18 24 26 42

LPC 91.11 88.89 86.67 91.11 86.67 97.78 95.56 95.56 91.11

MFCC (12 Para) 64.44 64.44 66.67 51.11 55.56 53.33 51.11 53.33 80.00

MFCC (42 Para.) 98.00 100.00 100.00 100.00 100.00 97.78 97.78 100.00 97.78

Confusion matrix for accuracy of vowels is shown in table 3. We have tested the system against 1000 inputs of each vowel. We are getting near about 100% recognition rate for vowels. Results are derived for 42 MFCC parameters. Table 3: Vowel Recognition Expected A O U E I

A 1000 01 00 00 00

O 00 995 02 00 00

Recognized U E 00 00 02 02 998 0 00 1000 00 00

I 00 00 00 00 1000

Snap shots of our animation model is shown in figure 6. Input speech is presented to parametric model which will create the animation according to the lip height and width returned by our neural network.

Figure 6: Lip animation

4.

Conclusions

Speech sound varies according to the context and hence it’s most difficult task in sense to one to one phoneme to viseme mapping. It is out of scope to bind the same model for generalization for any language. Result compared in above section shows the superiority of our results over results derived by other researchers. Vowels are voiced speech and they can be considered as quasi periodic source of excitation, while consonant (unvoiced speech) is identical to random noise. In our experiment, we are getting near about 100% recognition rate for vowels. Result shows that MFCC parameters are having very good response for male as well as female speaker. As human voice is nonlinear in nature, Linear Predictive Codes are not a good choice for speech

ISSN: 0975-5462

2368

Mahesh Goyani et. al. / International Journal of Engineering Science and Technology Vol. 2(6), 2010, 2364-2369   estimation. MFCC is derived on the concept of logarithmically spaced filter bank, clubbed with the concept of human auditory system and hence had the better response compare to LPC parameters. References [1] [2] [3] [4] [5] [6] [7]

[8]

[9] [10] [11] [12] [13] [14] [15]

Goranka Zoric (2005): Automatic lip synchronization by speech signal analysis, Master Thesis, Faculty of Electrical Engineering and Computing, University of Zagreb, Zagreb. Tsuhan Chen, Ram Rao (1998): Audio-visual integration in multimodal communication, Proc. IEEE, Vol. 86, Issue 5, pp. 837-852. Goranka Zoric, Igor S. Pandzic (2005): A real time lip sync system using a genetic algorithm for automatic neural network configuration, Proc. IEEE, International Conference on Multimedia & Expo ICME 2005. Lei Xie, Zhi-Qiang Liu (2006): A comparative study of audio features for audio to visual cobversion in MPEG-4 compliant facial animation, Proc. of ICMLC, Dalian. Andreas Axelsson, Erik Bjorhall,(2003): Real time speech driven face animation, Master Thesis at The Image Coding Group, Dept. of Electrical Engineering, Linkoping University, Linkoping. Xuewen Luo, Ing Yann Soon and Chai Kiat Yeo (2008: An auditory model for robust speech recognition, ICALIP, International Conference on Audio, Language and Image Processing, pp. 1105-1109. Alfie Tan Kok Leong (2003): A music identification system based on audio content similarity, Thesis of Bachelor of Engineering, Division of Electrical Engineering, The School of Information Technology and Electrical Engineering, The University of Queensland, Queensland. Lahouti, F., Fazel, A.R., Safavi-Naeini, A.H., Khandani, A.K (2006): Single and double frame coding of speech LPC parameters using a lattice-based quantization scheme, IEEE Transaction on Audio, Speech and Language Processing, Vol. 14, Issue 5, pp. 1624-1632. R.V Pawar, P.P.Kajave, S.N.Mali (2005): Speaker identification using neural networks, Proceeding of world Academy of Science, Engineering and Technology, Vol. 7, ISSN 1307-6884. Alina Nica, Alexandru Caruntu, Gavril Toderean, Ovidiu Buza (2006): Analysis and synthesis of vowels using matlab, IEEE Conference on Automation, Quality and Testing, Robotics, Vol. 2, pp. 371-374, 25-28. B. P. Yuhas, M. H. Goldstein Jr., T. J. Sejnowski, and R. E. Jenkins (1990): Neural network models of sensory integration for improved vowel recognition, Proc. IEEE, vol. 78, Issue 10, pp. 1658–1668. Ovidiu Buza1, Gavril Toderean1, Alina Nica1, Alexandru Caruntu1 (2006): Voice signal processing for speech synthesis, IEEE International Conference on Automation, Quality and Testing Robotics, Vol. 2, pp. 360-364. Syed Ayaz Ali Shah, Azzam ul Asar, S.F. Shaukat (2009): Neural network solution for secure interactive voice response, World Applied Sciences Journal 6 (9): 1264-1269, ISSN 1818-4952. Chengliang Li,Richard M Dansereau and Rafik A Goubran (2003), Acoustic speech to lip feature mapping for multimedia applications”, proceedings of the third international symposium on image and signal processing and analysis, vol. 2, pp. 829-832. Narendra Patel, Pradip Patel, Mukesh Zaveri (2009): “Parametric model based facial animation synthesis, International Conference on Emerging Trends in Computing, Kamraj College of Engg. & Tech, Tamilnadu, India.

ISSN: 0975-5462

2369

Suggest Documents