Block Energy Based Visual Features Using Histogram ...

13 downloads 0 Views 1MB Size Report
Patrick Lucey et al. show the dependency of audio visual speech recogni- tion model on ..... [5] McGurk, Harry, and John MacDonald.(1976) “Hearing lips and ...
Available online at www.sciencedirect.com Available online at www.sciencedirect.com

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect

Procedia Computer Science 00 (2018) 000–000 Procedia Computer Science 00 (2018) (2018) 000–000 Procedia Computer Science 132 (2018) 1385–1393 Procedia Computer Science 00 000–000

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

International International Conference Conference on on Computational Computational Intelligence Intelligence and and Data Data Science Science (ICCIDS (ICCIDS 2018) 2018) International Conference on Computational Intelligence and Data Science (ICCIDS 2018)

Block Block Energy Energy Based Based Visual Visual Features Features Using Using Histogram Histogram Of Of Oriented Oriented Gradient Gradient For For Bimodal Bimodal Hindi Hindi Speech Speech Recognition Recognition a,∗ a a Prashant Upadhyaya a,∗, Omar Farooqaa, M. R. Abidiaa a,∗ Prashant Upadhyaya , Omar Farooq , M. R. Abidi Prashant Upadhyaya , Omar Farooq , M. R. Abidi

a Department of Electronics Engineering, Aligarh Muslim University, Aligarh, 202001, Uttar Pradesh, India a Department of Electronics Engineering, Aligarh Muslim University, Aligarh, 202001, Uttar Pradesh, India a Department of Electronics Engineering, Aligarh Muslim University, Aligarh, 202001, Uttar Pradesh, India

Abstract Abstract Abstract In this paper, new set of visual speech feature based on Histogram of Oriented Gradient (HOG) is proposed to improve the roIn this this paper, paper, new new set set of of visual visual speech speech feature feature based based on on Histogram Histogram of of Oriented Oriented Gradient Gradient (HOG) is is proposed to to improve improve the the roIn bustness of bimodal Hindi speech recognition. For extracting the visual features, energy (HOG) per block proposed using HOG is calculated roby bustness of bimodal Hindi speech recognition. For extracting the visual features, energy per block using HOG is calculated by bustness of gradient bimodalmagnitude Hindi speech recognition. extracting visual features, energy per block using HOG calculated by finding the of the pixel in theFor cell for both xtheand y direction form the region of interest (ROI).isThe advantage finding the gradient magnitude of the pixel in the cell for both x and y direction form the region of interest (ROI). The advantage finding the gradient magnitude of the pixel in the cell for both x and y direction form the region of interest (ROI). The advantage of proposed scheme is that it has reduced the dimensionality of visual features vectors which can retain the full information of of proposed scheme scheme is is that that itit has has reduced reduced the the dimensionality dimensionality of visual visual features features vectors vectors which which can can retain retain the the full full information information of of of theproposed lip region. For comparative study, four sets of visual feature;ofSet A (Two-Dimensional Discrete Cosine Transform feature (2Dthe lip region. For comparative study, four sets of visual feature; Set A (Two-Dimensional Discrete Cosine Transform feature (2Dthe lip region. For comparative study, four Wavelet sets of visual feature; Set A (Two-Dimensional Discrete Cosine Transform feature (2DDCT)), Set B (Two-Dimensional Discrete Transform followed by DCT (2D-DWT-DCT)), Set C (Static-HOG) and Set D DCT)), Set Set B B (Two-Dimensional (Two-Dimensional Discrete Discrete Wavelet Wavelet Transform Transform followed followed by by DCT DCT (2D-DWT-DCT)), (2D-DWT-DCT)), Set Set C C (Static-HOG) (Static-HOG) and and Set D D DCT)), (Dynamic-HOG) are extracted from AMUAV corpus. Standard Mel Frequency Cepstral coefficients (MFCC) followed by Set static (Dynamic-HOG) are are extracted extracted from from AMUAV AMUAV corpus. corpus. Standard Standard Mel Mel Frequency Frequency Cepstral Cepstral coefficients coefficients (MFCC) (MFCC) followed followed by by static static (Dynamic-HOG) and dynamic (MFCC) features were used as baseline features. The maximum improvement in WRA (%) of 12.73% is reported and dynamic (MFCC) (MFCC) features features were were used as as baseline features. features. The The maximum maximum improvement improvement in in WRA WRA (%) (%) of of 12.73% 12.73% is is reported reported and overdynamic baseline features using proposedused featuresbaseline sets. over baseline baseline features features using using proposed proposed features features sets. sets. over c 2018 The Authors. Published by Elsevier B.V. ⃝ © Ltd. ⃝ 2018 The The Authors. Authors. Published Published by by Elsevier Elsevier B.V. B.V. cc 2018 ⃝ This is an open access article under BY-NC-ND license Peer-review under responsibility of the CC scientific committee of (https://creativecommons.org/licenses/by-nc-nd/3.0/) the International Conference on Computational Intelligence and Peer-review under responsibility of the scientific committee of the International Conference on Computational Intelligence and Peer-review responsibility of Peer-review under responsibility of the the scientific scientific committee committee of of the the International International Conference Conference on on Computational Computational Intelligence Intelligence and and Data Scienceunder (ICCIDS 2018). Data Science (ICCIDS 2018). Data Science (ICCIDS 2018). Data Science (ICCIDS 2018). Keywords: MFCC ; Histogram of Oriented Gradient; AMUAV corpus; DCT; DWT and Feature Extraction Keywords: MFCC MFCC ;; Histogram Histogram of of Oriented Oriented Gradient; Gradient; AMUAV AMUAV corpus; corpus; DCT; DCT; DWT DWT and and Feature Feature Extraction Extraction Keywords:

1. Introduction 1. Introduction Introduction 1. In todays techno world, almost all the Human-machine interaction (HMI) system, speech is considered as a promiIn todays todays techno world, world, almost almost all all the Human-machine Human-machine interaction interaction (HMI) (HMI) system, system, speech speech is is considered considered as as aa promipromiIn nent interfacetechno between humans (users)the and machine. One of the biggest examples is the automatic speech recognition nent interface between humans (users) and machine. One of the biggest examples is the automatic speech recognition nent interface humans (users) andthese machine. One of the biggest is thefrequency automaticCepstral speech recognition (ASR) system,between which are used to control devices using speech as examples an input. Mel coefficients (ASR) system, system, which which are are used used to to control control these these devices devices using using speech speech as as an an input. input. Mel Mel frequency frequency Cepstral Cepstral coefficients coefficients (ASR) (MFCC) are the most popular acoustic features obtained using Mel-filter bank [1]. However, weakness of these modern (MFCC) are the most popular acoustic features obtained using Mel-filter bank [1]. However, weakness of these modern (MFCC) are the popular acoustic features obtained using Mel-filter bank [1]. weakness of these modern ASR systems is most observed when the speech signal gets corrupted [2]. Factors suchHowever, as environment noise, background ASR systems is observed when the speech signal gets corrupted [2]. Factors such as environment noise, background ASR is observed the speech signal gets corrupted Factors such as environment noise, noise systems and microphone usedwhen to capture speech signal degrades the [2]. performance of ASR system [1, 2]. One background of the main noise and and microphone microphone used used to to capture capture speech speech signal signal degrades degrades the the performance performance of of ASR ASR system system [1, [1, 2]. 2]. One One of of the the main main noise challenges to improve the performance of ASR systems is to find the robust features for real time applications. challenges to improve the performance of ASR systems is to find the robust features for real time applications. challenges to the improve the performance of ASR systems is condition, to find the automatic robust features real time applications. To overcome problem of ASR in adverse background visualfor speech recognition is the latest To overcome overcome the the problem problem of of ASR ASR in in adverse adverse background background condition, condition, automatic automatic visual visual speech speech recognition recognition is is the the latest latest To technique which enables to use two modalities; audio and visual simultaneously [2, 3]. Usually, both the modalities technique which enables to use two modalities; audio and visual simultaneously [2, 3]. Usually, both the modalities technique which enables to use two modalities; audio and visual simultaneously [2, 3]. Usually, both the modalities ∗ ∗ ∗

Corresponding author. Tel.: +91-945-116-9395. Corresponding author. Tel.: Tel.: +91-945-116-9395. +91-945-116-9395. Corresponding E-mail address:author. [email protected] E-mail address: [email protected] E-mail address: [email protected] c 2018 1877-0509 ⃝ Published by Elsevier B.V. Ltd. 1877-0509 © 2018 The TheAuthors. Authors. Published by Elsevier 1877-0509 ⃝ ⃝ 2018 The Authors. Published by Elsevier Elsevier B.V. ccunder 1877-0509 2018 The Authors. Published by B.V. Peer-review responsibility of the committee of the International Conference on Computational Intelligence and Data Science This is an open access article under thescientific CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/3.0/) Peer-review under under responsibility responsibility of the the scientific committee committee of the the International International Conference Conference on Computational Computational Intelligence Intelligence and and Data Science Science Peer-review Peer-review under responsibility of of the scientific committee of the International Conference on Computational Intelligence and Data Data Science (ICCIDS 2018). (ICCIDS 2018). (ICCIDS 2018). (ICCIDS 2018). 10.1016/j.procs.2018.05.066

1386 2

Prashant Upadhyaya et al. / Procedia Computer Science 132 (2018) 1385–1393 Author name / Procedia Computer Science 00 (2018) 000–000

are used for communicating. Bimodal approach is the most recent technique used in ASR to mitigate the effect of noise. In audio-visual automatic speech recognition (AVASR) system, visual features which are extracted from the lip region are concatenated with the audio features [2, 4]. Advantage of the above scheme is visual and audio features are orthogonal to each others [3]. Thus if one features gets corrupted due to the presence of noise there is no affect on other feature. Therefore, improvement obtained by adding the visual information with the audio information is helpful when the system performance is measured in the noisy environment [2, 3, 4]. Initial, McGurk effect demonstrate the used of bimodal nature of speech to improve the speech intelligibility [5]. However, this work by McGurk and MacDonald, does not completely explain how visual information is helpful in increasing the performance of audio speech. In [6] shows the robustness of ASR system is increased using mouth height and width as visual features. The comparative performance of audio-visual speech recognition using IBM ViaVoiceT M database is reported in [7]. For extracting visual feature, lip region of 64×64 pixel is selected (the region of interest i.e., ROI lip) that results in 4096 dimensional vector. For dimensionality reduction linear discriminant analysis-maximum likelihood linear transforms (LDA-MLLT) was applied to select the final 41 visual features. Best results for visual speech recognition were obtained using Discrete Cosine transform (DCT) based visual feature. Another experiment on audio visual phoneme recognition using VidTIMIT database is evaluated in [8]. For experimental work, two appearances based visual features i.e., DCT and Discrete wavelet transform (DWT), are compared. Initially, 100 highest energy DCT and DWT coefficients were extracted as visual features. For dimensionality reduction Linear Discriminant Transform is applied on the features vector to reduce the feature dimension to 30. For audio feature 13 MFCC features chosen as baseline. ∆ and ∆-∆ features computed for both audio and visual features. Total dimensionality of audio-visual feature vector was 180. However, visual speech only recognition results were presented. Author [8], claimed that it is inappropriate to compare the phoneme visual features using language model that is based on phoneme basis. From experimental results they shows that the DCT outperform DWT features. Best performance obtained for visual features extracted from the intermediate frequency region, which contain the more information about the lip moments [8]. Patrick Lucey et al. show the dependency of audio visual speech recognition model on phoneme-viseme grouping using CUAVE audio-visual database [9]. In their experiment they shows the interclass confusability of phoneme with respect to viseme increase, as the signal-to-noise ratio (SNR) decreases. Further, it was reported that the interclass confusability is mainly due to unbalance phonetically distribution of the phoneme. In this paper, new method for extracting the visual features using Histogram of Oriented Gradients (HOG) descriptor for phonetically balanced bimodal Hindi speech recognition model is presented. HOG is scale invariant feature transform technique which represents image content in the form of histogram by performing local pixel gradient orientations [10]. Audio features are extracted using standard MFCC features. Further, the comparative performance in term of word recognition accuracy (WRA) under noisy background for SNR range 20 dB to 0 dB using babble noise is presented. Performance using both static and dynamic visual HOG features is evaluated. This paper is organized as follows; Section 2 gives the significance of Hindi language. In Section 3, literature survey for visual speech recognition system is reported. Section 4 presents the extraction of visual features using HOG descriptor for bimodal Hindi speech recognition system. In Section 5, experimental results for the bimodal Hindi speech recognition are presented. Finally, Section 6 covers the conclusion. 2. Significance of Hindi language Hindi language is the fourth largest spoken language [11]. As per KPMG 2017 report Hindi language is the most important among the Indic languages [12]. Fig. 1 (a) shows the primary and Indian language users (in million) i.e., Indian language users are Indic language literates who prefer their primary language over English to read, write and converse with each other. Figure 1 (b) shows the total number of internet user (in million) in India who prefer Indian language over English language. This statistic analysis shows the significance of Hindi language over English language for Indian market. There is a difference in number of phoneme and phonetics transcription of English and Hindi language. In Hindi language, 64 phone sets are present out of which 39 phones sets are common to English languages [13]. Thus, the model developed for English language may not give the optimum results when applied for Hindi languages. Therefore, to increase the number of users, for earning the profit, most of the technologies are now enabling Hindi speech interface for controlling/commanding the system. However, the performance of such



Prashant Upadhyaya et al. / Procedia Computer Science 132 (2018) 1385–1393 Author name / Procedia Computer Science 00 (2018) 000–000

1387 3

Fig. 1. Indian language users (in millions) (a) Shows the number of Indic language user in India and (b) shows the comparative analysis of total number of internet user for Indian and English language.

system degraded when systems are operated under noisy condition. Hence, to increase the system accuracy under noisy condition, Hindi audio-visual system can be used. However, very little research on audio-visual Hindi speech has been carried due to non availability of large Hindi audio-visual database [4, 14, 15]. Therefore, to motivate the researcher to work on Hindi audio visual speech recognition application, development of phonetically balanced Hindi audio-visual speech corpus named as Aligarh Muslim University Audio Visual (AMUAV) [4] is carried out at Aligarh Muslim University-Aligarh-India. Fig. 2 shows the sample speaker from AMUAV database. The bimodal Hindi speech database (AMUAV) is being developed at Aligarh Muslim University-Aligarh, India for research purpose. AMUAV corpus is a continuous Hindi speech high quality audio and video database which contains more than 100 speakers. Each speaker in AMUAV corpus recorded 10 sentences out of which two sentences are common to all speakers. High quality Sony handy cam full HD 60p Camcorder was used as recording device. Recordings have been made in realistic conditions for testing robust audio visual schemes. The video was recorded at 640380 resolutions with 25 frames per seconds (fps) in full colour. The audio was recorded in 16-bit stereo at 44.1 kHz. Hindi sentence used in the AMUAV corpus is phonetically balanced. 3. Features Extraction method for Visual Speech Visual speech can be classified as the process of converting the information present in lip region into the set of features vectors [2]. These features contain the compact information of the visible speech. In general, initial steps in visual speech recognition are the face localization i.e., speaker face are extracted from the recorded video [3]. This is done by applying the skin model technique useful in face localization [2, 3, 4] . After face localization, lip segmentation is carried out. Lip region which contain the information about the visible speech are considered as the region of interest (ROI). Main is to extract the compact features which can retain important information about shape of lip motion, which are helpful in identifying speech information. Some popular techniques such as; Discrete Cosine transform (DCT) [3, 4, 8], Principal component analysis (PCA) [2, 4], Discrete wavelet transform (DWT) [4, 8, 14], Active Appearance Model (AAM) [7, 16], Local binary patterns (LBP) [17] and Histogram of Oriented Gradients (HOG) [18, 19, 20] are presented in literature for extracting the visual information. Recently, visual feature using HOG descriptor has shown improvement [18, 19, 20] over the others techniques [3, 4, 16, 17]. In [18] , accent classification using, MOBIO (English language) database is used which consist of continuous speech and isolated word utterance from 56 subjects. It is reported that while utterance the foreign language (L2), the unfamiliar phonemes of L2 languages are replaced by the phoneme present in the mother language (L1). This misarticulation will develop the large gap from the actual phonetic configuration of L2 and thus resulting in higher word error rates. Further, author performed the experiment on visual speech. For their research work; five appearance based descriptors: 2-D-DCT, PCA, DWT, HOGs and LBPs, were selected and HMM classifier for classification. An experimental result shows that, best accuracy score of 71.6% for accent classification was obtained using HOG based features which essentially provide the local edges orientation information of the images. Another work by Karel

4 1388

Author name / Procedia Computer Science 00 (2018) 000–000 Prashant Upadhyaya et al. / Procedia Computer Science 132 (2018) 1385–1393

Fig. 2. Twenty sample speakers from Aligarh Muslim University Audio Visual (AMUAV) corpus.

Fig. 3. Shows the complete procedure for extracting the ROI (lip region) from the given frame and techniques used to extract the visual feature from the given ROI.

Palecek [19] for lip reading using HOG descriptor is reported. HOG features are extracted using three orthogonal planes (TOP) that are span in x, y and t axes. Thus, computing gradient along (xy, xt yt) and concatenating these features vector to form final visual features using HOGTOP. Finally, PCA was used for dimensionality reduction and final feature vector was fed to HMM classifier. Experimentally, it was shown that visual features extracted using HOGTOP outperform the other existing visual features techniques such as DCT, PCA and LBPTOP on the entire database. Thus, the literature provides adequate information about the advantage of using spatio-temporal features (HOG) over other descriptor features. 4. Proposed Methodolgy for Audio-visual Feature Extraction from AMUAV corpus For this research work 100 speakers from AMUAV corpus is selected. Initially, the speech signal was down sampled to 16 kHz for extracting the audio features. Audio feature extraction is carried out using Mel frequency Cepstral coefficients (MFCC), which are the most popular acoustic features used for speech recognition [1, 2] given as: f (1) ) 700 To include the temporal evolution of MFCC, additional feature ∆ and ∆-∆ is computed. Finally, thirty-nine features vector are used for audio. Generally, audio are extracted at the rate of 100 frames/sec and visual features are extracted Mel( fm ) = 2595 log10 (1 +



Prashant Upadhyaya et al. / Procedia Computer Science 132 (2018) 1385–1393 Author name / Procedia Computer Science 00 (2018) 000–000

1389 5

Fig. 4. HOG visual feature extraction from AMUAV corpus

at rate 25 or 30 frame/sec. Therefore to synchronize the audio and visual, interpolation is done at visual end to match the audio frame [2, 4]. For this research work, visual feature are extracted at rate of 25 frame/sec. Interpolation is performed after ROI estimation instead of frame extraction which saves both memory and computational time. For feature fusion, early integration [4] technique is applied i.e., audio and visual features are concatenated and then passed through the classifier. For extracting the image, template matching method [21] is used which return the four coordinates points of the image. In this process extracted face frame is converted into binary image and lip region was determined by counting black pixel from the binary ROI [22]. Procedure for extracting the ROI from recorded speaker frames is shown in Fig. 3. After extracting the ROI, three sets of visual features are extracted from the given ROI. In first case, 2D-DCT is applied on the ROI and first 12 coefficients are selected as the visual features. For second case, 2D-DCT was applied on the wavelet approximated image (using one level Haar wavelet) and first 12 coefficients are selected from 2-DWT-DCT based features [4]. DCT coefficient Xi j of an M-by-N matrix Y is computed as: ( π(2n+1) j ) ( π(2m+1)i ) ∑ M−1 ∑N−1 cos (2) Xi j = ai a j m=0 n=0 Ym,n × cos 2M 2N

where

√   1    √M, ai =      M2 ,

if

i = 0,

otherwise

and

√   1    √N, aj =      N2 ,

if

j = 0,

otherwise

(3)

Finally, third case HOG features are extracted from the given ROI. Visual features are extracted by finding the local orientation information produce by the lip ROI image I(x,y). Image ROI of dimension M × N is divided into cell C xy of cell size C×C pixel i.e., small spatial region where C=8. For each pixel value, edge gradients I x (x, y) and Iy (x, y) in both x and y direction and orientation bins (x,y) are calculated for each ROI, given as: I x (x, y) =

∂I(x, y) ∂x

θ(x, y) = arctan

and

Iy (x, y) I x (x, y)

Iy (x, y) =

∂I(x, y) ∂y

(4) (5)

Finally, the gradient magnitude M(x,y) and nine orientation bins of 20◦ each i.e., evenly spaced over 0◦ -180◦ are calculated from each of the local cells. √ A(x, y) = Iy (x, y)2 + I x (x, y)2 (6)

To improve the invariance to illumination, block of 2×2 cell with 50 % overlap with the neighbouring block are used [18]. For each cell, gradient magnitude corresponding to 9 bins is calculated. Hence, for each block, (49) bins

Prashant Upadhyaya et al. / Procedia Computer Science 132 (2018) 1385–1393 Author name / Procedia Computer Science 00 (2018) 000–000

1390 6

Fig. 5. (a) Word recognition accuracy (in %) (b) Insertion error (in %) (c) Deletion error (in %) and (d) Substitution error (in %) for audio only (MFCC) and audio-visual (i.e., Set A, B, C and Set D) features at different level of SNRs for noisy speech signal.

of histogram are concatenated to form the feature dimension. Further, to capture the complete information about the ROI, energy per blocks which consist of four cells is computed. For dimensionality reduction singular value decomposition (SVD) is applied on the features set and finally 12 visual features coefficients were selected. Fig. 4; show the procedure for extracting the HOG features from the given ROI. Also, to capture the temporal information, delta feature were computed from static visual HOG features. Early integration techniques are used for features fusion. Finally, four sets of visual feature (12 feature vectors) along with 39 audio features vector are concatenated, resulting in total 51 audio-visual features vector. Thus Four sets of audio-visual feature; Set A (MFCC+DCT), Set B (MFCC+2DWT-DCT), Set C (MFCC+ Static HOG) and Set D (MFCC+ -HOG) are used for the comparative study of visual features. 4.1. Experimental Results The experiment was performed using Matlab on Windows 10 platform using HTK version 3.4.1. Bimodal Hindi speech model was trained using 70 % of the data from AMUAV corpus and remaining for testing. For testing phase, noisy speech signal which are generated by injecting babble noise from NOISEX database, at different level of SNRs are used as an input. For word recognition, a recognizer based on continuous density hidden Markov model (CDHMM) along with two Gaussian mixture model (GMM) was implemented. Bigram language model with 3-state left-right HMM model with one state silence model was used for modelling phoneme. The Baum-Welch re-estimation algorithm was used for training the HMM models. Performance is computed in term of word recognition accuracy (WRA) defined as: WRA(%) = 100(%) − WER(%)

(7)



Prashant Upadhyaya et al. / Procedia Computer Science 132 (2018) 1385–1393 Author name / Procedia Computer Science 00 (2018) 000–000

1391 7

where Word Error Rate (WER) is given by WER(%) = (Del + S ub + Ins)/N

(8)

where N is the number of word used in test, Del is number of deletions, Sub is number of substitutions and Ins is the number of insertion error. As shown in Fig. 5 (a), under clean environment, audio-only ASR using MFCC as the baseline feature performed well over audio-visual features sets. For clean condition no improvement in WRA (%) is obtained using visual features. Thus in clean condition, audio features extracted from speech signal (observation data Y), the most likely sequence of ˆ is found through the following Bayesian decision: words W ˆ = arg max{P(W | Y)} = arg max{P(Y | W)P(W)} W W

W

(9)

where P(W) is the priori probability of observing some specified word sequence W and P(W) is determine by the language model. Likelihood P(Y | W) is the probability of observing speech data Y for given word sequence W and is determined by the acoustic model. Therefore, addition of visual feature in clean condition does not provide any complimentary information to increase the most likelihood word sequence. On the other hand, addition of visual features in clean condition results in increasing the confusability of phoneme thereby providing reduction in WRA (%). Similar, trends is shown at 20 dB SNR, i.e., audio-only features performed well over audio-visual features. But, as the SNR decrease (15 dB to 0 dB) the performance of audio-only system decreases and therefore the integration of visual features provide the addition information to identify more likelihood word sequence under noisy condition. As shown in Fig 5(a) the improvement in WRA (%) using Set A and Set B is achieved for SNR range (10 dB to 0 dB), whereas for Set C and Set D improvement is obtained for SNR range (15 dB to 0 dB). The maximum improvement in WRA (%) of 6.38 % 7.81 %, 8.47% and 12.73% is reported over baseline features using Set A, B, C and Set D at 0 dB SNR respectively. Fig 5 (b)-(d) shows the error distribution of babble noise for different features extraction techniques. Fig 5 (b) shows the insertion errors (%) rate for audio-only and audio-visual features. It can be concluded from the Fig 5 (b), for noisy speech signal the insertion error increases tremendously from SNR 10 dB to 0 dB. Addition of visual features has however lowered the insertion error of audio-visual ASR system. In Fig 5 (c) more deletion error (%) rate is high for audio-visual features sets compare to baseline features. Similarly, for Fig 5 (d) substitution error (%) rate for the base line features are more when compared with audio-visual features. Therefore, when total sum of error distribution (Del+Sub+Ins) rate is compared, than audio-only ASR system has higher error rates has compared to audio-visual ASR system for SNR range 15 dB to 0 dB. Hence it conclude the addition of visual features improve the robustness of the ASR system in noisy environment. On the other hand with respect to audio-visual features comparisons, proposed HOG based visual features outperform all the other visual features techniques has shown in Fig 5 (a). Maximum improvement in WRA for proposed feature (Set C) of 5.52% and 3.84 % with respect to Set A and Set B is reported at 05dB SNR. However, the dynamic information obtained in Set D outperforms Set C at 0 dB SNR only. 5. Conclusion This paper reports the Bimodal Hindi speech techniques for increasing the robustness of an ASR system under noisy background conditions. In this paper, novel approach for extracting visual features using block energy based visual HOG feature are reported. These features are extracted by finding the gradient magnitude of the pixel in the cell for both x and y direction using HOG transformed. This retains all the information about lip region. The advantage of the proposed scheme is that it calculates the edge gradient for each of the local cell using nine orientation bins, which preserves the information of the lip shapes. The proposed visual features technique has shown the maximum improvement in terms of WRA of 12.73 % over baseline features. For comparative analysis, performance of noisy speech signal using babble noise from the NOISEX database is evaluated for different Signal to Noise ratio (SNR: 30 dB to 0 dB) using AMUAV corpus. Four set of visual features, have been reported for extracting the visual feature and selecting only the 12 visual dimension features along with 39 audio features has shown increase in the word recognition rate. A proposed visual feature

1392 8

Prashant Upadhyaya et al. / Procedia Computer Science 132 (2018) 1385–1393 Author name / Procedia Computer Science 00 (2018) 000–000

has shown the improvement for all the others sets. Hence, it can be concluded that the performance of the audio-visual ASR system are dependent on the type of visual features used along with audio features. I would like to extend this work as the future work research, so that more optimum results can be obtained using the HOG descriptors. As with the increase in demand of smartphones availability at affordable price and with high-end of visual streaming at low cost high speed internet can provide the additional advantage of improving the ASR system under noisy condition. As the performance of ASR degrades when they are operated under noisy environment the addition of visual features along with audio features can improve the overall performance. However, the limitation of audio-visual speech recognition model is that, it requires front view of the speaker face for extracting the visual features. Any movement by the speaker can results in missing ROI frame which can degrade the performance of ASR. Although the multiview angel faclities available in the smartphone in future can remove this limitation as well. Acknowledgements The authors would like to acknowledge Institution of Electronics and Telecommunication Engineers (IETE) for sponsoring the research fellowship during this period of research. References [1] Li, Jinyu, Li Deng, Yifan Gong, and Reinhold Haeb-Umbach. (2014) “An overview of noise-robust automatic speech recognition.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (4): 745–777. [2] Stewart, Darryl, Rowan Seymour, Adrian Pass, and Ji Ming. (2014) “Robust audio-visual speech recognition under noisy audio-video conditions.” IEEE transactions on cybernetics 44 (2): 175–184. [3] Potamianos, Gerasimos, Chalapathy Neti, Guillaume Gravier, Ashutosh Garg, and Andrew W. Senior. (2003) “Recent advances in the automatic recognition of audiovisual speech.” Proceedings of the IEEE 91 (9): 1306–1326. [4] Upadhyaya, Prashant, Omar Farooq, Musiur Raza Abidi, and Priyanka Varshney. (2015) “Comparative study of visual feature for bimodal Hindi speech recognition.” Archives of Acoustics 40 (4): 609–619. [5] McGurk, Harry, and John MacDonald.(1976) “Hearing lips and seeing voices” Nature, 264 (5588) pp. 746–748. [6] Chen Tsuhan. (2001) “ Audio visual speech processing.” IEEE Signal Processing Magazine, Springer pp. 9–31. [7] Matthews, Iain, Gerasimos Potamianos, Chalapathy Neti, and Juergen Luettin.(2014) “A comparison of model and transform-based visual features for audio-visual LVCSR.” In IEEE International Conference on Multimedia and Expo,ICME pp. 825–828. [8] Ahmad, Nasir, Sekharjit Datta, David Mulvaney, and Omar Farooq.(2008) “ A comparison of visual features for audio visual automatic speech recognition.”In Proceedings of Acoustica, Paris, France pp. 6445–6449. [9] Lucey, Patrick, Terrence Martin, and Sridha Sridharan.(2004) “Confusability of phonemes grouped according to their viseme classes in noisy environments.” In Australian International Conference on Speech Science and Tech, pp. 265–270. [10] Dalal, Navneet, and Bill Triggs. (2005) “Histograms of oriented gradients for human detection.” IEEE Computer Society Conference on Computer Vision and Pattern Recognition CVPR 2005 1 : 886–893. [11] IPA-The Phonetic Representation of Language Available http://www.internationalphoneticalphabet.org [12] Indian languages-Defining India’s Internet. A study by KPMG in India and Google April 2017. Available https://home.kpmg.com/in/en/home/insights/2017/04/indian-language-internet-users.html. [13] Kumar, Mohit, Nitendra Rajput, and Ashish Verma. (2004) “A large-vocabulary continuous speech recognition system for Hindi.” IBM journal of research and development 48 (6): 703–715 [14] Biswas, Astik, Prakash Kumar Sahu, and Mahesh Chandra. (2015) “Audio visual isolated Hindi digits recognition using HMM.” International Journal of Computational Vision and Robotics 5 (3):320–334 [15] Mishra, A. N., Mahesh Chandra, Astik Biswas, and S. N. Sharan. (2013) “Hindi phoneme-viseme recognition from continuous speech.” International Journal of Signal and Imaging Systems Engineering 6 (3): 164–171 [16] Palecek, Karel, and Josef Chaloupka. (2013) “Audio-visual speech recognition in noisy audio environments.” In 36th International Conference on Telecommunications and Signal Processing (TSP), pp. 484–487. [17] Zhao, Guoying, Mark Barnard, and Matti Pietikainen. (2009) “Lipreading with local spatiotemporal descriptors.” IEEE Transactions on Multimedia 11 (7):1254-1265. [18] Georgakis, Christos, Stavros Petridis, and Maja Pantic.(2016) “Discrimination between native and non-native speech using visual features only.”IEEE transactions on cybernetics 46 (12): 2758-2771 [19] Paleek, Karel. (2016). “Lipreading using spatiotemporal histogram of oriented gradients.” In 24th European Signal Processing Conference (EUSIPCO) pp. 1882–1885. [20] Dniz, Oscar, Gloria Bueno, Jess Salido, and Fernando De la Torre. (2011) “Face recognition using histograms of oriented gradients.” Pattern Recognition Letters 32 (12):1598–1603. [21] Viola, Paul, and Michael J. Jones. (2004) “Robust real-time face detection.” International journal of computer vision 57 (2):137–154.



Prashant Upadhyaya et al. / Procedia Computer Science 132 (2018) 1385–1393 Author name / Procedia Computer Science 00 (2018) 000–000

1393 9

[22] Upadhyaya, Prashant, Omar Farooq, M. R. Abidi, and Priyanka Varshney. (2014) “Performance Evaluation of Bimodal Hindi Speech Recognition under Adverse Environment.” In Proceedings of the 3rd International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA), Springer pp. 347–355

Suggest Documents