2015 1st International conference on futuristic trend in computational analysis and knowledge management (ABLAZE 2015)
An Efficient Text Dependent Speaker Recognition using Fusion of MFCC and SBC K V Krishna Kishore, MIEEE Department of CSE Vignan’s University Guntur, Andhra Pradesh, India
[email protected]
Syed.Sharrefaunnisa
Venkatramaphanikumar S, MIEEE
M.Tech, Department of CSE Vignan’s University Guntur, Andhra Pradesh, India
[email protected]
Department of CSE Vignan’s University Guntur, Andhra Pradesh, India
[email protected]
Each frame of audio signal is taken after windowing of speech corpus then it is converted into a feature vector of a given size. Entire speech sample is represented by a sequence of feature vectors, which shows variable length. The outline of the paper is as follows: In section II describes literature survey of speaker recognition system. In section III, feature extraction algorithms are presented. In section IV, classification of features with the SVM classifier is described. Section V contains the evaluation of the proposed method and experimental results are tabulated. Comparison of the results of the proposed method with the existing methods is presented.
Abstract— In this paper an efficient approach for the recognition of a speaker based on text dependent speech is presented. Speaker Recognition/ Verification system suffers with wide variety of problems. In the proposed approach, the features are extracted using two methods such as Mel Frequency Cepstral Coefficients and wavelet subband coefficients, and then these futures are fused through concatenation to give optimum performance. Those concatenated feature set are more reliable to discriminate an imposter from the genuine. Those concatenated features are classified using support vector machine classifier. Performance of the proposed approach is validated on a self generated corpus of size 300 samples of 20 individual. The proposed method outperforms other existing methods. Keywords—Speaker Concatenation, Fusion
Identification,
I.
MFCC,
SBC,
SVM,
II.
LITERATURE SURVEY
One of the promising areas of research in the field of speech-recognition systems is speaker-dependent and speakerindependent recognition in human computer interaction. In recent years, the demand in recognition of voices for various applications such as public surveillance systems, fraud detections are growing up thereby imbued interest in the areas of text dependent and also text independent speaker recognition systems. The performance of the speaker recognition systems is limited with various factors. Those are noise incurred due to environment, distortions generated with surrounding acoustics, several types of microphones used, position of microphone used, low quality signal created by limited bandwidth and spoofing etc.. Speech recognition is mainly carried out in two stages. First stage is extraction of features effectively with appropriate methods. Second stage is classification of features. There are good numbers of algorithms available for extraction of features from the speech signal. Most popularly used approaches are MFCC [1], Linear Prediction Coding (LPC)[1], Eigen-FFT [2], and Sub band based Cepstral coefficients (SBC) [3]. Features from MFCC give good results but not immune to noise. So Sub band based Cepstral parameters are considered for extraction, as it has embedded denoising or enhancement at feature extraction stage to improve the results of MFCC. Accuracy of the speaker recognition system is greatly affected by low level feature set such as duration of voiced fragments, power, pitch, and formants. MFCC is the most widely used feature in speech recognition but sensitive to
INTRODUCTION
Biometric methods are applied to identify a person based on physical traits such as Fingerprints, voice, iris, hand geometry, hand-writing, retinal, vein, and face. Biometric applications are becoming popular in highly secure identification and personal verification solutions. There are many technologies are upcoming in biometrics which includes Enterprise-wide network security infrastructures, homeland security, law enforcement, secure electronic banking, online financial transactions, health, retail sales, and social services. Speech recognition is a promising area of research used in routing systems based on voice at customer call centers, voice dialing on mobile phones etc,. Objectives of speechrecognition systems are to give better accuracy in identification of persons after elimination of noise in speech with filters and adapt to the environment, such as the speaker’s speech rate and accent. Generally Speaker recognition or identification methods are complex as it required detailed knowledge of signal processing and statistical modeling. Speaker recognition methods like phonetics, GMMs, Hidden Markov Models, Vector Quantization(VQ), Support Vector Machines (SVM) and Verbal Information Verification works are based on features extracted from various types such as 1) spectral frequencies 2) prosodic features, 3) spectrotemporal features, 4) source features from voice, and 5) highlevel features. In the proposed work feature extraction with robust methods, feature normalization, and score normalization methods will be addressed to get high accuracy.
978-1-4799-8433-6/15/$31.00© 2015 IEEE
18
2015 1st International conference on futuristic trend in computational analysis and knowledge management (ABLAZE 2015) noise [11]. Essentially, the DCT step in the calculation of Training Speech Signals
Preprocessing (De noising)
Training Phase
MFCC
features
Feature Extraction (MFCC)
decorrelates
Model Training
Fusion
filter
bank
energies.
Model Database
Feature Extraction (SBC)
Testing Phase Training Speech Signals
Preprocessing (De-noising)
Feature Extraction (MFCC)
Classifier Fusion
Speaker Id
Feature Extraction (SBC)
Fig 1: Architecture of the Proposed Method The process of text dependent speaker recognition is mainly of three phases. First phase is feature extraction, second is training based on extracted features and third is classification of emotions. For Feature extraction MFCC (Mel-Frequency Cepstral Coefficients) and SBC (Subband based Cepstral coefficients) are used. 3.1 MFCC Algorithm Mel-Frequency Cepstral Coefficient (MFCC) is widely used for language, speech and speaker recognition. MFCC behaves like human ear in the generation of cepstral component analysis. Analysis of MFCC components mostly depended on the size of the frames. For speaker recognition, only ten to twelve number of features will be used. Based on the properties of Discrete Cosine Transform (DCT) the signal was compacted to process fewer coefficients. This is because most of the signal energy is compacted in the first few coefficients due to the properties of the cosine transform. Algebraic measures like mean, median, standard deviation and distributive measures like min, max are used for the generation of MFCC feature set in this work. The process of acquiring Mel Frequency Cepstrum Coefficients is shown below: Denoise the speech signal which acquired in the data collection phase. Enhance the quality of the speech signal using Fourier Transform. Map the signal on the Mel scale with power of spectrum. Calculate the logarithm of powers of those Mel frequencies. Discrete Cosine Transform is applied on those log Mel frequencies. At final step, those coefficients are transformed in to the time spectrum called Mel Frequency Cepstrum Coefficients. The algorithm divides the each speech sample into frames and computes MFCCs of each frame and stores in matrix. The coefficients represented in frames with constant sampling.
It has been known that the wavelet transform is better decorrelater in coding applications. Gaussian mixture densities typically used to model the emotions. The degree to which this assumption is satisfied will be depending upon transformation which makes de-correlation. Speaker recognition is one of challenging task. Most of the existing algorithms have concentrated on only speech recognition [4], [6]. The methods or algorithms towards speaker recognition for authentication had not been well developed. In this work, proposing a system that deals with text dependent speaker recognition and targets in improvement of results using fusion of MFCC and SBC features. The proposed architecture is shown in below Figure 1. In this paper text dependent speech samples are taken as inputs. The preprocessing has been performed on training utterances. Windowing is one of the preprocessing techniques which can be performed on speech samples before extracting the feature. In this method feature extraction has done using MFCC and SBC. After the extraction of MFCC/SBC features, save them as feature vectors. The fundamental acoustic features extracted from the speech signal which are relevant to intensity are used in speaker recognition. Features which are yielded with mathematical transforms also used for speaker identification algorithms like Mel-Frequency Cepstrum Coefficients (MFCC) and Linear Prediction-based Cepstral Coefficients (LPCC) are varieties of this model. By selecting some of these features, modifications are applied and features which are having equal importance are used for speaker recognition. Features extracted from both the methods are fused with augmentation rule. Feature vectors are given as input for building model database with training methods. Support Vector Machine (SVM) is used to train the network. Then performance of the model is evaluated with test samples. III.
FEATURE EXTRACTION
19
2015 1st International conference on futuristic trend in computational analysis and knowledge management (ABLAZE 2015) In Frame Blocking step, the continuous speech signal has blocked into different frames and neighboring frames are separated with samples of size M. The first frame of size N samples and second sample begun after M samples but it may collide or overlap first by N-M samples. In the continuation, the next frames collide the first frame by N-2M, N-3M samples respectively. This process is keep on until all the speech signals are reached. Discontinuities and overlaps are reduced by windowing at the initiation of each frame.
x
where i=1...n. and n, L are the no. of SBC parameters and total no. of frequency bands. In SBC, the feature extraction has done similar to MFCC but with different methods. The sampling rate used at the sample collection and conversion was 8000 and frame size was 256.
Frame Speech
Frame Blocking
IV.
Windowing
A. Linear Kernel Linear kernel is the simplest kernel and it is defined as an inner product with optional constant c. k(x,y) = xTy + c B. Gaussian Kernel Gaussian kernel is a model of radial basis function kernel. C. Polynomial Kernel Polynomial kernels are mostly used to handle normalized data in training phase. In our proposed work we have used Gaussian kernel. V. EXPERIMENTS AND RESULTS
Spectrum Mel Spectrum
Cepstrum
Mel –Frequency Wrapping
Fig 2: Block Diagram of MFCC In the next stage, Fast Fourier Transform is applied to transform the signal samples from time domain into frequency domain. Fast Fourier Transform (FFT) is a speed version of Discrete Fourier Transform by reducing the no. of computations from (MN)2 to MN*log(MN). FFT is defined over a set of N samples and defined is as follows: F u
1 N
Performance of the proposed work is evaluated with speech corpus developed by our volunteers. Twenty trained volunteer are involved in the developing the corpus. Each of the class consists of 15 speech samples. Sampling rate used for conversion is 8000. Experiments are evaluated on corpus by using 5 samples per each class as training and remaining samples are used for performance evaluation.
N
f x
e
CLASSIFICATION WITH SVM KERNELS
Kernel functions are usually used to convert linear data to nonlinear. Based on the requirement, the model selection of the kernel will be done.
FFT
Mel Cepstrum
0.5
/N
where k = 0, 1, 2…N-1, For each signal tone’s actual frequency value on a frequency scale, a pitch value has to be defined on Mel scale. Mel Scale frequency is a linear frequency scale with spacing of M samples of tone frequency value less than 1000Hz. At last, such logarithmic Mel spectrum has transformed in to time domain and the output can be considered as Mel Frequency Cepstrum Coefficients. Mel Cepstrum Coefficients are real, and they are transformed into time domain by applying Discrete Cosine Transform, because DCT works with real numbers only with optimized time complexity.
Methods Recognition Rate MFCC+GMM 91% SBC+GMM 88% MFCC+ SBC+ SVM with AVG Fusion 92.5% MFCC+SBC+ SVM with 96.5% Augmentation Table 1: Performance of Proposed Method with 5 Samples Proposed method has given 96.5% of recognition rate on given corpus with 5 training samples. Performance of proposed method is compared with other existing methods and results are plotted in Fig 3.
3.2 SBC Algorithm The computational process of cepstrum coefficients with Sub Band Coding is mostly related to MFCC. Like in MFCC, filter bank energies computation and decorrelation among filter bank energies calculation is there. But in SBC, filterbank energies are calculated with Wavelet packet transform instead of FFT. The features of wavelets has stood SBC as one of the best feature extraction algorithm. In MFCC, filter bank energies are calculated with cosine transform, but in SBC, low pass filters of wavelet packet tree are used.
20
2015 1st International conference on futuristic trend in computational analysis and knowledge management (ABLAZE 2015)
98% 96% 94% 92% 90% 88% 86% 84% 82%
MFCC+GMM
REFERENCES 1)
SBC+GMM 2)
MFCC+ SBC+ SVM with AVG Fusion MFCC+SBC+ SVM with Augmentation
Recognition Rate
3) 4)
Fig 3: Comparative Study of various methods Performance of the proposed method has evaluated on the corpus with 4 training samples. Performance of the proposed method against other methods shown in Table 2. Methods MFCC+GMM SBC+GMM MFCC+ SBC+ SVM with AVG Fusion MFCC+SBC+ SVM with Augmentation
5) 6)
Recognition Rate 89.5% 87.5% 91%
7) 8)
95%
Table 2: Performance of Proposed Method with 4 Samples
9)
Performance of proposed method is compared with other existing methods and results are plotted in Fig 4. 96.00%
10)
MFCC+GMM
94.00% 92.00%
SBC+GMM
90.00%
11)
88.00%
MFCC+ SBC+ SVM with AVG Fusion MFCC+SBC+ SVM with Augmentation
86.00% 84.00% 82.00%
12)
Recognition Rate
13)
Fig 4: Comparative Study of various methods with 4 samples VI.
CONCLUSION
This paper describes recognition of speaker with text dependent speech signals. Features are extracted using MFCC & SBC and those are concatenated. The concatenated feature set will be classified with SVM. Performance of the proposed method is evaluated on a corpus, which consists of 20 speakers with 15 samples. Proposed method has given 96.5% and 95% of recognition rate with 5 and 4 training samples respectively. Proposed method outperforms other existing text dependent speaker recognition methods.
14) 15)
16)
21
Sadaoki Furui “Cepstral Analysis Technique for Automatic SpeakerVerification”, IEEE Transactions on Acoustics, Speech, and Signal processing, vol. Assp-29, no. 2, 1981. Ruhi Sarikaya, Bryan L. Pellom and John H.L.Hansen “Wavelet packet Transform Features With application to speaker identification”, Proc. of IEEE Nordic Signal Processing Symp., Visgo, pp 81-84,1998 Scherer, K.R.” Adding the affective dimension: A New look in Speech analysis and synthesis”, In Proc. International Conf. on Spoken Language Processing, pp. 1808–1811, 1996. Hongbin SUO1, Ming LI1, Ping LU1 and Yonghong YAN, “Automatic Language Identification with Discriminative Language Characterization Based on SVM”, IEICE Transactions on Info and Systems, Vol E91-D, No. 3 , Pp. 567575, 2008. Hakan Melin, “Databases for Speaker Recognition: activities in cost250 working group 2”, pp. 1-8. K.V. Krishna Kishore, P.Krishna Satish, “Emotion Recognition in Speech Using MFCC and Wavelet Features” Proceedings of International Conference on Advance Computing, pp 842847, 2013 Herbert Glsh and Michael Schmidt, “Text Independent Speaker Identification”, IEEE Signal Processing Magazine, pp. 18-35, 1994. Khaled T. Assaleh, and Richard J. Mammone, “New LPDerived Features for Speaker Identification” IEEE transactions on speech and audio processing, vol. 2, no. 4, pp. 630-639, 1994 Douglas A. Reynolds, and Richard C. Rose, “Robust TextIndependent Speaker Identification Using Gaussian Mixture Speaker Models”, IEEE transactions on speech and audio processing, Vol. 3, No. 1. pp 72-83, 1995. Ravi P. Ramachandran, Mihailo S. Zilovic, and Richard J. Mammone, “A Comparative Study of Robust Linear Predictive Analysis Methods with Applications to Speaker Identification”, IEEE transactions on speech and audio processing, vol. 3. no. 2, 1995. Manas A. Pathak and Bhiksha Raj, “Privacy-Preserving Speaker Verification and Identification Using Gaussian Mixture Models”, IEEE Transactions on Audio, Speech, And Language Processing, Vol. 21, No. 2, 2013. WU Zunjing, CAO Zhigang “Improved MFCC-Based Feature for robust Speaker Identification”, Proceedings of Tsinghua Science and Technology, Volume 10, Number 2, pp l58-166, 2005. Seiichi Nakagawa, Longbiao Wang, and Shinji Ohtsuka, “Speaker Identification and Verification by Combining MFCC and Phase Information”, IEEE transactions on Audio, Speech, And Language Processing, Vol. 20, NO. 4, pp. 10851095, 2012 Roberto Togneri and Daniel Pullella “An Overview of Speaker Identification: Accuracy and Robustness Issues”, IEEE circuits and systems magazine, pp 23-61, 2011. Tomi Kinnunen, Evgeny Karpov, and Pasi Fränti, “Real-Time Speaker identification and Verification”, IEEE Transactions On Audio, Speech, And Language Processing, Vol. 14, no. 1, pp. 277- 288, 2006 Mr. Venkatramaphanikumar, Prof. K V Krishna Kishore, “An Efficient Multimodal Person Authentication System Using
2015 1st International conference on futuristic trend in computational analysis and knowledge management (ABLAZE 2015) Gabor and Sub Band Coding”, Proceedings of International Conference on Computational Intelligence and Computing Research, pp: 637-341, 2013.
22