2011 3rd Computer Science and Electronic Engineering Conference (CEEC)
University of Essex, UK
Age Estimation Based on Speech Features and Support Vector Machine Davood Mahmoodi
Ali Soleimani
Department of Electrical Engineering and Robotics Shahrood University of Technology Shahrood, Iran Email:
[email protected]
Department of Electrical Engineering and Robotics Shahrood University of Technology Shahrood, Iran Email:
[email protected]
Hossein Marvi
Farbod Razzazi
Department of Electrical Engineering and Robotics Shahrood University of Technology Shahrood, Iran Email:
[email protected]
Department of Electrical Engineering Science and Research Branch,Islamic Azad University Tehran, Iran
Mehdi Taghizadeh
Marzieh Mahmoodi
Department of Electrical Engineering Kazerun Branch, Islamic Azad University Kazerun, Iran Email:
[email protected]
Department of Biostatistics Shiraz University of Medical Sciences, Shiraz, Iran Email:
[email protected]
Abstract— Age estimation based on human’s speech features is an interesting subject in Automatic Speech Recognition (ASR) systems. There are some works in literature on speaker age estimation but it needs more new works especially for Persian speakers. In age estimation, like other speech processing systems, we encounter with two main challenges: finding an appropriate procedure for feature extraction, and selecting a reliable method for pattern classification. In this paper we propose an automatic age estimation system for classification of 6 age groups of various Persian speaker people. Perceptual Linear Predictive (PLP) and Mel-Frequency Cepstral Coefficients (MFCC) are extracted as speech features and SVM is utilized for classification procedure. Furthermore the effects of variations in parameter of kernel function, time of frame length in sampling process, the number of MFCC coefficients, and the order of PLP on system efficiency has been evaluated, and the results has been compared.
There are some approaches in ASR to identify dialogues with angry or unsatisfied speakers, but not many researches which use information about speaker characteristics, like age and gender, while there are a lot of useful applications associated with this topic. The considerable examples of such applications are the adaptation of the waiting queue music, the offer of age dependent advertisements to callers in the waiting queue, the statistical information on the age distribution of a caller group, or changing the speaking habits of the text-tospeech module of the ASR system [2]. This paper is organized as follow. After introduction, section 2 describes the characteristics of corpus used in this paper. In section 3 features that are extracted for speech processing procedure has been described. The principle of linear and nonlinear support vector machine (SVM) has been explained in section 4. Finally our experimental results are given in section 5, followed by conclusion in section 6.
Keywords- age estimation; SVM; Automatic Speech Recognition (ASR); PLP; MFCC; kernel function.
I.
II.
INTRODUCTION
The speech samples in this research have been selected from FARSDAT database [3]. An overall block diagram of its main components is shown in Fig. 1.
In the speech based communications, age is an important factor for everyone, especially in the first meeting, to adapt him or her in appropriate treatment [1]. Speech of men and women includes not only the semantics of spoken phrases, but it has features that provide speaker dependent non-verbal information, such as the identity, the gender, the emotional state or the age of a speaker. By extraction these features of every speaker, we can adapt our speaking style to the person whom we are talking to. It is a common application in every day telephone communications for people. Also, some companies need an automatic age estimation system to play adaptable queue music for different age groups of their customers.
978-1-4577-1301-9/11/$26.00 ©2011 IEEE
CORPUS DETAILS
Fig. 1. Main components of the FARSDAT speech database.
60
FARSDAT is a Farsi speech data uttered by 304 native Iranian speakers that are different with regards to age, gender, dialect, and educational level. The most frequent words were derived by word-count processing of daily newspaper text of length about 7100000 characters. 403 sentences (type 1) were made by the first 1000 most frequent words, to provide Farsi allophones in most possible left and right phonetic contexts, and 2 sentences (type 2) were made consisting of all Farsi phonemes (except /f/) to allow dialect comparison among speakers.
Fig. 2.
Procedure of MFCC.
B. PLP Although MFCC becomes a standard speech feature for recognition applications, but PLP outperforms it in some conditions [9]. PLP approximates auditory spectral information that is not very important for speech recognition. Ignoring such information offers better recognition results, because fewer distracting features remain, in order to prevent statistical generalizations. Fig. 3 shows a Block diagram of PLP for speech analysis that was firstly proposed by Hynek Hermansky [10].
Each of 304 speakers uttered twenty sentences in two sessions; hence 6080 utterances were segmented and labeled manually. The average speaking rate is 4.2 syllables per second. The speech was collected in acoustic booth of the Linguistics Laboratory of the University of Tehran. A SONY cardioid dynamic microphone having frequency response within 80 Hz to 16 kHz range and effective output level of 61.8 dBm was used. Signal to noise ratio (SNR) of the FARSDAT is about 31db. The speech was collected using sound blaster hardware cards, installed in four 80486 IBM microcomputers and sampled at 22.05 kHz samples per second at 16 bit resolution. FARSDAT is a first step toward producing Farsi speech databases to support original and advanced research in speech sciences and technologies. III.
Fig. 3.
SPEECH PROCESSING AND FEATURE EXTRACTION
Speech processing is one of the main parts of all speech recognition systems, in which primal speech signals are converted to some type of parametric representation for subsequent processing. This procedure is called feature extraction. According to the type of processing, appropriate features must be extracted from available database, to achieve the best result.
Block diagram of PLP by Hynek Hermansky.
IV.
SUPPORT VECTOR MACHINE
SVM is a type of binary pattern classification based on structural risk minimization (SRM) theory [11]. It has represented good performance on several class pattern classification problems too [12]. Fig. 4 shows an example of a linear SVM using the maximal margin and optimal separating hyperplane between data from two classes. Let X={x1, ... , xn} denote training data set of two classes. An indicator vector y is defined as:
Most of previous works used MFFC as speech features [1], [2], [4], [5], and [6]. In this paper both PLP and MFCC used as features, and the effect of variations in the number of MFCC coefficients and the order of PLP, is evaluated in order to attain the best recognition accuracy of the system.
⎧ 1 yi =⎨ ⎩ −1 and decision function is:
A. MFCC The MFCC is an effective speech feature that rejects too much information of speech signal and displays the signal spectrum in a concise form. According to MFCC procedure, the speech signal is first divided into the frames with the same size of time windows which may be overlapped or not. Then the MFCCs for each frame are computed and considered as the features of the frame.Thus for each sample we’ll have a matrix consist of the same dimension of MFCC vectors. Because of different lengths, the number of vectors varies for each speech sample. To deal with this problem, the averaged MFCCs of all vectors are computed and used as speech feature vector [7]. Therefore, for each sample a d-dimensional MFCC vector determines the input vector for recognition system, where d is the number of MFCC coefficients.
if x i in calss 1 if x i in calss 2
d ( x ) = sign ( wT x + b )
(1)
(2)
where w is the weight vector, and b is the bias.
A detailed diagram for deriving the MFCCs of an acoustic signal is given in Fig. 2 [8].
Fig. 4.
Linear SVM using maximum margin
The main idea of SVM is to maximize the margin between the closest vector and the hyperplane. Therefore,
61
the optimal separating hyperplane can be obtained by solving the following quadratic problem:
K (xi , x j ) = exp(−γ || xi − x j ||2 )
and the Polynomial kernel:
1 T min w w (3) 2 subject to yi (wT xi + b) ≥ 1 The solution to above constrained problem is given by saddle points of the Lagrangian. After applying the KKT conditions to the primal Lagrangian, we'll attain Dual Lagrangian, as follow:
max s.t :
n 1 n Ld = − ∑ yi y jα iα j xiT x j + ∑ α i 2 i , j =1 i =1
0 < αi < C n
∑α y i
i
K (xi , x j ) = (xTi x j + 1) p
In this paper Gaussian RBF is used as kernel function, and the performance of the system is evaluated with various values of γ.
V. EXPERIMENTAL RESULTS (4) Distribution of age in FARSDAT database is between 15 to 73 years. In this work we divided the samples in 6 categories, and defined 6 age classes, as follows: a) 15-25 years:class1 b) 26-35 yeas: class2 c) 36-45 years:class3 d) 46-55 years:class4 e) 56-65 years:class5 f) 66-73 years:class6 To reserve the generalization of the system, population ratio of males to females, and the number of utterances of each speaker are selected randomly.
=0
i =1
Subsequently, w and b are then obtained as follows: n _ sv
∑α y x i
i
i
(5)
i =1
For assimilation the level of each recording utterance, all speech samples are normalized, firstly. After that from each sample, PLP or MFCC is extracted as feature vectors for SVM classification. Libsvm [14] is used to train and test our SVM problem.
b = yi − w T xi (6) where n_sv is the number of input vectors placed on the margin lines, and called support vectors.
Real world classification problems typically involve data that are not linearly separable. In this case to get an optimized decision function, a kernel-based transformation is used, in which the input space can be mapped to higherdimensional feature space, where the training set is separable, as shown in Fig. 5 [13]. The kernel is estimated using the expansion of the separating hyperplane based on the dual problem function and is defined by the inner product of two vectors φ(x1) and φ(x2), as follow: K(xi ,x j ) = ϕ (xi )Tϕ (x j ) (7)
From total of available data, 160 samples from each age classes are selected randomly; in which about 62% of them are selected as training and the remaining samples as testing data, for SVM classifier. Experimentally and according to [15], it is found that Gaussian RBF is better than Polynomial kernel function, so Gaussian RBF is utilized in training and testing procedure, and the performance of the system is evaluated with various quantities of parameter γ of kernel function.
Finally, a kernel-based decision function has the form:
After normalization, PLP extracted as the first feature. Various PLP orders were compared to evaluate the accuracy of the system. Fig. 6 shows the comparison results of error rate of the system with PLP orders of 16, 18, and 20, while RBF kernel parameter, gamma, is swept.
n _ sv
d (x) = sign( ∑ αi yi K (xi , x) + b)
(8)
i =1
where n_sv
b = yi − ∑ α i yi K (xi , x)
(11)
where γ and p are kernel function parameters.
where αi’s are the Lagrange multipliers, and the value of C is a trade-off between maximization of the margin and minimization of errors, which is determined by user.
w=
(10)
(9)
In the best result, 8.84% of error rate is achieved with PLP order of 20. Increasing the PLP order, reducing the error rate, but not too much.
i =1
11.5 plp order=16 plp order=18 plp order=20
Error Rate(%)
11 10.5 10 9.5 9
Fig. 5. Nonlinear SVM: Mapping data from input space (left), into the higher dimensional feature space (right) [13].
8.5 5
10
15 Gamma
20
25
Fig. 6. Plot of error rate versus RBF kernel parameter gamma, using PLP with 3 various orders as the feature vector.
The most widely used kernel functions are the Gaussian Radial Basis Function (RBF):
62
TABLE I RESULTS OF MINIMUM ERROR RATES FOR DIFFERENT PLP ORDERS YIELDED FROM FIG. 6 Gamma
Error Rate
16 18 20
11 10.8 9
9.91% 9.21% 8.84%
18
Table I illustrates the results of minimum error rates and optimum gamma values attained from Fig. 6. Gamma values are large and the minimum error rate is relatively acceptable.
14 12 10
6 4 0
0.005
0.01
0.015
0.02 Gamma
0.03
0.035
0.04
According to results of Fig. 7, frame length of 25ms gives the best result. So for the next step we select a fixed frame length of 25ms and assess the performance of the system by changing the MFCC coefficients. To do this, MFCC coefficients of 13, 24, and 39 are extracted from the samples. Fig. 8 shows the results of the system error for a fixed frame length and changing MFCC coefficient.
15 frame length=10ms frame length=15ms frame length=20ms frame length=25ms frame length=30ms frame length=35ms frame length=40ms frame length=45ms
As shown in Fig. 8, using the MFCC coefficients of 39, offers much better results compared to others. Considering Fig. 7 and Fig. 8, it is found that selecting MFCC coefficients of 39, and frame length of 25ms and sweeping gamma values, the error rate of 5.89% will be achieved, which is the best result in this work.
5 0
0.005
0.01
0.015
0.02 Gamma
0.025
0.03
0.035
Compared to [6], using the same corpus, and other previous works: [1], [2], [4], [5], [16], using another corpus, it is a considerable result.
0.04
Fig. 7. Plot of error rate versus parameter gamma of the RBF kernel function, using a fixed MFCC order, with various times of frame length.
In Fig. 9 a graph bar of error rate between each two classes of age groups is shown. As determined the error rate between age classes of 2 and 3 is very high, while between 1 and 6 is less than the others. This result is quite acceptable, because also in human speech communication, recognition of age between two speakers is easier when the age gap between them is too much.
As Fig. 7 shows, the system is very sensitive to variation of gamma, which warns us of an adaptable choice for this parameter. On the other hand if a small frame length is selected, reasonable response is not achieved. Also if the time of frame length exceeds to above 25ms, the difference in plots is very low and the errors are almost the same. 14 12 10
Error Rate(%)
0.025
Fig. 8. Plot of error rate versus parameter gamma of the RBF kernel function, using various MFCC coefficients, with frame length of 25ms.
So to extract MFCC coefficients of each sample, various times of frame length are evaluated, and the error of the system is surveyed by swiping parameter γ of the kernel function, which yields to Fig. 7.
Error Rate(%)
16
8
The accuracy of the system is evaluated with MFCC as the second feature too. This work shows that for attaining optimum performance of the age estimation system, selecting an appropriate frame length is an important factor in calculating the MFCC coefficients.
10
MFCC Coeff=39 MFCC Coeff=24 MFCC Coeff=13
20
Error R ate(% )
PLP
22
8 6 4 2 0
Fig. 9. Plot of graph bar of error rate between each two classes of age groups.
63
TABLE II RESULTS OF MINIMUM ERROR RATES YIELDED FROM VARIATIONS OF GAMMA IN SVM PROBLEM, FOR DIFFERENT MFCC COEFFICIENTS AND
[5]
FRAME LENGTHS
MFCC 13 24 39 13 24 39
Frame Length 30 ms 30 ms 30 ms 25 ms 25 ms 25 ms
Gamma
Error Rate
0.030 0.015 0.008 0.029 0.005 0.008
10.90% 7.33% 6.17% 11.11% 7.33% 5.89%
[6]
[7]
Table II illustrates the results of minimum error rates of the system using various MFCC coefficients with fixed frame length of 25 and 30ms. The success of our evaluation results is achieved with MFCC coefficients of 39, frame length of 10ms, and γ=0.008.
[8]
VI. CONCLUSION In this paper an acoustic age estimation system using speech features and SVM was proposed. Two main features were used as speech features: MFCC and PLP. In order to obtain the best results, various orders of PLP and MFCC, with different frame lengths, were examined. A successful result was achieved using MFCC coefficients of 39 as feature vector, with frame length of 25ms. Gaussian RBF kernel function was applied for nonlinear SVM, and the performance of the system was evaluated with various values of gamma, too. The experimental results show that the algorithm provides much better result compared to similar works.
[9] [10]
[11] [12]
REFERENCES [1]
[2]
[3] [4]
[13]
M. Nishimoto, et al., "Subjective age estimation using speech sounds: Comparison with facial images," in Systems, Man and Cybernetics. SMC 2008. IEEE International Conference on, 2008, pp. 1900-1904. T. Bocklet, et al., "Age and gender recognition for telephone applications based on gmm supervectors and support vector machines," in Acoustics, Speech and Signal Processing. ICASSP 2008. IEEE International Conference on, Las Vegas, NV, 2008, pp. 1605-1608. M. Bijankhan and M. Sheikhzadegan, "FARSDAT-the Farsi spoken language database," 1994, pp. 826-829. F. Metze, et al., "Comparison of four approaches to age and gender recognition for telephone applications," 2007, pp. IV-1089-IV-1092.
[14]
[15]
[16]
64
L. Cerrato, et al., "Subjective age estimation of telephonic voices," Speech Communication, vol. 31, pp. 107-112, 2000. S. Naini and M. Homayounpour, "Speaker age interval and sex identification based on jitters, shimmers and mean mfcc using supervised and unsupervised discriminative classification methods," in Signal Processing, 8th International Conference on, Beijing, 2006. C. H. Lee, et al., "Automatic recognition of animal vocalizations using averaged MFCC and linear discriminant analysis," pattern recognition letters, vol. 27, pp. 93-101, 2006. B. Milner and X. Shao, "Clean speech reconstruction from MFCC vectors and fundamental frequency using an integrated frontend," Speech Communication, vol. 48, pp. 697715, 2006. I. Mporas, et al., "Comparison of speech features on the speech recognition task," Journal of Computer Science, vol. 3, pp. 608-616, 2007. H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," Journal of the Acoustical Society of America, vol. 87, pp. 17381752, 1990. V. Vapnik, "Statistical learning theory. 1998," ed: Wiley, New York. A. Ganapathiraju, et al., "Applications of support vector machines to speech recognition," Signal Processing, IEEE Transactions on, vol. 52, pp. 2348-2355, 2004. D. Mahmoodi, et al., "FPGA Simulation of Linear and Nonlinear Support Vector Machine," Journal of Software Engineering and Applications, vol. 5, No.4, pp. 320-328, 2011. C. C. Chang and C. J. Lin, "LIBSVM: A library for support vector machines," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, p. 27, 2011. H. T. Lin and C. J. Lin, "A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods," Department of Computer Science and Information Engineering, National Taiwan University, 2003. R. M. Hecht, et al., "Age Verification Using a Hybrid Speech Processing Approach," in INTERSPEECH-2009, Brighton, 2009, pp. 184187.