Speaker Profiling for Forensic Applications Amir Hossein Poorjam
Thesis submitted for the degree of Master of Science in Electrical Engineering, option Embedded Systems and Multimedia Thesis supervisor: Prof. dr. ir. Hugo Van hamme Assessors: Prof. dr. ir. Dirk Van Compernolle Prof. dr. ir. Marc Moonen Mentor: Dr. ir. Mohamad Hasan Bahari
Academic year 2013 – 2014
c Copyright KU Leuven
Without written permission of the thesis supervisor and the author it is forbidden to reproduce or adapt in any form or by any means any part of this publication. Requests for obtaining the right to reproduce or utilize parts of this publication should be addressed to Departement Elektrotechniek, Kasteelpark Arenberg 10 postbus 2440, B-3001 Heverlee, +32-16-321130 or by email
[email protected]. A written permission of the thesis supervisor is also required to use the methods, products, schematics and programs described in this work for industrial or commercial use, and for submitting this publication in scientific contests.
Preface I would like to express my special appreciation and thanks to my promotor, professor Dr. Hugo Van hamme for offering me the opportunity to do research in his group. I am grateful to him for his consideration all the time, and for always guiding me down the right path. I would also like to express my sincere gratitude to my daily supervisor, Dr. Mohamad Hasan Bahari, for continuous support of my research. His thoughtful insights in this field have inspired me a lot through this thesis work. The good advice, support and friendship of Hasan, has been invaluable on both an academic and a personal level, for which I am extremely grateful. He has also provided me necessary materials to perform research on speaker profiling. My thanks goes to the members of PSI Speech group at ESAT who provided me facilities in the speech lab to fulfill my thesis. I would also like to thank professor Dr. Dirk Van Compernolle for his advices in my first thesis presentation and a speech group presentation sessions. My thanks goes to him and other member of jury for reading this thesis. Words cannot express how grateful I am to my parents for supporting me spiritually throughout my life, and providing me the opportunity to study abroad, and also my mother-in-law and father-in-law for all their kindness and supports. I would like to express my gratefulness to them. My special and sincere appreciation goes to my beloved wife who was always my support in all moments. I lovingly dedicate this thesis to her, who supported me each step of the way. I would like to thank my previous supervisor in my Bachelor study, professor Dr. Jahangir Bagheri, who has always encouraged me during my study, and motivated me to continue my higher education in this field. I learned a lot from his knowledge and personality. Finally, my thanks go to all my lovely friends: Ali Charkhi, Mostafa Yaghobi, Dr. Hadi Aliakbarian, Taha Mirhoseini, Dr. Majid Hosseinzadeh, Reza Sahraeian, Milad Yavari, Saeed Reza Toghi, Hasan Farrokhzad, Rahim Khanizad and all other new and old friends for accompanying, encouraging and helping me during my study in Leuven. Amir Hossein Poorjam
i
Contents Preface
i
Abstract
iv
List of Abbreviations and Symbols
v
1 Introduction
1
2 Automatic Speaker’s Age Estimation Telephone Speech 2.1 Introduction . . . . . . . . . . . . . . 2.2 System Description . . . . . . . . . . 2.3 Experimental Setup . . . . . . . . . 2.4 Results and discussion . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . .
from Spontaneous . . . . .
. . . . .
. . . . .
. . . . .
5 5 7 13 16 18
. . . . .
. . . . .
. . . . .
. . . . .
21 21 22 23 25 27
4 Automatic Speaker’s Weight Estimation from Spontaneous Telephone Speech 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 System Description . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
29 29 31 32 34 35
. . . . .
. . . . .
3 Automatic Speaker’s Height Estimation Telephone Speech 3.1 Introduction . . . . . . . . . . . . . . . . 3.2 System Description . . . . . . . . . . . . 3.3 Experimental Setup . . . . . . . . . . . 3.4 Results and discussion . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . .
5 Automatic Smoker Detection 5.1 Introduction . . . . . . . . . 5.2 System Description . . . . . 5.3 Experimental Setup . . . . 5.4 Results and Discussion . . . 5.5 Conclusions . . . . . . . . . 6 Multitask Speaker Profiling ii
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
from Spontaneous . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
from Spontaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Telephone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Speech . . . . . . . . . . . . . . . . . . . .
37 37 39 42 45 47 49
Contents 6.1 6.2 6.3 6.4 6.5
Introduction . . . . . . System Description . . Experimental Setup . Results and discussion Conclusion . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
49 51 54 55 58
7 Conclusion 7.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61 61 64
A Artificial Neural Networks (ANNs) A.1 Multilayer Perceptron Neural Networks . . . . . . . . . . . . . . . . A.2 Regression and Classification using MLPs . . . . . . . . . . . . . . . A.3 Limitations of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . .
67 67 69 71
B Least Squares Support Vector Machines (LSSVM) B.1 Least Squares Support Vector Machines for Classification . . . . . . B.2 Least Squares Support Vector Machines for Regression . . . . . . . .
73 73 75
C Logistic Regression C.1 Logistic Regression for Binary Classification . . . . . . . . . . . . . . C.2 MLE in the Logistic Regression model . . . . . . . . . . . . . . . . . C.3 Advantages and Limitations of the Logistic Regression . . . . . . . .
77 77 78 80
Bibliography
81
iii
Abstract Speech signals convey important paralinguistic information such as age, gender, body size, language, accent and emotional state of speakers. Automatic identification of speaker traits and states has a wide range of forensic, commercial and medical applications in real-world scenarios. This thesis proposes a novel approach for automatic estimation of four forensically important components of speaker profiling systems, namely speaker age, height, weight and smoking habit estimations, from spontaneous telephone speech signals. In this method, each utterance is modeled using the i-vector framework which is based on the factor analysis on Gaussian Mixture Model (GMM) mean supervectors, and the Non-negative Factor Analysis (NFA) framework which is based on a constrained factor analysis on GMM weight supervectors. Then, Artificial Neural Networks (ANNs) and Least Squares Support Vector Regression (LSSVR) are employed to estimate age, height and weight of speakers from given utterances. Various classification techniques such as ANNs, Logistic Regression (LR), Naive Bayesian Classifier (NBC), Gaussian Scoring (GS) and Von-Mises-Fisher Scoring (VMF) are also utilized to perform smoking habit detection. Since GMM weights provide complementary information to GMM means, a score-level fusion of the i-vector-based and the NFA-based recognizers is considered for speaker age estimation and smoking habit detection tasks to improve the performance. In addition, inspired from the human learning system in which the related tasks are learned in interaction with each other, a multitask speaker profiling approach is proposed to evaluate the correlated tasks simultaneously, and consequently, to boost the accuracy in speaker age, height, weight and smoking habit estimations. To this end, a hybrid architecture involving the score-level fusion of the i-vector-based and the NFA-based recognizers is proposed. ANNs are then employed to provide an appropriate architecture to share the learned information with all tasks while they are learned in parallel. The proposed method has two major distinctions with the previous speaker profiling approaches. First, information in both GMM means and weights is employed through a score-level fusion of the i-vector-based and the NFA-based recognizers. Second, by applying multitask learning, correlated tasks, which are usually investigated in isolation, are evaluated simultaneously and in interaction with each other. The suggested approach is evaluated on telephone speech signals of National Institute for Standards and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation (SRE) corpora. Experimental results over 1194 utterances show the effectiveness of the proposed method in automatic speaker profiling. iv
List of Abbreviations and Symbols General Abbreviations ANNs AUC BFG CC CGF GMM GS LM LR LSSVM LSSVR MAE MAP MFCC MLP MTL NBC NFA NIST RBF ROC SGR SRE STL UBM VMF VTL
: : : : : : : : : : : : : : : : : : : : : : : : : : :
Artificial Neural Networks Area Under the ROC Curve Broyden Fletcher Goldfarb (Pearson) Correlation Coefficient Fletcher-Reeves Conjugate Gradient Gaussian Mixture Model Gaussian Scoring Levenberg-Marquardt Logistic Regression Least squares Support Vector Machines Least Squares Support Vector Regression Mean Absolute Error Maximum-A-Posteriori Mel-Frequency Cepstrum Coefficient Multilayer Perceptron Multitask Learning Naive Bayesian Classifier Non-negative Factor Analysis National Institute for Standard and Technology Radial Basis Function Receiver Operating Characteristic Subglottal Resonance Speaker Recognition Evaluations Single-Task Learning Universal Background Model Von-Mises-Fisher Scoring Vocal Tract Length v
List of Abbreviations and Symbols
General Symbols and Definitions xi o yi yiA yiH yiW yiS yˆ λ M T u v w L b r N F0min θ Ψ Cllr,min
vi
: : : : : : : : : : : : : : : : : : : : : :
ith utterance Acoustic vector Label of the ith utterance Age label of the ith utterance Height label of the ith utterance Weight label of the ith utterance Smoking habit label of the ith utterance Estimated label Parameters of the UBM GMM mean supervector Subspace matrix The UBM mean supervector i-vector GMM weight supervector Subspace matrix The UBM weight supervector NFA-vector Total number of test samples Lowest fundamental frequency of voice Parameters of the logistic regression model Covariance matrix Minimum Log-Likelihood Ratio Cost
Chapter 1
Introduction Speech signals convey important information about speakers such as age, gender, body size, language, accent and emotional state. Speaker profiling refers to extracting information about a speaker from his/her speech pattern. Automatic identification of speaker characteristics has a wide range of forensic, commercial and medical applications in real-world scenarios. Forensic is one of the most important area of speaker profiling applications where it can give cues to the identities of unknown speakers. Police investigators continuously look for technologies to enhance investigative techniques. Hence, speaker profiling is considered as an important investigation tool for police application. In some forensic scenarios, a voice recording of the criminal act can be made, e.g. a threat call and a blackmail call. Police inspectors may have a list of suspects, but no recording of a suspect is available to be compared with the voice of the unknown speaker. They might loose time by checking all suspects. Since age and physical characteristics are important factors when forming a picture of an unknown speaker, it can be beneficial in the early stages of a police investigation to rank suspects according to objective criteria such as gender, age and body size. This action is classified in the automatic speaker profiling category. However, identification of the speakers’ characteristics can be performed by listeners, which is out of the scope of this study. In this case, the recorded voice sample is presented to a wide public to find suspects [74]. Speaker profiling is also used in other applications such as improving service quality in dialog systems, categorizing large music databases with potentially unknown artists, protection of children in web environment, interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of the speakers and adaptation of waiting queue music for offering the most suitable advertisements to callers in the waiting queue [73, 77, 116, 101, 38, 91, 90, 62]. Variety of traits of a speaker can be inferred form his/her speech. In this study, however, only four characteristics, namely speaker age, height, weight and smoking habit, which can be considered as the most important traits from the forensics point of view are investigated. In automatic speaker profiling we need features in voice pattern to provide us cues
1
1. Introduction to speakers’ characteristics. Most of the voices used in speaker profiling are speech samples. However, other voices such as laughs or screams may also be forensically important. Various acoustic features such as fundamental frequency, sound pressure level, voice quality, distribution of spectral energy, amplitude, pitch, formants, vocal tract length warping factor, jitter, shimmer and speech rate have been demonstrated to have a (weak or strong) correlation with each aspect of the speakers’ characteristics. For instance, segment duration, sound pressure level range, and cepstral features were reported as the most important acoustic features which correlate with the age of speakers [89, 1, 99, 69, 112], or speech rate was reported as an acoustic feature which has a significant correlation with the weight of speakers [104]. However, the relation of these acoustic features are usually influenced by other factors such as language, gender, speech context, smoking habits, level of intoxication, body size, channel conditions and emotional state which are not typically tractable in real-world situations [21, 87, 12, 13]. Modeling speech utterances with Gaussian Mixture Model (GMM) mean supervectors has been considered as an effective approach to speaker recognition systems [25]. However, due to the high dimensional nature of these vectors, a robust model with GMM mean supervectors is not easily obtained when limited data are available. Efforts towards effective dimensionality reduction of GMM mean supervectors, such as the weighted-pairwise principal component analysis (WPPCA) based on the nuisance attribute projection technique, have improved the performance of the speaker recognition systems [38]. Recent advances using the i-vector framework based on factor analysis for GMM mean adaptation and decomposition have effectively increased the accuracy of speaker profiling systems [33]. The i-vector framework, which provides a compact representation of an utterance in the form of a low-dimensional feature vector can be effectively substituted for the GMM mean supervectors in speaker profiling tasks [9, 83]. In addition, various studies show that GMM weights carry complementary information to GMM means [6, 62, 115, 11, 7, 8]. This framework, named the Non-negative Factor analysis (NFA), is based on adaptation and decomposition of GMM weights, and yields a new low-dimensional utterance modeling approach. In this study, in two steps, novel approaches for four forensically important speaker profiling tasks are proposed which are based on the i-vector and the NFA frameworks. In the first step, new techniques for speaker age, height and weight estimations as well as smoking habit detection are proposed, independently. This step, provides baselines for the next phase of the experiments in which a new method is proposed to investigate the correlated tasks simultaneously, in the form of a multitask learning approach. The goal of this study is to improve the performance of the above mentioned speaker profiling tasks. To demonstrate the effectiveness of the proposed methods, a large corpus of speech samples consisting of the National Institute for Standard and Technology (NIST) 2008 and 2010 speaker recognition evaluations (SRE) databases are utilized. In Chapter 2, a new approach in automatic speaker age estimation is proposed. This approach is based on a hybrid architecture of the i-vector and the NFA frameworks. The distinction of the proposed method with the previous speech-based 2
methods is employing the information in the GMM weights in conjunction with the information in the GMM means through a score-level fusion of the i-vector-based and the NFA-based estimators, in order to enhance the accuracy of age estimation. Two different function approximation methods, namely least squares support vector regression (LSSVR) and artificial neural networks (ANNs) are utilized and compared in this chapter. Innovative proposed methods for speaker height and weight estimations based on the i-vector framework are described in detail in Chapter 3 and Chapter 4, respectively. The goal of this study is investigating the effectiveness of the i-vector framework in the estimation of speakers’ body size. In these chapters, the height and weight estimation is fulfilled by training LSSVR and ANNs with the i-vectors. Different speech analysis systems such as speaker gender detection, age estimation, intoxication-level recognition and emotional state identification are influenced by smoking. Due to the importance of an automatic smoking habit detection system and the analysis of the smoking habit effects on speech signals in the forensic applications, an automatic smoker detection from spontaneous telephone speech signals is proposed for the first time (to my knowledge) in Chapter 5. In this method, each utterance is modeled using the i-vector and the NFA frameworks. Then, various classification algorithms are employed to detect the smoker speakers. Finally, score-level fusion of the i-vector-based and the NFA-based recognizers is considered to boost the classification accuracy. Inspired from the human learning system in which the related tasks are learned in interaction with each other, in Chapter 6, a multitask learning (MTL) approach is proposed to improve the performance of the speaker profiling system. Using a MTL approach, this study aims at evaluating the correlated tasks simultaneously and in interaction with each other. In addition, this chapter explores the MTL as an approach to improve the accuracy of recognizers by means of sharing the learned information between the related tasks. Finally, Chapter 7 concludes this thesis and suggests possible directions for future works. In this study, various techniques are employed for classification and regression methods. Among them, the concepts and relations of the ANNs, LSSVR and logistic regression (LR) techniques are selected as the most important and commonly used methods for classification and regression to be described in more details. In order to maintain the integrity of the content in the chapters, they are elaborated in Appendix A, Appendix B and Appendix C, respectively.
3
Chapter 2
Automatic Speaker’s Age Estimation from Spontaneous Telephone Speech 2.1
Introduction
Speaker age estimation as an important component of speaker profiling, can be utilized in forensic applications to direct the police investigations. However, the range of its applications is not limited to the forensic cases since it can be effectively employed in the commercial, medical and educational applications. Service customization, protection of children in web environment, adaptation of waiting queue music for offering the most suitable advertisements to callers in the waiting queue, and humancomputer interaction systems are examples of applications of the automatic speaker age estimation in other fields. This wide range of applications has attracted a lot of researchers’ attention and encouraged them to investigate the automatic age estimation issues precisely. Like other speaker profiling tasks, automatic age estimation from speech signals involves two problems. First, finding an appropriate utterance modeling procedure by means of extracting the most relevant features of the acoustic signals which provide cues to the age of speakers, and second, providing an appropriate function approximation or classifier to estimate the age of speakers as accurate as possible. Thus, a reliable automatic speaker age estimation requires a large corpus of utterances with corresponding age labels. This corpus should contain voices uttered by a wide range of ages. In addition, there is a difference between perceived age and calender age of a speaker. The perceived age of a speaker is modified by factors such as drinking and smoking habits and physiological condition [21, 84]. Authors in [93] showed that the correlation between calender age and perceived age is 0.88. The reported correlation in [79], however, was 0.77. This issue makes the problem of automatic age estimation more challenging. As the speech production system is modified with aging, speech is affected in numerous ways. Many acoustic features such as fundamental frequency, sound 5
2. Automatic Speaker’s Age Estimation from Spontaneous Telephone Speech pressure level, voice quality, distribution of spectral energy, amplitude and speech rate are modified with aging [85, 2, 113, 87]. These age dependent features can be used in automatic age estimation. Schotz and Muller investigated the correlation of numerous acoustic features with the age of speakers. They reported two features, namely segment duration and sound pressure level range as the most important acoustic features which correlate with the age of speakers [89]. Ajmera et. al. applied discrete cosine transform to the cepstral coefficients and showed that the coefficients corresponding to the lower modulation frequencies provide the best discrimination of age [1]. In further studies, other features such as pitch, energy, formants, vocal tract length warping factor and speaking rate were added to the cepstral features at the frame or utterance level to improve the performance [99, 69, 112]. However, the relation of these acoustic features are usually influenced by other factors such as language, gender, speech context, smoking habits, level of intoxication, body size, channel conditions and emotional state which are not typically tractable in real-world situations [21, 87, 12, 13]. Furthermore, many studies focused on the classification of speakers based on their age groups by utilizing techniques such as Gaussian Mixture Model (GMM) mean supervector and Support Vector Machine systems [19, 66, 28, 105], nuisance attribute projection [37], parallel phoneme recognizers [70], maximum mutual information training [58] and anchor models[37, 58]. These techniques were mostly taken from speaker verification and language identification applications. By combining various classification methods, significant improvements in accuracy of speaker age classification have been reported in [75, 105, 69, 20, 58]. In study by Bocklet et al., the ages of children in pre-school and primary school were effectively estimated based on the modeling speech signals with GMM mean supervector and support vector regression (SVR) approach [19]. Although GMM mean super vectors are effective in speaker age estimation, since they are high dimensional vectors, obtaining a robust model is not a straightforward approach, specially when limited data are available. Dobry reduced the dimension of GMM mean supervectors by means of the weighted-pairwise principal component analysis (WPPCA) based on the nuisance attribute projection technique, and enhanced the performance of the age estimation [38]. In the field of speaker recognition, recent advances using the i-vector framework [33] have considerably increased the classification accuracy [9]. The i-vector framework provides a compact representation of utterance in the form of a lowdimensional feature vector. The GMM mean supervectors can be effectively substituted by the i-vectors in speaker age estimation [9, 10]. Various studies demonstrate that although GMM weights, which entail a lower dimension compared with Gaussian mean supervectors, convey less information than GMM means, they contain complementary information to GMM means [6, 62, 115, 11]. Bahari et al. have recently introduced a new framework based on factor analysis for GMM weight adaptation and decomposition [7, 34]. In this method, named non-negative factor analysis (NFA), the applied factor analysis is constrained such that the adapted GMM weights are non-negative and sum to unity. This method, which yields new low-dimensional utterance representation approach, was successfully 6
2.2. System Description applied to speaker and language/dialect recognition [7, 8]. In this chapter, a novel approach for speaker age estimation based on a compound architecture of the i-vector and the NFA frameworks is proposed. This architecture consists of two subsystems based on the i-vectors and the NFA vectors. To improve the performance of proposed speaker age estimation, score-level fusion of the ivector-based and the NFA-based function approximations is also considered. The superiority of this method over previous age estimation methods is that in this method, information in GMM means and GMM weights are employed to enhance the accuracy of age estimation. To select an accurate regression approach for this problem, two different function approximation approaches, namely least-squares support vector regression (LSSVR) and artificial neural networks (ANNs) are compared. In this research, the effectiveness of the proposed method is investigated on spontaneous telephone speech signals of the NIST 2008 and 2010 SRE corpora. Experimental results confirm the effectiveness of the proposed approach compared with the results of a baseline provided from the same database [9]. The rest of the chapter is organized as follows. In section 2.2, the problem of automatic age estimation is formulated and the proposed approach is described. Section 2.3 explains the experimental setup. The evaluation results are presented and discussed in Section 2.4. The chapter concludes with conclusions in Section 2.5.
2.2
System Description
In this section, after the problem formulation, the main constituents of the proposed method are described.
2.2.1
Problem Formulation
In the speaker age estimation problem, we are given a set of training data D = p th utterance and y ∈ R denotes the corre{xi , yi }N i i=1 , where xi ∈ R denotes the i sponding chronological age. The goal is to design an estimation function g, such that for an utterance of an unseen speaker xtst , the estimated age yˆ = g(xtst ) approximates the actual age as good as possible in some predefined sense.
2.2.2
Utterance Modeling
The first step for speaker age estimation is converting variable-duration speech signals into fixed-dimensional vectors, suitable for regression algorithms which is performed by fitting a GMM to acoustic features extracted from each speech signal. The utterance is then characterized by the parameters of the corresponding obtained GMMs. Since the available data is limited, we are not able to accurately fit a separate GMM for a short utterance, specially in the case of GMMs with a high number of Gaussian components. Thus, parametric utterance adaptation techniques should be applied to adapt a universal background model (UBM) to characteristics of 7
2. Automatic Speaker’s Age Estimation from Spontaneous Telephone Speech utterances in training and testing databases. In this chapter, the i-vector framework for adapting UBM means and the NFA framework for adapting UBM weights are applied. Universal Background Model and Adaptation Consider a UBM with the following likelihood function of data O = {o1 , . . . , ot , . . . , oT }. p(ot |λ) =
C X
πc p(ot |µc , Σc )
c=1
λ = {πc , µc , Σc }, c = 1, . . . C,
(2.1)
where ot is the acoustic vector at time t, πc is the mixture weight for the cth mixture component, p(ot |µc , Σc ) is a Gaussian probability density function with mean µc and covariance matrix Σc , and C is the total number of Gaussian components in the mixture. The parameters of the UBM –λ– are estimated on a large amount of training data from speakers of different ages. The i-vector Framework One effective method for speaker age estimation involves adapting UBM means to the speech characteristics of the utterance. Then, the adapted GMM means are extracted and concatenated to form Gaussian mean supervectors. This method have been shown to provide a good level of performance [38, 19]. Recent progress in this field, however, has found an alternate method of modeling GMM mean supervectors that provides superior recognition performance [13]. However, since Gaussian components of the UBM model are adapted independent of each other, some components were not updated in the case of limited training samples [56]. This problem can be alleviated by linking the Gaussian components together using the Joint Factor Analysis (JFA) framework [94]. In the JFA framework, each utterance is represented by a supervector M which is a speaker- and channel-dependent vector of dimension (C.F ), where C is the total number of the mixture components in a feature space of dimension F . In JFA, it is assumed that M can be decomposed into two supervectors M=s+c
(2.2)
where s = u + Vy + Dz is a speaker-dependent supervector and c = Ux is a channeldependent supervector. s and c are independent and possess normal distributions. u is the speaker- and channel-independent supervector, V defines a lower dimensional speaker subspace, U is a lower dimensional channel subspace, D defines a speaker subspace. y and z are factors in speaker subspace, and x is a channel-dependent factor in channel subspace. The vectors x, y and z are random variables with standard normal distributions N (0, I) which are jointly estimated. In the JFA framework, some information about speakers can be found in the channel factor. This information can be utilized in speaker identification [32]. This 8
2.2. System Description fact resulted in proposing a new utterance modeling approach, referred to as the i-vector framework or the total variability modeling [33, 32]. This method comprises both speaker variability and channel variability. Channel compensation procedures such as within-class covariance normalization (WCCN) can be further applied to compensate the residual channel effects in the speaker factor space [51]. The i-vector framework assumes that each utterance possesses a speaker- and channel-dependent GMM supervector which its mean, M, can be decomposed as M = u + Tv,
(2.3)
where u is the speaker- and channel-independent mean supervector of the UBM, T spans a low-dimensional subspace (400 dimensions in this work) and v are the factors that best describe the utterance-dependent mean offset Tv. The vector v is treated as a hidden variable with the standard normal prior and the i-vector is its maximuma-posteriori (MAP) point estimate. For a sequence of L frames O = {o1 , o2 , ..., oL }, a UBM model (θU BM ) of C mixture components, and the centralized first-order Baum-Welch statistics given in equation 2.4, the i-vector for a given utterance is calculated using the equation 2.5. Fˆc =
L X
P (c|ot , θU BM )(ot − mc )
(2.4)
t=1
ˆ v = (I + T0 Σ−1 N(O)T)−1 T0 Σ−1 F(O)
(2.5)
where mc is the mean of the cth UBM mixture component, P (c|ot , θU BM ) is the ˆ posterior probability of the cth mixture component, F(O) is a concatenation of all ˆ Fc s, N(O) is the diagonal matrix of dimension (CF × CF ), and Σ is a covariance matrix. The subspace matrix T is estimated via maximum likelihood in a large training dataset. An efficient procedure for training T and for MAP adaptation of the i-vectors can be found in [57]. The i-vector is taken its name from intermediate vector, for the intermediate representation between an acoustic feature vector and a supervector [32]. In the total variability modeling approach, the i-vectors are the low-dimensional representation of an audio recording that can be used for classification and estimation purposes. The NFA Framework The NFA is a new framework for adaptation and decomposition of GMM weights based on a constrained factor analysis [7]. The basic assumption of this method is that for a given utterance, the adapted GMM weight supervector can be decomposed as follows w = b + Lr,
(2.6)
where b is the UBM weight supervector (2048 dimensional vector in this study). L is a matrix of dimension (C × ρ) spanning a low-dimensional subspace. r is a low-dimensional vector that best describes the utterance-dependent weight offset Lr. 9
2. Automatic Speaker’s Age Estimation from Spontaneous Telephone Speech In this framework, neither subspace matrix L nor subspace vector r are constrained to be non-negative. However, unlike the i-vector framework, the applied factor analysis for estimating the subspace matrix L and the subspace vector r is constrained such that the adapted GMM weights are non-negative and sum up to one. The procedure of calculating L and r involves a two-stage algorithm similar to ExpectationMaximization. In the first step, L is assumed to be known, and r is updated. Similarly in the second step, r is assumed to be known and L is updated. In the Expectation step, given an utterance O, by solving the following constrained optimization problem, a maximum likelihood estimation of vector r is obtained. max γ¯ 0 (O) log(b + Lr)
(
subject to
(2.7)
h(b + Lr) = 1 : Equality constraint b + Lr > 0 : Inequality constraint
where h is a row vector of dimension C with all elements equal to 1, γ¯ = t [γ1,t , ..., γC,t ]0 , and γc,t is the occupation count for class c and frame t. In the case of a square full-rank L, this constrained optimization problem can be solved analytically as follows 1 (2.8) r(O) = L−1 γ¯ (O) − b τ P
However, since L is not a square full-rank, this constrained optimization does not have an analytical solution and should be solved using iterative optimization approaches. However, these methods are time-consuming for a large number of utterances. Relaxing constraints and modifying this constrained optimization to an unconstrained optimization problem will decrease the computation time. Since the UBM weights sum up to 1, hb+hLr = 1, which results in hLr = 0. This constrain holds for any r if h is orthogonal to all columns of L. In Maximization-step, L is calculated in a way that hL = 0 holds. If any of C inequality constraints in equation 2.7 is violated, the evaluation of cost function will be impossible. This violation can be prevented by controlling the steps of maximization approach. The exception is when any element of γ¯ 0 (O) is equal to zero. By substitution of zero components of γ¯ 0 (O) by very small positive values, this problem can also be eliminated. At this moment, which the constrained optimization problem is converted to an unconstrained optimization problem, various optimization techniques such as gradient ascent algorithm can be utilized to calculate the maximum likelihood estimation of r. The gradient ascent algorithm has the following formula ri = ri−1 + αE ∇f (ri−1 )
(2.9)
[¯ γ 0 (O)] [b + Lr(O)]
(2.10)
∇f (r) = L0
where i is the index for gradient ascent iterations, αE is the learning rate, and ∇ denotes the gradient operator. 10
2.2. System Description If a Gaussian distribution is considered for the prior of r, the objective function in equation 2.7 and its gradient (given in equation 2.10) are modified to the following forms f (r) = γ¯ 0 (O) log(b + Lr) − ∇f (r) = L0
1 0 rr 2δ 2
(2.11)
[¯ γ 0 (O)] r − [b + Lr(O)] δ 2
(2.12)
where δ is the standard deviation of the prior distribution, which forces r to have small elements. In order to keep w non-negative, the variance of the Gaussian distribution of the prior selects a small value. In the Maximization-step, r is assumed to be known for all utterances in the training database. Thus, L can be calculated by solving the following constrained optimization problem: !
max
X
0
γ¯ (O(s)) log[b + Lr(O(s))]
(2.13)
s
(
subject to
h(b + Lr(O(s)) = 1 : Equality constraint b + Lr(O(s)) > 0 : Inequality constraint
To solve this constrained optimization problem, iterative optimization techniques should be employed. Like in the Expectation-step, all equality constraints mentioned in equation 2.13 can be simplified to a single constraint hL = 0. By controlling the steps size, the inequality constraints can also be avoided. Solving this optimization problem using the projected gradient algorithm [97] results in the following equations: Li = Li−1 + αM P∇f (Li−1 ) ∇f (L) =
X s
[¯ γ (O(s))] r0 (O(s)) [b + Lr(O(s))]
P =I−
1 0 hh C
(2.14) (2.15)
(2.16)
where i is the index for gradient ascent iterations, αM is the learning rate and I is an identity matrix of dimension C. The subspace matrix L is estimated over a large training dataset. The obtained subspace vectors representing the utterances in train and test datasets are used to estimate the age of speakers in this chapter. This new low-dimensional utterance representation approach was successfully applied to speaker and language/dialect recognition tasks [8]. 11
2. Automatic Speaker’s Age Estimation from Spontaneous Telephone Speech
2.2.3
Function Approximation
In this study, two different function approximation approaches, namely artificial neural networks (ANNs) and least-squares support vector regression (LSSVR) are employed for the i-vector-based and the NFA-based function approximations. In addition, an ANN is used to perform the score-level fusion. Artificial Neural Networks A multilayer perceptron (MLP) is a supervised, feed-forward neural network, which is widely applied to regression problems due to their ability to approximate complex nonlinear functions from input data [50, 54]. An MLP usually utilizes a derivative based optimization algorithm such as back-propagation, to train the network. Different training methods have been suggested during the last decades [50, 86, 46, 65] to enhance the training speed, provide more memory efficient methods and represent better convergence properties. A feedforward neural network has a layered structure: an input layer, hidden layers and an output layer. An input layer consists of sensory nodes. The number of input layer neurons equals to the dimension of the data. There are one or more hidden layers of computational nodes. Since there is no general rule to calculate an appropriate number of hidden neurons, it should be selected by a trial-and-error procedure. The output layer calculates the outputs of the network. The activation functions commonly used in feedforward neural networks are logistic, hyperbolic tangent and linear functions. Selecting an appropriate activation functions for the hidden layers as well as the output layer depends on applications. In function approximation problems, a linear function should be considered as the activation function for the output layer, while all types of activation functions can be chosen for hidden layers. The concept and relations of the MLPs are explained in more detail in Appendix A In this study, numerous network architectures consisting of different number of hidden layers and hidden neurons, various activation functions and variety of training algorithms are trained. Then, the trained networks are tested on the validation data, and based on the obtained results, the best network architectures are selected to be evaluated on the test data. In this study, networks are implemented, trained and tested using Matlab Neural Network toolbox, version 6.0.2. Least Squares Support Vector Regression Support vector regression (SVR) is a function approximation approach developed as a regression version of the widely known Support Vector Machines (SVM) classifier [96]. Using nonlinear transformations, SVMs map the input data into a higher dimensional space in which a linear solution can be calculated. They also keep a subset of the samples which are the most relevant data for the solution and discard the rest. This makes the solution as sparse as possible. While SVMs perform the classification task by determining the maximum margin separation hyperplane between two classes, SVR 12
2.3. Experimental Setup carries out the regression task by finding the optimal regression hyperplane in which most of training samples lie within an ε-margin around this hyperplane [96, 100]. In this study, we use the least squares version of support vector regression (LSSVR). While a SVR solves a quadratic programming with linear inequality constraints, which results in high algorithmic complexity and memory requirement, a LSSVR involves solving a set of linear equations by considering equality constraints instead of inequalities for classical SVR [100], which speeds up the calculations. This simplicity is achieved at the expense of loss of sparseness, therefore all samples contribute to the model, and consequently, the model often becomes unnecessarily large. The concept and relations of the LSSVR are explained in more detail in Appendix B In this chapter, linear and radial basis function (RBF) kernels are used to approximate g(x). For the LSSVR with RBF kernels, a K-fold cross-validation is used to tune the smoothing parameter of the kernels. The LSSVR model for training and testing is implemented using LS-SVMlab1.8 Toolbox [31, 100] in Matlab environment.
2.2.4
Training and Testing
The proposed age estimation approach is depicted in Figure 2.1. As illustrated in the figure, each utterance of train, development and test sets is mapped onto a high dimensional vector using one of the mentioned utterance modeling approaches described in Section 2.2.2. During the training phase, the obtained i-vectors and NFA vectors of the training set are used as features with their corresponding age labels to train the model-1 and the model-2, respectively. During the development phase, the trained models estimate the age of utterances of the development set. To this end, the obtained i-vectors and NFA vectors of the development set are applied to the trained model-1 and model-2, respectively. The outputs of model-1 and model-2 are then concatenated to form a two dimensional vector of estimated ages. This vector along with corresponding age labels of development set are applied to train the model-3 to fuse the results. Finally, during the testing phase, the trained models estimate the age of utterances of unseen speakers (utterances of the test set). This is performed by applying the obtained i-vectors and NFA vectors of the test set to the trained model-1 and model-2, respectively. Then, the outputs of model-1 and model-2 are concatenated to form a two-dimensional vector of estimated ages. This vector is applied to the trained model-3 to estimate the age of the test utterances.
2.3 2.3.1
Experimental Setup Corpus
The National Institute for Standard and Technology (NIST) have held annual or biannual speaker recognition evaluations (SRE) for the past two decades. With each SRE, a large corpus of telephone (and more recently microphone) conversations 13
xtr
U.M.
2. Automatic Speaker’s Age Estimation from Spontaneous Telephone Speech
xtri −vec Model 1
ytr x
i − vec dev
U.M.
xdev
i − vec xtst
U.M.
xtst
i − vec yˆ dev NFA yˆ dev
Model 3
Estimated Age
ydev
U.M.
yˆtsti − vec xtr
yˆ
yˆ tstNFA
xtrNFA Model 2
ytr x
NFA dev
U.M.
xdev
xtstNFA U.M.
xtst
Figure 2.1: Block diagram of the proposed speaker age estimation approach. (U.M. stands for utterance modeling)
are released along with an evaluation protocol. These conversations typically last 5 minutes and originate from a large number of participants for whom additional meta data is recorded including age, height, weight, language and smoking habits. The NIST databases were chosen for this work due to the large number of speakers and because the total variability subspace requires a considerable amount of development data for training. The development data set used to train the total variability subspace and UBM includes over 30,000 speech recordings and was sourced from the NIST 2004-2006 SRE databases, LDC releases of Switchboard 2 phase III and Switchboard Cellular (parts 1 and 2). For the purpose of automatic speaker age estimation, telephone recordings from the common protocols of the recent NIST 2008 and 2010 SRE databases are used. Speakers of NIST 2008 and 2010 SRE databases are pooled together to create a dataset of 1445 speakers. Then, they are divided into two disjoint parts such that 80% and 20% of all speakers are assigned for training and testing sets, respectively. 14
2.3. Experimental Setup Training Set / MALE
Testing Set / MALE
800
140
700
120
Number of Utterances
Number of Utterances
600
500
400
300
100
80
60
40
200
20
100
0 10
20
30
40
50
60
70
80
0 20
90
30
40
50
Age
60
70
80
Age
Training Set / FEMALE
Testing Set / FEMALE
1200
200
180 1000
Number of Utterances
Number of Utterances
160
800
600
400
140
120
100
80
60
40 200 20
0 10
20
30
40
50
60
70
80
90
Age
100
0 20
25
30
35
40
45
50
55
60
65
70
Age
Figure 2.2: The age histogram of telephone speech utterances of training and testing datasets for male and female speaker.
The age histogram of training and testing datasets for male and female speakers of target are depicted in Fig. 2.2. The training set is also divided into two disjoint parts such that 20% of the training data are considered as the development set. Since there are several utterances from each speaker in the data set, the development set was selected such that there was no speaker who had utterances in both training and development sets. Thus, of all 6080 utterances, 3194 utterances are considered for training set, 1692 utterances are considered for development set, and 1194 utterances are considered for testing set.
2.3.2
Performance Metric
In order to evaluate the effectiveness of the proposed system, the mean-absolute-error (MAE) of the estimated age, and the Pearson correlation coefficient (CC) between 15
2. Automatic Speaker’s Age Estimation from Spontaneous Telephone Speech the actual and estimated speakers’ age are used. MAE is defined as: N 1 X M AE = |fi − yi | N i=1
(2.17)
where fi is the ith estimated age and yi is the ith actual age, and N is the total number of test samples. Although MAE is a helpful performance metric in regression problems, it is limited in some respects specially in the case of a test set with a skewed distribution. Therefore, we use correlation coefficient, which is computed as: N 1 X fi − f¯ CC = N − 1 i=1 sf
!
!
yi − y¯ , sy
(2.18)
where f¯ and sf denote sample mean and standard deviation, respectively.
2.4
Results and discussion
In this section, the proposed speaker age estimation approach is evaluated. The acoustic feature vector is a 60-dimensional vector consists of 20 Mel-Frequency Cepstrum Coefficients (MFCCs) including energy appended with their first and second order derivatives. MFCCs are obtained using cosine transform of the real logarithm of the short-term energy spectrum represented on a mel-frequency scale [80]. This type of feature is very common in the i-vector-based speaker recognition systems. Wiener filtering, feature warping [80] and voice activity detection [68] have also been considered in the front-end processing to obtain more reliable features. In this study, two different function approximation approaches, namely artificial neural networks (ANNs) and least squares support vector regression (LSSVR) were used for model-1 and model-2. In each experiment, a similar function approximation was considered for both models. In other words, when an MLP (as model-1) was trained with the i-vectors, an MLP with similar architecture was also utilized to be trained with the NFA vectors, as model-2. Likewise, when an LSSVR was employed as model-1, a similar LSSVR was also used as model-2. Artificial neural networks used in model-1 and model-2 were trained using different number of hidden layers, different number of hidden neurons, various learning algorithms and the variety of activation functions. Then, based on the obtained results on the development set, the best network architecture has been selected for further experiments. Since the experiments related to the male and female speakers were performed separately, two different network architectures, namely a three-layer NN and a four-layer NN were employed for each group of gender. For male speakers, the three-layer NN consisted of 200 hidden neurons, and the four-layer NN was composed of 400 neurons in the first hidden layer and 200 neurons in the second hidden layer. For female speakers, the three-layer NN had 150 hidden neurons, and the four-layer NN was composed of 100 neurons in the first hidden 16
2.4. Results and discussion Table 2.1: The results of speaker age estimation using different utterance modeling methods (the i-vector and the NFA frameworks), and different function approximation techniques (LSSVR and MLPs). CC is the Pearson correlation coefficient between actual and estimated age, and M AE is the mean-absolute error between actual and estimated age in year. Function Approximation Methods LSSVR (RBF) LSSVR (Linear) Three-Layer NN Four-Layer NN
Male i-vector CC MAE 0.68 8.73 0.71 8.52 0.71 8.52 0.73 8.21
NFA CC MAE 0.41 10.71 0.44 10.79 0.38 11.05 0.38 10.70
Female i-vector CC MAE 0.77 7.68 0.79 7.57 0.79 6.10 0.82 7.27
NFA CC MAE 0.62 9.55 0.63 9.90 0.68 8.16 0.54 10.14
∗ The bold numbers in the table indicate the best results.
layer and 400 neurons in the second hidden layer. The preferred activation function for hidden layers was logistic sigmoid function, and in order to perform regression, a linear activation function has been utilized for the output layers. Among the various training algorithms described in Section A.2.1 of the Appendix A, the “scaled conjugate gradient” and “one step secant back-propagation” algorithms were applied for networks related to males and females, respectively. Networks were trained to minimize the mean-absolute-error between the desired and estimated outputs. To attenuate the effect of random initialization, each experiment was repeated 10 times, and the most observed result was reported. Another function approximation approach used for model-1 and model-2 was LSSVR. Two different kernels, namely linear and radial basis function (RBF) kernels have been used. The hyper-parameters of the RBF kernel have been tuned using a 10-fold cross-validation. After optimization of the hyper parameters, models were trained separately using the i-vectors and the NFA vectors. The mean-absolute-error (MAE) between estimated and actual age along with the Pearson correlation coefficient (CC) between estimated and actual age are presented in Table 2.1. This table lists the results of using LSSVR and ANNs as function approximation methods for model-1 and model-2 before the score-level fusion. As this table shows, for this problem, LSSVR with linear kernel outperforms LSSVR with RBF kernel. It also shows that if a four-layer NN is used for both model-1 and model-2, more accurate results will be obtained compared with using an LSSVR or a three-layer NN. It can also be inferred from the table that the i-vector framework, which is based on the Gaussian means, is more accurate in estimating age than the NFA framework which is based on the Gaussian weights. Experimental studies show that although GMM weights convey less information than GMM means, they are complementary. For instance, Li et al., improved the speaker age group recognition performance by performing score-level fusion of classifiers based on GMM weights and GMM means [62]. Zang et al. applied 17
2. Automatic Speaker’s Age Estimation from Spontaneous Telephone Speech Table 2.2: The results of proposed speaker age estimation after score-level fusion, along with the relative improvements (R.I.) in CC and MAE, compared with the results of the i-vector-based models. CC is the Pearson correlation coefficient between actual and estimated age, and MAE is the mean-absolute error between actual and estimated age in year. Function Approximation LSSVR (RBF) LSSVR (Linear) Three-Layer NN Four-Layer NN
CC 0.71 0.76 0.74 0.75
Male R.I. MAE 4.2% 7.86 6.6% 6.97 4.1% 7.21 2.7% 7.19
R.I. 9.9% 18.2% 15.4% 12.4%
CC 0.82 0.85 0.82 0.85
Female R.I. MAE 6.1% 6.30 7.1% 5.92 3.6% 6.30 3.5% 6.12
R.I. 17.9% 21.8% -3.2% 15.8%
∗ The bold numbers in the table indicate the best results.
GMM weight adaptation in conjunction with GMM mean adaptation for a large vocabulary speech recognition system to improve the word error rate [115]. In [11] the feature-level fusion of the i-vectors, GMM mean supervectors, and GMM weight supervectors was applied to improve the accuracy of accent recognition. Therefore, in this study, score-level fusion of the i-vector-based and the NFA-based estimators was applied to enhance the accuracy of age estimation. The fusion procedure, as described in Section 2.2.4, was performed by training a three-layer NN on the outputs of model-1 (the i-vector-based model) and model-2 (the NFA-based model) on the development dataset. The architecture of fusion network (model-3) consisted of 5 hidden neurons and logistic activation function in hidden layer, and one linear neuron at the output layer. The training algorithm used to train the network was “Gradient descent with momentum and adaptive learning rate back-propagation”. Table 2.2 represents the results of speaker age estimation after score-level fusion, along with the relative improvements in CC and MAE compared with the results of the i-vector-based models. It can be interpreted from the results that the score-level fusion of the i-vector-based and the NFA-based estimators results in an improvement in accuracy of automatic speaker age estimation. This improvement was more considerable when the outputs of the linear LSSVR models were fused. When the male and female data were pooled together, the correlation coefficient is equal to 0.82. The results of speaker age estimation reported in [9] are considered as the baseline, since they were obtained using the same databases as this study. The minimum MAE of the baseline system for male and female speakers were 7.63 and 7.61, respectively. The proposed method has improved the MAE of the baseline for males and females by 8.6% and 22.2%, respectively.
2.5
Conclusion
In this chapter, a novel approach for speaker age estimation based on a hybrid architecture of the i-vector and the NFA frameworks was proposed. This architecture 18
2.5. Conclusion consisted of two subsystems based on the i-vectors and the NFA vectors. To perform the age estimation, two different function approximation approaches, namely LSSVR and ANNs were used and compared in this study. The score-level fusion of the i-vector-based and the NFA-based estimators was also considered to improve the performance. The effectiveness of the proposed method was investigated on spontaneous telephone speech signals of the NIST 2008 and 2010 SRE corpora. The obtained results demonstrated that employing the information in the GMM weights in conjunction with the information in the GMM means, in the form of the score-level fusion of the i-vector-based and the NFA-based estimators, not only decreased the mean-absoluteerror between actual and estimated age, but also improved the Pearson correlation coefficient between estimated and actual age, compared with the i-vector framework. The relative improvements in CC after score-level fusion for males and females were 6.6% and 7.1%, respectively. The proposed method has improved the MAE of the baseline (provided on the same databases) for males and females by 8.6% and 22.2%, respectively, which reflects the effectiveness of the proposed method in automatic speaker age estimation.
19
Chapter 3
Automatic Speaker’s Height Estimation from Spontaneous Telephone Speech 3.1
Introduction
Speaker body size (height/weight) estimation is an interesting, important and challenging task in forensic and medical as well as commercial applications. In forensic scenarios, estimation of suspect’s body size can direct investigations to find cues in judicial cases. In service customization, body size estimation may help users to receive services proportional to their physical conditions. The wide range of applications has motivated researchers to look for acoustic cues which can be more beneficial for speaker body size estimation. In this chapter, we focus on speaker height estimation; the next chapter is devoted to speaker weight estimation. Experimental studies have found different acoustic cues for speaker height estimation [104, 48]. However, the relation of these acoustic cues with speaker height is usually complex and affected by many other factors such as speech content, language, gender, weight, emotional condition, smoking and drinking habits. Furthermore, in many practical cases we have no control over the available speech duration, content, language, environment, recording device and channel conditions. Therefore, height estimation from speech signals is a very challenging task. Previous studies have investigated a correlation between the speech signal of a person and his/her height. In experiments conducted by Van Dommelen and Moxness, the ability of listeners to estimate the height of speakers from their voice have been examined, and significant correlations between estimated and actual height of male speakers were reported [104]. In studies on speech-driven automatic height estimation, several resources have been devoted to identify acoustic features of speech that can convey information about speaker height. For example, [104] and [48] analyzed the correlation between speaker height and formant frequencies, based on the assumption of speech production theory that there is a correlation between a person’s vocal tract length (VTL) and his/her height. Recently, Arsikere et al. proposed a new 21
3. Automatic Speaker’s Height Estimation from Spontaneous Telephone Speech algorithm based on the assumption of the uniform tube model of the subglottal system to estimate the speakers’ height from the subglottal resonances (SGRs) [4, 5]. In other studies, Pellom and Hansen performed height group recognition by applying Mel-frequency cepstral coefficients (MFCCs) to train a height-dependent Gaussian mixture model. Then a maximum a posteriori classification rule was used to assign each audio file to one of several height groups [81]. However, this text independent approach does not estimate the actual height of a speaker, which can be achieved by using regression techniques. Ganchev et al. applied a large set of openSmile audio descriptors and performed support vector regression to estimate the height of a test speaker [44]. In this chapter, a new speech-based method for automatic height estimation based on the i-vectors instead of raw acoustic features (which utilized in previous studies) is proposed [83]. One effective approach to speaker recognition involves modeling speech utterances with GMM mean super vectors [25]. Although GMM mean super vectors are effective, it is difficult to obtain a robust model when limited data are available due to the high dimensional nature of these vectors. In the field of speaker recognition, recent advances using the i-vector framework [33] have increased the classification accuracy considerably. The i-vector is a compact representation of an utterance in the form of a low-dimensional feature vector. To select an accurate regression approach for this problem, two different function approximation approaches, namely LSSVR and ANNs are compared. The effect of the kernel in LSSVR is also investigated. Evaluation on the NIST 2008 and 2010 SRE corpora shows the effectiveness of the proposed approach. The rest of the chapter is organized as follows. In Section 3.2 the problem of automatic height estimation is formulated and the proposed approach is described. Section 3.3 explains the experimental setup. The evaluation results are presented and discussed in Section 3.4. The chapter ends with conclusions in Section 3.5.
3.2
System Description
3.2.1
Problem Formulation
In the speaker height estimation problem, we are given a set of training data p th utterance and y ∈ R denotes the D = {xi , yi }N i i=1 , where xi ∈ R denotes the i corresponding height. The goal is to design an estimation function g, such that for an utterance of an unseen speaker xtst , the estimated height yˆ = g(xtst ) approximates the actual height as good as possible in some predefined sense.
3.2.2
Utterance Modeling
The first step for speaker height estimation is converting variable-duration speech signals into fixed-dimensional vectors suitable for regression algorithms, which is performed by fitting a GMM to acoustic features extracted from each speech signal. The parameters of the obtained GMMs characterize the corresponding utterance. 22
3.3. Experimental Setup Due to limited data, we are not able to accurately fit a separate GMM for a short utterance, specially in the case of GMMs with a high number of Gaussian components. Thus, for adapting a UBM to characteristics of utterances in training and testing databases, parametric utterance adaptation techniques are applied. In this chapter, the i-vector framework is applied to adapt UBM means. The UBM and the methods of UBM means adaptation using the i-vector framework are explained in Chapter 2, in Section 2.2.2.
3.2.3
Function Approximation
In this study, two different function approximation approaches, namely artificial neural networks (ANNs) and least-squares support vector regression (LSSVR) are employed, which are described in Chapter 2, in Sections 2.2.3 and 2.2.3, respectively.
3.2.4
Training and Testing
The proposed height estimation approach is depicted in Figure 3.1. During the training phase, each utterance is mapped onto a high dimensional vector using the i-vector framework described in Section 2.2.2. The obtained vectors of the training set are then used as features with their corresponding height labels to train an estimator for approximating function g. During the testing phase, the utterance modeling approach is applied to extract a high dimensional vector from an unseen test utterance and the estimated height is obtained using the trained regression function.
3.3 3.3.1
Experimental Setup Database
For this work, the NIST SRE databases were chosen due to the large number of speakers and because the total variability subspace requires a considerable amount of development data for training. The development data set used to train the total variability subspace and UBM includes over 30,000 speech recordings and was sourced from the NIST 2004-2006 SRE databases, LDC releases of Switchboard 2 phase III and Switchboard Cellular (parts 1 and 2). For the purpose of height estimation, telephone recordings from the common protocols of the recent NIST 2008 and 2010 SRE databases are used for training and testing, respectively. The core protocol, short2-short3, from the 2008 database contains 3999 telephone recordings of 1236 speakers whose height is known. Similarly, the extended core-core protocol of the 2010 database contains 5792 telephone segments from 445 speakers. The height histogram of male and female speakers of NIST 2008 and 2010 SRE databases of target are depicted in Figure 3.2. The training set is also divided into two disjoint parts such that 25% of the training data are considered as the development set. Since there are several utterances from each speaker in the 23
3. Automatic Speaker’s Height Estimation from Spontaneous Telephone Speech
Figure 3.1: Block diagram of the proposed speaker height estimation approach in training and testing phases.
data set, the development set was selected such that there was no speaker who had utterances in both training and development sets.
3.3.2
Performance Metric
In order to evaluate the effectiveness of the proposed system, the mean absolute error (MAE) of the speakers’ estimated height, and the Pearson correlation coefficient (CC) between the actual and estimated speakers’ height are used, which are described in Chapter 2, Section 2.3.2. In some literature regarding the estimation of speakers body size, researchers evaluate the performance of the systems by means of the mean-absolute-error between the actual and the estimated values. Although MAE is a helpful performance metric in regression problems, it is limited in some respects, specially in the case of a test set with a skewed distribution which is the case in this problem. For instance, consider the most basic estimator which its output is the average height of the training data. When a test set with a skewed distribution is applied to this basic estimator, based on the variance of the data, the mean-absolute-error might be in an acceptable range. However, the CC would be zero. For this reason, the Pearson correlation coefficient is a preferred performance metric in this problem. 24
3.4. Results and discussion
Figure 3.2: The height histogram of telephone speech utterances for the NIST 2008 and NIST 2010 databases.
3.4
Results and discussion
In this section, the proposed speaker height estimation approach is evaluated. The acoustic feature vector used in this study is a 60-dimensional vector which consists of 20 Mel-Frequency Cepstrum Coefficients (MFCCs) including energy appended with their first and second order derivatives. MFCCs are obtained using cosine transform of the real logarithm of the short-term energy spectrum expressed on a mel-frequency scale [80]. This type of feature is very common in the i-vector-based speaker recognition systems. Voice activity detection [68], feature warping [80] and Wiener filtering have also been considered in the front-end processing to obtain more reliable features. In this study, ANNs and LSSVR were utilized as function approximation techniques. ANNs were trained using different number of hidden layers and hidden neurons, various learning algorithms and different activation functions. Then, based on the obtained results on the development set, the best network architecture has been selected to be evaluated on the test data. Therefore, two different network 25
3. Automatic Speaker’s Height Estimation from Spontaneous Telephone Speech Table 3.1: Results of speaker height estimation using ANNs and LSSVR. CC is the Pearson correlation coefficient between actual and estimated height. Function Approximation LSSVR (RBF kernel) LSSVR (Linear kernel) Three-Layer NN Four-Layer NN
CCHeight Male Female 0.30 0.23 0.41 0.40 0.35 0.36 0.36 0.35
∗ The bold numbers in the table indicate the best results.
architectures, namely three-layer NN and four-layer NN were employed. For the three-layer NN, 10 hidden neurons and for the four-layer NN, 20 neurons in the first hidden layer and 5 neurons in the second hidden layer have been selected, respectively. The preferred activation function for hidden layers was logistic sigmoid function, and in order to perform regression, a linear activation function has been utilized for the output layers. Among the various training algorithms described in Section A.2.1 of the Appendix A, the “BFGS quasi-Newton backpropagation” algorithm were employed. To attenuate the effect of random initialization, the training and testing phases of each experiment was repeated 20 times. Networks were trained to minimize the mean-absolute-error between the desired and estimated outputs. In this study, networks were implemented, trained and tested using Matlab Neural Network toolbox, version 6.0.2. In this chapter, in order to investigate the effect of kernels in LSSVR, two different kernels, namely linear and radial basis function (RBF) kernels have been used. The hyper-parameters of the RBF kernel have been tuned using a 5-fold crossvalidation. After optimization of the hyper parameters, the model has been trained. The LSSVR models for training and testing were implemented using LS-SVMlab1.8 Toolbox [31, 100] in Matlab environment. The results of automatic height estimation using LSSVR and ANNs as function approximation methods for male and female speakers are reported in Table 3.1. The obtained results indicate that, LSSVR with linear kernel outperforms ANNs in speaker height estimation. It can be inferred from the results that the linear kernel is more effective than the RBF kernel in this problem. In this case, the correlation coefficients for male speakers, female speakers and when the male and female data were pooled together are 0.40, 0.41 and 0.60 respectively. The scatter plots of estimation for male speakers, female speakers and when the male and female data were pooled together are shown in Figure 3.3(a), Figure 3.3(b) and Figure 3.3(c), respectively. The mean-absolute-error (MAE) of estimation is 6.2 cm and 5.8 cm for male and female speakers, respectively. As stated before, the MAE is limited in some respects, particularly, in the case of a test set with a skewed distribution which is the case in this height estimation task. This limitation is highlighted by considering a basic estimator, which its output is the average height of the training data. When the test 26
3.5. Conclusion set was applied to this estimator, the MAE for male and female speakers were 6.54 cm and 5.9 cm, respectively. However, the measured CC for males and females were equal to zero. For this reason, the correlation coefficient is a preferred performance metric in this problem which reflects the performance of the estimators in a more tangible way. Although the obtained MAE is satisfactory and the correlation coefficient is fairly strong when male and female data are pooled together, the CC within male and female speakers requires improvement. Unfortunately there is no published results on the same database for comparison purpose. However, the results of published papers on other datasets indicate the typical range of performance in automatic speaker height estimation problem. In [5], reported CC of speaker height estimation on TIMIT database using a method based on sub-glottal resonances [4] are 0.12, 0.21 and 0.71 for male speakers, female speakers and when the male and female data were pooled together respectively. In [81], the obtained CC of speaker height estimation for male and female speakers of TIMIT database using a GMM based approach are 0.39 and 0.31 respectively. The obtained results seem to be reasonable, considering that the applied testing dataset in this study consists of spontaneous telephone speech signals and the number of test set speakers in this study (3999 telephone recordings of 1236 speakers) is considerably larger than that of in [5] and [81].
3.5
Conclusion
In this chapter a novel approach for automatic speaker height estimation based on the i-vector framework was proposed. In this method, each utterance was modeled by its corresponding i-vector. Then, ANNs and LSSVR were employed to estimate the height of a speaker from a given utterance. The proposed method was trained and tested on the telephone speech signals of NIST 2008 and 2010 SRE corpora respectively. Evaluation results showed the effectiveness of the proposed method in speaker height estimation.
27
3. Automatic Speaker’s Height Estimation from Spontaneous Telephone Speech
: R=0.40846
: R=0.39379
Data Fit Y=T
210
Data Fit Y=T
190 185
200 180 175
Output
Output
190
180
170 165
170 160 160
155 150
150
145 150
160
170
180
190
200
210
145
150
155
Target
160
165
170
175
180
185
190
Target
(a)
(b) : R=0.59371 Data Fit Y=T
210
200
Output
190
180
170
160
150
150
160
170
180
190
200
210
Target
(c)
Figure 3.3: The scatter plot of height estimation for (a): male speakers, (b): female speakers, and (c): when the male and female data were pooled together.
28
Chapter 4
Automatic Speaker’s Weight Estimation from Spontaneous Telephone Speech 4.1
Introduction
The acoustic features of speakers have been postulated to convey information about speakers such as age, gender, body size and emotional state. Estimation of speaker height was investigated in the previous chapter. In this chapter, we focus on speaker weight estimation. Automatic speaker weight estimation is one aspect of speaker profiling systems which has a wide range of forensic, medical and commercial applications in real-world scenarios. Since the size of various components of the sound production system such as vocal folds and vocal tract may be related to the overall weight or height of a speaker, it has been a good motivation for researchers in the field of speaker recognition to investigate whether some features of the acoustic signal may provide cues to the body size of the speaker. For instance, authors in [41, 76] found a relationship between formants and the length of the vocal tract, based on the source-filter theory. Thus, since the vocal tract is a part of the speaker’s body, this feature can be used to estimate the body size of a speaker [61]. Mean fundamental frequency (f0 ) of voice is also reported as a feature which has a (negative) correlation with body size. That is, females and children have higher f0 , and males (who are taller and heavier) have lower one [30, 72]. However, estimating speaker weight from the voice pattern is not a straightforward problem. By observing the results of researchers’ study on the relationship between various features of acoustic signal and the body size, the complexity of the issue will be more trivial. For instance, when the relation of fundamental frequency (f0 ) and body size was investigated within male and female speakers, no correlation was found between f0 and the body size of adult humans [60, 59, 29, 103]. The lowest fundamental frequency of voice (F0min ) is another feature which is determined by the mass and length of the vocal folds [30]. By investigating this feature, researchers 29
4. Automatic Speaker’s Weight Estimation from Spontaneous Telephone Speech have found no correlation between F0min and body size in adult human speakers [60, 59, 29, 103]. Fitch has found formant dispersion (the averaged difference between adjacent pair of formant frequencies) a reliable feature which has a correlation with both vocal tract length and body size in macaques [42]. However, a weak relation between formant parameters and body size of human adults is reported in study conducted by Gonzalez [48]. The reason for this weak correlation is that in humans at puberty, the vocal folds grow independent of the rest of the head and body. This issue is more evident in the males than the females [47, 78]. Gonzalez studied the correlation between formant frequencies and body size in human adults [48]. He calculated the formant parameters by means of a long-term average analysis of running speech signals uttered by 91 speakers. In this experiment, the Pearson correlation coefficients between formants and weights for male and female speakers were reported to be 0.33 and 0.34, respectively [48]. In research conducted by Van Dommelen and Moxness [104], the ability to judge speakers weight from their speech samples was investigated. They reported a significant correlation between estimated weight (judged by listeners) and actual weight of only male speakers. In addition, they performed regression analysis involving several acoustic features such as f0 , formant frequencies, energy below 1 kHz, and speech rate. The results showed that the only parameter which had a significant correlation with male speaker’s weight was the speech rate. They concluded that speech rate of male speakers is a reliable predictor for weight estimation. Modeling speech utterances with GMM mean supervectors is demonstrated to be an effective approach to speaker recognition which has attracted the attention of researchers [25]. However, GMM mean supervectors are high dimensional vectors and obtaining a reliable model is difficult when limited data are available. Recently, utterance modeling using the i-vector framework have considerably increased the accuracy of the classification and regression problems in the field of speaker profiling [33, 83, 9]. The i-vector represents an utterance in a compact and a low-dimensional feature vector. In this chapter, a novel approach for automatic speaker weight estimation based on the i-vectors instead of raw acoustic features is proposed. In this proposed method, each utterance is modeled by its corresponding i-vector. To select an accurate regression approach for this problem, two different function approximation approaches, namely LSSVR and ANNs are compared. The effect of the kernel in LSSVR for the problem of speaker weight estimation is also investigated in this study. The proposed method is investigated on spontaneous telephone speech signals of the NIST 2008 and 2010 SRE corpora. Experimental results confirm the effectiveness of the proposed approach. The rest of the chapter is organized as follows. The problem of automatic weight estimation is formulated and the proposed approach is described in Section 4.2. Section 4.3 explains the experimental setup. The evaluation results are presented and discussed in Section 4.4. The chapter concludes with conclusions in Section 4.5.
30
4.2. System Description
4.2
System Description
In this section, the problem formulation and the main constituents of the proposed method are described.
4.2.1
Problem Formulation
In the speaker weight estimation problem, we are given a set of training data p th utterance and y ∈ R denotes the D = {xi , yi }N i i=1 , where xi ∈ R denotes the i corresponding weight. The goal is to design an estimation function g, such that for an utterance of an unseen speaker xtst , the estimated weight yˆ = g(xtst ) approximates the actual weight as good as possible in some predefined sense.
4.2.2
Utterance Modeling
By fitting a GMM to acoustic features extracted from each speech signal, variableduration speech signals are converted into fixed-dimensional vectors suitable for regression algorithms. The parameters of the obtained GMMs characterize the corresponding utterance. Due to limited data, we are not able to accurately fit a separate GMM for a short utterance, specially in the case of GMMs with a high number of Gaussian components. Thus, for adapting a UBM to characteristics of utterances in training and testing databases, parametric utterance adaptation techniques are applied. In this chapter, the i-vector framework is applied to adapt UBM means. The UBM and the methods of UBM means adaptation using the i-vector framework are explained in Chapter 2, in Section 2.2.2.
4.2.3
Function Approximation
In this study, two different function approximation approaches, namely artificial neural networks (ANNs) and least-squares support vector regression (LSSVR) are employed, which are described in Chapter 2, in Sections 2.2.3 and 2.2.3, respectively.
4.2.4
Training and Testing
The proposed weight estimation approach is depicted in Fig. 4.1. During the training phase, each utterance is mapped onto a high dimensional vector using the i-vector framework described in Section 2.2.2. The obtained vectors of the training set are then used as features with their corresponding weight labels to train an estimator for approximating function g. During the testing phase, the i-vector extraction is applied to extract high dimensional vectors from an unseen test utterance and the estimated weight is obtained using the trained regression function. 31
4. Automatic Speaker’s Weight Estimation from Spontaneous Telephone Speech
x 1 y1
…
Utterance modeling
x i yi
Utterance modeling
…
x N yN
Utterance modeling
Training phase
Regression
yˆ Estimated Weight
Testing phase xtst
Utterance modeling
Figure 4.1: Block diagram of the proposed speaker weight estimation approach in training and testing phases.
4.3 4.3.1
Experimental Setup Database
Since the National Institute for Standard and Technology (NIST) speaker recognition evaluations (SRE) databases contain the large number of speakers and because the total variability subspace requires a large amount of development data for training, the NIST SRE databases were selected for this study. The development data set used to train the total variability subspace and UBM includes over 30,000 speech recordings and was selected from the NIST 2004-2006 SRE databases, LDC releases of Switchboard 2 phase III and Switchboard Cellular (parts 1 and 2). For the purpose of automatic speaker weight estimation, telephone recordings from the common protocols of the recent NIST 2008 and 2010 SRE databases are used. Speakers of NIST 2008 and 2010 SRE databases are pooled together to create a dataset of 1445 speakers. Then, they are divided into two disjoint parts such that 80% and 20% of all speakers are used for training and testing sets, respectively. The weight histogram of training and testing datasets for male and female speakers of target are depicted in Fig. 4.2. The training set is also divided into two disjoint sets such that 20% of the training data are considered as the development set, so that none of them were used in the training set. In addition, since there are several utterances from each speaker in the data set, the development set was selected such that there was no speaker who had utterances in both training and development sets. Thus, of all 6080 utterances, 3194 utterances are considered for training set, 1692 utterances are considered for development set, and 1194 utterances are considered for testing set. 32
4.3. Experimental Setup Training Set / MALE
Testing Set / MALE
700
180
160 600
Number of Utterances
Number of Utterances
140 500
400
300
200
120
100
80
60
40 100 20
0 40
60
80
100
120
140
0 40
160
60
80
Weight
100
120
140
160
Weight
Training Set / FEMALE
Testing Set / FEMALE
900
250
800 200
Number of Utterances
Number of Utterances
700
600
500
400
300
200
150
100
50
100
0 20
40
60
80
100
120
140
Weight
160
0 20
40
60
80
100
120
140
Weight
Figure 4.2: The weight histogram of telephone speech utterances of training and testing datasets for male and female speaker.
4.3.2
Performance Metric
In order to evaluate the effectiveness of the proposed system, we used the Pearson correlation coefficient (CC) between the actual and estimated weights, and the mean-absolute-error (MAE), which are described in Chapter 2, Section 2.3.2. MAE is a helpful performance metric in regression problems. However, it is limited in some respects, specially in the case of a test set with a skewed distribution which is the case in this problem. For instance, suppose the most basic estimator which its output is the average weight of the training data. When a test set with a skewed distribution is applied to this basic estimator, based on the variance of the data, the mean-absolute-error might be in an acceptable range. However, the CC would be zero. For this reason, the Pearson correlation coefficient is a preferred performance metric in this problem. 33
4. Automatic Speaker’s Weight Estimation from Spontaneous Telephone Speech
4.4
Results and discussion
In this section, the proposed speaker weight estimation approach is evaluated. The acoustic feature consists of 20 Mel-Frequency Cepstrum Coefficients (MFCCs) including energy appended with their first and second order derivatives, forming a 60 dimensional acoustic feature vector. This type of feature is very common in the i-vector-based speaker recognition systems. To have more reliable features, Wiener filtering, speech activity detection [68] and feature warping [80] have been considered in the front-end processing. In this study, ANNs and LSSVR were utilized as function approximation techniques. ANNs were trained using different number of hidden layers and hidden neurons, as well as various learning algorithms and activation functions. Then, based on the obtained results on the development set, the best network architecture has been selected to be evaluated on the test data. Since the experiments related to the male and female speakers were performed separately, two different network architectures, namely three-layer and four-layer neural networks were employed for each group of gender. For male speakers, the three-layer NN consisted of 450 hidden neurons, and the four-layer NN was composed of 250 neurons in the first hidden layer and 400 neurons in the second hidden layer. For female speakers, the three-layer NN had 400 hidden neurons, and the four-layer NN was composed of 250 neurons in the first hidden layer and 400 neurons in the second hidden layer. The activation function of the hidden layers was logistic sigmoid function, and to perform regression, a linear activation function has been used for the output layers. Of all various training algorithms described in Section A.2.1 of the Appendix A, the “scaled conjugate gradient” and “gradient descent with adaptive learning rate backpropagation” algorithms were employed for male and female weight estimation, respectively. Networks were trained to minimize the mean-absolute-error between the desired and estimated outputs. To attenuate the effect of random initialization, each experiment was repeated 10 times, and the most observed result was reported. The networks were implemented, trained and tested using Matlab Neural Network toolbox, version 6.0.2. In this study, in order to approximate the function g with LSSVR, two different kernels, namely linear and radial basis function (RBF) kernels have been used. The hyper-parameters of the RBF kernel have been tuned using a 10-fold crossvalidation. After optimization of the hyper parameters, the model has been trained. The LSSVR models for training and testing were implemented using LS-SVMlab1.8 Toolbox [31, 100] in Matlab environment. The results of using LSSVR and ANNs as function approximation methods for male and female speakers are represented in Table 4.1. The obtained results indicate that, for speaker weight estimation problem, LSSVR with linear kernel outperforms LSSVR with RBF kernel. It can be interpreted from the table that a MLP is more effective in automatic speaker weight estimation than LSSVR. The obtained results show that the proposed method is more effective in weight estimation for male speakers than for female speakers. The mean-absolute-error of estimation is 11.4 kg, 9.0 kg and 9.9 kg for male 34
4.5. Conclusion Table 4.1: Results of speaker weight estimation using MLPs and LSSVR. CC is the Pearson correlation coefficient between actual and estimated weight, and M AE is the mean-absolute-error between actual and estimated weight, in kg. Function Approximation LSSVR (RBF kernel) LSSVR (Linear kernel) Three-Layer NN Four-Layer NN
Male CC M AE 0.39 12.10 0.42 12.11 0.52 11.44 0.50 11.45
Female CC M AE 0.25 10.06 0.30 9.77 0.36 9.06 0.39 9.00
∗ The bold numbers in the table indicate the best results.
and female speakers and when the male and female data were pooled together, respectively. As mentioned earlier, the MAE is limited in some respects, specially, in the case of a test set with a skewed distribution which is the case in this task. This limitation is highlighted by considering a basic estimator, which its output is the average weight of the speakers in the training data. When the test set was applied to this estimator, the MAE for male and female speakers and when the male and female data were pooled together were 13.0 kg, 9.7 kg and 12.4 kg, respectively. However, the measured CC for males and females and when the male and female data were pooled together were equal to zero. For this reason, the correlation coefficient is a preferred performance metric in this task which reflects the performance of the estimators in a more sensible way. The reported CC for automatic speaker weight estimation based on the formant parameters of the running speech signals uttered by 91 speakers are 0.33 and 0.34 for male and female speakers, respectively [48]. However, in our proposed automatic speaker weight estimation, which is based on the i-vector framework, the correlation coefficients between actual and estimated weights of the male speakers, female speakers and when the male and female data were pooled together are 0.52, 0.39 and 0.59, respectively. The obtained results seem to be reasonable, considering that the applied testing dataset in this study consisted of spontaneous speech signals and the number of speakers in test set was considerably larger than that of in [48]. It can be concluded that automatic speaker weight estimation using the i-vectors is more efficient compared with estimation based on the raw acoustic features.
4.5
Conclusion
In this chapter a novel approach for automatic speaker weight estimation based on the i-vector framework was proposed. In this method, each utterance was modeled by its corresponding i-vector. Then, ANNs with different architectures and LSSVR with different kernels were employed to estimate the weight of a speaker from a given utterance. The proposed method was trained and tested on the telephone speech signals of NIST 2008 and 2010 SRE corpora. Evaluation results showed the effectiveness of the proposed method in speaker weight estimation. 35
Chapter 5
Automatic Smoker Detection from Spontaneous Telephone Speech 5.1
Introduction
Cigarette smoking habit is a feature that can be inferred form a speaker’s voice. Smoker detection as a component of speaker profiling systems and behavioral informatics is scrutinized in this chapter. The effect of smoking habit on different speech analysis systems such as speaker gender detection, age estimation, intoxication-level recognition and emotional state identification shows the importance of an automatic smoking habit detection system and motivates the analysis of the smoking habit effects of speech signals. Experimental studies show that many acoustic features of the speech signal such as fundamental frequency, jitter, shimmer are influenced by cigarette smoking [98, 109, 21, 45, 49]. For example, Gonzalez and Carpi studied the early effects of smoking on the voice parameters. They reported the differences between the perturbation parameters (smoothed pitch perturbation quotient and jitter) for early smokers and non-smokers of both genders. They found an effect of smoking on the amplitude and frequency tremor intensity indices for male test persons, and on the fundamental frequency parameters (highest, lowest and average fundamental frequencies) of the early stage female smokers. They have also reported a correlation between the number of cigarettes smoked per day and the values of the fundamental frequency in female smokers and the frequency tremor intensity index in the male smokers [49]. The effect of long-term smoking on the fundamental frequency of the male and female smokers and non-smokers have been studied in [98]. In that study, the different effects of smoking habits on the fundamental frequency of the male and the female smokers have been reported. It showed that, unlike for females, the differences between the fundamental frequency of the male smokers and non-smokers was significant. Although experimental studies reveal the effect of smoking on different acoustic 37
5. Automatic Smoker Detection from Spontaneous Telephone Speech characteristics of speech, the relation of these acoustic cues with speaker smoking habit is usually complex and affected by many other factors such as speaker age, gender, emotional condition and drinking habits [88, 12, 13]. Furthermore, in many practical cases there is no control over the available speech duration, content and language. These issues make smoking habit detection very challenging for both humans and machines [19, 62, 88]. Technical factors such as available speech duration, environment, recording device and channel conditions also influence the estimation accuracy. In other words, in a typical practical scenario, the quality of the available speech signal and the recording conditions are not controlled and the duration of the speech signal may vary from a few seconds to several hours. In this study, an automatic smoker detection from spontaneous telephone speech signals is proposed. To my knowledge, this is the first work on this topic and thus, the system performance can not be compared with any baseline. However, state-of-the-art techniques developed within speaker and language recognition fields are adopted and applied to reach a reasonable result. One effective approach to speaker recognition involves modeling speech recordings with GMM mean supervectors to use them as features in Support Vector Machines (SVM) [25]. Similar SVM techniques have been successfully applied to different speech processing tasks such as speaker age estimation [38, 19]. While effective, GMM mean supervectors are of a high dimensionality resulting in high computational cost and difficulty in obtaining a robust model in the context of limited data. In the field of speaker and language recognition, recent advances using the i-vector framework [33, 35], which provides a compact representation of an utterance in the form of a low-dimensional feature vector, have increased the classification accuracy considerably. The i-vectors successfully replaced GMM mean supervectors in speaker age estimation too [9]. Bahari et al. have recently introduced a new framework for adaptation and decomposition of GMM weights based on a factor analysis similar to that of the i-vector framework [7, 34]. In this method, named non-negative factor analysis (NFA), the applied factor analysis is constrained such that the adapted GMM weights are non-negative and sum to unity. This method, which yields new low-dimensional utterance representation approach, was applied to speaker and language/dialect recognition successfully [8]. In this chapter, a hybrid architecture involving the NFA and the i-vector frameworks for smoking habit detection is proposed. This architecture consists of two subsystems based on the i-vectors and the NFA vectors. The score-level fusion of the i-vector-based and the NFA-based recognizers is also considered to improve the classification accuracy. In this research, the effectiveness of the proposed method is evaluated on spontaneous telephone speech signals of NIST 2008 and 2010 SRE corpora. Experimental results confirm the effectiveness of the proposed approach. The rest of the chapter is organized as follows. In Section 5.2 the problem of automatic smoker detection is formulated and the proposed approach is described. Section 5.3 explains the experimental setup. The evaluation results are presented and discussed in Section 5.4. The chapter ends with conclusions in Section 5.5.
38
5.2. System Description
5.2
System Description
5.2.1
Problem Formulation
In the smoking habit estimation problem, we are given a set of training data D = d th {xi , yi }N i=1 , where xi ∈ R denotes the i utterance and yi denotes the corresponding smoking habit. The goal is to approximate a classifier function g, such that for an utterance of an unseen speaker, xtst , the probability of the estimated output classified in the correct class is maximum. That is, the estimated label, yˆ = g(xtst ), is as close as the true label, as evaluated by some performance metrics (see Section 5.3.2).
5.2.2
Utterance Modeling
The first step for smoking habit detection is converting variable-duration speech signals into fixed-dimensional vectors suitable for classification algorithms, which is performed by fitting a GMM to acoustic features extracted from each speech signal. The parameters of the obtained GMMs characterize the corresponding utterance. When a limited data are available, fitting a separate GMM for a short utterance can not be performed accurately, specially in the case of GMMs with a high number of Gaussians. Hence, for adapting a UBM to characteristics of utterances in training and testing databases, parametric utterance adaptation techniques are usually applied. In this chapter, the i-vector framework for adapting UBM means and the NFA framework for adapting UBM weights are applied. The UBM and the methods of UBM adaptation using the i-vector and the NFA frameworks are explained in Chapter 2, in Section 2.2.2.
5.2.3
Classifiers
In order to find out suitable matches between the utterance modeling schemes and the recognizers in smoking habit detection problem, five different classifiers are employed which are described in the following subsections. Logistic Regression (LR) Logistic regression (LR) is a widely used classification method [18], which assumes that 0 yi ∼ Bernoulli(f (θ xi + θ0 )) (5.1) where yi s are independent, θ is a vector with the same dimension of x, θ0 is a constant and f (•) is a logistic function and defined as: 1 (5.2) 1 + e−(•) The output of the logistic function, as shown in Figure 5.1, is a value between zero and one. In the problem of smoker detection, we intend to model the probability of a 0 smoker speaker given his/her speech. That is, P (Smoker|xi ) = f (θ xi + θ0 ), where f (•) =
39
5. Automatic Smoker Detection from Spontaneous Telephone Speech
Figure 5.1: Logistic function
xi is the feature vector corresponding to the ith utterance. Vector θ and constant θ0 are the model parameters, which are found through the maximum likelihood estimation (MLE), and the prime denotes transpose. The concept and relations of the logistic regression as well as the MLE procedure to estimate the parameters of the model are explained in more detail in Appendix C.
5.2.4
Multilayer Perceptrons (MLPs)
A multilayer perceptron labeled as MLP in this chapter, is described in Chapter 2, in Section 2.2.3. The only difference is that when a MLP is employed to perform classification task, a logistic function is typically considered as the activation function for the output layer, while in regression problems it should be a linear function. Appendix A explains the concept and relations of the ANNs in more detail. In this study, numerous network architectures consisting of different number of hidden layers, different number of hidden neurons, various activation functions and variety of training algorithms are trained. Then, the trained networks are tested on the validation data, and based on the obtained results, the best network architectures are selected to be evaluated on the test data. Networks are implemented, trained and tested using Matlab Neural Network toolbox, version 6.0.2. Naive Bayesian Classifier (NBC) Bayesian classifiers are probabilistic classifiers working based on Bayes’ theorem and the maximum posteriori hypothesis. They predict class membership probabilities, i.e., the probability that a given test sample belongs to a particular class. That is, P (Cl | x) =
P (x | Cl )P (Cl ) P (x)
(5.3)
where Cl is the label of l th class. Since P (x) is the same for all classes, the denominator can be ignored in calculations. However, calculating P (x | Cl ) requires large training samples and is computationally expensive. The Naive Bayesian classifier (NBC) is a special case of Bayesian classifiers, which assumes class conditional independence to decrease the computational cost and training data requirement [114]. Due to this assumption, P (xi | Cl ) can be 40
5.2. System Description determined independently. That is, P (x | Cl ) =
N Y
P (xi | Cl )
(5.4)
i=1
In this study, class distributions are assumed to be Gaussian. Gaussian Scoring (GS) This classification approach, labeled as GS in this chapter, assumes that each class has a Gaussian distribution and full covariance matrix is shared across all classes [67]. In this method, score of the test vector xtest for the l th class is obtained as follows 1 0 −1 0 ¯l − x ¯Ψ x ¯l , scorel = xtest Ψ−1 x 2 l
l = 1, 2
(5.5)
¯ l is the mean of the vectors for the l th class in the where prime denotes transpose, x training dataset and Ψ is the common covariance matrix shared across all classes. Since the equation 5.5 is linear in xtest , this is a linear classifier. Von-Mises-Fisher Scoring (VMF) This classification approach, labeled as VMF in this chapter, works based on simplified Von-Mises-Fisher distribution [95] which is defined as f (x|¯ x, κ) = Cd (κ) exp(κ¯ x0 x)
(5.6)
¯ is the mean, κ is the spread of the probability mass around the mean, Cd (κ) where x is the normalization constant, d is the dimension of x, and prime denotes transpose. ¯ 0 x is the cosine similarity between mean and x. In this equation, x In this method, each class is modeled using Von-Mises-Fisher distribution. If the ¯ l , is defined as mean of the vectors for the l th class, x PNl
¯l = x
i=1 xi
k
PNl
i=1 xi
k
(5.7)
where Nl is the number of utterances for each class, VMF score of the test vector, xtest , for the l th class is calculated by a dot product of the test vector with the class model mean as follows 0 ¯l , scorel = xtest x
5.2.5
(5.8)
Training and Testing
The proposed smoking habit estimation approach is depicted in Figure 5.2. During the training phase, each utterance is mapped onto a high dimensional vector using one of the mentioned utterance modeling approaches described in Section 2.2.2. The 41
5. Automatic Smoker Detection from Spontaneous Telephone Speech
Figure 5.2: Block-diagram of the proposed smoker detection approach in training and testing phases.
obtained vectors of the training set are then used as features with their corresponding smoking habit labels to train a classifier. During the testing phase, the utterance modeling approaches are applied to extract high dimensional vectors from an unseen test utterance and the smoking habit is recognized using the trained classifier.
5.3 5.3.1
Experimental Setup Database
For the purpose of smoker detection, telephone recordings from the common protocols of the recent National Institute for Standard and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation (SRE) databases are used due to the large number of speakers and because the total variability subspace requires a considerable amount of development data for training. The development data set used to train the total variability subspace and UBM includes over 30,000 speech recordings and was sourced from the NIST 2004-2006 SRE databases, LDC releases of Switchboard 2 phase III and Switchboard Cellular (parts 1 and 2). Speakers of NIST 2008 and 2010 SRE databases are pooled together to create a dataset of 1445 speakers. Then, they are divided into three disjoint parts such that 60%, 20% and 20% of all speakers are used for training, development and testing respectively. Thus, of all 6080 utterances, 3194 utterances are considered for training set, 1692 utterances are considered for development set, and 1194 utterances are considered for testing set. The smoking habits histogram of male and female utterances (there might be multiple utterances from each speaker) of training, development and testing databases are depicted in Figure 5.3(a) and Figure 5.3(b), respectively. As depicted in these figures, the problem is dealing with an unbalanced datasets 42
5.3. Experimental Setup 1200
1079
Non-Smokers
Smokers
1000 800 629
600 426
411
400 200
113
76
0 Training Set
Development Set
Test Set
(a) 1600 1400
Non-Smokers
1363
Smokers
1200 1000
843
800 568
600 400
341
200
107
124
0
Training Set
Development Set
Test Set
(b)
Figure 5.3: The smoking habit histogram of telephone speech utterances for training, development and testing datasets for (a): male speakers and (b): female speakers.
which can make the problem of classification more difficult. The effect of unbalancing in the database is alleviated by considering the distribution of each class of the training set into consideration during the training phase. In any classification problem, we would ideally want to have a well-balanced test set, where all affecting parameters are kept the same across different categories. In the case of smoking habit detection, we desired a test set such that all parameters such as speech content, language, speaker age, alcohol consumption and ethnicity of utterances in smoking and non-smoking categories are the same. However, forming such an ideal test set of too many utterances is usually very difficult and expensive. Therefore, like in many speech technology classification studies [64, 23, 55, 90], the applied test set is formed by randomizing over all other factors. 43
5. Automatic Smoker Detection from Spontaneous Telephone Speech
5.3.2
Performance Metric
Two performance metrics, namely minimum log-likelihood-ratio cost (Cllr,min ) and area under the ROC curve (AUC) are considered to evaluate the effectiveness of the proposed method. In this section, the applied performance measure methods are described in brief.
Minimum Log-Likelihood Ratio Cost Log-Likelihood Ratio Cost (Cllr ) is a performance measure for classifiers with soft, probabilistic decisions output. Since this performance measure is independent of the prior distribution of the classes, it is application-independent. This method has been selected for use in the NIST SRE, and was initially developed for binary classification problems such as speaker recognition. It was further utilized in language and dialect recognition problems [24, 8, 6] which are multi-class classification tasks. Cllr is defined as: Cllr = −
N 1 X wi log2 Pi N i=1
(5.9)
where Pi is the posterior probability for the true class of the ith utterance, wi is a weighting factor to normalize the class proportions, and N is the total number of test samples. Cllr has the sense of cost. That is, a classifier with smaller Cllr is a better classifier. For binary classifiers, the reference value of Cllr is log2 2 = 1, which indicates a useless classifier which extracts no information from the voice samples. Cllr < 1 is an indication of a useful classifier which can be expected to make decisions which have lower average cost than decisions based on the prior. Cllr,min represents the minimum possible Cllr which can be achieved for an optimally calibrated system [106, 24]. In this study, in order to calculate Cllr,min , the FoCal Multiclass Toolkit [22] is utilized.
Area Under the ROC Curve (AUC) Classifiers can also be evaluated by comparing their area under the Receiver Operating Characteristic (ROC) curves (AUCs). ROC curve is a widely used approach to measure the efficiency of classifiers. In a ROC curve the true positive rate (Sensitivity) is plotted as a function of the false positive rate (1-Specificity) for different operating points. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. A classifier with perfect discrimination has a ROC curve that passes through the upper left corner (100% sensitivity, 100% specificity). Therefore the closer the ROC curve is to the upper left corner, the higher the overall accuracy of the test [117]. Thus, the area under the ROC curve takes a value between 0 and 1. This value for a perfect classifier is 1, and for a default system (posterior equal to the prior), it is 0.5. 44
5.4. Results and Discussion Table 5.1: The Cllr,min of applying different classifiers (Multilayer Perceptrons (MLP), Logistic Regression (LR), Von-Mises-Fisher Scoring (VMF), Gaussian Scoring (GS) and Naive Bayesian Classifier (NBC)) over the i-vector and the NFA frameworks Utterance Modeling i-vector NFA
MLP 0.86 0.92
LR 0.86 0.90
Cllr,min VMF GS 0.90 0.98 0.91 0.97
NBC 0.93 0.98
Table 5.2: The AUC of applying different classifiers (Multilayer Perceptrons (MLP), Logistic Regression (LR), Von-Mises-Fisher Scoring (VMF), Gaussian Scoring (GS) and Naive Bayesian Classifier (NBC)) over the i-vector and NFA frameworks Utterance Modeling i-vector NFA
5.4
MLP 0.70 0.63
LR 0.74 0.68
AUC VMF 0.51 0.65
GS 0.56 0.59
NBC 0.66 0.56
Results and Discussion
This section presents the results of the proposed smoking habit detection approach. The acoustic feature consists of 20 Mel-Frequency Cepstrum Coefficients (MFCCs) including energy appended with their first and second order derivatives, forming a 60 dimensional acoustic feature vector. MFCCs are obtained using cosine transform of the real logarithm of the short-term energy spectrum represented on a mel-frequency scale [80]. This type of feature is very common in the state-of-the-art i-vector-based speaker recognition systems. To have more reliable features, Wiener filtering, speech activity detection [68] and feature warping [80] have been considered in front-end processing. The obtained Cllr,min and AUC of applying different classifiers over the i-vectorbased and the NFA-based recognizers are listed in Tables 5.1 and 5.2 respectively. As these tables show, MLPs and LR yielded more accurate results compared with other applied classifiers. The MLPs used in this study was trained using different number of hidden layers and hidden neurons, various learning algorithms and variety of activation functions. Then, based on the obtained results on the development set, the best network architecture has been selected for further experiments. Thus, the architecture of MLPs consisted of a three-layer neural network with 150 hidden neurons. A linear and logistic activation functions were selected for hidden layer and output layer, respectively. The network was trained using “Gradient descent with momentum and adaptive learning rate back-propagation” training algorithm. The networks were trained using Matlab Neural Network toolbox. This toolbox offers only two performance functions, namely mean-square-error and mean-absolute-error to be minimized while networks are trained. Based on the obtained results on the development set, the mean-absolute-error was selected as a better performance function to be minimized in this problem. To attenuate the effect of random 45
5. Automatic Smoker Detection from Spontaneous Telephone Speech initialization, each experiment was repeated 10 times, and the most observed result was reported. It is also shown that the i-vector framework, which is based on the Gaussian means, is more accurate in smoking habit detection than the NFA framework, which is based on the Gaussian weights. Different studies show that GMM weights, which entail a lower dimension compared with Gaussian mean supervectors, carry less, yet complementary, information to GMM means [62, 115, 11]. For example, Zang et al. applied GMM weight adaptation in conjunction with mean adaptation for a large vocabulary speech recognition system to improve the word error rate [115]. Li et al. investigated the application of GMM weight supervectors in speaker age group recognition and showed that score-level fusion of classifiers based on GMM weights and GMM means improves recognition performance [62]. In [11] the feature level fusion of the i-vectors, GMM mean supervectors, and GMM weight supervectors is applied to improve the accuracy of accent recognition. Therefore, to enhance the smoking habit detection accuracy, a score-level fusion of the i-vector and the NFA recognizers was applied using MLPs and LR. The fusion was performed by training a logistic regression on the outputs of logistic regressions, and a three-layer neural network on the outputs of MLPs on the development dataset. To this end, as illustrated in Figure 5.4, each utterance of training, development and testing sets was mapped onto a high dimensional vector using one of the mentioned utterance modeling approaches described in Section 2.2.2. During the training phase, the obtained i-vectors and NFA vectors of the training set were used as features with their corresponding smoking habit labels to train the model-1 and the model-2, respectively. During the development phase, the obtained i-vectors and NFA vectors of the development set were applied to the trained model-1 and model-2, respectively. The outputs of model-1 and model-2 are then concatenated. This vector along with corresponding smoking habit labels of development set were applied to train the model-3 to fuse the scores. Finally, during the testing phase, the obtained i-vectors and NFA vectors of the test set were applied to the trained model-1 and model-2, respectively. Then, the scores of model-1 and model-2 were concatenated and applied to the trained model-3 to detect the labels of test speakers. The Cllr,min and AUC of MLPs and LR classifiers after score-level fusion are compared in Table 5.3. It can be interpreted from the table that the score-level fusion of the i-vector-based with the NFA-based recognizers significantly improves the accuracy of smoking habit detection. This improvement is more evident when a MLP is used. Using MLP, the relative improvements of Cllr,min obtained by the proposed fusion method compared with the i-vector and the NFA frameworks are 3.5% and 9.8%, respectively. The relative improvements of AU C compared with the i-vector and the NFA frameworks are 6.6% and 16%, respectively. The ROC curves of the proposed fusion for male and female speakers are illustrated in Figure 5.5(a) and Figure 5.5(b), respectively.
46
xtr
U.M.
5.5. Conclusions
xtri −vec Model 1
ytr x
i − vec dev
U.M.
xdev
i − vec xtst
U.M.
xtst
i − vec yˆ dev NFA yˆ dev
Model 3
ydev
U.M.
yˆtsti − vec xtr
yˆ
yˆ tstNFA
xtrNFA Model 2
ytr x
NFA dev
U.M.
xdev
xtstNFA U.M.
xtst
Figure 5.4: Block-diagram of the proposed smoker detection approach for scorelevel fusion of the i-vector-based recognizer (model-1) and the NFA-based recognizer (model-2). (U.M. stands for utterance modeling)
5.5
Conclusions
In this chapter, an automatic smoking habit detection from spontaneous telephone speech signals was proposed. In this approach, each utterance was modeled using the i-vector and the NFA frameworks by applying factor analysis on GMM means and weights, respectively. For each utterance modeling method, five different classifiers, namely MLP, LR, VMF, GS and NBC have been employed to find out suitable matches between the utterance modeling schemes and the classifiers. Furthermore, score-level fusion of the i-vector-based and the NFA-based recognizers was performed to improve the classification accuracy. The proposed method was evaluated on telephone speech signals of speakers whose smoking habits were known, drawn from the NIST 2008 and 2010 SRE databases. The study results show that the AUC for both male and female speakers after the score-level fusion is 0.75, and the Cllr,min for 47
5. Automatic Smoker Detection from Spontaneous Telephone Speech Table 5.3: The comparison between Cllr,min and AU C of MLP and LR classifiers after score-level fusion of the i-vector-based and the NFA-based recognizers, along with the relative improvements in Cllr,min and AU C compared with the i-vector framework. Cllr,min 0.84 0.83
Classifiers LR MLP
AUC 0.75 0.75
(AUC=0.75) 1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.5
0.4
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
False Positive Rate
(a)
0.8
0.9
1
(AUC=0.75)
0.6
0.3
0
R.I. 1.3% 6.6%
ROC Curve for Female Speakers
1
True Positive Rate
True Positive Rate
ROC Curve for Male Speakers
R.I. 2.3% 3.5%
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
(b)
Figure 5.5: The ROC curves of the proposed method for (a): male speakers, and (b): female speakers.
male and female speakers after the score-level fusion are 0.84 and 0.82, respectively.
48
Chapter 6
Multitask Speaker Profiling 6.1
Introduction
A simple approach for estimating speaker features is to investigate each trait as a single-task and in isolation, ignoring the task relatedness. Previous studies were mostly focused on independent evaluation of speaker profiling tasks as single-tasks. However, there might be a meaningful relation between some characteristics of a speaker. In other words, some traits of a speaker are influenced by other characteristics. For instance, the perceived age of smokers are different from that of non-smokers of the same calender age [21]. In [110], authors demonstrated that the emotional behaviors highly depend on the gender. That is, males and females express different emotional behaviors. There might be a relation between the body size (weight/height) of a speaker and his/her age. Thus, providing other characteristic or behavioral information of a speaker in the form of a multitask learning (MTL) approach might improve the accuracy of the single-task speaker profiling. In contrast to single-task learning (STL) systems in which each task is learned in isolation, in MTL, related tasks are learned simultaneously by extracting and utilizing appropriate shared information across tasks. MTL has recently attracted a lot of attention in the machine learning community since learning multiple regression and classification tasks simultaneously and in parallel allows to model mutual information between the tasks and consequently improves recognition performance for the individual tasks. This concept is inspired from the human learning methods in which tasks are usually learned in interaction with several related tasks. Extensive experimental and theoretical studies have demonstrated the effectiveness of multitask learning in improving the performance relative to learning each task in isolation [3, 14, 27, 40, 102]. In experiments conducted by Lu et al. [63], the performance of the automatic speech recognition in noise as the main task was improved by including speech enhancement and gender recognition as additional tasks. Weninger et al. have also significantly improved the recognition of speaker traits and states by applying a MTL to this problem [111]. In [110, 108], the performance of an automatic emotion recognition system was improved by considering gender information. This improvement was made due to the fact that males and females 49
6. Multitask Speaker Profiling Male and Female Speakers
Male and Female Speakers
100
100
90
90
80
80
70
70
60
60
Male and Female Speakers 160
140
50
Weight
Age
Age
120
50
40
40
30
30
20
20
10 140
10 20
100
80
60
150
160
170
180
Height
(a)
190
200
210
40
40
60
80
100
Weight
(b)
120
140
160
20 140
150
160
170
180
190
200
210
Height
(c)
Figure 6.1: The scatter plots of (a): age and height, (b): age and weight, and (c): height and weight of speakers in NIST 2008 and 2010 SRE databases for both genders.
express their emotions in different manners, and a MTL system can better discriminate these differences compared with a single-task automatic emotion recognition system. Authors in [92] applied MTL to improved the accuracy of single-task speaker classification by providing speakers height, age and race information. To implement a MTL system, a neural network with hidden layers is an appropriate approach which provides a shared representation among the tasks. In multilayer perceptrons (MLPs), each task is associated with a target. It means that implementing MTL with a neural network requires adding extra targets corresponding to new tasks. By providing additional tasks, network learns internal representations towards minimizing errors of main and extra tasks. Thus, learned information are shared when tasks are learned in parallel. It means that a network learns several related tasks in an efficient manner compared with learning each task in isolation [26]. Considering automatic estimation of height, weight, age and smoking habit as single-task problems, training classifiers and regressors to predict multiple related variables simultaneously can improve the performance of speaker profiling. MTL improvement is based on the mechanisms which are dependent on the related tasks [27]. That is, if tasks are related, MTL performs better than STL to find underlying patterns, otherwise STL may learn better. Figure 6.1, illustrates the correlation between different traits of speakers in NIST 2008 and 2010 SRE databases. The correlation is a useful representation of relationships between variables. The correlation between speakers height and weight, height and age, and weight and age are 0.538, −0.103 and 0.157, respectively. However, related tasks should be correlated at the representation’s level and not necessarily at the output level [26]. That is, the internal representations (which are useful for the different tasks) should be correlated. Caruana discussed various criteria for tasks to be related with each other in [27]. In this chapter, the impact of multitask learning on the performance of speaker profiling systems, by means of providing additional information of a speaker, is investigated. For the purpose of comparison, the obtained results in previous chapters are considered as baselines. Then, the classifiers and estimators are trained to predict 50
6.2. System Description multiple related variables simultaneously. To this end, in a series of experiments, one task is considered as a main task. Then, by introducing additional tasks one-by-one, the impact of applying MTL on the performance of the speaker profiling system is evaluated. In this study, a hybrid architecture involving the i-vector and the NFA frameworks is proposed. This method is based on the score-level fusion of the i-vector-based and the NFA-based recognizers, which utilizes information in GMM means and weights. For this problem, artificial neural networks (ANNs) are employed to capture the underlying patterns between input and output data. Evaluation on the NIST 2008 and 2010 SRE corpora shows the effectiveness of the proposed approach. After this introduction, the problem of multitask speaker profiling is formulated and the proposed approach is described in Section 6.2. Section 6.3 explains the experimental setup. The results are presented and discussed in Section 6.4. The chapter concludes in Section 6.5.
6.2
System Description
In this section, the problem formulation and the main constituents of the proposed method are described.
6.2.1
Problem Formulation
In a multitask speaker profiling, we are given a set of training data D = {xi , yi }N i=1 , p th where xi ∈ R denotes the i utterance and yi denotes a vector composed of the combination of two or more age, height, weight and smoking habit labels. The goal is to design an estimator or a classifier g, such that for an utterance of an unseen ˆ = g(xtst ) approximates the actual label as good speaker xtst , the estimated labels y as possible in some predefined sense.
6.2.2
Utterance Modeling
Converting variable-duration speech signals into fixed-dimensional vectors to make them suitable for classification/regression algorithms is the first step in multitask speaker profiling. This procedure is performed by fitting a GMM to acoustic features extracted from each speech signal. The parameters of the obtained GMMs characterize the corresponding utterance. Since the available data is limited, an accurate adaptation of a separate GMM for a short utterance is not possible, specially in the case of GMMs with a large number of Gaussian components. Therefore, parametric utterance adaptation techniques should be applied to adapt a UBM to characteristics of utterances in training and testing databases. In this chapter, the i-vector framework for adapting UBM means and the NFA framework for adapting UBM weights are applied. The UBM and the methods of UBM adaptation using the i-vector and the NFA frameworks are explained in Chapter 2, in Section 2.2.2. 51
6. Multitask Speaker Profiling
6.2.3
Function Approximations and Classifiers
In MTL, related tasks are learned simultaneously by extracting and utilizing appropriate shared information across tasks. Implementing a multitask speaker profiling necessitates an architecture which involves a shared representation between tasks. Since hidden layers of MLPs provide this property, MTL in this study is performed by ANNs. Artificial Neural Networks To implement a MTL system, a neural network with hidden layers is an appropriate approach which provides a shared representation among the tasks. A layered structure of feedforward neural networks provides the ability to share the learned information while the tasks are learned in parallel. In a neural network, there are one or more hidden layers of computational nodes, which provides a shared representation between the tasks. The number of hidden units should be large enough to capture the underlying pattern between the input-output data, and an appropriate number of hidden neurons should be selected by a trial-and-error procedure. Appendix A explains the concept and relations of the ANNs in more detail. In MLPs, each task is associated with a target. It means that implementing MTL with a neural network requires adding extra targets corresponding to new tasks. By providing additional tasks, network learns internal representations towards minimizing errors of the main and extra tasks. Thus, learned information are shared when tasks are learned in parallel. It means that a network learns several related tasks in an efficient manner compared with learning each task in isolation [26]. To perform MTL, numerous dynamic and static network architectures consisting of different number of hidden layers and hidden neurons, various activation functions and variety of training algorithms networks are trained. Then, the trained networks are tested on the validation data, and based on the obtained results, the best network architectures are selected to be evaluated on the test data. Networks are implemented, trained and tested using Matlab Neural Network toolbox, version 6.0.2.
6.2.4
Training and Testing
In this study, experiments are divided into two categories based on the relatedness of tasks. The first category proposes MTL for speaker age and smoking habits estimation. The second one proposes MTL for speaker height, weight and age estimation. These two proposed multitask speaker profiling methods are depicted in Figure 6.2 and Figure 6.3, respectively. The procedure of training and testing for both categories are the same. As illustrated in figures, each utterance of training, development and testing sets is mapped onto a high dimensional vector using one of the mentioned utterance modeling approaches described in Section 2.2.2. During the training phase, the obtained i-vectors and NFA vectors of the training set are used as features with their corresponding labels to train model-1 and model52
xtr
U.M.
6.2. System Description
xtri −vec Model 1 A S tr tr
y y
i − vec tst
i − vec x xdev
U.M.
xdev
U.M.
xtst
A ,i −vec S ,i −vec yˆ dev yˆ dev A , NFA S , NFA yˆ dev yˆ dev
Model 3
A S ydev ydev
yˆtstA , NFA yˆtstA , NFA
U.M.
yˆtstS , NFA
xtr
yˆ A yˆ S
yˆtstS , NFA
xtrNFA Model 2
ytrA ytrS NFA x dev
U.M.
xdev
xtstNFA U.M.
xtst
Figure 6.2: Block diagram of the proposed multitask speaker profiling approach for speaker age estimation and smoking habit detection, in training, development A and y S represent the and testing phases. U.M. stands for utterance modeling. ytr tr training labels corresponding to age and smoking habit, respectively, and yˆA and yˆS represent the estimated age and smoking habit, respectively, after applying a test sample xtst .
2, respectively. Therefore, model-1 is considered as an i-vector-based model, and model-2 is considered as an NFA-based model, and both models perform MTL. During the development phase, the trained models simultaneously estimate the requested labels of utterances of the development set. To this end, the obtained i-vectors and NFA vectors of the development set are applied to the trained model-1 and model-2, respectively. The outputs of model-1 and model-2 are then concatenated, and along with corresponding labels of development set, are applied to train the model-3 to fuse the results. Finally, during the testing phase, the trained models estimate the labels of utterances of unseen speakers (utterances of the test set). This is performed by applying the obtained i-vectors and NFA vectors of the test set to the trained model-1 and model-2, respectively. Then, the outputs of model-1 and model-2 are concatenated and applied to the trained model-3 to estimate the labels of the test utterances. 53
xtr
U.M.
6. Multitask Speaker Profiling
xtri −vec Model 1
A tr
H tr
y y y
W tr
i − vec tst
i − vec x xdev
U.M.
xdev
U.M.
xtst
A , ivec H , ivec W , ivec yˆ dev yˆ dev yˆ dev A , NFA H , NFA W , NFA yˆ dev yˆ dev yˆ dev
Model 3
yˆ A yˆ H yˆ W
A H W ydev ydev y dev
yˆ tstA , ivec
yˆ tstA , NFA
yˆ tstH , ivec
yˆ tstH , NFA
xtr
U.M.
yˆ tstW , ivec yˆ tstW , NFA
xtrNFA Model 2
ytrA ytrH ytrW NFA x dev
U.M.
xdev
xtstNFA U.M.
xtst
Figure 6.3: Block diagram of the proposed multitask speaker profiling approach for speaker age, height and weight estimation, in training and testing phases. U.M. stands A , y H and y W represent the training labels corresponding for utterance modeling. ytr tr tr to age, height and weight, respectively, and yˆA , yˆH and yˆW represent the estimated age, height and weight, respectively, after applying a test sample xtst .
6.3 6.3.1
Experimental Setup Corpus
For the purpose of MTL speaker profiling, telephone recordings from the common protocols of the recent National Institute for Standard and Technology (NIST) 2008 and 2010 Speaker Recognition Evaluation (SRE) databases are used due to the large number of speakers and because the total variability subspace requires a large amount of development data for training. The development data set used to train the total variability subspace and UBM includes over 30,000 speech recordings and was sourced from the NIST 2004-2006 SRE databases, LDC releases of Switchboard 2 phase III and Switchboard Cellular (parts 1 and 2). Speakers of NIST 2008 and 2010 SRE databases are pooled together to create a dataset of 1445 speakers. Then, they are divided into three disjoint parts such that 60%, 20% and 20% of all speakers are used for training, development and testing respectively. Thus, of all 6080 utterances, 3194 utterances are considered for training set, 1692 utterances for development set, and 1194 utterances for testing set. 54
6.4. Results and discussion
6.3.2
Performance Metric
In order to evaluate the effectiveness of the proposed MTL system, and for the purpose of comparison, the systems are evaluated with the same performance metrics as single-tasks were examined in previous chapters. The minimum Log Likelihood Ratio Cost (Cllr,min ) and the Area Under the ROC Curve (AU C) are employed to evaluate the performance of MTL in smoking habit detection. These measures are described in Chapter 5, in Sections 5.3.2 and 5.3.2, respectively. To evaluate the performance of MTL in speaker height, weight and age estimation, the Pearson correlation coefficients (CC) is used, which is described in Chapter 2, in Section 2.3.2.
6.4
Results and discussion
In this section, the proposed multitask speaker profiling approach is evaluated. The acoustic feature vector is a 60-dimensional vector consists of 20 Mel-Frequency Cepstrum Coefficients (MFCCs) including energy appended with their first and second order derivatives. MFCCs are obtained using cosine transform of the real logarithm of the short-term energy spectrum represented on a mel-frequency scale [80]. This type of feature is very common in the i-vector-based speaker recognition systems. Wiener filtering, feature warping [80] and voice activity detection [68] have also been considered in the front-end processing to obtain more reliable features. In this chapter, model-1 is an i-vector-based model, model-2 is an NFA-based model which both perform MTL, and ANNs were used to train the models. In the first series of experiments, the impact of applying MTL to speaker age, height and weight estimation was evaluated. To this end, in each experiment, one task was considered as a main task. Then, by introducing additional tasks one-by-one, the impact of MTL on the performance of the speaker profiling system was investigated. In the second series of experiments, the effect of applying MTL to speaker age and smoking habit estimation was examined. To find the best network architecture for model-1 and model-2, numerous dynamic and static network architectures consisting of different number of hidden layers and hidden neurons, various activation functions and variety of training algorithms networks were trained. Then, the trained networks were tested on the validation data, and based on the obtained results, the best network architectures were selected. Therefore, to perform MTL for speaker height, age and weight estimation, a four-layer NN consisted of 100 and 150 neurons in the first and the second hidden layers, respectively, were selected for model-1 and model-2. To perform MTL for speaker age and smoking habit estimation, a four-layer NN composed of 200 neurons in the first hidden layer and 300 neurons in the second hidden layer was employed to be trained for model-1 and model-2. The fusion procedure, as described in section 6.2.4, was performed by training a three-layer NN on the outputs of model-1 and model-2 on the development dataset. The architecture of the fusion network (model-3) consisted of 10 hidden neurons and logistic activation function in its hidden layer. The training algorithm used to train the network was “Gradient descent with momentum and adaptive learning rate 55
6. Multitask Speaker Profiling Table 6.1: The comparison between single-task and multitask speaker profiling for speaker height and age estimation. CC is the Pearson correlation coefficient between actual and estimated height/age. Method of Estimation Baselines(STL) MTL (Height & Age) Relative improvement
CCHeight Male Female 0.36 0.36 0.36 0.37 0.0% 2.7%
CCAge Male Female 0.75 0.85 0.75 0.86 0.0% 1.2%
∗ The bold numbers in the table indicate the improved results.
back-propagation”. The networks were trained using Matlab Neural Network toolbox. This toolbox offers only two performance functions, namely mean-square-error and mean-absolute-error to be minimized while networks are trained. Based on the obtained results on the development set, the mean-absolute-error was selected as a better performance function to be minimized in this problem. To attenuate the effect of random initialization, the training and testing phases of each experiment was repeated 10 times, and the most observed result was reported. Table 6.1 shows the results of applying MTL to speaker age and height estimation after the score-level fusion. It should be noted that although in Chapter 3, the best results for speaker height estimation were 0.41 for males and 0.40 for females when LSSVR was used, and in Chapter 2 the best result for speaker age estimation for males was 0.76 when LSSVR was employed, since the ANNs were used in MTL, the results of speaker height and age estimations using MLPs were considered as the baselines. It can be interpreted from this table that MTL slightly improved the performance of age and height estimation for female speakers. The results of applying MTL to speaker weight and age estimation after the score-level fusion are presented in Table 6.2. Except for male age estimation, a positive impact of MTL on the performance of speaker age and weight estimations is evident. Unfortunately, applying MTL to speaker height and weight estimation, as reported in Table 6.3, did not improve the performance. This might be due to the fact that related tasks should be correlated at the representation’s level and not necessarily at the output level [26]. So, one hypothesis is that the speaker height and weight estimation tasks are not correlated at the level of representation, hence are not related tasks. Table 6.4 lists the results of applying MTL to speaker height, weight and age estimation after the score-level fusion. Except for male height and male age estimations, MTL had a positive impact on the accuracy of other tasks. This improvement was more evident for speaker weight estimation. The correlation coefficients between actual and estimated age, height and weight when the male and female data were pooled together are 0.82, 0.74 and 0.60 respectively, which show the improvement in performance, since the corresponding 56
6.4. Results and discussion Table 6.2: The comparison between single-task and multitask speaker profiling for speaker weight and age estimation. CC is the Pearson correlation coefficient between actual and estimated weight/age Method of Estimation Baselines(STL) MTL (Weight & Age) Relative improvement
CCW eight Male Female 0.52 0.39 0.54 0.41 3.7% 4.8%
CCAge Male Female 0.75 0.85 0.75 0.87 0.0% 2.3%
∗ The bold numbers in the table indicate the improved results.
Table 6.3: The comparison between single-task and multitask speaker profiling for speaker height and weight estimation. CC is the Pearson correlation coefficient between actual and estimated height/weight Method of Estimation Baselines(STL) MTL (Height & Weight) Relative improvement
CCHeight Male Female 0.36 0.36 0.36 0.36 0.0% 0.0%
CCW eight Male Female 0.52 0.39 0.52 0.39 0.0% 0.0%
Table 6.4: The comparison between single-task and multitask speaker profiling for speaker height, weight and age estimation. CC is the Pearson correlation coefficient between actual and estimated height/weight/age Method of Estimation Baselines(STL) MTL (Height, Weight & Age) Relative improvement
CCHeight Male Female 0.36 0.36 0.36 0.39 0.0% 7.7%
CCW eight Male Female 0.52 0.39 0.56 0.41 7.1% 4.8%
CCAge Male Female 0.75 0.85 0.75 0.86 0.0% 1.2%
∗ The bold numbers in the table indicate the improved results.
values for the single-task age, height and weight estimations were 0.82, 0.60 and 0.59, respectively. In Table 6.5, the results of applying MTL to speaker age estimation and smoking habit detection are presented. By comparing the obtained results with the baselines, we can conclude that the proposed MTL approach for smoker detection and age estimation improves the results of smoking habit detection. However, the performance of age estimation was not improved in this MTL approach. The ROC curves of the MTL smoker detection for male and female speakers after the score-level fusion and when age information is considered are illustrated in Figure 6.4(a) and Figure 6.4(b), 57
6. Multitask Speaker Profiling Table 6.5: The comparison between single-task and multitask speaker profiling for speaker age estimation and smoking habit detection. Cllr,min is the minimum log-likelihood ratio cost, AU C is the area under the ROC curve, and CC is the Pearson correlation coefficient between actual and estimated age Method of Estimation Baselines(STL) MTL (Smoking habits & Age) Relative improvement
CCAge Male Female 0.75 0.85 0.75 0.85 0.0% 0.0%
Smoker Detection Cllr,min AU C 0.83 0.75 0.81 0.79 2.4% 5.1%
∗ The bold numbers in the table indicate the improved results.
ROC Curve for Female Speakers (AUC=0.795) 1
0.9
0.9
0.8
0.8
0.7
0.7
True Positive Rate
True Positive Rate
ROC Curve for Male Speakers (AUC=0.78) 1
0.6
0.5
0.4
0.6
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
False Positive Rate
(a)
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
(b)
Figure 6.4: The ROC curves of the proposed MTL smoker detection after the score-level fusion and when age information is considered, for (a): male speakers, and (b): female speakers.
respectively. By applying the proposed MTL method, the AU C for male and female speakers after the score-level fusion is 0.78 and 0.79, respectively. The obtained Cllr,min for male and female speakers are 0.80 and 0.83, respectively.
6.5
Conclusion
In this chapter, the performance of multitask learning in speaker profiling was evaluated. In proposed method, each utterance of the NIST 2008 and 2010 SRE databases was modeled using the i-vector and the NFA frameworks by applying factor analysis on GMM means and weights, respectively. In this study, a hybrid architecture involving the score-level fusion of the i-vector-based and the NFA-based 58
6.5. Conclusion recognizers was proposed. In series of experiment, each speaker profiling task was considered as the main task and extra tasks were added one-by-one to the main task. In this study, a MTL approach for speaker age, height and weight estimation, and a MTL method for speaker age and smoking habit estimation were performed separately. By comparing the results of MTL with the best results obtained in previous chapters for each single-task, it can be concluded that related tasks can improve the performance of the main task. On the other hand, unrelated tasks are not able to improve the performance. The experimental results show that the performance of speaker weight estimation and smoker detection were improved when age information was provided. This improvement was occurred only for female height estimation. On the other hand, performing MTL for speaker height and weight estimation did not improve the performance of the height or weight estimations. Furthermore, when MTL was applied to estimate age, height and weight simultaneously, an improvement in performance was observed for speaker weight estimation, as well as female height and age estimations. The relative improvement of CC for speaker weight estimation compared with the baselines for males and females were 7.1% and 4.8%, respectively. This improvement for female age and female height estimations were about 1.2% and 7.7%, respectively. Cllr,min and AU C for MTL smoking habit detection were also experienced relative improvements equal to 2.4% and 5.1%, respectively.
59
Chapter 7
Conclusion 7.1
Summary and Contributions
In this thesis, novel approaches for four forensically important tasks of automatic speaker profiling, namely automatic speaker age, height, weight and smoking habit estimations from spontaneous telephone speech signals have been proposed. In speaker height and weight estimation tasks, utterances were modeled using the i-vector framework. The i-vector framework, which is based on the factor analysis for GMM mean adaptation and decomposition, provides a compact representation of an utterance in the form of a low-dimensional feature vector, and can be effectively substituted for the GMM mean supervectors in speaker profiling tasks [9]. In speaker age estimation and smoking habit detection tasks, utterances were modeled using the i-vector and the NFA frameworks. The NFA framework provides a lowdimensional utterance representation, and is based on a constrained factor analysis on GMM weights such that the adapted weights are non-negative and sum to unity. This framework demonstrated to provide less information compared with the i-vector framework. However, utilizing the NFA vectors in conjunction with the i-vectors in speaker profiling systems might enhance the performance, thanks to the complementary information which is provided by the NFA framework. In Chapter 2, a new approach for age estimation from spontaneous telephone speech signals based on a hybrid architecture of the NFA and the i-vector frameworks was proposed. This architecture was consisted of two subsystems based on the i-vectors and the NFA vectors. ANNs and LSSVR were applied to perform regression. This method is distinguished from previous age estimation methods due to exploiting the available information in both Gaussian means and Gaussian weights through a score-level fusion of the i-vector-based and the NFA-based subsystems. The effectiveness of the proposed method was investigated on spontaneous telephone speech signals of the NIST 2008 and 2010 SRE corpora. Experimental results (as represented in Table 7.1) demonstrated that the score-level fusion of the i-vector-based and the NFA-based estimators decreased the mean-absolute-error, and enhanced the Pearson correlation coefficient between estimated and actual age compared with the i-vector framework. The proposed method has improved the MAE of 61
7. Conclusion Table 7.1: The relative improvements (R.I.) in CC and M AE of the proposed speaker age estimation after score-level fusion compared with the i-vector framework. CC is the Pearson correlation coefficient between actual and estimated age, and M AE is the mean-absolute error between actual and estimated age. Function Approximation LSSVR (RBF) LSSVR (Linear) Three-Layer NN Four-Layer NN
Male R.I.CC R.I.M AE 4.2% 9.9% 6.6% 18.2% 4.1% 15.4% 2.7% 12.4%
Female R.I.CC R.I.M AE 6.1% 17.9% 7.1% 21.8% 3.6% -3.2% 3.5% 15.8%
Table 7.2: The relative improvements (R.I.) in Cllr,min and AU C of the proposed smoking habit detection after score-level fusion compared with the i-vector framework. Classifiers LR MLP
R.I. in Cllr,min 2.3% 3.5%
R.I. in AU C 1.3% 6.6%
the baseline [9] (provided on the same databases) for males and females by 8.6% and 22.2%, respectively, which reflects the effectiveness of the proposed method in automatic speaker age estimation. In Chapter 3 and Chapter 4, new methods for automatic speaker body size (height and weight) estimation from spontaneous telephone speech signals were introduced. In this research, the effectiveness of the i-vector framework in the estimation of speakers’ body size was investigated for the first time. In these methods, each utterance was modeled by its corresponding i-vector. Then, ANNs and LSSVR were employed to estimate the body size of a speaker from a given utterance. The proposed methods were trained and tested on the telephone speech signals of the NIST 2008 and 2010 SRE databases. Evaluation results demonstrated the effectiveness of the proposed methods in automatic speaker height and weight estimations. Chapter 5, for the first time, introduced an automatic smoking habit detection from spontaneous telephone speech signals, which was based on the utterance modeling using the i-vector and the NFA frameworks. For each utterance modeling method, five different classifiers, namely MLP, LR, VMF, GS and NBC have been employed. Furthermore, to improve the classification accuracy, a score-level fusion of the i-vector-based and the NFA-based recognizers was performed. The proposed method was evaluated on telephone speech signals drawn from the NIST 2008 and 2010 SRE databases. The relative improvements of Cllr,min and AU C obtained by the proposed score-level fusion compared with the i-vector framework are presented in Table 7.2. This table reflects the effectiveness of the proposed method in smoking habit detection. The final and important contribution of this thesis was presented in Chapter 6. In this chapter, the impact of multitask learning on the performance of speaker 62
7.1. Summary and Contributions Table 7.3: The relative improvements in CC for age, height and weight estimations obtained in a multitask age, height and weight estimation, compared with the baselines. Tasks in MTL Speaker Profiling Age estimation Height estimation Weight estimation
Relative Improvement in CC Male Female 0.0% 1.2% 0.0% 7.7% 7.1% 4.8%
Table 7.4: The relative improvements (R.I.) in Cllr,min and AU C for smoking habit detection and the R.I. in CC for age estimation obtained in a multitask smoking habit and age estimation compared with baselines. Tasks in MTL Speaker Profiling Age estimation Smoker detection
R. I. in CCAge 0.0% —
R. I. in Cllr,min — 2.4%
R.I. in AU C — 5.1%
profiling systems was investigated. In the proposed method, each utterance of the NIST 2008 and 2010 SRE databases was modeled using the i-vector and the NFA frameworks. In this study, due to the task relatedness, a MTL for speaker age, height and weight estimation, and a MTL for speaker age and smoking habit estimation were performed in separate experiments. Experimental results show that providing age information improves the performance of speaker weight estimation and smoker detection. However, age information can only enhance the performance of female height estimation. On the other hand, performing a MTL for speaker height and weight estimation has no positive impact on the performance of the height or weight estimations. Moreover, applying a MTL for estimating age, height and weight simultaneously, results in performance improvements for speaker weight estimation, female height estimation as well as female age estimation. The relative improvements in CC for age, height and weight estimations obtained in a multitask age, height and weight estimation compared with the baselines are presented in Table 7.3. In Table 7.4, the relative improvements in Cllr,min and AU C for smoking habit detection and the relative improvement in CC for age estimation obtained in a multitask smoking habit and age estimation compared with baselines are presented. The proposed approach has two major distinctions with the previous speaker profiling methods. First, in the proposed method, the available information in both GMM means and weights was employed through a score-level fusion of the i-vector-based and the NFA-based recognizers and estimators. Second, by applying MTL, correlated tasks, which were usually investigated in isolation, were evaluated simultaneously and in interaction with each other. 63
7. Conclusion
7.2
Future Direction
Approaches to improve the performance of speaker profiling systems from the machine learning point of view can be categorized into two groups: first, modifying input data and second, modifying methods of inferring paralinguistic data from the speech signals. Feature-level fusion is an approach which provides complementary information for recognizers. Providing additional related features which are postulated to have correlation with the objective criteria, can enhance the performance of speaker profiling tasks. Subglottal resonance frequencies for instance, are known to correlate with physical height [5, 4], which might improve the accuracy of height estimation by applying feature-level fusion (and normalization) of the i-vectors and subglottal resonance frequencies. Modifying methods of inferring paralinguistic data from speech signals can be performed using different machine learning algorithms. In this study several methods such as LSSVR, ANNs, LR, NBC, GS and VMF were employed. For each speaker profiling task, one should look for a classifier or an estimator which matches with the utterance modeling scheme. The fusion of different classifiers and estimators has also demonstrated to improve the performance of speaker recognition systems. The success of employing MTL for speaker age, height, weight and smoking habit estimation was observed in this thesis. This idea can be extended to other speaker profiling tasks such as accent, language and dialect to improve the performance of speaker profiling systems.
64
Appendices
65
Appendix A
Artificial Neural Networks (ANNs) In this appendix, concepts and equations related to artificial neural network theory are reviewed in brief. More extensive explanation of this topic can be found in [53, 43, 36, 100].
A.1
Multilayer Perceptron Neural Networks
After the introduction of the neural networks’ concept, many researchers have successfully applied neural networks on different applications. A significant improvement was obtained after the introduction of multilayer perceptron in conjunction with the back-propagation technique for learning the weights of interconnections given input/output patterns. While a single perceptron can realize a linear decision boundary in the input space, a multilayer perceptron is able to construct a nonlinear decision boundary. A single neuron as depicted in Figure A.1, can be modeled as a nonlinear element. The output of a perceptron is a weighted sum of input signals multiplied by a nonlinear activation function. That is, oj = ϕ(
n X
wi xi + b)
(A.1)
i=1
where b is a bias or threshold, and nonlinear function is typically of the saturation type, e.g. tanh(·). A multilayer perceptron (shown in Figure A.2) is constructed by adding one or more hidden layers, which can be modeled as: y = W tanh(Vx + B)
(A.2)
where x ∈ Rm is input vector, y ∈ Rpy is output, B ∈ Rnh is bias vector, V ∈ Rnh ×m is an interconnection matrix for hidden layer, and W ∈ Rpy ×nh is an interconnection matrix of output layer. 67
A. Artificial Neural Networks (ANNs)
Figure A.1: The structure of a single neuron.
Figure A.2: The structure of a multilayer perceptron (MLP).
The activation functions commonly used in feedforward neural networks are logistic, hyperbolic tangent and linear functions. The logistic function takes the following form, which its output lies between 0 and 1: ϕ(x) =
1 , a>0 1 + exp(−ax)
(A.3)
The general form of the hyperbolic tangent function, which its output takes values between −1 and 1, is defined as: ϕ(x) =
1 − exp(−2x) 1 + exp(−2x)
(A.4)
The linear function is defined as: ϕ(x) = x
(A.5)
Selecting an appropriate activation function for each layer depends on application. For regression problems, a linear function is employed in the output layer, while, for classification problems, the output layer can take logistic or hyperbolic tangent functions. Depends on the degree of the problem’s nonlinearity, all types of activation functions can be used for hidden layers. From the universal approximation theorem [54], a feedforward network of a single hidden layer containing a finite number of neurons is sufficient to compute a uniform 68
A.2. Regression and Classification using MLPs approximation for a given training set and its desired outputs. The number of hidden neurons has to be chosen by a trial-and-error procedure, since there is no general rule to calculate the best number of hidden units. Barron in [15] demonstrated that under certain conditions, MLPs with one hidden layer have approximation error of order of magnitude O(1/nh ) which is independent of the dimension of the input data. Barron showd that models based on MLPs can better handle larger dimensional inputs than polynomial expansions, since the 2/m approximation error of polynomial expansions is of order of magnitude O(1/nx ), where m is the dimension of input data and nx is the number of terms in the expansion.
A.2
Regression and Classification using MLPs
Back-propagation (BP) algorithm which can be considered as an extension of the LMS algorithm [52], was the first method for training MLPs. In this algorithm, we are given a set of training data {xi , yi }N i=1 , where N denotes the number of training data, and the goal is to minimize the residual squared error cost function in the unknown interconnection weights, by means of steepest decent local optimization algorithm. That is,
minp J(θ) =
θ∈R
N 1 X kyi − f (xi ; θ)k22 N i=1
(A.6)
where θ ∈ Rp is a vector of p unknown interconnection weights. The gradient can be computed for any number of hidden layers by means of recursive equations of generalized delta rule. Different training algorithms have been suggested during the last decades [86, 46, 65] to enhance the training speed, provide more memory efficient methods and represent better convergence properties. Some commonly used algorithms are described in brief in Section A.2.1. As illustrated in Figure A.3, the input patterns are propagated in a forward path towards the output layer. The error calculated at the output layer, on the other hand, is back-propagated in a backward path towards the input layer. During this procedure, the values of the interconnection weights are modified. One important issue which should be taken into consideration during the training phase is to avoid over-fitting data. To tackle this problem, one should consider a part of training data as a validation set. This validation set is used to decide about when to stop training. Otherwise, the network starts memorizing the pattern instead of performing generalization. In the case of solving function approximation problems, one can utilize the model defined in Equation A.2. This model contains a hidden layer consisting of neurons with tanh activation functions and an output layer of linear activation function. In the case of classification problems, the output layer can take tanh or logistic sigmoid activation functions. As an alternative approach, the model can be modified in such 69
A. Artificial Neural Networks (ANNs)
Figure A.3: The structure of a single hidden layer feedforward neural network with error back propagation. The solid lines represent the forward paths and the dotted lines indicate the error back-propagation paths.
a way that a sign function is applied on the output of model used in the regression problems (Equation A.2). That is, y = sign [W tanh(Vx + B) ](A.7) In binary classification, and in the case that tanh activation function is utilized, the desired output takes either −1 or +1 to network be trained. The desired output in the case of logistic sigmoid activation function takes either 0 or 1. In multiclass classification, additional outputs should also be considered to represent the various classes.
A.2.1
Training Algorithms
Most of training algorithms are categorized in three classes: steepest descent, quasiNewton, and conjugate gradient. The method of steepest descent makes a linear approximation of the cost function when updating the weights. In other words, it only uses the first order information about the error surface. Including the moment term in the update equation is an attempt at using second order information about the error surface. However, this introduces an additional parameter to be tuned, which makes the training process more complex. On the other hand, in methods like conjugate-gradient and quasi-Newton methods, the problem of supervised training of a multilayer perceptron is a problem in numerical optimization. In these methods, higher-order information about the error surface is used in the training process. This leads to an improvement in the rate of convergence. Some commonly used training algorithms are described in brief as follows: • Levenberg-Marquardt (LM) algorithm, uses step size damping by regularizing the Hessian matrix and exhibits a fast training in comparison with BP [50]. 70
A.3. Limitations of ANNs • In the Fletcher-Reeves conjugate gradient training algorithm (CGF), the search direction is computed from the new gradient and the previous search direction, based on the Fletcher-Reeves variation of the conjugate gradient method [86]. • BFGS quasi-Newton backpropagation, is a quasi-Newton method for backpropagation, that converges in few iterations but that requires more computation in each iteration [46]. • Scaled conjugate gradient backpropagation (SCG) algorithm updates weight and bias values according to the scaled conjugate gradient method. In this algorithm, Backpropagation is used to calculate derivatives of performance with respect to the weight and bias variables. The scaled conjugate gradient is based on conjugate directions, but this algorithm does not perform a line search at each iteration [71, 36]. • Gradient descent with momentum and adaptive learning rate backpropagation (GDX) algorithm updates weight and bias values according to gradient descent momentum and an adaptive learning rate. Variables are adjusted according to gradient descent with momentum. For each iteration, if performance decreases toward the goal, the learning rate is increased. If performance increases by more than a pre-defined factor, the learning rate is adjusted and the change, which increased the performance, is not made. This algorithm is usually much slower than other methods. However, using algorithms which converge too fast might result in overshooting from the minimum error point [36]. • Gradient descent with adaptive learning rate backpropagation (GDA) algorithm updates weights and bias in accordance with gradient descent with adaptive learning rate. At each iteration, the learning rate is increased if performance decreases toward the goal [82, 36]. • One step secant backpropagation (OSS) algorithm updates weights and bias in accordance with the one step secant method. In each subsequent iteration, the search direction is computed from the new gradient and the previous steps and gradients [17]. It might be better to apply conjugate gradient methods for networks with, for instance, more than about one thousand interconnection weights since they don’t require to store huge matrices.
A.3
Limitations of ANNs
In the previous section, a property of ANNs that can approximate any linear or nonlinear relation between input and output data using an appropriate architecture and weights, was introduced as its important property. However, MLPs have limitations in data analysis which are described as follows: • The final result depends highly on the initialization, 71
A. Artificial Neural Networks (ANNs) • training phase should be repeated several hundreds of times, • training process of the ANN has a stochastic nature, • the available dataset should be relatively large for effective ANN training, • the training technique has a "black box" nature, • there is a risk of overfitting, • training of ANN can take many hours or even days of CPU time. The above mentioned limitations have encouraged researchers to introduce alternative approaches and techniques such as SVM or LSSVM which don’t have such limitations in data analysis. The concept of LSSVM is elaborated in Appendix B.
72
Appendix B
Least Squares Support Vector Machines (LSSVM) In this appendix, the concepts and equations regarding Least Squares Support Vector Machines (LSSVM) for both classification and function approximation problems are explained in brief. More extensive explanation of this topic can be found in [100]. The SVM is a machine learning algorithm with potential for classification and regression [16]. SVM was initially developed for classification, but the theory was extended to perform function approximation by Drucker [39]. In [107] a detailed description on the SVM theory for both classification and regression is provided. LSSVM is a simplified version of the standard SVM algorithm for classification and function estimation, which maintains the advantages and the attributes of the original SVM theory [100]. LSSVM exhibits excellent generalization performances and is associated with low computational costs [100]. While SVMs solve nonlinear classification and function approximation problems by means of quadratic programming, which results in high algorithmic complexity and memory requirement, a LSSVM involves solving a set of linear equations [100] which speeds up the calculations. This simplicity is achieved at the expense of loss of sparseness, therefore all samples contribute to the model, and consequently, the model often becomes unnecessarily large.
B.1
Least Squares Support Vector Machines for Classification
p th input data If the training set is considered as {xi , yi }N i=1 where xi ∈ R is the i th and yi ∈ {−1, +1} is the i label, a linear classifier is constructed as follows:
y(x) = sign[wT ϕ(x) + b]
(B.1)
where w is the matrix of model variables, b is the bias term, and ϕ(·) : Rm → Rnh is the mapping function which maps the input space onto the high dimensional feature 73
B. Least Squares Support Vector Machines (LSSVM) space. The goal of LSSVM is to minimize the cost function defined in Equation B.2 such that yi [wT ϕ(xi ) + b] = 1 − ei . N 1 T 1X min JP (w, e) = w w + γ e2 w,b,e 2 2 i=1 i
(B.2)
where γ is the regularization parameter. In general, w might become infinite dimensional. The Lagrangian solution to the above-mentioned optimization is given by the following equation:
L(w, b, e; α) = JP (w, e) −
N X
αi {yi [wT ϕ(xi ) + b] − 1 + ei }
(B.3)
i=1
where the αi values are the Lagrange multipliers, which thanks to the equality constraints, can be positive or negative. The conditions for the optimality of Equation B.3 are given by the Equation B.4.
PN
∂L ∂w
=0→w=
∂L ∂b
=0→
∂L ∂ei
= 0 → αi = γei
∂L ∂αi
= 0 → yi [wT ϕ(x) + b]
i=1 αi yi ϕ(xi )
PN
i=1 αi yi
=0 (B.4)
After eliminating w and e from the above conditions (Equation B.4), the following equation is obtained: "
0 yT y Ω + I/γ
#" #
"
b 0 = α 1v
#
(B.5)
where 1v = [1, · · · , 1]T , y = [y1 , · · · , yN ]T , e = [e1 , · · · , eN ]T , α = [α1 , · · · , αN ]T , and the kernel function can be defined as: Ωij = yi yj ϕ(xi )T ϕ(xj )
(B.6)
= yi yj K(xi , xj ) When, for instance, the radial basis function (RBF) is used as the kernel, we have K(xi , xj ) = exp(−kxi − xj k22 /σ 2 ). Now, the LSSVM classifier in the dual space takes the form
y(x) = sign[
N X i=1
74
αi yi K(x, xi ) + b]
(B.7)
B.2. Least Squares Support Vector Machines for Regression
B.2
Least Squares Support Vector Machines for Regression
The derivation of LSSVM formulation for the case of nonlinear function approximation is similar to the LSSVM classifier case. If the model in the primal weight space is considered as y(x) = wT ϕ(x) + b
(B.8)
where x ∈ Rm and y ∈ R, the optimization problem in the primal weight space can be formulated as what is given in the Equation B.9 such that yi = wT ϕ(xi ) + b + ei . N 1 1X min JP (w, e) = wT w + γ e2 w,b,e 2 2 i=1 i
(B.9)
where γ is the regularization parameter. In general, w might become infinite dimensional. The Lagrangian solution to the above-mentioned optimization is given by the following equation: L(w, b, e; α) = JP (w, e) −
N X
αi {wT ϕ(xi ) + b + ei − yi }
(B.10)
i=1
where the αi values are the Lagrange multipliers. The conditions for the optimality of Equation B.10 are given by the Equation B.11.
PN
∂L ∂w
=0→w=
∂L ∂b
=0→
∂L ∂ei
= 0 → αi = γei
∂L ∂αi
= 0 → wT ϕ(xi ) + b + ei − yi
PN
i=1 αi ϕ(xi )
i=1 αi
=0 (B.11)
After eliminating w and e from the above conditions (Equation B.11), the following equation is obtained: "
0 1Tv
1Tv Ω + I/γ
#" #
" #
b 0 = α y
(B.12)
where 1v = [1, · · · , 1]T , y = [y1 , · · · , yN ]T , α = [α1 , · · · , αN ]T , and the kernel function can be defined as: Ωij = ϕ(xi )T ϕ(xj )
(B.13)
= K(xi , xj ) Now, the LSSVM model for function approximation in the dual space takes the form 75
B. Least Squares Support Vector Machines (LSSVM)
y(x) =
N X
αi K(x, xi ) + b
(B.14)
i=1
where αi , b are the solution to the linear system (Equation B.12). It worth mentioning that when the radial basis function (RBF) is used as the kernel, there are just two parameters (γ, σ) to be tunned, which is less than for the standard SVMs. The LSSVM unlike ANNs has a global and unique solution, which is an important property of LSSVM. A LSSVM as mentioned before, involves solving a set of linear equations which speeds up the calculations. However, this simplicity is achieved at the expense of loss of sparseness, therefore all samples contribute to the model, and consequently, the model often becomes unnecessarily large.
76
Appendix C
Logistic Regression In this appendix, concepts and equations related to the logistic regression are reviewed in brief. More extensive explanation of this topic can be found in [18].
C.1
Logistic Regression for Binary Classification
Logistic regression (LR) is a widely used classification method [18], which assumes that yi ∼ Bernoulli(f (T xi + w0 )) (C.1) where yi s are independent, w is a vector with the same dimension of x, w0 is a constant and f (•) is a logistic function and defined as: f (•) =
1 1 + e−(•)
(C.2)
The output of the logistic function, as shown in Figure (C.1), takes a value between zero and one.
Figure C.1: Logistic function In the binary classification problems, we intend to model the probability of a certain label given its featurs. That is, P (y|xi ) = f (wT xi + w0 ), where xi is the feature vector corresponding to the ith sample. Vector w and constant w0 are the model parameters, which are found through the maximum likelihood estimation (MLE). The MLE in the logistic regression model for binary cases is described in the next section. 77
C. Logistic Regression
C.2
MLE in the Logistic Regression model
In order to estimate the parameters of the logistic regression model, maximum likelihood estimation (MLE) is used. Lets consider αi = σ(wT xi ). The MLE for the parameters w is: wM LE = arg max P (D|w)
(C.3)
w
where P (D|w) =
N Y
P (yi |xi , w) =
i=1
N Y yi
αi (1 − αi )yi
(C.4)
i=1
where each yi is a Bernoulli random variable. Since there are series of product operations, it is easier to first take a log from the equation. L(w) = − log P (D|w) =−
N X
yi log αi + (1 − yi ) log(1 − αi )
(C.5)
i=1
Now, in order to maximize this probability, the equation (C.5) should be minimized with respect to w. N ∂L(w) −∂ X = yi log αi + (1 − yi ) log(1 − αi ) ∂wj ∂wj i=1
∂ 1 ∂ log α = log ∂wj ∂wj 1 + e−wT x T
xj e−w x = = xj (1 − α) 1 + e−wT x
(C.6)
(C.7)
and ∂ T log(1 − α) = −wT x − log(1 + e−w x ) ∂wj
(C.8)
= −xj + xj (1 − α) = −αxj where α = (α1 , ..., αn )T . By substituting equations (C.7) and (C.8) in equation (C.6) ∇w L =
N ∂L(w) X = yi (αi − yi )xij ∂wj i=1
= (α − y)T A = AT (α − y) where xj = (xj1 , ..., xjd )T , A is a design matrix and is defined as: 78
(C.9)
C.2. MLE in the Logistic Regression model
x1,1 x2,1 A= .. .
xn,1
· · · x1,d · · · x2,d .. .. . . · · · xn,d
Since αi is a nonlinear function in w, to find the maximum value the equation (C.9) can not be set to zero. But if the Hessian matrix is computed, and if it is positive semidefinite, it can be concluded that L function in equation (C.5) is a convex function. So, in order to find MLE, the Newton’s method can be utilizd. ∇2w L =
N X ∂ 2 L(w) ∂αi = xij ∂wj ∂wk ∂wk i=1
= =
N X
(C.10)
xij xik αi (1 − αi )
i=1 zTj BzTk
= AT BA
where zj = (x1j , ..., xnj )T , xj = (xj1 , ..., xjd )T , and B is a diagonal matrix which is defined as: α1 (1 − α1 ) · · · 0 . .. . .. .. B= . 0 · · · αn (1 − αn )
At this moment, since the α is a positive and smaller than one, the B matrix is positive semidefinite, ∇2w L = AT BA 1
1
1
(C.11)
1
= AT B 2 B 2 AT = (B 2 A)T (B 2 A)
Consequently, The Hessian is positive semidefinite and L is a convex function. Now, the Newton’s method called iterative reweighted least squares (IRLS), should be applied. The Newton’s method for d-dimensional data is defined as wt+1 = wt − H−1 g
= (AT BA)−1 AT B Awt − B−1 (α − y)
(C.12)
= (AT BA)−1 AT Bzt where H and g are the Hessian and the gradient matrices, respectively which evaluated using wt . The equation(C.12) is the solution of the weighted least squares problem: arg min w
N X
bi (zi − wT xi )2
(C.13)
i=1
where bi s are weights and defined as (αi − (1 − αi )) . 79
C. Logistic Regression
C.3
Advantages and Limitations of the Logistic Regression
The logistic regression which is also a faundation of more complex methods like neural networks, has the advantage of being interpretable. Considering (wT xi ) = (w0 + w1 x1 + w2 x2 + ... + wn xn ), the coefficients give the information about how individual variables are affecting the probability and which one is more influential in the model. For instance, the positive coefficient wi means that the probability of the ith sample belonging to a certain class is increasing as a variable xi increases. Another advantage of logistic regression is the small number of the parameters, which results in easier parameter estimation in statistical sense. The number of parameters is (d + 1), where d is the dimension of input data. It means that the number of parameters is increasing linearly by increasing the dimension of data. This method is also computationally efficient to estimate the parameters of the model. On the other hands, the performance of this method depends on the problem, which means that it does not necessarily show the best performance.
80
Bibliography [1]
J. Ajmera and F. Burkhardt. Age and gender classification using modulation cepstrum. In Proc. Odyssey, 2008.
[2]
J. D. Amerman and M. M. Parnell. Speech timing strategies in elderly adults. journal of Voice, 20:65–67, 1992.
[3]
R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817–1853, 2005.
[4]
H. Arsikere, G. Leung, S. Lulich, and A. Alwan. Automatic height estimation using the second subglottal resonance. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 3989–3992, March 2012.
[5]
H. Arsikere, G. Leung, S. Lulich, and A. Alwan. Automatic estimation of the first three subglottal resonances from adults speech signals with application to speaker height estimation. Speech Communication, 55(1):51–70, 2013.
[6]
M. H. Bahari. Automatic Speaker Characterization: Automatic Identification of Gender, Age, Language and Accent from Speech Signals. PhD thesis, KU Leuven – Faculty of Engineering Science, Belgium, May 2014.
[7]
M. H. Bahari, N. Dehak, and H. Van hamme. Gaussian mixture model weight supervector decomposition and adaptation. In Internal Report. Speech Group, 2013.
[8]
M. H. Bahari, N. Dehak, H. Van hamme, L. Burget, A. Ali, and J. Glass. Non-negative factor analysis of gaussian mixture model weight adaptation for language and dialect recognition. Audio, Speech, and Language Processing, IEEE/ACM Transactions on, 22(7):1117–1129, July 2014.
[9]
M. H. Bahari, M. McLaren, H. Van hamme, and D. Van Leeuwen. Age estimation from telephone speech using i-vectors. In Proc. Interspeech, pages 506–509, 2012.
[10] M. H. Bahari, M. McLaren, H. Van hamme, and D. Van Leeuwen. Speaker age estimation using i-vectors. Journal of Engineering Applications of Artificial Intelligence, 34:99–108, 2014. 81
Bibliography [11] M. H. Bahari, R. Saeidi, H. Van hamme, and D. van Leeuwen. Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. In Proceedings ICASSP’2013, pages 7344–7348, 2013. [12] M. H. Bahari and H. Van hamme. Speaker age estimation and gender detection based on supervised Non-Negative Matrix Factorization. In Proc. IEEE Workshop on Biometric Measurements and Systems for Security and Medical Applications (BIOMS), pages 1–6, 2011. [13] M. H. Bahari and H. Van hamme. Speaker age estimation using Hidden Markov Model weight supervectors. In 11th IEEE Int. Conf. Information Science, Signal Processing and their Applications (ISSPA), pages 517–521, 2012. [14] B. Bakker and T. Heskes. Task clustering and gating for bayesian multi–task learning. Journal of Machine Learning Research, 4:83–99, 2003. [15] A. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, 1993. [16] D. Basak, S. Pal, and D. Patranabis. Support vector regression. Neural Information Processing – Letters and Reviews, 11(10):203–224, 2007. [17] R. Battiti. First and second order methods for learning: Between steepest descent and newton’s method. Journal of Neural Computation, 4(2):141–166, 1992. [18]
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[19] T. Bocklet, A. Maier, and E. Noth. Age determination of children in preschool and primary school age with GMM-based supervectors and support vector machines regression. In Proc. Text, Speech and Dialogue, pages 253–260, 2008. [20] T. Bocklet, G. Stemmer, V. Zeissler, and E. Noth. Age and Gender Recognition Based on Multiple Systems Early vs. Late fusion. In Proc. 11th Annual Conference of the International Speech Communication Association, pages 2830–2833, 2010. [21] A. Braun and T. Rietveld. The influence of smoking habits on perceived age. In Proc. ICPhS, volume 95, pages 294–297, 1995. [22] N. Brummer. FoCal Multi-class: Toolkit for Evaluation, Fusion and Calibration of Multi-class Recognition Scores, 2007. [23] N. Brummer, L. Burget, J. H. Cernocky, O. Glembek, F. Grezl, M. Karafiat, D. van Leeuwen, P. Matejka, P. Schwarz, and A. Strasheim. Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006. IEEE Trans. Audio, Speech, and Lang. Process., 15(7):2072–2084, 2007. 82
Bibliography [24] N. Brummer and D. Van Leeuwen. On calibration of language recognition scores. In IEEE Odyssey Speaker and Language Recognition Workshop, pages 1–8, 2006. [25] W. Campbell, D. Sturim, and D. Reynolds. Support vector machines using gmm supervectors for speaker verification. IEEE Signal Processing Letters, 13(5):308–311, 2006. [26] R. Caruana. Multi–task learning. Journal of Machine Learning, 28:41–75, 1997. [27] R. Caruana. Multitask Learning. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, 1997. [28] C.-C. Chen, P.-T. Lu, M.-L. Hsia, J.-Y. Ke, and O.-C. Chen. Gender-to-Age hierarchical recognition for speech. In Circuits and Systems (MWSCAS), 2011 IEEE 54th International Midwest Symposium on, pages 1–4, 2011. [29] J. R. Cohen, T. H. Crystal, A. S. House, and E. P. Neuburg. Weighty voices and shaky evidence: a critique. Journal of the Acoustical Society of America, 68:1884–1886, 1980. [30] C. Darwin. The Descent of Man and Selection in Relation to Sex. London: Murray, 1871. [31] K. De Brabanter, P. Karsmakers, F. Ojeda, C. Alzate, J. De Brabanter, K. Pelckmans, B. De Moor, J. Vandewalle, and J. A. K. Suykens. Ls-svmlab1.8 toolbox. http://www.esat.kuleuven.be/sista/lssvmlab/. [32] N. Dehak. Discriminative and Generative Approaches for Long- and Short-Term Speaker Characteristics Modeling: Application to Speaker Verification. PhD thesis, Ecole de Technologie Superieure de Montreal, Montreal, QC, Canada, 2009. [33] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet. Front-end factor analysis for speaker verification. IEEE Trans. Audio, Speech, and Lang. Process., 19(4):788–798, 2011. [34] N. Dehak, O. Plchot, M. H. Bahari, and H. Van hamme. Gmm weights adaptation based on subspace approaches for speaker verification. In SPEAKER ODYSSEY, the speaker and language recognition workshop. Submitted, 2014. [35] N. Dehak, P. A. Torres-Carrasquillo, D. Reynolds, and R. Dehak. Language recognition via ivectors and dimensionality reduction. In Proc. Interspeech, pages 857–860, 2011. [36] H. Demuth, M. Beale, and M. Hagan. Neural Network Toolbox User’s Guide, 2009. 83
Bibliography [37] G. Dobry, R. Hecht, M. Avigal, and Y. Zigel. Dimension reduction approaches for SVM based speaker age estimation. In Proc. Interspeech, pages 2031–2034, 2009. [38] G. Dobry, R. M. Hecht, M. Avigal, and Y. Zigel. Supervector Dimension Reduction for Efficient Speaker Age Estimation Based on the Acoustic Speech Signal. IEEE Trans. Audio, Speech, and Lang. Process., 19(7):1975–1985, 2011. [39] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector regression machines. Advances in neural information processing systems, pages 155–161, 1997. [40] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615–637, 2005. [41]
G. Fant. Acoustic Theory of Speech Production. The Hague: Mouton, 1960.
[42] T. W. Fitch. Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. Journal of the Acoustical Society of America, 102:1213–1222, 1997. [43] J. A. Freeman and D. M. Skapura. Neural Networks Algorithms, Applications and Programming Techniques. Addison-Wesley Publishing Company, 1991. [44] T. Ganchev, I. Mporas, and N. Fakotakis. Audio features selection for automatic height estimation from speech. Artificial Intelligence: Theories, Models and Applications Lecture Notes in Computer Science, 6040:81–90, 2010. [45] H. R. Gilbert and G. G. Weismer. The effects of smoking on the speaking fundamental frequency of adult women. Journal of Psycholinguistic Research, 3(3):225–231, 1974. [46] P. E. Gill, W. Murray, and M. H. Wright. Practical Optimization. Emerald, 1981. [47] U. G. Goldstein. An articulatory model for the vocal tracts of growing children. PhD thesis, Massachusetts Institute of Technology, 1980. [48] J. Gonzalez. Formant frequencies and body size of speaker: a weak relationship in adult humans. Journal of Phonetics, 32:277–287, 2004. [49] J. Gonzalez and A. Carpi. Early effects of smoking on the voice: A multidimensional study. Med. Sci. Monit., pages 649–656, 2004. [50] M. T. Hagan and M. Menhaj. Training feed-forward networks with the marquardt algorithm. IEEE Transactions on Neural Networks, 5(6):989–993, 1994. [51] A. Hatch, S. Kajarekar, and A. Stolcke. Within-class covariance normalization for SVM-based speaker recognition. In Proc. Interspeech, volume 4, 2006. 84
Bibliography [52]
S. Haykin. Adaptive Filter Theory. Prentice Hall, 1996.
[53] S. Haykin. Neural Networks A Comprehensive Foundation. Prentice Hall. New Jersey, 2nd edition, 1999. [54] K. Hornik. Approximation capabilities of multilayer feedforward networks. Journal of Neural Networks, 4(2):251–257, 1991. [55] M. Ichino, N. Komatsu, W. Jian-Gang, and Y. W. Yun. Speaker gender recognition using score level fusion by adaboost. In Proc. of 11th International Conference on Control Automation Robotics & Vision (ICARCV), pages 648– 653, 2010. [56] P. Kenny, G. Boulianne, and P. Dumouchel. Eigenvoice modeling with sparse training data. IEEE Transaction on Speech and Audio Processing, 13(3):345– 354, 2005. [57] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel. A study of interspeaker variability in speaker verification. IEEE Trans. Audio, Speech, and Lang. Process., 16(5):980–988, 2008. [58] M. Kockmann, L. Burget, and J. Cernocky. Brno University of Technology System for Interspeech 2010. In Proc. 11th Annual Conference of the International Speech Communication Association, pages 2822–2825, 2010. [59] H. J. Kunzel. How well does average fundamental frequency correlates with speaker height and weight? Journal of Phonetica, 46:117–125, 1989. [60] N. J. Lass and W. S. Brown. Correlational study of speakers heights, weights, body surface areas, and speaking fundamental frequencies. Journal of the Acoustical Society of America, 63:1218–1220, 1978. [61] N. J. Lass and M. Davis. An investigation on speaker height and weight identification. Journal of the Acoustical Society of America, 60:700–703, 1976. [62] M. Li, K. J. Han, and S. Narayanan. Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech and Language, 27(1):151 – 167, 2013. [63] Y. Lu, F. Lu, S. Sehgal, S. Gupta, J. Du, C. H. Tham, P. Green, and V. Wan. Multitask learning in connectionist speech recognition. In Proc. of 10th Australian International Conference on Speech Science & Technology, Sydney, Australia, 2004. 10th Australian International Conference on Speech Science & Technology. [64] B. Ma, H. M. Meng, and M. W. Mak. Effects of device mismatch, language mismatch and environmental mismatch on speaker verification. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal (ICASSP), volume 4, pages 298–301, 2007. 85
Bibliography [65] D. J. C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992. [66] D. Mahmoodi, A. Soleimani, H. Marvi, F. Razzazi, M. Taghizadeh, and M. Mahmoodi. Age estimation based on speech features and support vector machine. In 3rd Computer Science and Electronic Engineering Conference, pages 60–64, 2011. [67] D. Martınez, O. Plchot, L. Burget, O. Glembek, and P. Matejka. Language recognition in ivectors space. Proceedings of Interspeech, Firenze, Italy, pages 861–864, 2011. [68] M. McLaren and D. van Leeuwen. A simple and effective speech activity detection algorithm for telephone and microphone speech. In Proc. NIST SRE Workshop, 2011. [69] H. Meinedo and I. Trancoso. Age and gender classification using fusion of acoustic and prosodic features. In Proc. INTERSPEECH, pages 2818–2821, 2010. [70] F. Metze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt, J. Stegmann, C. Muller, R. Huber, B. Andrassy, J. Bauer, et al. Comparison of four approaches to age and gender recognition for telephone applications. In Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages IV–1089, 2007. [71] M. F. Moller. A scaled conjugate gradient algorithm for fast supervised learning. Journal of Neural Networks, 6:525–533, 1993. [72] E. S. Morton. On the occurrence and significance of motivation–structural rules in some bird and mammal sounds. American Naturalist, 111:855–869, 1977. [73] C. MüLler, editor. Application of Speaker Classification in Human Machine Dialog Systems, volume 4343. Springer-Verlag Berlin Heidelberg, 2007. [74] C. MüLler, editor. Speaker Classification in Forensic Phonetics and Acoustics, volume 4343. Springer-Verlag Berlin Heidelberg, 2007. [75] C. Muller and F. Burkhardt. Combining short-term cepstral and long-term pitch features for automatic recognition of speaker age. In Proc. INTERSPEECH, pages 2277–2280, 2007. [76] J. Muller. The physiology of the senses, voice and muscular motion with mental faculties. . London: Walton and Maberly, 1848. [77] Y. Muthusamy, E. Barnard, and R. Cole. Reviewing automatic language identification. Signal Processing Magazine, IEEE, 11(4):33–41, 1994. 86
Bibliography [78] V. E. Negus. The Comparative Anatomy and Physiology of the Larynx. Hafner, New York, 1949. [79] G. Neiman and J. A. Applegate. Accuracy of listener judgments of perceived age relative to chronological age in adults. Folia Phoniatr Logop, 42:327–330, 1990. [80] J. Pelecanos and S. Sridharan. Feature warping for robust speaker verification. pages 213–218, 2001. [81] B. L. Pellom and J. H. L. Hansen. Voice analysis in adverse conditions: the centennial olympic park bombing 911 call. In Proc. Of the 40th Midwest symposium on circuits and systems, 1997. [82] V. P. Plagianakos, D. Sotiropoulos, and M. N. Vrahatis. An improved backpropagation method with adaptive learning rate. Technical report, University of Patras, Department of Mathematics, 1998. [83] A. H. Poorjam, M. H. Bahari, V. Vasilakakis, and H. Van hamme. Height estimation from speech signals using i-vectors and least-squares support vector regression. In Proc. 37th International Conference on Telecommunications and Signal Processing, 2014. [84] L. A. Ramig and R. Ringel. Effects of physiological aging on selected acoustic characteristics of voice. Journal of Speech, Language and Hearing Research, 26(1):22–30, 1983. [85] W. J. Ryan. Acoustic aspects of the aging voice. Journal of Gerontology, 27:256–268, 1972. [86] L. E. Scales. Introduction to Non-Linear Optimization. Springer-Verlag, 1985. [87] S. Schotz. Perception, Analysis and Synthesis of Speaker Age. PhD thesis, Department of Linguistics and Phonetics, Centre for Languages and Literature, Lund University, 2006. [88] S. Schotz. Perception, analysis and synthesis of speaker age, volume 47. Citeseer, 2006. [89] S. Schotz and C. Muller. A study of acoustic correlates of speaker age. Speaker Classification II, Lecture Notes in Computer Science, 4441:1–9, 2007. [90] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. MüLler, and S. Narayanan. Paralinguistics in speech and language – state-of-the-art and the challenge. Computer Speech & Language, 27(1):4–39, 2013. [91] B. Schuller, S. Steidl, A. Batliner, A. Vinciarelli, K. Scherer, F. Ringeval, M. Chetouani, F. Weninger, F. Eyben, E. Marchi, et al. The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, 87
Bibliography autism. In Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, 2013. [92] B. Schuller, M. Wöllmer, F. Eyben, G. Rigoll, and D. Arsic. Semantic speech tagging: Towards combined analysis of speaker traits. In Audio Engineering Society Conference: 42nd International Conference: Semantic Audio, Jul 2011. [93] T. Shipp and H. Hollien. Perception of the aging male voice. Journal of Speech, Language and Hearing Research, 12(4):703–710, 1969. [94] S. Shum, N. Dehak, R. Dehak, and J. Glass. Unsupervised speaker adaptation based on the cosine similarity for text-independent speaker verification. In Proc. Odyssey, 2010. [95] E. Singer, P. Torres-Carrasquillo, D. Reynolds, A. McCree, F. Richardson, N. Dehak, and D. Sturim. The mitll nist lre 2011 language recognition system. Speaker Odyssey 2012, pages 209–215, 2012. [96] A. Smola and B. Scholkopf. A tutorial on support vector regression. Statistics and computing, 14(3):199–222, 2004. [97] J. A. Snyman. Practical mathematical optimization: an introduction to basic optimization theory and classical and new gradient-based algorithms, volume 97. Springer Science+ Business Media, 2005. [98] D. Sorensen and Y. Horii. Cigarette smoking and voice fundamental frequency. Journal of communication disorders, 15(2):135–144, 1982. [99] W. Spiegl, G. Stemmer, E. Lasarcyk, V. Kolhatkar, A. Cassidy, B. Potard, S. Shutn, Y. Song, P. Xu, P. Beyerlein, J. Harnsberger, and E. Noth. Analyzing features for automatic age estimation on cross-sectional data. In Proc. INTERSPEECH, pages 2923–2926, 2009. [100] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least squares support vector machines. World Scientific, 2002. [101] D. C. Tanner and M. E. Tanner. Forensic aspects of speech patterns: voice prints, speaker profiling, lie and intoxication detection. Lawyers and Judges Publishing, 2004. [102] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharing features: efficient boosting procedures for multiclass object detection. In Proc. of CVPR, pages 762–769, 2004. [103] W. A. Van Dommelen. Speaker height and weight identification: a reevaluation of some old data. Journal of Phonetics, 21:337–341, 1993. [104] W. A. Van Dommelen and B. H. Moxness. Acoustic parameters in speaker height and weight identification: sex-specific behaviour. Language and Speech, 38:267–287, 1995. 88
Bibliography [105] C. van Heerden, E. Barnard, M. Davel, C. van der Walt, E. van Dyk, M. Feld, and C. Muller. Combining regression and classification methods for improving automatic speaker age recognition. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 5174–5177, 2010. [106] D. van Leeuwen and M. H. Bahari. Calibration of probabilistic age recognition. In Proc. Interspeech. Portland, USA, 2012. [107] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 2000. [108] D. Ververidis and C. Kotropoulos. Automatic speech classification to five emotional states based on gender information. In Proc. of 12th European Signal Processing Conference., pages 341–344, 2004. [109] I. Vincent and H. Gilbert. The effects of cigarette smoking on the female voice. Logopedics Phoniatrics Vocology, 37(1):22–32, 2012. [110] T. Vogt and E. Andre. Improving automatic emotion recognition from speech via gender differentiation. In Proc. of Language Resources and Evaluation Conference., 2006. [111] F. Weninger, E. Marchi, and B. Schuller. Improving recognition of speaker states and traits by cumulative evidence: Intoxication, sleepiness, age and gender. In Proc. INTERSPEECH, 2012. [112] M. Wolters, R. Vipperla, and S. Renals. Age recognition for spoken dialogue systems: Do we need it? In Proc. INTERSPEECH, pages 1435–1438, 2009. [113] S. A. Xue and D. Deliyski. Effects of aging on selected acoustic voice parameters: Preliminary normative data and educational implications. Journal of Educational Gerontology, 21:159–168, 2001. [114] R. Yager. An extension of the naive bayesian classifier. Information Sciences, 176(5):577–588, 2006. [115] X. Zhang, K. Demuynck, and H. Van hamme. Rapid Speaker Adaptation in Latent Speaker Space with Non-negative Matrix Factorization. Speech Communication, 2013. [116] M. A. Zissman and K. M. Berkling. Automatic language identification. Speech Communication, 35(1):115–124, 2001. [117] M. Zweig and G. Campbell. Receiver-operating characteristic (roc) plots: a fundamental evaluation tool in clinical medicine. Clin. Chem., 39(4):561–577, 1993.
89