370
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 2, FEBRUARY 2011
Automatic Detection of Pathological Voices Using Complexity Measures, Noise Parameters, and Mel-Cepstral Coefficients Juli´an D. Arias-Londo˜no*, Student Member, IEEE, Juan I. Godino-Llorente, Senior Member, IEEE, Nicol´as S´aenz-Lech´on, V´ıctor Osma-Ruiz, and Germ´an Castellanos-Dom´ınguez
Abstract—This paper proposes a new approach to improve the amount of information extracted from the speech aiming to increase the accuracy of a system developed for the automatic detection of pathological voices. The paper addresses the discrimination capabilities of 11 features extracted using nonlinear analysis of time series. Two of these features are based on conventional nonlinear statistics (largest Lyapunov exponent and correlation dimension), two are based on recurrence and fractal-scaling analysis, and the remaining are based on different estimations of the entropy. Moreover, this paper uses a strategy based on combining classifiers for fusing the nonlinear analysis with the information provided by classic parameterization approaches found in the literature (noise parameters and mel-frequency cepstral coefficients). The classification was carried out in two steps using, first, a generative and, later, a discriminative approach. Combining both classifiers, the best accuracy obtained is 98.23% ± 0.001. Index Terms—Combining classifiers, Gaussian mixture models (GMMs), nonlinear analysis, pathological voices, support vector machines (SVMs).
I. INTRODUCTION ESEARCH on automatic systems to assess voice disorders has received considerable attention in the past few years due to its objectivity and noninvasive nature. Much of the work done in this area is based on the use of acoustic parameters: amplitude and frequency perturbation parameters, noise parameters, and mel-frequency cepstral coefficients (MFCC) [3]. However, several researchers pointed out that voice production involves some nonlinear processes that cannot be characterized
R
Manuscript received February 23, 2010; revised June 8, 2010 and September 27, 2010; accepted August 22, 2010. Date of publication October 21, 2010; date of current version January 21, 2011. This work was supported by the Spanish Ministry of Education under Grant TEC2006–12887-C02 and by the Convocatoria de apoyo a doctorados nacionales 2007—COLCIENCIAS. Asterisk indicates corresponding author. *J. D. Arias-Londo˜no is with the Department ICS, Universidad Polit´ecnica de Madrid, Madrid 28031, Spain and also with GC&PDS, Universidad Nacional de Colombia, Manizales, Colombia (e-mail:
[email protected]). J. I. Godino-Llorente, N. S´aenz-Lech´on, and V. Osma-Ruiz are with the Department ICS, Universidad Polit´ecnica de Madrid, Madrid 28031, Spain, (e-mail:
[email protected];
[email protected];
[email protected]). G. Castellanos-Dom´ınguez is with Department of Electrical, Electronic, and Computational Engineering, Universidad Nacional de Colombia, Manizales, Colombia (e-mail:
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBME.2010.2089052
by the mentioned measures. Such behavior is due to: nonlinear pressure-flow in the glottis, delayed feedback of the mucosal wave, the nonlinear stress–strain curves of vocal fold tissues, and the nonlinearities associated with vocal fold collision [4]. Titze et al. [5] introduced a qualitative classification for speech sounds corresponding to sustained vowels, taking into account their nonlinear dynamics. The authors established three classes: Type I sounds are nearly periodic, Type II are aperiodic, or do not have a dominant period, and Type III are irregular and aperiodic. Normal voices can usually be classified as Type I and, sometimes, as Type II, whereas voice disorders commonly lead to any of these three classes [6]. Besides, the conventional perturbation parameters (such as shimmer and jitter) are defined only for nearly periodic voice signals, and thus, their usefulness is limited to Type II and III signals [7]. In this sense, some researchers have been interested in applying nonlinear time series analysis to disordered speech, attempting to characterize the nonlinear phenomena and evaluating the discriminative capabilities of these measures for the detection of pathological voices (see [7] and cites therein). The nonlinear analysis of time series is derived from the theory of dynamical systems, and use to be carried out using two statistics: the largest Lyapunov exponent (LLE), and the correlation dimension (CD). LLE is a measure that attempts to quantify the sensibility to the initial conditions of the underlying system [8]. CD is a measure developed for quantifying the geometry (self-similarity) in the state space of the underlying system [8]. There are previous works that investigated the behavior of LLE and CD for the characterization of pathological voices. In [9], CD was used to describe the complexity of the speech signals uttered by normal speakers and by patients with vocal polyps. From each utterance, the CD was estimated using frames of 200 ms. The authors demonstrated that CD values from normal and pathological speakers have statistically significant differences, concluding that the nonlinear analysis can be used as supplementary method to evaluate and detect laryngeal pathologies. Zhang and Jiang [10] used the CD to discriminate between three types of speech signals according to the aforementioned definition by Titze et al. [5]. The database contained different types of pathologies, but the speech signals with strong glottal pulse noise were excluded. Unlike the previous work, the estimation of the CD was carried out in a frame of 500 ms. Again, the authors conclude that CD tends to increase from Type I to III signals, but a classification rate is not presented. Similar studies [11]–[15] used CD to characterize pathological
0018-9294/$26.00 © 2010 IEEE
˜ et al.: AUTOMATIC DETECTION OF PATHOLOGICAL VOICES ARIAS-LONDONO
371
voices before and after a clinical treatment, leading to similar conclusions. In [16], CD and LLE along with other complexity measures were used to characterize some speech recordings extracted from the Massachusetts Eye and Ear Infirmary (MEEI) voice disorders database [17]. The authors performed different classification experiments using support vector machines (SVM) with a polynomial kernel. Eighty percent of the speech recordings were employed to train the SVM and the remaining to validate the system. Only one run of the training was used to estimate the accuracy. The authors obtained an accuracy up to 94.4% only by using CD. In [1], the LLE was used to differentiate between normal voices and patients with unilateral laryngeal paralysis. The authors found statistically significant differences between both groups. Although LLE and CD have shown certain discrimination capabilities, such nonlinear statistics require the dynamics of speech to be purely deterministic, and this assumption is inadequate, since randomness due to turbulence is an inherent part of the speech production [6]. Besides, these measures have been used under the assumption of presence of chaos or a completely random behavior. However, in [18], Pincus demonstrated that there exist stochastic processes with CD equal to zero and, in general, it is not valid to infer the presence of an underlying deterministic system from the convergence of the algorithms designed to estimate these measures. In the case of LLE, many analyses are based on the fact that generally, a system containing at least one positive Lyapunov exponent is defined as chaotic, whereas a system with no positive exponents is regular (as in the case of dynamical systems). From this assumption, other authors concluded that an irregular phonation presents a chaotic dynamic [7]. Nevertheless, the sign of the Lyapunov exponent does not present statistically significant differences [19] and, furthermore, using LLE to differentiate between normal and pathological voices lead to positive values for both classes [1], [7]. There are also numerical and algorithmic problems associated with the calculation of nonlinear measures for speech signals, casting doubts about the reliability of such tools to develop systems for pathological voice detection [6]. To overcome these restrictions, the literature reports a set of features based on information theory. Such measures attempt to quantify the signal complexity in a way such that there is no need of making assumptions about of nature of the signal (i.e., deterministic or stochastic). This idea is in concordance with the fact that the time series generated by biological systems most likely contain deterministic and stochastic components; therefore, both approaches may provide complementary information about the underlying dynamics [20]. The most common measure used in this context is the approximate entropy (AE ) [18], [21], and some other measures derived from this one, such as: sample entropy (SE ) [22], and Gaussian kernel approximate entropy (GAE ) [23]. AE is a regularity statistic that quantifies the unpredictability of the fluctuations in a time series, and reflects the likelihood that similar patterns of observations will not be followed by additional similar observations [24]. This class of measures provides a better parameterization of the nonlinear behavior [25], but its use in the context of pathological speech have not been extensively explored. In [26] and [27], AE was
applied to quantify the effects of radiotherapy in patients with laryngeal cancer, concluding that AE can be used to differentiate healthy from patients that underwent radiotherapy. Again, in [28], AE was used joint to a scaling parameter to classify between normal and pathological voices. The authors conclude that AE is an effective tool to classify vocal fold disorders, but no results were provided in terms of performance. In [29], another set of entropy-based measures were used for the automatic detection of pathological voices. The study used two voice disorders databases. Among the features used are: the Shannon entropy, the first and second-order Renyi entropies, the correlation entropy, and CD. The results showed a very high classification accuracy using the MEEI database (99.6%). However, the accuracy obtained cast some doubts due to a possible bias in the estimation of the Shannon entropy. Each single feature provided a detection error above 40%, except for the CD and Shannon entropy, evidencing an important contribution to the final accuracy of these two features. However, for the second database used in [25], the Shannon entropy reveals a detection error around 43%, which seems to be lacking of coherence. In addition, theoretically, the Shannon entropy must be equal to the first-order Renyi entropy; therefore, their combination becomes redundant. Moreover, and again using the MEEI database, the accuracy obtained by the first-order Renyi entropy is extremely different to that obtained with the Shannon entropy, but such difference does not appear with the second database. Additionally, the Shannon entropy for normal voices and patients with nodules was also estimated in [30] and [31], and the values of the parameters and the classification accuracy obtained were very different to those obtained in [29]. On the other hand, the lengths of the normal and pathological recordings in the MEEI database are very different (normal voices are around 3 s long, whereas pathological are 1 s long). Having in mind that the Shannon entropy is the only feature in [29] estimated using the whole recording, the conclusion is that the results could be biased by the different lengths of the recordings. In [6], Little et al. characterize deterministic and stochastic dynamics of the speech. The deterministic behavior is characterized by a measure from recurrence analysis, and the stochastic components by means of fractal-scaling analysis. This approach reached a 91.8% of accuracy to detect pathological voices. No comparison with approximate entropy measures was given. In [32], Little et al. used the same analysis applied to healthy and pathological speakers with unilateral vocal fold paralysis evaluated pre- and postsurgery. The measures based on recurrence and fractal-scaling analysis were compared with CD and conventional perturbation parameters. The study used a small database (17 pathological and 11 normal speakers). No classification results were given, but the authors concluded that the nonlinear methods are more stable and reproducible than conventional perturbation parameters. On the other hand, the estimation of complexity measures requires the reconstruction of a state space (i.e., the embedding attractor) from a time series. From a pattern recognition point of view, the complexity measures, such as AE , SE , and GAE , use a nonparametric estimate of the probability mass function of the embedding attractor using a Parzen-window method with
372
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 2, FEBRUARY 2011
Fig. 1. General scheme of the system developed for the automatic detection of pathological voices.
a Gaussian or rectangular kernel [33]. They only attempt to quantify the divergence of the trajectories of the attractor, but do not take into account the directions of divergence. In this paper, we used a discrete hidden Markov model (DHMM) to estimate a nonparametric density function of the attractor. The aim is to characterize the divergence of the trajectories and its directions into the state space in terms of the transitions between regions provided by the DHMM. This scheme was used to estimate two empirical entropy (EE) measures [34]. The discriminative capabilities of the proposed features are studied and compared along this paper, both, individually, and complementing different nonlinear features. The results are compared with those using noise parameters and MFCC [35] together. Noise parameters have proven to be reliable for detecting the presence of voice disorders, since most voices present some degree of noise in presence of pathology. The harmonics to noise ratio (HNR) [36], normalized noise energy (NNE) [37], and glottal to noise excitation ratio (GNE) [38], have been widely used both to evaluate the voice quality, and for the detection of voice disorders. Moreover, in order to improve the accuracy of the automatic detection of pathological voices, a two stepped strategy has been followed combining generative and discriminative classifiers fed with the aforementioned nonlinear measures and a classic analysis based on fusing noise parameters and MFCC. II. METHODOLOGY Fig. 1 depicts a block diagram with the overall scheme of the system used in this paper. Prior to classification, the speech signal was divided into frames. On one hand, each window was parameterized by following a classic approach based on the NNE, GNE, HNR, and 12 MFCC. These features were used before in [38] for the same task leading good results and, in this paper, are used as a baseline for comparison. In this case, the speech signal was framed and windowed using 40 ms Hamming windows with a 50% frame shift. Therefore, the feature vector extracted for each frame is built
up concatenating 12 fast Fourier transform (FFT)-based MFCC, and three noise parameters: HNR, NNE, and GNE. On the other hand, the embedding attractor extracted from the speech frames was parameterized using nonlinear analysis. For each embedding attractor, a set of 11 complexity measures were estimated. The values of the points in the attractor were normalized into the [0, 1] interval, because the complexity measures are based on the distance among different points of the attractor; therefore, the normalization allows controlling the bound of these measures. A first decision about the presence or absence of pathology for each speech signal was taken from the outputs given by a generative classifier. Such a classifier was based on Gaussian mixture models (GMM), which have previously been used for the same task with good results [39]. Each voice record was characterized with the same number of vectors as frames extracted from each recording. Following the scheme depicted in Fig. 1, one set of GMM (i.e., one GMM for pathological and another for normal voices) was trained using the complexity measures, and another set of GMM was trained using the aforementioned combination of noise parameters and MFCC. The first decision about the presence or absence of pathology for each speaker was taken establishing a decision threshold over the averaging of the scores given to each speech signal for each classifier. Finally, for each speaker, the outputs of both classifiers were combined using a discriminative approach based on SVM. The final decision was taken establishing a threshold over the overall output given by the SVM. In order to allow comparisons using different feature subsets, the evaluation has been carried out following the methodology presented in [3]. A. Embedding Prior to the estimation of the nonlinear features, an embedding attractor has to be reconstructed. The embedding attractor is the starting point needed to estimate the nonlinear measures. The state-space reconstruction is based on the time-delay embedding theorem [8], which can be written as follows: given a dynamic system with a d-dimensional solution space and an evolving solution h(t), let x be some observation x(h(t)). Let us also define the lag vector (with dimension m, m > 2d + 1, and common time lag τ ) x(t) ≡ (xt , xt−τ , . . . , xt−(m −1)τ ). Then, under very general conditions, the space of vectors x(t) generated by the dynamics contains all the information of the space of solution vectors h(t). The mapping between them is smooth and invertible. This property is referred to as diffeomorphism and this kind of mapping is referred to as an embedding. The embedding theorem establishes that, when there is only a single sampled quantity from a dynamical system, it is possible to reconstruct a state space that is equivalent to the original (but unknown) state space composed of all the dynamical variables [8]. The points in the state-space form trajectories, and the set of trajectories from a time series is known as attractor. In this paper, the embedding dimension m was chosen using the improved version of the false neighbors method proposed in [40], and time-delay τ , by using the first minimum of the auto
˜ et al.: AUTOMATIC DETECTION OF PATHOLOGICAL VOICES ARIAS-LONDONO
373
complex ones, since complexity measures lack of the spectral leakage problems presented in FFT-based parameters. This kind of window has been used in other works employing complexity measures for the same task [16]. From each frame, LLE, CD, and nine entropy-based complexity measures were estimated. 1) Largest Lyapunov Exponent: LLE is a measure of the separation rate of infinitesimally close trajectories of the attractor [1]. In other words, LLE measure the sensibility to the initial conditions of the underlying system. Having in mind two trajectories of the state space with an initial separation δx0 , the divergence is [8] |δx(t)| ≈ exp (λt) |δx0 |
(1)
being λ the Lyapunov exponent. The LLE can be defined as follows λ = lim Fig. 2. 3-dimensional state spaces reconstructed by using the time-delay embedding theorem. The attractors were reconstructed using frames of 200 ms. (a) Normal voice (file DMA1NAL.NSP) with close trajectories. (b) Normal voice (file PCA1NAL.NSP) with separate trajectories. (c) Pathological voice (file JRF30AN.NSP) with separate trajectories. (d) Pathological voice (file EED07AN.NSP) with no clear dynamic behavior.
mutual information function [8]. Fig. 2 shows four examples of 3-dimensional embedding attractors of speech signals belonging to the MEEI database [17]. Significant differences can be observed between the attractors in Fig. 2(a) and Fig. 2(d), but the differences are not so clear between the attractors in Fig. 2(b) and Fig. 2(c). B. Parameterization In order to take into account possible changes in the nonlinear dynamics of the speech, the signal was parameterized following a short-time procedure. Following this approach, the dynamic information in the speech signal can be characterized by the evolution along time of the complexity measures estimated. This is quite important because, in a real-physiologic system, changes in the nonlinear dynamics may indicate states of pathophysiological dysfunction [7]. In this framework, the window length is an important variable to set, because it is linked with the number of points used to reconstruct the state space. In order to provide a reliable estimation of nonlinear measures, we must use a large enough number of points. In the context of nonlinear analysis of time series, the number of points used to reconstruct the attractor has been established around 10CD [41], [42]. In this paper, the length of the frame is based on previous experiments with complexity measures using the same database. In [43], the frame size was selected taking into account that the mean value of the CD was estimated as 3.1. Thus, the number of points used to reconstruct the attractor must be around 1500, corresponding 60 ms. The final window length was selected as the size that reported the best detection accuracy between normal and pathological voices for a set of experiments. The frames used are 55 ms long with an overlapping of 50%, and were extracted using rectangular windows instead of more
t→∞
1 |δx (t)| ln . t |δx0 |
(2)
There exist different algorithms to calculate LLE. In order to allow comparisons, two algorithms widely used in the literature have been used throughout this paper. The first one is described in [1]. It is based on the Wolff algorithm [44], but adjusted to speech signals. The second one was proposed in [2], theoretically with better results. Hereafter, the LLE estimated using the Wolff algorithm will be called LLE1 and LLE2 with the second approach. 2) Correlation Dimension: CD is a measure of the dimensionality of the space occupied by a set of random points or its geometry. In order to characterize CD, it is necessary to define the correlation sum (CS) for a set of points x ∈ Ψ, where Ψ is the embedding space. CS is the fraction of all possible pairs of points, which are closer than a given distance r in a particular norm. CS is given by [8] C(r) =
N
Cim (r)
(3)
i=1
where Cim (r) =
N 2 Θ (r − xi − xj ) N (N − 1) j =i+1
(4)
being N the number of points in Ψ, Θ is the Heaviside function, and the norm (·) defined in any consistent metric space. CD is defined in the limit of an infinite amount of data (N → ∞) and for small r, and can be expressed as follows: CD = lim lim d (N, r) , d (N, r) = r →0 N →∞
∂ ln C (r, N ) . ∂ ln r
(5)
The CD is commonly calculated using the Grassberger– Procaccia algorithm [45]. However, in this paper, the CD was calculated using the Takens estimator, since it is computationally more efficient and obtains closer estimates to real values than the Grassberger–Procaccia algorithm [46]. 3) Entropy-Based Complexity Measures: The entropy is a measure of the uncertainty of a random variable. Let X be a discrete random variable with alphabet χ and probability
374
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 2, FEBRUARY 2011
mass function p(x) = Pr{X = x}, x ∈ X . The Shannon entropy H(X) is defined by [47] p (x) log p(x). (6) H (X) = − x∈X
If instead of a random variable we have a sequence of n random variables (i.e., a stochastic process), the process can be characterized by a joint probability mass function: Pr{X1 = x1 , . . . , Xn = xn } = p(x1 , x2 , . . . , xn ). Under the assumption of existence of the limit, the rate at which the joint entropy grows with n is defined by [47] 1 1 H (X) = lim H (X1 , X2 , . . . , Xn ) = lim Hn . (7) n →∞ n n →∞ n If the set of random variables are independent, but not identically distributed, the entropy rate is given by 1 H(Xi ). n →∞ n i=1 n
H(X) = lim
(8)
On the other hand, let the state space be partitioned into hypercubes of content εm , and the state of the system measured at intervals of time δ. Besides, let p(k1 , . . . , kn ) denote the joint probability that the state of the system is in the hypercube k1 at t = δ, k2 at t = 2δ. The Kolmogorov–Sinai entropy (HKS ) is as follows [20]: 1 HKS = − lim p (k1 , . . . , kn ) log p (k1 , . . . , kn ) δ →0 nδ ε→0 n →∞
k 1 ,...,k n
(9) measuring the mean rate of creation of information [20]. For stationary processes, it can be shown that [20] HKS = lim lim lim (Hn +1 − Hn ). δ →0 ε→0 n →∞
(10)
Numerically, only entropies of finite order n can be computed. However, some methods have been proposed in an attempt to estimate the HKS [20]. One of them is the AE . AE is a measure of the average conditional information generated by diverging points on the trajectory [20], [22]. AE is defined as a function of the CS (4). For a fixed m and r, AE is given by AE (m, r) = lim Φm +1 (r) − Φm (r) (11) N →∞
where
Another modification of AE presented in [23], called Gaussian Kernel approximate entropy (GAE ), changes the Heaviside function by a Gaussian-kernel-based function with the aim of suppressing the discontinuity of the auxiliary function over the CS (rectangular kernel), and in this way, nearby points have greater weight than distant. In this case, the Heaviside function is replaced by 2 xi − xj 1 . (14) dG (xi , xj ) = exp − 10r2 By using (14), the estimation of GAE is carried out in the same way than for AE [see (11) and (12)]. Moreover, in order to evaluate its behavior, the SE estimation was modified using a Gaussian kernel. From now on this modification is called Gaussian kernel sample entropy (GSE ). Besides of the number of points used to estimate this class of measures, it is necessary to set the threshold r. The value of r was fixed to r = rc · std(signal), being std(·) the standard deviation operator [22]. The parameter rc has been chosen equal to 0.35 according to the previous experiments reported in [43]. 4) Measures Based on Recurrence and Fractal-Scaling Analysis: Considering that there exists a combination of both deterministic and stochastic components into the speech [6], the deterministic component could be characterized by a recurrence measure. Let B(x(ti ), r) be a close ball with radius r > 0, containing an embedded data point x(ti ). Excluding temporal correlations, let tr = tj − ti be the recurrence time, where tj is the instant at which the trajectory first return to the same ball. Let R(t) be the normalized histogram of the recurrence times estimated for all embedded points into a reconstructed attractor, the recurrence probability density entropy (RPDE) can be expressed as follows: m ax R (i) ln R (i) − ti=1 (15) RPDE = ln tm ax where tm ax is the maximum recurrence time in the attractor. On the other side, the stochastic component could be characterized by means of a detrended fluctuation analysis (DFA) [6] that calculates the scaling exponent in nonstationary time series. First, the time series x(t) is integrated n y (n) = x (t) (16) t=1
Φm (r) =
1 N −m+1
N −m +1
ln Cim (r).
(12)
i=1
A first modification of AE presented in [22] called SE was developed to obtain a more independent measure of the signal length than AE . SE is given by SE (m, r) = lim − ln N →∞
m +1
(r) Γ . Γm (r)
(13)
The difference between Γ and Φ is that the first one does not compare the embedding vectors with themselves (excludes self-matches). The advantage of this fact is that the estimator is unbiased [20].
for n = 1, 2, . . . , N , where N is the number of samples in the signal. Then, y(n) is divided into windows of length L samples. A least-square straight-line approximation is carried out to each window and the root-mean squared error is calculated for every window at every time scale
L 1 (y (n) − an − b)2 F (L) = L n =1
12 (17)
where a and b correspond to the straight-line parameters. This process is repeated over the whole signal for different window sizes L, and a log–log graph of L against F (L) is constructed. A straight line on this graph indicates self-similarity expressed as
˜ et al.: AUTOMATIC DETECTION OF PATHOLOGICAL VOICES ARIAS-LONDONO
375
F (L)αLβ . Then, the DFA measure corresponds to a sigmoidal normalization of the scaling exponent β [6]. 5) Hidden Markov Entropy Measures: A Markov chain is a random process {X(t)}, which can take a finite number of k values at certain moments of time (t0 ≤ t1 ≤ t2 ≤ · · ·). The values of the stochastic process change with known probabilities called transition probabilities. The particularity of this stochastic process is that the probability of change to other state depends only on the current state of the process; this is known as the Markov condition. When such probabilities do not change with time and the initial probability of each state is also constant, the Markov chain is stationary. Let {X(t)} be a stationary Markov chain with initial distribution π and transition matrix A. Then, the entropy rate is given by [47] π i Aij log Aij . (18) H (X) = −
entropies, HE can be written as follows: ⎞ ⎛ ϕ k k πi Aij log Aij + Bij log Bij ⎠ . (20) HES = − ⎝
ij
In view of (18), it is possible to observe that the entropy measure is a sum of the individual Shannon entropy measures for the transition probability distribution of each state, weighted with respect to the initial probability of its corresponding state. There exist some processes that can be seen like a Markov chain, whose outputs are random variables generated from probability functions associated to each state. Such processes are called hidden Markov processes (HMP), since the states of the Markov process cannot be identified from its output (the states are “hidden”). In this case, it is not possible to obtain a close form for the entropy rate [47], [48]. A HMP can also be understood as a Markov process with noisy observations [48]. Therefore, in the same way as in (18), it is possible to establish an entropy measure of the HMP as the entropy of the Markov process plus the entropy generated by the noise in each state of the process. We call this measure EE. If we use a DHMM to represent a stochastic process, the noise is modeled by means of discrete distributions, and finally, it is possible to obtain a probability mass function for the noise in each state. Denoting the actual state of the process in time t as St , a DHMM can be characterized by the following parameters [49]: 1) π = {πi }, i = 1, 2, . . . , k: the initial state distribution, where πi = p(S0 = i) is the probability of starting at the ith state; 2) A = {Aij }, 1 ≤ i, j ≤ k: the set of transition probabilities among states, where Aij = p(St+1 = j|St = i) is the probability of reaching the jth state at time t + 1, coming from the ith sate at time t; 3) B = {Bij }, i = 1, 2, . . . , k, j = 1, 2, . . . , ϕ: the probability distribution of the observation symbol, being Bij = p(ot = υj |St = i), where ot is the output at time t, υj are the symbols that can be associated to the output, and ϕ is the total number of symbols. All parameters are subject to standard stochastic constrains [49]. Using this definition, the EE, HE , can be defined as follows: HE = HM C + H g
(19)
where HM C is the entropy due to the Markov process (18), and Hg is the Shannon entropy due to noise. By replacing both
ij
i=1 j =1
If instead of using the Shannon entropy, we use the Renyi entropy [42] becomes HER =
k i=1
ϕ k k πi 1 log log Aαi j + Biαj (21) 1−α 1 − α j =1 i=1 j =1
where α > 0; and α = 1 is the entropy order. In this paper, we are using α = 2, since it is the most common Renyi entropy [47]. The uncertainty is maximum (and equal to k −1 · log k + k · log ϕ) if all states in the Markov chain have equal likelihood to be achieved from any other state (all directions are equally probable) and also all observation symbols are equally probable in each state of the process. In other words, no behavior in the state space can be likened to a trajectory. The minimum value (zero) corresponds to the case, where only one state can be achieved from other state in the Markov chain (excluding self-transitions) and only one observation symbol is likely to be emitted in each state. Therefore, there exists one evident trajectory into the state space, and its dispersion is zero. In this case, the initial probability does not influence the EE, since there is only one state sequence probable in the Markov chain. The hidden Markov entropy measures used in this paper were estimated by using DHMM with six states and a codebook of 32 words. These values were set after different experiments changing the values in the range [5, 10] and [16, 256], respectively. C. Classification 1) Gaussian Mixture Models: The central idea of the GMM, is to estimate the probability density function of a dataset, by means of a set of Gaussian weighted functions. The model can be expressed as follows [50]: p (x|ζ) =
M
αi pi (x)
(22)
i=1
where ζ = {ζn , ζp } indicates the class to model, pi (x), i = 1, . . . , M are the component densities, and ci , i = 1, . . . , M are the component weights. Each component density is a nvariate Gaussian function; therefore, the different components act together to model the overall pdf. The most common estimation method of the parameters of the model is the expectation maximization (EM) algorithm. For each class to be recognized, the parameters of a different GMM are estimated. Thus, the evaluation is made by calculating for each GMM, the a posteriori probability of an observation. The score given to each sequence is obtained by calculating the logarithm of the ratio between the likelihoods given by both models (called log-likelihood ratio). 2) Support Vector Machines: A SVM is a two-class classifier. The problem in approaching a SVM [51] is analogous to
376
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 2, FEBRUARY 2011
solving the problem of finding a linear function that satisfies f (x) = w, x + b with w ∈ X ,
b∈
set of features in a single feature space, encouraging the use of a classifier combination strategy.
(23) III. EXPERIMENTS AND RESULTS
where χ corresponds to the space of the input patterns x. The function f (·) is calculated following an optimization problem from sums of a kernel function. The support vector algorithm looks for the hyperplane that separates the two classes with the largest margin of separation. Considering a radial basis function kernel, the training implies adjusting the aperture of the kernel γ and a penalty parameter C. The output given by the SVM for each speech sample can be interpreted as the likelihood that the sample belongs to a specific class. Henceforth, the log-likelihood (likelihood in the log domain) will be called “score.” 3) Fusing Generative and Discriminative Classifiers: The classification of normal/pathological voices was carried out in two steps, following first a generative approach based on GMM, and later, a discriminative approach based on SVM (see Fig. 1). As commented earlier, the SVM was fed with the scores given by two GMM-based classifiers supplied using two parameterization approaches: 1) nonlinear measures and 2) noise parameters combined with MFCC. For each speech signal, and for each parameterization approach, both the likelihood to be normal and the likelihood to be pathological were calculated using GMM. A score was obtained subtracting the log-likelihood (likelihood in the logarithmic domain) to be normal, from the log-likelihood to be pathological. Later, the intermediate decisions about presence or absence of pathology given by both statistical models were used as a new feature space, and a SVM-based classifier was trained to detect the presence of pathology. The SVM and GMM classifiers were chosen on the basis of the modeling capabilities they present. The nonlinear mapping carried out by the SVM maximizes the generalization capabilities of the classifier. In addition, the possibility of choosing different basis functions, from a priori knowledge of the problem domain, allows this system to adapt better to a given problem [52]. On the other hand, the GMM fit the distribution of the observed data by means of a set of weighted Gaussian functions. The advantages of using GMM are that they are computationally inexpensive, and capable of modelling complex statistical distributions [53]. These structures have been used independently for the detection of pathological voices, and have proven to be very reliable in comparison to others used in the state of the art [39]. The advantage of combining outputs from different classifiers instead of fusing features is that the structure of the feature space used to feed each classifier is much simpler. Furthermore, although one of the classifiers would yield a better performance, the sets of speech recordings misclassified would not necessarily overlap; therefore, the combination of their outputs could improve the overall performance [54]. On the other hand, the parameterization using complexity measures and MFCC requires different window lengths (55 and 40 ms, respectively) and different windows (rectangular and Hamming, respectively), which complicate the use of the whole
A. Corpus of Speakers Testing has been carried out with the MEEI voice disorders database [17]. Due to the different sampling rates of the recordings stored in the database, a downsampling with a previous half band filtering were carried out, when needed, adjusting every utterance to a 25 kHz sampling rate. The 16 bits of resolution were used for all the recordings. The recordings contain the sustained phonation of the/ah/vowel from patients with a variety of voice pathologies that were previously edited to remove the beginning and ending of each utterance, removing the onset and offset effects in these parts of each utterance. A subset of 173 pathological and 53 normal speakers were taken according to those enumerated in [55]. B. Experimental Setup The methodology proposed in [3] has been used to evaluate the system. The generalization abilities have been tested following a cross-validation scheme with ten different sets for training and validation. The results are presented giving the following rates: true positive rate (tp) (or sensitivity, is the ratio between pathological files correctly classified and the total number of pathological voices), false negative rate (fn) (ratio between pathological files wrongly classified and the total number of pathological files), true negative rate (tn) (or specificity, is the ratio between normal files correctly classified and the total number of normal files), and false positive rate (fp) (is the ratio between normal files wrongly classified and the number of normal files). The overall accuracy is the ratio between the hits of the system and the total number of files. Receiver operating characteristic (ROC) curves were used to represent graphically the performance of the proposed architecture. The ROC [56] reveals the diagnostic accuracy expressed in terms of sensitivity and 1-specificity (i.e., fp). In addition, the area under the ROC curve (AUC) was calculated, representing an estimation of the expected performance of the system in a single scalar [56]. C. Results Table I shows a statistical analysis of the nonlinear features described in Section II. Initially, and in order to compare with previous results in the state of the art, they were estimated using frames of 200 ms. Table I shows that the maximum embedding dimension for this database has been estimated as 7. This is a high value in comparison with the values found in previous works (usually 2 or 3) [1], [26]. However, in [1], the embedding dimension was 3 for all voices, because the mean value of the correlation dimension is around 3. In [26], the embedding dimension was simply assumed as 2. Nevertheless, other works that used algorithms to estimate the embedding dimension, have used high values of m (equal to 11), even for normal recordings [57].
˜ et al.: AUTOMATIC DETECTION OF PATHOLOGICAL VOICES ARIAS-LONDONO
TABLE I STATISTICS OF THE NONLINEAR FEATURES (WITH 200 MS FRAMES)
377
TABLE II ACCURACY USING COMPLEXITY MEASURES AND A GMM DETECTOR
TABLE III ACCURACY USING DIFFERENT FEATURE SETS AND A GMM DETECTOR
Fig. 4. ROC curves using different feature sets: (a) with the feature sets enumerated in Table II and (b) using the noise parameters combined with MFCC, the complexity measurements, and fusing both parameterization approaches. Fig. 3. Distributions of CD and two different estimations of LLE for normal and pathological voices. (a) LLE estimated by using the algorithm in [1]. (b) LLE estimated by using the algorithm in [2]. (c) CD calculated by using the Takens estimator.
On the other hand, Table I shows large differences between the values obtained with both algorithms used to calculate the LLE. The first one delivered positive and negative values around zero, while the second provided just positive values. Since many normal and some pathological voices present attractors with close trajectories, one might expect LLE to be zero in these voices, but in the second case, the algorithm does not estimate values equal to zero. Fig. 3 shows the distributions for CD and both estimates of LLE. The line inside the box marks the median, whiskers mark 1.5 times the interquartile range from the ends of the box, and “+” symbols mark the outlying points. If the notches in the box plot do not overlap, we can conclude with 95% confidence that the true medians do differ, so medians are statistically different for normal and pathological voices. These results are in concordance with those found in the literature using different databases [7]. The values of CD present clearer differences between normal and pathological voices. However, since the main interest of this paper is to establish the discriminative capabilities of the nonlinear features, we performed the
experiments calculating the features for short-time windows to train the GMM-based detectors. Table II shows the sensitivity, specificity, accuracy, and AUC obtained independently for each of the nonlinear measures. The experiments were carried out using different number of Gaussians for the GMM (from 2 to 6), and the results showed in Table II were the best obtained for each feature. Although Fig. 3 showed that the medians are statistically different, Table II shows that LLE does not provide good discrimination capabilities. Moreover, the best classification accuracy is obtained with HE S . Table III shows the classification accuracy obtained using different sets of features. The first set corresponds to the more classical nonlinear features (LLE and CD), only the algorithm for LLE1 was used, because it presented a better behavior than LLE2 . The second set is conformed by the entropy measures based on AE ; the third set corresponds to the recurrence and fractal-scaling analysis features together; the fourth, to the hidden Markov entropy measures; and, for the sake of comparison, the fifth set corresponds to the noise parameters and MFCC. Fig. 4(a) plots the ROC curves for the configurations reported in Table III. In all cases, the classification was carried out with a GMM. The best performances are those obtained using:
378
Fig. 5.
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 58, NO. 2, FEBRUARY 2011
Probability density functions and cumulative distributions for true and false scores obtained with: (a) training set and (b) testing set.
TABLE IV CLASSIFICATION ACCURACY OBTAINED USING A CLASSIFIER COMBINATION STRATEGY. RESULTS WITH TRAINING AND TESTING SUBSETS
1) hidden Markov entropy measures and 2) noise parameters joint to MFCC coefficients. Fusing the whole set of nonlinear features, the performance diminished with respect to the performance obtained using the hidden Markov entropy measures. Due to this fact, the features based on hidden Markov entropy measures were preferred for the further analysis combining classifiers. Again, for the sake of comparison, Fig. 4(b) shows the ROC curves for the system trained using noise parameters combined with MFCC, complexity measures, and fusing both strategies. There is a clear improvement in the performance of the system complementing the nonlinear measures with the noise parameters combined with MFCC. Table IV shows the best classification accuracy fusing both classifiers, each fed with a different feature set. The GMM provides a new feature space that was used, for each speaker, to take the final decision about the presence or absence of pathology. The table shows the best classification accuracy. Combining both classifiers with a SVM, the error decreased up to 44.47% with respect to the minimum error obtained using only complexity measures, representing an absolute reduction of 2.21% in the final error. Besides, for the sake of comparison, Table IV explores the possibility of using a third GMM classifier instead of the proposed SVM: although the results are similar in terms of accuracy, the confidence intervals of the results using SVM are close to zero, evidencing a better stability. Fig. 5 shows the distributions for the normal and pathological scores given by the final SVM-based detector, as well as the false positive rate and false negative rate curves, which correspond to the cumulative sum of the distributions for normal and pathological scores, respectively. Fig. 5(a) shows the graphics obtained with the training set, and Fig. 5(b) shows the graphics obtained with the testing set. The similitude of the distribution of scores, in both cases, confirms the stability of the system given by the confidence interval shown in Table III.
IV. CONCLUSION The methodology for nonlinear analysis of speech signals proposed in this paper reveals valuable and complementary information to detect pathological voices. The use of complexity measures in this context showed reliable results, especially using hidden Markov entropy measures. The characterization of the embedding space carried out by DHMM takes into account additional information about the transitions between different regions of the state space from the trajectories of the attractor, improving the discrimination capabilities of the EEs. The use of a classifier combination strategy has demonstrated to be a valuable alternative for fusing information from different phenomena involved in the speech production. The use of SVM as final detectors, allows a classification accuracy of 98.23% with a very narrow confidence interval (approximately 0), representing an improvement with respect to other works found in the state of the art. Although the conventional nonlinear statistics (CD and LLE) showed statistically significant differences between normal and pathological voices, their usefulness for the automatic detection of pathologies still remains unclear. Experiments with new data must be carried out to clarify this aspect. Regarding the future work, the proposed methodology could be fused with additional features, which could provide complementary information, e.g., features based on biomechanical parameters or extracted from the characterization of the mucosal wave. REFERENCES [1] A. Giovanni, M. Ouaknine, and J.-M. Triglia, “Determination of largest Lyapunov exponents of vocal signal: Application to unilateral laryngeal paralysis,” J. Voice, vol. 13, no. 3, pp. 341–454, 1999. [2] R. Hegger, H. Kantz, and T. Schreiber, “Practical implementation of nonlinear time series methods: The TISEAN package,” Chaos, vol. 9, no. 2, pp. 413–439, 1999. [3] N. S´aenz-Lech´on, J. I. Godino-Llorente, V. Osma-Ruiz, and P. G´omezVilda, “Methodological issues in the development of automatic systems for voice pathology detection,” Biomed. Signal Process. Control, vol. 1, no. 2, pp. 120–128, 2006. [4] I. R. Titze, The Myoelastic Aerodynamic Theory of Phonation, Iowa, IA: National Center for Voice and Speech, 2006. [5] I. R Titze, R. J. Baken, and H. Herzel, “Evidence of chaos in vocal fold vibration,” in Frontiers in Basic Science, I. R Titze, Ed., San Diego, CA: Singular Publishing Group, 1993, pp. 143–188. [6] M. A. Little, P. E. McSharry, S. J. Roberts, D. A. Costello, and I. M. Moroz, “Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection,” Biomed. Eng. Online, vol. 6, no. 23, 2007.
˜ et al.: AUTOMATIC DETECTION OF PATHOLOGICAL VOICES ARIAS-LONDONO
379
[7] J. J. Jiang, Y. Zhang, and C. McGilligan, “Chaos in voice, from modeling to measurement,” J. Voice, vol. 20, no. 1, pp. 2–17, 2006. [8] H. Kantz and T. Schreiber, Nonlinear time series analysis, 2nd ed. Cambridge, U.K: Cambridge University Press, 2004. [9] J. J. Jiang and Y. Zhang, “Nonlinear dynamic analysis of speech from pathological subjects,” Electron. Lett., vol. 38, no. 6, pp. 294–295, 2002. [10] Y. Zhang and J. J. Jiang, “Nonlinear dynamic analysis in signals typing of pathological human voices,” Electron. Lett., vol. 39, no. 13, pp. 1021– 1023, 2003. [11] J. MacCallum, L. Cai, L. Zhou, Y. Zhang, and J. Jiang, “Acoustic analysis of aperiodic voice: Perturbation and nonlinear dynamic properties in esophageal phonation,” J. Voice, vol. 23, no. 3, pp. 283–290, 2009. [12] M. L. Meredith, S. M. Theis, J. S. McMurray, Y. Zhang, and J. Jiang, “Describing pediatric dysphonia with nonlinear dynamic parameters,” Int. J. Pediatr. Otorhinolaryngol., vol. 72, no. 12, pp. 1829–1836, 2008. [13] Y. Zhang, J. Jiang, L. Biazzo, and M. Jorgensen, “Perturbation and nonlinear dynamic analysis of voices from patients with laryngeal paralysis,” J. Voice, vol. 19, no. 4, pp. 519–528, 2004. [14] Y. Zhang, C. McGilligan, L. Zhou, M. Vig, and J. Jiang, “Nonlinear dynamic analysis of voices before and after surgical excision of vocal polyps,” J. Acoust. Soc. Am., vol. 115, no. 5, pp. 2270–2277, 2008. [15] Y. Zhang and J. J. Jiang, “Acoustic analysis of sustained and running voices from patients with laryngeal pathologies,” J. Voice, vol. 22, no. 1, pp. 1–9, 2008. [16] G. Vaziri, F. Almasganj, and R. Behroozmand, “Pathological assessment of patients’ speech signals using nonlinear dynamical analysis,” Comput. Biol. Med., vol. 40, no. 1, pp. 54–63, 2010. [17] Massachusetts Eye and Ear Infirmary. Voice Disorders Database, Version 1.03. Kay Elemetrics Corp., Lincoln Park, NJ, 1994. [CD-ROM]. [18] S. M. Pincus, “Approximate entropy as a measure of system complexity,” Proc. Natl. Acad. Sci. USA, vol. 88, pp. 2297–2301, 1991. [19] A. Serletis, A. Shahmordi, and D. Serletis, “Effect of noise on estimation of Lyapunov exponents from a time series,” Chaos, Solutions Fractals, vol. 32, no. 2, pp. 883–887, 2007. [20] M. Costa, A. Goldberger, and C. Peng, “Multiscale entropy analysis of biological signals,” Phys. Rev. E, vol. 71, pp. 021906-1–021906-18, 2005. [21] I. A. Rezek, S. J. Roberts, and “, “Stochastic complexity measures for physiological signal analysis,” IEEE Trans. Biomed. Eng., vol. 45, no. 9, pp. 1186–1191, Sep. 1998. [22] J. S. Richman and J. R. Moorman, “Physiological time-series analysis using approximate entropy and sample entropy,” Am J Physiol. Heart Circ. Physiol., vol. 278, pp. H2039–H2049, 2000. [23] L.-S. Xu, K.-Q. Wang, and L. Wang, “Gaussian kernel approximate entropy algorithm for analyzing irregularity of time series,” in Proc. 4th Int. Conf. Mach. Learn. Cybern., 2005, pp. 5605–5608. [24] K. K. L. Ho, G. B. Moody, C.-K. Peng, J. E. Miteus, M. G. Larson, D. Levy, and A. L. Goldberger, “Predicting survival in heart failure case and control subjects by use of fully automated methods for deriving nonlinear and conventional indices of heart rate dynamics,” Circulation, vol. 96, pp. 842–848, 1997. [25] L. A. Fleisher, S. M. Pincus, and S. H. Rosenbaum, “Approximate entropy of heart rate as a correlate of postoperative ventricular dysfunction,” Anesthesiology, vol. 78, no. 4, pp. 683–692, 1993. [26] K. Manickam, C. Moore, T. Willard, and N. Slevin, “Quantifying aberrant phonation using approximate entropy in electrolaryngography,” Speech Commun., vol. 47, no. 3, pp. 312–321, 2005. [27] C. Moore, K. Manickam, T. Willard, S. Jones, N. Slevin, and S. Shalet, “Spectral pattern complexity analysis and the quantification of voice normality in healthy and radiotherapy patient groups,” Med. Eng. Phys., vol. 26, no. 4, pp. 291–301, 2004. [28] B. S. Aghazadeh, H. Khadivi, and M. Nikkhah-Bahrami, “Nonlinear analysis and classification of vocal disorders,” in Proc. 29th Int. IEEE EMBS Conf., 2007, pp. 6199–6202. [29] P. Henr´ıquez, J. B. Alonso, M. A. Ferrer, C. M. Travieso, J. I. GodinoLlorente, and F. D´ıas-de-Mar´ıa, “Characterization of healthy and pathological voice through measures based on nonlynear dynamics,” IEEE Trans. Audio, speech, Lang. Process., vol. 17, no. 6, pp. 1186–1195, Aug. 2009. [30] C. Maciel, J. Pereira, and D. Stewart, “Identifying healthy and pathologically affected voice signals,” IEEE Signal Process. Mag., vol. 27, no. 1, pp. 120–123, Jan. 2010. [31] P. Scalassara, M. Dajer, J. Marrara, C. Maciel, and J. Pereira, “Analysis of voice pathology evolution using entropy rate,” in Proc. Tenth IEEE Int. Symp. Multimedia, 2008, pp. 580–585. [32] M. A. Little, D. A. Costello, and M. L. Harries, “Objective dysphonia quantification in vocal fold paralysis: comparing nonlinear with classical measures,” J. Voice, to be published, DOI:10.1016/j.jvoice.2009.04.004.
[33] D. Woodcock and I. T. Nabney, A New Measure Based on the Renyi Entropy Rate Using Gaussian Kernels, Aston University, U.K., 2006. [34] J. D. Arias-Londo˜no, J. I. Godino-Llorente, G. Castellanos-Dom´ınguez, N. S´aenz-Lechon, and V. Osma-Ruiz, “Complexity analysis of pathological voices by means of hidden Markov entropy measures,” in Proc. 31st Int. IEEE EMBS Conf., 2009, pp. 2248–2251. [35] X. Huang, A. Acero, and H. W. Hon, Spoken language processing, Englewood Cliffs, NJ: Prentice Hall PTR, 2001. [36] G. de Krom, “A cepstrum-based technique for determining a harmonicsto-noise ratio in speech signals,” J. Speech Hear. Res., vol. 36, no. 2, pp. 254–266, 1993. [37] H. Kasuya, S. Ogawa, K. Mashima, and S. Ebihara, “Normalized noise energy as an acoustic measure to evaluate pathologic voice,” J. Acoust. Soc. Am., vol. 80, no. 5, pp. 1329–1334, 1986. [38] D. Michaelis, T. Gramss, and H. W. Strube, “Glottal-to-noise excitation ratio. A new measure for describing pathological voices,” Acustica/Acta acustica, vol. 83, pp. 700–706, 1997. [39] J. I. Godino-Llorente, P. G´omez-Vilda, and M. Blanco-Velasco, “Dimensionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters,” IEEE Trans. Biomed. Eng., vol. 53, no. 10, pp. 1943–1953, Oct. 2006. [40] L. Cao, “Practical method for determining the minimum embedding dimension of a scalar time series,” Physica D, vol. 110, no. 1–2, pp. 43–50, 1997. [41] R. Carvajal, N. Wessel, M. Vallverd´u, P. Caminal, and A. Voss, “Correlation dimension analysis of heart rate variability in patients with dilated cardiomyopathy,” Comput. Meth. Programs Biomed., vol. 78, no. 2, pp. 133–140, 2005. [42] M. Ding, C. Grebogi, E. Ott, T. Sauer, and J. A. Yorke, “Estimating correlation dimension from chaotic time series: When does plateau occur?” Physica D, vol. 69, no. 3–4, pp. 404–424, 1993. [43] J. D. Arias-Londo˜no, J. I. Godino-Llorente, and G. CastellanosDom´ınguez, “Short time analysis of pathological voices using complexity measures,” in Proc. 3rd Adv. Voice Funct. Assess. Int. Workshop, 2009, pp. 93–96. [44] A. Wolff, J. Swift, H. Swinney, and J. Vastano, “Determinning Lyapunov exponents from a time series,” Physica D, vol. 16, no. 3, pp. 285–317, 1985. [45] P. Grassberger and I. Procaccia, “Characterization of strange attractors,” Phys. Rev. Lett., vol. 50, no. 5, pp. 346–349, 1983. [46] B. Borovkova, R. Burton, and H. Dehling, “Consistency of the takens estimator for the correlation dimension,” Ann. Appl. Probab., vol. 9, no. 2, pp. 376–390, 1999. [47] T. M. Cover and J. A. Thomas, Elements of information theory, 2nd ed. Hoboken, NJ: Wiley Interscience, 2006. [48] M. Rezaeian, “Hidden markov process: A new representation, entropy rate and estimation entropy,” arXiv:cs/0606114v2, 2006. [49] L. R. Rabiner, “A tutorial on hidden markov models and selected applications on speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989. [50] D. Reynolds, T. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian Mixture models,” Digit. Signal Proc., vol. 10, pp. 19– 41, 2000. [51] V. Vapnik, “An overview of statistical learning theory,” IEEE Trans. Neural Netw., vol. 10, no. 5, pp. 988–1000, Sep. 1999. [52] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification, 2nd ed. Hoboken, NJ: Wiley Interscience, 2000. [53] D. A. Reynolds, T. Quatieri, and R. Dunn, “Speaker verification using adapted gaussian mixture models,” Digit. Signal Proc., vol. 10, pp. 19– 41, 2000. [54] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, Mar. 1998. [55] V. Parsa and D. Jamieson, “Identification of pathological voices using glottal noise measures,” J. Speech, Lang. Hear. Res., vol. 43, no. 2, pp. 469–485, 2000. [56] T. Fawcett, “ROC graphs: Notes and practical considerations for researches,” HP Laboratories, Palo Alto, CA, 2004. [57] R. Nicollas, R. Garrel, M. Ouaknine, A. Giovanni, B. Nazarian, and J.-M. Triglia, “Normal voice in children between 6 and 12 years of age: Database and nonlinear analysis,” J. Voice, vol. 22, no. 6, pp. 671–675, 2008.
Author’s photographs and biographies not available at the time of publication.