Probabilistic Speaker Identification with dual

Probabilistic Speaker Identification with dual Penalized Logistic Regression Machine Tomoko Matsui and Kunio Tanabe The Institute of Statistical Mathematics Tokyo, Japan {tmatsui,tanabe}@ism.ac.jp

Abstract This paper investigates a probabilistic speaker identification method based on the dual Penalized Logistic Regression Machines (dPLRMs). The machines employ kernel functions which map an acoustic feature space to a higher dimensional space as is the case with the Support Vector Machines (SVMs). Nonlinearity in discriminating each speaker is implicitly handled in this space. While SVMs maximize the margin between two classes of data, dPLRMs maximize a penalized likelihood of a logistic regression model for multiclass discrimination. dPLRMs provide a probability estimate of each identification decision. We show that the performance of dPLRMs is competitive with that of SVMs through textindependent speaker identification experiments in which speech data recorded by 10 male speakers in four sessions are analyzed.

In this paper, we compare dPLRMs with SVMs in textindependent speaker recognition experiments. In section 2, we briefly sketch dPLRMs. In section 3, we introduce our speaker identification procedure. In section 4, we show the speaker identification experimental results. In section 5, we demonstrate its significance as a probabilistic identifier. 1.

SVM

2. 3.

1. Introduction Speaker recognition techniques are widely applied not only to access controls of information service systems but also to such problems as speaker detection problems in speech dialogue and speaker indexing problems with large audio archives. The demand has been increasing for techniques with higher-accuracy. The conventional method for text-independent speaker recognition is based on the Gaussian mixture model (GMM) [1]. In this approach, utterance variation is well captured by a mixture of a well-chosen number of Gaussian distributions. However, the maximum likelihood (ML) estimates of the mixture distribution with the expectation maximization algorithm tend to be unreliable especially when the number of training data is relatively small. In such a case additional finetuning process of the estimated mixture is needed for gaining higher discriminative power [2]. SVMs [3] have been successfully applied to various pattern recognition problems: handwritten digit recognition, face detection, text categorization and speaker recognition [48]. Recently, Tanabe [9,10] proposed dPLRMs based on the penalized logistic regression model with a specific penalty term for bringing about induction-generalization capacity of the machine. The model yields a certain duality which leads intrinsically to the kernel regressors as is the case with quadratic programming models in SVMs. Hence comes the name with “dual” in dPLRMs. We applied dPLRMs to speaker identification problems and showed that the dPLRMbased method performed better than the conventional GMMbased method in our text-independent experiments with 10 male speakers [11]. We list several distinctions between dPLRMs and SVMs in Table 1.

dPLRM

dPLRM SVM dPLRM SVM

4.

dPLRM SVM

originally designed for multi-class discrimination and can handle multiclass situations at once. designed for two-class discrimination and need some special techniques for multi-class discrimination such as a pairwise coupling method. provides a probabilistic estimate. gives a confidence index (‘margin’). forms discrimination boundaries depending on the whole set of the training sample vectors. forms a discrimination boundary in the middle of the margin consisted of support vectors. needs to solve a simple matrix nonlinear equation. needs to solve a series of inequality constrained maximization problems for multi-class situations.

Table 1. Distinctions between dPLRMs and SVMs

2. Dual penalized logistic regression machine Let x j is a column vector of size n and cj takes a value in the finite set {1,2,…,K} of classes. The learning machine dPLRM feeds a finite number of training data {(x j , c j )} j =1, , N , and then

produces

a

conditional

multinomial

distribution

(p * (x)) of c given x ∈ R n , where p * (x) is a predictive probability vector whose k-th element pk * (x) indicates the probability of c taking the value k. For mathematical convenience, we code the class data cj

by j-th unit column vector e k ≡ (0,

,1,

0)t of size K and

define an K × N constant matrix Y by Y ≡ [ y1 ;

; y N ] ≡ [e c1 ;

; ec N ]

(1)

whose j-th column vector y j ≡ e c j indicates the class to which the data xj is attached.

While SVM determines a single valued dichotomous discriminative function f (x) ≡ vk (x)

(2)

where v is a row vector of size N , we introduce a multivalued polychotomous function F (x) ≡ Vk (x)

(3)

mapping R N into R K , where V is an K × N parameter matrix which is to be estimated by using the training data set {(x j , c j )} j =1,

,N

, k(x) is a map from R n into R N defined

by k (x) ≡ (K (x1 , x),

, K (x N , x))t ,

(4)

and K(x, x’) is a certain positive definite kernel function. Then we define a model for multinomial probabilistic predictor p(x) by p(x) ≡ pˆ (F (x)) ≡ ( pˆ1 (F(x)), , pˆ K (F(x)))t , (5) exp(Fk (x))

where pˆ k (F( x)) ≡

K

exp(Fi (x))

log( pc j (x j )) = −

j =1

N

log( pˆ c j (Vk (x j ))) (6)

j =1

which is a convex function (see [9,10]). This objective function L(V) is of discriminative nature, and that if the kernel function is appropriately chosen, the map F(x) can represent a wide variety of functions so that the resulting predictive probability p(x) can be expected to be close to the reality. A predictive vector p* ( x) could be obtained by putting p * (x) = pˆ (V **k (x)) where V ** is the ML estimate which minimize the function L(V) with respect to V. **

However, over-learning problems could occur with V with the limited number of training data. In order to deal with the problems, the penalty term is introduced and the negativelog-penalized-likelihood PL(V ) ≡ L(V ) +

δ 2

1 2

1

VK 2

2

(7)

F

is minimized to estimate V where ⋅

1 YY t N

F

(8)

which equilibrates a possible imbalance of classes in the training data. The matrix K is the N × N constant matrix, given by K = [K (x i , x j )]i , j =1,

,N .

K ,N ,

(10)

where P(V ) is an K × N matrix whose j-th column vector is the probability vector p (x j ) ≡ pˆ (Vk (x j )) . The matrix Y is given in (1). The minimizer V * , which gives the probabilistic predictor p * (x) ≡ pˆ (V *k (x)) , is iteratively computed by the following algorithm. Algorithm: Starting with an arbitrary K × N matrix V 0 , we generate a sequence {V i } of matrices by V i +1 = V i − α i ∆V i , i = 0,

,∞

(11)

([p(x j )] − p( x j )(p (x j )) t )∆V i (k (x j )(k (x j )) t )

j =1

i

i

(12)

i

+ δ ∆V K = (P(V ) − Y + δ V )K . The detailed algorithm for estimation is shown in [9-12]. Note that we only need to solve an unconstrained optimization of a strictly convex function PL(V) or equivalently, to solve the simple matrix nonlinear equation (10).

3. dPLRM-based Speaker identification The training data set {(x j , c j )} j =1,

,N

which covers all the

speakers’ data is collected. The class data {cj} is converted into matrix Y. The key matrix V * is estimated by dPLRM. Finally the predictor p * ( x) is obtained. predictive probability p * (x′i ) is calculated for each data x′i . Then we sum up the logprobability for each class over samples and choose the class which attains its maximum as the speaker who utters the testing data. For

testing,

the

4. Experiments and results is the Frobenius norm.

The penalty term is intended to reduce the effective freedom of the variable V. The matrix is an K × K positive definite matrix. A frequent choice of is given by =

∇PL ≡ (P(V ) − Y + δ V )K =

N

Under this model assumption, the negative-log-likelihood function L (V) for p (x) is given by L( V ) ≡ −

in (7), the minimizer V * of PL(V) is a solution of the neat matrix equation,

where ∆V i is the solution of the linear matrix equation,

is the logistic transform.

i =1

N

The δ is a regularization parameter and can be determined by the empirical Bayes method. Due to the introduction of the specific quadratic penalty

(9)

The performance of our method is compared with that of the in text-independent speaker SVM-based method identification experiments. 4.1. Data and system description The data for training and testing have been collected for 10 male speakers. Each speaker utters several sentences and words. Duration of utterance for each sentence is approximately four seconds, and for each word one second. Although the texts are common for all speakers, the sentences used for testing are different from those for training. The utterances of the same set of sentences and words was recorded for testing in three sessions (T1 to T3) over six

months and sampled at 16 kHz. A feature vector of 26 components, consisting of 12 mel-frequency cepstral coefficients plus normalized log energy and their first derivatives, is derived once every 10 ms over a 25.6 ms Hamming-windowed speech segment. For training, the utterances of three sentences recorded three months earlier than Session T1 (12 second speech in total) are used for estimating V * . The following polynomial function is used as the kernel function. K ( x, x ′) = ( x t x ′ + 1) s

(13)

The parameters α and δ in dPLRM are experimentally set to be 1.0 in (11) and 7.7e-5 in (7), respectively. In order to execute effective computation with 32-bit precision, the data is affinely transformed so that all the elements of feature vectors lie in the interval [-0.5, 0.5]. The estimation step (11) is terminated at the 10-th iteration. Then, the utterances of the five sentences and the five words in Sessions T1, T2 and T3 are individually tested. The case number is 150 for each sentence and word data. In the SVM-based method, the SVMlight software is used [14] with the polynomial kernel function (13). For each speaker, a one-versus-rest classifier is trained and the speaker who attains the largest positive confidence index averaged over the test speech is regarded as the speaker who uttered the speech for testing. 4.2. Results Tables 2 lists the numbers of sentences and words identified correctly for each session and the accuracy rates with the confidence intervals at a confidence level of 90% averaged over Sessions T1, T2 and T3. For both dPLRM- and SVMbased methods, s = 5, 7, 9 in (13) were tried. The confidence intervals are calculated based on the assumption that the accuracy rates follow the binomial distribution. Our method performed better than the SVM-based method for both Table 2. The numbers of correctly identified cases for each session and the accuracy rates (with the confidence intervals at a confidence level of 90%) averaged over Sessions T1, T2 and T3. ([ ] is the power of the polynomial function.) Sentence speech Method dPLRM [5] dPLRM [7] dPLRM [9] SVM [5] SVM [7] SVM [9]

T1 50 48 50 48 48 48

T2 46 48 48 47 47 47

T3 50 50 50 50 50 50

Average (%) 97.3 (95.3, 99.3) 97.3 (95.3, 99.3) 98.7 (97.3, 100.0) 96.7 (94.0, 98.7) 96.7 (94.0, 98.7) 96.7 (94.0, 98.7)

Word speech Method dPLRM [5] dPLRM [7] dPLRM [9] SVM [5] SVM [7] SVM [9]

T1 40 41 46 42 42 42

T2 44 45 42 43 43 44

T3 43 43 45 43 44 44

84.7 86.0 88.7 85.3 86.0 86.7

Average (%) (80.0, 89.3) (81.3, 90.7) (84.7, 92.7) (80.7, 90.0) (81.3, 90.7) (82.7, 91.3)

sentence and word speech. dPLRMs can capture speaker characteristics with a small amount of data and effectively discriminate each speaker.

5. Discussion dPLRMs give a frame-wise probability estimate and SVMs a frame-wise confidence index. Fig. 1 shows the waveform of the word “MOICHIDO” spoken by the identified speaker, the frame-wise dPLRM probability estimates and SVM confidence indices for 10 candidate speakers. The graph in the bottom shows the frame-wise entropy, Hi = −

K k =1

pk* (x i ) log pk* (x i )

(14)

calculated from dPLRM probability estimates. It can be used as a measure of discriminatory power of the frame-data. For the identified speaker, the segments from 24 through 50 and from 91 through 109 frame numbers have higher probability estimates and lower entropy values for dPLRM, and can be regarded as key parts which substantially contain the discriminative information. SVM also shows the same tendency with higher positive confidence indices for the segments. With dPLRMs, the probability estimates are small among all speakers and the entropy values are large for the beginning frames and ending frames which does not contain speech information. Thus, we can use dPLRMs as a probabilistic identifier of speech segments containing the discriminative information.

6. Conclusions This paper investigated a speaker identification method based on dPLRMs. We conducted the experiments with training speech in 12-second duration, and showed that the dPLRMbased method was competitive with the conventional SVMbased methods. It was also shown that dPLRMs could be used as a probabilistic identifier of speech segments having the discriminative information of speakers. The evaluation of dPLRMs with a larger dataset is treated in a forthcoming paper. The probability estimates of dPLRMs could be used in speaker verification and end-point detection with modification of the probabilistic model (5) for each objective. Those are also left for our future study.

7. References [1] D. A. Reynolds, “Speaker Identification and Verification Using Mixture Speaker Models,” Speech Communication, 17, pp. 91-108, 1995. [2] G. R. Doddington, M. A. Przybocki, A. F. Martin and D. A. Reynolds, “The NIST Speaker Recognition Evaluation – Overview, Methodology, Systems, Results, Perspective,” Speech Communication, 31, pp. 225-254, 2000. [3] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995. [4] M. Schmidt and H. Gish, “Speaker Identification via Support Vector Classifiers,” Proc. ICASSP, Atlanta, 1996. [5] M. Schmidt, “Identifying Speakers With Support Vector Networks,” Proc. Interface, Sydney, 1996.

4

amplitude index index index index index index

prob.

0.5

2 0 −2 −4 −6 2 0 −2 −4 −6 2 0 −2 −4 −6 2 0 −2 −4 −6

Other speaker 3

1

prob.

2 0 −2 −4 −6

Other speaker 2

0

0.5 0

Other speaker 4

1

prob.

2 0 −2 −4 −6

index

prob.

0.5

1

0.5 0

Other speaker 5

1

prob.

2 0 −2 −4 −6

Other speaker 1

0

0.5 0

Other speaker 6

1

prob.

2 0 −2 −4 −6

index

prob.

0.5

1

0.5 0

Other speaker 7

1

prob.

2 0 −2 −4 −6

Identified speaker

0

0.5 0

Other speaker 8

1

prob.

2 0 −2 −4 −6

index

prob.

0.5

1

0.5 0

0 −1

Waveform

0

1

index

amplitude

0

1

SVM

x 10

1

−1

H

4

dPLRM

x 10

Other speaker 9

2 1.5 1 0.5

Waveform

Identified speaker

Other speaker 1

Other speaker 2

Other speaker 3

Other speaker 4

Other speaker 5

Other speaker 6

Other speaker 7

Other speaker 8

0

20

40

60

80

100

Other speaker 9

0

20

40

60

80

100

Entropy

Figure 1. Comparison of dPLRM probability estimates (left side) and SVM confidence indices (right side): the waveform of the word “MOICHIDO” spoken by the identified speaker (top row), the frame-wise dPLRM probability estimates / SVM confidence indices for the speaker and other 9 speakers (2nd to 11th rows), and the entropy calculated from dPLRM probability estimates (12th row). The x-axis indicates frame numbers. [6] S. Fine, J. Navratil and R. A. Gopinath, “A Hybrid GMM/SVM Approach to Speaker Identification,” Proc. ICASSP, Salt Lake City, 2001. [7] W. M. Campbell, “A Sequence Kernel and its Application to Speaker Recognition,” NIPS-14, pp. 11571163, 2001. [8] V. Wan and S. Renals, “SVMSVM: Support Vector Machine Speaker Verification Methodology,” Proc. ICASSP, Hong-Kong 2003. [9] K. Tanabe, “Penalized Logistic Regression Machines: New methods for statistical prediction 1,” ISM Cooperative Research Report 143, pp. 163-194, 2001.

[10] K. Tanabe, “Penalized Logistic Regression Machines: New methods for statistical prediction 2,” Proc. IBIS, Tokyo, pp. 71-76, 2001. [11] T. Matsui and K. Tanabe, “Speaker Identification with dual Penalized Logistic Regression Machine,” Proc. Odyssey, Toledo, 2004. [12] K. Tanabe, “Penalized Logistic Regression Machines and Related Linear Numerical Algebra,” KOKYUROKU 1320, Institute for Mathematical Sciences, Kyoto University, pp. 239-249, 2003. [13] http://htk.eng.cam.ac.uk, the hidden Markov model toolkit (HTK). [14] http://www.cs.cornell.edu/People/tj/svm_light/, the support vector machine software, SVMlight.

Probabilistic Speaker Identification with dual

Probabilistic Speaker Identification with dual

Suggest Documents

dPLRM-Based Speaker Identification with Log

phonetic speaker identification - CiteSeerX

Speaker Identification on compactRIO

Real-Time Speaker Identification

A PC Speaker Identification System

Signal Modeling for Speaker Identification

Predictive Systems for Speaker Identification

Wavelet Formants Speaker Identification Based

Robust Speaker Identification using Denoised

Text independent speaker identification using

EE559: Pattern Recognition Speaker Identification

Speaker Identification in Whisper1 www.researchgate.net › publication › fulltext › Speaker-I

Probabilistic soft sets and dual probabilistic sof sets in

Integrating Speaker Identification and Learning with Adaptive Speech

Thai Text-Dependent Speaker Identification by ANN with Two Different

Algorithm for Probabilistic Dual Hesitant Fuzzy Multi

Speaker Identification by GMM based i Vector

Improving Performance of Speaker Identification ... - Semantic Scholar

Robust text-independent speaker identification using Gaussian ...

Dominant Speaker Identification for Multipoint Videoconferencing

Text independent speaker identification in multilingual ... - CiteSeerX

Speaker Identification Using Diffusion Maps - eurasip

speaker identification from youtube obtained data

AM-FM Based Robust Speaker Identification in