The Affine Transform and Feature Fusion for Robust ...

2 downloads 0 Views 543KB Size Report
identification systems (SID) to be robust to speech coding distortion. This paper examines the robustness issue for the 8 kilobits/second ITU-T G.729 codec.
2010 Asia Pacific Conference on Circuits and Systems (APCCAS 2010) 6–9 December 2010, Kuala Lumpur, Malaysia

The Affine Transform and Feature Fusion for Robust Speaker Identification in the Presence of Speech Coding Distortion Robert W. Mudrowsky and Ravi P. Ramachandran

Sachin S. Shetty

Department of Electrical and Computer Engineering Rowan University [email protected], [email protected]

Department of Electrical and Computer Engineering Tennessee State University [email protected]

Abstract—For security in wireless, voice over IP and cellular telephony applications, there is an emerging need for speaker identification systems (SID) to be robust to speech coding distortion. This paper examines the robustness issue for the 8 kilobits/second ITU-T G.729 codec. The SID system is trained on clean speech and tested on the decoded speech of the G.729 codec. To mitigate the performance loss due to mismatched training and testing conditions, five features are considered and two approaches are used. Four of the five features are based on linear prediction analysis and the other is the mel frequency cepstrum. The first method is feature compensation based on the affine transform and is used to map the features from the test scenario to the train scenario. The second method is feature fusion based on the arithmetic combination of probabilities generated by the vector quantizer classifier. The affine transform and fusion of four features gives the best identification success rate (ISR) of 83.2%. The best performing single feature achieves an ISR of 70.5% without the affine transform and 77.4% with the affine transform.

I. I NTRODUCTION This paper deals with enhancing the robustness of textindependent, closed set speaker identification (SID) systems [1]. Closed set SID is the case when the speaker is known a priori to belong to a set of M speakers. In text-independent systems, there is no restriction on the sentence or phrase to be spoken. Speaker identification is a pattern recognition problem consisting of feature extraction and classification. The task is highly successful if the conditions for training and testing are the same (known as matched conditions) [2]. Studies have shown that SID performance degrades when the training and testing conditions are not the same (known as mismatched conditions) [2]. The focus of this paper is when the SID system is trained on clean speech and tested on lossy compressed speech as a result of the 8 kilobits/second ITU-T G.729 codec [3]. The G.729 is a prevalent coding standard in wireless and voice over IP applications. For achieving remote access security using voice over IP and cellular telephony, SID systems that are robust to G.729 distortion are very significant. In previous work, the focus is on the mel frequency cepstrum (MFCC) and its first and second order derivatives as the robust features [4][5][6]. The features are either derived from

978-1-4244-7456-1/10/$26.00 ©2010 IEEE

the bitstream [4] or from the decoded speech [5] with the performance being better based on the latter approach. The approach in [5] also accomplishes waveform compensation based on the quantized linear prediction (LP) information to further augment performance. In this paper, feature compensation using the affine transform and feature fusion using two techniques (decision level fusion and probability fusion) are used to accomplish robustness to mismatched training (clean speech) and testing (G.729 decoded speech) conditions. The overall system consists of four components, namely, (1) Feature extraction for ensuring speaker discrimination, (2) Feature compensation using the affine transform that maps the test feature vectors to the space reflecting the training condition, (3) Vector quantizer (VQ) classifier and decision logic for identifying the speaker and (4) Feature fusion to get a more robust identification. The five features considered [6][7][8] include the linear predictive cepstrum (CEP), adaptive component weighted cepstrum (ACW), the postfilter cepstrum (PFL), the mel frequency cepstrum (MFCC) and the line spectral frequencies (LSFs). Both the affine transform and fusion are crucial to improving the performance over the case of using only a single feature. II. F EATURE E XTRACTION Linear predictive analysis results in a stable all-pole model 1/A(z) of order p where A(z) = 1 −

p 

a(n)z −n

(1)

n=1

The autocorrelation method of LP analysis gives rise to the predictor coefficients a(n) for n = 1 to p. The LSF feature lsf (n) are the angles (between 0 and π) of the alternating unit circle roots of F (z) and G(z) [6] where F (z) = A(z) + z −(p+1) A(z −1 ) G(z) = A(z) − z −(p+1) A(z −1 )

(2)

The predictor coefficients a(n) are converted to the LP cepstrum clp(n) (n ≥ 1) by an efficient recursive relation

1063

2010 Asia Pacific Conference on Circuits and Systems (APCCAS 2010) 6–9 December 2010, Kuala Lumpur, Malaysia [6] clp(n) = a(n) +

n−1 

i ( )clp(i)a(n − i) n i=1

(3)

Since clp(n) is of infinite duration, the CEP feature vector of dimension p consists of the components clp(1) to clp(p) which are the most significant due to the decay of the sequence with increasing n. The first step in developing the ACW cepstrum [7] is to perform a partial fraction expansion of the LP function 1/A(z) to get p  1 rn = (4) A(z) n=1 1 − qn z −1 where qn are the poles of A(z) and rn are the corresponding residues. The variations in rn were removed by forcing rn = 1 for every n. Hence, the resulting transfer function is a polezero type of the form N (z) A(z)

= =

1 −1 1 − q nz n=1

1 −

p−1 

1−

p 

y = Ax + b

b(n)z −n 

n=1

p

Pictorial view of the affine transform

compensates for the distortion of the test speech (in this case, the G.729 codec). Figure 1 gives a pictorial representation. The transform relating x and y is given by Eq. (7) as

p p 1   (1 − qn z −1 ) A(z) n=1 i=1=n

=

Fig. 1.

p 

(5) a(n)z −n

n=1

Applying the recursion in Eq. (3) to b(n) and a(n) results in two cepstrum sequences cb(n) and clp(n) respectively. The ACW cepstrum is cacw(n) = clp(n) − cb(n) [7]. The postfilter is obtained from A(z) and its transfer function is given by A(z/β) (6) Hpf l (z) = A(z/α) where 0 < β < α ≤ 1. The cepstrum of Hpf l (z) is the PFL cepstrum which is equivalent to weighting the LP cepstrum as cpf l(n) = clp(n)[αn − β n ] [8]. The ACW feature cacw(n) and PFL feature cpf l(n) are taken from n = 1 to p. The success of the MFCC is due to the perceptually based filter bank processing of the Fourier transform of the speech followed by cepstral analysis using the DCT [6]. The magnitude of the Fourier transform of the speech is logarithmically smoothed using a mel spaced filter bank. The DCT of the filter bank outputs yields the MFCC (denoted by cmf cc(n) for n = 1 to p) which forms an essentially decorrelated and compact representation of the speech spectrum. III. A FFINE T RANSFORM The affine transform accomplishes feature compensation by mapping a feature vector x derived from test speech to a feature vector y in the region of the p-dimensional space occupied by the training vectors. This forces a better match between training and testing conditions and in effect,

(7)

where A is a p by p matrix and y, x and b are column vectors of dimension p. Expanding Eq. (7) gives ⎤ ⎡ T ⎤⎡ ⎡ ⎤ ⎡ ⎤ a1 y(1) x(1) b(1) ⎥ ⎢ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ y(2) ⎥ ⎢ aT2 ⎥ ⎢ x(2) ⎥ ⎢ b(2) ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ y(3) ⎥ ⎢ aT3 ⎥ ⎢ x(3) ⎥ ⎢ b(3) ⎥ ⎢ ⎥=⎢ ⎥⎢ ⎥+⎢ ⎥ (8) ⎥ ⎢ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎢ . ⎥ ⎢ . ⎥⎢ . ⎥ ⎢ . ⎥ ⎥ ⎢ ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ T ⎦⎣ ⎦ ⎣ ⎦ ap y(p) x(p) b(p) where aTm is the row vector corresponding to the mth row of A. The affine transform parameters A and b are determined from the training data only. Let y(i) be the feature vector for the ith frame of the training speech utterance. Let x(i) be the feature vector for the ith frame of the training speech utterance passed through the distortion encountered during testing (in this case, the clean utterance is compressed and decoded by the G.729 codec). By using a number of training speech utterances, N sets of vectors are collected, namely, y(i) and x(i) for i = 1 to N . A squared error function is formulated as N  2 [y (i) (m) − aTm x(i) − b(m)] (9) E(m) = i=1

where aTm is the mth row of A and y (i) (m) and b(m) are the mth components of y(i) and b, respectively. Minimization of E(m) with respect to aTm and b(m) results in the system of

1064

2010 Asia Pacific Conference on Circuits and Systems (APCCAS 2010) 6–9 December 2010, Kuala Lumpur, Malaysia affine transformed test feature vector. This is quantized by each of the VQ codebooks. The quantized vector is that which is closest to the affine transformed test feature vector in terms of the squared Euclidean distance. There are M different distances recorded, one for each codebook. This process is repeated for every affine transformed test feature vector. The distances are accumulated over the entire set of feature vectors such that d(i) is the accumulated distance for codebook i. The codebook that renders the smallest accumulated distance identifies the speaker. When many utterances are tested, the identification success rate (ISR) is the number of utterances for which the speaker is identified correctly divided by the total number of utterances tested. V. F EATURE F USION

Fig. 2.

equations ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

N 

Vector quantizer block diagram

(i) (i) T

x x

N 

i=1 N 

x(i)

T

i=1

⎡ ⎢ ⎢ ⎢ =⎢ ⎢ ⎢ ⎣

Since five different features are used, each with a separate VQ classifier, an ensemble system [10] results which naturally leads to the investigation of fusion. Decision level fusion is the simplest technique and involves taking a majority vote of the different features to get a final decision. For probability fusion, the distances d(i) of the five single feature classifiers depicted in Fig. 2 are first converted to probabilities p(i) by the equations

N 



x(i) ⎥ ⎥⎡ ⎤ i=1 ⎥ am ⎥ ⎥⎣ ⎦ ⎥ b(m) ⎥ ⎥ N ⎦

Total =

d(j)

j=1

p(i)

=

T otal − d(i) (M − 1)Total

(11)

where M is the number of speakers. The quantity i p(i) = 1 and p(i) represents the probability that the set of test feature vectors came from speaker i. For speaker i, the p(i) generated by a subset of or all of the five different features are added. The maximum added probability identifies the speaker.



y (i) (m)x(i) ⎥ ⎥ i=1 ⎥ ⎥. N ⎥  ⎥ y (i) (m) ⎦

N 

(10) VI. E XPERIMENTAL P ROTOCOL

i=1

The function E(m) is minimized for m = 1 to p. Hence, m different systems of equations of dimension (p + 1) as described by Eq. (10) are solved. The left hand square matrix of Eq. (10) is independent of m and hence, needs to be computed only once. IV. VQ C LASSIFIER AND D ECISION L OGIC A vector quantizer (VQ) classifier [8] is used to render a decision as to the identity of a speaker. For training, the speech data for each speaker is converted to a training set of feature vectors which is then used to design a VQ codebook based on the Linde Buzo Gray (LBG) algorithm [9]. No affine transform is applied to the feature vectors. There will be M codebooks, one pertaining to each of the speakers. The VQ system for processing a test speech utterance is shown in Figure 2. A test utterance from one of the speakers is converted to a set of test feature vectors. The affine transform is applied to each test feature vector. Consider a particular

Ten sentences from each of the 38 speakers from the New England dialect of the TIMIT database are used for the experiments. The speech in this database is clean and first downsampled from 16 kHz to 8 kHz. The speech is preemphasized by using a nonrecursive filter 1 − 0.95z −1 . The speech is divided into frames of 30 ms duration with a 20 ms overlap. Energy thresholding is performed over all frames of an utterance to determine the relatively high energy speech frames. The feature vectors are computed only in these high energy frames. For the LP analysis, the autocorrelation method [6] is used to get a 12th order LP polynomial A(z). The LP coefficients are converted into 12 dimensional LSF, CEP, ACW and PFL feature vectors. For the PFL feature, α = 1 and β = 0.9 (see Eq. (6)). A 12 dimensional MFCC feature vector is computed in each high energy frame. The VQ classifier is trained using the 12 dimensional feature vectors. A separate classifier is used for each feature. The codebooks for each speaker are of size 64. For each speaker in the database, there are 10 sentences. The first five are used for training the VQ classifier (clean

1065

2010 Asia Pacific Conference on Circuits and Systems (APCCAS 2010) 6–9 December 2010, Kuala Lumpur, Malaysia Feature CEP ACW PFL LSF MFCC

No Affine Transform 70.5 62.1 66.8 63.2 51.6

With Affine Transform 77.4 70.0 74.2 74.2 75.3

Features All five CEP/ACW/PFL/MFCC CEP/PFL/MFCC CEP/PFL

TABLE I I DENTIFICATION S UCCESS R ATE (%) FOR I NDIVIDUAL F EATURES

Features All five CEP/PFL/LSF/MFCC CEP/ACW/PFL/MFCC CEP/PFL/MFCC CEP/LSF/MFCC PFL/LSF/MFCC

No Affine Transform 70.5 71.1 70.0 71.1 68.9 68.4

No Affine Transform 75.3 80.0 77.4 74.2

With Affine Transform 81.1 83.2 82.6 83.2

TABLE III I DENTIFICATION S UCCESS R ATE (%) FOR P ROBABILITY F USION

With Affine Transform 81.1 81.1 80.5 79.5 79.5 79.5

is also given. Combining the affine transform and probability fusion is the best technique. Also, probability fusion leads to a higher ISR than decision level fusion.

TABLE II I DENTIFICATION S UCCESS R ATE (%) FOR D ECISION L EVEL F USION

speech). The remaining five sentences are individually used for testing thereby giving 190 test cases. The test speech is compressed by the 8 kilobits/second ITU-T G.729 codec [3] and the decoded speech is used for speaker identification. The affine transform parameters are determined from all 190 training utterances. These parameters, A and b, are different for each feature. Consider a training utterance. The feature vectors of this clean utterance are first computed. The utterance is passed through the G.729 codec and the feature vectors of the decoded speech are computed. The vectors are matched up such that y(i) and x(i) correspond to the ith frame of the clean utterance and the G.729 utterance, respectively. Following the same procedure, the feature vectors of every clean and its corresponding G.729 utterance are matched up. Then, the entire set of N clean and G.729 vectors are used to calculate the affine transform parameters A and b using Eq. (10).

VIII. S UMMARY AND C ONCLUSIONS In a VQ based speaker identification system, the use of the affine transform and probability fusion gives the best results. It is important to use the CEP/ACW/PFL/MFCC combination which gives the best ISR of 83.2%. The CEP is the best performing single feature that achieves an ISR of 70.5% without the affine transform and 77.4% with the affine transform.

VII. R ESULTS The first experiment is to compare the performance of the five individual features with and without the affine transform. Table I gives the identification success rate (ISR) results. The MFCC shows the greatest improvement. Decision level fusion and probability fusion using all five features and all possible subsets of the features were attempted. Table II shows the six best cases when the affine transform and decision level fusion is used. For these cases, the ISR obtained without using the affine transform is also given. The CEP, MFCC and PFL features play a very important role in fusion. With probability fusion, there are 18 of 26 total cases for which the ISR is greater than or equal to 80%. In fact, using all five feature and all possible combinations of four features lead to an ISR greater than or equal to 80%. Table III shows the best combinations of 2, 3 and 4 features and the case of using all 5 features with the affine transform and probability fusion. Again, the ISR obtained without using the affine transform

1066

IX. R EFERENCES 1) J. P. Campbell, “Speaker recognition: A tutorial”, Proc. IEEE, vol. 85, pp. 1437–1462, September 1997. 2) R. J. Mammone, X. Zhang and R. P. Ramachandran, “Robust speaker recognition - A feature based approach”, IEEE Signal Proc. Mag., vol. 13, pp. 58–71, September 1996. 3) ITU-T, “Recommendation G.729 - coding of speech at 8 kbit/s using conjugate-structure algebraic-code-exited linear prediction (CS-ACELP)”, January 2007. 4) A. Moreno-Daniel, B.-H. Juang and J. A. NolazcoFlores, “Robustness of bit-stream based features for speaker verification”, IEEE Int. Conf. on Acoustics, Speech and Signal Proc., pp. I-749–I-752, March 2005. 5) A. McCree, “Reducing Speech Coding Distortion for Speaker Identification”, IEEE Int. Conf. on Spoken Language Proc., September 2006. 6) T. F. Quatieri, Discrete Time Speech Signal Processing Principles and Practice Prentice Hall PTR, 2002. 7) K. T. Assaleh and R. J. Mammone, “New LP-derived features for speaker identification”, IEEE Trans. on Speech and Audio Proc., vol. 2, pp. 630–638, October 1994. 8) M. S. Zilovic, R. P. Ramachandran and R. J. Mammone, “Speaker identification based on the use of robust cepstral features obtained from pole-zero transfer functions”, IEEE Trans. on Speech and Audio Proc.. vol. 6, pp. 260–267, May 1998. 9) Y. Linde, A. Buzo and R. M. Gray, “An algorithm for vector quantizer design”, IEEE Trans. on Comm., vol. COM-28, pp. 84–95, January 1980. 10) R. Polikar, “Ensemble based systems in decision making”, IEEE Circuits and Systems Magazine, vol. 6, no. 3, pp. 21–45, 2006.

Suggest Documents