VTLN Using Different Warping Functions for Template

0 downloads 0 Views 1MB Size Report
Jul 3, 2015 - Speech variation →variation in vocal tract length [1] .... “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans.
VTLN Using Different Warping Functions for Template Matching

Maulik C. Madhavi1 , Shubham Sharma2 and Hemant A. Patil1 {maulik_madhavi, hemant_patil}@daiict.ac.in, [email protected] 1Dhirubhai

Ambani Institute of Information and Communication Technology, Gandhinagar. 2Indian

Institute of Science, Bangalore.

Friday, 03.07.2015

Presented At: PReMI 2015 - Warsaw University of Technology, Warsaw, Poland, June 30 - July 3, 2015

2

Vocal Tract Length Normalization (VTLN)  



Speech variation variation in vocal tract length [1] Vocal Tract Length  Adult female (13 cm)  Adult male (17.5 cm) Formant frequencies in female is higher than male Fn 

 

2n  1v , 4L

n  

v = velocity of sounds, L = length of vocal tract Length of vocal tract L Let  AB  B  FA   AB FB  for speaker A  LA LA  for speaker B  LB

Speaker pair specific warping factor αAB

[1] L. Lee, and R. Rose, “A frequency warping approach to speaker normalization.” IEEE Trans. on Speech & Audio Process., vol. 6, no.1, pp. 49-60,1998.

3

Frequency Warping Functions (1/4) 

Limitation of speaker-pair specific warping  Need to estimate for all the different pair  This motivates for universal frequency warping



Universal Non-linear warping  Log warping  Piece-wise log warping  Mel scale warping  Bark scale warping  ERB (Equivalent Rectangle Bandwidth) warping

4

Frequency Warping Functions (2/4) 

Log-warping



Replace  by ef

FA ( )  FB ( AB )

FA ( )  FA (e f )  FB ( AB )  FB ( AB e f )  FB (e f  log AB )

 f A (f)  f B (f  log  AB ) 

Using shift invariant property of Fourier Transform Z A ( )  FT  f A ( f )  FT  f B ( f  log  AB )  FT  f B ( f ) e jlog AB  ZB ( )e jlog AB  Z A ( )  ZB ( )

[1] S. Umesh, L. Cohen, N. Marinovic, D. Nelson, “ Frequency-warping in speech,” 4th Int. Conf. on spoken Lang. Process., ICSLP. Vol. 1, pp. 414-417, 1996.

5

Frequency Warping Functions (3/4)   

 



Piecewise-log warping Speaker dependent warping factor is function of frequency band [1] 5 frequency bands  [100,240),[240,550),[550,1280),[1280,3000),[3000,7000)

Mel-warping Steven and Volksmann[2] Frequency stimuli corresponds to the distance on basilar membrane of Guinea pig fmel

 fHz     2591log10 1   700 

[1] S. Umesh, L. Cohen, N. Marinovic, D. Nelson, “ Frequency-warping in speech,” 4th Int. Conf. on spoken Lang. Process., ICSLP. Vol. 1, pp. 414-417, 1996. [2] D. O'shaughnessy, : Speech communication: human and machine. Universities press, 1987.

6

Frequency Warping Functions (4/4)     

Bark scale warping Zwicker, Flottorp and Stevens in 1956. Loudness perceived high at higher frequencies[1]. Within certain frequency band – loudness constant (critical band). Frequency subdivision ~ filterbank in human cochlea. 2

fBark





 f   13 tan 0.00076 fHz   3.5 tan  Hz   7500  1

1

ERB-warping Critical bandwidth: constant, for centre frequencies < 500 Hz [2]. fERB  11.17 ln

fHz  0.312  43 fHz  14.675

[1] E. Zwicker, G. Flottorp, and S. S. Stevens, “Critical band width in loudness summation,” J. Acoust. Soc. Am., vol. 29, no. 5, pp. 548-557, 1957. [2] B. C. J. Moore, B. R. Glasberg, “Suggested formulae for calculating auditory-filter bandwidths and excitation patterns ,” J. Acoust. Soc. Am., vol. 74, no. 3, pp. 750-753, 1983.

7

Discrete Implementation (1/3)  

Non-uniform DFT  Non-uniform frequency samples. Warped-domain equally spaced. Mapped back to frequency-domain . Warping function Warped frequency



Frequency in Hz

Fig. A schematic representation to describe frequency warping

8

Discrete Implementation (2/3) 

 

For N point non-uniform DFT:  Warped frequency equal N points. For log warping: log(fs/ 2)  log(1) log(fs/ 2)  v   , Non-uniform DFT N N N 1

X (k )   x (n )e j 2  n(e

k v

)/fs

fs is the sampling frequency.

k  0,1, , N  1.

,

n 0

 

In general, ekv is replaced by fw-1(kv), fw-1 ~inverse mapping to Hz. fw can be fmel, fBark or fERB. N 1

X (k )   x (n )e

 j 2  nfw1 (k v )/fs

,

k  0,1, , N  1.

n 0

 

Piecewise log-warping: different number of points in 5 frequency bands. 34, 47, 84, 141 and 206 points in the five bands.

9

Discrete Implementation (3/3) Log scale Piecew ise-log

6000

Mel scale Bark scale ERB scale

Frequency in Hz

5000 4000 3000 2000 1000

50

100

150 200 250 300 350 Warped frequency samples

400

Fig. Plots of different warping schemes.

450

500

10

Feature Extraction Average across the subframe remove the harmonics.

25 ms frame 100 Hz frame rate Hamming window Speech signal

Pr e-emphasis and Windowing

Subframing

Subframing to remove source harmonics

Averaged Autocorrelation

Warping in frequency-domain Non-uniform DFT

Magnitude -log

Non-uniform frequency warping

Fig. : Block diagram to illustrate procedure of feature extraction.

DCT

Features

11

Experimental Setup    



Dynamic Time Warping (DTW) [1] template matching. Efficient for dynamically misaligned patterns. Evaluation metrics. Precision @ N :  Precision value at top N correct occurrence of word.  Range: 0 to 1 (0% to 100 %).  Ideal value: 1 (100%). Equal Error Rate (EER):  Error rate when False Acceptance and False Rejection are same.  Range: 0 to 1 (0% to 100%).  Ideal value: 0 (0%).

[1] H. Sakoe, and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Trans. on Audio, Speech, & Language Process., vol.26, no.1, pp.43-49, 1978.

12

Miss probability (in %)

Experimental Results on TIMIT MFCC Piecewise-log log Mel Bark ERB

25 24 23 22 21 20 19 18 17 16



TIMIT:= TIMIT Acoustic-Phonetic Continuous Speech Corpus.



50 spoken words are taken.



No. occurrences: 30-50 . (Total ~1800 instances)

15 14

Table . % Precision and % Equal Error Rates (EER) Obtained by Different Feature Sets (in %) for TIMIT Database.

13 12 11 10 9

10 11 12 13 14 15 16 17 18 19 20 2122232425 False Alarm probability (in %)

30

Fig. : DET curves for template matching experiments for TIMIT database.

Feature Set

MFCC

Logwarping

Precision EER

36.69 18.68

42.80 14.47

Piecewise logwarping 36.89 17.44

Melwarping

Barkwarping

ERBwarping

38.67 17.01

41.88 15.71

42.85 15.18



Inference: Performance of warped features are better than MFCC .



Log > ERB> Bark > Mel > Piecewise-log > MFCC

13

Experimental Results on TIDIGITS 

TIDIGITS := Texas Instruments digit database.



zero (0), one (1) , … nine (9).



20 speakers.



Total ~200 instances.

Table . % Precision and % Equal Error Rates (EER) Obtained by Different Feature Sets (in %) for TIDIGITS Database. Feature Set

MFCC

Logwarping

Precision EER

49.48 24.59

54.46 21.33

Piecewise logwarping 55.59 19.57

Melwarping

Barkwarping

ERBwarping

52.38 21.11

56.53 19.19

56.65 19.11

Fig. : DET curves for template matching experiments for TIDIGIT database. 

Inference: Performance of warped features are better than MFCC .

14

Conclusions 

    

VTLN via universal warping. Different non-uniform frequency warping functions. Performance for template matching task. Warping based feature improves the performance. Future work: perform the analysis on large dataset and noisy channel condition.

Acknowledgements 

Department of Electronics and Information Technology (DeitY), New Delhi, India.



Dhirubhai Ambani Institute of Information and Communication Technology, Gandhinagar .

Suggest Documents