Comparative Evaluation of Speech Parameterizations ...

1 downloads 0 Views 142KB Size Report
techniques, such as the Mel-frequency cepstral .... obtained follow approximately the Mel scale. .... [4] Davis, S.B., Mermelstein, P. "Comparison of parametric.
Comparative Evaluation of Speech Parameterizations for Speech Recognition Iosif Mporas, Todor Ganchev, Mihalis Siafarikas, Theodoros Kostoulas Wire Communications Laboratory, University of Patras, 26500 Rion-Patras, Greece {imporas,tganchev,msiaf,tkost}@wcl.ee.upatras.gr Abstract In this work, we present comparative evaluation of the practical value of some recently proposed speech parameterizations on the speech recognition task. Specifically, in a common experimental setup we evaluate recent discrete wavelet-packet transform (DWPT)-based speech features against traditional techniques, such as the Mel-frequency cepstral coefficients (MFCC) and perceptual linear predictive (PLP) cepstral coefficients that presently dominate the speech recognition field. The relative ranking of eleven sets of speech features is presented.

1. Introduction The contemporary speech recognition technology is based on statistical analysis of speech performed through powerful pattern recognition techniques such as the hidden Markov models (HMM) and dynamic programming. One problem that has not been solved with sufficient elegance is the speech parameterization, which has the task to present the information carried by the speech signal in a compact form, so that it can be efficiently utilized by the HMM classifier. In [4], it was demonstrated that the psychophysically inspired MFCC outperform RCC [1], LPC [2], LPCC [3], and other features, on the task of speech recognition. This success of MFCC, combined with their robust and cost-effective computation, turned them into a standard choice in the speech recognition applications. Similar studies [6] have shown that the PLP [5] features outperform MFCC in specific conditions, but generally no large gap in performance was found between them. Other speech features such as the perceptual linear prediction adaptive component weighting (ACW) cepstral coefficients [7], and various wavelet-based features SBC [8], WPF [9], etc, although presenting reasonable solution for the same tasks, did not gain widespread practical use, often due to their relatively sophisticated computation. Most often when new speech features are proposed they are only contrasted to MFCC or PLP and rarely to

a larger number of other competitive parameters. Some exceptions are [6, 10, 11, etc], where the authors consider three or more other speech features in addition to the method they propose. The lack of comparison to multiple known methods leads to a particular difficulty, which developers experience when they have to choose speech features for the needs of a given speech recognizer. Usually, their first choice falls on the MFCC, since the last are known to provide relatively good performance and are straightforward to implement. The selection of alternative speech features is somehow complicated due to the lack of large-scale comparisons, especially as concerns the wavelet packets-based speech features. In the present work, we employ the Sphinx-III speech recognizer [12] and the TIMIT speech database [13] to evaluate a large number of recent (DWPT)- and (DFT)-based speech parameterization approaches. In a common experimental setup, we identify the relative ranking of the corresponding sets of speech features.

2. Speech Parameterization Techniques In the present work, we consider the following relatively less studied speech parameterization techniques: SBC [8], WPF [9], WPSR [10], and HFCC-E [11]. In addition, the well-known LFCC [4], MFCC [14] and PLP [5] are employed as reference points. In the following we briefly outline how the original speech parameterizations were adjusted to unified bandwidth and common settings.

2.1. MFCC speech features (MFCC-FB40) Here, the MFCC implementation of Slaney [14], which is the default for the Sphinx-III speech recognizer, is considered. In brief, the frequency range [133, 6855] Hz is covered by a filter-bank of 40 lin-log equal-area filters. Specifically, the first 13 filters are with fixed spacing of 66.7 Hz and the next 27 filters, whose centres reside beyond 1000 Hz, are logarithmically spaced. For the purpose of fair comparison, we accepted the frequency range of

MFCC [133, 6855] Hz as binding for all other methods.

2.2. LFCC speech features (LFCC-FB40) The LFCC [4] are computed following the methodology of the MFCC with the only difference that the Mel-frequency warping step is skipped. Thus, the desired frequency range is implemented by a filterbank of 40 equal-width and equal-height linearly spaced filters, each of bandwidth 164 Hz, which cover the range [133, 6857] Hz.

2.3. PLP speech features (PLP-FB19) The PLP parameters [5] rely on Bark-spaced filterbank of 24 filters for covering the frequency range [0, 15500] Hz. Here this filter-bank was adapted to the desired frequency range by discarding the first (lowest frequency) filter and all filters whose centre frequencies reside beyond 6855 Hz. This modification led to a filter-bank of 19 filters that cover the frequency range [100, 6400] Hz, which is the closest feasible implementation to the desired frequency range.

2.4. HFCC-E of Skowronsky & Harris Skowronski and Harris [11] introduced the human factor cepstral coefficients (HFCC-E) to decouple the filter band-width from the filter spacing. In the earlier MFCC implementations [4, 14, etc] these were dependent variables and the filter bandwidth was dependent on the centres of the neighbouring filters. Another difference is that in HFCC-E the filter bandwidth is derived from the equivalent rectangular bandwidth (ERB), which is based on critical bands rather than the Mel scale. The filter bandwidth is further scaled by a constant, which Skowronski and Harris labelled as E-factor. Assuming sampling frequency of 12500 Hz Skowronski & Harris proposed the HFCC-E filter bank composed of 29 Bark-warped equal height filters, which cover the frequency range [0, 6250] Hz. In the present work, we achieved the desired frequency range by discarding the first two filters and adding a new one at the end. This modification led to a filter-bank of 28 filters in the frequency range [125, 6844] Hz. In addition to the original filter-bank, two other filter-banks that cover the same frequency range were designed: (1) with 23 filters and (2) with 40 filters. Everywhere we consider the E-factor equal to one.

2.5. SBC of Sarikaya & Hansen (SBC S&H) In [8], considering the stressed speech monophone recognition problem, Sarikaya & Hansen performed a wavelet packet decomposition of the frequency range [0, 4000] Hz such that the 24 frequency subbands obtained follow approximately the Mel scale. The proposed wavelet packet tree assigns more subbands between low to mid frequencies while keeping roughly a log-like distribution of the subbands across frequency. The SBC speech features make use of Daubechies' 32-tap orthogonal filters. To adjust the filterbank of SBC to the desired frequency range we did the following two modifications: The initial two subbands were discarded and six new subbands with bandwidth of 500 Hz each were added at the end of the original frequency range. This kept the Mel-scale like frequency warping and led to actual frequency range of [125, 7000] Hz that is covered by 28 frequency subbands, and that is the closest feasible implementation of the desired bandwidth.

2.6. WPF of Farooq & Datta (WPF F&D) Considering the phoneme recognition task, Farooq & Datta [9] performed a wavelet packet decomposition of the frequency range [0, 8000] Hz such that the obtained 24 frequency subbands closely follow the Mel scale. In order to obtain features with emphasis on the lower frequency subbands, the authors used Daubechies' wavelet filter of order 12. To adjust the filterbank of WPF to the desired frequency range we discarded the first and the last subbands, which lead to 22 subbands that cover the range [125, 7000] Hz. This is the closest feasible fit to the desired frequency range.

2.7. WPSR of Siafarikas et al The WPSR of Siafarikas et al [10] were initially developed for the needs of speaker recognition, but here they are adapted to the speech recognition task. In contrast to the SBC and WPF speech features, which are based on the Mel-scale the WPSR wavelet packet tree builds on the concept of critical bands. The BattleLemarie polynomial spline wavelet of order 5 was employed as the basis function. In the original design the authors used 66 filters to cover the frequency range [0, 4000] Hz. To adapt this filter-bank to the speech recognition task it was modified to have smoothly increasing frequency resolution as follows: resolution 31.25 Hz for the range [0, 1000] Hz, resolution 62.5 Hz for [1000, 2500] Hz, resolution 125 Hz, for [2500, 4000] Hz. These changes led to two extra subbands in the range [0, 4000] Hz, so that the totals become 68. The desired frequency range was implemented by

discarding the first four subbands and adding a number of subbands at the end, in two different ways: (i) 23 new subbands with resolution 125 Hz each were added. A total of 87 subbands covering the range [125, 6875] Hz was obtained; (ii) 12 new subbands with resolution of 250 Hz each were added. This led to total of 76 subbands covering the frequency range [125, 7000] Hz. As a consequence, these two versions of the WPSR, WPSR125 and WPSR250, respectively, differ only in the upper part of the frequency range.

3. Experimental Setup The speech parameterizations of interest were evaluated on the TIMIT speech recognition corpus [13], utilizing the standard division of training and test data sets. The open-source Sphinx-III speech recogniser [12] was employed and continuous acoustic models were trained. A standard phoneme set [13] of 38 monophones plus the silence were used. Thus, 39 context-independent 3-state HMMs were initially trained and each state was modelled by a mixture of 8 Gaussians. Context-independent untied models were trained using 15 iterations of the Baum-Welch (BW) algorithm. For each triphone having at least 8 occurrences, a context-dependent untied HMM was trained, using multiple iterations of the BW algorithm and convergence ratio was set to 0.02. The automatically derived decision trees were pruned to 1000 senones. Finally, context-dependent tied HMMs were trained on the states corresponding to these triphones identified with senones from the pruned trees. After building the acoustic model for each feature set it was force-aligned against the training set's transcriptions. The force-aligned transcriptions were utilized to build the final acoustic model, utilizing the setup described above. The acoustic model was build with a feature vector of dimensionality 39, consisting of 13 static coefficients and their first and second derivatives. No automatic gain control or variance normalization was applied. The default window length of 25.625 milliseconds with 100 frames per second was used. A trigram language model trained from all TIMIT sentences was utilized. It was build used the CMU Language Model Toolkit [15].

4. Experimental results All speech features were processed in a uniform manner. The MFCCs are considered as the baseline speech parameters. The word error rate (WER) and the sentence error rate (SER) in percentages are presented in Table 1 for language weight 9.5 and 12, respectively. The total number of words in the training

subset was 14553, and the number of sentences 1680. The number 16 in the brackets after the designation of the DWPT-based speech features denotes that these features utilize only the first 16 milliseconds of the speech frame. This is forced by the requirement of the DWPT analysis that the number of input samples has to be an exact power of 2. Thus, the DWPT-based features utilized only the first 256 samples of each speech frame. All speech features evaluated here outperformed the baseline MFCC, which was an expected outcome and confirms the results reported by the corresponding authors. However, from a practical point of view it is more interesting to investigate the ordering and the actual improvement these speech features provide when compared to the baseline. Table 1. Results for window 25.625 ms Feature SBC S&H (16) WPSR125 (16) WPSR250 (16) WPF F&D (16) LFCC-FB40 HFCC-FB23 HFCC-FB40 HFCC-FB28 PLP-FB19 MFCC-FB40

LW=9.5 LW=12.0 WER(%) SER(%) WER(%) SER(%) 6.24 21.31 6.91 21.96 6.33 21.79 6.99 21.61 6.45 21.55 7.22 21.67 6.78 22.86 7.47 23.10 6.94 23.45 7.47 22.98 8.19 27.32 8.84 27.86 8.69 28.21 9.06 28.39 8.71 28.93 8.70 28.87 9.02 29.40 9.20 28.39 9.03 29.88 9.34 28.93

As the results show the lowest error rates were achieved for the SBC, followed by the WPSR125 and WPSR250, the WPF, and next by the DFT-based speech features. An interesting observation is that the LFCC-FB40, which relies on equal-bandwidth filters with linear spacing of the central frequencies, outperformed the HFCC, PLP, and MFCC, which all possess frequency warping inspired by the human auditory system. The superior results for the DWPTbased speech features can be explained with: (1) the balanced time-frequency resolution these wavelet packet trees provide, when compared to the uniform frequency resolution of the DFT-based ones; (2) the more suitable (for analysis of non-stationary speech signals) basis functions, which are more reasonable choice compared to the cosine functions. Comparing the results in Table 1 we can observe that there is only slight increase of the error rates when the language weight is increased form 9.5 to 12. This corresponds to a speed-up of decoder operation by about 1.5 times. For the purpose of fair comparison, we repeated all experiments with the DFT-based speech features for window size of 16 milliseconds, which corresponds to the effective frame size that the DWPT-derived speech features utilize. The errors of word substitutions (WS), deletions (WD) and insertions (WI) for LW=9.5 are

presented in Table 2. As the results in Tables 2 show, the DWPT-derived speech features retained their superiority. With small exceptions, the ordering the speech features remained the same as in Table 1. Summarizing the results presented in Table 2, we can see that the SBC parameters demonstrated relative reduction of the WER by more than 20% and 30%, when compared to the baseline MFCC and PLP, respectively. Table 2. Results for window 16 ms (LW=9.5) Feature SBC S&H WPSR125 WPSR250 WPF F&D LFCC-FB40 HFCC-FB40 HFCC-FB28 HFCC-FB23 MFCC-FB40 PLP-FB19

WS 597 596 592 619 635 736 759 764 733 868

WD 194 212 218 207 223 173 176 179 167 150

WI 117 113 128 161 152 194 183 189 247 295

WER(%) SER(%) 6.24 21.31 6.33 21.79 6.45 21.55 6.78 22.86 6.94 23.45 7.58 25.54 7.68 26.01 7.78 25.66 7.88 27.14 9.02 29.41

Pair-wise t-test (for level of significance p=0.05) was performed for all speech features (see Table 3). The cells in grey correspond to pairs, which are not statistically different (significance threshold 1.98).

PLP-FB19

MFCC FB40

HFCC FB23

HFCC FB28

HFCC FB40

LFCC FB40

WPF F&D

WPSR 250

WPSR 125

t-test

Table 3. T-test for the 16 ms (LW=9.5)

SBC S&H 0.50 1.19 2.96 3.89 6.71 7.10 8.00 8.21 12.92 WPSR125 0.71 2.53 3.48 6.38 6.79 7.69 7.91 12.70 WPSR250 1.88 2.82 5.80 6.22 7.10 7.35 12.20 WPF F&D 0.86 3.93 4.36 5.09 5.41 10.25 LFCC FB40 3.18 3.64 4.34 4.69 9.64 HFCC FB40 0.47 0.94 1.39 6.21 HFCC FB28 0.44 0.90 5.69 HFCC FB23 0.50 5.51 4.91 MFCC FB40

5. Conclusion In conclusion, we would like to highlight that the comparative evaluation of some recent speech parameterization techniques presented in this work suggests that the widely-used Mel-Frequency cepstral coefficients might not be the most appropriate choice for speech parameterization, when maximization of the absolute speech recognition performance is required. We deem our work could benefit considerably the

developers of speech recognizers, since it will save duplication of efforts for implementing and comparing different speech parameterization techniques.

6. Acknowledgement This work was supported by the MoveOn project (IST-2005-034753).

7. References [1] Oppenheim, A.V. "A speech analysis-synthesis system based on homomorphic filtering". JASA 45:458-465, 1969. [2] Atal, B.S., Hanauer, S.L. "Speech analysis and synthesis by linear prediction of the speech wave". JASA 50(2):637655, 1971. [3] Atal, B.S. "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification". JASA 55(6):1304-1312, 1974 [4] Davis, S.B., Mermelstein, P. "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences". IEEE Trans. on ASSP 28(4):357-366, 1980. [5] Hermansky, H. "Perceptual linear predictive (PLP) analysis for speech". JASA 87(4):1738-1752, 1990. [6] Kim, D.S., Lee, S.Y., Kil, R.M. "Auditory processing of speech signals for robust speech recognition in real-world noisy environments". IEEE Trans. on SAP 7(1):55-69, 1999. [7] Assaleh, K.T., Mammone, R.J. "Robust cepstral features for speaker identification". ICASSP'94, Adelaide, Australia. Vol.1, 1994, pp. 129-132. [8] Sarikaya, R., Hansen, H.L. "High resolution speech feature parameterization for monophone-based stressed speech recognition", IEEE Signal Processing Letters, 7(7):182-185, 2000. [9] Farooq, O., Datta, S. "Mel filter-like admissible wavelet packet structure for speech recognition". IEEE Signal Processing Letters, 8(7):196-198, 2001. [10] Siafarikas, M., Ganchev, T., Fakotakis, N. "Wavelet packets based speaker verification". Odyssey 2004, Toledo, Spain. pp. 257-264, 2004. [11] Skowronski, M.D., Harris, J.G., "Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition", JASA 116(3):1774-1780, Sept. 2004. [12] Lee, K.-F., Hon, H.-W., Reddy, R. "An overview of the SPHINX speech recognition system", IEEE Trans. on ASSP 38(1):35-45,1990. [13] Garofolo J., "Getting started with the DARPA-TIMIT CD-ROM: An acoustic phonetic continuous speech database", NIST of USA, Gaithersburgh, MD, 1988. [14] Slaney, M. "Auditory toolbox. Version 2", Technical Report #1998-010, Interval Research Corporation, 1998. [15] "The CMU-Cambridge Statistical Language Modelling Toolkit, v2", http://www.speech.cs.cmu.edu/SLM/toolkit_ documentation.html.

Suggest Documents