Building accurate and robust HMM models for practical ASR ... - STU

0 downloads 0 Views 681KB Size Report
and its effect on the classical extraction methods like MFCC and PLP will be ... tual linear prediction (PLP) features, concept of RASTA fil- tering [7], time .... computing Gaussian pdf. ..... accurate ASR system for real life applications [22] using.
Telecommun Syst DOI 10.1007/s11235-011-9660-8

Building accurate and robust HMM models for practical ASR systems Juraj Kaˇcur · Gregor Rozinaj

© Springer Science+Business Media, LLC 2011

Abstract In this article the relevant training aspects for building robust and accurate HMM models for large vocabulary recognition system are discussed and adjusted, namely: speech features, training steps, and the tying options for context dependent (CD) phonemes. As the basis for building HMM models the well known MASPER training scheme is assumed. First the incorporation of the voicing information and its effect on the classical extraction methods like MFCC and PLP will be shown together with the derivative features, where the relative error reductions are up to 50%. Next the suggested enhancement of the standard training procedure by introducing garbled speech models will be presented and tested on real data. As it will be shown it brings more than a 5% drop in the error rate. Finally, the options for tying states of CD phonemes using decision trees and phoneme classification will be adjusted, tested, and explained. Keywords Speech recognition · Hidden Markov models · Speech features · Model training · MASPER

1 Introduction For a couple of decades there has been a great effort spent in building and employing automatic speech recognition (ASR) systems in areas like information retrieval systems, dialog systems, etc., but only as the technology has evolved to some stage other applications like dictation systems or even automatic transcription of natural speech [1] are emerging. These advanced systems should be able to operate on J. Kaˇcur () · G. Rozinaj Faculty of Electrical Engineering and Information Technology, Slovak University of Technology, Bratislava, Slovakia e-mail: [email protected]

a real time base, must be speaker independent, achieving high accuracy even in the presence of additive and convolution noises, changing environments, and support dictionaries containing several hundreds of thousands of words. These strict requirements can be currently met by HMM models of tied CD phonemes with multiple Gaussian mixtures, which is a technique known from the 60ties [2]. As this statistical concept is mathematically tractable it, unfortunately, doesn’t completely reflect the physical underlying process. Therefore, soon after its creation there have been many attempts to alleviate that. Nowadays the classical concept of HMM has evolved into areas like hybrid solutions with neural networks, utilization of different than maximum likelihood (ML) or maximum a posteriori probability (MAP) training strategies that minimize recognition errors by the means of corrective training, maximizing mutual information [3] or by constructing large margin HMMs [4]. Furthermore, a few methods have been designed and tested aiming to suppress the first order Markovian restriction by e.g. explicitly modelling the time duration (Levinson, 1986), splitting states into more complex structures [5], using double [6] or multilayer structures of HMM. Last but by no means least issue is the construction of robust and accurate feature extraction method. Again this matter is not fully solved and various popular features and techniques exist like: Mel frequency cepstral coefficients (MFCC), perceptual linear prediction (PLP) features, concept of RASTA filtering [7], time frequency filtering (TIFFING) [8], Gammatone cepstral coefficients (GTCC) [9], zero-crossing peak amplitude (ZCPA) [10], etc. Even despite the huge variety of advanced solutions in ASR systems many of them are either not general enough (specially designed to certain environment) or are rather impractical for the real-life employment. Thus, in the present time most of the practically employed systems are based on

J. Kacur, G. Rozinaj

continuous context independent (CI) or tied CD HMM models of phonemes with multiple Gaussian mixtures trained by ML or MAP criteria. As there is no analytical solution for the optimal setting of HMM parameters given the real data, there must be a training process employed which is an iterative one [3]. Unfortunately, using it, there is no guarantee of reaching the global maxima [11], thus lot of effort is paid to the training phase in which many procedures are selectively applied in stages. Thus only mature and complex systems allow convenient, efficient and flexible training of HMM models, where the most famous are HTK and SPHINX. These systems are looked at as standard tools for building robust and accurate models for large vocabulary, practical systems. This article discuses and explains major stages of building speaker independent continuous density HMM (CDHMM) models using the professional database MOBILDAT-SK [12] and the standard training scheme called MASPER [13]. The rest of the article is organized as follows. First the standard training method will be used with several base speech feature extraction methods, namely MFCC and PLP and a couple of auxiliary parameters, mostly dynamic ones. Further the measure of voicing which plays a role in differentiating some pairs of consonants will be added as well as the pitch itself. Each setting and feature will be tested and its merit numerically assessed and compared to the original one. In the second section the focus will be on the training process itself, where as the basis the reference recognition system REFREC [14] or its multilingual version MASPER will be used and introduced in brief. However the core part would be on presenting the enhancement to the standard training procedure by incorporating the background models of garbled speech. Several structures of those models will be designed and tested as well as the ways how they are to be optimally trained. The third part deals with the process of tying HMM states of CD phonemes using both the language information (classification of phonemes) and the statistical similarities of the data. Setting optimal tying options, understanding the underlying physical process and its effect on the right balance between the accuracy and generality may bring additional increase of the overall accuracy. Next, in brief, the training and testing environments (executed tests) will be presented along with the professional database. The article is concluded by summarizing results and findings. Therefore the presented article should give you an insight into how to adjust and build both robust and accurate HMM models using standard methods and systems on the professional database. Further, it should suggest what may be and what probably is not so relevant in building HMM models for practical applications, i.e. which part the designer should be particularly careful with.

2 Feature extraction methods and their performance One of the first steps in the design of an ASR system is to decide which feature extraction technique to use. At the beginning it should be noted that this task is not yet completely solved and a lot of effort is still going on in this area. The aim is to simulate the auditory system of humans, mathematically describe it, simplify for practical handling and optionally adapt it for a correct and simple use with the selected types of classification methods. A good feature should be sensitive to differences in sounds that are perceived as different in humans and should be “deaf” to those which are unheeded by our auditory system. It was found [15] that the following differences are audible: different location of formants in the spectra, different widths of formants and that the intensity of signals is perceived non-linearly. On the other hand, following aspects do not play a role in perceiving differences: overall tilt of the spectra like, filtering out frequencies lying under the first formant frequency, removing frequencies above the 3rd format frequency, and a narrow band stop filtering. Furthermore, features should be insensitive to additive and convolutional noises or at least they should represent them in such a way that these distortions are easy to locate and suppress in the feature space. Finally, when using CDHMM models it is required for the feasibility purposes that the elements of feature vectors should be linearly independent so that a single diagonal covariance matrix can be used. Unfortunately, there is no feature that would ideally incorporate all the requirements mentioned above. Many basic speech features have been designed so far, but currently MFCC and PLP [16] are the most widely used in CDHMM ASR systems. In most cases these basic features aim to mimic the static part of the spectra as it is perceived by humans. Apart from the static features it was soon discovered that the changes in the time [17] represented by delta and acceleration parameters play an important role in modelling the evolution of speech. This notion was further evolved by introducing the concept of modulation frequency and RASTA filtering [7]. This is particularly important when using HMMs as they lack the natural time duration modelling capability (geometric time distribution in a single state). Overall energy or zero cepstral coefficients with their derivations also carry valuable discriminative information thus most of the systems use them. More details on both MFCC and PLP methods can be found elsewhere [9, 16, 18], but in the following let us mention some basic facts and achieved results both for basic features alone and with auxiliary coefficients. 2.1 MFCC vs. PLP MFCC and PLP both represent some kind of cepstra and thus are better in dealing with convolutional noises. However, it was reported that some times in lower SNRs they

Building accurate and robust HMM models for practical ASR systems Table 1 WERs and relative improvements for MFCC, PLP and auxiliary features averaged over different HMMs for digit strings and application words tests

Static

Relative Improvements related to previous step

WER [%]

C0 [%]

Delta [%]

Acceleration [%]

Cepstral mean subtraction [%]

PLP

40.3

11.1

61.47

52.1

23.27

MFCC

43.6

20.64

61.36

48.15

52.66

are outperformed by other methods, e.g. TIFFING [8], J-RASTA [19], ZCPA [10], etc. as they nonlinearly couple signals with additive noises. This well known fact of a rapid deterioration in accuracy for lower SNRs that is not solely related only to PLP or MFCC is compensated by matching training and employment environments, thus huge professional databases are recorded in real environments. On the other hand, in the clear environment (above 30 dB) MFCC and PLP provide one of the best results [10]. The computation of MFCC undergoes following processing.The speech signal is first modified by HP so-called preemphasis filter to suppress the low pass filtering character of the speech given by the lip radiation to the open space. Prior to the FFT computation a Hamming window is applied and the frequency in Hz is warped into the Mel scale to mimic the critical bands over different frequencies. Next, equally spaced triangular windows with 50% overlap are applied to simulate a filter bank. Finally the logarithm function and DCT transform are applied that produce a static feature frame. The logarithm not only acts as a tool to produce cepstrum (real one) but suppress the high-value intensity in favor of low intensities as the human auditory system does. In addition, zero cepstral coefficient is used as well to estimate the overall log energy. Unlike MFCC the original process of PLP calculation follows these steps: calculation of FFT that is proceeded by Hamming windowing, frequency warping into Bark scale, smoothing the bark-scaled frequency spectra by a window simulating critical bands effect of the auditory system, sampling the smoothed bark spectrum in approx. 1 bark intervals to simulate the filter bank, equal loudness weighting of the sampled frequencies which approximates the hearing sensitivity, transformation of energies into loudness by powering each frequency magnitude to 0.33, calculating the linear prediction (LP) coefficients from the warped and modified spectra (all pole model of the speech production), finally cepstral LP coefficients are derived from LPC as if the logarithm and the inverse FFT were calculated. In both MFCC and PLP cases, the DCT or FFT transform applied in the last step of the computation process minimize the correlation between elements and thus justifies the usage of diagonal covariance matrices. Furthermore, to take the full advantage of cepstral coefficients, usually a cepstral mean subtraction is applied in order to suppress possible distortions inflicted by various transmission channels or

recording devices. At the end we shall not forget about the liftering of a cepstra in order to emphasise its middle part so that the most relevant shapes of spectra for recognition purposes would be amplified (lower-index coefficients approximate the overall spectral tilt and the higher-index coefficients reflect the details and are prone to contain noise) [15]. Well, this appealing option has no real meaning when using CDHMM and Gaussian mixtures with diagonal covariance matrices only. In this case it is simply to show that the liftering operation would be completely canceled out when computing Gaussian pdf. All the above-mentioned features and auxiliary settings were tested and evaluated on he MOBILDAT-SK database in terms of the recognition error. Two tests were done on the test set portion of the database: digit strings, and application words whose results were averaged. The training was based on the MASPER training procedure (will be presented later in the text) using the HTK system. In Table 1 there are shown averaged word error rates (WER) for PLP and MFCC features as scored in the application words and digit string tests. The relative improvements achieved by: adding zero cepstral coefficient, including delta and acceleration coefficients, and applying cepstral mean subtraction, are also shown. These results were calculated and averaged over different HMM models i.e. CI and tied CD phoneme models with multiple Gaussian mixtures. From these tests one can induce that slightly better results are obtained by PLP method in both cases, once regarding only the static features (43.6% vs. 40.3% of WER in favor for PLP) and the other time using all abovementioned static and dynamic auxiliary parameters and modifications (5.24% vs. 5.08% of WER in favor for PLP). Further, we investigated step by step the significance of auxiliary features and modification techniques. First let us begin with the zero cepstral coefficient (C0), where its incorporation brought relative improvements over basic PLP (11.1%) and MFCC (20.64%) vectors. As it can be seen the improvement is much more relevant for MFCC, thus we can interpret this result as the PLP provides more complex representation of the static speech frame than MFCC (from the point of view of recognition accuracy), because the additional information was not so beneficial. Next the inclusion of delta coefficients disclosed that their incorporation brought down the averaged error relatively by 61.36% for MFCC and 61.47% for PLP (related to basic vectors plus C0). If this absolute drop is

J. Kacur, G. Rozinaj

further transformed to the relative drop calculated over a single difference coefficient (if all are equally important), it shows that one delta coefficient on average causes a 4.72% WER drop for MFCC and 4.73% for PLP. Next, the acceleration coefficients were tested, and their inclusion resulted in a 48.15% drop of WER for MFCC and 52.1% drop for PLP relative to the previous setting (basic vector + C0 + delta). Again, the incorporation of dynamic (acceleration) coefficients was more beneficial for PLP. If these absolute drops in WER are calculated for a single acceleration coefficient, it was found that one such coefficient causes on average a 3.7% drop of WER for MFCC and a 4% for PLP. Finally, the cepstral mean subtraction was tested for both methods where it brought substantially improved results on average by 52.66% for MFCC and 23.27% for PLP. As it can be seen the benefit of this operation is tremendous for MFCC comparing to PLP. That reveals the PLP is less sensitive to the cepstral mean subtraction, probably, because it uses non linear operations (0.33 root of the power, calculation of the all pole spectra) applied prior to the signal is transformed by the logarithm and before the cepstral features are calculated. Interestingly enough, both dynamic features caused to be more significant for PLP than for MFCC in relative numbers, however, for the additional C0 (static feature) this was just the opposite. All this may suggest that PLP itself is better in extracting static features for speech recognition as the information contained in C0 and cepstral mean subtraction are not so vital, unlike MFCC. 2.2 Voicing and the pitch as the speech features As it was shown in the previous paragraph apart of the base static features that aim to estimate the magnitude-modified and frequency-warped spectra, the dynamic features reflecting time evolution, like delta and acceleration coefficients proved to be rather valuable as well. As a consequence of their construction and aim, those parameters eliminate any voicing evidence contained in the signal which still carries some discriminative information that may play a role in the recognition accuracy. Namely, this information is vital to discriminate between some pairs of voiced and unvoiced consonants (t–d, etc.). To alleviate this drawback we substituted the least significant static features with such kind of information derived from the average magnitude difference function (AMDF). To evolve this concept further the pitch was tested in the separate experiments as well. In order to verify and assess the merit of such a modification, series of experiments were executed using the professional database and a training scheme for building robust HMM models. As it was already outlined the concept of incorporating some parameters assessing the voicing of a particular signal may bring additional discriminative information into the recognition process. For example in Slovak, one classification of consonant is according whether they exist in

voiced/unvoiced pairs. In the group where pairs exist each consonant is matched into pair according to the mechanism how they are produced and perceived. In the group of pairs there is always an unvoiced consonant that is matched to the voiced one. The only difference in their production is the absence of vocal chord activity that obviously can not be observed after PLP or MFCC processing. Some typical pairs of voiced and unvoiced consonants are: z in the word zero and s as appears in the word sympathy, p (peace) and b (bees), d (divide) and t (tick), etc. As it can be seen, distinguishing between them may be crucial, thus such information can be potentially beneficial. On the other hand, it must be said that there are many cases (at least in Slovak) where in the real conversation the voiced and paired consonant may take the form of unvoiced one and vice versa. Thus these two contradictory effects must be tested and assessed, to see which one is prevailing. As it comes to the selection of proper method estimating the voicing degree in a signal, there are more methods to do it as well as to detect the periodicity as a side product. These algorithms are ranging from the simple algorithms in the time domain like: AMDF, and autocorrelation, through spectral ones like harmonic spectra and real cepstrum to methods operating on the residual signal after the inverse filtering. A good method should be accurate, robust against additive noise, easy to compute and the outcome should be easy to interpret and should be gain invariant. In our experiments we opted for AMDF method as it provides good results obtained even in lower SNRs, has a simple and fast implementation, its output has straightforward representation, the by product is the detected pitch and moreover it is the base for more complex methods like YIN [20]. The AMDF function is defined as follows: fi (k) =

n

Suggest Documents