Knowledge-Rich Model Transformations for ... - Semantic Scholar

Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg

Knowledge-Rich Model Transformations for Speaker Normalization in Speech Recognition Mats Blomberg, Daniel Elenius Dept of Speech, Music and Hearing, CSC, KTH, Stockholm

Abstract In this work we extend the test utterance adaptation technique used in vocal tract length normalization to a larger number of speaker characteristic features. We perform partially joint estimation of four features: the VTLN warping factor, the corner position of the piece-wise linear warping function, spectral tilt in voiced segments, and model variance scaling. In experiments on the Swedish PF-Star children database, joint estimation of warping factor and variance scaling lowers the recognition error rate compared to warping factor alone.

Introduction Mismatch between training and test data is a major cause of speech recognition errors. Adaptation is one way to reduce this mismatch. If, however, the adaptation utterances are unknown, adaptation has to be performed in an unsupervised manner. This is, less effective and the performance gain is not as high as for supervised adaptation. One explanation to this lies in the fact that the conventional data-driven adaptation algorithms impose low constraints on the properties of the updated models. This makes the process sensitive to recognition errors. Another view of the mismatch problem is that very large amounts of training data are required for covering different speaker characterristics in current state-of-the-art recognition systems. Considerable reduction should be possible if some of these properties could be inserted artificially. A hypothesis in this paper is that including known theory on speech production in the training and adaptation procedures can provide a solution to these problems. In adaptation, the knowledge could be used to constrain the updated models to be realistic. The second problem, missing speaker characteristics in the training corpus, could be approached by predicting their acoustic features and inserting them into the trained models. In this way, the models are artificially extended to a larger training population.

Likelihood-maximisation based estimation of explicit speaker or environment properties has shown to be a powerful tool for speech recognition when there is no adaptation data available (Sankar and Lee, 1996). This is performed by optimizing a small number of parameters to maximize the recognition score of an utterance. The parameters control the transformation of the acoustic features of either the incoming utterance, or the trained models. One advantage of this approach compared to common techniques for speaker adaptation, e. g. MAP or MLLR, is the low number of parameters to control the adaptation. If only one parameter is used and the likelihood function is smooth enough to allow sparse sampling, searching over the whole dynamic range of the parameter is practically possible. A well-known example of this is Vocal Tract Length Normalization (VTLN) (Lee and Rose, 1996), where the effect of different length of the supra-glottal acoustic tube is often modeled by a single frequency warping parameter, which expands or compresses the frequency axis of the input utterance or the trained models. VTLN has proven to be efficient both within adult speakers and, especially, for children using models trained on adult speakers. In the latter case with large mismatch between training and test data, VTLN can reduce the errors by around 50%, (e.g. Potamianos and Narayanan, 2003, and Elenius and Blomberg, 2005). The objective of this paper is to investigate a few other speech properties and study how they can be combined with VTLN. A requirement for successful transformation of a particular feature is that it should raise the discriminability between the correct and the incorrect identities. Furthermore, the transformation should produce realistic models, which suggests an approach based on speech production theory. This paper looks into a few candidate properties. We modify phone models of an HMMbased recogniser using transforms related to speaker characteristics. The transformations are evaluated in unsupervised test utterance adaptation in the challenging task of recognising children’s speech using models trained on adult speech.


Studied speaker characterization features Vocal Tract Length The technique for VTLN, which is used in this work, is based on a Gaussian distribution assumption and a linear transformation. In this case, the new feature distribution of the models are achieved by multiplying the mean and covariance arrays by a transformation matrix. Pitz and Ney (2005) have shown how this is done efficiently in an MFCC (Mel Frequency Cepstrum Coefficients) feature representation. This technique is used in the current work. An advantage with this approach is that the transformation is applied on the models, not the input utterance. This facilitates phoneme-dependent warp factors, in contrast to input utterance warping. In the latter case, the whole utterance is normally uniformly warped, since its phonetic identity is unknown. Piece-wise linear warping function The warping function applied in this report is a 2-segment piece-wise linear function, with two free parameters: warping factor and the upper warp cut-off frequency. The latter is defined as the projection of the break point onto a line with slope 1.0, as in HTK (Young et. al. 2005). One motivation for optimising the cut-off frequency is that it might capture some aspect of different scale factors for individual formants. Expanding the frequency scale makes the highest cepstral coefficients invalid and they have to be excluded from recognition (Blomberg and Elenius, 2008). Model variance Besides having shorter vocal tract, an additional source of higher error rate for children than for adults is their higher variability (Potamianos and Narayanan, 2003). Due to growth during childhood, there is large inter-speaker variability in physical size and accordingly acoustic properties between individuals. Differences are also caused by the developing acquisition of articulation skill and consistency. Intra-speaker variability was also observed to be larger than for adults, possibly due to less established articulation patterns. For these reasons, adult models are judged to be narrower than those trained on child speech.

Score maximization of the model variances compensates not only for this variability difference but also for the mismatch in mean values. Accordingly, the maximum likelihood point is not expected to correspond to the true variability ratio between adults and children. The deviation from the mean values is likely to be of more importance. Voice source spectral tilt Studies on the voice source of children have found that spectral tilt differs from that of adults (Iseli, Shue and Alwan, 2006). Compensation for this effect could be performed by modifying the parameters of a voice source model. In this work we perform a coarse approximation by adding a bias to the mean of the first static cepstral coefficient, C1, to voiced sounds. The variances remain unchanged as well as all delta and acceleration parameters. The transform is phoneme-dependent, since only voiced phone models are modified.

Experiment Speech data The task for the experiments is digit string recognition. Children’s recordings were taken from the Swedish part of the Pf-Star Children’s Speech Corpus (Batliner, Blomberg, D’Arcy, Elenius and Giuliani, 2002). This consists of 198 children from 4 to 8 years old. Each child was aurally prompted for 10 3-digit strings. The adult speakers were taken from SpeeCon (Großkopf, Marasek, v. d. Heuvel, Diehl and Kiessling, 2002). The number of digits per speaker was equal to that in Pf-Star, but the string length varied between 5 and 10 digits. Both corpora were recorded through the same type of directional head-set microphone. Training and evaluation was conducted using separate data sets of 60 speakers in both corpora, resulting in a training and test data size of 1800 digits, except for the children’s test data. Its size was 1650 digits due to the failure of some children to produce all the three-digit strings. A more detailed speech data description is presented in Blomberg and Elenius (2007). Acoustic processing Speech was divided into overlapping segments at a frame rate of 100 Hz, using a 25 ms Hamming window. Static, delta and acceleration features of MFCCs and normalised log energy

Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg were computed from a 38-channel mel filterbank in the frequency range 0-7600 Hz. Training and transformation was performed on 18 MFCCs while testing was made after removing the upper six coefficients as mentioned above.

Table 1. Baseline results for non-normalized training and test data.

Train Test

Adult Adult

Adult Child

Child Child

33.0

6.8

Phone model specification Speech was modelled using phoneme-level HMMs (Hidden Markov Models) consisting of 3 states each. Word-internal triphone models were used, where the transition matrix was shared for all contexts of a particular phone. The feature observation probability distribution for each state was modelled using a GMM (Gaussian Mixture Model) of 32 terms. Training and recognition experiments were conducted using HTK 3.3. Separate software was developed for the model transformation algorithm.

WER (%) 1.4

Performed experiments The baseline experiments use the VTLN algorithm implemented in HTK with the same analysis and acoustic conditions as in the other experiments. HTK applies VTLN on the input utterance. The speaker parameters were estimated for each utterance, by applying a joint grid search for score maximisation. The search range of each parameter was quantized into 10 steps. A full 4-dimensional search is not feasible due to the extensive amount of computation required. We have restricted this to pair-wise full search, where one of the parameters is always the frequency warping factor. The reason for always including the latter is its proven importance for child speech recognition using adult models. In order to determine if a parameter need to be estimated for each utterance or if it should rather be performed on a larger number of utterances, we also compare the error rate between utterance and test set based optimisation of the parameters.

1.0 1.7

1000 - 1.0 7600 3.0

-8 +10

HTK

5700

-

-

15.2

5700

1.0

0.0

12.8

u

1.0

0.0

13.3

Results The results of the baseline and the experiments with the proposed approach are displayed in Table 1 and Table 2, respectively. Without any transformation of the adult models the word error rate is quite high, 33.0%, understandable with regard to the low age of the children. The standard VTLN algorithm in HTK roughly halves the error rate to 15.2%.

Table 2. Error rates for various combinations of speaker factor optimization. The search range for each parameter is indicated in the table head. The label ‘Opt’ indicates that this parameter is utterance-estimated. Values denote the constant setting on the whole test set. An asterisk at a value denotes that it is a test set average during the optimisation run marked by ‘u’. Absence of an asterisk indicates a default value.

Warping Warp factor cut-off

Opt

u

Opt

Opt

Variance C1 factor bias

u

WER (%)

Opt

5700

Opt

0.0

11.0

Opt

5700

1.0

Optu

12.7

Opt

4632*

1.0

0.0

13.0

Opt

5700

1.58*

0.0

11.0

Opt

5700

1.0

-0.34*

12.7

Opt

4632*

1.58*

-0.34*

10.8

1.25*

4632*

1.58*

-0.34*

11.8

This is further reduced by model-based VTLN transformation to 12.8%. Unexpectedly, joint likelihood optimisation of warp factor and warp cut-off frequency increases the error to 13.3%. Combined warp factor and variance scaling search lowered the 12.8% error rate of single warp factor optimisation to 11.0%, a reduction by 14% relative. The error rate of combined C1 and warp factor optimisation differs very little from that of a single warp factor. Locking the parameters to their average estimates over the test set resulted in little difference in error rate compared to utterance optimisation, except for the warping factor, which raised the error to 11.8%.


Discussion

Acknowledgements

It is interesting to note the positive contribution of variance scaling. A probable explanation is that the effect of mismatch in the mean values is reduced, rather than in the variances. In any case, the result shows empirically that likelihood maximisation of the model variances is able to improve the recognition performance. Optimising the warping cut-off frequency and the chosen representation of voice source spectral tilt did not improve the results. The warp cut-off frequency had, in fact, slight negative impact on recognition performance. Regarding spectral tilt, a more detailed and accurate voice source model than C1 bias should work better. A general problem is that the transformations can raise the score of incorrect identities more than that of the correct identity. Less realistic transformations of incorrect identities are not penalised. One way to achieve this could be to assign probabilities to the transform parameter values. Optimisation of the proposed speaker parameters to each utterance instead of optimisation to the whole test set turned out to be of little value, except for the warping factor. It is probable that speaker-specific parameter estimation over more than one utterance would have been a better choice. Phoneme-specific settings might also be important. Joint optimisation of the warping factor and each of the other speaker features seems to have had little or no advantage. However this is regarded as specific for the particular features used.

This work was financed by the Swedish Research Council.

Conclusions Although the inclusion of the proposed transforms result in only modest recognition improvement, we believe that the approach of using explicit transformations for extending the properties of a given training population is a promising candidate for combining knowledge of speech production with data driven training and adaptation. More work is required to develop realistic and accurate transforms to different speaker characteristics and speech styles. The results may give insight in speech relations important for speech recognition. The demand for efficient transformations may inspire to intensified connections between the speech recognition and the phonetic-acoustic research fields.

References Batliner, A., Blomberg, M., D’Arcy, S., Elenius, D., Giuliani, D. (2002) The PF_STAR Children’s Speech Corpus. InterSpeech, 2761 – 2764. Blomberg, M. and Elenius, D. (2007) Vocal tract length compensation in the signal and model domains in child speech recognition. Proc. Swedish Phonetics Conference, TMHQPSR, 50(1), 41-44. Blomberg, M. and Elenius, D. (2008) Investigating Explicit Model Transformations for Speaker Normalization, ISCA Workshop on Speech Analysis and Processing for Knowledge Discovery, Aalborg, Denmark. Elenius D. and Blomberg M. (2005) Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year old Children. Proc. Interspeech, 2749 – 2752. Großkopf, B., Marasek, K., v. d. Heuvel, H., Diehl, F., Kiessling, A. (2002) SpeeCon speech data for consumer devices: Database specification and validation. Proc. LREC. Iseli, M., Shue, Y.-L., and Alwan, A. (2006) Age- and gender-dependent analysis of voice source characteristics. Proc. ICASSP, 389-392. Lee, L. and Rose, R. C. (1996) Speaker Normalization using efficient frequency warping procedures. Proc ICASSP, 353-356. Pitz, M. and Ney, H. (2005) Vocal Tract Normalization Equals Linear Transformation in Cepstral Space. IEEE Trans. Speech and Audio Proc. Vol. 13, No. 5. Potamianos A. and Narayanan S. (2003) Robust Recognition of Children’s Speech. IEEE Trans. Speech and Audio Proc., 603 – 616. Sankar, A. and Lee, C.-H. (1996) A MaximumLikelihood Approach to Stochastic Matching for Robust Speech Recognition, IEEE Trans. Speech and Audio Proc., Vol. 4, No. 3. Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D.,Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P. (2005) The HTK book. Cambridge University Engineering Department.