interpolation reduced the word error rate by 11% to 23% ... Generic PDF sharing further reduced the error .... algorithm such as the forward-backward or Viterbi.
DELETED INTERPOLATION AND DENSITY SHARING FOR CONTINUOUS HIDDEN MARKOV MODELS X.D. Huang, Mei-Yuh Hwang, Li Jiang, and Milind Mahajan Microsoft Corporation One Microsoft Way Redmond, Washington 98052, USA
ABSTRACT As one of the most powerful smoothing techniques, deleted interpolation has been widely used in both discrete and semi-continuous hidden Markov model (HMM) based speech recognition systems. For continuous HMMs, most smoothing techniques are carried out on the parameters themselves such as Gaussian mean or covariance parameters. In this paper, we propose to smooth the probability density values instead of the parameters of continuous HMMs. This allows us to use most of the existing smoothing techniques for both discrete and continuous HMMs. We also point out that our deleted interpolation can be regarded as a parameter sharing technique. We further generalize this sharing to the probability density function (PDF) level, in which each PDF becomes a basic unit and can be freely shared across any Markov state. For a wide range of dictation experiments, deleted interpolation reduced the word error rate by 11% to 23% over other simple parameter smoothing techniques like flooring. Generic PDF sharing further reduced the error rate by 3%.
1. INTRODUCTION
means, variances, and mixture coefficients [3]. Smoothing techniques invented for discrete HMMs may not be applied directly. In this paper, we point out that we can carry out smoothing in the probability value rather than on the parameter itself, as conventionally used for continuous HMMs. This enables us to use most of the existing smoothing techniques for both discrete and continuous HMMs. As we can carry out smoothing in both parameter and probability space, we can significantly increase the choices of applicable smoothing techniques for continuous HMMs. We further point out that our deleted interpolation can be regarded as a special parameter sharing technique where well trained mixture PDFs are shared across less well trained Markov states to combat the problem associated with the limited amount of training data. In general, each individual PDF in the mixture of any Markov state can be regarded as a basic unit. We can use these basic units to form any mixture function for modeling any Markov state. Thus, we advance ourselves from state sharing (senones [4]) to PDF sharing, which increases the freedom to balance the amount of training data vs. the number of detailed parameters to capture acoustic variability.
To improve the performance of acoustic models in speech recognition, it is often necessary to combine welltrained general models (such as speaker-independent or context-independent) with less well-trained but more detailed models (such as speaker-dependent or contextdependent). Deleted interpolation (DI) has been successfully used for this purpose for both discrete and semi-continuous HMM based speech recognizers [1, 2].
We experimented with combining the continuous PDFs of the context-dependent (CD) and context-independent (CI) phone models using deleted interpolation instead of MAP-based parameter smoothing. For a wide range of speech dictation experiments, deleted interpolation reduced the word error rate by 11% to 23% over other simple smoothing techniques like flooring. A two-level generic shared PDF system further reduced the error rate by 3%.
In discrete HMMs, the output probability values in the output probability distributions are identical to the model parameters. However, in continuous HMMs, the PDF parameters (e.g. Gaussian means and covariances) and the values of the PDF are different. Most smoothing techniques such as Maximum a Posterior Probability (MAP) smooth only model parameters such as Gaussian
This paper is organized as follows. We first discuss deleted interpolation and how it can be used to smooth the density values of continuous HMMs. We then discuss a generic PDF sharing scheme, followed by experimental results to illustrate the utility of these techniques in Whisper [5]. Finally we summarize our major findings.
2. DELETED INTERPOLATION AND PDF SHARING
interpolation weights are then averaged to obtain the final weights.
2.1 Deleted Interpolation for PDFs Deleted interpolation was first used to smooth a less well trained, but more detailed discrete output probability distribution with a better trained, but less detailed one [1]. The interpolation weights are often estimated using cross-validation data with the EM algorithm, to maximize the probability of the model generating the unseen data. In discrete HMMs, output probability distributions can be interpolated directly, as the value of the output distribution and the output probability parameter are identical. Conventionally, people considered smoothing as parameter smoothing. Thus, for models where the model parameters and the model probability values are different, smoothing is usually carried out on the model parameters. In this paper, we generalize deleted interpolation smoothing to any stochastic model and point out that smoothing can be carried out in either the parameter or probability space. For example, for HMMs with continuous PDFs, the interpolated PDF
P
DI i
(. )
less detailed but better trained PDF
Pi where
DI P i
(). = λi
(. )
CD
Pi
P
CI i
(. )
()
CD . i
and a
as follows:
(). + (1 − λi ) Pi (). CI
interpolation for CD Markov state i; corresponding
P
CD
mixture,
and
P
P
CI i
()
CD . i
(. )
is the is
the
λi
could be state-dependent, senone-dependent, or phonedependent if we share interpolation weights at different levels. Since the PDFs in continuous HMMs are usually trained with Maximum Likelihood Estimation (MLE), the interpolation weights should be trained using a crossvalidation technique. Otherwise, the weight for the more P
() would almost always be
CD . i
set to one by an MLE algorithm. For cross validation, the training data is normally divided into M parts and a set of
P
()
CD . i
and
P
CI i
Step 1: Initialize λi with a guessed estimate. Step 2: Update
λi
by the following formula: CD −
λi Pi
j
λ′
=
i
1 M ∑ N j =1 i
Where
P
M Ni ∑ ∑ j j =1 k =1
()
CD − j . i
CD −
λi Pi
j
( ) jk Xi
+
j
( ) jk Xi
(1 − λ i ) P i
CI
− j
( ) jk Xi
is the mixture PDF for CD senone i
estimated by the entire training corpus except part j, the deleted part; similarly
P
( ) is the mixture PDF for
CI − j . i
the corresponding CI state, estimated by the same M-1 j parts of data. N i indicates the number of data points in part j that have been aligned with senone i by the Viterbi forced alignment and the model X
i
P
();
CD − j . i
indicates the k-th data point in this set of aligned
Step 3: If the new value
λi′ is sufficiently close to the
previous value λi , stop. Otherwise, go to Step 2.
(1)
corresponding CI mixture. The interpolation weight
detailed density mixture
λi ,
data.
the mixture function after deleted
is
Figure 1 illustrates the procedure for estimating where each i represents a CD senone:
jk
can be expressed in terms of a
more detailed but less well trained PDF
DI
algorithm) from each combination of M-1 parts, with the deleted part serving as the unseen data to estimate a set of interpolation weights { λi }. These M sets of
(.) model is trained (by a traditional
algorithm such as the forward-backward or Viterbi
Figure 1: Estimation of one interpolation weight for each CD senone i. The deleted interpolation procedure described above can be applied after each forward-backward or Viterbi training iteration. Then for the next iteration of training, the learned interpolation weights can be used as illustrated in equation (1) to compute the forwardbackward paths or the Viterbi maximum path.
2.2 Generic PDF Sharing When examining the interpolated mixture function in (1), we can see that the interpolated function is in fact a mixture of CD and CI PDFs. Deleted interpolation augments each CD senone with the PDFs from its corresponding CI senone; the CD PDFs and the CI PDFs are weighted by the deleted interpolation weights, on top of the original mixture weights. That is, suppose CD
Pi
(.) =
∑ wiaCD f iaCD (.) a
CI
Pi
(.) =
∑ wibCI f ibCI (.) b
where f(.) is a probability density function, for example, a Gaussian density. Then the interpolated density value for senone i is: DI
Pi
(.) = ∑ (λi wiaCD ) f iaCD (.) + ∑ {(1 − λi )wibCI }f ibCI (.) a
The acoustic model consisted of 42 context independent phones (without deletable stops, the flap and the phone TS). Each phonetic model had the 3-state Bakis topology without any skip arcs. Lexicon pronunciations were obtained from Carnegie Mellon University [7]. The MFCC coefficients were normalized with augmented mean normalization procedure. Detailed Whisper system descriptions can be found in [5].
b
Because all the CD senones of the same base phone share the same CI density components, there is no significant increase in computational complexity or in the number of free parameters, compared with the noninterpolated model. This two-level (CD senone and CI senone) density DI scheme can be extended to multiple levels when the decision-tree based senones [4] are used. For example, in the following figure, senone c, e, f can share the same set of PDFs in b, in addition to having their own PDFs; similarly, senone h and i can share the set of PDFs in g.
a b
g d
c e
h
i
f
3.1 Deleted Interpolation for Independent Speech Recognition
Speaker-
The baseline acoustic model consisted of 6000 context1 dependent decision-tree based senones for modeling position-dependent and context-dependent within-word and cross-word triphones [7]. The HMM output probability density function at each CD and each CI senone was a mixture of 20 Gaussian probability density functions. Diagonal covariance was used for each mixture component. This baseline system used simple flooring techniques for smoothing the variances and mixture weights. Its word error rates on the Nov92 set and si_dev5 set were 5.3% and 6.6% respectively as listed in Table 1. The DI scheme estimated an interpolation weight for each CD senone as illustrated in Figure 1 with M=2. For simplicity, the cross-validation data was simulated from the trained model instead of the real data. That is, P
()
CD − 1 . i
was trained on the second part of the training 1k
We can further generalize this sharing scheme to the level of each individual PDF. Each PDF in the continuous HMMs can be treated as the basic unit to be shared across any Markov state. At this extreme, there is no need to use senones or shared states any more and the shared PDFs become the acoustic kernels that can be used to form any mixture function for any Markov state with appropriate mixture weights. Parameter sharing is thus advanced from a phone unit (generalized triphones [6]) to a Markov state unit (senones), to a density component unit. Parameter sharing at a finer granularity provides more flexibility to capture different acoustic phonetic variations.
data, and X i was sampled according to the model P
1k
and X i
P
()
CD − 1 . i
computed the first set of interpolation weights 2k
as Figure 1 illustrates. Similarly, X i was sampled according to
( ) ; and
CD − 1 . i
()
CD − 2 . i
2k
and X i computed the second set of interpolation weights. These two derived sets of weights were averaged to obtain the final estimate for the interpolation weights. We combined Viterbi training and DI weight learning for two iterations. This interpolation gave us more than 10% error rate reduction as indicated in Table 1.
3. EXPERIMENTAL RESULTS For our comparative study, we selected the 5000-word Wall Street Journal continuous speech, speaker independent task with the bigram language model developed by Lincoln Lab as our target. The acoustic training corpus consisted of 284 speakers from WSJ0 and WSJ1 corpora. Results reported here are on the 1992 5000word development set (si_dev5) and the Nov92 evaluation set. In addition, we also included the results on our internal speaker-dependent isolated dictation.
( ) , which was trained on the first part.
CD − 2 . i
1
P
P
System
Nov92
Si_Dev5
baseline CHMMs
5.3%
6.6%
one DI weight per CD senone
4.7%
5.9%
% error reduction
11%
11%
The decision trees were constructed using discrete HMMs for the sake of simplicity.
Table 1: Experimental results on the 5K WSJ dictation using bigram, without/with deleted interpolation.
3.2 Deleted Interpolation Dependent Speech Recognition
for
Speaker-
With deleted interpolation, we can use a small number of mixture components for CD senones to capture context variations, and a relatively large number of mixture components for CI senones to capture generalization and achieve robustness for new data. Despite a large number of CI mixture components, the CI parameters can still be well trained as the number of CI phones is usually small. To verify this, we used our entry level Whisper isolated dictation engine to conduct preliminary experiments. We used 5 Gaussian densities for each CI senone and 1 for each of the 1000 CD senones in the system. Without deleted interpolation, the performance on the test set had a word error rate of 8.7% on our 5,000-word speaker-dependent isolated dictation task, without any language model. With one learned interpolation weight for each CD senone, the error rate was reduced to 6.7%, which represented a 23% error rate reduction.
memory reduction) could be modestly reduced with the same level of word error rate constraint.
4. SUMMARY We proposed an algorithm to extend the concept of deleted interpolation to the probability domain for continuous HMMs. Various experimental results demonstrated that this could reduce word error rates by 11% - 23%. We further generalized our parameter sharing technique to shared PDFs. Our preliminary experiments indicated that the shared PDF model could reduce word error rates by additional 3%. The shared PDF scheme opened many possible parameter sharing structures that include senones or generalized triphones as its special cases. We believe that this architecture could be well used to build compact yet accurate acoustic models for advanced speech recognition [9, 10]. REFERENCES [1]
Jelinek F., and Mercer, R. “Interpolated Estimation of Markov Source Parameters from Sparse Data." Proc. the Workshop on Pattern Recognition in Practice, Amsterdam, North-Holland, 1991.
[2]
Huang X., Hidden Markov Models for Speech Recognition, Edinburgh University Press, 1990.
[3]
Gauvain J., and Lee C. “Maximum a Posterior Estimation of Multivariate Gaussian Mixture Observations of Markov Chains”, IEEE Transactions on Speech and Audio Processing, Vol. 2, No. 2, April 1994.
[4]
Hwang, M.Y., Huang X., and Alleva F. “Predicting Unseen Triphones with Senones”, IEEE International Conference on Acoustics, Speech, and Signal Processing, 1993.
[5]
Huang X., Acero A., Alleva F., Hwang, M.Y.., Jiang L., Mahajan M. “Microsoft Windows Highly Intelligent Speech Recognizer: Whisper”, IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995.
[6]
Lee, K.F., “Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition”, IEEE Transactions on Acoustics, Speech, and Signal Processing, April, 1990.
[7]
Huang X., Alleva F., Hwang M., and Rosenfeld R. “An Overview of Sphinx-II Speech Recognition System”, Proc. of ARPA Human Language Technology Workshop, March 1993.
[8]
Young, S.J. and Woodland P.C. “The Use of State Tying in Continuous Speech Recognition”, Proc. of EuroSpeech, 1993.
[9]
Takahashi, S. and Sagayama, S. “Four-level Tied-Structure for Efficient Representation of Acoustic Modeling”, IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995
3.3 Two-level Shared Gaussians As a first attempt to move to the generic PDF sharing, we selected the speaker-independent WSJ task again and decided the following two-level Gaussian sharing: the 6000 CD senones and the 126 CI senones. Each CI senone had 20 Gaussian components and each CD senone had 40 Gaussian components, of which 20 were tied with the corresponding CI Gaussians. This was essentially the same as the DI system listed in Table 1, except that there were 40 mixture weights for every CD senone and no interpolation weight was needed. In other words, each CD senone had its full freedom of learning its own 40 mixture weights, including the 20 weights for the CI Gaussians. We noticed additional 3% error rate reduction compared with the standard DI system listed in Table 1.
3.4 Generic Shared Gaussians As we pointed out, PDFs can be shared across any Markov state without involving senones. Unified mapping for the Markov state and PDF sharing thus provides the maximum flexibility to speech recognition. We experimented with clustering PDFs to build a unified senone and PDF mapping structure. Our experiments (using various distortion metrics including [8]) indicated that there was no significant error reduction; but computational complexity (mainly from
[10] Dugast, C. Beyerlein, P. and Haeb-Umbach, R. “Application of Clustering Techniques to Mixture Density Modeling for Continuous Speech Recognition”, IEEE International Conference on Acoustics, Speech, and Signal Processing, 1995.