for hidden Markov models in the area of speech recog- nition. By sharing ... den Markov Model HMM], a statistical process model .... pointed semi-automatically.
HIDDEN MARKOV MODELS USING SHARED AND GLOBAL VECTOR LINEAR PREDICTORS Bruce A. Maxwell
Phillip C. Woodland
Cambridge University, Cambridge, England, UK
ABSTRACT It has been shown that the use of non-shared vector linear predictors can improve recognition rates compared to standard HMMs [11]. Vector linear predictors adjust the probability of the current observation based upon previous or following observations at certain osets in time, improving the HMMs ability to model speech data. This paper develops the theory and implementation of arbitrarily shared vector linear prediction for hidden Markov models in the area of speech recognition. By sharing predictors between multiple states, the parameters are more robustly estimated with the same amount of data. For most osets, globally shared predictors provide more accurate recognition on both training and test data sets than equivalent HMMs without predictors.
1. INTRODUCTION One of the most successful speech models is the Hidden Markov Model [HMM], a statistical process model based upon the Markov chain [8]. The work described in this paper is an attempt to improve an HMM's ability to model speech, hopefully providing for greater discrimination between individual models and improved recognition. Two basic assumptions underlie the functioning of an HMM. The rst of these assumptions is that the observation vector produced by an HMM state is independent of the vectors produced by other states [8]. A second assumption is that the individual elements of an observation vector are also independent of one another [8]. Clearly, these assumptions do not t with our knowledge of speech, which shows high correlation in both the time and frequency domains [4]. Some common modi cations that relax the independence assumptions are the use of covariance matrices and derivatives of the speech signal. Covariance matrices account for some of the correlation between the elements of a single observation vector, and derivatives incorporate neighboring observation vectors, accounting for some of the time correlation in a speech signal. Both modi cations tend to improve recognition results [4] [1]. Vector linear prediction allows explicit modeling of the time correlation between observation vectors at arbitrary osets. When a predictor is used, the mean
vector of the state is no longer static, but uctuates depending upon previous or future observation values at speci ed osets q. The predictors themselves can be either diagonal or full matrices. The motivation for using prediction is that by explicitly modeling both the within-frame correlation with covariance matrices, and the between-frame correlation with predictors, a better model will result with greater discriminatory power. Initial work in this area was undertaken by Wellekens [9], who developed a method similar to more general vector linear prediction whereby the Gaussian distribution of the current frame of an HMM is modi ed, or predicted, by information from the previous frame (no experimental results were reported). Similar work by Brown [4] on what he termed a conditional Gaussian distribution also operated on the previous frame, but he reported that it produced signi cantly worse results than his baseline HMMs. Paradoxically, Brown also reported that the conditional Gaussian method produced a signi cantly higher average log probability per frame1 than the baseline HMMs during training despite the poor recognition results. The vector linear prediction method used in these experiments is an extension of the work of Woodland [11]. Kenny et al. [5] developed an equivalent method to Woodland, but with slightly diering parameters and re-estimation formulae as they associated an output distribution with each state transition rather than with each state. Like Brown, the experiments by Kenny et al. also showed poor results, however, both the experiments run by Woodland and the results given herein show that vector linear prediction can reduce the error rate on both test and training sets. As a predictor requires a large number of parameters to be re-estimated, the ability to share a predictor between a number of states can be useful in obtaining more robust estimation of the parameters. Woodland, for example, found that using two full predictors for each state actually increased the error rate on the test set while decreasing the error rate on the training set. This combination suggests the parameters were inadequately trained. By sharing predictors, the parameters can be better estimated by the same size training set, possibly leading to better recognition. Using a single global predictor for all states of an HMM also results 1 The average log probability per frame is often used as an indicator of how well an HMM matches the training sequences.
in signi cant memory savings with respect to the size of each HMM data structure. To examine the eects of sharing predictors among HMM states, the Hidden Markov Model Toolkit (HTK) developed at Cambridge University was modi ed to allow arbitrary sharing of predictors. The results obtained showed performance on the test and training sets to be between that of the baseline model without prediction and HMMs with non-shared predictors when diagonal covariance matrices were used. However, for this task, on HMMs with full covariance matrices, the shared predictors slightly outperformed the non-shared predictors on both the training and test data.
2. PREDICTOR MODEL
The mathematicalmodel used to develop the predictors is based upon the work of Woodland. Output distributions are associated with each state. The output Yti at time t from state i with P predictors at osets qp is modeled using (1).
Yni=t = i0 +
P X p=1
Aip (Ot+qp ? iqp )
(1)
A standard HMM can be obtained from this equation by setting all Ap to the zero matrix. The predictor model has three basic parts, the standard mean, 0 , a set of oset means, qp , which are the means of the observations qp frames, or time steps, away from the current observation, and the set of predictors, Ap . The Ot+qp are the observations at oset qp from time n. Note that sharing the Ap between a number of states does not aect the calculation of the output probabilities. The error between the actual observation Ot and the output of state i is given by (2). Eti = Ot ? Yti (2) Given the covariance matrix or vector Si for state i, the probability of seeing observation Ot in state i is given by (3), where Eti is given by (2), and Si is the covariance matrix for state i. 1 ? 12 (Enit Sj?1 Eni ) p(On ) = (3) 1 e d (2) 2 jSij 2
3. RE-ESTIMATION EQUATIONS
The maximum liklihood [MLE] re-estimation equations for the linear prediction model can be determined by using the methods of Baum and his colleagues [3] [2] [6]. This method de nes an appropriate auxilliary function and dierentiates it with respect to the unknown parameters, setting each resulting equation equal to zero. Simple manipulation then leads to a formula for each parameter. For a complete derivation of the training equations for shared predictors, see [7].
The changes to the HMM equations outlined in (1) do not aect the transition probability calculations. Because of this, the standard result for training transition probabilities applies to the HMM model with prediction (see [2], [6], or [8], for further discussion). The re-estimation equations for the state means and the oset means found using (1) are given by (4). The variable i (t) is the probability of being in state i at time t, and the qp is the oset from the current frame. When qp = 0 the mean is a state mean. P i (4) ^qp = TP i (t )O(tt)+qp T i In general, the re-estimated covariance matrix shared between all of the states in the set Ip for observation vectors at osets r and s can be shown to be (5) [6] [11] [7]. P X r ? ^ir )(Ot+s ? ^is )t (5) T i (t)(Ot+ P CrsIp = T i (t) i2Ip Once these covariance matrices have been found, the new estimates for the P predictor matrices A^Ipp are found by solving the matrix equation given by (6), where BI , RI , and ZI are de ned as in (7) and (8).
RIp = ZIp BItp 2
I C10 6 . RI = 4 .. CPI 0
3 7 5
2
I C10 6 . ZI = 4 .. CPI 1
BI = A^I1 A^IP
(6)
C1IP ...
.. .
I CPP
3 7 5
(7)
(8) Note that even when the predictors are shared over a set of states, the oset means ^qp used to calculate the CrsIp are speci c to each state over which the predictor, and the CrsIp , is shared. In theory, the oset means could be shared among several states without aecting the form of the re-estimation equations, but this alternative was not pursued. To compare the consistency of the shared predictor re-estimation formulas with previous work, the above equations easily reduce to the non-shared case developed previously in [11], and [5] when the set of shared states, Ip , consists of only a single state. The covariance matrix re-estimation equation follows the form of (5) where the (Ot+r ? ^ir ) terms are replaced by (2), producing (9), where S^Is is the re-estimated covariance matrix which may be shared over the states in the set Is ([6]). "P # i i ^SIs = X TP i (t) E^t E^t (9) T i (t) i2Is
Note that the E^t are based upon the re-estimated values ^0, ^qp , and A^p . The terms of an expanded version of (9) can be substantially simpli ed and re-written in terms of (7) and (8). Is ? BI RI ? (BI RI )t + BI ZI Bt (10) S^Is = C^00 s Ip p s p s p If the sets Ip = Is and each consists of a single state, this equation simpli es to the one given by Woodland [11]. As the sets Ip and Is are not necessarily the same, care must be taken during re-estimation of the predictors A^Ipp and covariance matrices S^Is that the parameters from the correct states are used. Note also that the matrices RI and ZI are not necessarily equivalent in (10) and (6). BIs , on the other hand, must consist of the re-estimated predictors as calculated by (6). If the simpli ed form of the training equations given by (10) is used to calculate S^Is , then Is Ip in order for SIs to be properly re-estimated. Both the predictor and covariance matrices can be either full or diagonal. The matrix elements of the re-estimation equation given by (10) must be adjusted accordingly depending upon the combination used. For a complete description see [7].
4. DATABASE AND METHODS
The data set used to evaluate shared linear prediction was a British English E-set (`B', `C', `D', `E', `G', `P', `T', & `V') which forms part of a spoken alphabet data set collected and distributed by British Telecom Research Laboratories. The database contains three utterances from each of 104 speakers (54 male speakers, and 50 female speakers). The data was recorded in a quiet-room at a sampling rate of 20kHz, and then endpointed semi-automatically. For these experiments, the rst two utterances by each speaker were used as the training data (1634), and the third utterance as the test data (824). The data was originally sampled by a 27 channel Mel-scaled lter-bank at a 100Hz frame rate. 12 Melfrequency cepstral coecients and their rst dierentials were then calculated, producing a 24 element observation vector to characterize each frame. The data was then normalized so that the average within state covariance matrix was the identity matrix as described in [10]. The means were also adjusted to be, on average, zero. All of the HMMs used in these experiments were strictly left to right with no skips. The eight consonants were each modeled by an HMM with four emitting states, and a single HMM with seven emitting states was used for the common vowel. Each utterance, therefore, was modeled by 11 states, with the last 7 being shared among all words. The HMMs were initialized using a clustering algorithm and uniform segmentation to provide an initial estimate of the means and covariances. Multiple iterations of Baum-Welch re-estimation on the individual
Cov. Type Log Prob Train Err % Test Err % diagonal -30.9 10.83 12.86 full -27.7 2.88 7.65 Table 1. Average Log Probability per Frame and Recognition Results for the Baseline HMM set
Cov diag diag diag full full full
Pred Log Prob Train Err Test Err gbl diag -25.7 7.65 8.37 ind diag -25.1 6.04 7.77 gbl full -21.6 6.70 7.89 gbl diag -22.6 2.45 5.34 ind diag -22.1 3.12 5.46 gbl full -16.1 1.96 5.70
Table 2. Average Log Probability per Frame and Recognition Results for Dierent Combinations of Covariance and Predictor Types
phonemes was then used to further estimate the means, covariances, and transition probabilities. Neither of these initial phases aected or trained the predictors or the oset means, which were initialized to zero. Finally, embedded Baum-Welch re-estimation was used for four iterations to train the HMMs with the predictor model implemented as outlined above.
5. EXPERIMENTAL RESULTS
Two baseline HMM sets were developed for comparison purposes, both using a single 24 dimensional observation stream and a single Gaussian distribution per vector element per state. One model used a diagonal covariance matrix, the other a full covariance matrix. The results for the test and training sets are given in Table 1 2 . As can be seen, the full covariance matrix signi cantly improves recognition of both the training and test data. To compare the eects of sharing a diagonal or full predictor among all states of a given HMM, an HMM set was trained and tested for each of the four possible covariance/predictor combinations. Two HMM sets were also trained with non-shared predictors for the diagonal covariance/diagonal predictor and full covariance/diagonal predictor cases. To compare with the ndings of Woodland, a single predictor was used with an oset of -3. The results for all of these cases are given in Table 2. It is apparent from these results that a shared predictor does provide signi cant improvement over the baseline models, but not as much improvement as nonshared predictors. When a full covariance matrix is used, however, the shared diagonal predictor actually has a lower error rate than the non-shared diagonal predictors on both the test and training data. Presumably, this is because the shared predictors|and, therefore, the modi cations to the covariance matrices|are better estimated due to the greater number of training observations applied to the re-estimation of the predictor parameters. These results demonstrate that a shared predictor is an ecient means of improving 2 All of the average log probabilities per frame given in the tables of results are one iteration behind the HMM set for which the results are reported.
Oset Log Prob Train Err % Test Err % -8 -29.8 9.67 10.68 -7 -29.3 9.55 10.80 -6 -28.8 8.81 9.95 -5 -28.1 8.14 8.98 -4 -27.1 7.65 8.50 -3 -25.7 7.65 8.37 -2 -22.9 9.00 9.47 -1 -13.9 16.02 16.03 1 -13.7 12.79 15.66 2 -22.9 8.51 9.47 3 -25.7 7.28 8.13 4 -27.2 7.04 7.77 5 -28.1 7.16 8.37 6 -28.8 7.41 8.86 7 -29.3 7.53 8.01 8 -29.6 7.71 9.10 Table 3. Average Log Probability per Frame and Recognition Results for HMM sets with Diagonal Covariance Matrices and Diagonal Globally Shared Predictors at Different Osets
the HMMs ability to model and discriminate between speech classes. To examine the eect of using dierent osets with a single predictor, experiments were performed on HMMs with a diagonal covariance matrix and a shared diagonal predictor for all osets between -8 and +8. The speci c values are listed in Table 3. For this task and HMM structure, the osets from +3 to +7 provide the best recognition results. 3 Overall these ndings are similar to those reported by Brown, Kenny et al., and Woodland. Note that in the case of a 1 predictor, especially the -1 case, the average log probability|usually an indicator of how well the model ts the data|is extremely high compared to the other osets. Brown hypothesized that the poor results were due to a serious modeling problem. An alternate hypothesis can be based on the proposition that all speech utterances are highly correlated between adjacent frames. By using a 1 oset, this makes the model `too good,' in the sense that it is no longer particularly discriminating. Instead, each model ts a large subset of the utterances with a high probability, making the recognition task more likely to fail. This hypothesis is discussed further in [7]. Several sets of predictor pairs were also chosen and tested, but no pair of predictors outperformed the better single predictor results given above. For the full results and analysis see [7].
6. CONCLUSIONS
The recognition results clearly show that the use of shared linear predictors can reduce both the test and training set error rates compared to HMMs without predictors. Furthermore, HMMs using a shared predictor can produce lower test set error rates than HMMs with non-shared predictors in cases where the data set 3 Results are not shown in Table 3 for oset 0 as it would result in an in nite output probability for each state because the variance of the Gaussian distribution would be zero.
is insucient for robust estimation of the individual predictor parameters. The results also show that the predictor osets must be carefully chosen to maximize the recognition results. In the E-set task, the best results were obtained by forward looking predictors between +3 and +7, probably due to the nature of the task, in which the initial frames are relatively more important to discriminating between the models. Overall, shared vector linear prediction has been shown to be an ecient way of increasing performance without signi cantly increasing memory use or computational loads.
REFERENCES
[1] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. Speech recognition with continuousparameter hidden markov models. In Alex Waibel and Kai-Fu Lee, editors, Readings in Speech Recognition. Morgan Kaufmann Publishers, Inc., 1990. [2] L. E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of a markov process. Inequalities, 3:1{8, 1972. [3] Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(1):164{171, 1970. [4] Peter F. Brown. The acoustic-modeling problem in automatic speech recognition. Technical Report RC 12750, IBM Thomas J. Watson Research Center, 1987. [5] Patrick Kenny, Matthew Lenning, and Paul Mermelstein. A linear predictive hmm for vectorvalued observations with applications to speech recognition. IEEE Trans. on Acoustics, Speech, and Signal Processing, 38(2), February 1990. [6] Louis A. Liporace. Maximum likelihood estimation for multivariate observations of markov souces. IEEE Trans. Information Theory, 28(5):729{734, 1982. [7] Bruce A. Maxwell. Hidden markov models using shared and global vector linear predictors. Master's thesis, Cambridge University, August 1992. [8] Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In Alex Waibel and Kai-Fu Lee, editors, Readings in Speech Recognition. Morgan Kaufmann Publishers, Inc., 1990. [9] C. J. Wellekens. Explicit time correlation in hidden markov models for speech recognition. In Proceedings ICASSP-87, pages 384{386, 1987. [10] P.C. Woodland and D.R. Cole. Optimising hidden markov models using discriminative output distributions. In Proceedings ICASSP-91, pages 545{ 548, 1991. [11] Philip C. Woodland. Hidden markov models using vector linear predictors and discriminative output distributions. In Proceedings ICASSP-92, 1992.