Pattern Recognition 33 (2000) 1749}1758
An improved maximum model distance approach for HMM-based speech recognition systems Q.H. He , S. Kwong*, K.F. Man, K.S. Tang South China University of Technology, People's Republic of China Department of Computer Science, City University of Hong Kong, 83 Tatchee Ave, Kowloon, Hong Kong, People's Republic of China Received 28 January 1999; received in revised form 21 June 1999; accepted 21 June 1999
Abstract This paper proposes an improved maximum model distance (IMMD) approach for HMM-based speech recognition systems based on our previous work [S. Kwong, Q.H. He, K.F. Man, K.S. Tang. A maximum model distance approach for HMM-based speech recognition, Pattern Recognition 31 (3) (1998) 219}229]. It de"nes a more realistic model distance de"nition for HMM training, and utilizes the limited training data in a more e!ective manner. Discriminative information contained in the training data was used to improve the performance of the recognizer. HMM parameter adjustment rules were induced in details. Theoretical and practical issues concerning this approach are also discussed and investigated in this paper. Both isolated word and continuous speech recognition experiments showed that a signi"cant error reduction could be achieved by IMMD when compared with the maximum model distance (MMD) criterion and other training methods using the minimum classi"cation error (MCE) and the maximum mutual information (MMI) approaches. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
1. Introduction Hidden Markov Models have been proven to be one of the most successful statistical modeling methods in the area of speech recognition, especially for continuous speech recognition [1,2]. The most di$cult problem with HMM application is how to get a HMM model for each basic recognition unit (which may be subword, word or phrase) based on a limited training data. Besides the maximum likelihood estimation approach, some other approaches were proposed to solve the training problem of HMMs, such as the maximum mutual information (MMI) criterion [3,4], minimum discrimination information (MDI) criterion [5] and minimum classi"cation error (MCE) [6,7]. The MMI approach assumes that a word model is given and attempts to "nd the set of HMMs in which the sample averages of the mutual information with respect to the given word model is
* Corresponding author. Tel.: #852-2788-7704; fax: #8522788-8614. E-mail address:
[email protected] (S. Kwong).
maximum. The MDI approach is performed by joint minimization of the discrimination information measure over all probability density (PD) of the source that satis"ed a given set of moment constraints, and all PDs of the model from the given parametric family. The expected performance of the MDI approach for HMM is as yet unknown since it has not been fully implemented or studied, and no simple robust implementation of the procedure is known. Both MMI and MDI modeling approaches aim indirectly at reducing the error rate of the recognizer. In either case, however, it is di$cult to show theoretically that the modeling approach results in a recognition scheme that minimized the probability of error [8]. MCE di!ers from the distribution estimation approaches, such as ML, in that the recognition error is expressed in the computational steps in such a way that it would lead to the minimization of recognition errors. It was claimed that, in general, a signi"cant reduction of recognition error rate could be achieved by MCE against the traditional ML method [7]. The main problem of this approach is how to select a proper error function that could incorporate the recognition operation and performance in a functional form.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 4 - 2
1750
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
We proposed a maximum model distance (MMD) criterion for training HMMs in Ref. [9], which could automatically, focus on those training data that are important for discriminating acoustically similar words from each other. Both speaker-dependent and multi-speaker experiments demonstrated that MMD approach could signi"cantly reduce the number of recognition error, compared with the ML approach. The main disadvantage of MMD is that it pays same attention to all competitive words during the estimation of model parameters of a labeled word, which is not very true in reality. Di!erent competitive words should have di!erent impact in the recognition phase of the system. Through further study, we adopted a more suitable model distance de"nition, which makes di!erent competitive words play di!erent roles in the training phase. The HMM parameter adjustment formulation are derived in Section 2, theoretical and practical issues concerning this approach in speech recognition are investigated. Theoretical analysis and experimental results demonstrated that IMMD is superior to MMD in terms of recognition rate.
2. The improved maximum model distance approach For simplicity of notation, we assume that the task is to recognize a vocabulary of M acoustic units (a unit refers to any legible lexical unit such as phoneme, subword, word or phrase), and a HMM model is constructed for each word. Let ""+j ,l"1,2,M, be the model set, J j"(n, A, B) represents a HMM with N states.
The maximum model distance (MMD) criterion is to "nd the entire model set " such that the model distance is maximized: 4 (") "arg max D(j , "). (4) J ++" J 2.2. Limitation of D(j , ") J Usually, the classi"er/recognizer is operating under the following decision rule: C(O)"C i! P(O"j )"max P(O"j ). (5) J J F F In Eq. (3), all competitors of word j are considered J with the same level of importance, which in general is not realistic with the decision rule in the recognition phase. Assume that O is labeled as j , the competitors M of j could be classi"ed into two clusters, one is S " M +j , P(O"j )*P(O"j ),, and the other is S " F F M +j , P(O"j )(P(O"j ),. If S is not null, then an incorrect F F M decision is made. The goal of any classi"er design is to achieve a minimum error probability. Therefore, di!erent competitors of j should have di!erent impact in the M training phase. If the in#uences of competitors on the system performance are considered at the training (learning) mode of a recognizer, then the competitors of j in M S should be considered more seriously. In other words, utterance O should have more in#uences on models of S than that of S . The aim is to reduce the size of S to zero, which "nally will make the decision correct. 2.3. The improved MMD approach
2.1. Maximum model distance approach For any pair of HMMs j and j , Juang and Rabiner J F [10] proposed a probabilistic distance measure 1 D(j , j )" lim +log P(OJ"j )!log P(OJ"j ),, (1) J F J F ¹ 2J J where OJ"(oJ oJ oJ 2oJ J) is a sequence of observation 2 symbols generated by j . Petrie's limit theorem [11] J guarantees the existence of such a distance measure and ensures that D(j , j ) is nonnegative. We generalized this J F de"nition for any observation sequence with "nite length, which is 1 D(j , j )" +log P(OJ"j )!log P(OJ"j ),. (2) J F J F ¹ J And furthermore de"ned a model distance measure D(j , ") between model j and model set " as J J 1 D(j , ")" log P(OJ"j ) J J ¹ J 1 4 ! log P(OJ"j ) . (3) F
When we take the above considerations into account and combine the basic principle of discriminative training [12], a new de"nition of D(j , ") which relax the J shortcoming of the de"nition in Eq. (3) is de"ned as following:
1 D(j , ")" log P(OJ"j ) J J ¹ J
+ E (P(OJ"j ))E , (6) F FF$J where g is a positive number. When g approaches R, the term in the bracket becomes max P(OJ"j ), i.e. FF$J F only the top competitor is considered. When searching the classi"er parameter ", one could realize di!erent weight distributions among the competitors of j by J varying the value of g:
1 !log M!1
+ + 1 D(")" D(j , ")" log P(OJ"j ) J J ¹ J J J 1 + E !log (P(OJ"j ))E F M!1 FF$J
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
+ 1 1 + 1 " log P(OJ"j )! J ¹ ¹ g J J J J
;log
It can also be seen that when g approaches R, RJ is I equal to either 0 or 1. When OI makes j the top competiJ tor of j , then RJ"1. In all other cases, RJ will equal 0: I I I
+ (P(OJ"j ))E F FF$J
1 + 1 1 ! log . g ¹ M!1 J J
RJ" I
(7)
Since D(") is a smooth and di!erentiable function in terms of the model parameter set ". Traditional optimization procedures like the gradient scheme could be used to "nd the optimal solution of Eq. (4). The parameter adjustment rule is , " I "" #e ; *D(")" L> L L L L
(8)
Where " I is used to distinguish from ", which satis"es the stochastic constraints on the HMM model parameters, such as , a "1 (i"1, 2,2, N). e is a small positive GH L Hsatis"es number that certain stochastic convergence constraints [13]. ; could be an identity matrix or a properL ly designed positive-de"nite matrix, D(") is the gradient vector of the target function with respect to the parameter set ". From Eq. (7), we get *D(") 1 *P(OJ"j ) J " *j ¹ P(OJ"j ) *j J J J J + RJ *P(OG"j ) I J, ! *j ¹ P(OG"j ) J J GG$J G
1751
(9)
where PE(OG"j ) J RJ" I + PE(OG"j ) F FF$G and it makes the di!erence between the IMMD and the MMD approaches. If we let P " P(OI"j ), then RJ can JI J I be rewritten as RJ"(P /P )E/ + (P /P )E. It can I JI II FI II FF$I be seen that P /P has a close relationship with the JI II model distance D(j , j ) (refer to Eq. (2)). The term J I P /P could be interpreted as the similarity between JI II model j and j measured on the observation sequence J I OI. Based on the above observations, the term RJ could I be explained as a relative similarity measure between j and j against all competitors of j . If j is more J I I J similar to j against all other competitors of j , then the I I probability for j to mis-recognize OI is high. Also, if RJ is J I larger than other RF(hOl, k), then it makes OI labeled as I j have much in#uence on the model parameter adjustI ment of j , which is of bene"t to improve the distinguishJ ing ability of j . Thus, it hints that this training procedure J could automatically be focused on those training data that are important for distinguishing between acoustically similar words.
1, P(OI"j )"max P(OI"j ) J F F
(h"1, 2,2, N, hOk).
0, P(OI"j )Omax P(OI"j ) J F F
This means that during the re-estimation of model parameters, IMMD not only improve the ability of j to J distinguish its own tokens from others, but also reduces the possibility of j to become the top competitor of its J competitors. This will improve the discriminative ability of the entire model set ". When g approaches 0, RJ will I approach 1/(M!1), and IMMD degenerates to MMD. At this point, we could list the relationship between IMMD and MMD as follows. E Both IMMD and MMD consider the discrimination e!ect of the competitive tokens to the parameter estimation of model j . MMD considers all the competiJ tive tokens at the same level, but IMMD weighted the contribution of each competitive token to j by J RJ (relative in#uence). This in fact is a more reasonable I way to utilize the given data than the MMD approach. E MMD is a special case of IMMD, when setting g to 0. In the case of discrete HMM with N states and K distinct discrete observations, for model j , we could induce J the following parameter adjustment rules by using a proper positive-de"nite matrix sequence ; (for the L induction detail, please refer to Appendix A):
+ nL>"nL#e cJ (i)! RJcF (i) , G G L F FF$J i"1, 2,2, N,
(10a)
+ a L>"aL #e sJ ! RJsF , GH GH L GH F GH FF$J i, j"1, 2,2, N,
(10b)
j"1, 2,2, N, + bI L>(k)"bL(k)#e cJ ! RJcF , H H L HI F HI k"1, 2,2, K, FF$J (10c) where we use x to distinguish from the x that satis"es the stochastic constraints with HMM: (i) , n "1, (ii) G G , a "1,∀i, (iii) + b (k)"1,∀i. GH G H I cF (i)"P(q "i"OF, j )/¹ is the expected frequency J F in state i of j at time t"1 in OF normalized by J ¹ ; sF "1/¹ 2F\P(q "i, q "j"OF, j ) the expected F GH F R R R> J
1752
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
number of transitions from state i to state j of j J in OF normalized by ¹ ; cF "1/¹ 2F P(q "i"OF, F HI F R R j )d(o , l ) the expected number of times in state j of J R I j and observing symbol v in OF normalized by ¹ . J I F Comparing Eq. (10) with the corresponding equations in Ref. [1], the di!erence is in the use of an adaptive weighing factor for each acoustically similar utterance of OJ instead of the constant factor 1/(M!1). The advantages of this adaptive weighting factor have been discussed above. For the stochastic constraints with HMM, Eq. (10) should be normalized with x "x / x . G G G G In the case of continuous HMM, with N states, in which each state output distribution is a "nite mixture of the form ) b (o)" c N(o, u , R ), 1)j)N, H HI HI HI I
(11)
where o is the observation vector being modeled, c is the HI mixture coe$cient for kth mixture in state j and satis"es the following stochastic constraint: ) c "1, c *0, 1)j)N. HI HI I
(12)
N( ) ) is an any log-concave or elliptically symmetric density. Usually N( ) ) denotes a normal distribution with mean vector u "[u ]* and covariance matrix R for HI HIJ J HI kth mixture component in state j. For the sake of simplicity, R is assumed to be diagonal, i.e. R "[p ]* . HI HI HIJ J Since P(O"j)" , , a (i)a b (o )b ( j), based R GH H R> R> G H on Eq. (11), we get *P(O"j) P(O"j) 2 " c ( j, k), R c *c HI R HI
(13)
*P(O"j) P(O"j) 2 " c ( j, k)(o !u ), R RJ HIJ p *u HIJ HIJ R
(14)
*P(O"j) P(O"j) 2 " c ( j, k) R *p p HIJ R HIJ
o !u RJ HIJ !1 , p HIJ
(15)
where c ( j, k) is the probability of being in state j at time R t with the kth mixture component accounting for o , i.e., R c ( j, k) R
"
a ( j)b ( j) c N(o , u , R ) R R HI R HI HI , a ( j)b ( j) ) c N(o , u , R ) R R HI HI H R I HI 1 n b ( j)c N(o , u , R ), HI HI P(O"j) H HI
t"1
"
1 , a (i)a b ( j)c N(o , u , Rj ), t'1. R HI I P(O"j) G R\ GH R HI (16)
For model j , from Eqs. (8) and (9), we could induce the J following parameter adjustment rules by designing a proper positive-de"nite matrix sequence ; : L
+ cL>"cL #e cJ( j, k)! R cF( j, k) , HI HI L F F F$J + uL>"uL #e *oJ! R *oF , HIJ HIJ L J F J F + pL>"pL #e pL uJ! R uF , HIJ HIJ L HIJ F HIJ FF$J where
(17)
(18)
(19)
1 2F cF( j, k)" cF( j, k), R ¹ F R 1 2J cF( j, k)(oF !u ), *oF" R RJ HIJ J ¹ F R 1 2F cF( j, k) uF " R HIJ ¹ F R
oF !u RJ HIJ !1 . p HIJ
The re-estimation formula of a , n is identical to that GH G used for the discrete observation densities (Eq. (10)). 2.4. Computation complexity analysis HMM with continuous probability density functions will be used in our experiments, and its computation complexity is analyzed. For introducing the weight factor RJ, IMMD needs I extra computation to compute RJ. When calculating I contributions of OI to j , RJ should be estimated "rst. J I According to the de"nition of RJ, M!1 forward calcuI lation is needed to compute P(OI"j ) (h"1,2, M, hOl). F It is obvious that the IMMD will have much higher computation complexity than the MMD. However, this might not be true since the computation of P(OI"j ) (h"1,2, M) is also needed in MMD. MMD F requires M calculations of forward and backward variables in order to give a re-estimation of the entire model set ", but each model is trained sequentially. IMMD re-estimates the model set parameters in a simultaneous mode, M calculations of forward and backward variables are needed to give a re-estimation of the entire model set " and only two additions and one multiplication are needed to calculate one RJ. There are I M(M!1) RJ should be computed for one re-estimation I of model set. Thus, a total of 2M(M!1) additions and M(M!1) multiplication are required to calculate all RJ. I In addition, extra multiplication of the order of MNKD are needed during the model parameter estimation for those RJ in the set S1, where D is the dimension of feature I vector. Totally, the order of MNKD extra basic computation is needed for IMMD to re-estimate the model
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
parameters against MMD, which is only a little portion of the entire computation requirement and the explanation is discussed below. Besides the calculations of the forward variable a (i) R and the backward variable b (i), of the order of R MNKD¹ calculations are required to re-estimate the parameters of a HMM model, where ¹ is the average of length of feature vector sequence (in terms of frame). Therefore, the additional calculation complexity of IMMD is only 1/N¹ of that of MMD, which could be ignored, because N¹ usually is larger than 100 under the typical value setting for N"5, and ¹"30. If the calculations of the forward variable a (i) and the backward R variable b (i) are taken into account, the occupation rate R of the additional calculation should be less than 1/N¹ of the total calculation. In conclusion, the additional computation complexity of IMMD against MMD is not a serious problem. In principle, IMMD uses not only the labeled data, but also all the competitive data to estimate the parameters of model j . It has much higher computation complexity J than the ML approach, about M times of ML, where M is the number of models or basic recognition units which usually is not a small number. For example, there are 60 distinguish phonemes in the TIMIT database. Fortunately, several approaches have been designed to reduce the computation of IMMD and it will not a!ect its performance adversely. For example, the method used in Ref. [1] gives a hybrid training of ML and MMD. Here we give another method based on the fact that most of the tokens could be recognized correctly during the training procedure and the top likelihood of a token measured on model set " is much higher than other's. (1) Initialize the model set ", then perform the following operation repeatedly until the re-estimation procedure is converged. (2) De"ne the competitive model set ) of OJ: if J log P(OJ"j )'log P(OJ"j )!l, then model j is a F J F competitive model of token OJ. (3) Calculate the contribution of every token OJ to its own model and to every model in ) . Usually, the J size of ) is much less than M. J (4) Re-estimate j with adjustment equations of IMMD. J In order to save some computation time, steps 3 and 4 could be repeated several times after each calculation of step 2. It could save time for de"ning competitive model set ) of each OJ. Competitive model set ) of each OJ is J J usually de"ned through forward calculation or Viterbi Algorithm. Step 2 needs M forward calculation or Viterbi calculation to de"ne ) of each OJ. Adopting the J above application procedure of IMMD, the computation complexity of IMMD were reduced to 4}5 times of ML, which is shown in our experiments.
1753
2.5. Extension to multiple observation sequence Left-to-right HMMs are commonly used in speech recognition, hence, we could not make reliable estimates of all model parameters with a single observation sequence, because of the transient nature of the states within the model allows only a small number of observation for any state. To have su$cient data to make reliable estimations of all model parameters, one has to use multiple observation sequences. The modi"cation of the re-estimation procedure is straightforward and is stated as follows. Let OJ"[OJ , OJ ,2,OJ J] be the training data ! labeled to model j , i.e. the training data of j consists of J J C observation sequences. The distance D(j , ") is reJ J de"ned as
1 !J 1 D(j , ")" log P(OJ"j ) J A J C ¹ J A JA 1 + E (P(OJ"j ))E . (20) !log A F M!1 FF$J Thus the modi"ed re-estimation formulas for the model j becomes J 1 !J + 1 !F nL>"nL#e cJA(i)! R cFA(i) , G G L C FA C J A FF$J F A i"1, 2,2, N, (21a)
1 !J + 1 !F a L>"aL #e sJA! RJ sFA , GH GH L C GH FA GH C JA FF$J F A i, j"1, 2,2, N, (21b) cL>"cL HI HI
+ 1 !F 1 !J cJA( j, k)! RJ cFA( j, k) , #e FA L C C J A FF$J F A (21c)
1 !J uL>"uL #e *oJA HIJ HIJ L C J J A + 1 !FJ ! RJ *oFA , (21d) FA J C FF$J F A 1 !J pL>"pL #e pL uJ HIJ HIJ L HIJ C HIJ J A + 1 !F ! RJ uFA , (21e) FA HIJ C FF$J F A where all the immediate variables are de"ned as before, but measured on j with OF instead of OF. J A
3. Experimental results For evaluating the performance of IMMD, isolated word recognition and continuous phoneme recognition experiments on the TIMIT database were carried out.
1754
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
Table 1 Experimental results of isolated word recognition (in error rate)
Closed test Open test
MCE
MMI
ML
MMD
IMMD
Error}reduction
2.03% 2.39%
0.82% 2.38%
2.14% 2.43%
1.73% 2.06%
0.84% 1.69%
51.44% 17.96%
3.1. Isolated word recognition On the isolated word experiment, a database of 21 words which includes `all, an, ask, carry, dark, don't, greasy, had, in, like, me, oily, rag, she, suit, that, to, wash, water, year, youra are used. These words are extracted from sentences SA1 and SA2 of TIMIT, which are the most common sentences of TIMIT. Each has 630 utterances spoken by 630 speakers from eight major dialect regions of the United States. In our experiments, 244 utterances of each sentence were used, 160 for HMM model training, and the other 84 for open test. All these utterances were parameterized using 12 mel-frequency cepstral coe$cient (MFCC) and 12 delta-cepstral coe$cient. For each word, the training data were extracted from the 488 utterances according to the time-alignment transcription provided by TIMIT. Twenty-one contextindependent word models were used, each model has six states and three mixture per state. Table 1 shows the experimental results of the recognizer trained with IMMD, MMD and ML. For performance comparison, the error rates of recognizers trained with MCE and MMI are also listed in the left columns of Table 1. The result indicates that the proposed improved maximum model distance approach is superior to the original MMD, especially for the closed testing set, achieving 17.96% error reduction for open test, and 51.44% for closed test. Meanwhile, the experiment result provides another proof to the conclusion that MMD is superior to ML, which was concluded in Ref. [9] based on discrete HMM. All these results are expected because IMMD utilizes more discriminative information of the given training data than MMD does, and MMD e!ectively utilizes the training data than ML does. MMI has similar closed performance to that of IMMD, but its open test performance is only close to that of ML. Although the HMMs trained by IMMD approximate the distribution of the training data in high precision, which could not provide the same improvement to the open unlimited tokens for the limitation of the training data. This is a common problem for any statistical optimal methods, i.e. a mismatch between the "nite training data and the in"nite test data always exists, which is
Table 2 Experimental results of continuous phoneme recognition Training method
1-%Corr
1-%Acc
MCE MMI ML MMD IMMD Err}reduction
13.87% 14.34% 14.81% 14.02% 13.15% 6.21%
19.19% 19.41% 19.78% 18.95% 18.04% 4.80%
another open problem to improve the robustness of HMM-based recognizer. 3.2. Continuous phoneme recognition The experiment on the continuous phoneme recognition is to recognize all the phonemes of the TIMIT database that consists of 60 phonemes excluding silence. In TIMIT database, no silence appears between any two sequential words, so we did not consider silence in the experiments. The end-point detection of utterances is marked by the labeled time transcription in TIMIT. The experiment uses 600 phonetically balance utterances (including 23 147 phonemes) for testing, 400 of it are used for training (including 15 688 phonemes). All these utterances were parameterized using 12 mel-frequency cepstral coe$cient (MFCC) and 12 delta-cepstral coe$cient. Sixty context-independent phone models are trained, with three states, "ve mixture/state. Acoustic models are trained using bootstrapping technique, which iterates two steps. The "rst one uses existing set of phone segment to train acoustic models of the recognition system, and the second uses these acoustic models to do forced-recognition-segmentation of training utterance with phonetic transcription given, resulting in a new set of phone segment. The initial segmentation of training utterances was simple uniform distribution. Let Cor, Del, Ins, Sub and = are, respectively, the number of correct phonemes, deletions, insertions,
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
1755
Table 3 The experiment results as a function of g g
0.06
0.12
0.25
0.5
1.0
1.5
2.5
5.0
Close Rec}rate (%) Open Rec}rate(%)
98.77 97.68
98.89 97.74
99.02 97.74
99.07 97.80
99.16 97.91
99.04 97.85
99.06 97.80
98.92 97.68
substitutions and the total number of phonemes in the test speech, with Cor"W-Del-Sub. The percentage of the number of words correctly recognized is given by %Cor"(Cor/=);100% and the accuracy is %Acc"((Cor!Ins)/=);100%. Table 2 shows the results of the recognizer trained with IMMD, MMD, ML, MCE and MMI. In recording the errors, only the top recognized phoneme string is used. The result is another proof to assert that IMMD is superior to MMD, achieving 6.21% error reduction in correct recognition rate, and 4.80% error reduction in accuracy. Meanwhile, the results in Table 2 suggest that IMMD is an attractive alternative for HMM training when compared with other algorithms such as ML, MMI and MCE. 3.3. Ewect of weight factor g It was stated earlier that di!erent values of g a!ect the distribution of the weighting factor RJ,(k"1, 2,2, I M, kOl), which controls the contributions of each competitive utterance of word l, and "nally a!ect the performance of the recognizer. We investigated the in#uence of g on isolated word recognition. The in#uence of g is shown in Table 3. g " 1.0 has the best performance. It is observed that no matter when g becomes larger or smaller away from 1.0, the system performance is degraded. A reasonable explanation is that when g equal to 1.0, it makes each utterance provide a natural contribution to its competitive models, which provides the optimal distribution for the training data. Another phenomena is that the change of recognition rate is slow when g changes. We checked the likelihood of competitive models, and found that the likelihood of the top competitive model is usually much higher than that of the other competitive models. This means that only a few RJ, (l"1, 2, 2, M, lOk) has useful value, i.e. its value I is far above 0.0, and most of RJ, (l"1, 2,2, M, lOk) is I close to 0.0. Therefore, an utterance has e!ect only on a few top competitive models. Within the investigated range of g, the in#uence of each utterance to its competitive models only changed slightly when g is set to di!erent values listed in Table 3. Therefore, only slight system performance change was observed in the experiments when g becomes larger or smaller away from 1.0, but the trend of change is predictable.
4. Conclusion We have shown that the maximum model distance (MMD) criterion is superior to maximum likelihood because it uses some discriminative information to train each HMM model [9]. However, the MMD regards all competitive models to have the same importance when considering their contributions to the model re-estimation procedure. This is not completely practical since some competitive models might not be the real competitors for its likelihood is much lower than that of the labeled model. In order to have the best performance: di!erent competitors should also be paid di!erent level of attentions according to its competitive ability against the labeled model. This paper gave a solution to this problem and a more reasonable HMM model distance was proposed. We call the method as improved MMD (IMMD). HMM model parameter re-estimation formula were induced, from which a conclusion was reached that the improved MMD approach could utilize the training data more e!ectively than the MMD. In fact, MMD is a special case of IMMD by letting the weight factor g approach 0. The computation complexity of IMMD was also discussed. We showed that the IMMD's complexity is comparable to that of MMD. Both the isolated word and continuous speech recognition experiments showed that a signi"cant error reduction could be achieved through the proposed approach. In the isolated word recognition, IMMD provided 51.44% error reduction on closed test, and 17.96% on open test; In the continuous phoneme recognition, IMMD decreased the closed test error by 6.21% and open test error by 4.80%. For understanding the limitations of HMM, many extended models have been proposed in recent years to address some of the shortcomings of HMMs, such as segmental model [14,2] and stochastic trajectory model [15]. Although IMMD is designed for standard HMM training, it could easily be extended to handle these extended HMM models.
Acknowledgements This work is supported in part by the City University of Hong Kong Strategic Grant 7000754, City University
1756
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
of Hong Kong Direct Allocation Grant 7100081 and the National Natural Science Foundation of China Project 69881001.
Appendix A
1 2\ , " a (i)a b (o )b ( j)d(o , l ) R GH H R> R> R> I b HI R G
#n b (o )b ( j)d(o , l ) H H I
A.1. Notations By setting a (i)"n b (o ), forward variable a (i)" G G R P(o o 2o , q "i"j) (1)t)¹) could be calculated R R with
2\ , *P " a (i)a b ( j)d(o , l )#n b ( j)d(o , l ) R GH R> R> I H I *b HI R G
1)t)¹!1, , a ( j)" a (i)a b (o ), (22) R> R GH H R> 1)j)N. G Similarly, setting b (i)"1 ∀i, backward variable 2 b (i)"P(o o 2o "q "i, j) could be calculated with R R> R> 2 R ¹!1*t*1, , b (i)" a b (o )b ( j), (23) R GH H R> R> 1)i)N. H Then , , P(O"j)" a (i)a b (o )b ( j) (24a) R GH H R> R> G H , " a (i)b (i) (24b) R R G Other two probability variables are cited during inducting Eq. (10), they are a (i)b (i) c (i)"P(q "i"O, j)" R R R R P(O"j)
(25)
and m (i, j)"P(q "i, q "j"O, j) R R R> a (i)a b (o )b ( j) " R GH H R> R> . P(O"j)
(26)
A.2. Induction of Eq. (10) To induce Eq. (10), the key problem is to calculate
P(O"j) according to Eqs. (8) and (9). Di!erentiating (Eqs. (24a) and (24b)), we get *P a (i)b (i) P(O"j)c (i) , "b (o )b (i)" " G n n *n G G G *P 2\ " a (i)b (o )b ( j) R G R> R> *a GH R 1 2\ " a (i)a b (o )b ( j) R GH G R> R> a GH R P(O"j) 2\ " m (i, j), R a GH R
(27)
1 2\ " a ( j)b ( j)d(o , l ) R> R> R> I b HI R
#a ( j)b ( j)d(o , l ) I
P(O"j) 2 " c ( j)d(o , l ). R R I b HI R
(29)
From Eq. (9), we get 1 *D(") *P(OJ"j ) J " *nJ ¹ P(OJ"j ) *nJ J J G G + RJ *P(OF"j ) F J ! ¹ P(OF"j ) *nJ F J G FF$J
1 cJ (i) + RJcF (i) F ! " nJ ¹ ¹ G F J FF$J
1 + " cJ (i)! RJcF (i) , F n G FF$J
where c (i)"c (i)/¹ is the expected frequency in state i at time t"1 in O, normalized by the length of the observation sequence, *D(") 1 *P(OT"j ) J " *aJ ¹ P(OJ"j ) *aJ GH J T GH + RT *P(OF"j ) F J ! ¹ P(OF"j ) *aJ T GH FF$J F
1 1 2J\ + RJ 2F\ F mF(i, j) " mJ(i, j)! R R aJ ¹ ¹ GH J R FF$J F R
1 + " sJ ! RJ sF , F GH aJ GH GH FF$J
(31)
where s "(1/¹) 2\m (i, j) is the expected number of GH R Ri to transitions from state state j in O, normalized by the length of the observation sequence, *D(") 1 *P(OJ"j ) J " *bJ ¹ P(OJ"j ) *bJ HI J J HI
(28)
(30)
+ RJ *P(OF"j ) F J ! ¹ P(OF"j ) *bJ J HI FF$J F
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
1 1 2J " cJ( j)d(oJ, l ) R R I bJ ¹ JR HI + R T 2F F cF( j)d(oF, l ) ! R R I ¹ FF$J F R 1 + " cJ ! RJcF , (32) HI F HI bJ HI FF$J where c "(1/¹) 2 c ( j)d(o , l ) is the expected number HI R I R R of times in state j observing symbol l in O, normalized I by the length of the observation sequence. If we design the positive-de"nite matrix U as a diagL onal matrix in the following way: the element corresponding to n , a , b (k) are nL, aL , bL(k), respectively, G GH H G GH H which asserts that U is a positive-de"nite matrix, then L we could get the adjustment rule by substituting Eqs. (30)}(32) into Eq. (8):
+ nL>"nL#e cJ (i)! RJcF (i) , i"1, 2, 2, N, G G L F FF$J (33a)
+ a L>"aL #e sJ ! RJsF , i, j"1, 2,2, N, GH GH L GH F GH FF$J (33b) j"1, 2,2, N, + bI L>"bL #e cJ ! RJcF , HI HI L HI F HI k"1, 2,2, M. FF$J (33c)
References [1] Y. Gotoh, M.M. Hochberg, H.F. Silverman, E$cient training algorithms for HMM's using incremental estimation, IEEE Trans. Speech Audio Process. 6 (6) (1998) 539}548. [2] M. Ostendorf, V.V. Digalakis, O.A. Kimball, From HMM's to segment models: a uni"ed view of stochastic modeling for speech recognition, IEEE Trans. Speech Audio Process. 4 (5) (1996) 360}378.
1757
[3] R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Tokyo, Japan, April 1986, [pp. 49}52]. [4] Nam Soo Kim, Chong kwan Un, Deleted strategy for MMI-based HMM training, IEEE Trans. Speech Audio Process. 6 (3) (1998) 299}303. [5] Y. Ephraim, A. Dembo, L.R. Rabiner, A minimum discrimination information approach for hidden markov modeling, IEEE Trans. Inform. Theory 35 (5) (1989) 1001}1003. [6] W. Chou, C.H. Lee, B.H. Juang, F.K. Soong, A minimum error rate pattern recognition approach to speech recognition, Int. J. Pattern Recognition Artif. Intell. 8 (1) (1994) 5}31. [7] Biing-Hwang Juang, Wu Chou, Chin-Hui Lee, Minimum classi"cation error rate methods for speech recognition, IEEE Trans. Speech Audio Process. 5 (3) (1997) 257}265. [8] Y. Ephraim, L.R. Rabiner, On the relations between modeling approaches for speech recognition, IEEE Trans. Inform. Theory 36 (2) (1990) 372}380. [9] S. Kwong, Q.H. He, K.F. Man, K.S. Tang, A maximum model distance approach for HMM-based speech recognition, Pattern Recognition 31 (3) (1998) 219}229. [10] H. Juang, L.R. Rabiner, A probabilistic distance measure for hidden markov models, AT & T Tech. J. 64 (2) (1985) 391}408. [11] Petrie, Probabilistic functions of "nite state Markov chains, Ann. Math. Statist. 40 (1) (1969) 97}115. [12] L. Rabiner, B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Inc. Englewood Cli!s, NJ, 1993 (Chapter 5). [13] P.C. Chang, B.H. Juang, Discriminative template training for dynamic programming speech recognition, Proceedings of the ICASSP-92, Vol. I, San Francisco, March 1992, pp. 493}496. [14] M. Ostendorf, S. Roukos, A stochastic segment model for phoneme-based continuous speech recognition, IEEE Trans. Acoust. Speech Signal Process. 37 (12) (1989) 1857}1869. [15] Y.F. Gong, Stochastic trajectory modeling and sentence searching for continuous speech recognition, IEEE Trans. Speech Audio Process. 5 (1) (1997) 33}44.
About the Author*S. KWONG received his B.Sc. degree and MA.Sc. degree in electrical engineering from the State University of New York at Bu!alo, USA and University of Waterloo, Canada, in 1983 and 1985, respectively. In 1996, he obtained his Ph.D. from the University of Hagen, Germany. From 1985 to 1987, he was a diagnostic engineer with the Control Data Canada where he designed the diagnostic software to detect the manufacture faults of the VLSI chips in the Cyber 430 machine. He later joined the Bell Northern Research Canada as a Member of Scienti"c sta! where he worked on both the DMS-100 voice network and the DPN-100 data network project. In 1990, he joined the City University of Hong Kong as a lecturer in the Department of Electronic Engineering. He is currently an associate Professor in the Department of Computer Science. His research interests are in Genetic Algorithms, Speech Processing and Recognition, Data Compression and Networking. About the Author*QIANHUA HE received the B.S. degree from Hunan Normal University, Changsha City, China, in 1987, and the M.S. degree from Xi'an Jiaotong University, Xi'an City, China, in 1990, and Ph.D. degree from South China University of Technology(SCUT), Guangzhou City, China, in 1993. From May 1994 to April 1996, he was a research assistant and from July 1998 to June 1999, he was a senior research assistant at City University of Hong Kong. He is now an associate professor in the Department of Electronic Engineering of SCUT, where he is teaching graduate courses and doing research work in speech processing and evolutionary algorithm.
1758
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
About the Author*DR. TANG obtained his B.Sc. from the University of Hong Kong in 1988, and both M.Sc. and Ph.D. from the City University of Hong Kong in 1992 and 1996, respectively. Prior to his doctorate programme which was started in September 1993, he has worked in Hong Kong industry for over "ve years. He joined the City University of Hong Kong as a Research Assistant Professor in 1996. He is a member of IFAC Technical Committee on Optimal Control (Evolutionary Optimisation Algorithms) and a member of Intelligent Systems Committee in IEEE Industrial Electronics Society. His research interests include evolutionary algorithms and chaotic theory. About the Author*K.F. MAN was born in Hong Kong, and obtained his Ph.D. award in Aerodynamics from Cran"eld Institute of Technology, UK in 1983. After some years working in UK areopspace industry, he returned to Hong Kong in 1988 to join City University of Hong Kong where he is currently an associate Professor in the Department of Electronic Engineering. He is also a Concurrent Research Professor with South China University of Technology, Guangzhou China. Dr. Man is an Associate Editor of IEEE Transactions on Industrial Electronics and a member of Administrative Committee member of the IEEE Industrial Electronics Society. He serves both IFAC technical committees in Real-time Software Engineering, and the Algorithms and Architectures for Real-time Control. His research interests include active noise control, chaos and nonlinear control systems design, and genetic algorithms.