Frame Discrimination Training of HMMs for Large ... - Semantic Scholar

1 downloads 306 Views 131KB Size Report
is either increasing or decreasing, and does not alternate between the two. ... B that are increasing, the inequality of
Frame Discrimination Training of HMMs for Large Vocabulary Speech Recognition D. Povey & P.C. Woodland Cambridge University Engineering Department Trumpington Street, Cambridge CB21PZ, UK email: dp10006, pcw  @eng.cam.ac.uk Abstract This report describes the implementation of a discriminative HMM parameter estimation technique known as Frame Discrimination (FD) for large vocabulary speech recognition, and reports improvements in accuracy over ML-trained and MMI-trained models. Features of the implementation include the use of an algorithm called the Roadmap algorithm which selects the most important Gaussians for a given input frame without calculating every Gaussian probability in the system, a new distance measure between Gaussian based on overlap (which is used in the Roadmap algorithm), and an investigation of improvements to the Extended Baum-Welch formulae. Frame Discrimination estimation is found to give error rates at least as good as MMI with considerably less computational effort.

1 Introduction Discriminative HMM parameter re-estimation techniques, for example Maximum Mutual Information (MMI), have been widely reported in the literature to improve recognition results; but there have been relatively few reports of the application of these techniques to large vocabulary speech recognition. See, for example, [5, 8, 9, 10]. A good part of the reason for this is the extra computational effort involved in MMI training. In [8, 5], the use of recognition lattices as an approximation to MMI training was reported, which resulted in a considerable speedup relative to a more naive implementation, but it still took 15 times longer than conventional Maximum Likelihood (ML) training. A discriminative criterion called Frame Discrimination (FD) was developed in [2]. Its efficient implementation for large vocabulary speeech recognition (LVCSR) is reported here. To implement FD efficiently the Roadmap algorithm was developed which finds the Gaussians in the HMM set which best match an input vector (i.e, highest probability), while only testing a fraction of the Gaussians in the HMM set (in the region of 1-10%). This is done by setting up links, or “roads” between Gaussians and navigating among them using a hill-climbing algorithm. The links are set up using a new distance measure, based on overlap of Gaussians. In re-estimating the HMMs the Extended Baum-Welch (EBW) formulae are used, and improvements to these formulae are proposed and tested here. The rest of the report is structured as follows: Section 2 introduces the FD objective function; Section 3 details the optimisation approach used; Section 4 describes the Roadmap algorithm; and Section 5 describes an experimental evaluation of FD on the speech recognition tests.

2 The FD objective function The FD objective function is related to the MMI objective function, which was first proposed in [12]. The MMI objective function is the posterior probability of the speech transcription, given the speech data:

   ! & #" $   %'&  ()     +*" -, (1)

   " where is the word sequence corresponding to training utterance . and / is the composite HMM corresponding to the "  #" "

word sequence

.

is the probability of the word sequence

1

, as given by the language model. The MMI objective

function may be rewritten in terms of the transcription model   and the general model of speech production may be the same as the model used in the speech recogniser), as follows:

 , (which

                 (2)

       is known as the numerator model, and    as the denominator model, because the subtraction of logs The model  



may be considered a division. The FD objective function is an altered form of Equation 2 where the model  has been replaced by a model , which allows a superset of the state sequences allowed in  . The hope is that, by allowing these extra state sequences, the alignment of a given speech frame to the states of the model  will be less dependent on the context of the speech frame, and more typical of the assignment of states to that frame in the language at large.



          

   





 

   

(3)

In this report, and in [2], the particular form of frame discrimination used is zero memory frame discrimination. memory Markov chain, whose output PDF consists of a weighted sum of all the PDFs in the HMM set so that

 

   

!





       

   

,

 " 

"



is a zero

#

where % are the vectors of speech data, . is the length of utterance . , and is the output PDF . The    of state notation indicates summation over all the states in  , i.e, all states in the HMM set. is the prior probability of observing state . The prior probability of each state is set proportionally to its occupation count as calculated by the forward-backward algorithm for a previous iteration of ML training.

$"



 

3 Extended Baum-Welch (EBW) re-estimation 3.1 The EBW formulae To optimise the parameters of HMMs when using discriminative criteria such as MMI or FD, the EBW re-estimation formulae can be used. The EBW algorithm for rational objective functions was introduced in [1] and extended in [4] for the continuous density HMMs considered here. The re-estimation formulae presented below have been found to work well in practice although they can be only proved to converge when a very large value of the constant is used which in turn leads to very small changes in the model parameters on each iteration. In the following, counts and other functions of the alignment will be given a superscript num or den, to indicate whether they pertain to the numerator models    or the denominator model . The update equations for the mean vector of the ’th mixture component of state , and the corresponding variance vector, , are as follows:

%

&(') *

-' . ) *

1 ') *



&/* ') * 

- * ' . ) * 

+

,

021 ' 4) 36* 5   7 19' 8 ) *   ;:=< %>&/' ) * 0@ ? ' 4) 3A* 5  ? ' 8 ) *   : < %  , 021 ' 4) 36* 5 .  1 '8 ) *  . ;:B< % - '. ) * < & ' . ) *  & * '. ) * 02? ' 4) 3A* 5  ? ' 8 ) *  : < %

,

(4) (5)

where represents the sum of the vectors of training data weighted by the probability of occupying that mixture component, i.e.:

1 ' ) *



       ? ' ) * ,

   2

and and

1 ' ) *  . is a similar sum of squared input values. ? ' ) *  is the probability of occupying mixture + ? ' ) * is the count of the number of times mixture component + of state , is occupied, i.e.:     ? ' ) *    ? ' ) *



,

of state at time ,



3.2 Mixture Weight Updates The formula used for continuous EBW updates is similar to the update for discrete output probabilities originally put forward in [1]. It is as follows:



where the derivatives 

 

 

< ' ) *    & & '* ' *& < * 

* ') * 



' ) *

   



 

,

 

 

(6)

are given by:   

 

') *

 ' ) * ' 4 )3A* 5  '8 ) *  









(7)

However, these are not the values commonly used in Equation 6 when performing EBW re-estimation. Merialdo [13] found while performing gradient optimisation for disriminative training of discrete HMM systems that the gradients were excesin Equation 7. He therefore improved convergence sively dominated by low valued parameters, due to the division by  by using the alternative formula as follows:

' ) *

  



' ) *



%



& ' 4) 3A* 5 &  % & ' 8 ) *  & * '4)36* 5 * '8 ) *  





(8)

This equation differs considerably from Equation 7. The most we can say is that the sign of the derivatives calculated both ways is likely to be the same, assuming the total numerator and denominator occupancies for the state are roughly equal. In experiments reported in [3], this approximation dramatically improved the rate of convergence for discrete HMMs. A look at Equations 6 and 8 shows why the equations are effective. The value of the approximation to the derivative as calculated in Equation 8 is normalised to lie between -1 and 1; this means that the same value of  will be appropriate for all mixtures. A problem encountered in practice with these altered update equations is that, during training, the objective function starts to decrease again after increasing to near its maximum [3]. This is not surprising, since even the sign of the derivative in the approximation of Equation 8 may differ from the actual value: i.e, although Equation 8 may give good results, it is not a good approximation to the derivative.

3.3 New Mixture Weight Updates

' ) *

A new set of mixture weight update equations were developed. These equations are based on heuristics about how the occupancies are expected to change as the mixture weights are changed. In the following explanation, mixture weights  will be denoted  , the state being assumed constant. It is clear from the way HMMs work that increasing the value of a mixture weight will tend to increase its occupancies, and decreasing it will tend to decrease the occupancy. If it was known in advance what the effect of changing the mixture weights  would be on the occupancies and , then gradient descent could be performed efficiently based on this knowledge, without actually calculating the new occupancies in the normal way, e.g., by the Forward-Backward algorithm. Of course, this is not possible because it cannot be known exactly what the new occupancies will be. But it is possible to make non-infinitesimal updates by estimating limits on the change of the occupancies as mixture weights change. The limits that were estimated were:

*

*

,

? *4365

? *8 

3

?*

*

? * 4 365 ?*

*

The occupancy of a mixture with initial weight  , initial occupancy and final weight  and   . This is true for both numerator and denominator occupancies ( and ).

?*

 

?*

? *8 

is bounded by

From these limits on the occupancies, an update  rule is derived as follows. We will consider the variation in the mixture   only   . Consider the function becomes a vector of mixture weights weights of one state, so that the parameter set   *     *    , where  is the initial set of parameters and  * is the updated set. It is clear that if we ensure ,    *      *     . The value of    *  may be expressed   we guarantee   that an increase in the objective function:  , , as the line integral:



  *

    &          & %          * *     *    We can choose to integrate along any path between and along which 

,



is defined: i.e, any path that preserves the sum-to-one constraint on the mixture weights. For convenience, we will choose the  path corresponding to the straight line  * * between    and    , which can be mapped on to the space in which is defined by taking logs. The values

  are given by , so:

  



? *4365  ? *8 

 

  *

  &    *  ?



,

*

*43A5  ? *8     * 



We only have bounds on these values as the weights  change; however, if we set the numerator occupancies at the bound     and the denominator occupancies at the bound , giving the function

? *4365 ? *4365

? *8  ? *8    *   & * ? *!   * , 4  6 3 5    (9) 8   * ? , *   *     * #"   * . This is proved by a case split between * that are increasing and those that are decreasing. The then  , ,  and  * *  ,* so each value path along which we are integrating corresponds to a straight line between   $"   *   , it * is is either increasing or decreasing, and does not alternate between the two. In order to prove that  , ,  , as   * approaches sufficient to prove that for all + and for all valid sets of mixture weights zero,   (10) *8  (      * ? *43A5  ** ? *8       * %"'& ? *43A5  ? !  For those * that are increasing, the inequality of Equation 10 can be proved from the facts that  * %) and +* *, * ,    and our estimated bounds on the occupancies. reasoning holds for those * that are decreasing.   * bySimilar We can obtain a closed form* for integrating. Since each element of the summation in Equation 9 only depends ,   on one mixture weight * , can be written as: , &    * ? /8    *  4  A 3 5 ? *  *  * .-         103254  * , which is unchanged by integration w.r.t. Integrating over each of the , and noting that * * * , we        































































 

*

,



6