accuracy versus complexity in context dependent ...

ACCURACY VERSUS COMPLEXITY IN CONTEXT DEPENDENT PHONE MODELING ∗ Wei Xu, Jacques Duchateau, Kris Demuynck, Ioannis Dologlou, Patrick Wambacq, Dirk Van Compernolle1, Hugo Van hamme1 Katholieke Universiteit Leuven - ESAT - PSI Tel: +32 16 321861; fax: +32 16 321723 e-mail: [email protected] 1 Lernout & Hauspie Speech Products, Belgium

ABSTRACT This paper presents two different directions to build HMM models which give enough acoustic resolution and fit in limited user resources. They both refer to scaling down the acoustic models which are built with tied gaussian HMMs. The total number of gaussians is reduced by a pairwise merging, and the number of gaussians per state is reduced by selecting them based on the so called occupancy criterion. Experiments carried out on the WSJ recognition task show that after scaling down, no further training is needed when the number of gaussians or the number of gaussians per state is reduced up to a factor three. This is an advantage as retraining can not be executed by the final system user.

1.

INTRODUCTION

In acoustic modeling with continuous HMMs, the output probability density functions are modeled as a weighted sum of gaussian density functions. In the following equation the output probability of state j for frame X is given by

past years. The reason behind it is that due to limited resources of the user platforms, market requirements are usually quite severe in terms of computational complexity and memory space. It is therefore desirable to give the users the possibility to adjust the complexity and the size of the recognizer according to their hardware capacity. In this paper, we investigate a general way to considerably reduce the size of the SC-HMM based acoustic models. The total number of gaussians is lowered through merging based on likelihood distance. As for the number of gaussians per state, this is reduced via the ”occupancy criterion”: mixture weights λij that are based on too little observations are removed. No retraining is needed for model size reduction of up to a factor three. The algorithm is not only simple but also fast and accurate. Thus the user is given the possibility to shape the recognizer according to some predetermined specifications. The remaining of this paper is organized as follows: in section 2, the reduction algorithms are described, section 3 presents the speech recognition tasks and the experimental results and section 4 gives the conclusion.

2. Pj (X) =

N X

λij × Ni (X)

2.1.

REDUCING THE SC-HMM Reducing the Gaussian Pool

i=1

where λij is the weight for gaussian density i in state j, Ni (X) is the probability of gaussian density i and N is the number of gaussian densities. In Semicontinuous HMMs, there is a single set of gaussian densities that are shared by all probability density functions. For each state j in our system, only the most important gaussians are selected and most of the λij are assumed to be zero. Merging [1][2], and splitting [3] of gaussians has attracted the interest of several research groups in the ∗ This research work was supported by Lernout & Hauspie Speech Products, Belgium.

In Semi-Continuous Density HMMs, all states are modeled as a mixture of a single large set of pdfs. Merging the gaussians used in this big pool is a straightforward way to reduce the size of the acoustic model. The distance criterion is important. Euclidean distance which is often used in speech processing is not appropriate to evaluate similarities between gaussians. Therefore a definition is introduced expressing the decrease of the likelihood of the resulting gaussian set when two gaussians are merged into a single density. Assuming that d denotes the acoustic space dimension, the above mentioned definition of the distance can be expressed as follows:

Dis = (n1 + n2 )

d X

log(σi ) − n2

i=1

−n1

d X

d X

log(σ2i )

i=1

log(σ1i )

i=1

where the n1 and n2 represent the number of data points allocated to N1 and N2 respectively. We denote the i−th element of the diagonal covariance with σi . Note that this formula is only for calculating the distance between gaussians with diagonal covariance. After merging, the sum of frames associated with each gaussian is assigned to the resulting density. The i − th element of mean µ and the covariance σ of the new gaussian are calculated according to the following formulas:

µi σi f1 f2

= f1 µ1i + f2 µ2i = f1 σ1i + f2 σ2i + f1 f2 (µ1i − µ2i )2 n1 = n1 + n 2 n2 = n1 + n 2

Merging not only decreases the total number of gaussians but also changes the tying of the model’s parameters.

2.2.

Reducing the Mixture Size per State

The reduction of the number of gaussians per state makes the observation pdf calculations more efficient. There are different methods for tying the parameters of the model. Here the selection criterion from previously published work is implemented [4] and may be summarized as follows: let the occupancy of a state be defined as the number of observations assigned to that state. The occupancy of a gaussian in a state is then the number of observations with which the weight for that gaussian is estimated and equals the product of the weight of the gaussian and the state occupancy. Gaussians with an occupancy in a state smaller than a predefined value, namely the Occupancy Threshold (OT), are removed from that state.

3.

EXPERIMENTS

Journal (WSJ) recognition task. The standard bigram language modeling provided by Lincoln Laboratory for the 20k word open vocabulary is used. The results - word error rate (WER) - are given on the November 92 evaluation test set with non verbalized punctuation. They contain 333 sentences for the 20k word task. The signal processing gives mean normalized Mel scale cepstrum (12 parameters) and log energy, all of them with first and second time derivative. This results in 39 parameters in total. Our acoustic modeling is gender independent and based on a phone set with 45 phones, without specific function word modeling. The context dependent models are designed using decision trees, with one tree per context independent state [5]. Initially there are 20,567 gaussians in the model. There are 1991 tied states in total and each one of them is modeled as a mixture of 460 gaussians on average. The acoustic models are estimated on the SI-84 WSJ0 database which contains approximately 14 hours of speech. In our SC-HMM system, we expedite the evaluation of pdfs based on FRoG [5, 6]. In the experiments below, no cross-word assimilation rules are used to adapt phonetic descriptions depending on the neighboring words. A time-synchronous beam search algorithm is used. Since the parameters for the language model depend on the reduction rate one must readjust them. For that purpose five experiments are performed first using the small search beam with limited number of hypotheses. Thereafter the parameters which correspond to the optimum of the above experiments are chosen to perform on experiments with a considerably large search beam. The original model with 20,567 gaussians is trained with two passes of Viterbi training. The experimental results are shown in table 1. All reduction algorithms are applied to the models which result from this training. The following notation is used in all the tables, TNG stands for total number of gaussians in the model, GpS stands for the average number of gaussians per state, and OT stands for occupancy threshold. two passes training of the original model TNG GpS Beam width WER 20567 460 small beam 14.04% 20567 460 large beam 13.54%

Design of the Experiments

Table 1: Results of the original model

The proposed method for acoustic model reduction is evaluated on the speaker independent Wall Street

The experiments of model reduction are carried out

3.1.

It has been established experimentally that results are rather insensitive with respect to the ordering of the reduction processes, i.e. whether the global number of gaussians is reduced first and then the number of gaussians per state or vice-versa. Table 4 gives an overview. It shows the results as a function of the average number of gaussians per state and of the global number of gaussians both for the small search beam and the large search beam.

19

18

17 word error rate

as follows: starting from the original model, first the total number of gaussians is reduced and then the number of gaussians per state is reduced. The total number of gaussian was set to 20k(no reduction), 15k, 10k, 8k, 5k respectively, and the number of gaussians per state is reduced from 460 to 39.

16 5k gauss. 8k gauss.

15

10k gauss. 14 15k gauss. 20k gauss. 13 40

60

80

100 120 140 average number of gaussians per state

160

180

200

From table 4, we can see that the performance drops fairly slowly as the global number of gaussians is reduced from 20k to 5k or as the number of gaussians per state is reduced from 460 to 60.

Figure 1: Experiments with equal computational complexity (dashed line)

We also attempt to establish the relationship between the complexity of acoustic model evaluation and recognition accuracy. The model evaluation consists of two parts. The first part is the calculation of all gaussians (reduced with the FRoG system) which is supposed to be approximately linear (with an offset) with the number of gaussians. The second part is the calculation of the weighted sum for the mixture of gaussians that models a state, which is supposed to be approximately linear (again with an offset) with the number of gaussians per state. The smaller the models are, the more time is needed in searching. Note that complexity due to searching has not been taken into account, because it largely depends on the search algorithm which varies a lot.

To investigate the importance of further training on the reduced models, an additional Viterbi training step is applied and the obtained results from the large beam search are shown in the column labeled training of table 2.

In order to establish the relation between complexity and accuracy of the models, the recognition process is carried out (using a small search beam) for different situations in table 4 while a profiling system keeps track of the accumulated time spent in the different routines that evaluate the acoustic models. The results, shown in figure 1, include some interpolated values which are added between the discrete set of explicitly calculated points. The four variables that describe each experiment are the total number of gaussians, the average number of gaussians per state, the error rate and the time needed for the evaluation of the model. In figure 1, experiments with the same number of gaussians are connected with a thick line, experiments that need the same time for model evaluation are connected with dashed lines. The model specifications for optimal recognition accuracy versus evaluation speed can be seen easily.

3.2.

No. 1 2 3 4

Further Training of Models

TNG 20k 20k 15k 8k

GpS 100 60 43 291

no train. 13.56% 15.28% 16.84% 15.40%

training 13.11% 14.18% 15.15% 14.76%

improv. 0.45% 1.10% 1.69% 0.64%

Table 2: Results with further training

From these experiments, we see that small relative reductions (factor 3 or 4) both in number of gaussians (row No. 4) or in number of gaussians per state (row No. 1) only lead to small improvements (about 0.5%) after further training. On the contrary large relative reductions (factor 8 to 10, starting from 460 gaussians per state) in the number of gaussians per state (row No. 2 and row No. 3) renders further training more essential, providing improvements of the error rate of about 1.5%. It is shown experimentally that the efficiency of the reduction algorithm depends on the initial choice of the parameters of the system. To investigate this, trained models with 20k gaussians and 100 gaussians per state (reduced with OT=5) were chosen as the reference ones and reduction of the number of gaussians per state from 100 to 73, 67, 53 was performed. The obtained results using large beam search both with and without further training, are shown in the table 3. Note that in this case no considerable improvement can be achieved by further training.

OT o2 o5 o7 o8 o9 o10 o12 o14

TNG 20k 460 14.04% 13.54% 157 13.98% 13.41% 100 14.64% 13.56% 80 15.45% 14.39% 72 16.18% 15.03% 66 16.41% 15.15% 55 17.46% 15.88% 51 18.61% 16.43% 44 20.56% 18.23%

15k 400 14.44% 13.63% 143 14.02% 13.49% 93 14.39% 13.93% 75 15.08% 14.24% 69 15.56% 14.76% 63 16.07% 15.08% 54 17.03% 15.47% 50 17.44% 15.84% 43 18.54% 16.84%

10k 327 15.65% 15.08% 125 15.24% 14.58% 84 15.22% 14.83% 70 16.02% 15.12% 64 16.50% 15.40% 59 16.27% 15.31% 51 16.69% 15.73% 48 17.07% 16.25% 42 18.06% 16.78%

8k 292 15.67% 15.40% 115 15.45% 15.08% 79 15.24% 14.73% 66 15.61% 14.89% 61 15.84% 15.20% 57 16.00% 15.28% 50 16.69% 15.73% 47 17.01% 16.02% 41 17.35% 16.43%

5k 229 16.00% 15.61% 97 15.77% 15.88% 69 15.59% 15.45% 58 15.84% 15.59% 54 15.88% 15.40% 51 15.81% 15.33% 45 16.60% 15.90% 43 16.82% 15.97% 39 17.10% 16.09%

Table 4: Results of reduction of the global number of gaussians and of the number of gaussians per state (The integer in each block stands for the Gps, the right top figure is WER for small search beam and the figure behind it is WER for large search beam )

No. 1 2 3

TNG 20k 20k 20k

GpS 73 67 53

no train. 13.59% 13.98% 14.85%

training 13.72% 13.86% 13.98%

improv. -0.13% 0.12% 0.87%

Table 3: Further training with appropriate original model

4.

CONCLUSION

The size reduction of the acoustic models in two different directions has been extensively investigated in this paper. It was discovered that the number of gaussians and the number of gaussians per state may be reduced up to a factor three while no further improvement is found after additional model training. Moreover the relation between the reduction of the models and the computational complexity was studied in depth and it was clearly illustrated.

REFERENCES 1. W. Xu, J. Duchateau, K. Demuynck, and I. Dologlou. A new approach to merging gaussian densities in large vocabulary continuous speech recognition. In Proc. IEEE Benelux Signal Processing Symposium, pages 231–234, Leuven, Belgium, March 1998. 2. J. Simonin, S. Bodin, D. Jouvet, and K. Barkova. Parameter tying for flexible speech recognition. In

Proc. International Conference on Spoken Language Processing, volume II, pages 1089–1092, Philadelphia, U.S.A., October 1996. 3. D. Willet and G. Rigoll. A new approach to generalized mixture tying for continuous HMM-based speech recognition. In Proc. EUROSPEECH, volume III, pages 1175–1178, Rhodes, Greece, September 1997. 4. J. Duchateau, K. Demuynck, D. Van Compernolle, and P. Wambacq. Improved parameter tying for efficient acoustic model evaluation in large vocabulary continuous speech recognition. In Proc. International Conference on Spoken Language Processing, volume V, pages 2215–2218, Sydney, Australia, December 1998. 5. J. Duchateau, K. Demuynck, and D. Van Compernolle. Fast and accurate acoustic modelling with semi-continuous HMMs. Speech Communication, 24(1):5–17, April 1998. 6. K. Demuynck, J. Duchateau, and D. Van Compernolle. Reduced semi-continuous models for large vocabulary continuous speech recognition in Dutch. In Proc. International Conference on Spoken Language Processing, volume IV, pages 2289– 2292, Philadelphia, U.S.A., October 1996.