Sausage-net-based Minimum Phone Error Training for Continuous Phone Recognition Jiang-Chun Chen1, Chun-Jen Lee2, Shuo-Pin Hsu1, J.-S. Roger Jang1 1
Department of Computer Science, National Tsing Hua University 2 Telecommunication Labs., Chunghwa Telecom Co., Ltd. {jtchen, eddy, jang}@cs.nthu.edu.tw,
[email protected]
Abstract. This paper describes a discriminative-based training approach to continuous phone recognition. Recently minimum phone error (MPE) training is widely used to enhance the performance of large vocabulary continuous speech recognition, but few of them applied it to continuous phone recognition. In this paper, we explore a flexible combination of the sausage net with the MPE training. Furthermore, a more effective method of MPE weight update is introduced. The best experimental result in this study indicates that our approach achieves 7% error rate reduction when comparing to the baseline system. This demonstrates the advantages of the proposed approach for the MPE training. Keywords: minimum phone error, discriminative training, continuous phone recognition, sausage, speech recognition.
1. Introduction Discriminative training approaches such as maximum mutual information (MMI) [12] have demonstrated improvement over conventional maximum likelihood estimation (MLE) approaches on several speech recognition tasks, since MLE only aims at maximizing the likelihood of the training data and disregards the discrimination among confusing classes during model training. Recently, maximum classification error (MCE) [2] and minimum phone error (MPE) [3][10][11][13] are successful criteria for achieving minimum error rate on training data. In contrast to MCE criterion, which aims at minimizing sentence classification error, MPE tries to maximize the phone level accuracy. Studies on MPE have shown to be an effective way on large vocabulary continuous speech recognition (LVCSR). However, few of them applied it to phone recognizer which is known to be the basic step in speech recognition. Studies on attribute detection [14] demonstrate the importance of the phone unit for speech recognition, motivates us to investigate the use of MPE for improving phone recognition. For LVCSR, a two-pass recognizer based on a word lattice is a commonly used technique for improving speech recognition. The word lattice embeds detail information of the search space, especially when the n-best recognition is adopted [5][6][1]. For MPE training, the lattice also contains essential information for better
performance. However, the larger the lattice is, the more likely some unrelated miscellaneous information is introduced, which may mislead the optimization process of MPE training and cause deterioration in performance. To eliminate the miscellaneous information, we propose a novel approach that combines a composite phone lattice and a sausage net for MPE training. The proposed approach exhibits more flexibility and requires less time for model refinement. Moreover, the common MPE weight calculation is modified such that more effective training is achieved. The flowchart of the proposed approach is shown in Fig. 1. MLE HMM Sausage-based Lattice Generation
MPE Training with Extended MPE Weight
N-Best Composite Lattice
Phone
Fig. 1. The flowchart of the proposed approach in the MPE training procedure. The rest of this paper is organized as follows. Section 2 introduces the baseline system of continuous phone recognition. Section 3 explains various techniques used in our approach. Section 4 demonstrates the experimental results and Section 5 gives concluding remarks.
2. Baseline System
2.1. Continuous Phone Recognizer Previous work on continuous phone recognition for TIMIT has been discussed in the literature [8][9]. For comparison on the same basis, similar configuration settings in [9] are used in this study. The MLE acoustic models used here include 1364 bi-phone right-context-dependent HMMs (CDHMM) and 48 context-independent HMMs (CIHMM). To maintain the balance between model complexity and available training data, decision tree-based state tying is adopted for CDHMM. The ratio of numbers of rich models to poor models is about 1:4. Each phone model has three states and each state has 16 Gaussian mixtures for CIHMM and 8 mixtures for CDHMM. Instead of using the general mixture binary splitting approach, here we increase the mixtures one by one and delete the mixtures with small variances during the process. Our experiment demonstrated that this “one by one” approach can achieve a recognition rate of 71.69%, which is 2.89% better than that (68.8%) of “binary splitting”. MFCC of 39 dimensions are used, with cepstrum mean subtraction. As mentioned in [9], the
same 160 test sentences are used for outside test. The recognition net is also organized context-dependently for the CDHMM and the recognition output is mapped into 39 commonly used phones in CMU phone set. The achieved accuracy is 71.69% for CDHMM and 64.53% for CIHMM, which are comparable to the previous work with similar settings [8]. The experimental result will be discussed in Section 4. 2.2. MPE Training for Phone Recognition The object function of MPE can expressed as follows: R
FMPE ( λ ) = ∑∑ Pλκ ( s | Ο r ) A ( s, sr ), r =1
(1)
s
where κ , Οr , and sr represent a scaling factor, the r-th speech training sentence,
and the correct transcription of Οr , respectively; A ( s, sr ) represents the raw phone transcription accuracy of the sentence s given sr . Pλκ ( s | Ο r ) is the scaled
posterior probability of the sentence s given the HMM model λ
Pλκ ( s | Ο r ) =
Pλ ( Ο r | s ) P ( s )κ κ
∑ Pλ ( Ο
| u ) P (u )κ κ
r
,
(2)
u
where Pλ ( Ο r | s ) represents the scaled probability of the speech Οr , given the κ
sentence s , determined by the acoustic model; P ( s )κ represents the scaled probability of the sentence s determined by the language model. The MPE tries to estimate a new parameter set λ by maximizing the object function FMPE ( λ ) in Eq. (1). By using a weak-sense auxiliary function, the mean
μ jm and variance σ 2jm of mixture m of state j of λ can re-estimated as follows:
μ jm = σ 2jm =
num den {θ jm (O) − θ jm (O)} + D jm μ jm den {γ num jm − γ jm } + D jm
,
num den {θ jm (O2 ) −θ jm (O2 )} + Djm (σ 2jm + μ 2jm ) den {γ num jm − γ jm } + Djm
(3)
− μ 2jm. (4)
In Eqs. (3) and (4), D jm are the Gaussian-specific smoothing constants, whereas
θ jm (O) and θ jm (O 2 ) are the sum of O and O 2 weighted by the probability of being in mixture m of state j, respectively. More details of the parameter updating formula can be found in [3].
3. The Proposed Approach
3.1. N-Best Composite Lattice For the task of LVCSR, the MPE training utilizes the complete word lattice to gather enough information. However, for the task of continuous phone recognition, the phone lattice used here contains much more phone-level information, even including some unrelated miscellaneous information. To prune the extra information, a composite phone lattice of n-best hypotheses is adopted. For n decoded hypotheses, the duplicated phones with the same time alignment among the n hypotheses are removed and the remaining phones are combined into a smaller lattice, as shown in Fig. 2. The dotted path indicates the correct phone sequence and the solid paths indicate the other n-1 hypotheses. uh eh
w w ow
uh sil
w
aw
eh aw
ih
ow
w
ih
dh
ey
sil
n n
aw
w uw w
Fig. 2. A typical example of composite phone lattice.
3.2 Sausage-based Lattice Pruning Conversion from a lattice into a sausage net was proposed by Mangu et al [7]. A sausage net not only contains distilled information from lattice, but also represents the confusion consensus hypothesis in a more compact form. In addition to the n-best composite lattice, we also convert a complete lattice into a sausage lattice, as shown in Fig. 3, where the dotted path indicates the correct phone sequence. The compact size of sausage lattice requires less MPE training. Moreover, to eliminate potentially useless links in the sausage net, a pruning algorithm is adopted, as summarized next: 1. Set tavg as the average time duration of all links. 2. Set savg as the average time-normalized log-probability of all links. For each link li: 3. Drop li if the time-normalized log-probability si of li is less than threshold savg. 4. Drop li if the time duration ti of li is less than threshold tavg.
After the pruning process, the remaining sausage net is traversed into possible hypotheses and again composed into a composite lattice. Finally, to update model parameters, the MPE objective function is maximized based on the derived sausagebased composite lattice. The pros and cons of using a sausage net will be discussed in Section 4. uh
eh
-
uw sil
w
w
ih
dh
ey
n
-
aw
sil
eh ow ih -
Fig. 3. A typical example of the sausage net. The symbol “-” indicates the skippable transition.
3.3. Extended MPE Weight Given an arc in a lattice, the extended Baum-Welch estimates the model parameters in either the positive contribution or the negative one, depending on the MPE weight of an arc. Traditionally the MPE training specifies a unique MPE weight for multiple arcs of a chain. As shown in Fig. 4, both the dotted arcs ow and m belong to a chain and share the same MPE weight. This situation of a chain with multiple arcs may lead to a problem in phone recognition since the accuracy and MPE weight of a phone may conflict. For instance, in Fig. 4 the MPE weight of the dotted chain is negative (which causes arc ow and arc m a negative contribution), while the arc m exists with positive accuracy. Similarly, in the dashed chain in Fig. 4, the MPE weight is positive (which causes arc er and arc n a positive contribution), while the arc n exists with negative accuracy. To eliminate such contradictions in the confusion chain, we propose an algorithm to deal with this situation, as follows: For an arc q in the chain with multiple arcs: 1. Skip the negative contribution of arc q if the accuracy of q is great than c, but the MPE weight of q is negative. 2. Skip the positive contribution of arc q if the accuracy of q is less than –c, but the MPE weight of q is positive. The value of parameter c is empirically set to 0.5 in this study. The corresponding experimental results are discussed in the next section.
r
…. Correct sentence
m
….
W(ow)=W(m)=-0.021 A(ow)=-0.1 ow
m
A(m)=0.4
n er A(n)=-0.3 Recognized A(er)=0.8 hypotheses W(er)=W(n)=0.0 Fig. 4. Two kinds of chains in a simple phone lattice. A(phone) and W(phone) are the accuracy and the MPE weight for a phone, respectively.
4. Experimental Results Our experiments are based on TIMIT corpus. According to the default settings in [9], we take 3,696 sentences as the training data and 160 sentences as the test data. The acoustic analysis is performed at 10 ms frame rate using the 20 ms hamming window. Each frame contains 39 spectral feature coefficients, including 12 MFCC and 1 log energy, and their delta and double delta values. To verify the performance of the proposed lattice pruning approach, the comparison with the traditional complete lattice approach is shown in Fig. 5. The yaxis represents the absolutely error rate and the x-axis represents the iteration times. Obviously our approaches outperform the traditional approach using a completely lattice, in particular for the sausage-based approach. About 1% absolutely phone error rate is reduced at the fourth iteration. In addition, our approach requires less training time due to the pruned and smaller lattice.
% Phone Error Rate
35.5
Complete N-Best Composite Sausage-Based
35
34.5
34
33.5
33 0
1
2
3
4
5
Numbers of Iterations
Fig. 5. Phone error rates versus numbers of iterations for different kinds of lattices based on CIHMM.
The extended MPE weight is compared with the original calculation, as shown in Fig. 6. Basically the proposed modification also improves the performance of MPE training. In particular, the phone error rate decreases stably, showing the potential further improvement. A more general and robust approach is under development and will be proposed in the immediately future.
% Phone Error Rate
35.5
Original MPE Weight Extended MPE Weight
35
34.5
34
33.5
33 0
1
2
3
4
5
6
Numbers of Iterations
Fig. 6. Phone error rates versus numbers of iterations for two methods of calculating MPE weight based on CIHMM.
% Phone Error Rate
Based on the results of the previous experiments, the performance of our CIHMM and CDHMM is shown in Fig. 7. Both the phone error rates of the two systems are reduced but the CIHMM system has the better error reduction, in particular at forth iteration where 7% error reductions is achieved from the baseline system. For CDHMM, it is conjectured that the complicated tied state technique makes the MPE training become less effective. 36
CIHMM
35
CDHMM
34 33 32 31 30 29 28 27 0
1
2
3
4
5
Numbers of Iterations
Fig. 7. Phone error rates versus numbers of iterations for different context types of HMMs.
5. Conclusions In this paper, we have proposed several techniques for a discriminative-based phone recognition system. The extended MPE weight calculation provides a more delicate way to MPE training. In particular, a more efficient lattice generation approach is proposed. Experimental results demonstrate the feasibility of the proposed approaches. Immediate future work of this study will focus on the feature-level optimization such as fMPE [4]. Furthermore, robust acoustic features used in the attribute detection [14] can also be included to achieve better performance in phone recognition.
References [1] Bach-Hiep Tran, Frank Seide, Volker Steinbiss, “A Word Graph Based N-Best Search in Continuous Speech Recognition”, ICSLP1996. [2] Biing-Hwang Juang, Wu Chou, Chin-Hui Lee, “Minimum Classification Error Rate Methods for Speech Recognition”, IEEE Transactions on Acoustics, Speech, and Signal Processing, 1997. [3] D. Povey, “Discriminative Training for Large Vocabulary Speech Recognition”, Ph.D thesis, 2004. [4] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau,G. Zweig, “fMPE: Discriminatively trained features for speech recognition”, ICASSP2005.
[5] Frank K. Soong, Eng-Fong Huang, ”A Tree.Trellis Based Fast Search for Finding the NBest Sentence Hypotheses in Continuous Speech Recognition”, ICASSP1991. [6] Jung-Kuei Chen, Frank K. Soong, “An N-Best Candidates-Based Discriminative Training for Speech Recognition Applications”, IEEE Transactions on Speech and Audio Processing, 1994.
[7] L. Mangu, E. Brill, A. Stolcke, “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks”, Ph.D. thesis 2000.
[8] L.F. Lamel, J.L. Gauvain, “High Performance Speaker-Independent Phone Recognition Using CDHMM”, EUROSPEECH1993 [9] Lee, K.-F., Hon, H.-W., “Speaker-independent phone recognition using hidden Markov models”, IEEE Transactions on Acoustics, Speech, and Signal Processing, 1989 [10] P.C. Woodland, D. Povey, “Large Scale Discriminative Training For Speech Recognition”, ISCA ITRW 2000. [11] Povey D., Woodland P.C., “Minimum phone error and I-smoothing for Improved Discriminative Training”, ICASSP2002. [12] V. Valtchev , J. J. Odell , P. C. Woodland , S.J. Young, “MMIE Training of Large Vocabulary Speech Recognition System”, Speech Communication 1997 [13] Jen-Wei Kuo, “An Initial Study on Minimum Phone Error Discriminative Learning of Acoustic Models for Mandarin Large Vocabulary Continuous Speech Recognition”, thesis, 2005.
[14] Jinyu Li, Yu Tsao and Chin-Hui Lee, A Study on Knowledge source integration for Rescoring in Automatic Speech Recognition, IEEE International Conference on Acoustics, Speech and Signal Processing, 2005