Improved Syllable Based Acoustic Modeling by Inter-syllable ...

Improved Syllable Based Acoustic Modeling by Inter-syllable Transition Model for Continuous Chinese Speech Recognition Hao Chao , Wenju Liu Institute of Automation, Chinese Academy of Sciences, Beijing, China {hchao, lwj }@nlpr.ia.ac.cn

Abstract Accurately modeling the acoustic variabilities caused by coarticulation is important in continuous speech recognition. Recent research indicates that syllable units do better in modeling intra-syllable co-articulation effect than sub-syllable units. However, most continuous Mandarin speech recognition systems use context dependent phones or Initial/Finals (IFs) as the basic acoustic unit because it is difficult to collect sufficient data to train longer units. Here we present a syllable based approach which includes two steps. Firstly, context independent syllable based acoustic models are trained, and the models are initialized by intra-syllable IFs based diphones to solve the problem of training data sparsity. Secondly, we capture the inter-syllable co-articulation effect by incorporating inter-syllable transition models into the recognition system. Experiment results show that the acoustic model based on the presented approach is effective in improving the recognition performance. Index Terms: speech recognition, modeling unit selection, coarticulation

1. Introduction One of the most important problems in continuous speech recognition is to accurately model the acoustic variabilities caused by co-articulation in order to obtain high recognition accuracy. Acoustic variabilities caused by co-articulation is affected by context [1], so many continuous Chinese speech recognition systems use context-dependent Initial/Finals models, such as IFs based triphones, to model the elementary acoustic units of speech [2, 3]. Chinese is a syllabic centric language. Each Chinese character can be phonetically represented by a syllable and each Chinese syllable has the Initial-Final structure. Even when tone is considered, there are only about 200 units in the Initial/Finals set. Because of the properties of Chinese, the complexity of context-dependent Initial/Finals models is low and these models can be robustly trained when reasonable amounts of training data are available [4]. Furthermore, context-dependent Initial/Finals models can model short-term contextual effects that cause variations in the way a particular Initial/Final is produced. Though previous research have shown that using contextdependent Initial/Finals as the elementary units for speech modeling is able to achieve good performances [2, 3], there are some problems that acoustic modeling based on IFs is difficult to solve. Firstly, co-articulation effects have long time span and can not be well captured by IFs units which model speech segments in a short duration [5]. Secondly, words are looked as sequences of IFs when IFs are the basic units. In this case, pronunciation variation can only be represented in term of Initial/Final level substitutions, deletions and insertions. Furthermore, it is limited that making effective use of higherlevel dependencies (such as syllable).

In view of the presented drawbacks, we can use longerlength acoustic units that include temporal dependencies. Word and syllable are the most obvious candidates. As mentioned earlier, Chinese is a syllabic language and each Chinese syllable has fixed Initial-Final structure. Therefore, the total number of Chinese syllables is less than 1,300 (with tone considered), far less than the number of English Syllables (over 30000) [6]. Furthermore, in Chinese every basic semantic unit (Chinese character) has one or more corresponding single syllable pronunciations, so we can construct syllable based pronunciation lexicon easily. Finally, recent research [7] has shown that syllables play important role in human perception and speech production. Because of these properties, we believe that syllable is suitable to be the acoustic modeling unit for Chinese speech recognition. However, the use of syllable acoustic units will suffer the problem of training data sparsity. Because the number of syllables is far more than the number of Initial/Finals, it is inevitable that the number of syllable units with little or no acoustic training data will increase, then model parameter of these syllable units become unreliable. Furthermore, when context independent (CD) syllable are used as acoustic units, the total number of acoustic models (with tone considered) is very large (about 1058 × 1058 × 1058 ), and a very large training corpus is needed; when context independent (CI) syllable are used as acoustic units, the inter-syllable coarticulation effect can not be captured. In this paper, we initialize the CI syllable based acoustic models by using intrasyllable IFs based diphones to solve the problem of training data unevenness (suggested by [5]), and take parametric trajectory segment model [6] as inter-syllable transition model to compensate the inter-syllable co-articulation effect. In the next section we describe the detail of the CI syllable based models. In section 3 we describe the intersyllable transition model. Section 4 introduces training process of models. In section 5, the result of experiments is presented and discussed. The conclusions and future works will be described in section 6.

2. Context Independent Syllable Based Model The use of syllable acoustic units will suffer the problem of training data unevenness, so it is inappropriate to train syllable level models by using flat initialization strategies. As mentioned earlier, most Chinese syllable has the Initial-Final structure, so we can use the parameters of IFs level models which have been trained well to initialize syllable level models. To initialize syllable level models, Intra-syllable IFs based diphones must be trained firstly. Context expansion of IFs based diphones is not across Chinese syllable boundaries. As mentioned previously, each Chinese syllable has the InitialFinal structure. So for each Chinese syllable there are two intra-syllable IFs based diphones corresponding to the

978-1-4244-4199-0/09/$25.00 ©2009 IEEE Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on December 7, 2009 at 05:11 from IEEE Xplore. Restrictions apply.

1

syllable. One is the initial of the syllable, whose right neighbor is the final of the syllable; the other is the final of the syllable, whose left neighbor is the initial of the syllable. Each IFs level model is represented by continuous mixture density HMM which includes 3 emission states. Every syllable level model is represented by continuous mixture density HMM which includes 6 emission states. The number of states in syllable level model is the sum of the number of states in an Initial model and a Final model. The topology of IFs level model and syllable level model can be seen in Figure 1. For each syllable level model, we get the initial state parameters from corresponding IFs based diphones. A Chinese syllable such as ‘yan1’ (with tone considered) includes an Initial ‘y’ and a Final ‘an1’.The IFs based diphones corresponding to the syllable are ‘y+an1’ and ‘y-an1’. The diphone ‘y+an1’ denotes the context dependent version of the initial ‘y’ whose right neighbor is the final ‘an1’, and the diphone ‘y-an1’ denotes the context dependent version of the final ‘an1’ whose left neighbor is the initial ‘y’. Then states 13 in the syllable ’yan1’ are be initialized by using the IFs based diphone ‘y+an1’ and states 4-6 are be initialized by using the IFs based diphone ‘y-an1’ (Figure 2). A few Chinese syllables (such as ‘ai1’) contains only Final, but not Initial. In this case we add a virtual initial ‘s1’ into the syllable, then the syllable turn into ‘s1ai1’, and other process is the same as the syllable which has the complete Initial-Final structure. As can be seen from the above, to build a syllable based recognition system we must train intra-syllable CD IFs level models firstly. In the process of training IFs based diphone models decision tree is used to implement state tying. Therefore, even if some syllable has little training data, the diphones corresponding to these syllables can be trained well. Syllable level models which are initialized in this manner can get the same recognition performance as the IFs based diphones even without further training. Syllable contains more temporal and spectral correlation information than Initial/Finals, so improvement in recognition accuracy can be achieved when we train the syllable level models which have adequate training data in speech corpus.

3.1. Description of Inter-Syllable Transition Model For each pair of adjacent syllables, we extract a fixed-length feature vector sequence located symmetrically across the two syllables’ boundary to train the corresponding inter-syllable transition model. If we build an inter-syllable transition model for each syllable pair, there would be about 1119364 (1058×1058) inter-syllable transition models. However, the training data is limited, so we will suffer the problem of training data sparsity. In view of the fact that the Initial/Finals nearest to an inter-syllable transition have the greatest effect on acoustic variability of the inter-syllable transition segment [1], for each Final-Initial structure we construct a model. For example, the syllable pair ‘ning-gu’ (without tone considered) corresponds to the inter-syllable transition model ‘ing-g’. Final ‘ing’ is Final of the syllable ‘ning’ and Initial ‘g’ is Initial of the syllable ‘gu’. There are only 24 Initials and 37 Finals (in our dictionary), so only 888 inter-syllable transition models are needed. In view of the silence syllable ‘sil’, 61 additional models must be constructed. The total num of models is 949 and far less than 1119364. In the training process, the syllable sequences which are used to trained inter-syllable transition models are firstly generated by forced alignments. And then the syllable sequences are obtained from training utterances by using a slightly modified Viterbi algorithm. As inter-syllable transition models have been trained well, a speech signal can be represented by a sequence of syllable based models and inter-syllable transition models which overlay the sequence of syllable based models (Figure 2). The modified Viterbi algorithm is used to find the best sequence of syllable based models and inter-syllable transition models. The only modification of the Viterbi algorithm is to calculate additional transition score when the searching process moves into the next syllable from the previous syllable, then the additional transition score is added to the total score. We can get the additional transition score by matching the fixed-length feature vector sequence located symmetrically across the syllable pair with the inter-syllable transition model which the syllable pair correspond to. The weight of the transition score can be adjusted manually. Then for a testing utterance, the syllable string associated with the sequence of syllable based models and inter-syllable transition models which has the highest score is taken as the recognition output.

Figure 1: Initialization of the syllable’ yan1’

3. Inter-Syllable Transition Model Though the syllable level models described in section 2 can achieve good performance in modeling intra-syllable coarticulation effect and long term temporal dependencies in speech, they can not model the inter-syllable co-articulation effect because they are context independent. Therefore, we build inter-syllable transition model to capture inter-syllable co-articulation effect.

Figure 2: sequence of syllable based models and intersyllable transition models which represents speech signal

3.2. Parametric Trajectory Segment Model Since the acoustic characteristics in transition regions between two syllables are not stationary [8], we adopt parametric trajectory segment model (PSM) to deal with the nonstationary trajectory of feature vector sequences in transition regions. In PSM, polynomial functions of time are used to capture the dynamic Characteristics of mean trajectory. PSM is defined as

2 Authorized licensed use limited to: INSTITUTE OF AUTOMATION CAS. Downloaded on December 7, 2009 at 05:11 from IEEE Xplore. Restrictions apply.

Y = ZB + E (1) Where Y is a T×D feature matrix for T frames of D dimensional feature vector. Z is a T × (R+1) design matrix for a R th order trajectory model that maps the segments of different durations to a range of 0 to 1. B is a (R+1) × D parameter model matrix, and E is a T×D residual matrix. If model α has the training data {Y1 , Y2 ,..., YK } ( Yk is a T×D feature matrix for T frames of D dimensional feature vector), then the joint probability density function is given K

P(Y1 , Y2 ,..., YK | Bα , ∑α ) = ∏ P (Yk | Bα , ∑α ) k =1

K

T

= ∏∏ f ( yk ,t )

(2)

k =1 t =1

and the maximum likelihood estimation for the PSM parameter matrix Bα and ∑α are given by

⎡ K ⎤ B α = ⎢ ∑ Z kT Z k ⎥ ⎣ k =1 ⎦

−1

⎡ K ⎤ T ⎢ ∑ Z k Yk ⎥ ⎣ k =1 ⎦

(3) K

∑α =

∑

(Y k − Z k B α ) T (Y k − Z k B α )

k =1

K

∑

Tk

4.3. Acoustic Modeling In this section we construct the baseline system which uses IFs based triphones and syllable based system to compare the performance of the IFs units with syllable units. Before the syllable based system is constructed we must train intrasyllable CD IFs level models firstly, which are used to initialize syllable level model. Each emission state of these models is represented by a Gaussian Mixture Distribution Model which has 16 mixed weights. The inter-syllable transition models will be trained using the training data on which forced alignment has been performed using the syllable based models.

4.3.1. Model training of baseline system Baseline system uses IFs based triphones as the basic acoustic units. The models are continuous density HMM with 3 emission states. Context independent models are first trained, and then the parameters of triphones are initialized using the parameters of context independent models. There are 24 Initials and 166 Finals (with tone considered) in our training corpora, and the total number of triphones is very large (about 202946), so parameter tying is performed using decision tree [2]. And the number of triphones is 39895 when parameter tying has been implemented.

k =1

(4) the likelihood of the observation vector sequences Y against model α (with mean Bα and variance ∑α ) can be evaluated, T

P(Y | α ) = ∏ f ( yt )

(5)

t =1

where yt is the likelihood of the observation vector at time t against model α, and it can be represented as:

f (yt ) =

4.3.2. Intra-syllable CD IFs level model training The intra-syllable CD IFs level models are IFs based diphones , which are trained to initialize the parameters of syllable based models. There are 1057 syllable in our training corpora, so the number of Intra-syllable CD models is 2114. The models are also continuous density HMM with 3 emission states and the process of model training is similar to that of triphones.

4.3.3. Syllable based model training

1 1 T −1 exp{− (yt −zB t α ) ∑α (yt −zB t α )} D/2 1/2 (2π) | ∑α | 2

The database provided by Chinese National Hi-Tech Project 863 for Mandarin LVCSR system development is used in our experiments. There are 83 male speakers’ 48373 sentences in training set and 6 male speakers’ 240 sentences in test sets. All speech in the corpora is reading style speech.

Syllable based model are continuous density HMM with 6 emission states. As introduced in section 2, a syllable consists of two IFs based diphones, and emission states of syllable based model and IFs based diphone models are all represented by Gaussian Mixtures Distribution Models which has 16 mixed weights. As a result, the parameters of each syllable based model can be initialized by using the parameters of the two corresponding diphones. When the parameters of syllable based models have been initialized, further training is performed by Baum-Welch algorithm. Due to the problem of uneven training data, the syllables with little or no acoustic training data in our training corpora will not be further trained. Thus, only syllables which have sufficient training data can be further trained. A threshold based on the frequency of occurrence of syllable units in the training corpora is set to control the number of syllable units which can be further trained, and we adjust the threshold in order to achieve the best performance.

4.2. Features extracting

4.3.4. Inter-syllable transition model training

The feature extraction was carried out at a frame rate of 10 ms. The feature are the Mel-frequency cepstral coefficients (MFCCs), which includes 12 Mel frequency cepstral coefficients and log-energy with corresponding first and second order time derivatives. So a feature vector has 39 elements. The feature extraction is implemented using HTK [9].

When syllable based recognition system has been constructed, forced alignments of all training utterances are generated using HTK [9]. Then we extract intersyllable boundary segments from the forced alignments to train the 949 intersyllable transition models. The intersyllable transition models are trained using equation (3) and equation (4). An iterative training procedure is then applied to alternatively segmenting all training utterances into syllable sequences using the modified Viterbi algorithm introduced in section 3.1 and

(6) where zt is the t th line of the matrix Z: ⎡ z t = ⎢1 ⎣⎢

t −1 T −1

⋅⋅⋅

R ⎛ t −1 ⎞ ⎤ ⎜ ⎟ ⎥ ⎝ T − 1 ⎠ ⎦⎥

.

(7)

4. Training of Acoustic Models 4.1. Speech Corpora


updating the parameters of all syllable based models and intersyllable transition models. When the order of polynomial (R) in PSM increases from 0 to 2, the recognition accuracy is significantly improved; when the order of polynomial increases further the improvement of recognition accuracy is not obvious [10]. Furthermore, more parameters are needed to estimate when the order of polynomial becomes greater, so we set R = 2. The weight of the transition score and the fixed length which represents the number of feature vector sequence extracted to train intersyllable transition models are all set differently to get the best performance.

5. Experimental Results We compare the performance of the baseline system and recognizer using syllable. Table 1 shows the recognition accuracy of different models. Because the syllable based models are initialized using the IFs based diphones models, the performance of the syllable recognition system is about the same as the recognition system using IFs based diphone even if re-estimation of the syllable models is not performed. The accuracy of diphone system is lower than triphone system which can be attributed to not capturing the inter-syllable coarticulation effect. Re-estimation of syllable models captures the intra-syllable co-articulation effect and long term temporal dependencies better than triphone, so the performance of syllable system improves. Table 1: Experiment results with different models Model Corr Sub Del Ins WER Triphone 85.15 14.56 0.29 0.48 15.33 Diphone 84.74 14.79 0.48 0.16 15.42 Syllable(withou 84.74 14.79 0.48 0.19 15.45 t re-estimation) Syllable(with 86.26 13.39 0.35 0.22 13.96 re-estimation) In section 4.3.3 we set different threshold to obtain best performance. The threshold controls the number of syllable units which can be re-estimate. Table 2 shows the performances of syllable system with different threshold. If the threshold is too small, many syllable based models with insufficient training data are re-estimated, and the performance will deteriorate. On the contrary, if the threshold is too large many syllable based models with sufficient training data are not re-estimated, the performance will also deteriorate. Therefore, when the threshold increases from 100 to 400 the recognition accuracy is improved; however, the accuracy declines when the threshold continues to increase. Table 2: Experiment results with different threshold Threshold Corr Sub Del Ins WER 100 85.88 13.74 0.38 0.22 14.34 200 86.17 13.45 0.38 0.22 14.05 300 86.14 13.48 0.38 0.25 14.12 400 86.26 13.39 0.35 0.22 13.96 500 86.17 13.48 0.35 0.22 14.05 600 86.07 13.58 0.35 0.22 14.15

not much. This seems to indicate that most of co-articulation is intra-syllable co-articulation in Chinese speech. Table 3: Experiment Results with different models Model Triphone Syllable Syllable with transition model Corr(%) 85.15 86.26 86.80

6. Conclusion In this paper we construct syllable based acoustic models to capture intra-syllable co-articulation effect and incorporate inter-syllable transition models into recognition system to capture inter-syllable co-articulation effect. The results show that syllable based models and inter-syllable transition models can achieve better performance in recognition. In the future, more sophisticated model is needed to capture the inter-syllable co-articulation effect. Meanwhile, we will examine the possibility to apply the research in this paper to spontaneous style speech corpora.

7. Acknowledgements This work was supported in part by the China National Nature Science Foundation (No. 60675026, No. 60121302 and No.90820011), 863 China National High Technology Development Project (No.20060101Z4073, No.2006AA01Z194) and the National Grand Fundamental Research 973 Program of China (No. 2004CB318105).

8. References [1]

X Y Zhou, B Wang, Y F Yang, X Q Li, ”The influence of coarticulation on syllable perception in utterance,” Acta Psychologica Sinica (in Chinese), 35(3):340-344, 2003. [2] S. Gao, et al “Acoustic modeling for Chinese speech recognition: a comparative study of mandarin and Cantonese,” In Proc. ICASSP, pp. 1261-1264, 2000. [3] J. Li, et al “Improved context-dependent acoustic modeling for continuous Chinese speech recognition,” In Proc. EuroSpeech, pp. 1617-1620, 2001. [4] H. Wu, X H Wu, “Context dependent syllable acoustic model for continuous Chinese speech recognition” In Proc. EuroSpeech, pp. 1713-1716, 2007. [5] A. Sethy and S. Narayanan, “Split-lexicon based hierarchical recognition of speech using syllable and word level acoustic units,” In Proc. ICASSP, pp. 772-776, 2003. [6] A. Acero, X D Huang, H. Hon, Spoken Language Processing, Prentice Hall, 2000. [7] A. Ganapathiraju, J. Hamaker, J. Picone, M. Ordowski, G. R. Doddington, “Syllable-based large vocabulary continuous speech recognition,” IEEE Trans. Speech Audio Processing, 9(4):358366, 2001. [8] L. Deng, M. Aksmanovic, D. Sun, and C. F. J. Wu, “Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states,” IEEE Trans. Speech Audio Processing, 2(4): 507-520, 1994. [9] S. Young et al, the HTK Book, Cambridge 2002. [10] Y Y Zhang, the Study of Parametric Trajectory Models and Its Applications in Confidence Measures (In Chinese), Ph.D. thesis, Institute of Automation Chinese Academy of Science, 2003.

Table 3 shows the recognition performance of syllable system with inter-syllable transition model considered, and a comparison with that of syllable system and triphone system is also given in the table. Compared with syllable system, the syllable system with transition model achieves better performance. However, the gain in recognition performance is