Complete Recognition of Continuous Mandarin Speech ... - CiteSeerX

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 5, NO. 2, MARCH 1997

Complete Recognition of Continuous Mandarin Speech for Chinese Language with Very Large Vocabulary Using Limited Training Data

195

TABLE I THE PHONOLOGICAL HIERARCHY OF MANDARIN SYLLABLES, WHERE THE NUMBER INSIDE EVERY BRACKET INDICATES THE TOTAL NUMBER OF THAT KIND OF UNITS IN MANDARIN CHINESE

Hsin-Min Wang, Tai-Hsuan Ho, Rung-Chiung Yang, Jia-Lin Shen, Bo-Ren Bai, Jenn-Chau Hong, Wei-Peng Chen, Tong-Lo Yu, and Lin-Shan Lee

Abstract— This correspondence presents the first known results of complete recognition of continuous Mandarin speech for the Chinese language with very large vocabulary but very limited training data. Various acoustic and linguistic processing techniques were developed, and a prototype system of a continuous speech Mandarin dictation machine has been successfully implemented. The best recognition accuracy achieved is 92.2% for finally decoded Chinese characters. Index Terms— Mandarin speech, recognition, very large vocabulary, continuous speech, tone recognition, hidden Markov modeling, Chinese language model

I. INTRODUCTION This correspondence presents the first known results of complete recognition of continuous Mandarin speech for the Chinese language (i.e., recognition of the syllables, tones, Chinese characters, words, and sentences) with very large vocabulary but using only very limited training data. Although some isolated-syllable-based [1], [2] or isolated-word-based [3], [4] large-vocabulary Mandarin speech recognition systems have been successfully developed, a continuousspeech-based system of this kind has never been reported before but has been highly desired. Only with continuous speech can the desired speed, convenience, and naturalness of man–machine communications be achieved. Also, because it is not feasible to request a new user to produce much training speech before being able to use the system, relatively limited training data become a very natural constraint. In Mandarin Chinese, there exists at least 80 000 commonly used words, each composed of from one to several characters, and more than 10 000 commonly used characters, all produced as monosyllables. However, the total number of phonologically allowed different syllables is only 1345. Also, Mandarin Chinese is a tonal language, in which each syllable is assigned a tone and there are a total of four lexical tones plus one neutral tone. It has been found that the vocal tract parameters for Mandarin speech are only slightly influenced by the tones, and the tones can be separately recognized primarily using the pitch contour information. When the differences among the syllables caused by tones are disregarded, the total of 1345 different tonal syllables is reduced to only 416 base syllables (i.e., the syllable structures independent of the tones). It is therefore helpful to recognize the tones and base syllables separately. Based on the above Manuscript received October 25, 1994; revised November 9, 1995. the associate editor coordinating the review of this manuscript and approving it for publication was Prof. Kuldip K. Paliwal. H.-M. Wang, J.-L. Shen, B.-R. Bai, W.-P. Chen, and T.-L. Yu are with the Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. (e-mail: [email protected]). T.-H. Ho, R-C. Yang, and J.-C. Hong are with the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. L.-S. Lee is with the Department of Electrical Engineering and the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. (e-mail: [email protected]). Publisher Item Identifier S 1063-6676(97)00756-6.

considerations, the block diagram of the core speech recognition system presented in this paper is shown in the lower half of Fig. 1, while the peripherals in the upper half of the figure can be included in the overall architecture of the complete prototype system to be described later on in Section VII. In the acoustic processor, the tones and base syllables are separately recognized using two different sets of models and a multiple candidate searching technique based on a concatenated syllable matching algorithm is used to integrate the two separate base syllable and tone recognizers. While in the linguistic decoder, a lexicon is used to construct the word lattice and a Chinese language model used to find the final output characters, words, and sentences, because every tonal syllable in general is shared by many (e.g., 10 000/1345 on average) possible homonym characters. II. PRELIMINARIES The speech database used here is produced by four speakers, two male and two female. The results reported here are averages of the four. Each speaker uttered one set of phonetically balanced continuous Mandarin sentences covering all the 416 base syllables and different tones (including 200 sentences or 1514 syllables) for training, while another set of short paragraphs taken from different articles (including 142 sentences or 1451 syllables totally) for testing. All the recorded materials are obtained in a laboratory environment, and digitized with a sampling frequency of 16 KHz. For base syllable recognition, 14 cepstral and 14 delta-cepstral coefficients, deltaenergy, and delta-delta-energy are used as feature parameters to form a feature vector with dimension 30, while for tone recognition the pitch period and the energy together with their first- and secondorder delta-coefficients are used instead to form a feature vector with dimension 6. The hidden Markov models (HMM’s) used in this paper are leftto-right continuous HMM’s (CHMM’s) with two transitions only. The CHMM’s obtained from the speaker independent training data are used as the initial models and further trained by the modified segmental K-means training algorithm [5], [6] using the continuous training data, in which the CHMM’s after each iteration are linearly interpolated with the initial CHMM’s. The recognition algorithm used here is an N-best beam search algorithm [6]. III. BASE SYLLABLE RECOGNITION Conventionally, each of the 416 Mandarin base syllables mentioned above can be decomposed into an INITIAL/FINAL format very similar to the consonant/vowel relations in other languages. Here, “INITIAL” is the initial consonant part of a syllable, and “FINAL” is the vowel (or diphthong) part but including optional medial or nasal ending. There are a total of 22 INITIAL’s and 41 FINAL’s in Mandarin. However, these 63 (22 + 41) INITIAL/FINAL’s can also be further decomposed into even smaller acoustic units. It was found that a total of 33 phonelike units (PLU’s) will be enough to transcribe the 416 Mandarin base syllables. The phonological hierarchy of

0063–6676/97$10.00  1997 IEEE

Authorized licensed use limited to: National Taiwan University. Downloaded on September 24, 2009 at 02:17 from IEEE Xplore. Restrictions apply.

196


Fig. 1.

Overall architecture of the prototype system, with the core of the recognition system depicted in the lower half.

TABLE II THE (a) 33 PLU’S, (b) 22 INITIAL’S, AND (c) 41 FINAL’S OF MANDARIN CHINESE, WHERE BOTH INTERNATIONAL PHONETIC ALPHABET (IPA) AND SIMPLIFIED PHONETIC ALPHABET (SPA) ARE LISTED IN (a) FOR REFERENCES, BUT ONLY SPA IS USED IN (b) AND (c)

Fig. 2. Part of an example utterance. x; y; and z are possible beginning points, where y 1 and z 1 are possible ending points corresponding to x:

0

(a)

(b)

(c)

a Mandarin syllable is shown in Table I, where the relationships among tonal syllables, base syllables and tones, INITIAL/FINAL’s, and PLU’s are shown. The 33 PLU’s of Mandarin Chinese with the IPA (International Phonetic Alphabet) representations are listed in Table II(a), in which the corresponding simplified symbols used in this research (SPA: Simplified Phonetic Alphabet) are also listed for

0

references, while the 22 INITIAL’s and 41 FINAL’s in SPA are listed in Table II(b) and (c), respectively. It can be found that an INITIAL is always a PLU, while a FINAL may contain one, two, or three PLU’s in general. That is, a Mandarin base syllable is composed of two to four PLU’s. Special efforts were made in selecting the most appropriate subsyllabic acoustic units for base syllable recognition, considering the condition of very limited training data and the special monosyllabic characteristics of Mandarin Chinese. A general observation on continuous Mandarin speech is that the coarticulation effects within a syllable are much more significant than those across syllables, and within a syllable the acoustic characteristics of the INITIAL are certainly highly dependent on the FINAL, but those of the FINAL are much less dependent on the INITIAL. With these observations, a first approach is to assume both the coarticulation effects across syllables and the dependence of the FINAL on the preceding INITIAL within a syllable to be negligible. FINAL’s are therefore just taken as context-independent units. In a second approach it can be further assumed that the dependence of an INITIAL on the following FINAL is primarily determined by the beginning phone of the FINAL only. As shown in Table II(c), the 41 FINAL’s are divided into eight groups according to their beginning phones. The FINAL’s in the same group are thus assumed to have the same influence on their preceding INITIAL’s. In this way, the 22 INITIAL’s can be expanded to 113 context-dependent (CD) INITIAL’s, as shown in Table III. Based on similar assumptions mentioned above for INITIAL/FINAL’s, i.e., the coarticulation effects across syllables assumed negligible and so on, the 33 PLU’s can be expanded to 511 context-dependent (CD) PLU’s.



TABLE III THE 113 CONTEXT-DEPENDENT (CD) INITIAL’S, IN WHICH THE VERTICAL SCALE IS THE 22 INITIALS’S, AND THE HORIZONTAL SCALE IS THE BEGINNING PHONES OF THE FOLLOWING FINAL’S IN THE 8 GROUPS IN TABLE II(c)

197

TABLE IV RESULTS OF RECOGNITION OF (a) BASE SYLLABLES AND (b) TONES

(a)

(b) TABLE V THE 23 CAREFULLY CHOSEN CONTEXT-DEPENDENT TONE MODELS. FOR EXAMPLE, 1-(2) IN THE FIRST COLUMN IS THE TONE 1 MODEL AT SENTENCE BEGINNING FOLLOWED BY TONE 2, (1)-2 IN THE SECOND COLUMN IS THE TONE 2 MODEL AT SENTENCE END PRECEDED BY TONE 1.

Acoustic models for all kinds of units mentioned above, including the 416 base syllable (BS) models, 33 context-independent (CI) PLU (33 CI-PLU) models, 511 context-dependent (CD) PLU (511 CDPLU) models, 22 CI-INITIAL and 41 CI-FINAL (22 CI-I/41 CI-F) models, and 113 CD-INITIAL and 41 CI-FINAL (113 CD-I/41 CIF) models, are considered in the experiments. Each PLU model is represented by a CHMM with two states, while each INITIAL model with three states and each FINAL model with four states, and each syllable model with seven states. The results for three to five mixtures per state are listed in Table IV(a), together with the total number of states needed for all different models listed in the last column. It is obvious from Table IV(a) that the recognition accuracies of all kinds of subsyllabic models except the 33 CI-PLU models are much superior to those of the BS models. The base syllable apparently suffers from inefficient utilization of the training data, because too many states (2912) are required. On the other hand, the smaller units are not always better choices if the context dependency is not carefully considered, with the 33 CI-PLU models being a good example, which have the smallest number of states and the worst performance among all subsyllabic models. However, even though the context dependency is carefully considered in the 511 CD-PLU models, the recognition accuracies achieved are in fact inferior to those of the 22 CI-I/41 CI-F and 113 CD-I/41 CI-F models, probably due to too many states needed (1022) for our very limited training data. It seems that the 503 states for the 113 CD-I/41 CI-F models turn out to be a good number for limited training data used in this case, and the major context dependency and coarticulation effects have been properly taken care of in these models, thus the highest accuracies are achieved. Furthermore, for the 113 CD-I/41 CI-F models, the recognition accuracy does not change significantly when the mixture number is increased from three to five, probably because the very limited training data would be insufficient to adequately reestimate large number of parameters. In all the following experiments, the 113 CD-I/41 CI-F models with three mixtures will be adopted for base syllable recognition. IV. TONE RECOGNITION It has been well known that the tone behavior is very complicated in continuous Mandarin speech, although there are only 5 (4 +

1) different tones in Mandarin. It is therefore very important to select a minimum number of appropriate context-dependent models such that they can adequately describe the very complicated tone behavior, but at the same time can be very well trained by the limited training data. First assume each possible tone concatenation combination needs a context-dependent model, then a total of 175 models will be needed, i.e., 53 (in the middle of a sentence) + 52 (at the end of a sentence) + 4 2 5 (at the beginning of a sentence, because the neutral tone never appears at the beginning of a sentence) + 5 (isolated models). However, practically this number can be significantly reduced if special characteristics of the tone behavior can be carefully considered. For example, both Tones 1 and 2 end in high with similar level, and both Tones 3 and 4 end in low with similar level. This makes the influence of Tones 1 and 2 (or Tones 3 and 4) on the following tones very similar, and this was, in fact, observed empirically. Also, it was observed empirically that the influence of a neutral tone on the tone behavior of adjacent syllables on both sides is insignificant and can be ignored, etc. When all these phenomena are considered, many models can be merged and the total number of models can be reduced to 23 as listed in Table V. Some experiments on tone recognition were performed. The five context-independent (CI) tone (five CI-T) models and both the 175 and 23 context-dependent (CD) tone (175 CD-T and 23 CD-T) models derived above were tested, and each tone model is represented by a CHMM with seven states. The results are listed in Table IV(b), while the total number of states needed for all models also listed in the last column. It is obvious from Table IV(b) that, for five CIT models, the recognition accuracy in general increases with the mixture number, because the context dependency is not included here and, thus, larger number of mixtures is needed to model the


198


Fig. 3.

Partial list of a simplified example word lattice, where each circle is a monocharacter word, while each ellipse is a polycharacter word.

Fig. 4. Three-stage word classification algorithm.

complicated coarticulation effects. On the other hand, for contextdependent models, the recognition accuracy increases only slightly or even degrades as the mixture number increases because of the very limited training data. Between the two sets of context-dependent models, it is apparent that the 23 CD-T models can better model the tone behavior apparently because they need a proper number of states (161) for the limited training data, but at the same time have included the major context dependency for tones in continuous speech. It should be noted that the improvements from context-independent models to context-dependent models turn out to be relatively limited here, probably also due to the limited training data. In the following experiments the 23 CD-T models with three mixtures will be used.

resulting tone and base syllable sequences will be out of synchronization and such errors can propagate very long. A concatenated syllable matching (CSM) algorithm summarized below is therefore developed for synchronizing the base syllable and tone recognition. For a given test utterance, all possible syllable beginning frames can be first obtained by picking up all the dips in the energy contour, such as x; y; z in Fig. 2. The possible ending frames such as y 0 1 and z 0 1 corresponding to each beginning frame such as x can then be found using estimated minimum and maximum duration of a syllable Dmin and Dmax as in Fig. 2. With beginning and ending points of a syllable estimated as above, the accumulated score at an ending point such as y 0 1 in Fig. 2 is then determined by the dynamic programming approach that follows:

V. TONAL SYLLABLE RECOGNITION When tones and base syllables are separately recognized as described above, a major problem is that the segmentation of the unknown utterances into syllables may become quite different in the two recognizers, and the problem gets even worse when any insertion/deletion occurs in one of the recognizers, in which the

T [y

0 1] =

0 1] + 1max [ 2 bs ( i1345 0 1)] + (1 0 ) 2 t (

T [x

w

w

S

x; y

x; y

0 1)

(1)

is the accumulated score at a point u; Sbs (u; v ) and are the scores for the corresponding base syllable and tone for a tonal syllable i (i.e., one out of the 1345, including a base where

St

T [u]

S

(u; v )



199

TABLE VI RESULTS FOR TONAL SYLLABLE RECOGNITION BY THE CONCATENATED SYLLABLE MATCHING (CSM) ALGORITHM

Fig. 5. User interface of the prototype system.

The experimental results are listed in Table VI. The best recognition accuracy achieved here for tonal syllables is 81.4%, while those for base syllables and tones are 88.7% and 91.6%, respectively. Note that these recognition accuracies for both base syllables and tones here are better than those obtained in Table IV (88.3% and 89.8%), especially the accuracy of tones is significantly improved, because here the information in base syllable and tone recognition are integrated and extra information such as energy contour dips are applied. On the other hand, the top N recognition accuracies are also listed in Table VI. Note that though the top one recognition accuracy achieved here is not very high (81.4% for tonal syllables), the top five recognition accuracy of 97.1% is in fact reasonable as long as a good linguistic processor is applied. VI. LINGUISTIC PROCESSING

Fig. 6. Overall picture of the complete prototype system, Golden Mandarin (III) 3.0 workstation version.

syllable and a tone) matched with the utterance section (u; v ); and w is a weighting factor. For any utterance section, the base syllable recognition was performed using the N-best beam search algorithm mentioned above because the subsyllabic units were adopted, while for tone recognition the conventional Viterbi search algorithm was used. At the end of the utterance, the most probable tonal syllable sequence can then be easily obtained by backtracking the entire utterance. In this way, the recognition of base syllables and tones can be performed syllable by syllable in complete synchronization, with advantages that not only extra information such as energy contour dips have been applied, but the information in base syllable and tone recognition are now properly integrated. Furthermore, a very efficient technique is developed to generate multiple candidates of the tonal syllables to construct a tonal syllable lattice for the linguistic processor to work on. In the syllable matching process mentioned above, for any possible utterance section, not only the most likely tonal syllable is stored, but all the N-best tonal syllable candidates are stored for backtracking. Therefore, as long as the most likely syllable segmentation is obtained, the top N tonal syllable candidates for each utterance section will be obtained simultaneously. The obvious advantage of such a multiple candidate searching technique is that it is suitable for real-time implementation because no further computation is needed. In fact, as will be shown below, it also provides higher accuracy, i.e., a tonal syllable lattice including higher percentage of correct tonal syllables is produced.

With the top N tonal syllables recognized and a tonal syllable lattice obtained as described above, this lattice is first transformed into a word lattice via a lexical access process in the linguistic processor as in the lower half of Fig. 1. Because every tonal syllable is in general shared by many homonym characters, and it is possible for a character to form either a monocharacter word or polycharacter words after combining with adjacent characters as mentioned previously, such a word lattice can be very large and complicated, especially when top N tonal syllables are included and N is large. A partial list of a simplified example word lattice is shown in Fig. 3. In Fig. 3, each tonal syllable has only one or two candidates but, practically, the candidates can be much more. With such complicated word lattices, the linguistic decoder certainly requires a very powerful Chinese language model to find out the most probable output characters, words, and sentences. A word-class-based Chinese language model is thus developed and used here in this research. A three-stage hierarchical word classification algorithm was first specially designed to cluster the words into classes, with different strategies used in each stage as shown in Fig. 4, and the language model was then developed based on these word classes. In the first stage, all the words in the lexicon are grouped based on their linguistic features according to the parts-of-speech (POS) assigned by the linguists. For example, if a word can have three different POS’s, e.g., Active-Verb-B, Proper-Noun-A, and Proper-Noun-C, it will be grouped with the words having exactly these three POS’s. In this way, about 200 POS’s were used and about 950 initial classes were obtained from a lexicon of roughly 100 000 words. In the second stage, the words in the same class, which are believed to have similar syntactic behavior, are further grouped into smaller classes based on their statistical behavior. In other words, the words having similar statistical word-co-occurrence features with adjacent words in a corpus are further grouped into even smaller classes. In the third stage, however, in order to avoid too strict classification, some classes obtained in the second stage but with different POS features can be further merged together even if they have been separated apart in the first stage, if the statistical similarity measure between them so indicates. In general, with the above procedure, words clustered in the same class have very similar syntactic and semantic behavior, because both of their POS features and statistical co-occurrence relations are


200


in the last row of Table VII. It can be found that the best recognition accuracy for characters achieved here is 92.2% when 20 tonal syllable candidates (or 98.9% of correct tonal syllables) are included in the tonal syllable lattice. However, it is observed that the recognition accuracy does not change significantly when the number of tonal syllables included in the word lattice is increased from 10 to 20 (92.1% to 92.2%), apparently because the percentage of correct tonal syllables included is only very slightly increased (98.7% to 98.9%).

TABLE VII FINAL RECOGNITION ACCURACIES FOR CHINESE CHARACTERS

very similar. Furthermore, for rarely used words with insufficient statistical information, the classification is still satisfactory because the POS features have been carefully used. This is why the language models developed based on this word classification algorithm turn out to be very reliable and robust with significantly reduced model parameters at scalable size. The training corpus used in training the language model includes texts from newspapers, articles from magazines, and parts from various novels, with a total of 18 384 664 characters. It was found that around 980 word classes give very good results. In our system, a word-class-based Chinese language model derived from the word classes obtained above was used. Viterbi search based on the word-class-based language model presented above was performed in the linguistic decoder to find the optimal character or word sequence (a word is composed of one to several characters) which maximizes the following sentence probability: P (S )

b

= p (c1 ) 2 p(w1 jc1 ) 2

1kK 01

p(c

k+1 jck ) 2 p(wk+1 jck+1 )

e (cK )

(2)

2 p

where the sentence S has K words, i.e., S = w1 w2 1 1 1 wK ; ck is the word class containing the kth word wk ; while pb (c) and pe (c) are the probabilities that a word belonging to class c appears at the beginning and end of a sentence, respectively. Several improved approaches were further applied in the linguistic decoder. First, the acoustic recognition scores for the tonal syllable candidates can be integrated into (2) as follows: P (S )

a

b

= p (c1 ) 2 p(w1 jc1 ) 2 p (w1 ) 2

p(c

1kK 01

e (cK )

2 p

k+1 jck ) 2 p(wk+1 jck+1 ) 2 pa (wk+1 ) (3)

where pa (w) represents the acoustic recognition score for the word which is the product of the acoustic recognition scores of all the component tonal syllables of the word w: Second, the acoustic recognition scores for the tonal syllables can be further weighted by other parameters such as the tonal syllable bigram and unigram before they are applied in (3). Also, the tonal syllable candidates with original acoustic recognition scores lower than the best score of the top one tonal syllable candidate by a given threshold can be simply discarded, and only the rest of the tonal syllable candidates used to construct the word lattice. Furthermore, some of the tone sandhi phenomena in continuous Mandarin speech have been included here. For example, a simple tone sandhi rule is that when a syllable of Tone 3 is followed by another of Tone 3, the previous one will be pronounced as Tone 2. Therefore, when the obtained tonal syllable lattice includes any tonal syllable pair with a Tone 2 followed by a Tone 3, the first syllable is always considered to be possibly of Tone 3 also. The final results for character accuracy with different numbers (N ) of syllable candidates included in the tonal syllable lattice are listed w;

VII. SYSTEM INTEGRATION

AND

CONCLUSION

With all the key techniques presented above, a prototype system of a continuous speech Mandarin dictation machine for the Chinese language with very large vocabulary but very limited training data has been successfully implemented on a Sparc 20 workstation as a software-only system. The prototype system was given a name, the Golden Mandarin (III) 3.0 workstation version, as a continuation of the Golden Mandarin prototype system series [1], [2]. The overall system architecture is shown in Fig. 1, where the upper half includes peripherals such as the front-end signal processor and the user interface. Figs. 5 and 6 show the user interface of the prototype system as well as an overall picture of the complete system, respectively. The recognition errors can be corrected either by choosing the desired tonal syllables or characters from the candidate lists shown in a window using the mouse, or directly by voice. The on-line learning functions for both the acoustic and language models previously developed in Golden Mandarin (II) [2] are also integrated in this system. The overall CPU time needed to recognize an utterance on the Sparc 20 workstation is about 1.3 times of the length of the speech utterance. REFERENCES [1] L.-S. Lee et al., “Golden Mandarin (I): A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary,” IEEE Trans. Speech Audio Processing, vol. 1, pp. 158–179, Apr. 1993. , “Golden Mandarin (II): An improved single-chip real-time Man[2] darin speech dictation machine for Chinese language with very large vocabulary,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 1993, vol. 2, pp. 503–506. [3] J.-K. Chen, F. K. Soong, and L.-S. Lee, “Large vocabulary word recognition based on tree-trellis search,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, vol. 2, 1994, pp. 137–140. [4] H.-W. Hon et al., “Toward large vocabulary Mandarin Chinese speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 1994, vol. 1, pp. 545–548. [5] L. R. Rabiner, J. G. Wilpon, and B. H. Juang, “A segmental K-means training procedure for connected word recognition,” AT&T Tech. J., vol. 65, pp. 21–31, May–June 1986. [6] H.-M. Wang, R. Lyu, J.-L. Shen, and L.-S. Lee, “An initial study on large-vocabulary continuous Mandarin speech recognition with limited training data based on sub-syllabic models,” in Proc. Int. Computer Symp., Hsinchu, Taiwan, R.O.C., 1994, pp. 1140–1145.