Speech Synthesis for Error Training Models in CALL Xin Zhang1, Qin Lu2, Jiping Wan1, Guangguang Ma1, Tin Shing Chiu2, Weiping Ye1, Wenli Zhou1, and Qiao Li1 1
Department of Electronics, Beijing Normal University
[email protected]
2
Department of Computing, Hong Kong Polytechnic University
Abstract. A computer assisted pronunciation teaching system (CAPT) is a fundamental component in a computer assisted language learning system (CALL). A speech recognition based CAPT system often requires a large amount of speech data to train the incorrect phone models in its speech recognizer. But collecting incorrectly pronounced speech data is a labor intensive and costly work. This paper reports an effort on training the incorrect phone models by making use of synthesized speech data. A special formant speech synthesizer is designed to filter the correctly pronounced phones into incorrect phones by modifying the formant frequencies. In a Chinese Putonghua CALL system for native Cantonese speakers to learn Mandarin, a small experimental CAPT system is built with a synthetic speech data trained recognizer. Evaluation shows that a CAPT system using synthesized data can perform as good as or even better than that using real data provided that the size of the synthetic data are large enough. Keywords: training data preparation, computer aided language learning, Speech synthesis, formant modification.
1 Introduction With the rapid development of computing technology such as speech processing, multimedia and internet, computer-assisted language learning (CALL) systems are getting more comprehensive [1]. A computer assisted pronunciation teaching (CAPT) system, as the basic component of a CALL system, can give feedbacks to a learner’s mis-pronunciation in order to improve his pronunciation. In addition to an embedded automatic speech recognizer, a CAPT is able to score the pronunciation quality of the learner. A simple CAPT system, which gives one evaluation score based on different speech features, can only tell how good a learner performs, but not how he can improve. More complicated error model based CAPT systems can give suggestions on how to improve learner’s pronunciations [2][3][4]. A more comprehensive speech recognizer in CAPT needs acoustic models for both the correctly pronounced phones and the incorrectly pronounced ones. Consider a native Chinese learning the English word “flag” ([flæg]). Since [æ] is not a Chinese phone, the learner is most likely to utter it either as [fleg] or [flag]. Having both two error phone models and the correct model, the recognizer can tell not only how wrong W. Li and D. Mollá-Aliod (Eds.): ICCPOL 2009, LNAI 5459, pp. 260–269, 2009. © Springer-Verlag Berlin Heidelberg 2009
Speech Synthesis for Error Training Models in CALL
261
the actual pronunciation is as a score, but also which actual error model is most likely pronounced. As a result, the system can give more concrete suggestions to a learner such as “put your tongue a little downward” to improve learner’s pronunciation from [fleg] to “flag” ([flæg]). Obviously, a comprehensive CAPT system like this is much better than those which can only give a simple score without specific suggestions. Yet a comprehensive CAPT needs a large amount of incorrectly pronounced speech data to train the error models through a large amount of recording followed by error type clustering using either a manual or an automatic classification method. Without enough training data, error models can face data sparseness problem. Collecting incorrectly pronounced speech data is a labor intensive and costly work. And incorrectly pronounced speech data is much more difficult to obtain than the correct ones [5]. Instead of exhaustive manual recording and classification to minimize the effect of data sparseness problem, this paper proposes an alternative method to produce incorrect speech data through a formant speech synthesis. Analysis of a small set of samples of correct/incorrect phone pairs suggests the formant relationships between these pairs. According to the obtained formant relationships, the formant frequencies in the vocal tract filter of the correct phone are mapped into that of the incorrect ones. In this way, all the incorrect speech data are obtained from their correct counterparts. The characteristics of the speakers and other environmental variations of original correct speech are unchanged by keeping the original formant bandwidths and the LPC residual. A small experimental system is built for demonstration and evaluation of the proposed method. Experiment shows that the synthetic incorrect speech trained model performs as good as its prerecorded counterpart and with enough generated synthetic data, it can actually out-perform the pre-recorded counterpart. The rest of the paper is organized as follows. The basic design idea and methodology are explained in Section 2. Experiments and results are detailed in section 3. Section 4 is the conclusion.
2 Design and Methodology 2.1 Basic Design Idea of the Formant Synthesizer Fig. 1 shows a general speech synthesizer. The synthesis process can be modeled as a linear combination of a vocal tract filter as a linear time-invariant system and the glottal sound as an input signal to the system. The vocal tract filter can be characterized by formant frequencies and formant bandwidths. sˆ( n) is the synthesized speech. The synthesizer for error phone production in this work uses a vocal tract filter specified by the first four formant frequencies F1 to F4 and the first four bandwidths B1 to B4. To excite any vocal tract, a sound source is needed. In this paper, the linear prediction (LPC) residual is used as the sound source. By extracting the LPC residual from the prerecorded speech material, the pitch and voice of the speaker remain the same. The following is a brief description of linear prediction analysis. A more detailed treatment can be found elsewhere in literature [9]. The sampled speech waveform s (n) can be approximated by another sequence sˆ(n) through linear prediction of the past p samples of s ( n) :
262
X. Zhang et al.
Fig. 1. The basic structure for speech synthesizer p
sˆ(n) = ∑ ak s (n − k )
(1)
k =1
p is prediction order. The linear prediction coefficients
ak can be determined by
minimizing the mean squared differences. The LPC residual is:
e(n) = s (n) − sˆ(n)
(2)
Formants are the resonances of the vocal tract. they are manifested in the spectral domain by energy maxima at the resonant frequencies. The frequencies at which the formants occur are primarily dependent upon the shape of the vocal tract, which is determined by the positions of the articulators (tongue, lips, jaw, etc). The relationship or ratio of adjacent formants is called formant structure. Former researches demonstrate that the quality (correct or incorrect) of a phone is mainly determined by formant frequency structure whereas sound source, formant bandwidths, absolute values of formant frequencies and other speech parameters are mostly depend on different speaker, speaker’s different state and emotion, and different environmental conditions [6]. A reasonable inference is that by modifying only formant structure, a correct phone can be synthesized into an incorrect phone, yet other speech characteristics can be kept unchanged. Consequently, the phone model trained by the synthesized incorrect speech data is likely to be the same robust to variations on speaker and other conditions as its correct counterpart. Table 1 shows the first four formant frequency values of two different phones [a:] and [e], read by two different speakers, A and B. The ratios of the corresponding formant frequencies between different phones are also listed. Table 1 shows that the formant structures by both speakers are very similar for the same phonemes even though they have very different absolute formant values. In fact, the ratios of the two phonemes from the two speakers are almost the same. This suggests that the formant frequency ratio of different phones from a small set of speech data can be applied to generate a larger set of speech data.
Speech Synthesis for Error Training Models in CALL
263
Table 1. The first 4 formants frequency values of different vowels [a:] and [e] recorded by two speakers, ratio is Fi([e])/Fi([a:]), i=1,2,3,4. A
B
F1
F2
F3
F4
F1
F2
F3
[a:]
802
1126
2933
3483
1128
1632
3453
F4 4520
[e]
588.4
1970
2893
3500
800
2213
3143
4345
Ratio
0.734
1.749
0.986
1.005
0.71
1.36
0.91
0.96
This invariant of formant ratio is meaningful to build the incorrect speech synthesizer. Scaling formant frequencies by the ratio of the incorrect to the correct, a correct phone can be synthesized into an incorrect one. The principle in this work is synthesis large amount of incorrect speech data from correct speech data using the ratios obtained from a much smaller set of sample data. The characteristics of speakers, emotions and other variations are kept unchanged as the original correct speech data.
2.2 Auto-synthesis Program Procedures Some preparations need to be done before auto-synthesis. Firstly, the common types of pronunciation errors are identified through expert analysis and summarization. This requires good understanding of the two dialects and the phonetic differences so that conceptual error models can be established. Based on the analysis result, correct and incorrect speech data are then compared to each other to identify the relationships between the first four formant frequencies of correct and incorrect speech data. As a result, the modification ratio of each formant can be determined. After the preparation, the prerecorded correct speech data and the corresponding formant ratios can be input to the auto-synthesizer to generate the incorrect speech data one by one as shown in the block diagram in Fig.2. The proposed auto-synthesis has three steps. Step One is the LPC analysis. The LPC coefficients of the prerecorded correct speech data are computed by the autocorrelation method. The LPC residual of the prerecorded correct speech data is extracted by inverse filtering. And the first four formant frequencies of the same data are decided by solving the LPC equation [7]. Here the prediction order p=18. Step Two is the formant frequency modification. The formant frequencies of the prerecorded correct speech data are multiplied by the pre-decided ratios to obtain the modified formant frequencies for incorrect synthetic speech data. A new vocal tract filter for the incorrect synthetic speech is built using the modified formant frequency values. Step Three is the synthesis. The LPC residual is used to excite the vocal tract filter, and then the new synthesized incorrectly pronounced speech data are obtained. The proposed method has two main advantages. Firstly, it can maintain the characteristics of the speakers used in the training data by keeping the original LPC residual and formant bandwidth. Multiplying the formant frequencies only by a ratio can also keep the difference of formant location caused by different vocal tract sizes. Secondly, the relationship between formant frequencies and the type of incorrectly pronounced error concluded by a small set of speech data can be used to modify other speech data which is not from this small set.
264
X. Zhang et al.
Fig. 2. Block-diagram of auto-synthesis program
3 Experiment and Results The incorrectly pronounced speech data synthesized in this experiment is used in a Chinese Putonghua CAPT system rEcho [4] to teach native Cantonese speakers to speak Mandarin. Three typical syllables/characters confusing to native Cantonese speakers “ge[kɤ]”, ”he[xɤ]”, and ”xi[ɕi]” are chosen for the experiment [8]. The types of incorrectly pronounced syllables for each choice are showed in Table 2. The corresponding syllables/characters are constructed into some carefully chosen sentences for the purpose of pronunciation error detection. Table 3 shows the 3 sentences (li3 da4 ge1 mai4 mian4 bao1)” with [kɤ] containing these phones (1) “ in it, (2) “ (a1 li4 he1 ka1 fei1)” with [xɤ ] , and (3) “ (a1 mei3 you3 xi2 ti2 da2 an4)” with [ɕi]. Another 6 sentences with the incorrect pronunciations showed in Table 2 can then be constructed.
阿力[喝]咖啡
李大[哥]卖面包
阿美有[习]题答案
Table 2. Types of incorrectly pronounced syllables
Syllable Error type 1 Error type 2
[kɤ]
[xɤ]
[ɕi]
[kε]
[xε]
[ʂi]
[kɔ]
[xɔ]
[si]
Speech Synthesis for Error Training Models in CALL
265
Table 3. Sentences and modification ratios
Formant Ratio
李大哥卖面包 li da ge mai mian bao
阿力喝咖啡 a li he ka fei
阿美有习题答案 a mei you xi ti da an
〉 [kɤ]-〉[kɔ] [xɤ]-〉[xε] [xɤ]-〉[xɔ] [ɕi]-〉[ʂi] [ɕi]-〉[si] [kɤ]- [kε]
K1
K2
K3
K4
1
1.5
1.5
1
0.9
0.8
2
0.9
1
1.5
1.5
1
0.9
0.8
2
0.9
0.67
0.7
1
1
1.4
1.2
1
1
The aim of the experiments is to examine the performance of the synthesized incorrect speech data when applied to the CAPT system. Two speech training databases are used for comparison the synthesized data training acoustic models to the prerecorded data training ones. The prerecorded speech database is from a group of 20 male and 20 female speakers who are native mandarin speakers and can imitate Cantonese. They are requested to utter each of the 9 sentences twice. When they are asked to pronounce the incorrect phones, they are instructed to follow the error models and types summarized by expert. The synthetic speech database is obtained by modifying the correct speech data in the prerecorded database. In other words, the prerecorded database contains 9 sentences in all, 3 with correct pronunciations and 6 with incorrect pronunciations whereas the synthetic database contains only the 6 sentences with incorrect phonemes. There are 80 samples for each of the sentences in both databases. The modification ratio used by the algorithm to generate the synthesized database is showed in Table 3 where K1~K4 are the modification ratio of F1~F4, respectively. 3.1 Evaluation by Open Test In this section, the performances of the CAPT using both the prerecorded and the synthesized speech data are evaluated by prerecorded testing data. For each sentence, the training set of prerecorded and synthesized model comprises 60 sentences, which are randomly selected from the prerecorded and synthesized database. The test sets for the two models are the same, comprising 20 prerecorded sentences which are not contained in the training set. Fig. 3 shows the evaluation result expressed by character error rate(cer). [kɤ], [kε] and [kɔ] represent the 1 correct and 2 synthetic incorrect sentences of the 1st sentence “ (li3 da4 ge1 mai4 mian4 bao1)”. [xɤ], [xε], [xɔ] and [ɕi], [ʂi], [si] represent the 2nd and 3rd sentences respectively. From Fig.3, it can be seen that with equal amount of training sentences, the synthetic sentence training models perform worse than their prerecorded counterparts as a whole. Among the 3 sentences, [xɤ] and [ɕi] in the 2nd and the 3rd sentence are relatively more difficult to synthesize with fricative consonants in them. Therefore, the 2nd and the 3rd sentences have lower cer than the 1st sentence.
李大哥卖面包
266
X. Zhang et al.
90 80 70 ) 60 % ( e t 50 a r r o 40 r r e 30
pre syn
20 10 0
] ɤ k [
] ε k[
]ɔ k[
]ɤ x[
] ε x[
]ɔ x[
] i ɕ [
] i ʂ [
] i s [
Fig. 3. Error rates using prerecorded data for testing
It is understandable that the CAPT performs better using prerecorded real data for training compared to the system using equal amount of synthesized data. However, synthetic phones are artificially synthesized with a ratio. By deliberately changing the ratio, more synthesized sentences can be generated to enlarge the training size whereas using prerecorded data does not have that liberty. To further investigate the size of the synthesized data to the performance of CAPT, another set of experiment is conducted using the sentence “ (a1 li4 he1 ka1 fei1)” with different sizes of synthesized data. Results are listed in Fig.4. Fig. 4 shows that the system trained by 80 synthesized sentences gives a cer of 55% for the error phone [xε]. Yet when the synthesized data size is increased to 640, the cer is reduced to 11.1% which is basically the same as the pre-recorded data. In the case of [xɔ], the synthetic cer is decreased from 33% with 80 training sentences to 10% with 640 sentences, which is even better than the prerecorded counterpart of 20%. Generally speaking, the more synthesized data are used, the better the cer values are. In fact, cer decreases quite rapidly as the number of synthetic training sentences increases. The results also indicate that in a system with limited real speech data for training, synthesized speech data is not only useful, it can even perform better than system using real data provided that there are sufficient synthesized data used. The additional synthesized data are obtained by adjusting up to 10% of the formant ratio in both directions.
阿力喝咖啡
Speech Synthesis for Error Training Models in CALL
267
60 50 ) % ( e t a r r o r r e
40 [xɤ] [xε] [xɔ]
30 20 10 0 [xɤ] [xε] [xɔ]
80pre 10 0 20
80sys 160sys 320sys 11.1 5.6 6 55 30 5 35 30 20 number of train data
640sys 11.1 5 10
Fig. 4. cer of the 2nd sentence with more synthetic training sentences
3.2 Test by Native Cantonese Speakers In this section, the prerecorded and synthesized data are used to train the CAPT system, respectively. The test data set is recorded by 4 male and 3 female who are native Cantonese speakers and cannot speak Chinese Putonghua well. This is very much like the real working condition of CAPT.
60 50 )% 40 (e t ar 30 ro r re 20
[ʂi] [si]
10 0 [ʂi] [si]
80pre 0 17
80sys 160sys 320sys 33 50 33.3 33 25 0 number of train data from HK
Fig. 5. cer of native speaker testing
640sys 0 0
268
X. Zhang et al.
阿美有习题答案 习
Fig.5 shows the cer of the 3rd sentence “ (a1 mei3 you3 xi2 ti2 da2 an4)”. For this sentence, syllable “[ɕi]” ( ) is incorrectly pronounced by all native Cantonese speakers. They mispronounced it into [si] and [ʂi]. So Fig.5 has cer of [si] and [ʂi] only. Fig.5 shows that with the same number of training speech samples (80), the synthetic speech trained model performs worse than the prerecorded model by 33% in cer. As the synthetic training data increases, the cer of the synthetic speech trained model decrease rapidly. The 640 synthetic sentences trained model has a 0 cer for both [si] and [ʂi], which is better than the system using the prerecorded data. These experiments shows that a CAPT system using synthesized data for error model training can substitute prerecorded speech trained model provided that the training size is reasonably large. The test conducted in this experiment is more convincing than the test in Section 3.1. The test speech data in Section 3.1 are recorded by people who can speak Chinese Putonghua(Manderin) very well, their incorrect pronunciations are produced based on instructions. But the test speech data in this subsection is recorded by native Cantonese speakers who cannot pronounce the word correctly.
4 Conclusion This paper presented a novel work to use synthesized data for error model training in the CAPT system. Results show that with reasonably large number of synthesized speech data based on correct and incorrect speech data, a CAPT system can give comparably performances as its prerecorded counterpart. This gives light to the availability of more comprehensive CAPT systems which requires a large amount of training data, especially incorrectly pronounced data. More experiments can be conducted on the appropriate size of the synthetic data. Other phonetic features may also be investigated to see their effect in synthesis.
Acknowledgement This project is partially funded by the Hong Kong Polytechnic University (Grant No.: A-PF84).
References 1. Bailin, A.: Intelligent Computer-Assisted Language Learning. A Bibliography Computers and the Humanities 29, 375–387 (1995) 2. Bernstein, J., Cohen, M., Murveit, H., Ritschev, D., Weintraub, M.: Automatic evaluation and training in English pronunciation. In: ICSLP 1990, Kobe, Japan, pp. 1185–1188 (1990) 3. Ronen, O., Neumeyer, L., Franco, H.: Automatic detection of mispronunciation for language instruction. In: Eurospeech 1997 (1997) 4. Zhou, W., et al.: A computer aided language learning system based on error trend grouping. In: IEEE NLP-KE 2007, pp. 256–261 (2007)
Speech Synthesis for Error Training Models in CALL
269
5. Anderson, O., Dalsgaard, P., Barry, W.: On the use of data-driven clustering technique for identificationof poly- and mono-phonemes for four European languages, Acoustics, Speech, and Signal Processing (1994) 6. Klatt, D.H.: Software for a cascade/parallel formant synthesizer, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139 (October 1979) 7. McCandless, S.S.: An algorithm for automatic formant extraction using linear prediction spectra. IEEE transactions on acoustics, speech, and signal processing assp-22(2) (April 1974) 8. Wang, L.: Cantonese how to learn speaking Chinese mandarin( ). Peking university press (1997) 9. Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall International, Inc., Englewood Cliffs
广东人怎样学习普通话