Automatic Detection of Tone Mispronunciation in Mandarin

Automatic Detection of Tone Mispronunciation in Mandarin Li Zhang1,2,? , Chao Huang2 , Min Chu2 , Frank Soong2 , Xianda Zhang1 , Yudong Chen2,3 1

Tsinghua University, Beijing, 100084 Microsoft Research Asia, Beijing, 100080 3 Communication University of China, 100024 [email protected], {chaoh,minchu,frankkps}@microsoft.com, [email protected] 2

Abstract. In this paper we present our study on detecting tone mispronunciations in Mandarin. Both template and HMM approaches are investigated. Schematic templates of pitch contours are shown to be impractical due to their larger pitch range of inter-, even intra-speaker variation. The statistical Hidden Markov Models (HMM) is used to generate a Goodness of Pronunciation (GOP) score for detection with an optimized threshold. To deal with the discontinuity issue of the F0 in speech, the multi-space distribution (MSD) modeling is used for building corresponding HMMs. Under an MSD-HMM framework, detection performance of different choices of features, HMM types and GOP measures are evaluated.

1

Introduction

During the past few years, much progress has been made in the area of computerassisted language learning (CALL) systems for nonnative language learning. Automatic speech recognition (ASR) technology by defining a proper Goodness of Pronunciation (GOP) measure is applied to grade the pronunciation quality at both speaker and sentence levels [4][5][6]. To improve the feedback quality, precise knowledge of the mispronunciation is required. Detection of mispronunciation in a speaker’s utterance have been developed and achieved good performance[7][8][9]. CALL systems for Mandarin are also desired. What is more,the proficiency test of Mandarin for Chinese (PSC) becomes more and more popular recently. Phonetic experts are required during the test as judgers, which makes the test costly, time-consuming and not absolutely objective. So automatic assessment of Mandarin is very necessary. As we know, Mandarin is a tonal language. Compared with finals and initials, tones are more difficult to be pronounced correctly because they are much easily influenced by the dialect of a speaker. In PSC, the goodness of tone pronunciation ?

join this work as a visiting student at MSRA

is one of the most important factors to be used to grade the testing speaker. However, in previous works [1][2][3], assessment of tone pronunciation has not been taken into consideration. In addition, studies before mostly focused on rating in the sentence level and speaker level. In contrast to existing works, we focus on the automatic detection of tone mispronunciation. In this paper, some analysis of tone mispronunciations occurred in PSC is first given in Section 2. Based on the analysis, two kinds of detection methods are proposed: template based method and statistical models. Section 3 describes the template methods. Section 4 investigates the statistical methods based on multi-space distribution Hidden Markov Models (MSD-HMM) under different setups such as model types, feature combinations and GOP measures. Section 5 demonstrates the experiment results and analysis. Conclusions are given in Section 6.

2

Some Analysis about Tone Mispronunciations in PSC

There are five tones in Mandarin Chinese and first four tones are widely used. Although many factors affect the characteristics of tones, pitch is considered as the most discriminative factor. In phonetics theory, pitch range is usually divided into 5 levels, naming 1 to 5 and level 5 corresponding to the highest. Traditional description of Mandarin tones is shown in Table 1. Chinese Pronunciation Tone Symbol Tone Description Tone1 55 high level Tone2 35 high rising Tone3 214 low falling Tone4 51 high falling Table 1. Tones of Mandarin

According to the experts’ experience in PSC, rules to evaluate tone pronunciations are also based on the 5-level description. General mispronunciations of tone are discussed below. Typical examples with tone mispronunciations in our database are plotted in Figure 1. Pitch is extracted by the entropic signal processing system (ESPS). The first tone (55): it is sometimes pronounced unevenly as shown in Figure 1a. Another representative mistake is that pitch values are not high enough, such as 44 or even lower value. This kind of mistakes always exist in the pronunciations of people living in south-north of China. The second tone (35): it is easily confused with the third tone when the descending part of the contour lasts long enough to be perceived, as shown in Figure 1b. The third tone (214): a rising trend is required at the end of the pitch contour for isolated syllables, which is not strict in the continuous speech. Some speak-

(a) Mispronunciation of gang1

(b) Mispronunciation of cai2 140

F0(Hz)

120 100 80 60 1000

1100

1200

1300

40 950

1400

1000 1050 1100 1150 1200 1250 1300 1350 1400 1450 1500

Time (ms)

Time (ms)

(c) Mispronunciation of lin3

(d) Mispronunciation of ou3 250

F0(Hz)

200 150 100 50 1400

1500

1600

1700

700

800

900

1000

Time (ms)

1100

1200

1300

1400

Time (ms)

(f) Mispronunciation of chi4

(e) Mispronunciation of kui3

160

F0(Hz)

140 120 100 80 60 40 1000

1100

1200

1300

1400

1500

1400

1450

Time (ms)

1500

1550

1600

1650

Time (ms)

Fig. 1. F0 contours of examples with tone mispronunciations

ers ignore such requirement and pronounce as 21 (Figure 1c). Some speakers pronounce it as 35 that becomes the second tone (Figure 1d). Another mistake is beginning value is so high to be comparable with or even higher than the ending’s, such as 313 or 413 and so on. Figure 1e is an example of 413. The fourth tone (51): the beginning is sometimes pronounced not high enough. Speakers whose pitch ranges are not wide enough are more possible to make such mistakes, as shown in Figure 1f. According to these analysis, we present two kinds of methods for the detection of tone mispronunciations.

3

Automatic Detection Methods Based on Template

The expert’s judgement agree with most of results from analysis of pitch contours in our experiments. It seems that just analyzing pitch contours can make the detection. The third tone is more prone to be pronounced incorrectly and it covers more than 80% mistakes or defectiveness occurred in our database. In addition, there are lots of variant mispronunciations for tone 3. Therefore, we will use tone 3 as the studying case to investigate the template method. 3.1

Methods Based on Five-Level Description

The most typical mispronunciation of tone 3 are 21, 35, 313, 413 and so on. A method to detect these mispronunciations is firstly partitioning the testing speaker’s pitch range into five levels according to his pitch contours; then describing his pitches using them. If the description is not 214, it is judged to be incorrect.

This method seems reasonable and very simple. However, it is very difficult to partition the pitch range into five levels, for they are just relative and regions of neighbor levels always have overlap. Figure 2 shows range of 1st and 5th levels of a male speaker, who speaks Mandarin very well. Surprisingly, the 5th level decided by range of the first tone (55) even have an overlap with the 1st level decided by the ending values of the fourth tone (51) in this case. Range within a level are also very wide. These reasons make the partition too difficult to be implemented. Pitch contours of correct pronunciations of tone1

Pitch contours of correct pronunciations of tone4

200

F0(Hz)

150

100

50 Upper limit of the 5th level Lower limit of the 5th level

Upper limit of the 5th level

Upper limit of the 1st level

Lower limit of the 5th level

Lower limit of the 1st level 100

150

Time(ms)

200

250

0 0

20

40

60

80 100 Time(ms)

120

140

160

180

Fig. 2. Examples of pitch contour for tone1 and tone4 from a single speaker

3.2

Methods Based on Pitch Value Comparison

Considering the fact of existing big overlap among 5-value quantization of pitch range as shown in last sub-section, we can also check the relative value of pitch for a given tone contour as shown in Figure 3. For example, we can detect the mispronunciations like 21 and 35 by just counting the durations of ascending and descending parts of a contour. Through comparing the values of beginning and ending parts, we can detect the mistakes such as 313,314 and so on. However, it is also not practical: firstly, there are usually elbows at the beginning and ending of pitch contours that make the detection of the beginning and ending segment unreliable. Duration estimation based on segmentation results becomes more unreliable; secondly, there are too many threshold parameters to be predefined or tuned and some of them are even beyond expert’s perceiving resolution. Comparing with the template methods above, there are fewer threshold to be predefined in statistical methods such as HMMs and more features combinations in addition to pitch can be applied flexibly in such framework.

Fig. 3. Flow chart for the template method

4

Automatic Detection Methods Based on Statistic Models

In this section statistical methods based on HMMs are discussed. A brief introduction is given first: syllables pronounced by golden speakers are used to train the speaker independent HMMs and phone set [10] is used, in which each tonal syllable is decomposed into an initial and a tonal final. After forced alignment, GOP scores are computed within the segment. Finally, with the help of a threshold, detection is operated. Detailed descriptions about choice of model types, feature vectors and GOP measures are given in the following subsections. 4.1

Experimental corpora

The total corpora of the experiment contain 100 speakers, each speaker reads 1267 syllables. Among them we chose 80 speakers (38 male and 42 female) as the training set. The rest 20 speakers(10 female and 10 male) are used for testing. Randomly 100 syllables from each of these 20 speakers and totally 2000 syllables are selected for expert’s evaluation. Expert with national level scores each pronunciation correct or not and points out where the problem is, such as tone mispronunciation if there is. 4.2

Choice of Models

Studies have indicated that F0 related features can greatly improve tone recognition accuracy. For F0 there is no observation in the unvoiced region and some methods have been proposed to deal with it[11][12][14]. Among them, multispace distribution (MSD) approach, first proposed by Tokuda [13] for speech

synthesis, have also achieved good performance in tonal language speech recognition[14]. The MSD assumed that the observation space can made of multiple subspaces with different priors and different distribution forms (discrete or continuous pdf)can be adopted for each subspace. We have adopted two kinds of models to solve the problem of discontinuity of F0 feature in speech: MSD models in which the observation space are modeled with two probability spaces, discrete one for unvoiced part and continuous one (pdf) for voiced F0 contour, and the other model in which random noise is interpolated to the unvoice region to keep the F0 related feature continuous during the whole speech. Their performance in terms of tone recognition error rate are compared in our experiments. Acoustic feature vector contains 39-dimensional spectral features (MFCC-ED-A-Z), and 5-dimensional F0 related features, consisting of logarithmic F0, its first, second order derivatives, pitch duration and long-span pitch[15]. In MSD models, we separate the features into two streams: one is MFCC-E-D-A-Z, the other is 5 dimensional F0 related features and only one stream of 44-dimension is used in conventional models. We compare their performance in Table 2. MSD models perform better than conventional models in both monophone and triphone cases, so MSD models are more proper for the detection of tone mispronunciation. In addition, tied state triphone models are better than monophone models in terms of recognition. We also compare their performance in the detection.

Model Type Model Size Tone Error Rate (%) HMMs, monophones 187*3(states), 16(mixtures/state) 18.05 MSD-HMMs, monophones 187*3(states), 16(mixtures/state) 8.85 HMMs, triphones 1500(tied states), 16(mixtures/state) 15.85 MSD-HMMs, triphones 1506(tied states), 16(mixtures/state) 7.81 Table 2. Comparison of tone recognition error rate between conventional HMMs and MSD HMMs

4.3

Choice of Features

Spectral features such as MFCC can improve tone recognition accuracy in ASR systems. However, the most discriminative feature for tone is F0. To check whether MFCC are beneficial to the detection of tone mispronunciation, we use two kinds of feature vectors: F0 related features and its combination with MFCC-E-D-A-Z in our experiments. Pitch ranges vary greatly for speakers and normalization is needed. Two normalization are proposed: pitch value is divided by the average of nonzero values within a syllable and logarithm of F0.

4.4

Choice of Goodness of Pronunciation Measures

We compare three types of scores for tone pronunciation: recognition scores, loglikelihood scores and log-posterior probability scores, all of which are computed within the HMM paradigm. Recognition Scores. A simple measure for tone mispronunciation is just based on tone recognition results. If the tone is recognized correctly, its pronunciation is judged as correctand otherwise it will be judged as one with mistakes or defectiveness. Such kind of measure will be highly dependent on the pronunciation quality of the training data that used to generate the HMM models. For example, if one speaker mistakenly pronounce A to B and such case are observed a lot in training data and the decoding result may be still correct even for the wrong pronunciations. Log-likelihood Scores. The log-likelihood score for each tonal segment is defined as: t +di −1 1 iX li = · log(p(yt |tonei )) (1) di t=t i

where, p(yt |tonei ) =

Ji X

p(yt |f inalj , tonei )P (f inalj |tonei )

(2)

j=1

p(yt |f inalj , tonei ) is the likelihood of the current frame with the observation vector yt . P (f inalj |tonei ) represents the prior probability of the f inalj given that its corresponding tone is tonei . d is the duration in frames of the tonal final, t0 is its starting frame index, and Ji is the total number of the final phones with tonei . Normalization by the number of frames in the segment eliminates the dependency on the duration. Log-posterior Probability Scores. Log-posterior probability scores perform better than log-likelihood scores in most CALL systems[6]. We also modify its formula in our case. First, for each frame within the segment corresponding to tonei , the framebased posterior probability p(tonei |yt ) of tonei given the observation vector yt is computed as follows: p(tonei |yt ) = P4

p(yt |tonei )P (tonei )

(3) p(yt |tonem )P (tonem ) PJi j=1 p(yt |f inalj , tonei )P (f inalj |tonei )P (tonei ) = P4 (4) PJm m=1 j=1 p(yt |f inalj , tonem )P (f inalj |tonem )P (tonem ) m=1

P (tonem ) represents the prior probability for tonem . Then, the log-posterior probability for the tonei segment is defined as: ρi =

t +di −1 1 iX · log(p(tonei |yt )) di t=t

(5)

i

5

Experiments and Results

The training and testing database for detection are the same as Section 4.1. MSD method is used to model golden pronunciations of tone. Performance of different model types, feature vectors and GOP measures are evaluated. 5.1

Performance Measurement

To evaluate performance of different setups, four decision types can be defined: – Correct Acceptance(CA): A tone has been pronounced correctly and was detected to be correct; – False Acceptance(FA): A tone has been pronounced incorrectly and was detected to be correct; – Correct Rejection(CR): A tone has been pronounced incorrectly and was detected to be incorrect; – False Rejection(FR): A tone has been pronounced correctly and was detected to be incorrect. Given a threshold, all these decision types can be computed. Scoring accuracy (SA) defined as CA + CR is always plotted as a function of FA for a range of thresholds to evaluate the performance of a detection system. The SA-FA curves are plotted for different setups in next sections. 5.2

Results for Different Model Types

Basic setups and tone recognition error rate of all models in our experiments are listed in Table 3. Model Type Feature Vector Tone Error Rate (%) Monophones LogF0 (5) 11.20 Monophones MFCC-E-D-A-Z, LogF0(5) 8.85 Monophones MFCC-E-D-A-Z, Normalized F0(5) 7.75 Triphones LogF0 (5) 9.25 Triphones MFCC-E-D-A-Z, LogF0(5) 7.10 Triphones MFCC-E-D-A-Z, Normalized F0(5) 6.05 Table 3. Tone recognition error rate of different model sets

Firstly we compared the performance of monophone and tied state triphone models. In all the experiments below, logF0 related 5 dimensional features mentioned are the same as [15] and we use ”LogF0 (5)” for short. Table 3 shows that triphone models always perform better than monophone models for tone recognition. It is obvious that context information modeled by triphones is helpful for recognition.

(a)

(b)

(c)

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2 Monophones, Logf0(5) Triphones, Logf0(5) 0.03

0.04

False Acceptances

0.05

0.1 0.06 0

0.2 Monophones, Logf0(5) + MFCC−E−D−A−Z

Monophones, Normalized f0(5)+MFCC−E−D−A−Z

Triphones, Logf0(5) + MFCC−E−D−A−Z

Triphones, Normalized f0(5)+MFCC−E−D−A−Z

0.01

0.02

0.03

0.04

False Acceptances

0.05

0.1 0.06 0

0.01

0.02

0.03

0.04

0.05

0.06

False Acceptances

Fig. 4. Scoring accuracy versus false acceptance for monophones and triphones

Performance of these models on detection are shown in Figure 4. Log-posterior probability scores are chosen as the GOP measure. Triphone models achieve a little better performance than monophone models for the detection. Triphones model the context between initial and final in our case and tone also has a relationship with initial, which may be a reason why triphone models are beneficial for tone detection. However, the affection of initial on tone is very limited compared with final and it is why the advantage is not so much. 5.3

Results for Different Features

In this section, we evaluate performance of different features. The basic setups and tone recognition error rate of the models are in Table 3. Log-posterior probability scores are still used as the GOP measures. Figure 5 indicates that spectral features such as MFCC are useful for tone recognition as shown in Table 3,however, they seem not beneficial for the detection. The reference of detection is provided by the phonetic experts. We inferred that phonetic experts judge the quality of tone pronunciation probably mainly

(a)

(b) 1

0.9

0.8

Scoring Accuracy

0.7

0.6

0.5

0.4

0.3

Monophones, Logf0(5)

Triphones, Logf0(5)

0.2

Monophones, Logf0(5)+ MFCC−E−D−A−Z

0.01

0.02

0.03

0.04

0.05

Triphones, Logf0(5)+ MFCC−E−D−A−Z

0.06

0.1 0

0.01

False Acceptances

0.02

0.03

0.04

0.05

0.06

False Acceptances

Fig. 5. Scoring accuracy versus false acceptance for different features

(b)

(a) 1

0.9

0.8

Scoring Accuracy

0.7

0.6

0.5

0.4

0.3

0.2

Monophones, Logf0(5)+ MFCC−E−D−A−Z

0.02

0.03

0.04

False Acceptances

0.05

0.06

Triphones, Logf0(5)+ MFCC−E−D−A−Z Triphones, Normalizedf0(5)+ MFCC−E−D−A−Z

Monophones, Normalizedf0(5)+ MFCC−E−D−A−Z 0.1 0

0.01

0.02

0.03

0.04

0.05

0.06

False Acceptances

Fig. 6. Scoring accuracy versus false acceptance for different processing of F0

based on pitch contours as discussed in Section 2 and they are consistent with F0 related features. It is why MFCC features are probably of little helpful to improve the agreement. Table 3 and Figure 6 show that normalization of F0 is more effective than logarithm for both recognition and detection. The reason is that normalization can reduce the pitch values into smaller range than logarithm, which makes models more independent of speakers. 5.4

Results for Different GOP Measures

GOP measure is another key factor for detection. In this section, we compare the performance of three different measures mentioned in Section 4.3.

(a)

(b) 1

0.9

Scoring Accuracy

0.8

0.7

0.6

0.5

0.4

Triphones, Log−likelihood scores 0.3

Monophones, Log−likelihood scores

Triphones,Log−posterior probability scores

Monophones, Log−posterior probability scores

Triphones,Recognition score

Monophones, Recognition scores 0.02

0.03

0.04

False Acceptances

0.05

0.06

0.2 0

0.01

0.02

0.03

0.04

0.05

0.06

False Acceptances

Fig. 7. Scoring accuracy versus false acceptance for different GOP measures

Figure 7 shows log-posterior probability scores achieve the best performance among these measures. The assumption behind using posterior scores is that the better a speaker has pronounced a tone, the more likely the tone will be over the remain tones. The experiment results indicates such assumption is reasonable.

6

Conclusions

In this paper we present our study on automatic detection of tone mispronunciations in Mandarin. After the subjctive evaluations of tone mispronunciations occurred in PSC, two approaches to automatic detection of mispronunciations are presented: tempalted-based and HMM-based. Templates based on 5-level schematic characterization of a tone or the relative comparison with a pitch

contour, are proved to be impractical. Statistical MSD-HMM are shown to be more effective and flexible than the template-based approach. Under the HMM framework, different experiments on feature combinations, model types and GOP measures have been compared in terms of recognition and the mispronunciation detections. We observed that MFCC is not as effective for mispronunciation detection as for recognition and the normalization of fundamental frequency in a segment is more useful. Among various GOP measures, log-posterior probability shows the best performance.

References 1. Chen J.-C., Jang J.-S. R., Li J.-Y. and Wu M.-C.: Automatic Pronunciation Assessment for Mandarin Chinese. in Proc. ICME, pp.1979-1982, 2004 2. Wei S., Liu Q.S., Hu Y., Wang R.H.: Automatic Pronunciation Assessment for Mandarin Chinese with Accent, NCMMSC8, pp. 22-25, 2005 (In Chinese) 3. Dong B., Zhao Q.W., Yan Y.H.: Analysis of Methods for Automatic Pronunciation Assessment, NCMMSC8, pp.26-30, 2005, (In Chinese) 4. Franco H., Neumeyer L., Digalakis V., and Ronen O.: Combination of machine scores for automatic grading of pronunciation quality, Speech Communication, vol. 30, pp. 121–130, 2000 5. Witt S.M., and Young S.J.: Computer-assisted pronunciation teaching based on automatic speech recognition. In Language Teaching and Language Technology Groningen, The Netherlands, April 1997 6. Neumeyer L., Franco H., Digalakis V. and Weintraub M.: Automatic Scoring of Pronunciation Quality, Speech Communication, 30: 83-93, 2000 7. Ronen O., Neumeyer L. and Franco H.: Automatic Detection of Mispronunciation for Language Instruction, Proc. European Conf. on Speech Commun. and Technology, pp.645-648, Rodhes, 1997 8. Menzel W., Herron D., Bonaventura P., Morton R.: Automatic detection and correction of non-native English pronunciations, in Proc. of InSTIL, Scotland, pp.49-56, 2000 9. Witt S.M. and Young S.J.: Performance measures for phone–level pronunciation teaching in CALL. in Proc. Speech Technology in Language Learning 1998, Marholmen, Sweden, May 1998 10. Huang C., Chang E., Zhou J.-L., and Lee K.-F.: Accent Modeling Based on Pronunciation Dictionary Adaptation for Large Vocabulary Mandarin Speech Recognition, in Proc. ICSLP 2000, Volume III, pp. 818-821. Oct., 2000 11. Chang E., Zhou J.-L., Di S., Huang C., and Lee,K.-F.: Large Vocabulary Mandarin Speech Recognition with Different Approach in Modeling Tones, in Proc. ICSLP 2000 12. Hirst D. and Espesser R.: Automatic Modelling of Fundamental Frequency Using a Quadratic Spline Function, Travaux de l’Institut de Phontique d’Aixen -Provence, 15, pp.75-85, 1993 13. Tokuda K., Masuko T., Miyazaki N., and Kobayashi T.: Multi-space Probability Distribution HMM, IEICE Trans.Inf. &Syst., E85-D(3):455-464, 2002 14. Wang H.L., Qian Y., Soong F.K.: A Multi-Space Distribution (MSD) Approach To Speech Recognition of Tonal Languages, Accepted by ICSLP 2006 15. Zhou J.-L., Tian Y., Shi Y., Huang C., Chang E.: Tone Articulation Modeling for Mandarin spontaneous Speech recognition, in Proc. ICASSP 2004,pp.997-1000