A Distinctive Feature Based Method For Evaluating The ... - IEEE Xplore

4 downloads 0 Views 155KB Size Report
Wen Cao1, Ziyu Xiong3. 3 Institute of Linguistics,. Chinese Academy of Social Sciences. Beijing [email protected], [email protected]. Abstract—For the ...
A Distinctive Feature Based Method For Evaluating The Phonetic Transcription Of A Non-Native Speech Database∗ Jinsong Zhang1,2 , Dongning Wang1 1

Wen Cao1, Ziyu Xiong3 3

Center of Studies of Chinese as a Second Language, 2 College of Information Science, Beijing Language and Culture University, Beijing

Institute of Linguistics, Chinese Academy of Social Sciences Beijing

[email protected], [email protected]

[email protected], [email protected]

Abstract—For the purpose of studies of second language acquisition and computer aided pronunciation training, an L2 Chinese speech database by Japanese learners has been collected and phonetically annotated twice by two independent groups of labelers. As there are errors in the two annotations, appropriate methods are needed to screen out inconsistencies for rechecking. This paper presents a multi-step procedure to deal with the problem: first screen out checking candidates based on statistical distributional analyses of inconsistent transcriptions, then analyze and merge inconsistent phoneme transcriptions based on phonetic knowledge, finally make use of a phonetic feature based analyzer to order those inconsistent pair labels for rechecking. Those labels with most dissimilarity are assigned with top priority for rechecking. Preliminary experimental results showed that the procedure was helpful to generate a meaningful candidate list and priority ordering for rechecking.

the annotations, how to detect those labels that most likely have problems, and how to deal with those problematic labels. For a study of these issues and the feasibility of the designed annotation convention, pilot annotations have been carried out on phonetic segments of a part (17 speakers) of Japanese inter-Chinese speech by two groups of annotators. Solutions are desired to be sought from the two independent sets of phonetic labels for the questions.

Keywords-inter-language database, inter-labeler agreement, distinctive features

I.

INTRODUCTION

For the studies of Chinese as a second language and the development of computer aided Chinese pronunciation training(CAPT) technology, we started a long term data collection plan to build a large scale interChinese speech database[1], which will consist of speech by speakers from a number of L1 language backgrounds, with each language having a few hundreds of speakers. The database will provide phonetic annotations in details for part of data, describing mispronunciations at segment and super-segment levels. As the first step, an inter-Chinese speech database of 100 Japanese learners has been collected, including speech of isolated syllables and continuous speech. We have also designed an annotation convention defining a number of diacritics, each indicating an erroneous articulation tendency such as tongue advancing, backing, and etc[1]. The diacritics can be associated to the canonical phonetic labels to indicate mispronunciations. Several diacritics are also allowed to be combined to represent a complex sound variation. Phonetic annotation is a difficult, time-consuming and errorprone procedure, since it is performed manually, with humans analyzing non-native sounds and converting them into their corresponding phonetic symbols. We need to take serious considerations about questions of how to guarantee the reliability of ∗

The most commonly used method to evaluate reliability of two sets of phonetic annotations is the criterion of percentage agreement of two sets of labels [2-4]. The criterion is calculated as the ratio of the frequency of identical labels over that of all labels. It offers a simple and direct measurement of percentage of consistent labels. One problem is that it does not take into account of the agreement occurring by chance, which has a strong dependence on the number of labels in the annotation. An improved criterion is Cohen's kappa coefficient [3], which excludes the chance agreement explicitly. However, if the label set of annotation is very large, the chance probability is rather small, so that Cohen’s kappa approaches to percentage agreement. In our case, as the number of labels for mispronunciations is large, the performance of Cohen’s kappa should be similar to that of percentage agreement. The percentage agreement of the two sets of phonetic labels can only provide us an evaluation of overall inter-labeler reliability, but cannot tell us which of the mismatched labels are more problematic. It is because that the evaluation is based on a binary judgment in a simple “same-or-different” manner, taking no considerations of interlabel correlations. For example, if there are two [ph] segments, and the two pairs of labels are [ph]-[p] and [ph]-[f], the mismatches are treated as the same errors. Though the phonetic difference of [ph]-[p] can be regarded smaller than [ph]-[f]. If there is a rechecking procedure, the mismatch of [ph]-[f] is desired to be associated with a higher priority than [ph]-[p]. In other words, an ordering of all mismatched label pairs is helpful for improving rechecking efficiency. To attain such a goal, we adopt the idea of feature based label agreement estimation, first proposed by Cucchiarini in [5] to evaluate agreements between phonetic transcriptions of Dutch utterances. It converted the binary judgment of two labels into a continuous degree of agreement. The features were derived from perception experiments on a 10-point scale, in which the subjects' task was to reproduce the sounds and indicate the degree of articulatory dissimilarity. Oller & Ramsdell [6] also employed the similar method in the evaluation of

Supported by the China MOE Project 07JJD740060 of Key Research Institute of Humanities and Social Sciences at Universities.

ISBN 978-1-4244-6245-2/10/$26.00 ©2010 IEEE

416

the annotation for pathological voices. In this paper, we would apply the similar idea to comparing mismatched labels of non-native speech, in order to sort them into a sequence with different priorities for rechecking. The following sections are arranged as follows: Section II gives a brief description of our inter-Chinese speech database by Japanese learners and our annotated results. Section III introduces rule-based modifications of mismatched labels. Section IV presents our distinctive feature based method and experimental results of sorting mismatched phoneme label pairs. Finally, The results are briefly concluded in Section V.

II.

INTER-CHINESE SPEECH DATABASE

The Japanese part of our large scale Chinese L2 speech database (referred to as BLCU inter-Chinese speech corpus) has collected data of more than 100 speakers [7]. Among them, continuous speech of 17 Japanese speakers' speech (8 males and 9 females) has been phonetically annotated at segment level. Each speaker uttered a same sentence set of 301 daily used sentences. 6 post-graduate students majoring in phonetics acted as annotators, divided into two groups. The speech data was annotated twice independently by the two groups, with each annotator labeling a continuous 200 utterances on a rotating basis. Table I shows some statistics of the annotated database.



Inconsistent mispronunciation (IM): regarded as mispronunciations by both annotators but annotated in different labels.



Warning mispronunciations (WM): regarded as mispronunciations by only either one of the two annotators.

Figure 1 shows the distributional percentages of the four groups of phoneme labels with respect to pair-wise annotators. Totally speaking, 80.7% (CC+CM) of the annotation is composed of consistent labels. The total inconsistent labels account for 19.3% (WM+IM), in which about 2.8% IM phonemes are treated as mispronunciations with consensus but annotated differently, the other 16.5% was regarded as correct by either of one annotator. Since our aim is to generate a single version of annotation for further studies, the inconsistent parts of WM and IM are our major concern here. They are first processed by rule based label modifications, then compared using distinctive feature based method.    

:0



,0



&0



&&

 

TABLE I.

JAPANESE L2 INTER-CHINESE DATABASE STUDIED.

Text Speaker Number of utterances Number of phonemes Average length per utterance Number of annotators Number of annotations per utterance

301 utterances 8 males 9 females 4,631 64,190 13.9 phonemes 6 2

A set of diacritics were designed for the following erroneous articulation tendencies: raising, lowering, advancing, backing, lengthening, shortening, centralizing, rounding, spreading, labiodentalizing, laminalizing, devoicing, voicing, insertion, deletion, stopping, fricativizing, nasalizing, retroflexing and etc [1]. Annotation is to associate one or several diacritics to the canonical phoneme labels to indicate mispronunciation tendencies. Table II gives some examples of diacritics. TABLE II.

EXAMPLE OF DIACRITICS OF PHONETIC ANNOTATING.

Error tendencies Tongue raising Tongue lowering Tip advancing Tip backing Lengthening Shortening

diacritic ^ ! + : ;

Annotation ex. a{^} u{!} e{+}n n{-} z{:} p{;}

An investigation of label agreements has been done for the two sets of annotations in [7]. All annotation labels can be classified into four groups in view of consistency, i.e. •

Consistent correct (CC) phonemes: regarded as correct by both annotators.



Consistent mispronunciation (CM): annotated as the same mispronunciation labels.

    ZGQZO\ ZGQOVO ZO\OVO OTSMP

OT]\

SMP]\ RYHUDOO

Figure 1. Percentage distributions of phoneme labels

III.

RULE BASED LABEL MODIFICATION

Inconsistent label pairs were summarized and their inconsistencies were checked. Table 1 shows part of the label pairs in IM data with token frequencies higher than 2.0%. The parts within the braces are diacritics indicating mispronunciation tendencies, and those outside the braces are the canonical forms of Chinese phonemes. TABLE III. /DEHO L^H`QJ^` O^U` T^` VK^[` Y^` R^R`^X` Y^` X^Z` ]K^VK` LLL^[` U^GO` Ă

EXAMPLES OF INCONSISTENT LABELS IN IM DATA /DEHO LQJ^` O^GO` T^VK` VK^VK` Y^Z` R^X` Y^Z`^X` X^` ]K^[` LLL^VK` U^O` Ă

)UHTXHQF\ 3HUFHQWDJH                       Ă Ă

Observations of the labels revealed that a considerable number of label pairs could be merged, for the fact that the merge would have little influence on the use of the data for CAPT study, according to phonetic knowledge or statistic rules. Afterwards, five rounds of merge were applied to the pilot annotations:

417

Round 1: Labels of subordinate phonetic phenomena are deleted when major ones exist. For example, an error of tongue advancing for sound “ing” is usually accompanied by a deletion of “e” component, so an explicit annotation of “e” deletion as done in the label “i{e-}ng{+}” is not necessary. It indicates the same error as the label “ing{+}”.





Round 2: Labels of phonetic variations are merged when their discriminations are usually ignored by native Chinese speakers. For example, a flap variation of /r/ sound is easily recognized as an /l/ sound by Chinese speakers, so that it is not necessary to differentiate the two labels: “r{dl}” and “r{l}”.



Round 3: Labels indicating different degrees of the same mispronunciations are merged, e.g. both “sh{x}” and “sh{sh}” denote that /sh/ is realized as /x/ more or less, so that they can be merged as “sh{sh}”.



Round 4: Mispronunciations at different dimensions such as timing and articulation place can be combined. For example, “q{;} vs. q{sh}” can be combined as “q{;,sh}”. Round 5: The least frequent label pairs are ignored based on a statistical view. Here, those with only one token appearance were modified to their canonical forms.



After these 5 rounds of modifications, IM and WM decreases to 0.6% and 14.8% respectively, as is shown in Figure 2. The total number of label pair types decreases from 1050 to 310, which should be rechecked. To facilitate the task, a distinctive feature based method is described in the following section to order the priorities of these label pairs for rechecking.  





RULJLQDO URXQG URXQG URXQG URXQG URXQG

     



diphthong, the distance of insertions or deletions is set to be 0.5. If an agreement score is calculated as lower than 0, it is floored to 0. The features defined in this study are based on the classic distinctive features described in [9]. Both consonants and vowels are represented by 3 features respectively. For consonants, they are “manner of articulation”, “place of articulation” and “VOT” (voice onset time). For vowels, they are “height of the tongue”, “frontness of the tongue” and “rounding of the lips”. For each feature, the value is multi-valued and ranges from 0 to 1. According to the same principle stated in [10], the value of each feature is equally-spaced, which means if there are 2 different values for a feature, they should be 0 and 1; if there are 3, they should be 0, 0.5 and 1, and so on. Table IV and Table V show the feature configurations of the consonants and the vowels respectively. TABLE IV. &RQVRQDQW E[p] S[ph] P[m] I[f] G[t] W[th] Q[n] QJ[ ] O[l] J[k] N[kh] K[x] ]K[t! ] FK[t! h] VK[! ] U[! ]/[! ] M[t! ] T[t! h] [[! ] ][ts] F[tsh] V[s]

CONFIGURATION OF CONSONANTS 0DQQHU                      

3ODFH                      

927                      

 ,0

:0

TABLE V.

Figure 2. Size of WM and IM after each modification

IV.

DISTINCTIVE FEATURE BASED LABELS’ AGREEMENT

A. Calculation The agreement score of two phonetic labels can be calculated with the formula below: n

ω i | p i − qi |

A( P, Q ) = 1 −

(1)

i =1

Here, P and Q denote two vectors with n dimensional features,¹L is the weight of the ith feature. As Finals in Chinese can be compounds of several phonemes, the features are assigned to those compounding phonemes, instead of directly to the Finals themselves. Then the agreements between compounds are calculated as the averaged sum of the distances of all individual phonemes. If there are insertions or deletions in the two aligned compounds, as in the case of a monophthong aligned with a

9RZHO [a] ["] [! ] [#] [! ] [! ] [! ] [! ] [i] [I] [! ] [! ] >\@ [o] >! @ >! @

CONFIGURATION OF VOWELS

)URQWQHVV +HLJKW 5RXQGLQJ                                                

The values of the feature “manner of articulation” are assigned subjectively according to sonority levels, as stated in [10]. Compared

418

to the features proposed in [6], “VOT” is used instead of “voicing”. The reason lies in that Mandarin Chinese has an important phonemic use of “aspiration”. The use of VOT is able to differentiate both the features of “voicing” and “aspiration” from others. To be specific, the VOT value is 0 for a voiced phoneme, 0.5 for unaspirated plosives and affricates, 1 for the aspirated plosives and affricates. For the other features, values were approximately assigned according to the positions of the main articulators.

B. Results Agreement scores of all 310 label pairs, resulting from rule based label modifications, are calculated based on the feature method described above. The scores range from 0 to 1.0ˈ with most ones lying around 0.80. Figure 3 is the histogram of the agreement scores. The x-axis denotes ranges of scores, and the y-axis denotes number of label pairs. In order to judge if the scores are meaningful for ordering the label pairs or not, a subjective evaluation was done by an annotator for quantifying the similarities of the 310 label pairs using a 5-point scale. The correlation coefficient between the subjective scores and the agreement score was 0.75, significant at the 0.01 level. It indicates that the feature based label comparison method has brought about meaningful results for processing phonetic annotations of non-native speech.

V.

SUMMARY

This paper describes our approaches to deal with narrow phonetic annotations of non-native speech database, which include interlabeler reliability evaluation based on percentage agreement, analyses of label pair distributions, rule-based label modifications for the purpose of CAPT, and a distinctive feature based method to evaluate agreement scores between a pair of phonetic labels. The efforts are made to evaluate the reliability of manual phonetic annotations, screen out problematic labels for an extra round of rechecking, assign an order of rechecking priorities of mismatched labels in order to improve efficiency. Experimental results showed that the proposals first reduced a number of 1,050 label pairs to 310 ones, then the calculated agreement scores showed to have a correlation by a coefficient of 0.75 to a subjective evaluation of the 310 label pairs. The work finally helped to generate a reliable annotation for further studies from the two sets of phonetic annotations. This study on processing phonetic labels serves as a pilot and important step toward developing large scale inter-Chinese speech database in the future, which will include more speakers from more L1 language backgrounds. We hope the experiences we have learned from this pilot study will function effectively in the future work.

ACKNOWLEDGEMENTS We would like to appreciate much the hard annotation work by those student annotators from the center of studies of Chinese as a second language in BLCU.

REFERENCES [1] [2]

[3]

Figure 3. Histogram of agreement score

Three example label pairs are given in Table VI, together with the agreement scores calculated. In the first label pair, one labeler regarded that the sound /h/ was pronounced as a labiodental sound, while the other labeler regarded as bilabial sound. The two labels represent sounds that are articulatorily close. The agreement score was 0.93, conforming well to the fact that the two sounds are similar. In the second example, the /e/ sound was judged as “tongue raised” by one annotator and as “tongue lowered” by the other one. The contradictory judgments indicate a confusing situation, reflected by a comparatively lower agreement score. In the third example, the first annotator decided that /ao/ was realized as /o/, while the other as correct sound. The lowest agreement score indicates that the label pair is rather different, and should be rechecked with the highest priority.

[4]

[5]

[6]

[7]

[8] [9]

TABLE VI.

EXAMPLE AGREEMENT SCORES.

phoneme pairs h{f} vs h{w} e{!} vs e{^} {a-}o vs ao

agreement score 0.93 0.7 0.45

[10]

W. Cao, J. Zhang, “The Establishment of a CAPL Inter-Chinese Corpus and Its Labeling”. Proc. Of NCMMSC㧘2009. B. Eisen, H. G. Tillman, Ch. Draxler, “Consistency of judgments in manual labeling of phonetic segments: the distinction between clear and unclear cases”, Proceedings ICSLP 92, Banff, 1992, 871-874. J.M. Kessens, M. Wester, C. Cucchiarini & H. Strik. “The selection of pronunciation variants: comparing the performance of man and machine”, Proceedings ICSLP 98, 6, 2715-2718. M. A. Pitt, “The Buckeye corpus of conversational speech: labeling conventions and a test of annotator reliability”, Speech Communication, 45, 89-95. C. Cucchiarini, “Assessing annotation agreement: methodological aspects”, Clinical Linguistics & Phonetics, 2, 131-155. D. K. Oller, H. L. Ramsdell, “A weighted reliability measure for phonetic annotation”, Journal of Speech, Language, and Hearing Research, 2006, 49, 1391-1411. W. Cao, D. Wang, J. Zhang & Z. Xiong, “Developing a Chinese L2 Speech Database of Japanese Learners With Narrow Phonetic Labels for Computer Assisted Pronunciation Training”, to appear in Interspeech 2010. 䖍 ि ⧎㧘ᦡᢥ. “ᣣᧄੱѻ ↢᥉ㅢ䆱 r ჿᲣ๺ l ჿᲣ⊛㖸ؐ ⠨ ኤ”, Proc. Of NCMMSC, Huangshan, China, 2009. P. Ladfoged, M. Halle. Discussion Notes: Some Major Features of the International Phonetic Alphabet, 1988. D. A. Burquest, L. P. David, Phonological analysis: a functional approach. Dallas, TX: Summer Institute of Linguistics. Burquest 1998.

419