Identification of foreign-accented French using data mining techniques

1 downloads 4408 Views 146KB Size Report
Identification of foreign-accented French using data mining techniques. Bianca Vieru-Dimulescu, Philippe Boula de Mareüil, Martine Adda-Decker. LIMSI-CNRS ...
Identification of foreign-accented French using data mining techniques Bianca Vieru-Dimulescu, Philippe Boula de Mareüil, Martine Adda-Decker LIMSI-CNRS, BP 133, 91403 Orsay, France [email protected], [email protected], [email protected]

Abstract The goal of this study is to automatically differentiate foreign accents in French (Arabic, English, German, Italian, Spanish and Portuguese). We took advantage of automatic alignment into phonemes of non-native French recordings to measure vowel formants, consonant duration and voicing, prosodic cues as well as pronunciation variants combining French and foreign acoustic units (e.g. a rolled ‘r’). We retained 62 features which we used for training a classifier to discriminate speakers’ foreign accents. We obtained over 50% correct identification by using a cross-validation method including unseen data (a few sentences from 36 speakers). The best automatically selected features are robust to corpus change and make sense with regard to linguistic knowledge.

data mining techniques in order to sort out which cues contribute to identifying the origin of a given foreign accent. After a description of our corpus (Section 2), Section 3 presents a set of accent-specific cues which are subdivided into linguistic subsets related to vowel formants, consonant duration and voicing, prosodic cues, pronunciation variants derived from non-standard alignments with French and foreign acoustic units. In Section 4, we compare the contribution of each of these subsets of phonetic cues to identify 6 foreign accents in French, by applying a datadriven method (see Figure 1). Classification algorithms are trained on already classified data from which patterns are extracted by machine learning. They are then tested on unseen data. Results obtained with the best attributes (i.e. the most discriminating characteristics) provided by selection algorithms are finally reported and compared to human perception.

Index Terms: foreign accent, non-native French

2.

Corpus

1. Introduction The paralinguistic literature usually refers to variation in speech that correlates with factors like speaker gender, age, emotional state or origin. What are the cues that enable us to recognise the mother tongue (L1) of a non-native speaker of our own language (L2)? Can machines model human perception in a similar task? This study aims at finding an economical set of features that can be measured and permit an automatic system to identify the origin of 6 foreign accents in French: Arabic, English, German, Italian, Portuguese and Spanish. In [1], native French listeners proved capable of quickly identifying them in more than 50% of cases. Requested to comment on their responses, they claimed they had based their decisions on pronunciation traits such as the ‘r’ articulation. Perceptual studies listed a number of accent-specific features. Among the clues that contribute to an impression of Spanish-accented English, we can cite factors affecting [s]~[z] and [b]~[v] consonants especially [2, 3]. Voiced stops may index Arabic-accented [4] or French-accented [5] English. In German-accented English and English-accented German [6] as well as in Spanish-accented Italian and Italian-accented Spanish [7], prosody was found to play an important role. A question that arises is: how to benefit from these findings to automatically identify different accents? Most automatic speech recognition (ASR) studies on foreign accents focus on error rates. Fewer studies approach accent identification. Let us mention studies on Mandarin-, German- and Turkish-accented English [8], Arabic- and Vietnamese-accented English [9]. In this paper, we present results obtained combining linguistic knowledge, speech processing techniques, in particular automatic alignment and

We recorded 72 speakers of Arabic, English, German, Italian, Portuguese or Spanish mother tongue (12 per language). Among other things, they read two texts: a 400word text from the PFC project (Phonology of Contemporary French [10]) and the IPA text “The North Wind and the Sun” which contains 125 words in the French translation. We divided our speakers into two groups: 36 train speakers and 36 test speakers (6 per language). All the speakers were European or came from Arabic-speaking countries. The Spanish speakers were neither Catalan nor Latin-American. On average, train speakers (as many males as females who were all students) were 24 years old, had lived in France (in the Paris region) for 15 months and had started to study French at the age of 17. They were employed in a perceptual test, and their origin was correctly identified in 52% of cases, which was three times better than chance (17%) [1]. In comparison, a similar test on French regional accents yielded 43% correct identification [11]. For each foreign accent, the most frequent answer was the right one. The speakers’ degree of accentedness evaluated by the same 25 listeners was comparable across the different linguistic origins. It was rated 2.7 on average on a 0–5 scale (where 0 is no accent and 5 is very strong accent). For the test, we recorded another 36 speakers under the same conditions as the first group of speakers. Speakers were 27 year-old students who had lived in France for 21 months and had started to learn French at the age of 15, on average. Hence, the two groups were comparable in age, duration of residence in the Paris region and age of acquisition of French.

Figure 1: Data flow between recording and classification.

3.

Defining a Set of Linguistic Cues

3.1. Acoustic measures using standard alignment With the help of automatic alignment derived from the LIMSI speech recognition system [12], all the texts were segmented into phonemes by using context-independent acoustic models. Previous work showed the validity of the approach on standard and non-standard French [11,12,13].

3.1.1. Vowel formants Acoustic analyses were carried out on the basis of the segmented train corpus. Formant frequencies were measured on vowels (over 500 vowels per speaker) using a script written for the PRAAT software (www.fon.hum.uva.nl/praat/). The first 5 formants as well as fundamental frequency (F0) values were measured every 10 ms with the standard options of PRAAT. Filters were designed for the first two formants, adapted to each vowel to discard aberrant formant values with respect to reference values in an average range of ±500 Hz [13]. We also retained the sole vowels that were voiced on more than half their duration before averaging the values. Only 5.5% of vowels were rejected. The formant values were then normalised with the help of Nearey’s log-mean procedure [14], the purpose of which is to reduce the effect of vocal tract differences. The fronting of /u/ in English is noticeable, as is the backing of /y/ in Spanish and Italian speakers. Among /e/s, the closest one to /i/ is that of Arabs. The Arabic /e/~/i/ merger is notorious and stigmatised through caricatures such as Itats-Inis (“United States”). As for the schwa vowel, it is most closed among Portuguese speakers, and fronted among Spanish and Italian speakers. In the following we concentrate on the most distant vowels from their French counterparts: /e/, /i/, /u/, /y/ and /´/.

3.1.2. Consonant duration and voicing Consonant duration and voicing were also measured on the train data. The voicing ratio is defined for each phoneme as the number of voiced frames divided by the total number of frames (every 10 ms). We observe that speakers of Arabic, English and German have the longest /p/, /t/ and /k/ and that voiced stops (/b/, /d/, /g/) are devoiced in English and

German speakers. In their mother tongues, voiceless stops are often aspirated — that is, their voice onset time (VOT) is long, which impacts their L2 productions. A partial devoicing of /v/ and /Z/ is also measured for English speakers. Spanish speakers of French very much devoice the /z/ fricatives (which they seem to equate to /s/) and produce the shortest non-native /b/s and /v/s. Also, the ‘r’s produced by Italian speakers are shorter and more voiced than they are for other speakers. We retain the duration and voicing of /p/, /t/, /k/, /b/, /d/, /g/, /v/, /“/ and only the voicing of /s/, /z/, /S/, /Z/ for further experiments.

3.1.3. Final schwa and related prosodic cues On the train data, we notice that Italian speakers of French realise more final schwas of polysyllabic words than all other speakers. We also measured the lengthening of the vowel preceding a pronounced schwa together with the ∆F0 (in semitones) between the average pitch of the penultimate vowel and that of the final schwa. We notice that Italians lengthen the supposedly stressed syllable and lower the pitch on the final schwa, whereas native Germans speaking French display a rising F0 contour on the final syllable. In addition to these schwa-related cues we measured the duration standard deviation of vocalic intervals (∆V) put forth by Ramus [15] and Grabe’s pairwise variability index (PVI) on vowels [16]. These two parameters are high for Italian train speakers, in accordance with their tendency to lengthen stressed syllables.

3.2. Non-standard alignments 3.2.1. Variants using French acoustic units So far, we used a pronunciation dictionary in which each entry was assigned one or several standard French pronunciation(s). We saw that non-native speakers frequently produce variants that deviate markedly from the canonical form(s), and our acoustic measures suggest that such deviations may be common to speakers of a given mother tongue. We defined a set of 20 non-native French variants: /b/[v], /v/[b], /s/[z], /z/[s], /Z/[j], /Z/[S], /S/[tS], /b/[p], /d/[t], /g/[k], /v/[f], /“/[l], /“/[w], /l/[w], /o/[ç], /e/[ε], /y/[u], /y/[i], added to nasal vowels — /A)/, /E)/ and /ç)/ possibly

denasalised and followed by nasal appendices ([n] or [m] before p/b). For each type of variant, a specific pronunciation dictionary was generated and a distinct alignment was produced accordingly. In each pair (e.g. /b/[v]), the first phoneme could be identified as the second one. Building such a kind of mapping table enables a binary, categorical approach to foreign-accented speech completing formant, F0 and duration-based analyses. For instance, Italian train speakers’ /y/ aligned as [u] and /“/ aligned as a liquid are consistent with previous measurements. As far as Spaniards are concerned, their tendency to pronounce [v] for /b/ and [s] for /z/ is confirmed — as well as [j] for /Z/ and [tS] for /S/. Additionally, it turns out that after nasal vowels native speakers of Romance languages speaking French show many nasal appendices, which is well audible for ourItalian and Spanish train speakers. These 20 additional variants are taken into account in the test.

3.2.2. Variants using foreign acoustic units The previous variants were designed to evaluate confusions between French phonemes made by non-native speakers. We also included phonemes or allophones as [r], [®], [lÚ], [U], [s∞], [∆] and [B], which may be particular to speakers of certain origins [17]. The [r], [s∞], [∆] and [B] xenophones introduced into the French acoustic models were borrowed from Spanish, for which we had extensively trained models at our disposal. As for the [®], [lÚ] and [U] xenophone, they were taken from English. On our train data, Italian speakers of French produce more rolled ‘r’s than other speakers. Also, Spaniards’ tendency to realise an apical [s∞] shows up in the high proportion of this variant when French is spoken. Among Spanish speakers too, the palatal fricative [∆] and the approximant [B] (instead of /b/ or /v/) are often preferred to French phonemes. To sum it up, the following variants including xenophones were kept in the next section: /“/[r], /“/[®], /l/[lÚ], /u/[U], /s/[s∞], /Ћ/[∆], /b/[B], /v/[B].

4.

Accent classification

On the train data, relevant cues seem to be the Arabic /e/~/i/ merger, the English and German aspirated or devoiced stops, the Italian rolled [r] and prosody, the Portuguese schwa pronunciation, the Spanish [s∞] (possibly substituting the French /z/) and /b/~/v/ confusions. The aim of this section is to assess whether this holds true on new speakers and a new corpus. It is twofold: first, it consists in disentangling the relative importance of vowels, consonants and prosody analysed with different techniques; second, it considers which attributes are automatically selected by machine learning.

4.1. Classification based on linguistic sets of cues We first looked to the contribution to accent classification of the classes of cues referred to as follows: the first two Formants of vowels, duration and voicing of Consonants, Prosody (including schwa-related phenomena), French variants and Xenophones. We used the Weka data mining software [18], in which 24 implemented algorithms are suited for our type of data. All these algorithms were applied on our train corpus, the 20 best of which were kept in the following experiments (e.g. Bayesian Networks, Logistic

Regression Models, Multilayer Perceptrons, Support Vector Machines, C4.5, Random Forest). In Experiment 1, train and test speakers are different, but the text they read is the same. In Experiment 2, not only are speakers different, the material is also different. In Experiment 3, a leave-one-out method is used to estimate the accuracy rate of the learning scheme on all our data (72 speakers reading the PFC and the IPA texts). This crossvalidation method allows speaker-independent tests using one speaker at a time for testing and the rest for training. Since performance differs to a large extent according to the type of experiment and the subset of cues (see Figures 2 and 3), we decided to average classification accuracies among techniques in Table 1. In addition, averaging results makes them more comparable with the results of the perceptual test involving 25 subjects. Table 1. % correct identification based on linguistic subsets (e.g. French variants), the best 10, 15 and all attributes (see text). Attributes

Best algorithm

Formants Consonants Prosody Fr. variants Xenophones

Exp1 44 50 39 53 58

Exp2 39 28 28 47 42

Exp3 50 35 28 72 56

all best 15 best 10

69 72 58

56 47 50

81 78 71

Average over 20 algorithms Exp1 Exp2 Exp3 36 27 43 39 19 26 26 16 28 36 32 57 37 30 43 48 52 46

35 37 38

59 63 57

Table 1 shows results in terms of correct identification averaged across the 20 algorithms as a function of the set of attributes used. It also shows the results of the best algorithm, but this one is not always the same for all experiments (Exp1, Exp2, Exp3) and all sets of attributes. Some algorithms may outperform others in some cases. In this table, we readily notice how corpus-dependent Formants and Consonants are. If we train and test our scheme on formants extracted from the PFC text reading, we obtain 36% correct identification on average, instead of only 27% on the IPA text reading. We are at chance level (16%) for Prosody and Consonants when the train and test texts are different. For Consonants, results of Experiment 3 are even worse than in Experiments 1, whereas for all other cues the leave-one-out results are the best. French variants and Xenophones seem to be more robust to corpus change: the average performance loss is only 4% and 7% respectively, between Experiments 1 and 2. Nevertheless, some vowel formants and consonant duration characteristics may be of help to identify our 6 foreign accents. Selection algorithms are run in order to establish which attributes are most relevant for our task.

4.2. Attribute selection In Section 3, we depicted 62 attributes which may characterise foreign-accented French. As some attributes may be irrelevant, redundant or corpus-dependent, we performed automatic attribute selection on our train data. This selection aims to delete unsuitable attributes and improve the performance of learning algorithms, all attributes being considered as equally important in this selection [19].

We ran 15 attribute selection algorithms implemented in Weka (e.g. Information Gain, Principal Components Analysis). Since algorithms result in different rankings and numbers of selected attributes, we applied a kind of ROVER (Recogniser Output Voting Error Reduction) to sort out the ones which have the best average score (see Equation 1).

90

NaiveBayes

MultilayerPerceptron

SVM

C4.5

Logistic Regression

80 70 60 50 40 30 20 10 0

score =

n

p ∑ (C − ranki ) n i =1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

(1)

90 80 70

where ranki is the attribute rank in the i-th algorithm, p is the number of algorithms which select this attribute, n the number of algorithms used (here 15), and C is a constant here set to 100.

60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

The 10 best attributes (and the next 5 ones within brackets) present in most algorithms are: • the first formant of /e/ and the first two formants of /´/; • the duration of /“/ (and /v/); • percentages of nasal appendices, /b/[v], /b/[p], /d/[t], /“/[l] (and /v/[b] , /z/[s] , /e/[ε]); • percentages of /“/[r] (and /v/[B]) xenophones. Results obtained with the best 10 and 15 attributes are shown in the bottom of Table 1. On average, algorithms using 15 attributes yield 37% (resp. 52%) correct identification when train and test texts are different (resp. the same). The latter figure is the one obtained by humans in a similar classification task [20]. Above 15 attributes, results are quite stable on average, as we can see in Figure 2. For example, the average performance in terms of correct identification remains around 60% for Exp3. Performance does not improve proportionally with the number of attributes, because the more attributes are retained, the more irrelevant or correlated they may be [19]. Of course the best algorithm performs better than the average: using 10 attributes and cross-validation, for instance, we obtain 71% correct identification with the best algorithm (Support Vector Machines) and only 57% on average.

Figure 2. Algorithm performance as a function of the number of attributes: best performance (top) and average performance (bottom). Dotted lines indicate 10 and 15 attributes. The results of this algorithm (SVM) and others quoted in 4.1 (Bayesian Networks, Logistic Regression Models, Multilayer Perceptrons, C4.5) are plotted in Figure 3. The best algorithm changes with the number of attributes used for classification. Nonetheless, on the whole, the most robust algorithms seem to be Logistic Regression Models and SVM, followed by Multilayer Perceptrons.

90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

Figure 3. Performance of 5 classification algorithms as a function of the number of attributes in Exp1 (top), Exp2 (middle) and Exp3 (bottom). The C4.5 algorithm does not perform best, but its results appear among the best ones and are directly interpretable. Figure 4 displays the output of the C4.5 decision tree learner using 10 attributes to classify test speakers (50% correct identification in Experiment 1 and 33% in Experiment 2). It is of note that the identification of the Portuguese accent in French is based only on the first two (normalised) formants of /´/. If the same algorithm is run using all attributes or only 15 attributes the same cues are used for isolating the Portuguese accent in French. Indeed, few features characterise Portuguese-accented French, which is rather unmarked and was often mistaken in [1].

Figure 4: Decision tree resulting from the C4.5 learner implemented in Weka (J48) using 10 attributes. F1@ and F2@ stand for the first two normalised formants of the schwa. The groups given by the decision tree are the same as in the clustering resulting from listeners’ answers in the perception test performed on the train speakers [1]: English and German accents on the one hand, Spanish and Italian accents on the other hand are gathered. In the decision tree of Figure 4, Spanish and Italian speakers of French are distinguished on the /b/[v] pronunciation (Spaniards tending to have more /b/s aligned as [v] than Italians). Only 4 German speakers and 1 non-German speaker (the second number in the

German leaf) are recognised as German. Therefore, the German accent model is poor, in comparison to the Spanish or Portuguese accent models especially, which are learned from all and only Spanish and Portuguese speakers respectively. An explanation may be that German speakers in the train data were judged to have the weakest accent by native French speakers [1]. Table 2 corresponds to the confusion matrix averaged across 20 classifiers run using 10 attributes. For almost each linguistic origin, the most frequent answer is the right one (on the diagonal of Table 2). The only exception is Germanaccented French (see above). Indeed, only few features were measured allowing a satisfactory distinction between German and English speakers of French. Italian speakers are recognised best (in 55% of cases, which is better than human subjects’ identification of Italian-accented French in [1]). There is a small bias in favour of Arabic speakers who seem to “attract” identification. As a matter of fact, our Arabic speakers do not show a strong accent, and their age of acquisition of the French language is the lowest one among all our speakers. Table 2. Confusion matrix obtained on the test text by averaging the confusion matrices for the 20 best algorithms using 10 attributes (percentages are computed with respect to 120 responses). answer origin Arabic English German Italian Portuguese Spanish

Arabic English German Italian Portuguese Spanish

34 16 42 17 13 17

17 45 19 12 25 8

7 14 15 2 5 4

2 10 8 55 9 6

30 5 9 7 30 17

9 10 7 8 18 48

Figure 5. Dendogram obtained using a divise clustering algorithm and a Manhattan distance, from the results obtained in the confusion matrix (Table 2). A graphical layout of the results under the form of a tree (or dendrogram) can be obtained by applying a clustering algorithm based on distances between the lines of the confusion matrix. Using a divisive algorithm and a Manhattan distance, we can see in Figure 5 that speakers of Romance languages on the one hand and speakers of Germanic languages on the other hand are regrouped (see also Figure 4 and [1]). As well as Italians, Spaniards are recognized as Romance and English speakers as Germanic in most cases.

5.

Discussion and future work

In this article, we weighed the importance of vowel formants, consonant duration and voicing, prosodic cues, French pronunciation variants, and foreign acoustic units in the classification of Arabic, English, German, Italian, Portuguese and Spanish accents in French. Data mining algorithms were trained on 36 speakers and tested on another 36 speakers of these different origins. The role of prosody proved disappointing in comparison to segments, but the goal there was to gain linguistic knowledge rather than to target high identification scores. This result is interesting in view of prior studies [6, 7]. We then tried to find a concise and corpus-independent set of accent-characteristic features. With 10, 15 or over 60 attributes, we obtained similar results averaged on 20 classification algorithms. The Italian accent in French is correctly identified in more than 50% of cases, and the other accents except German-accented French are well identified in a relative majority of cases. Combining the outputs of different models might improve the identification scores. New experiments including native French speakers and/or limited to utterances of 10 seconds are necessary. New perceptual experiments on the test corpus are also needed to compare human and automatic identification under the same conditions. In the perceptual test we performed [1], listeners unlike classification algorithms succeeded in identifying German speakers. Future work is planned to measure subsegmental features, namely the VOT of stop consonants, expecting that this piece of information is useful for automatic identification. A more general L1/L2 comparison is eventually foreseen.

6.

References

[1] Vieru-Dimulescu, B. and Boula de Mareüil, P., “Perceptual identification and phonetic analysis of 6 foreign accents in French”, Interspeech, Pittsburgh, pp. 411–414, 2006. [2] Magen, H.S., “The perception of foreign-accented speech”, Journal of Phonetics, Vol, 26, pp. 381–400, 1998. [3] Flege, J.E. and Hammond, R., “Mimicry of Nondistinctive phonetic differences between language varieties”, Studies in Second Language Acquisition, Vol, 5, pp. 1–17, 1982. [4] Flege, J.E. and Port, R., “Cross-language phonetic interference: Arabic to English”, Language and Speech, Vol, 24 (2), pp. 125–146, 1981. [5] Flege, J.E., “The detection of French accent by American listeners”, Journal of Acoustical Society of America, Vol, 76 (3), pp. 692–707, 1984. [6] Jilka, M., “Identifying Intonational Foreign Accent with the help of different methods of F0 Generation”, ICPhS, pp. 1457–1450, 1999. [7] Boula de Mareuil, P. and Vieru-Dimulescu, B., “The contribution of prosody to the perception of foreign accent”, Phonetica, Vol, 63, pp. 247–267, 2006. [8] Arslan, L M. and Hansen, J. H. L., “A study of temporal features and frecquency characteristics in American English foreign accent”, Journal of Acoustical Society of America, Vol, 102 (1), pp. 28–40, 1997. [9] Kumpf, K. and King, R.W., “Foreign speaker accent classification using phoneme-dependent accent discrimination models and comparisons with human

perception benchmarks”, Eurospeech, Rhodes, pp. 2323–2326, 1997. [10] Durand, J., Lacks, B. and Lyche, C., “Le projet ‘Phonologie du Français Contemporain’ (PFC)”, La Tribune Internationale des Langues Vivantes, Vol, 33, pp. 3–9, 2003. [11] Woehrling, C. and Boula de Mareuil, P., “Identification of regional accents in French: perception and categorization”, Interspeech, Pittsburgh, pp. 1511–1514, 2006. [12] Adda-Decker, M. and Lamel, L., “Pronunciation variants across system configuration, language and speaking style”, Speech Communication, Vol, 29, pp. 83–98, 1999. [13] Gendrot, C. and Adda-Decker, M., “Impact of duration on F1/F2 formant values of oral vowels: an automatic analysis of large broadcast news corpora in French and German”, Interspeech, Lisbon, pp. 2453–2456, 2005. [14] Disner, S.F., “Evaluation of Vowel Normalization Procedures”, Journal of the Acoustical Society of America, Vol, 67, pp. 253–261, 1980. [15] Ramus, F., Rythme des langues et acquisition du langage, PhD thesis: EHESS, 1999. [16] Grabe, E. and Low, E.L., Durational Variability in Speech and the Rhythm Class Hypothesis, in Laboratory phonology VII, in Gussenhoven, C. & Warner, N. (Eds). Berlin: Mouton de Gruyter, pp. 515–546, 2002. [17] Delattre, P., Comparing the phonetic features of English, French, German and Spanish, Heidelberg: Julius Groos Verlag, 1965. [18] Witten, I.H. and Frank, E., Data Mining: Practical machine learning tools and techniques, 2nd Edition ed, San Francisco: Morgan Kaufmann, 2005. [19] Guyon, I. and Elisseeff, A., “An introduction to variable and feature selection”, Journal of Machine Learning Research, Vol, 3, pp. 1265–1287, 2003.

Suggest Documents