Available online at www.sciencedirect.com
Speech Communication 52 (2010) 847–862 www.elsevier.com/locate/specom
Predicting the phonetic realizations of word-final consonants in context – A challenge for French grapheme-to-phoneme converters Josafa´ de Jesus Aguiar Pontes *, Sadaoki Furui Tokyo Institute of Technology, Department of Computer Science, 2-12-1-W8-77 Ookayama, Meguro-ku, Tokyo 152-8552, Japan Received 4 December 2009; received in revised form 20 June 2010; accepted 22 June 2010
Abstract One of the main problems in developing a text-to-speech (TTS) synthesizer for French lies in grapheme-to-phoneme conversion. Automatic converters produce still too many errors in their phoneme sequences, to be helpful for people learning French as a foreign language. The prediction of the phonetic realizations of word-final consonants (WFCs) in general, and liaison in particular (les haricots vs. les escargots), are some of the main causes of such conversion errors. Rule-based methods have been used to solve these issues. Yet, the number of rules and their complex interaction make maintenance a problem. In order to alleviate such problems, we propose here an approach that, starting from a database (compiled from cases documented in the literature), allows to build C4.5 decision trees and subsequently, automate the generation of the required phonetic rules. We investigated the relative efficiency of this method both for classification of contexts and word-final consonant phoneme prediction. A prototype based on this approach reduced Obligatory context classification errors by 52%. Our method has the advantage to spare us the trouble to code rules manually, since they are contained already in the training database. Our results suggest that predicting the realization of WFCs as well as context classification is still a challenge for the development of a TTS application for teaching French pronunciation. Ó 2010 Elsevier B.V. All rights reserved. Keywords: Decision trees; Liaison in French; Post-lexical rules; Speech synthesis; Grapheme-to-phoneme conversions
1. Introduction Being able to communicate well in a foreign language has become a necessary skill to survive in the globalized world of the 21st century. Yet, no matter how large one’s vocabulary, and how quick one may be in translating ideas into words, if one does not know how to pronounce them properly, one may still fail to achieve his/her goal – to be understood. Admittedly, learning the pronunciation of French is a difficult task. One of the main reasons for this lies in the mismatch between the written and the spoken code. Given the complexity of the mappings between sound and written form (phonemes and graphemes), pronuncia-
*
Corresponding author. Tel.: +81 9084789169; fax: +81 357343481. E-mail addresses:
[email protected] (J. de Jesus Aguiar Pontes),
[email protected] (S. Furui). 0167-6393/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2010.06.007
tion cannot be deduced from the orthographic form alone. For example, consider the words “vin” (“wine”), “vins” (“wines”), “vingt” (“20”), “il vint” (“he came”), “je vaincs” (“I won”), “en vain” (“in vain”), which despite their different meanings and domains (beverage, number, movement, competition, success rate) are all pronounced the same way: [v~e]. Hence, pronunciation seems rather irregular and unintuitive from the learner’s point of view. A particular challenge is the learning of pronunciation rules of word-final consonants (henceforth WFCs). The pronunciation of these phonemes/graphemes may depend on the context of the neighboring words or the morphosyntactic roles played by them. It may even depend on the intended meaning. Nevertheless, these difficulties could be considerably alleviated with an adequate text-to-speech (TTS) application, one that has been designed with respect to the idiosyncrasies of our target language, French. Of course, this
848
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
requires a good model for predicting the pronunciation of the WFCs, a non-trivial problem in the case of French. In the following sections, we will give more details concerning the phonetic realizations of the WFCs, as well as some justification for the need to model them. 1.1. Liaison in French Liaison is the pronunciation of an otherwise silent final consonant of a word (w1) in certain contexts. It is realized when the following word (w2) starts with a vowel, a (graphic) mute “h” or some glides (Fouche´, 1959; Encreve´, 1988). In addition, w1 usually needs to have a close syntactic relation with w2 to allow for or require the realization of liaison. For example, consider the possessive adjective w1 “mes” [me(z)] (“my”), with [z] being the latent consonant. If w2 is a noun starting with a consonant, such as “doigts” [dwa] (“fingers”), the word sequence w1 w2 “mes doigts” (ex. 1)1 is pronounced as [medwa], where the latent consonant [z] remains silent (the liaison is not realized). But if w2 is a noun starting with a vowel, such as “amis” [ami] (“friends”), the word sequence “mes amis” (ex. 2) is pronounced as [mezami]. In this case, the latent consonant [z] must be pronounced, i.e. the liaison is realized. Liaisons generally connect the final consonant of a given word (w1) with the initial-vowel of the next word (w2), yielding a single phonetic word (sharing a single stress). However, as Encreve´ (1988) correctly pointed out, in the speech of certain politicians, a pause is sometimes introduced between these two elements, breaking thus the connection. This is typically the case when speakers need extra time to plan the next fragment, i.e. what to say next. This being so, Encreve´ suggested to use the terms liaison enchaıˆne´e and liaison non-enchaıˆne´e, in order to distinguish between these two types. Our study here focuses only on the first type (liaison enchaıˆne´e), i.e. a liaison immediately followed by the realization of a vowel, without any interruption. Liaisons are classified as Obligatory (Ob), Optional (Op), or Forbidden (Fo) (Delattre, 1951). Obligatory liaisons stand for contexts in which a consonant of liaison is to be pronounced, such as in the above sequence “mes amis” (ex. 2). Several studies suggest that for most varieties of French, a set of obligatory liaisons is stable enough to be classifiable (Sanders, 1993). By and large, this allows the obligatory liaisons to be subclassified in a systematic and uniform manner. Optional contexts are cases where the liaison can be realized depending on the circumstances (Pierret, 1994; Durand and Lyche, 2008). For instance, consider the noun w1 “amis” (“friends”) followed by the adjective w2 “e´trangers” [etRa˜Ze] (“foreign”). The word sequence “amis e´trangers” (ex. 3) can be pronounced either with or without the
1 Please refer to the Appendix A to see the meaning and the phonetic representation of the examples given throughout this document.
consonant [z], i.e. liaison. In this particular example, the pronunciation [amizetRa˜Ze] emphasizes the plurality of the word “amis”, and this is generally the case when this sequence is found in isolation. However, the pronunciation [amietRa˜Ze] is preferred when another element of the context already signals plurality, as is the case when the plural definite article “les” precedes this sequence: “les amis e´trangers” [lezamietRa˜Ze]. Hearing both liaisons being pronounced [lezamizetRa˜Ze] may sound unnatural to the listeners’ ears. To avoid this to happen, the first [z] (obligatory liaison) is pronounced, while the second one (optional liaison) is preferably kept silent. This could be handled automatically by using the number of Obligatory [z] liaisons present in a sentence as a feature, ranking Obligatory cases over Optional ones. Finally, Forbidden contexts of liaison are those whose realizations are avoided, even if the graphic word ends with a consonant (Battye et al., 2000). For example, if the noun w1 “amis” is followed by the verb w2 “e´tudient” [etydi] (“study”). In this case, the sequence “amis e´tudient” (ex. 4) must be pronounced as [amietydi], rather than [amizetydi], as liaisons are not triggered between nouns followed by verbs. 1.2. Other WFC pronunciation variations Liaison is actually a subset of the more general problem addressed in this paper: WFCs. Apart from variations caused by liaisons, the pronunciation of word-final consonants may vary for other reasons. These variations can be classified as context-dependent and context-independent. The context-dependent variations are found in words whose pronunciation of FCs can be described in terms of pronunciation rules, such as “six” [si(s)] (“six”), “plus” [ply(s)] (“more/plus”) and “tous” [tu(s)] (“all/everyone”). The context-independent variations are found in words whose pronunciation of FCs cannot be described in terms of such rules, for instance “ananas” [anana(s)] (“pineapple”), “aouˆt” [u(t)] (“August”) and “jadis” [Zadi(s)] (“formerly”). The final consonant of these words can be pronounced or not, depending on idiosyncratic or sociolinguistic variables (Stammerjohann, 1976; Verluyten and Hendrickx, 1987). Henceforth we refer to these two categories as non-liaison context-dependent (NLCD) set and non-liaison context-independent (NLCI) set, respectively. Whether we realize or not phonetically the final consonant of a word from the NLCI set is irrelevant here, since both pronunciations are correct. However, for words of the NLCD set, we need to make sure that realization is performed properly, as the meaning may be affected. For example, consider the sentence “j’en veux plus” being uttered in a casual setting. If the final -s of the word “plus” is pronounced ([Za˜vøplys]), it is understood as “I want more” (ex. 5), but if it is not ([Za˜vøply]), the opposite meaning is conveyed: “that’s enough” (ex. 6). The realization of a FC of a word from the NLCD set varies according to the context and is regulated by
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
underlying semantic/syntactic/phonetic constraints. For the realization of its final consonant, a word from this set in the position w1 does not require that the following word (w2) to start with a vowel, neither does it need to have a close syntactic relationship with w2 (cf. Table 1). For example, consider the French word “os” (“bone(s)”). Depending on whether it is supposed to express singular or plural, the -s needs to be pronounced: “os” [Os] for (“bone”) vs. “os” [o] for (“bones”). This word can even be followed by a pause and still have its final consonant being realized. This is not possible for liaisons, which assumes the existence of w2. Consider also the difference between liaisons and NLCD concerning the associated word-final graphemes/phones. In the sequence “tous ensemble” (ex. 7), the final consonant -s is realized via the phoneme [s] (NLCD), while in “tous azimuts” (ex. 8), the very same grapheme is realized via a [z] (liaison) consonant. This shows how liaison and NLCD phonemes are different from each other, even when the graphemes are the same. On one hand, the graphic FCs associated with the NLCD words are -c, -f, -q, -s, -t and -x. They can be illustrated via the words/tokens “croc” (n. vs. interj.), “neuf”, “cinq”, “tous” (pron. vs. adj.), “fait” (n.) and “dix” (pron. vs. adj.) occuring in different contexts. The possible phonemes associated with each of these graphemes are [k] for -c and -q, [f/v] for -f, [s] for -s and -x, and [t] for -t. The reader may consider Goldman et al. (1999) and Coˆte´ (2005) to find some of the NLCD words. On the other hand, the graphic FCs of the linking words are -c, -d, -g, -n, -p, -r, -s, -t, -x and -z, which can be demonstrated via the words/tokens “blanc”, “grand”, “long”, “bon”, “trop”, “premier”, “tre`s”, “petit”, “deux” and “chantez” in different contexts. The liaison phonemes associated are [k] for -c, [t] for -d and -t, [k/g] for -g, [n] for -n, [p] for -p, [R] for -r, and [z] for -s, -x and -z. Examples of related contexts for each of these words/tokens can be found in (Grevisse, 1997; BDL, 2002; Coˆte´, 2005; Robert, 2005).
Table 1 Summary of the differences between liaison and NLCD. Liaison
NLCD
Sequence
w1 + w2
w1 + {w2, pause}
FC of w1 (realization)
Silent if uttered in isolation
May be pronounced in isolation
FC of w1 (graphemes)
-c, -d, -g, -n, -p, -r, -s, -t, -x, and -z
-c, -f, -q, -s, -t, and -x
FC of w1 (phonemes)
[k, n, p, R, t, z]
[f, k, t, s, v]
w2
Starts with a vowel, a mute “h” or some glides
No special constraint
w1 + w2 relationship
Syntactically close
No special constraint
Context classific.
Ob/Op/Fo
Ob/Op/Fo
849
Like liaisons, the contexts of the words from NLCD set can be classified into three categories: Obligatory (Ob), Optional (Op) and Forbidden (Fo) contexts. Despite the “similarity” between these classifications, there is a clear difference between the contextual features of liaison and NLCD. In the case of an Obligatory NLCD context, the final consonant of a word w1 must be realized, regardless of the characteristics of w2 (vowel or consonant initial), and regardless of its presence or absence (pause). For instance, consider the word w1 “tous” [tu(s)] (“all/everyone”) in the sequence “. . . les connaıˆtre tous.”, (ex. 9). Despite the absence of w2, the final consonant -s must be realized. The classification we propose here is the same, even though a liaison is not possible under such a condition. Just as in liaisons, an Optional NLCD context is a situation where a WFC is realized or not depending on the circumstances. Take for example the word “six” [si(s)] (“six”). The sequence “six pour cent” (ex. 10) can be pronounced in two ways (BDL, 2002). The final consonant -x can either be silent [sipuRsa˜] or pronounced [sispuRsa˜]. Despite the fact that w2 starts with a consonant, the final consonant of w1 can still be pronounced, something which is not possible for liaison contexts (normative French). In a Forbidden NLCD context, a WFC is not realized. Take for example, the word “plus” [ply(s)] (“more/plus”). In the sequence “je n’en mange plus” (ex. 11), the final consonant -s should not be pronounced, since the word “plus” is used with a negative meaning. The differences between liaison and NLCD are summarized in Table 1. 1.3. The importance of modeling WFCs Since very few FCs of isolated words are realized in French, dictionaries hardly ever present their pronunciations. By and large, dictionary pronunciations apply to words in isolation. However, words are used in context, and in this case the pronunciation of WFC may vary accordingly. Therefore, pronunciations provided by dictionaries are default pronunciations, i.e. words out of context, but such pronunciations are not enough for building a French pronunciation model for TTS systems. A study realized by Mareu¨il et al. (2003) based on the BREF corpus (which contains 66,500 sentences of read speech) revealed that over 25% of 26,000 words (composing a word list) are possible liaison candidates, which gives an idea of the order of magnitude of the phenomenon. They considered the words ending with -d, -n, -p, -r, -s, -t, -x, and -z graphic consonants as possible liaison candidates. However, they did not take into account the words ending with the consonants -c, -f, -g, and -q. We do take these factors into account in the present study in order to model both liaison and non-liaison context-dependent WFCs. With a view to determine their relative frequency, i.e. the most frequent ones, Boe¨ and Tubach (1992) conducted a study using 20 h of native speech, showing that the
850
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
consonant sounds [z], [t] and [n] account for 99.7% of the produced liaisons, rating, respectively for 50.5%, 30.4%, and 18.9%. Due to the phonological complexities of French, predicting the realization of the WFCs is still a challenge for G2P converters. While the task of G2P converters for other languages is basically one of predicting the pronunciation of isolated words, G2P converters for French must also predict the variable pronunciation of WFCs according to the contexts where they are used. A study conducted by Yvon et al. (1998) evaluated the quality of automatic G2P conversions produced by eight different systems. The study revealed that G2P conversions are still problematic in French, since even the best systems are prone to make at least one error in every 10 sentences. The authors of this study consider liaisons and the prediction of pronunciation of FCs of the French numbers (subset of NLCD) as two major causes of the problems. In other words, the prediction of the phonemes associated to the WFCs is one of the major problems for French G2P converters. Unfortunately, there is hardly any literature describing how other TTS systems handle the prediction of liaisons and NLCDs. It seems that the Bell Laboratories TTS system (Sproat, 1998; Tzoukermann, 1998) uses a rulebased method. Within text analysis processing, traces (labels) are appended at the end of each word to indicate a possible liaison candidate. These labels are tested to see whether there is a liaison rule applying to a context or not. Whenever there is, the phonetic sequence is changed accordingly. However, no details are given concerning their method (such as coding technique and number of liaison rules). Due to the intricacy of the problem, the number of rules to model these phenomena tends to be large, their interaction complex, and the code difficult to maintain. That may be one of the reasons why the current models still produce so many errors, i.e. mispredictions, and this is true even if the scope is limited to Obligatory and Forbidden contexts. Taken together this makes current converters inadequate for teaching purposes. They simply cannot meet the high level of performance required to teach pronunciation correctly. If it were possible to record pronunciations of every bigram (or longer sequences) containing variable pronunciations (both for Obligatory and Prohibited contexts of liaison and NLCD), then these phenomena could be encapsulated inside of these sequences. Since it is not very feasible to prepare such a database, current corpus-based synthesis by concatenation methods still require exact phonemic/phonetic transcriptions, at least for French. Hence our motivation to improve the quality of the G2P conversions for WFCs. In addition, we need to provide classification of WFC contexts, given our pedagogical goals. This is necessary because, given some input text, the learner wants to know under what conditions the realization of a WFC is obligatory, optional or prohibited.
1.4. The focus of this study This study aims at modeling the pronunciation of words whose final consonants may vary, variation which can be accounted for and controlled by pronunciation rules. This comprises not only liaison, but also non-liaison contextdependent pronunciations. Theoretical phonological explanations to account for this phenomenon are irrelevant here, i.e. they fall outside of the scope of this paper. The focus of this study is on contexts (Obligatory and Forbidden) as documented in the French phonetic literature. Contexts allowing for more than one possible pronunciation of a WFC (indicated by at least one author) were considered as Optional contexts. However, for cases whose frequency of pronunciation of a WFC had been measured and documented in the literature, a threshold of at least 90% of realization was assumed necessary to consider a context as Obligatory. For example, Mareu¨il et al. (2003) state that for the sequence determiner + adjective, liaison was observed in 95.5% of the cases, against 60.6% for the sequence determiner + verb past participle. Given these results and the above threshold, a sequence like “un obscur” (ex. 12) is considered as Obligatory context, while “un e´lu” (ex. 13) can be regarded as optional with a preference for realizing the liaison. In spite of that, the classification of Optional contexts here aims only at tracing the boundaries between what can be considered as obligatory or optional for the purpose of 2nd language teaching. The complexities involved in the study of Optional contexts, as well as the problem of determining their relative frequency of occurrence with respect to (different) speakers and (different) situations fall outside of the scope of this investigation. Therefore, Optional contexts are not evaluated here. Our goals are linked to linguistic engineering issues and to the pitfalls concerning the production of rules accounting for Obligatory and Forbidden contexts. In addition, the assessment of pedagogical strategies likely to help learners internalize correct pronunciations regarding these phenomena is left for future work. We propose to investigate how efficiently the pronunciation of WFCs and their context classification can be modeled via C4.5 decision trees learning algorithm (Quinlan, 1993). Decision trees are used because they allow to convert structured data sets into respective sets of hard-coded rules necessary for the implementation of a Computerassisted language learning (CALL) system. With these rules, it is possible to identify the contexts where the WFCs have variable pronunciations. For this purpose, a training database is required. This data can be created on the basis of lexical, phonetic, morpho-syntactic and other properties characterizing different contexts. Taken together they can be used for populating a database necessary for training decision trees. Put differently, taken together they allow to induce automatically the pronunciation rules needed for the production of WFCs. The remainder of this paper is organized as follows: Section 2 explains our method for data collection; Section 3
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
describes the training and prediction algorithms; Sections 4 and 5 present the results and discussions related to the evaluation of the proposed method, while conclusions and future perspectives are given in Section 6. 2. The creation of the WFC database The creation of the database is a manual process, which consists of three major steps, as depicted in Fig. 1. Here, we deal with the data selection and clustering, the specification of features (attributes) and values, and the creation of data sets. Each of these steps is explained in the next three subsections. 2.1. Step 1: data selection and clustering Rules of pronunciation (R) for words with variable pronunciation of their final consonants can be obtained from the French phonetic literature. You may consider the references cited in Section 1 and take a look under the entry Liaison. Relevant information can also be gleaned under the heading variable pronunciation of final consonants (Coˆte´, 2005), or under special issues of French pronunciation (Goldman et al., 1999; BDL, 2002). Additional information can be found in dictionaries edited by Corre´ard et al. (2003) or Robert (2005). Search engines operating on large text corpora labeled with morpho-syntactic information (Corpuseye, 2008) or liaison type (Durand et al., 2005) are also very useful for finding linguistic contexts in which the words discussed here (those allowing for various pronunciations with respect to the final consonant) are used. From these sources, we collected about 2000 contexts of WFC variations in order to build the set of training data. The reader can find some of them on the authors’ website (Pontes, 2008). This selection was motivated by our goal to be representative, that is, we wanted to compile only examples (and counter-examples) in which the phonetic variation phenomena clearly occur. Our purpose here, however, is not to describe the pronunciation rules themselves because they are too numerous, and they are already documented in the literature. Instead, our target is to show how to organize this knowledge in a way that is computationally feasible, reducing complexity, while increasing maintainability of the model. Most of the pronunciation variations (liaison and NLCD) can be identified by contextual features. Contextual features are described in terms of attributes and values
Rm R2 R1 Cluster of similar cases
Rm Ym1 , Y m2 ,…Ymnm R2 Y21 , Y 22 ,… Y2 n 2 R1 Y11 ,Y12 ,…Y1 n 1 Classification of R1’s attributes
851
extractable from the contexts where the variations occur. They refer to information like “current lexical word” w1, “part-of-speech (PoS) of the next word”, “initial phoneme of next word”, and so on. Based on phonological/linguistic knowledge, we manually cluster the rules of pronunciation R into sets of rules Ri with {i j 16i6m}. Each Ri is compiled from a number of similar cases with respect to the pronunciation of WFCs, that is, cases sharing contextual features. For instance, the words “six” and “dix” (“six” and “ten”) are grouped together into a single set Ri, given the fact that the contextual attributes and values determining the pronunciation of one also hold for the other (BDL, 2002). The purpose of the index i is not only to list a set of rules sharing contextual features, but also to disambiguate contexts. In this case, the index is used for contextual ranking. Particular/exceptional cases in which a (liaison or NLCD) word is used are likely to have precedence over general cases with the same word being used. In such a situation, the particular cases should be handled first, by one set of rules, let us say Ri, and the general cases by another, Rt, with {t j i < t 6 m}. This is because the amount of data required to distinguish exceptions from general cases is likely to increase exponentially (cf. Eq. (1)) if all of them were clustered together (to make a single decision tree). To predict pronunciation and context classification for the WFCs in French, care needs to be taken with respect to the order in which the sets of rules are tested during the text analysis. This is because the final output may depend on the test order. For example, consider a sentence like “vont-ils j arriver?” (ex. 14), where liaison is Forbidden between the subject pronoun “ils” (w1) and the verb “arriver” (w2).2 Due to “verb–subject pronoun inversion”, a liaison between the pronoun and the successor word is not possible (Germain-Rutherford, 2005). Notice that, in ordinary non-inversion contexts, a liaison is required when a subject pronoun ending with consonant is followed by a vowel-initial verb, such as in “ils arrivent”. However, if a test performed first only requires the existence of the sequence subject pronoun + verb and the respect of the liaison constraints (of Table 1), then this algorithm could easily misclassify this context as Obligatory. This situation illustrates an ambiguity that needs to be resolved. This is done in the Bell Laboratories French TTS system by applying overruling techniques (Tzoukermann, 1998). Basically, some modification is applied to the phonetic sequence first and, later on, it may be rewritten by other rules, before producing the final output. This strategy, however, might increase the complexity of the final code, making it difficult for humans to understand, find bugs, and correct them. In our case, instead of overruling, we establish priorities with respect to the context classification. In order to set up priorities, we recommend to cluster the
WFC database
Fig. 1. The data preparation process.
2 Although there is a liaison between “vont” and “ils”, our attention here is turned to the words “ils” and “arriver”.
852
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
sets of rules in the following order: Forbidden > Obligatory > Optional. The preliminary evaluation of known Forbidden contexts reduces the search space for the cases left, given that less tests are needed to handle them. That being the case, we handle the Obligatory liaison contexts for a problematic word in a different set of rules Rt. The number of rules of the set Rt is reduced (or “simplified”) when the Forbidden contexts of this word have been previously tested by a set Ri, that is {t j i < t 6 m}. In the case of the example 14, if the pronoun is used in a Forbidden context, then Ri is responsible for identifying and handling it. Ri contains a set of strict tests which filters the Forbidden contexts of the problematic word. But if the context does not fit any of the tests, the responsibility is transferred to the set Rt, which will take care of the problem. Notice that Rt does not need to perform any of the tests of Ri anymore. In practice, this technique not only reduces the size of the sets of rules, but also controls the order in which the rules are tested/applied. Therefore, setting appropriate priorities between the rule-sets is essential, in order to resolve ambiguities. Handling Forbidden contexts separately in advance, however, may not apply to all situations. The fact that a word w2 starts with an aspirated “h”, for instance, does not necessarily imply that a liaison would be possible if this word were replaced by a vowel-starting one. Consider the case of an aspirated “h” at the beginning of a word in the sentence “Ces sont cinq haricots”. Since this is not a context for liaison, it would be an error to conclude right from the start that the aspirated “h” of “haricots” classifies this context as Forbidden. The point here is: we need to make sure first whether there is a room for liaison or not, in order to use the aspirated “h” as the reason for classifying a context as Forbidden. This means that Forbidden contexts with an aspirated “h” should be processed inside every rule-set in charge of handling potential Obligatory or Optional liaison contexts. This strategy allows the G2P converter to clearly identify in which specific set of rules, the aspirated “h” prohibits the liaison. Otherwise, non-liaison contexts could be incorrectly tagged as Forbidden. Another weak point that deserves special attention when predicting pronunciation of the WFCs, as well as classifying contexts, is the fixed expressions or locutions. Numerous expressions in French are composed of a fixed sequence of tokens, such as “tout a` coup” (ex. 15) of length 3 and “Jeux Olympiques” (ex. 16) of length 2. We define the length of a fixed expression as the number of tokens of which it is composed. A token with variable pronunciation of its final consonant is supposed to be pronounced differently depending on its function, that is, whether it is (part of) a fixed expression or not. For example, the final “t” of “tout” must be pronounced in “tout a` coup”, while in “le tout j ou la partie”, it should remain silent. Since the length of such expressions as well as the relative position of token (causing a variable pronunciation of its final consonant) may vary, the number of attributes
required to identify them varies accordingly. Hence, the identification of fixed expressions of different lengths would require a huge number of attributes, values and tests, if they were handled all together in a single set of rules. In order to reduce this complexity, we suggest splitting fixed expressions into several sets of rules, such as R2, R3, R4 and so on. Every set is compiled from expressions having the same length. This technique allows for a neat arrangement of the data, given the match between the length of the fixed expressions and the number of columns of their respective tables. The arrangement of the data into tables is explained in Section 2.3. An additional requirement is that, the greater the length of a fixed expression, the smaller the index i in Ri is needed, in order to resolve possible ambiguities likely to occur between overlapping expressions of different lengths. Covering the longer contexts first allows shorter contexts to be precisely identified later, among the remaining ones, when fewer features are tested. This strategy permits the disambiguation between overlapping expressions, such as “de plus en plus” (ex. 17) (of length 4) and “en plus” (ex. 18) (of length 2). The main advantage of creating rule-sets dedicated to handle fixed expressions is that it yields the simplification of other rule-sets dealing with the same words/tokens, but in different contexts. The pronunciation of a final consonant of a token in a fixed expression is normally considered as an exception to the general rules associated to such word/token. Therefore, processing together fixed expressions of equal length has the advantage of reducing the complexity of the general rules, which are handled later. These are the main ideas required for clustering the data. The clusters resulting from the application of these techniques are represented in Fig. 1 as Step 1. Some of them are shown in Table 2, which presents a brief description of some of the sets of rules as implemented in our prototype system. You can find the description for other sets and the corresponding examples on the authors’ website (Pontes, 2008). Our prototype accounts for about 60 sets of rules (m 60), each corresponding to one decision tree used for modeling the variable pronunciation of the WFCs. The number of trees may be different for other implementations and data. In our case, however, informal experiments confirmed that this number achieved a good balance between the number of attributes per tree and the number of trees. On one hand, if m is very small, too many attributes are likely to be used in the decision trees. In this case, the trees would become large, difficult to find errors and the training data would be harder to prepare. On the other hand, if m is large, the same attributes could appear in several trees without reducing their average size. In summary, this part of the data preparation deals with the systematic arrangement of words or group of words having variable FC pronunciations and being contextdependent. We cluster them into appropriate sets of rules, where each set deals with similar cases of WFC variations,
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862 Table 2 Description for some of our sets of rules (clusters). i
Set of rule description
1
Verb–object and inversion verb–subject pronouns – prohibited liaisons – Contexts of liaison in fixed expressions Plural personal and objective pronouns in verbal contexts Verb–object and inversion verb–subject pronouns – obligatory liaisons – The number “neuf” and its compounds The numbers “six” or “dix” and their compounds The word “os” The word “tous” and its contexts The adverbs “bien” and “moins”
2 10 16 22 27 31 36 54
provided there is enough contextual information required for the identification of each case. Special attention should be given to the cases where ambiguities may occur. Basically, this is done by forcing particular cases to have precedence over general cases, avoiding overruling and inherent complexities. Given a set of rules, Section 2.2 shows how to specify the features necessary to identify the contexts allowing to predict the pronunciation of WFCs. 2.2. Step 2: specification and classification of attributes Taking phonological/linguistic knowledge into account, Step 2 (Fig. 1) stands for the manual retrieval and labeling of attributes and (possible) values needed for the contextual identification of a given rule-set Ri. Examples of possible attributes are “PoS (of the current word)”, “final letter (of the current word)”, “initial phoneme of next word”, etc. We assume here that the features of a lexical G2P and PoS tags are available for this pre-processing. Let Yij be an attribute among the ni attributes (Yi1, Yi2, . . . , Y ini ) of a set of rules Ri, with {j j 1 6 j 6 ni}. The possible values for the attribute Yij are classified into pij categorical values by assigning a label Lk to each category, with {k j 1 6 k 6 pij}. For example, by applying this operation to the attribute “initial phoneme of next word” Table 3 Attributes (header) and categorical values (remaining rows) required for identifying the contexts of the word “neuf”. w1
w2
neuf Other
type_v type_f_v other
853
(init_phon_nw) the following categorical values can be obtained: “vowel”, “consonant”, “aspirated_h”, “mute_h” and “other”. Note that we do not explicitly include the glides in the list in order to prevent unnecessary growth of the data size. In practical terms of implementation, a glide allowing for liaison can be considered as a kind of “mute_h”, while an unlinkable one can fall in the same category as an “aspirated_h”. Choosing appropriate labels to represent the meaning of a given value is without any doubt important for their understanding and subsequent use by knowledge engineers (developers/phoneticians). These labels, defined here as categorical labels, contain useful information required for contextual identification. Most of the values can be grouped into very few categories. A category is normally defined by a group of similar elements, such as words, phonemes, letters, PoS, and punctuation marks. Obviously, a word allowing for variable pronunciations of its final consonant can be used itself as categorical label, provided it is representative for the rule under consideration. Labels like “les” and “des”, or “nous” and “vous” might be meaningful. One important point at this step is that every attribute Yij must have one extra categorical label, such as “other”, indicating that a given element does not apply to any category of that attribute. This additional label forces the node to split in a decision tree, allowing scrutinized use of the available attributes in order to distinguish each target value on the basis of the context. The more precisely and completely the categories are specified, the better the trees are populated and trained. Let us take a concrete example to illustrate how the classification of attributes can be implemented. Consider the phonetic constraints associated with the word w1 “neuf” [nf] (“nine”), whose pronunciation of its final consonant depends on the context. Three different types of contexts are identified. All of them are arranged into one set of rules Ri. First, when “neuf” is followed by the word w2 “heures” (“hours”), “ans” (“years”) or “autres” (“other”), the final graphic consonant -f is pronounced with the [v] sound (Coˆte´, 2005). For example: “il est neuf heures” (ex. 19), “j’ai neuf ans” (ex. 20) and “neuf autres personnes” (ex. 21). Second, when “neuf” is followed by the word w2 “hommes” (“men”) or “enfants” (“children”), the final consonant -f is either pronounced with [f] or [v] (Coˆte´, 2005; Pierret, 1994), like in “neuf hommes travaillent ici” (ex. 22). Third, when “neuf” is followed by a pause or any other word than the ones mentioned, the “normal” lexical pronunciation [nf] is maintained. Table 3 shows the set of attributes and values required to identify all possible contexts of the word w1 “neuf”, accompanied by the next word w2 having any of the following features: (i) Be member of a group identified by the categorical label “type_v”, which is composed of the words (“heures”, “ans” or “autres”). In this case, the pronunciation of “neuf” is [nv];
854
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
(ii) The categorical label “type_f_v”, representing either the word “hommes” or “enfants”, which may yield the pronunciation [nf] or [nv]; (iii) The categorical label “other”, representing any other word or pause, which results in the maintenance of the ordinary lexical pronunciation [nf]. If the current word w1 is something else than “neuf”, signaled here via the (categorical) label “other”, then another set of rules than Ri is used to process this context. Concerning classification and specification of attributes we use similar strategies as the one presented here above to handle all other sets of rules. Phonological (and more generally linguistic) knowledge is essential for determining the possible contexts in which the phenomena addressed here may occur. As indicated in Step 2 of Fig. 1, each set of rules identified in Step 1 generates a table having a set of ni attributes with corresponding categorical values. The next subsection explains how to organize this information in order generate the training data. 2.3. Step 3: database specification Step 3 consists in the creation of the WFC training database. Each set of rules Ri corresponds to a table Ti, composed of Yi1, Yi2, . . . , Y ini attributes and one target output attribute called ki. A table Ti is filled by computing, over all attributes, the Cartesian Product of all values of an attribute against all values of the next attribute. Once this is done, we append manually the target attribute ki to the table. These operations are summarized by the following Equation: T i ¼ appendðY i1 Y i2 Y ini ; k0i sÞ; ð1Þ where ki is the information required for handling the variable pronunciations, also called WFC information. It can assume four types of values. The first type specifies a set of features required for handling variable pronunciations of WFC. In practical terms, it is a string composed of six slots separated by a hyphen “-” with the following data in the respective fields: (i) An index i, representing the set of rules Ri in charge of handling the case; (ii) A label for the context classification (Ob, Op, Fo); (iii) A pointer to a function that performs one specific kind of change to the phonetic sequence. In other words, it specifies how to change the lexical pronunciation3; (iv) A WFC phoneme for the phonetic sequence change; (v) An extra phoneme associated with this context, such as in case of vowel coarticulation effects4; 3 Insertion, deletion or replacement of phonemes, considering possible coarticulation effects. 4 Sometimes, the insertion of a liaison consonant is accompanied by the change of the final vowel of the word w1. For example, the sound ½~e of the word “divin” ½div~e (in isolation) changes to the phoneme [i] in the context “divin Enfant” [divina˜fa˜].
(vi) An additional phoneme to be used in case of available alternative pronunciation. With this information properly specified, it is possible to perform all the phonetic changes at the word boundaries. The second type of WFC information addresses the cases where no phonological change should be applied to a particular context, implying that the ordinary lexical pronunciation of the current word should remain unchanged. The label “ordinary” is used for this purpose. The third type is meant to convey that the current set of rules Ri does not contribute at all to handle this context, letting another set Rt, with t > i, take care of it. A label, such as “skip”, is used for this purpose. This label is required because the solution for a given context may not be found in a tree. In that case, the execution pointer needs to “escape” from a leaf node and continue the search in the adjacent trees, until the solution is found. Finally, the fourth type refers to a combination of values for attributes yielding incorrect grammar, phonetic combination or illogical sequence. Since these combinations do not comply with linguistic constraints, the label “invalid” is used to refer to such cases. This process of table creation is repeated for all m sets of rules. As a result, m tables are generated, with each one corresponding to a set of rules Ri. The Eq. (2) captures the total number of tuples (#Tuples) contained in this database. #Tuples ¼
m Y ni X
ð2Þ
pij:
i¼1 j¼1
In practice, the implementation of these procedures led to the creation of a database having approximately 60 tables containing a total of around 40,000 tuples. However, since tuples which are linguistically invalid (ki = “invalid”) are not supposed to match any text input sequence, we removed them, leaving only about 10,000 entries. Nearly 70% of them were designed to model liaison phenomenon, while the remaining 30% modeled NLCD pronunciations. We spent approximately four months to collect and develop the present set of data. This is a labor-intensive work which requires (1) selecting and collecting appropriate samples for context distinction and (2) identification Table 4 Data required for training the decision tree in charge of the prediction of the pronunciation of the FC for the word “neuf”. The header contains the attribute names, while the remaining rows accomodate the combination of all possible categorical values. w1
w2
k22
neuf neuf neuf Other Other Other
type_v type_f_v Other type_v type_f_v Other
22-Ob-6-v- 22-Op-18- - -v Ordinary Skip Skip Skip
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
855
Table 5 Walkthrough of the data preparation procedures. Tasks
Input
Transformation
Output
Step 1 Step 2 Step 3
WFC samples Sets of rules Lists of attributes + values
Arrangement of similar cases Feature specification Cartesian Product + k
Sets of rules Lists of attributes + values WFC database
of the relative importance of lexical, phonologic, morphosyntactic or semantic elements for each set of rules, as well classifying and labeling the cases. Let us get back to our example with the word “neuf” and implement the procedures described here in Step 3. Considering the attributes w1 and w2 defined in Table 3, a Cartesian Product of their values results in the first two columns of the Table 4. Next, each of these rows is manually tagged with appropriate WFC information, describing how to handle each case. For example, the string “22-Ob-6-v- -” (cf. row 1) stands for the WFC information applying to a context where “neuf” is followed by a word identified by the label “type_v” (“heures”, “ans” or “autres”). The details concerning this string are as follows: (i) The first field is represented by the value 22, which refers to the set of rules R22; (ii) The second field classifies this context as Obligatory (Ob); (iii) The third field contains the value 6, which points to a function whose task is to change the original phoneme [f] into [v]; (iv) The fourth field is the consonant phoneme [v], which replaces the original phoneme [f]; (v) The fifth field is empty because no coarticulation effect takes place; (vi) The sixth field is empty because no alternative pronunciation is possible for this case. This example illustrates the Step 3 depicted in Fig. 1, where each set of rules Ri generates one table Ti of the training database. The process of data elaboration is summarized in Table 5. This database is the input for training the decision trees, as explained in the next section. 3. Training and prediction algorithms This section is divided in two parts. First, we explain how to use the data to create decision trees. Second, we discuss how to connect these trees in order to process a given input text. Each of these algorithms is described in the next subsections. 3.1. Training of a decision tree model Given training data, one could think for a moment that the problem of predicting the pronunciation of the WFCs
could be resolved by creating a single decision tree. In practice, however, a tree holding all the features may not be capable of applying rules in the proper order, to avoid ambiguities. If we tried to remedy this situation by computing the Cartesian Product over all possible categorical values of all attributes, it would become unfeasible to train such a tree, because of the exponentially grown data size (cf. Eq. (2)). Thus, instead of creating a single tree, we use the data of each table Ti to train a corresponding DTi decision tree, which is in charge of handling similar contexts. Due to the particularities of each group of rules, every tree requires a specific set of features for determining the variable pronunciations. This technique allows individual trees to focus just on a limited set of features, i.e. only those necessary for handling a limited number of contexts. Fig. 2 depicts the trees trained from the previously specified database. Each of them is able to perform tests over its specific set of attributes at the node levels, while the resulting predictions are available in the leaves. As explained in Section 2.3, whenever a set of rules does not contribute at all to handle a context, the label “skip” is found. The tuples containing this label for ki are informative for the training of the trees because they differentiate contexts from each other and provide means of “escape” when a DTi is not able to handle a context. Without this label, it would not be possible to know whether the search for the solution should stop or not, after reaching a leaf node. In addition, the following two settings of the decision tree classifier are important: (1) a tree should be able to hold at least one instance5 per node and (2) the trees should not be pruned. These settings allow for the creation of all possible valid paths (from the root to the leaf nodes), given that the training data has been previously generated by the computation of the Cartesian Product. Note that, the finer the level of details utilized during the training, the better the model will fit to the data. The purpose of overfitting the trees to the data is to guarantee that the model achieves the same level of performance as the rules implicit in the training data. We avoid tree pruning for two reasons. The first is because a single leaf node can contain the prediction of up to six interdependent parameters (cf. Section 2.3). This set is required by the G2P converter for performing the appropriate changes to the phonetic sequence, as well as for classifying the contexts (Ob, Op or Fo). Since all these 5 An instance is a record of a data set from which a model is learned (Ron Kohavi and Provost, 1998).
856
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
R1
neuf
Rm m1
11
WFC database
12
13
14
m2 m4
other w1
type _v
not
w2
skip
m3 m5
22-Ob-6-v--
type _f_v
other
w2
Fig. 2. The training algorithm.
ordinary
22-Op-18---v
parameters are produced together for a single prediction, the whole set would become ineffective as soon as any of them contained incorrect information concerning the handling of the current context. For example, even if the FC phoneme were correctly predicted, it would be useless if it is associated with incorrect information concerning the way how to change the phonetic sequence. Second, note that we used the Cartesian Product to generate all possible combinations of categorical values for the attributes of a given table. Assuming that the data has been correctly specified, the training process should cover all valid possibilities, and consequently the predictions based on this data should be correct. In practice, however, labeling errors might have been unintentionally introduced in the data. For example, it may be that not all possible contexts of a word have been utilized/considered by the human expert during the procedure of label specification. It is also possible that certain categorical labels contain information that is too general to allow for proper context identification. And finally, human errors may have been produced during the assignment of the target attribute values (ki). Obviously, we tried to minimize as much as possible these kind of errors by carefully labeling the data. Although the trees are not pruned, this method does allow to generalize predictions concerning the pronunciation of the FCs of unseen words. This is possible because decisions concerning the goodness of fit of a word with regard to a set of rules, are made not only on the basis of lexical entries, but also on the basis of clusters of words, such as PoS labels. So, even if a word is not explicitly included in the training database, decisions regarding the pronunciation of its FC may be made on the basis of these categories. Our target being education, this research aims at predicting pronunciations as good as possible in the light of the available data. Errors can be progressively reduced by continuously updating/increasing the training data with the most recently available problematic cases. Hence, the performance of this method is related to the extent and the completeness of the database covering various possibilities of pronouncing WFCs. The application of the procedures described here for our example with the word “neuf” generates the tree depicted in Fig. 3. Briefly, if the current word w1 is classified as “other” than “neuf”, then the output is “skip”, this is what happened for three tuples of the training data (last rows of Table 4). However, if the value of w1 is “neuf”, then a more
Fig. 3. Decision tree for the word “neuf”.
careful examination concerning the value of w2 is required, as indicated by the remaining nodes. At last, the corresponding WFC information is used by the G2P converter to perform pronunciation change(s) and classify the context. The remaining trees are trained in a similar manner. Once they are generated, the next step is to connect them in order to allow testing certain features of an input text against these data structures. This is explained as follows. 3.2. The WFC prediction algorithm The prediction algorithm stands for the integration of the automatically generated trees into the G2P converter, as well as for the description of the various steps the algorithm goes through (stepper) during run-time. This integration occurs in the post lexical analysis routine of the G2P converter. Initially, it is assumed that every word/token ending with any of the following 12 graphic consonants -c, -d, -f, -g, -n, -p, -q, -r, -s, -t, -x, and -z allows in principle various pronunciations, and should therefore be analyzed. To this end, given the relative position of analysis of a word/token, we extract and store in memory several general parameters concerning its context. For example, the current word, some of the neighbouring words, their parts-of-speech, the lexical phoneme sequence of the current, and following word. Then, Module 1 (cf. Fig. 4), which is in charge of testing these features against the first tree is called. Module 1 is the algorithm which performs the following operations: (1) classification of the input parameters concerning the current position of analysis, assigning the same set of labels prepared/defined for rule-set R1 during Step 2 and (2) use of the tree DT1 to predict the pronunciation of WFC for this context. If the current word/token and context are identified by this tree, then the WFC information (value of k1) solving this case is used for changing the phoneme Module 1
Module m
R1
Rm m1
11 12
13
14
15
...
"skip"
m2
m3 m5
m4 m6
Fig. 4. The prediction algorithm.
"skip"
inapplicable
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
sequence and classifying the context. However, if the label “ordinary” is found, no phonological change is necessary. Otherwise, Module 1 does not contribute anything to process this context. The label “skip” returned by DT1 contains this information, and the task is transferred to Module 2. The next Modules are built in a similar fashion. Each one performs the classification required for its attributes, given the current position of analysis of the input text, followed by the tests checking the familiarity of this context. Thus, every Module has its own mechanisms to identify whether a certain context is within its purview or not. If none of the Modules is able to recognize a context, then the phoneme sequence remains unchanged, as indicated by the word “inapplicable” in Fig. 4. Once the analysis of the current word is finished, the algorithm proceeds to the next word/token ending with one of those consonants, repeating the same process over and over, until the end of the input. 4. Evaluation 1: comparison with state-of-the-art G2P converters We conduct a twofold evaluation for the proposed method. Initially, we compare the predictions of liaisons produced by our prototype system with those generated by six other text-to-speech synthesizers (Yvon et al., 1998). This part is addressed in this section. Later, in Section 5, we perform a general evaluation considering the pronunciation of WFCs as well as context classifications. The experiments of the first part were done using a French text corpus extracted from the newspaper Le Monde of January 1987. This data was used only for this evaluation purposes. It has approximately 26,000 words/ tokens distributed in about 2000 sentences. Roughly, it contains 1500 liaison candidates, of which about 600 are obligatory. Further details concerning the description of this data can be found in (Yvon et al., 1998). Although context classification labels are not available, corresponding phonetic transcriptions accompany this corpus. The phonetic sequences were elaborated by human experts, including various acceptable pronunciations for the words/tokens of the corpus. Yvon et al. (1998) used this data to conduct an objective evaluation of G2P conversion for French TTS synthesis in eight systems (A, B, C, D, E, F, G, and H). However, the performance of the systems A and E could not be reported due to misalignment problems. This (early) study included global aspects of the G2P conversions. One of the aspects investigated was the nature and diversity of sources of
857
errors. They concluded that liaison was one of the major sources of errors among the synthesizers. While their paper specifies the number of errors caused by misprediction of liaisons, the amount of errors caused by the incorrect prediction of NLCD phonemes is left unspecified. In addition, context classifications were not part of this evaluation. Thus, in order to be fair, we can only compare the amount of errors caused by misprediction of liaisons between six of these systems and ours. The experiments here consisted in providing the corpus to the prototype system as input, in order to generate the G2P conversions. Next, the automatic conversions were aligned with the text corpus, as well as with the transcribed corpus. For words/tokens with a unique pronunciation, finding a mismatch was as simple as comparing two strings, given that the phonetic sequences were aligned and the phonetic alphabet identical for both transcriptions. For words/tokens with multiple pronunciations, all possible pronunciations were generated, as indicated in the transcribed corpus. Matches were identified whenever one of the optional pronunciations for a word/token corresponded to the automatically produced phonetic sequence. However, if none of the optional pronunciations corresponded to the one produced automatically, a mismatch was marked. Next, a list of all mismatching pronunciations was generated. From this list, we selected the words/tokens ending with one of the graphic consonants considered as possible liaison candidates. This list was filtered then again by separating entries with a liaison error from entries having other types of error. Consequently, a list of words/tokens containing liaison related errors was compiled. Table 6 shows the result of our method compared with the results obtained by Yvon et al. (1998). Unfortunately, we lack information concerning the methods used by these six systems to predict liaison or the pronunciation of WFCs. Hence, we cannot provide reasons, explaining why one method is superior to the other. In addition, since the types of liaison errors produced by these six systems are unknown, it is not possible to compare them and determine which types of liaison are the most difficult ones to predict. Nevertheless, the available figures allow us to infer that our approach is comparable to the 2nd best French G2P converter of this (early) evaluation with regard to liaisons. 5. Evaluation 2: general evaluation of the method Another set of experiments has been performed in order to check the performance of our method with regard to its
Table 6 Number of liaison errors produced by previous six systems compared with our prototype system. System
B
C
D
F
G
Prototype
H
Liaison errors
111
123
76
49
38
34
15
858
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
capacity of predicting the pronunciation of WFCs and context classifications. Here we compare the outputs produced by our prototype system against the ones of a commercial text-to-speech synthesizer. We present the experiments and results in the following three subsections. 5.1. Experimental setup The experiments of the second part of this evaluation were done using a French text corpus with its corresponding audio files, both of which were extracted from (NHK, 2007) broadcast news of May 2007. The broadcasted audio files have been produced by native speakers. This speech represents standard Parisian French accent (“francßais de re´fe´rence”). A total of 16,449 words (expanded form), distributed over 618 sentences extracted from 103 short articles composed the evaluation text corpus. Just like before, this corpus was used only for evaluation purposes. Basically, the settings for these experiments are summarized in three steps, listed below, followed by an explanation concerning each item: (i) Annotation and analysis of the corpus data; (ii) Processing and evaluation of the results of a commercial system; (iii) Processing and evaluation of the results of our prototype system. At first, we listened to the audio files, while accompanying the corresponding text and taking into consideration the contexts of variable pronunciation of WFCs. During this listening phase, the text was manually labeled and annotated according to the native speaker’s reference. The labeling part consisted of manually (1) highlighting the variable pronunciation of WFCs, (2) annotating WFC phonemes side by side with their corresponding graphemes in the printed text and, (3) classifying them into Obligatory, Optional, or Forbidden contexts of Liaison or NLCD. It was assumed that the pronunciation of a WFC in an Obligatory context correctly classifies it as Obligatory. Similarly, the lack of pronunciation of a WFC phoneme in a Forbidden context correctly classifies it as Forbidden. Table 7 illustrates some of these annotations
Table 7 Sample of the annotated corpus data corresponding to item (i) in Section 5.1. L’un {Liaison, Fo} des {Liaison, Fo} points {Liaison, Fo} sera de savoir s’il est {Liaison, Op, [t]} approprie´ pour l’Archipel d’intercepter {Liaison, Fo} des {Liaison, Fo} missiles {Liaison, Fo} visant {Liaison, Fo} les {Liaison, Ob, [z]} E´tats{Liaison, Ob, [z]}Unis. . . Le sommet {Liaison, Fo} du G Huit {NLCD, Ob, [t]} devrait {Liaison, Op, [t]} e´galement {Liaison, Op} eˆtre l’occasion de demander {Liaison, Fo} plus {NLCD, Ob, [s]} d’efficacite´ a` diverses {Liaison, Ob, [z]} organisations {NLCD, Fo}
in the copus. The WFC phonemes uttered by the native speakers appear inside brackets. We noticed that all the Obligatory contexts uttered by native speakers actually had the corresponding WFC phoneme pronounced. Likewise, for all the Forbidden contexts, WFCs were never pronounced, just as expected. Thus, the native speech from the broadcast was confirmed to be the correct reference for this evaluation. Second, the text corpus was given to one of the best-performing commercial TTS systems Loquendo (2007). The synthesized speech produced by this system was also labeled and annotated, repeating the same procedures described above for the native speech. This time, however, errors occurred. Two kinds of errors were identified: replacement of a WFC phoneme, or lack of a phoneme that should have been pronounced. Unfortunately, the commercial TTS system does not output context classifications. In order to cope with this shortcoming, we have developed a couple of assumptions to enable the labeling of the categories, solely on the basis of phonetic sequences. Given an actual Obligatory context, we assume that a correctly assigned FC phoneme means that this context is predicted as Obligatory. Likewise, we interpret the non-realization of a FC in an actual Forbidden context as a Forbidden context prediction. Note however, that for both cases, the system might have internally identified the context as Optional, where the choice made just coincided with our expectation. Therefore, these two assumptions may favor the commercial system to some extent that we cannot know. Because of this, we consider the context classification results of this system only as somewhat indicative. Despite this fragility, the next two situations do not seem to result in any errors. We can infer that, if there is a final phonetic consonant, the context corresponds to a non-forbidden one (i.e. Optional or Obligatory). Similarly, if there is no final consonant phoneme, the context corresponds to a non-obligatory one (i.e. Optional or Forbidden). By and large, this strategy allows us to create the labels required for the evaluation of context classification. Third, the text corpus was given to the prototype system to generate automatically G2P conversions, the pronunciation of WFCs and the respective context classifications. Our prototype system is built under the architecture of Festival Speech Synthesis (Black et al., 1999), having the lexical letter-to-sound rules trained by CART decision trees (Black et al., 1998) and Lexique 2, a French dictionary (New et al., 2003). We also use a library of Cordial software (Synapse, 2006) for PoS tagging. Then we applied our method for predicting the pronunciation of WFCs. The goal of applying these labels to the native speech, the commercial TTS system and our prototype system was to measure the relative efficiency of the predictions with respect to WFCs under the following conditions: (1) prediction of WFC phoneme and (2) prediction of context classification. The results of these two aspects are presented and discussed in Sections 5.2 and 5.3.
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
5.2. WFC phoneme prediction results
Table 9 Summary of relevant counts for WFCs and contexts.
The first aspect evaluated here concerns the prediction of WFC phonemes in Obligatory contexts. Two kinds of errors were identified: replacement of a WFC phoneme or lack of this phoneme. According to Table 8, the prototype system made 99.3% correct WFC phoneme predictions, while the commercial TTS system scored only 89.6%. However, although the associated WFC phoneme was correct, the rate 99.3% does not take into account the fact that 4.1% were based on incorrect context classification. They were misclassified as Optional, instead of Obligatory contexts. While this is a problem for a French phonetic teaching application, as it requires correct predictions of both WFC phonemes and context classifications, it would not have been a problem for other kinds of applications where context classification is irrelevant. Concerning errors, Table 8 shows that the prototype system made 0.3% mispredictions concerning the replacement of WFC phonemes, while the commercial TTS system made 0.7%. However, one of the most substantial improvements is represented by the reduction of errors due to the lack of WFC phonemes. While the commercial TTS system produced 9.7% of such errors, the prototype system made only 0.4%, which represents an absolute improvement of 9.3%. Even if we consider our goal to be a teaching application, that is, we subtract the 4.1% of context misclassification, we still have an overall improvement of 5.2%. It should be noted, however, that the values 0.3%, 0.4% and 0.7% are just indicative, given the relatively small size of the test data.
Total number of
5.3. Context classification results The second aspect evaluated here concerns the prediction of context classification. We introduce, in Table 9, a summary of some relevant figures or counts for the WFCs and contexts. About 51.0% of the tokens in the corpus ended with one of the 12 graphic consonants (WFCs). Nearly 67.1% of them belonged to Forbidden contexts (either liaison or NLCD). The remainder 32.9% was distributed in the following way: 18.4% for words whose final consonants are invariable NLCI, 8.2% for Obligatory contexts, 6.3% for Optional contexts and 0.0% for variable NLCI words. As an extension of our definition for NLCI, notice that we use the term invariably silent NLCI to designate words
Table 8 Comparison of the commercial TTS and the prototype systems regarding WFC phoneme prediction in Obligatory contexts.
Correct cons. phoneme Replaced cons. phoneme Lack of cons. phoneme
Commercial TTS
Prototype system
89.6% (619/691) 0.7% (5/691) 9.7% (67/691)
99.3% (686/691) 0.3% (2/691) 0.4% (3/691)
words/tokens with
WFCs Words/tokens
Ob liaisons Op liaisons Fo liaisons Ob NLCDs Op NLCDs Fo NLCDs variable NLCIs invariably silent NLCIs invariably pronounced NLCIs
859
Abs.
Rel. (%)
564 506 5101 127 20 527 2 321 1219
6.7 6.0 60.8 1.5 0.2 6.3 0.0 3.8 14.5
8387 16,449
100.0 –
whose FCs are kept silent in any context, such as “Japon”, “Paris”, and “parlement”. Also, we name invariably pronounced NLCI words whose FCs are always pronounced, like “avec”, “correct”, and “gaz”. Although they are present in the corpus, they fall outside of the scope of this investigation. The results and analysis for automatic context classification are divided into Obligatory and Forbidden contexts, respectively presented in Sections 5.3.1 and 5.3.2. We decided not to investigate the Optional context classifications here because it is impossible to infer the existence of an Optional context from the phonetic sequences produced by the commercial system. These sequences contain only one pronunciation per word, discarding any other possibilities for the Optional liaisons or NLCDs. In the description of our results, we use jointly all the counts of liaisons and NLCDs, ignoring NLCIs. This means that, for example, the total number of Obligatory contexts is 564 + 127 = 691, and that the total number of Forbidden contexts is 5101 + 527 = 5628. In addition, although the Optional contexts are outside of our scope, we do include them in both non-obligatory and non-forbidden sets. Therefore, the set representing non-obligatory contexts includes both Forbidden and Optional liaisons and NLCDs, yielding 5628 + 526 = 6154. Similarly, the set for non-forbidden contexts contains both Obligatory and Optional cases (691 + 526), totalizing 1217 cases. 5.3.1. Classification of obligatory contexts Concerning the errors produced by our prototype and the commercial system with respect to Obligatory context classification, we call false rejection (FR) when an Obligatory context is misclassified as non-obligatory (i.e. Optional or Forbidden). For example, “Jeux Olympiques” (ex. 16) being pronounced as [ZøOl~epik]. A false acceptance (FA) occurs when a non-obligatory context is misclassified as Obligatory. For instance, a liaison being introduced between the words “pays” and “hoˆte” in the context “. . . l’Allemagne, pays hoˆte. . . ” [l’alma›,peizot]. The results for true and false acceptances, as well as for true and false rejections with respect to classification of Obligatory contexts are presented in Table 10. A true
860
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
Table 10 Evaluation of native speakers, commercial TTS and prototype systems with respect to classifying Obligatory contexts, correctness being expressed in terms of true acceptances or rejections (TA/TR) and false acceptances or rejections (FA/FR). Oblig./non-oblig. contexts
TA
FA
FR
TR
Native speaker Commercial TTS system Prototype system
691 622 658
0 6 6
0 69 33
6154 6148 6148
rejection (TR) here corresponds to a word among the candidates whose pronunciation of its final consonant is nonobligatory (cf. ex. 3 and 4). And a true acceptance (TA) refers to an actual Obligatory context where the FC phoneme has been properly predicted (cf. ex. 2). Remember that the measurement of context classifications based on the phonetic sequences alone may produce to misleading results, as explained in Section 5.1. This strategy is used only for the commercial system because it does not output context classification labels by default. This remark is also valid for the results presented in the Section 5.3.2. A total of 691 contexts of Obligatory pronunciation of WFCs were present in the evaluation corpus. Concerning recall rate, the prototype system correctly classified 95.2% of the cases, while the commercial TTS system achieved 90.0%. Precision rate was maintained evenly 99% for both methods. The mean of recall and precision rates, calculated using F-value with b = 1, shows an absolute improvement of 2.8%. These results are summarized in Table 11. In addition, we could note that 85% (28/33) of the mispredictions of the prototype system were due to fixed expressions which were not part of the dictionary, such as “Nations Unies” (ex. 23) and “Affaires e´trange`res” (ex. 24). This suggests that adding a list of such fixed expressions (appearing in Obligatory contexts) has the potential to improve the results. Being cautious we use here the word “potential” as the size of the evaluation data was relatively small to give us a clear idea concerning the quantity of improvement that could be expected by adding a list of fixed expressions. However, using the present evaluation data set, the remaining 15% (5/33) of errors were due either to incorrect specification and classification of attributes during the modeling phase or they were due to unavailable features, such as semantic information. From the text corpus we could observe that 67% (465/ 691) of the Obligatory contexts were due to the articulation of top 10 most frequent liaison words with their contexts: “aux”, “ces”, “des”, “en”, “les”, “leurs”, “sans”, “ses”,
“son” and “un”. Removing these contexts, perfectly predicted by both systems, we could conclude that our prototype correctly predicted 85% (193/226) of the remaining cases, while the commercial TTS system rate was only 69% (157/226). This shows that the proposed context classification model was, on average, 16% more accurate than its competitor with regard to classifying non-trivial Obligatory contexts. 5.3.2. Classification of forbidden contexts Concerning the errors produced by both systems with respect to Forbidden context classification, a false rejection (FR) occurs when an actual Forbidden context is misclassified as non-forbidden (i.e. Optional or Obligatory). For instance, the insertion of a liaison between the words “tout” and “une” in the context “. . . re´clame avant tout une stabilite´. . . ” [ReklAmava˜tutynstabilite]. And a false acceptance (FA) occurs when a non-forbidden context is misclassified as Forbidden. For example, the word “plus” [plys] (“more”) being mispronounced as [ply] in the context “demander plus d’efficacite´” [d(E)ma˜deplydefikasite]. The results for true and false acceptances, as well as for true and false rejections with respect to classification of Forbidden contexts are presented in Table 12. A true rejection (TR) here corresponds to a word among the candidates whose pronunciation of its final consonant is not Forbidden (cf. ex. 8 and 10). While a true acceptance (TA) is a case where the FC has not been realized in an actual Forbidden context (cf. ex. 4). As can be seen in Table 13, the recall rate for predicting Forbidden contexts was 99.9% for both the commercial TTS and the prototype systems. This rate is very high because the number of incorrect predictions for Forbidden contexts is small compared to the number of words whose final consonants are not pronounced. However, an analysis of the types of errors of the prototype system revealed that 45% (5/11) of the mispredictions were due to inaccurate
Table 12 Evaluation of native speakers, commercial TTS and prototype systems with respect to classifying Forbidden contexts, correctness being expressed in terms of true acceptances or rejections (TA/TR) and false acceptances or rejections (FA/FR). Forb./non-forb. contexts
TA
FA
FR
TR
Native speaker Commercial TTS system Prototype system
5628 5622 5622
0 69 5
0 6 6
1217 1148 1212
Table 11 Comparison of the commercial TTS and the prototype systems regarding the classification of Obligatory contexts.
Table 13 Comparison of the commercial TTS and the prototype systems concerning the classification of Forbidden contexts.
Obligatory contexts
Commercial TTS (%)
Prototype system (%)
Forbidden contexts
Commercial TTS (%)
Prototype system (%)
Precision Recall F-value
99.0 90.0 94.3
99.1 95.2 97.1
Precision Recall F-value
98.8 99.9 99.3
99.9 99.9 99.9
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
specification and classification of attributes and values during the Step 2, while the remaining 55% (6/11) were due to exceptions not registered in the training database. The prototype’s performance was considerably better with respect to the number of false acceptances for Forbidden context classification. The ratio of false acceptances for Forbidden context classification was 5 to 69, respectively for the prototype system and the commercial TTS system. Directly affected by the number of false acceptances, the precision rate for Forbidden context classification is 1.1% higher in absolute numbers. More precisely, it moved from 98.8% to 99.9% as shown in Table 13. The relatively high number of false acceptances produced by the commercial TTS system was due to its tendency to avoid pronunciation of the final consonants of French numbers and fixed expressions in contexts where they should have been pronounced, such as in “Nations [z] unies”, “. . . mars 2008 [t] la date limite. . . ”, “. . . sommet du G Huit [t] de juin. . . ”, and “. . . pourparlers a` six [s] sur le programme. . . ”. These results show that the present method performed better in correctly identifying Forbidden contexts. Yet, classification errors still did take place and, no doubt, one should try to improve these models, given their relevance for an engine having the potential to teach pronunciation. Indeed, there is clearly room for improvement, which could be achieved in various ways, for example by (1) enlarging the training database, feeding it with more exceptional cases concerning fixed expressions and (2) by specifying more carefully the set of attributes and values necessary for context identification and classification. The first type of update is limited, although the number of fixed expressions is unknown. Hence, consulting more sources (phonetic literature) might be helpful to find cases not yet included in the current database. Concerning the second type of update, additional evaluation might help us to identify the set of features and values not properly specified or classified during Step 2 for some set of rules. Once these problematic cases are identified, corrections may consist of one or more of the following procedures: (1) redefining attributes by allocating their values to different attributes, (2) increasing the number of values for a given attribute, and (3) creating new attributes with corresponding values. Unfortunately, perfection is hard to achieve, since semantic information might be required, though it is cruelly lacking. Nevertheless, these results show that developing a TTS application to teach pronunciation of French is much more challenging than developing a TTS for other kinds of applications. In addition to the challenge of predicting correct WFC phonemes for better G2P conversions, correct context classifications are required. 6. Conclusions and future work This paper proposed a novel approach for modeling the pronunciation of French WFCs by using C4.5 decision trees. To the best of our knowledge, this is the first
861
approach based on decision trees to model WFC pronunciation variations. Since the long-term-goal of this research is to improve the accuracy of TTS for French studies as second language, excellent performance is required by the G2P converter. Hence, developing a TTS application for didactic purposes, i.e. teaching French pronunciation, is much more challenging than developing a TTS for other kinds of applications. Predicting correctly WFC phonemes is already a challenge, yet, correct context classification is an additional requirement if we want to give students explanations concerning pronunciation variation due to contexts. To test these requirements, a two step evaluation was conducted. The first part of our experiments introduced a comparison between the predictions concerning liaison generated by our prototype system with those generated by six other previously evaluated systems. The proposed method allowed our system to rate second best among the state-of-the-art systems. In the second part of the experiments, the method’s relative efficiency was investigated with respect to WFC phoneme prediction and context classification. Concerning the WFC phoneme prediction in Obligatory contexts, the commercial TTS system made 89.6% of correct predictions, while the prototype system rate was 99.3%. However, 4.1% of these Obligatory contexts were misclassified as Optional by the prototype system, being “helped” by the fact that due to fortunate circumstances i.e. coincidence, the same WFC phonemes were associated. Regarding the prediction of contexts, the prototype system correctly classified 95.2% of the Obligatory contexts, while the commercial TTS system succeeded only in 90.0% of the cases. Both systems kept the same precision rate. In addition, the prototype system improved significantly with respect to classification of Forbidden contexts, with the observed ratio of false acceptances being 5 vs. 69 in favor of the new method. Remember that, two of our assumptions utilized for labeling the context classifications of the commercial system may not be perfect. Therefore, we consider the results of this system in this matter only as somewhat indicative. An error analysis also reveals that the quality of the results of context classification and WFC phoneme predictions depend on the extension and the completeness of the attributes and values specified for the training of decision trees. Our method simplifies the modeling of the pronunciation of WFCs. Relying exclusively on linguistic/phonologic knowledge to deal with the problem at hand, it dispenses with the complexities of manual coding of rules in a programming language. The proposed method allows for the following technical advantages: (1) standardization and simplification of the rule generation process due to the fact that similar rules are grouped together into rule-sets; (2) automation of the production of rules, given a training database; and (3) ease of maintenance of the rule database, due to the distribution of rules, leading to minimal impact
862
J. de Jesus Aguiar Pontes, S. Furui / Speech Communication 52 (2010) 847–862
Table A.1 WFC context examples. No.
Example/translation
Pronunciation
1 2 3 4 5 6 7 8 9
“mes doigts” (“my fingers”) “mes amis” (“my friends”) “amis e´trangers” (“foreign friends”) “amis e´tudient” (“friends study”) “j’en veux plus” (“I want more”) “j’en veux plus” (“that’s enough”) “tous ensemble” (“all together”) “tous azimuts” (“in all directions”) “. . . les connaıˆtre tous.” (“. . . know all of them.”) “six pour cent” (“six per cent”) “je n’en mange plus” (“I don’t eat anymore of that”) “un obscur” (“a vague”) “un e´lu” (“an elected”) “vont-ils j arriver?” (“will they arrive?”) “tout a` coup” (“all of a sudden”), “Jeux Olympiques” (“Olympic Games”) “de plus en plus” (“more and more”) “en plus” (“in addition”) “il est neuf heures” (“it is nine o’clock”) “j’ai neuf ans” (“I am nine years old”) “neuf autres personnes” (“nine other people”) “neuf hommes travaillent ici” (“nine men work here”) “Nations Unies” (“United Nations”) “Affaires e´trange`res” (“Foreign Affairs”)
[medwa] [mezami] [ami(z)etRa˜Ze] [amietydi] [Za˜vøplys] [Za˜vøply] [tusa˜sa˜bl] [tuzazimu] [lekOnetRtus]
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
[si(s)puRsa˜] [ZEna˜ma˜Zply]
[nOpskyR] [(n)e´ly] [v¿ tilaRive] [tutaku] [ZøzOl~epik] [dEplyza˜ply] [a˜plys] [ilenvR] [Zenva˜] [nvotRpeRsOn]” [n(f/ v)OmtRavaj(t)isi] [nasj¿ zyni] [afeRzetRa˜ZeR]
of new rules, by clustering similar rules and by setting appropriate priorities. Future work in this field includes: (1) enlargement of the database used by the prototype system; (2) evaluation of our method with a larger amount of test data; (3) investigation of Optional liaison contexts, using annotated corpora; (4) improvements on general G2P conversions; and finally, and (5) proposal of an appropriate pedagogy for this phonetic engine. Acknowledgments We would like to thank Dr. Michael Zock (LIF-CNRS) for the time spent reading, commenting and correcting our final version. It has helped a lot improving this document. We would also like to express our gratitude to Francßois Yvon from LIMSI for providing us with the data needed to run the first part of the experiments. Appendix A.
See Table A.1.
References Battye, A., Hintze, M.-A., Rowlett, P., 2000. The French Language Today: A Linguistic Introduction, second ed. Routledge, London and New York, pp. 109–112. BDL – Banque de De´pannage Linguistique, 2002. Office Que´bequois de la langue francßaise. Gouvernement du Que´bec. .
Black, A.W. et al., 1999. The Festival Speech Synthesis System. University of Edinburgh. Black, A.W. et al., 1998. Issues in building general letter to sound rules. In: Proc. Third ESCA/COCOSDA Workshop on Speech Synthesis, Australia, pp. 77–80. Boe¨, L.J., Tubach, J.P., 1992. De A a` Zut, dictionnaire phone´tique du francßais parle´. Universite´ Grenoble 3, Ellug. Corpuseye, 2008. University of Southern Denmark. . Corre´ard, M.-H., Grundy, V., Ormal-Grenon, J.-B., Pomier, N., 2003. Oxford-Hachette French Dictionary, third ed. Oxford University Press (CD-ROM). Coˆte´, M.-H., 2005. Phonologie francßaise, Universite´ d’Ottawa, pp. 58–60, 91–107. . Delattre, P., 1951. Principes de phone´tique francßaise a` l’usage des e´tudiants anglo-ame´ricains. Middlebury College. Durand, J., Lyche, C., 2008. French liaison in the light of corpus data. Journal of French Language Studies 18/1, 33–66. Durand, J. et al., 2005. . Encreve´, P., 1988. La liaison avec et sans enchaıˆnement: phonologie ´ ditions du Seuil, Paris. tridimensionnelle et usages du francßais. E Fouche´, P., 1959. Traite´ de prononciation francßaise. Klincksieck, Paris. Germain-Rutherford, A., 2005. Centre du cyber-@pprentissage, Universite´ d’Ottawa. . Goldman, J.-P., Laenzlinger, C., Wehrli, E., 1999. La phone´tisation de “plus”, “tous” et de certains nombres: une analyse phono-syntaxique. Actes de TALN99, Carge`se, Corse, pp. 165–174. Grevisse, M., 1997. Le bon usage – grammaire francßaise, 13e`me e´dition, Andre´ Goosse, DeBoeck-Duculot, Paris, Louvain-la-Neuve, pp. 45– 49. Loquendo TTS, 2007. . Mareu¨il, P.B., Adda-Decker, M., Gendner, V., 2003. Liaisons in French: a corpus-based study using morpho-syntactic information, LIMSICNRS, Orsay, France, LATTICE, Universite´ Paris VII, France. New, B., Pallier, C., Brysbaert, M., Ferrand, L., 2003. Lexique 2: a new French lexical database. . NHK, 2007. . Pierret, J.-M., 1994. Phone´tique historique du francßais et notions de phone´tique ge´ne´rale. Peeters Louvain-La-Neuve, pp. 98–101. Pontes, 2008. . Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Mateo. Robert, P., 2005. Le Grand Robert de la langue francßaise: dictionnaire alphabe´tique et analogique de la langue francßaise, Paris. . Ron Kohavi, R., Provost, F., 1998. Glossary of terms – editorial for the special issue on applications of machine. Learning and the Knowledge Discovery Process 30 (2/3), 1998. Sanders, C., 1993. French Today: Language in its Social Context. Cambridge University Press, p. 263. Sproat, R., 1998. Multilingual Text-To-Speech Synthesis – The Bell Labs Approach. Kluwer Academic Publishers, Massachusetts, USA, p. 56. Stammerjohann, H., 1976. Zur Aussprache franzo¨sischer Endkonsonanten: aouˆt, but, cinq, usw. Die Neueren Sprachen 75 (5), 489–502. Synapse, 2006. Bibliothe`que Cordial, e´tiquetage – lemmatisation, Vol. 11. . Tzoukermann, E., 1998. Text analysis for the Bell Labs French text-tospeech system. In: Proc. ICSLP98, Sydney, Australia, Vol. 5, pp. 2039– 2042. Verluyten, S.P., Hendrickx, C., 1987. La chute des consonnes finales facultatives en francßais: petite enqueˆte sociolinguistique. Linguistica Antverpiensia 21, 175–189. Yvon, F. et al., 1998. Objective evaluation of grapheme to phoneme conversion of text-to-speech synthesis in French. Comput. Speech Lang. 12, 393–410.