Next, semantic constraints: since the adjective conique 'conical' does not satisfy the semantic constraints that the suffix -erie imposes on the bases it selects, the ...
An Experimental Constructional Database: The MorTAL Project Nabil Hathout,1 Fiammetta Namer,2 and Georgette Dal3 1 ERSS,
CNRS, and University of Toulouse–Le Mirail 2 LANDISCO, University of Nancy 2 3 SILEX, CNRS, and University of Lille 3
1. Introduction 1.1. MorTAL or: Why a constructional database? The constructional database1 which is the object of this presentation is part of the MorTAL project, a research project funded for a period of 3 years by the French Ministry of Research.2 The French language lacks specifically constructional databases, certainly due to the alleged irregularity of the constructional morphology of the language. However, natural language processing (NLP) and information retrieval (IR), in particular, can benefit from this type of resource, for the following usages, among others: •
in document retrieval, to improve the filtering of documents resulting from a query
•
in automatic analysis systems based on using unification-based grammars (TAG, LFG, HPSG), in order to automatically and dynamically (according to need) obtain the lexical entry of a constructed word C from its base B, and reduce the size of the lexicon used by the grammar
1. As in the theoretical model developed in Lille (France) by D. Corbin and her team, we prefer the term “constructional” to “derivational” which does not always imply a single notion. 2. MorTAL brings together the co-authors of this article, as well as Christian Jacquemin who deserves our deepest thanks for his careful rereading of this presentation. © 2002 Nabil Hathout, Fiammetta Namer, and Georgette Dal. Many Morphologies, ed. Paul Boucher and Marc Plénat, 178–209. Somerville, MA: Cascadilla Press.
An Experimental Constructional Database
179
•
in information retrieval when the knowledge of the constructional links between lexical units partially solves the problems related to term variants (cf. especially Jacquemin 1999)
•
in text comprehension, when the semantic relationship that exists between the constructed words and their bases can be used
Therefore, from an NLP and IR point of view, MorTAL fills a gap. From a theoretical linguistics point of view, it is an ambitious enterprise in that developing a computer program capable of automatically analyzing constructed French words in a way that is linguistically motivated proves that the construction of those words is governed by rules, and thus that the reputation from which French constructional morphology suffers is unjustified. In this way, MorTAL is part of the new scientific paradigm of corpus linguistics in which corpora can validate linguistic hypotheses when they are applied to massive quantities of data, or if necessary, allow the hypotheses to be amended.
1.2. A lexicon enriched by constructional and semantic features: Subject description The database under development3 is designed as a large lexicon of approximately 70,000 lexical units, essentially combining the major lexical entries that appear in the TLFnome4 and in the Robert électronique (1994). At this time, the suffixes -able, -ité, -et(te), -is(er) and -ifi(er), as well as the prefixes dé- and in- have been (almost) completely studied and implemented, creating a database of approximately 5,000 lexical entries. Eventually, our database will take the form of a French language lexicon enriched by constructional and semantic features (hereafter LECSF). Each entry will include the following information: 1.
the lemma (i.e., the form that conventionally subsumes all its inflectional variants)
2.
its grammatical category
3.
its constructional analysis presented as a tagged and bracketed schema
4.
a repetition of the information from the previous field, presented more clearly; i.e., for affixed and converted words, the entry, then its base, and
3. For the first stages, see Dal et al. 1999. 4. The TLFnome is a lexicon of inflected forms compiled at the INaLF and based on the nomenclature of the Trésor de la Langue Française. It presently contains 63,000 lemmas, 390,000 forms, and 500,000 entries. It has been complemented by a second lexicon of 36,400 additional lemmas taken from the index of the TLF.
180
Nabil Hathout, Fiammetta Namer, and Georgette Dal
the base’s base when necessary, etc. until it reaches the entry’s primitive – a unit that can not be decomposed 5.
a gloss that illustrates the semantic results of the application of the most peripheral constructional operation
Some of these fields can be left blank. Indeed, none of the entries that appear in the two reference corpora are rejected a priori; however, among those entries, not all are constructed. For example, all fields of an adjective like inarticulable ‘inarticulatable’ are filled because the adjective is constructed: (1)
inarticulable/ADJ: [ in [[ articul(er) VERBE] able ADJ ] ADJ ]5, (inarticulable, articulable, articuler),”qui n’est pas articulable “ ‘that which is not articulatable’
However, the fifth field is left blank for a non-constructed verb such as articul(er) ‘to articulate,’ because the verb does not have a constructed meaning: (2)
articul(er)/VBE: [articul(er) VERBE], (articuler)
Eventually, all of the information described will be linguistically verified. It is for this purpose that the DériF system was developed. However, given the time-cost of such a verification, we will use a second computer program, DéCor, which was also developed for this purpose. DériF is a system that implements linguistic hypotheses, excluding the restrictions involved in the implementation. By the end of 2002, DériF will have given us a complete analysis of approximately 20,000 French lexical units. 6 DéCor is a learning-based analysis system that uses the pairing of formally similar lexical units that belong to the same reference. By 2002, DéCor will allow us to offer a preliminary analysis for an as-yet uncalculated number of the 50,000 units in our corpus that are not treated by DériF. The rest of this paper will be devoted to a detailed presentation of these applications. After giving a progress report on the various points of view that
5. ADJ: adjective; VERBE/VBE: verb; NOM: noun; FWD: foreign word. 6. This number corresponds to an estimate of the number of derivatives produced by the constructional operators that we have decided to implement following linguistic rules. Together, they give good coverage of the constructed lexicon. These suffixes are -(a)tion (liberation), -(at)eur (voyageur ‘voyager’), -able (abordable ‘accessible’), -age (lavage ‘washing’), -aire (bancaire ‘banking’), -al (adjectival), -et(te) (fillette ‘little girl’), -eux (neigeux ‘snowy’), -ifi(er) (acidifier ‘to acidify’), -is(er) (budgétiser ‘to budgetize’), -ité (sévérité ‘severity’), -oir(e) (arrosoir ‘watering can’), and the prefixes dé- (décapsuler ‘to take a cap off’) and in- (inabordable ‘inaccessible’).
An Experimental Constructional Database
181
exist in the domain of the automatic processing of constructed lexical units, and a brief explanation of DéCor and DériF’s position (section 2), we will discuss these two systems in more depth (section 3), underlining their strengths as well as their weaknesses. In order to compare them, we will then look at how they deal with the specific case of -able suffixation (section 4).
2. Automatic processing of constructed lexical units 2.1. State of the art7 Two extreme viewpoints exist in the domain of automatic processing of constructed lexical units: processing based on dictionaries and processing based on rules. However, between these two poles, there are processing systems which can be referred to as mixed. The main objective of dictionary-based processing is information retrieval. Among these systems is the work on the French language by Savoy (1993) which offers a complete morphological system. The program does an inflectional and constructional analysis of untagged words which are given a non-deterministic treatment (the program produces as many results as there are possible hypotheses). The performance limitations noted by the author are probably due to the fact that the approach is based on dictionary consultations and is meant to do the lemmatization and stemming operations in a single step. The product developed at the Rank Xerox Research Center is based on the use of finite-state transducers (cf. Karttunen, Kaplan, and Zaenen 1992). It is an example of a mixed approach that uses both dictionaries and rules. These processing systems are based on rules that we will discuss in more detail, since DéCor and DériF are such systems. They can be partitioned according to whether they use statistical rules or linguistic rules. Statistical rules are used mostly in corpus linguistics (Habert et al. 1997). Without going into detail, we can cite as an example works by Jacquemin (1997) and Xu and Croft (1998). Both work with the English language and are meant to extend IR queries (through the use of morphological variants of complex terms for the former, and co-occurring words in a corpus for the latter). No linguistic features are used in either. Jacquemin uses a list of terms and a corpus to calculate the morphological variants of the multiword terms in a list, while Xu and Croft put together morphological families using only a corpus. For French, one can refer to the system developed by Grabar and Zweigenbaum (1999), which was designed to construct a morphological database using the SNOMED (Systematized Nomenclature of Human and Veterinary Medicine) medical nomenclature. This system was devised to extend a query through the use of key words. It does not distinguish between inflection and derivation or composition, and works by suffix stripping/adding, then by learning. However, 7. For a more detailed state of the art, see Daille et al. 2002 in this book.
182
Nabil Hathout, Fiammetta Namer, and Georgette Dal
this statistical approach relies on linguistic knowledge. It enters a structured resource that includes semantic indications (such as pseudo-synonymy relationships), which guarantees the correctness of the solutions. Most systems based on linguistic rules enter into a formalism known as two-level morphology (Koskenniemi 1983). Among the systems developed within this formalism, we might mention the finite-state automata that were modeled by Clémenceau (1993). It is nonetheless possible to work within other types of formalisms. For example, Clavier (1996) and Clavier et al. (1996) used regular grammars to recursively (and in a non-deterministic way) analyze suffixed words.
2.2. DéCor and DériF’s position in the state of the art Both DéCor and DériF, as we have mentioned, are based on rules. However, they are not based on the same type of rules. DéCor is among the approaches based on the use of statistical rules. Like the system expounded in Jacquemin 1997 and unlike the one advocated in Grabar and Zweigenbaum 1999, DéCor is totally automatic and requires no linguistic knowledge. Unlike Jacquemin 1997, however, DéCor does not filter variant candidates and does not make use of classification techniques to group the allomorphic variations, the graphic variants, and the spelling errors. In these ways DéCor seems less elaborate than Jacquemin (1997); on the other hand, its implementation is faster and it is easier to use. DériF is situated among the processes based on recursive linguistic rules which do not rely on the constraints of any particular formal model, or on the need for a lexicon of complex entries. DériF works on a wide scale to automatically analyze lexical units constructed by suffixation and/or prefixation. To a lesser extent, it also analyzes lexical units constructed by composition. The analysis is linguistically motivated since it is based on linguistic hypotheses formulated within the constrained theoretical context of the theory developed in France by the SILEX team. Finally, this deterministic system is also modular, as we will see in section 4.
3. DéCor and DériF: A general presentation 3.1. DéCor The aim of DéCor is to build a derivational lexicon semi-automatically at reasonable cost. It uses lists of lemma just as the Grabar and Zweigenbaum 1999 system does. We have taken the option of completely dissociating the computational analysis of the derived words and their manual validation, in order for the validation to be carried out by staff with no particular skills in computer science. The validation may be simple if the available means are limited; otherwise, cross-validation may be performed in order to limit the divergences between the operators’ interpretations.
An Experimental Constructional Database
183
DéCor is a set of programs designed to pair constructed words with their morphological bases. It has been tuned with the TLFnome, a lexicon of French inflected forms. We could have used other lexicons, such as the ABU lexicon, the BDLEX built by J. Pérennou and his team, the DELAF developed by M. Gross’s team (cf. Silberztein 1993), or the MULTEXT lexicon. We have preferred the TLFnome because it stems from a well known language dictionary (Trésor de la Langue Française) which we can refer to for linguistic analysis. DéCor can also use lexicons created from tagged and lemmatized corpora (e.g., WinBrill or TreeTagger: Schmid 1994); DéCor is able to handle derived lexical units that belong to general language, or build with affixes of special purpose languages (for instance, -ose or -ome, used in the medical vocabulary) for which no TLFnome-like lexicon is available.
3.1.1. The Network Model DéCor is compatible with the SILEX model; in particular, it is designed to compute pairings that conform to the linguistic descriptions made up in this theoretical framework. However, the empiricist approach underlying the construction of the system is closer to the Network Model proposed by Bybee (1988, 1995) in order to explain in a unified framework various facts including historical changes in the inflection of English irregular verbs (either from irregular to regular or the other way around), observations on child language acquisition, and results of psycholinguistic experiments. In this model, representations and formal apparatus are reduced to the minimum needed to describe the data. Thus the lexicon is a network of attested words (that is, the words available to the speaker) connected by relations set up according to shared phonological and semantic features. For instance, the words abordable ‘approachable’ and aborder ‘to approach’ are connected by a set of links corresponding to the sharing of the phonological prefix abord- and of the meaning of aborder. A similar set of links exists between activable and activer; abordable and activable are also connected because they share the suffix -able and the semantic properties corresponding to the paraphrase “capable of, fit for or worthy of being...” Moreover, both pairs (abordable–aborder and activable–activer) are clustered in a proportional series (Cruse 1986: 118–133) that can be extended to the whole set of (Xer, Xable)8 pairs. Bybee’s theory is first and foremost representational. A derivational operator such as -able is regarded as a pattern of connections abstracted from the phonological and semantic properties shared by the elements of the proportional series. The schema, which may be thought as a prototype of the relation that holds between the two sets of lexical items, is tied to the forms from which it
8. Informal notation of pairs made of a first group verb and the corresponding adjective ending with -able.
184
Nabil Hathout, Fiammetta Namer, and Georgette Dal
arises. On a procedural level, the formation of words ending with -able may be expressed as a “rule of three” or a proportional analogy (see Becker 1993 for an implementation of these schemas). DéCor’s analysis of the adjective activable may be described as the search for a value of a variable X that makes the proposition “abordable is to aborder as activable is to X” true; in this case, X = activer. In the network model, connections between lexical units are of different strengths. The closeness of two units (i.e., the degree of their relatedness) depends on the number and type of the semantic links (i.e., the semantic features they share). Phonological similarity does not intervene directly even if it usually parallel to the degree of semantic connection. More generally, the lexicon is viewed as a dynamic system where not all forms have the same status; forms used frequently gain in lexical strength while the ones that are used rarely lose in lexical strength. Forms with sufficient lexical strength have their own entry in the lexicon and are regarded as basic. Lexical storage and retrieval of the other forms is relative to more basic forms already in the lexicon. Token frequency plays an explanatory role as well. It accounts for the directionality of morphological relations (they are oriented from the more frequent forms to the less frequent ones) and for the maintenance of irregularities and suppletion. Type frequency also is used in the theory. It is related to the morphological schemas’ productivity: the number of types that participate in a schema determines its ability to be applied to new items. For computational applications, a number of connectionist systems, based on the network model, have been developed in order to mimic child acquisition of English past tense verbs, and in particular of the irregular forms (cf. Rumelhart and McClelland 1986 and the numerous developments initiated by this seminal work). DéCor demonstrates some similarities with these systems. It is learning-based and does not make use of any linguistic knowledge. On the other hand, DéCor presents several differences from these systems. DéCor is derivational while the others are inflectional; it is symbolic, which makes it possible to describe and motivate the processing operations performed; it is designed to pair derived words with their bases, and not as a model of morphological knowledge acquisition. In the MorTAL framework, DéCor’s aim is to construct a LEICS for NLP but not to develop a faithful implementation of Bybee’s theory. We use corpora of lemmas extracted from the TLFnome, the entries of which do not contain any phonetic or semantic descriptions, nor any frequencies in textual corpora. We only work on lemmas’ written forms, regarded as approximations of phonetic representations, and on grammatical categories, the interface between their semantic and syntactic properties (Corbin 1997). Moreover, we assume there exists a certain correspondence between phonetic representations and semantic ones, and a correlation between the lexical frequency of the morphological relation and their validity.
An Experimental Constructional Database
185
3.1.2. findaffix DéCor has been built around findaffix, a public domain Unix script, part of the ispell spellchecker developed by G. Kuenning. The spellchecker uses both a form dictionary and a set of mainly inflectional rules. The ispell spellchecker presents an original feature over other spellcheckers such as Cordial (Synapse Développement) or Correcteur101 (Machina Sapiens). When possible, it utilizes its rules to provide the user with a derivational “explanation” for unknown words. For instance, if ispell has to check the word cliticisent [‘to cliticize,’ present, 3rd person plural], and if the user dictionary already includes cliticise [3rd person singular], ispell suggests that this form could be accepted because it may be analyzed as cliticise+nt. The findaffix script is designed for the compression of dictionary forms. It computes a set of suffixation or prefixation rules that ispell can use to generate one dictionary subset originating from another one. For instance, it can use a suffixation rule (s,s') to yield a form f' = x.s' from a form f = x.s.9 The findaffix script also provides two numerical values for each rule: its frequency, i.e., the number of pairs of forms that can to be associated by the rule, and an estimate of the number of bytes that might be saved by means of the rule. For example, when findaffix is applied on a sample of 10,000 inflected forms of infinitive verbs ending in -er (approximation of the first group) present in the TLFnome, the two rules with highest frequency are: (3)
/s/2053/19809 s//2053/17756
The first rule adds an -s at the end of a word and the second strips an -s. These rules can be used to generate present 1st and 3rd person singular from 2nd person (chantes ⇒ chante ‘sing(s)’) and vice versa (chante ⇒ chantes), future 3rd person singular from 2nd person (chanteras ⇒ chantera), etc. The rules produced by findaffix correspond to the graphemic regularities of the proportional series that gathers the source forms and the target ones. Ruleslearning relies on two assumptions: 1.
The longer the form, the stronger the graphemic–semantic correspondence is. As a consequence, if two forms share a sufficiently long common prefix (or suffix), they are very likely to be semantically connected.
2.
The lexical frequency of a morphological relation is an index of its regularity, the latter being a gauge of the rule’s validity.
9. s, s' are suffixes by their position, not semantically.
186
Nabil Hathout, Fiammetta Namer, and Georgette Dal
3.1.3. The processing of an affix DéCor uses the rules learned by findaffix to pair derived forms with the forms that serve as their bases. Only attested words from the reference dictionary are considered as possible bases. 10 For instance, when processing the -able suffix, DéCor is applied to the dictionary verbs and adjectives ending in -able. We then get interpréter ‘to interpret’ for interprétable ‘interpretable’ and porter ‘to carry’ for portable ‘carryable/portable.’ In addition, DéCor proposes perturber ‘to perturb’ for imperturbable ‘imperturbable’ because perturbable, which is linguistically the base of imperturbable, is not attested in the TLF. In this case, the analysis proposed by DéCor is under-specified since it involves both a prefixation and a suffixation, but does not give their respective scope. It is unclear whether this analysis should be read as: (4)
* [A im- [A [V perturber] -able]] * [A [V im- [V perturber]] -able] * [A im- [V perturber] -able]
Corbin (1980) has clearly proved that word formation in French can do without the so-called parasynthesis rule that appeared at the end of the last century, and that the third hypothesis can readily be rejected. The second one is linguistically impossible (*imperturber is a derivational freak because in- does not operate on verbs when it constructs antonyms).
3.2. DériF 3.2.1. General principles DériF is the second program used in the experiment described in this paper. The system is characterized by a certain number of principles, including some that have already been mentioned in section 1.2: •
Unlike DéCor, DériF depends on the use of linguistic knowledge. The system adapts the linguistic hypotheses devised within a specific theoretical model developed by Danielle Corbin and her team in the UMR SILEX in France.
•
The DériF application works recursively. In other words, as long as it recognizes an affix, the system attempts to analyze it. Thus DériF does not only pair the two linguistic forms, one of which is presumably derived from the other. It structurally decomposes the lexical units submitted until it obtains a primitive.
10. In the network model, the lexicon contains no infralexical units.
An Experimental Constructional Database
187
•
The analyzer processes suffixes as well as prefixes. When a word is both suffixed and prefixed, the breakdown order takes the relative scope of the two affixes into account. For example, despite their superimposable linear forms (in/X/able), incorporable ‘incorporable’ and inadaptable ‘inadaptable’ receive different processing orders. In the first case, -able is applied to the verb incorpor(er) ‘incorporate’ and constitutes the first breakdown. In the second, the prefix in- is applied to adaptable ‘adaptable’ and corresponds to the most peripheral operation.
•
The unit entered is a lemma W with a grammatical tag, and the output is three-fold: (1) a bracketed analysis of W with ordered combinations of lexical and infralexical units which form the base and which are also tagged, (2) a constructional micro-family of the analyzed unit, and (3) a gloss assigned to the most peripheral constructional operation.
•
The validity of each lexical unit forming the base is calculated in specific modules of the program as presented schematically below, and verified in a reference: the TLFnome plus the Robert Electronique – the same lexicon that serves as a corpus of entries (cf. section 1.2).
3.2.2. Operation The DériF program is made up of a combination of functions which are activated by a parent application according to the category and the ending of the word analyzed. The order in which the functions are applied reflects the hierarchy between the word’s construction operators. Each function called up by the parent application handles the structural and semantic analysis of a particular suffix, after having called up the function(s) necessary for the analysis of the prefixes, according to the scope constraints between the affixes. If the application does not detect any known ending, it activates a prefix search on the word to be analyzed, which must therefore be applied to a non-suffixed base. Then it displays the results as a triplet made up of (1) a parse tree, (2) the morphological family of the analyzed word, and (3) a gloss expressing the semantic relationship between the analyzed word and its base. Let us look at how DériF works for the example indéracinable ‘ineradicable’/ADJ. In indéracinable, the application recognizes the adjectival ending able, and calls up the Fable function which analyzes the -able suffix. First Fable verifies whether indéracinable is a constructed word in French (contrary to stable, for example, which is sent as is to the application). Before executing the actual analysis, Fable calls up the function FprefXable, which checks whether indéracinable contains one or more prefixes applied to the suffixed base. This FprefXable has a specific link with Fable because of the strong structural and semantic constraints that link the Pi prefix(es) and the S suffix in the sequence Pi-X-S. Thus, in any inXable sequence, the in- prefix operates on
188
Nabil Hathout, Fiammetta Namer, and Georgette Dal
Xable, with a few exceptions (e.g., in incorporable, -able operates on incorpor(er) ‘to incorporate’), while the same prefix has no impact on the suffix -ité (e.g., incontournabilité ‘unavoidability’). The analysis functions of the prefixes FprefXS are verified by the corresponding FS functions because of this strong dependency relationship between the prefix pref which operates on the suffixed base XS and the suffix S. In the case of indéracinable, FprefXable determines that in- operates on déracinable ‘eradicable,’ by associating the gloss “non déracinable” ‘not eradicable’ to in-. This function is recursive, and thus reapplies itself (as long as there are results), examining the word déracinable. This time the attempted analysis fails since dé- does not operate on racinable. Déracinable is therefore the last positive result and is sent to the calling function F able . At the same time, information relative to the prefix in- is stocked in the RES list, in order to memorize the following features until the program output is ready: (1) the semantic relationship that the affix establishes between the entry word (indéracinable) and the partial result (déracinable), (2) the affix’s position in the hierarchical representation of the morphological operations required for the complete analysis of the word, and (3) the constructed word’s category, as well as that of its base. When confronted with déracinable, FprefXable deletes -able, calculates the possible verbal (déraciner) and nominal (déracin/déracin-e) allomorphs, then consults the reference lexicon TLFnome, in order to retain only the first hypothetical analysis (since the nouns déracin and déracine are both absent from the reference). The result that is sent to the application is therefore the verb déraciner. At the same time, RES receives the structural information that has just been calculated for -able. Since RES was initialized with in-, the semantic relationship that -able infers between déracinable and déraciner is not memorized. At the end of the process, the only thing that appears is the gloss corresponding to the action carried out by the most peripheral affix (the one that was analyzed first, and therefore the one that initializes RES). The ending of déraciner cannot be related to a constructional suffix. Consequently, the application tries to activate the recursive function FprefX, which searches the prefixes that could operate on a non-suffixed base. The function recognizes dé- and searches the base’s category (by confronting different hypotheses with the content of the reference). In fact, dé- operates either on a verb or on a noun to produce a verb. In this case, the second solution is retained: the noun (racine) is not prefixed, thus FprefXS is not reapplied. RES memorizes the new structural information that was just calculated with dé-, and the process is finished. In the end, the results stocked in RES for the analysis of indéracinable are as follows:
An Experimental Constructional Database
(5)
189
indéracinable/ADJ :: [in- [[dé- [racine NOM ] (er) VERBE] -able ADJ] ADJ ] :: (indéracinable, déracinable, déraciner, racine) :: “non déracinable” ‘not eradicable’
The preceding presentations show the advantages and inconveniences of each system. In the next section, in order to complete the comparison and justify why the MorTAL project resorts to such different systems, we will apply each system to the same constructional operation, the -able suffixation. (Recall once again that the project is meant to have each system work on complementary zones of the lexicon. The goal of the following confrontation is simply to compare the two systems in situ.)
4. DéCor, DériF, and -able suffixation 4.1. Processing by DéCor DéCor presents a lower degree of integration than DériF does since it is not, strictly speaking, a morphological analyzer. It is a set of programs to be run successively by the user, according to his or her needs. The processing of a prefix consists of four operations E 1 to E4. For a suffix, four operations are needed as well: E1 , E5 , E6 , and E7 . Three additional operations, E2, E3, and E4, are carried out if there are productive prefixations for the words of the derived words category (for instance, if the derived units are adjectives), and consequently only if some derived words can be the result of both a prefixation and a suffixation (as in the previous example imperturbable). The remainder of this section presents the operations E1 to E7 (cf. Figure 1).
190
Nabil Hathout, Fiammetta Namer, and Georgette Dal TLFnome
E1 : grep
Potential bases
Potential prefixed only bases
Derived words
E2 : findaffix E5 : findaffix Prefixation rules
Suffixation rules
E3 : applyaffix E6 : applyaffix Prefixed candidates
E4 : best-base
Suffixed candidates
Both prefixed and suffixed candidates
Selected prefixation rules
Prefixed bases
E7 : best-base
Bases
Figure 1. The processing of derived words by DéCor
4.1.1. Corpora selection (E1) DéCor is designed to pair the elements of a list of derived words with those of a list of potential bases. Therefore, the first operation to be carried out is to build these lists according to the affix to be treated, and, for suffixations,
An Experimental Constructional Database
191
according to the possibility that the formation of some derived words also calls into play a prefixation. For instance, in order to pair verbs that include the prefix re- with their bases, one has to extract from the reference lexicon (TLFnome) the list of the verbs that include the graphemic prefix “re” (or “ré” followed by a vowel, as in réaligner, rééditer, réitérer...), and a list of their potential bases, that is, the whole set of verbs. The pairing of adjectives ending in -able derived from verbs is performed on a list of adjectives that ends with “able” or its “ible” allomorph, and a list of the TLFnome verbs. The pairing also requires the consideration of the possible prefixations as for (imperturbable, perturber). In theory, the prefixations should be carried out with rules learned by findaffix on two additional lists: a word corpus containing the corresponding prefixes on the one hand, and a corpus of their derived bases on the other hand (see Corbin forthcoming a for a discussion on the prefixes’ categorizing capability). In practice, for reasons which we will not explain here, for the -able suffix the list of all TLFnome adjectives can be used. In the following, let W be the set of derived words we are interested in, B, the set in which we look for bases that are both prefixed and suffixed, and B' the set in which we look for bases that are only prefixed. These lists are extracted from the reference lexicon on the basis of their categorical and graphemic properties. The criteria used being quite approximate, they do not enable us to distinguish true affixes from pseudo-affixes like -ier in peuplier ‘poplar tree’ which could wrongly be analyzed as derived from peuple ‘people’ (cf. Corbin and Corbin 1991), nor from cases where words “accidentally” include a graphemic affix such as -able in stable ‘stable.’ In addition, we assume that all the words in W are of the same grammatical category. In the case of morphological operators able to form words of different categories, W is partitioned in categorically homogeneous subsets that are processed separately. In its present version, DéCor does not provide a complete account of operators which apply to words of different categories. For instance, words ending in -able formed on nominal bases (e.g., corvéable, corvée; related to corvée ‘fatigue’) are not properly paired since B only includes verbs. The words in W can be divided up into four subsets: (W nd ) for the non-derived word (préalable; ‘preliminary’);(Wp) for the prefixed-only words (incassable, cassable) ‘unbreakable, breakable’; (W s ) for the suffixed-only words (mangeable, manger ‘able to be eaten, to eat’); (W ps) for the words that are both prefixed and suffixed (imperturbable, perturber ‘imperturbable, to perturb’). In its present version, DéCor can only deal with the last three subsets; it cannot recognize non-derived words.
4.1.2. Learning the prefixation rules (E2 ) The words in W p are, a priori, not distinguishable from the other prefixed words of B' (no matter what the prefix is). In particular, all prefixes that can apply to words of B' can do so to words of W p . We can therefore use B' as
192
Nabil Hathout, Fiammetta Namer, and Georgette Dal
corpus for learning prefixation rules with findaffix. Notice that when the main processing concerns a suffix such as -able, all bases of words in Wp are necessarily in W. In practice, W does not have enough elements to allow findaffix to learn reliable prefixation rules. For instance, in the case of -able, B' contains 15,249 adjectives while W only has 1,011 elements. (Remember that lexical frequency is one of the two criteria used by findaffix to evaluate the validity of rules.) Let us call P(B') the set of prefixation rules computed at this stage.11
4.1.3. Applying the prefixation rules (E3) For now, DéCor consists of two programs: applyaffix and best-base. The former computes a set of candidate bases for each w ∈W by applying to w a set of prefixation and/or suffixation rules, then filtering the resulting character strings with a set of potential bases (B' or B depending on the individual cases). applyaffix is used in E3 to apply P(B') rules to W elements, then to eliminate the strings that do not belong to B'. Let cp(W) denote the set of candidates yielded by this operation.12 The second program, best-base, is described in section 4.1.4 and section 4.1.7.
4.1.4. Best prefixed candidate selection and prefixation rules filtering (E 4 ) In the SILEX model, each word can only have one base for a given morphological operator. However, some constructed sequences may have more than one base, but this multiple structural analysis reveals a structural homomorphy. For instance, inactivable which can be formed on activable or on inactiv(er).13 Given the approximation that every derived word has only one base (homonymy cases being rare), best-base determines the statistically most likely base of each word w ∈ W; it sorts w candidate bases with a comparison function that is specific to the formation type: prefixation (E4), suffixation (E7), or both prefixation and suffixation (E7). 11. The rules are learned with default values for findaffix parameters (m, minimal radical length = 3; M, maximal prefix length = 8), except for the frequency threshold, l, that has been lowered to 5. 12. More precisely, cp(W) is the set of (w,cp(w)) pairs, where w ranges over W and where cp: W→P(W) associates with each element x of W, the set of elements of W that can be connected to x by a rule of P(B'). (P(W) stands for the power set of W.) 13. When derived from activable, inactivable points out the lack of the expected property ‘activable’; when it derives from inactiv(er), inactivable indicates that the referent of its governing noun can be striped from the expected property ‘active.’ An inactivable substance (1) cannot be made active in the first case, or (2) can be made inactive in the second case. Note that in Bybee’s model, a word has at most one entry in the lexicon and that homonymy simply follows from the semantic links that hold between inactivable and activable, and between inactivable and inactiver.
An Experimental Constructional Database
193
For prefixed candidates, best-base uses three criteria. It ignores prefixations that involve an addition (such as in/re/9/53 which, for instance, connects intouchable ‘untouchable’ and retouchable ‘retouchable’) or that are used for less that 5% of cp(W) (such as a//106/814 which, for instance, connects avalable ‘swallowable’ and valable ‘valuable,’ but is used only for 3 pairs while there are 334 elements in W that can be paired with a prefixation rule satisfying the first criterion). This threshold is a parameter of best-base that has to be adjusted experimentally according to the corpus in order to eliminate the rules that yield at least one error.14 The third criterion consists in favoring the most frequent rule with respect to cp(W) or, failing that, with respect to P(B') (i.e., for the whole B'). Let bp(W) denote the set of pairs (w,bp(w)) such that w ranges over W, and b p : W→S(W) ∪ {Ø} associates with each element of W , the best prefixed candidate that has been selected, or Ø if none has been selected.15 Therefore, bp(w) contains at most one candidate base. Moreover, when a derived word has one such prefixed-only base, the latter can be considered the desired solution, the use of a threshold having eliminated the incorrect prefixations. The best prefixed candidate selection also yields as a by-product an additional criterion for filtering the prefixation rules: the possibility of keeping only those rules which have been used for at least one (w,bp(w)) pair, when w ranges over W (we then get a restriction of P(B') to W). In other words, only prefixations attested for pairs of W elements are retained for the pairing of both prefixed and suffixed bases. Let us call this set of rules PW(B').
4.1.5. Learning the suffixation rules (E5 ) The findaffix script is used for a second time to compute suffixation rules. In its standard version, the script runs on a single set consisting of both the derived words and their candidate bases (which would correspond to W ∪ B). Therefore, the computed rules correspond to relations between one word from W and one from B, but also to relations between two elements of W (e.g., table/vable (achetable, achevable) ‘buyable, finishable’) or between two elements of B (e.g., ter/ver (acheter, achever) ‘to buy, to finish’). In order to reduce this overgeneration, we have modified findaffix so as to distinguish the set of derived words from the set of candidate bases. Let us call S(W,B) the set of rules learned on W and B. 16
14. A new value has to be set up for each new corpus. Let us recall that DéCor is intended for the semi-automatic construction of a LEICS. In particular, the user must define the learning parameters of findaffix. The choice of a filtering threshold for the prefixed candidates selection falls within the same process. 15. S(W) stands for the set of W singletons. 16. For the adjectives ending in -able and derived from verbs, findaffix parameters have been set as follow: m, minimal radical length = 3; M, maximal prefix length = 8; l, frequency threshold = 3.
194
Nabil Hathout, Fiammetta Namer, and Georgette Dal
4.1.6. Applying the prefixation and suffixation rules (E6 ) In E6, applyaffix is used to pair the words of W with their potential bases both prefixed and suffixed and with their potential bases suffixed only. For the first ones, it applies P W (B') prefixation rules and S(W,B) suffixation rules to every word w ∈ W, then it filters the yielded strings with B.17 Let c ps(w) denote the set of candidate bases associated with w by this first operation and cps(W) denote the set of pairs (w,cps(w)) when w ranges over W. For the second ones, only S(W,B) are applied. Let cs(w) denote the set of candidate bases computed by this second operation and cs(W) denote the set of pairs (w,cs(w)) when w ranges over W.
4.1.7. Best candidate selection (E 7 ) The only thing left to do now is to choose a solution from bp(w) ∪ cps(w) ∪ cs(w) for each w ∈ W. This last operation is carried out by bestbase which uses sorting functions to compare the candidates. Recall that bp(w) is already filtered and that it contains at most one candidate.
4.1.7.1. When bp(w) is not empty We saw in section 4.1.3 that when bp(w) is not empty, the prefixed candidate must be favored for two reasons. First of all, the use of a threshold (5% for the adjectives ending in -able) guarantees the validity of the pairing, so this solution must be preferred to a suffixed-only one (cf. section 4.1.4). The second reason is that prefixation pairing is a more basic relation than prefixation and suffixation pairing. When both are possible, there always exists a relation of suffixation between the prefixed candidate and the both prefixed and suffixed one. For instance, intransportable ‘nontransportable’ can be paired with transportable ‘transportable’ (prefixation only) and with transporter ‘to carry’ (both prefixation and suffixation). In this case, best-base chooses transportable because the latter pair can be recovered via the (transportable, transporter) pair.
4.1.7.2. When bp(w) is empty When bp(w) is empty, the solution must be looked for in cps(w) ∪ cs(w). We have tested four strategies to compare these candidates: 1. 2. 3. 4.
single sorting combination of two sorting functions, with only one of them penalizing prefixed candidates iteration of single sorting iteration of combination of two sorting functions
17. The order in which the rules are applied does not matter; in the network model, lexical units do not have an internal structure.
195
An Experimental Constructional Database
4.1.7.2.1. Single sorting Four elementary sorting functions have been used: s1 , s2 , s3 , and s4. They compare the candidates according to three criteria: maximal frequency of the rule used, minimal length of the rule used,18 and lack of prefixation. Table 1 presents the use of these criteria by the four functions. The efficiency of the functions depends on individual corpora. For the adjectives ending in -able and for the nouns ending in -ité ‘-ity,’ s1 does better overall than the other three. Table 1. Elementary sorting functions suffixation and prefixation
Sorting function
maximal frequency
s1
2nd
3rd
1st
s2
1st
2nd
not used
s3
3rd
2nd
1st
s4
2nd
1st
not used
minimal length
no membership of cps(w)
4.1.7.2.2. Combination of sorting functions The next three strategies improve the results of single sorting with elementary functions. The first one combines two sorting functions, one of them, fs, penalizing cp(w) candidates (as s1 does) while the other, fp, does not (e.g., s2). The combination takes advantage of the strong points of each of these functions. It retains the candidate corresponding to the first condition satisfied: 1.
fs and fp choose the same candidate
2.
the prefix selected by fp has been also selected by fs at least 5 times (threshold depending on the individual corpora)
3.
the prefixation rule of the candidate selected by fp has been retained by this function for at least 3% of the corpus (threshold depending on the individual corpora)
4.
the fs candidate is chosen since this function is overall better than fp (for instance, for the nouns ending in -ité, in the 127 cases where s1 and s2 do not give identical solutions, s1 selects the correct base 77 times while s2 does so only 6 times)
18. Where rule length is the sum of the number of characters stripped and added, and not the minimum of the maximum number of characters stripped or added as in Jacquemin 1997. When two rules have the same length, the one with the smaller number of stripped characters is preferred.
196
Nabil Hathout, Fiammetta Namer, and Georgette Dal
Of course, the combination of sorting functions only makes sense if W may contain derived words that are both prefixed and suffixed. In this case, like for -able adjectives, fs + fp does better than fs or fp.
4.1.7.2.3. Iteration of single sorting The efficiency of the single sortings can be improved by reiterating the sorting on more and more precise sets of candidates (that is, sets with fewer and fewer incorrect candidates). This technique reduces the number of rules used to perform the pairings in order to favor the most productive ones as well as the ones that yield single candidates (i.e., such as when bp(w) ∪ cps(w) ∪ cs(w) is a singleton). Before the i-th iteration is carried out, we remove from [bp(w) ∪ cps(w) ∪ cs(w)] i–1 all the candidates that have been obtained with rules that have yielded no best candidate at the iteration i–1. In addition, the frequencies stemming from Pw(B') and S(W,B) are replaced by those of the prefixation and suffixation rules of the best candidates of iteration i–1. Single sorting iteration converges rapidly towards a fixed point (at most 4 iterations for the -able and -ité corpora). It only enhances the precision of frequency based functions like s1 and s2. It is useless to iterate sorting with length-based functions since affix size does not depend on the rule used.
4.1.7.2.4. Iteration of combinations of sorting functions Iteration can also improve the efficiency of the combination of sorting functions. The reduction of the set of candidates is carried out in the same way as for single sorting: elimination of the candidates that have been obtained by rules that have yielded no best candidate at the previous iteration. Again, iteration is only useful for frequency-based functions and if W can include words that are both prefixed and suffixed. As we have just seen, DéCor only uses statistical rules. We will now demonstrate that the rules used by DériF are very different.
4.2. Processing with DériF As mentioned in section 2.2, DériF implements linguistic hypotheses. We will begin here by linguistically characterizing the -able suffixation, before explaining how DériF implements these hypotheses.
4.2.1. General linguistic framework The linguistic hypotheses implemented by DériF are part of a more general theoretical framework intended to calculate jointly the structure and the meaning of constructed lexical units. It is rooted in particular in the constructional morphology model developed by Danielle Corbin and her team at the UMR SILEX at Lille 3 in France.
An Experimental Constructional Database
197
In this model, constructional operators fulfill certain constraints that can be organized according to a hierarchy: 19 •
First, phonological constraints, some of which are general. This is the case, for example, of the haplology rule that Corbin and Plénat (1992: 101) define as “la superposition de deux syllables (ou de deux fragments de syllables) identiques à la jointure d’une base et d’un suffixe […].” The suffix -at recurrently forms state nouns from nouns designating human beings according to their social status (cf. artisan → artisanat ‘craftsman → craft industry’, mandarin → mandarinat ‘mandarin → mandarinate’), while the haplology rule renders impossible the formation of such nouns as *magistratat to name the office of magistrate or *candidatat to name the status of candidate (for other examples, cf. Corbin forthcoming b, Roché 1997).
•
Next, semantic constraints: since the adjective conique ‘conical’ does not satisfy the semantic constraints that the suffix -erie imposes on the bases it selects, the noun *coni(qu+c)erie is excluded. Briefly stated, conique expresses an objective property, while -erie requires bases expressing nonobjective properties. 20
•
Finally, categorical constraints: generally, these are the ones that are mentioned first when defining a constructional operator, because they are certainly the easiest to describe. In fact, a given operator does not indiscriminately take any categorical type of base and does not randomly form any categorical type of derivative. However, these categorical restrictions issue from the preceding ones. For example, the prefix in- which forms antonyms of bases cannot be applied to verbs to form verbs (for example, mang-V/*immang-V ‘to eatV/to uneatV’;21 parl-V/*imparl-V ‘to speakV/to unspeakV’) because its semantic role is to state the absence of the property expected of its base. Therefore, the base must express a property, which verbs do not do.
4.2.2. Linguistic analysis of the suffix -able Given what has just been stated, describing the suffixation by -able in order to then implement it means giving it at least a double characterization: semantic 19. For an explanation of the main principles of the model, see Corbin 1991. For a detailed explanation of the latest stage of the model, see Corbin forthcoming b. 20. Regarding -erie, cf. Temple 1996, Dal and Temple 1997. 21. Verbs like inactiv(er) ‘to inactivate’ or insensibilis(er) ‘to desensitize’ do not contravene the rule. Semantic arguments show that they are derived by the conversion of adjectives whose structures already contain the prefix in-; they are respectively constructed from inactif ‘inactive’ and insensible ‘insensitive.’
198
Nabil Hathout, Fiammetta Namer, and Georgette Dal
and categorical. (We will not treat phonological constraints in this article because they are not very useful for a constructional analyzer – an attested word is necessarily phonologically possible. On the other hand, it is interesting to look for phonological constraints when dealing with the generation of constructed lexical units.22) The results of -able suffixation are all adjectives. True, the lexicon contains -able nouns (for example, imperméable ‘impermeable/rain coat,’ or portable ‘portable/cell phone or laptop computer,’ but these nouns can be analyzed as the results of a conversion applied to adjectives that contain -able in their structure. Impermeable refers to entities that have the salient property of being “impermeable”; portable refers to entities with the salient property of being “portable.” Semantically speaking, -able adjectives express that the referents of the nouns that they modify have latent properties which, in themselves, are supposed valid, regardless of whether they have been implemented or not. Thus, a hill can be abordable ‘accessible’ even though it has never been abordée ‘accessed,’ a politician can be said to be ministrable ‘potentially a minister’ although he has never been a minister. However, a property is only latent if it can be actualized. It can thus be predicted that the properties expressed by -able adjectives should, at some level, bring into play a process that allows them to be actualized. Optimally, this process is expressed by the base of the derivative, which is therefore typically a verb. In fact, the systematic examination of a corpus of 1378 -able adjectives from the TLFnome (1009 derivatives) supplemented by the Robert éléctronique (+369 derivatives) reveals that the bases of Xable adjectives are verbs in 1350 of the cases (98%). However, the base can be a noun and refer to any of the following (the partition here includes Plénat’s (1988) results): 1.
The result of a process susceptible of revealing the given property. In dictionaries, this particular case seems to be largely restricted to instances where the base refers to a tax – e.g., congéable ‘susceptible of obtaining a clearance certificate (or congé),’ mainmortable ‘not subject to death duties’ – but not exclusively – cf. opéable ‘susceptible of being the object of an “Offre Publique d’Achat,” or takeover bid.’23 Thus, a non-attested adjective such as °IVGable ‘one who can undergo an “Interruption Volontaire de Grossesse,” or termination of pregnancy’ is not a problem (as indicated according to convention by the exponential circle).
22. The linguistic rules implemented in the development of DériF can be reused to construct a generator; cf. Dal and Namer 2001, Namer and Dal 2000. 23. Opéable (“that which can be the object of an OPA (Offre Publique d’Achat or takeover bid)”) does not appear in either the TLFnome or the RE. However, it is attested in the Petit Robert (1996), and 57 occurrences can be found in Le Monde (1999).
199
An Experimental Constructional Database
2.
The public figure that the referent of the noun that the adjective modifies can become after an election (or nomination): ministrable, papable, °cardinalable ‘susceptible of becoming minister, pope, cardinal.’
3.
The means of locomotion which allows one to move (e.g., cyclable ‘where one can travel on a bicycle,’ °rollerable ‘where one can travel on rollerblades’).
4.
The property that results from the realization of a process, whether it be an incited property (effroyable ‘horrifying’) or a demonstrated property (charitable ‘charitable’). The possession of the virtual property charitable is in fact only verifiable through the realization of an act of charity.
4.2.3. The implementation of linguistic rules Now let us see how DériF implements the linguistic rules that have just been described. A lexical unit W with an -able ending activates the specific function that breaks down the suffix -able. The reasoning of the analysis of -able derivatives is as follows: First step: Allomorph search a.
W can be rewritten B'W#able When it is appropriate to hypothesize that W can be rewritten B'W #able,24 the function starts by calculating the allomorphic variant BW of B' W . The calculation is a function of morpho-phonological properties of B'W (for example, an aperture change in the last vowel of the base, pronunciation maintained unchanged), as illustrated by the following examples of variant pairing: Table 2. Allomorph pairing rules M = B'W #able misér#able ‘miserable’ favor#able ‘favorable’ objectionn#able ‘objectionable’ navig#able ‘navigable’
b.
BW misère ‘misery’ faveur ‘favor’ objection ‘objection’ navigu(er) ‘to navigate’
W cannot be rewritten B'W#able When W cannot be rewritten B'W#able, B'W = BW.
24. ‘#’ symbolizes the morpheme separator.
Rule éC ↔ èCe or ↔ eur tionn ↔ tion g ↔ gu
200
Nabil Hathout, Fiammetta Namer, and Georgette Dal
Second step: Base analysis [The following reasoning is based on BW, possibly an allomorph of B'W.] a.
B W corresponds to an infralexical unit which identifies a base of foreign origin When BW corresponds to an infralexical unit which identifies a base of foreign origin (for example, séc- in sécable), the program looks it up in a special table that inventories all bases which are absent from the reference and come from Latin, Greek, German, etc. The list includes their translation, sometimes approximate, as well as the grammatical categorization of the translation: Table 3. Foreign word bases BW (category = FWD) séc-
Translation of BW couper ‘to cut’
Category of the translation V
BW is therefore retained in the analysis part of the results displayed (with the tag FWD, “foreign word”), while the translation appears in the second and third parts of the results: (6)
sécable/ADJ: [[séc- FWD] able ADJ ], (sécable, séc- ≈ couper), “que l’on peut séc- (≈ couper)” [‘that one can séc- (≈ to cut)’]
The category of translation of the infralexical unit is used to determine the content of the gloss in some cases, as discussed in 2.2, and as illustrated by the contrast between the example sécable and the one below: (7)
b.
B W is in all capitals When BW is in all capitals, the function puts forth the hypothesis that this might be an acronym (tagged SIG) with gloss properties comparable to those of a nominal base: (8)
c.
épiscopable/ADJ: [[épiscop- FWD] able ADJ ], (épiscopable, épiscop- ≈ évêque), “qui peut devenir épiscop- (≈ évêque)” [‘who can become épiscop- (≈ bishop)’]
VIPable/ADJ: [[VIP SIG] able ADJ ], (VIPable, VIP), “qui peut devenir VIP” [‘who can become a VIP’]
B W is recognized in the reference with a NOUN category and a VERB category When BW is recognized in the reference with a NOUN category and a VERB category, the program chooses the verbal interpretation, excluding
An Experimental Constructional Database
201
the nominal interpretation, except when the noun refers to a vehicle or when it is a property noun. In the general case, a nominal base construction would in fact be either semantically incorrect when the noun does not fall into one of the semantic types that were inventories in section 4.2.2, or redundant compared to a verbal base construction when the noun refers to a tax or a process. As an example of the former case, bâtissable ‘buildable’ must be constructed from the long stem of the verb with the denominative form bâtir ‘to build,’ and not from the noun bâtisse ‘building,’ because the latter is not a noun of property, public office, tax, vehicle, or process. The redundancy when the noun refers to a tax or a process is obvious in the case of a process. For example, deriving analysable from the verb analys(er) ‘to analyze’ or the noun analyse ‘analysis’ leads to the expression of the same property since analyser expresses the process of “[f]aire l’analyse de…[to do an analysis of…]” (RE, s.v. analyser). It is less so in the former case. However, it does seem that when a tax noun and a homomorphic verb (without taking into account the inflectional marks) are co-occurrent in the attested lexicon, the verb (1) is derived the noun through conversion and (2) expresses the process of subjecting the referent of its direct object to the tax designated by the root. As examples, cf. patente ‘trading dues’ → patent(er) ‘to impose trading dues,’ taxe ‘tax’ → tax(er) ‘to tax.’ A construction based on a type of tax or on a derived verb leads to the same reference. In this case, the verb-based construction is chosen because it is the most frequent case. In the case where BW corresponds to a noun designating a vehicle or a verb resulting from the conversion of such a noun (for example, camion ‘truck’/ camion(er) ‘to truck’; cycle/cycl(er) ‘to cycle’), the programming choice described above is blocked, and both analyses are generated. When the verb refers to the process of moving by the means of the base’s referent (for example, cycl(er) ‘to cycle,’ canot(er) ‘to canoe’), both analyses lead to redundant references; however, when it refers to the process of transporting by means of the base’s referent (for example, camion(er) ‘to transport by truck,’ voitur(er) ‘to transport by car’), both analyses are legitimate. Thus when camionnable is assigned the structure [[camion NOM] able ADJ], we can predict the reference to be “that which can be a lane for trucks”; and when it is assigned the structure [[camion NOM] VERBE] able ADJ], we can predict the reference to be “that which can be trucked,” i.e., “that which can be transported by truck.” The same option was adopted for the case in which BW corresponds to a property noun or a verb. This is true for confortable ‘comfortable’ and épouvantable ‘terrible, appalling’ for which confort ‘comfort’ and
202
Nabil Hathout, Fiammetta Namer, and Georgette Dal
confort(er) ‘to comfort,’ épouvante ‘terror, fear’ and épouvant(er) ‘to frighten, to appall’ are valid formal and semantic bases. Although the attested meaning of these two adjectives suggests that they are derived from nouns (a comfortable armchair provides comfort, appalling news arouses fear), there is theoretically no obstacle that opposes their being derived from verbs. Thus, épouvantable could be glossed as “(Prep) that which can be appalled,” confortable as “(Prep) that which can be comforted.” d.
None of the types of cases above corresponds to BW. When none of the types of cases above corresponds to BW, the program then analyzes BW according to its unique category as found in the reference. This last case is the default rule, which applies when all of the others have failed. It is activated for the analysis of non-attested possible words (including neologisms). It follows the widespread hypothesis which supposes that these words are the result of a regular formation process, and therefore come under the most general construction schema. For example, détectable ‘detectable,’ which is attested neither in the TLFnome nor in the RE but which appears 29 times in Le Monde (1999) and 39 times in the Encyclopaedia Universalis CD-ROM, receives by default the following analysis: (9)
détectable/ADJ: [[détecter VERBE] able ADJ], (détectable, détecter), “que l’on peut détecter” [‘that one can detect’]
4.3. Evaluation of the results Now that we have subjected a single linguistic operation (-able suffixation) to each program’s system, we will go on to the evaluation phase. We will start by comparing DériF to the theoretical hypotheses.
4.3.1. DériF’s conformity to theoretical hypotheses The main difficulty in the computer simulation of a group of theoretical hypotheses is that the requirements of the implementation sometimes lead to excessive simplification. DériF overcomes this difficulty very well, except in the expression of the gloss which poses additional problems (cf. section 3.3.1.2). This is what we intend to show now by giving the detailed parse tree and the gloss that DériF automatically associates to the units it analyzes.
4.3.1.1. Parse tree In DériF, the parse tree for the described units is presented as a bracketed structure with categorical tags. For example:
An Experimental Constructional Database
203
(10) biodégrable/ADJ: [[bio NOM ] [[dégrad(er) VBE] able ADJ ] ADJ ] (11) imperturbable/ADJ: [in [[perturb(er) VBE] able ADJ ] ADJ ] (12) interchangeable/ADJ: [[inter [chang(er) VBE] VBE] able ADJ ] Because DériF gives these bracketed representations, it does not reduce the constructed lexical units to simple morpheme concatenations, like a flat representation would (for example, if imperturbable/ADJ were broken down into: in+perturb+able). Rather, it takes into consideration the impact that the constructional operators can have. Thus: •
The structure in (10) assigned to biodegradable shows a compositional operation that combines an infralexical noun (-bi(o)-) and a suffixed adjective (dégradable).
•
The two other structures in (11) and (12) show the relative order of the suffixation and prefixation operations: suffixation followed by prefixation for imperturbable (the hierarchy of brackets in (11) takes into account the fact that the prefix in- affects the suffixed adjective perturbable and constitutes the most peripheral affix); prefixation followed by suffixation for interchangeable (in this derivative, the bracket structure shows that the suffix -able affects the prefixed base interchang(er)).
4.3.1.2. Gloss In the theoretical context underlying the linguistic hypotheses that DériF implements, the constructed lexical units receive definitions that are deliberately metalinguistic (for the reasoning behind this choice, see notably Corbin 1993: 148 and Dal 1997: 37–38, 71). For example, in this context, the following is a possible formulation of the constructed meaning of the adjective mangeable ‘able to be eaten’: “says of the referent of the modified noun that it is characterized by the latent property of having applied to it the process expressed by mang-V ‘to eatV’.” Despite this, we decided to formulate the glosses automatically assigned to the DériF inputs in (semi-)natural language so that they can be used in NLP and IR. For example, DériF automatically assigns to mangeable the gloss “( lequel/Que l’)on peut manger ‘that which one can eat’,” which is more workable than the description above. In other words, the glosses in DériF are similar to derivational-type dictionary definitions (Martin 1992: 59–64). Although the glosses only reflect the application of the last constructional operation, in the case of a lexical unit that involves more than one operation, the semantic features assigned to the embedded structures can also be recovered and used. As an example, we will look at the constructed unit inadaptabilité, which involves three operations: in- prefixation which forms the antonym of a base, -able suffixation, and -ité suffixation.
204
Nabil Hathout, Fiammetta Namer, and Georgette Dal
Inadaptabilité is an XADJ-ité/NOM type of lexical unit. Consequently, the corresponding gloss is an instance of the generic gloss “property of that which is XADJ” associated with nouns produced by applying -ité to adjectives. Thus, in our LECSF, the gloss for inadaptabilité is (with the entry derivative’s base in bold): (13) inadaptabilité/NOM: [ … ], ( … ), “propriété de ce qui est inadaptable” [‘property of that which is inadaptable’] Inadaptable is an in-XADJ /ADJ type of lexical unit. This type of derivative expresses the absence of an expected property, approximately translated by the gloss “non A”: (14) inadaptable/ADJ: [ … ], ( … ), “non adaptable” Finally, adaptable is an XVBE-able/ADJ type of lexical unit. One way of formulating the constructed meaning of the derivative in (semi-)natural language is “that which can be X VBE”: (15) adaptable/ADJ: [ … ], ( … ), “que l’on peut adapter” Step by step, the meaning of the constructed word can in this way be reconstructed from the meaning of the primitive. Thus, the relationship between the meaning of inadaptabilité and that of adapt(er) can be represented by the following semi-formal notation: (16) sens_de(inadaptabilité) = propriété_de_ce_qui_est(non(que_l_on_peut(sens_de(adapter)))) ‘meaning_of(inadaptability) = property_of_that_which_is(not(that_which_one_can (meaning_of(to adapt))))’ Therefore, DériF is compatible with the hypothesis conceived by the implemented theoretical analyses and dealing with compositionality of the meaning of constructed lexical units in relation to their structure. More importantly, it provides a means of recuperating and exploiting this compositionality. There is, however, a case in which DériF’s glosses are not very effective: when they are presented as disjunctions. The problem exists to varying degrees depending on whether the glosses deal with the meaning of the base or not. For example, in order to treat circulable ‘able to be circulated’ as well as mangeable ‘able to be eaten’ in the present state of the LECSF, the gloss that is automatically assigned to an adjective with a XNOM -able/ADJ structure is
An Experimental Constructional Database
205
“( lequel/Que l’)on peut XVBE ‘that which one can XVB’.” Although it is disjunctive, this gloss remains fairly easy to use. On the other hand, at this time, the gloss assigned to an adjective with a XNOM-able/ADJ structure is “qui peut (devenir/subir/être une voie pour/faire l’objet de)XNOM ‘that which can (become/be subjected to/be a lane for/be the object of)XNOUN ’.” This kind of gloss is not as easy to use because it is dependent on semantic information associated with a lexical or infralexical base unit (a public office, a tax or process, a means of transportation, a property). Unfortunately, these characteristics associated with the base are not accessible in our entry corpus (TLFnome is made up of lemmas that only have categorical tags). They are not to be found in any free-access lexicon either. Because there is no semantic information in the input, the gloss is sub-specific when it involves the salient semantic property of the base. It is illusory to hope that DériF could be improved in this sense as long as this sort of information is not available.25 To conclude on this point, DériF faithfully reflects the proposed linguistic analyses, as long as they are not too concerned with the meanings of the bases.
4.4. DéCor and DériF comparison Let us now compare the results produced by DéCor and by DériF. First, while DériF gives a three-fold analysis (parsing tree, family, gloss), DéCor only computes the base of the described entry. The information provided by each of the systems does not have the same degree of completeness. Let us then focus on what can be compared, and observe the bases retained by each system for the lexical units involving the -able suffixation. The intersection of the corpora treated by the two systems has 836 entries. They propose the same base for 709 items. In 85% of cases, and in any case for the -able corpus, neither of the two systems performs better than the other with regard to the computation of the bases. Table 4 shows how the 127 cases (15% of the corpus) in which the systems do not give the same result divide up.
25. It is conceivable that one could manually add the appropriate semantic features to the constructional bases (whether they be possible or effective). But this solution can only come into play after the fact. We cannot know a priori that a certain affix prefers a certain semantic type of base, while another affix prefers another semantic type. For example, the preference of the suffix -ier for the “pragmatic” character of the bases that it selects, as defined in the hypothesis by Corbin and Corbin (1991), is not self-evident unless it is formulated in those terms.
206
Nabil Hathout, Fiammetta Namer, and Georgette Dal
Table 4. Types of divergence between DéCor and DériF number
explanation
62 3
underspecification ambiguity
25 21
non-derived nominal bases
13
rare suffixations
entry imperturbable coupable ‘guilty/cuttable’ friable ‘crumbly’ dommageable ‘harmful’ faisable ‘feasible’
example DéCor perturber
DériF °perturbable
couper ‘to cut’
culpa/couper
*frire ‘to fry’ *dominer ‘to dominate’ *faisander ‘to hang’
friable dommage ‘harm’ faire ‘to do’
These differences have different explanations: 1.
The under-specifications cannot be considered errors because the bases computed by DériF are possible words, absent from the reference dictionary. These constitute more than half of the differences between the two systems.
2.
Ambiguities are not handled by DéCor because of the section 4.1.4 hypothesis. We can consider these cases as silences since only one of the solutions is given.
3.
The verbal bases DéCor proposes for three non-derived words must be regarded as noise.
4.
Rules involving allomorphs that are specific to a single verb such as boire, buvable ‘to drink, drinkable,’ croire, croyable ‘to believe, believable,’ or faire, faisable are not retained by findaffix. The correct pairing of these words cannot be performed by DéCor.
Note that for the -ité suffix, the results are not as good as for -able because this derivational operator involves more allomorphic variations. The results of the two systems coincide for only 83% of the entries.
5. Conclusion The preceding presentation has shown the advantages and inconveniences of each system:
•
Since it does not require any linguistic knowledge, DéCor is easier and faster to implement than DériF. However, this lack of linguistic knowledge, combined with the fact that the pairings proposed are restricted by the
An Experimental Constructional Database
207
reference lexicon, also constitutes the system’s weak point. Some pairings are not valid because the statistics are only reliable up to a certain point, or because a reference, by definition, offers a finite number of candidates.
•
The linguistic rules that DériF implements constitute both the system’s strength and its weakness. They constitute its strength because they are a guarantee of the quality of the analyses that are done, as long as the theoretical analyses done beforehand are pertinent (DéCor’s performance rate is approximately 80–85% depending on the affixes that are processed). They constitute its weakness because updating and implementing these rules requires time and the process needs to be renewed for each constructional affix implemented (each operator has its own specificity).
Therefore, DéCor and DériF are not simply two systems with distinct advantages and inconveniences. They are, in fact, perfectly complementary. This characteristic explains why MorTAL is based on two systems that are so different. DériF deals with the linguistic quality of the information produced, while DéCor deals with the sturdiness of the base.
References ABU: http://cedric.cnam.fr/ABU/DICO/ BDLEX: http://www.irot.fr/SSI/ACTIVITIES/EQ_IHMPT/bdlex.html Becker, Thomas. 1993. Back-formation, cross-formation, and ‘bracketing paradoxes’ in paradigmatic morphology. In Yearbook of Morphology 1993, ed. Geert Booij and Jaap van Marle, 1–25. Dordrecht: Kluwer Academic Publishers. Bybee, Joan. 1988. Morphology as lexical organization. In Theoretical morphology: Approaches in modern linguistics, ed. Michael Hammond and Michael Noonan, 119–141. San Diego, CA: Academic Press. Bybee, Joan. 1995. Regular morphology and the lexicon. Language and Cognitive Processes 10.5: 425–455. Clavier, Viviane. 1996. Modélisation de la suffixation pour le traitement automatique du français: application à la recherche d’information. Ph.D. dissertation, Grenoble. Clavier, Viviane, Karine Warren, Geneviève Lallich-Boidin, and Marie-Hélène Stefanini. 1996. Intégration de la morphologie dérivationnelle dans un système distribué d’analyse du français écrit. Actes du colloque “Informatique & Langue Naturelle”, ILN’96, 9–10 oct. 1996, Université de Nantes: 103–120. Clémenceau, David. 1993. Structuration du lexique et reconnaissance des mots dérivés. Ph.D. dissertation, Université Paris 7. Corbin, Danielle. 1980. Contradictions et inadéquations de l’analyse parasynthétique en morphologie dérivationnelle. In Théories linguistiques et traditions grammaticales, ed. Anne-Marie Dessaux-Berthonneau, 181–224. Villeneuve d’Ascq: Presses Universitaires de Lille. Corbin, Danielle. 1991. Introduction. La formation des mots: structures et interprétations. Lexique 10: 7–30.
208
Nabil Hathout, Fiammetta Namer, and Georgette Dal
Corbin, Danielle. 1993. Morphologie et lexicographie: la représentation du sens dans le Dictionnaire dérivationnel du français. In Du lexique à la morphologie: du côté de chez Zwaan, ed. Aafke Hulk, Francine Melka, and Jan Schroten, 63–86. Amsterdam: Rodopi. Corbin, Danielle. 1997. Décrire un affixe dans un dictionnaire. In Les formes du sens, ed. Georges Kleiber and Martin Riegel, 79–94. Louvain-La-Neuve, Belgique: Duculot. Corbin, Danielle. forthcoming a. Préfixes et suffixes: du sens aux catégories. Journal of French Linguistic Studies. Corbin, Danielle. forthcoming b. Le lexique construit. Paris: Armand Colin. Corbin, Danielle and Pierre Corbin. 1991. Un traitement unifié du suffixe –ier(e). Lexique 10: 61–145. Corbin, Danielle and Marc Plénat. 1992. Note sur l’haplologie des mots construits. Langue française 96: 101–112. Cruse, D. Alan. 1986. Lexical semantics. Cambridge: Cambridge University Press. Daille, Béatrice, Cécile Fabre, and Pascale Sébillot. 2002 (this book). Applications of computational morphology. In Many morphologies, ed. Paul Boucher and Marc Plénat, 210–234. Somerville, MA: Cascadilla Press. Dal, Georgette. 1997. Grammaire du suffixe –et(te). Paris: Didier Erudition. Dal, Georgette, Nabil Hathout, and Fiammetta Namer. 1999. Construire un lexique dérivationnel: théorie et réalisations. Actes de la VI e conférence sur le Traitement Automatique des Langues Naturelles (TALN’99), Institut d’Etudes Scientifiques de Cargèse, Corse, 12 –17 juillet 1999: 115–124. Dal, Georgette and Fiammetta Namer. 2001. Génération et analyse automatiques de ressources lexicales construites utilisables en recherche d’informations. In Le traitement automatique des langues pour la recherche d’information, ed. Christian Jacquemin, 423–446. Paris: Hermes Science. Dal, Georgette and Martine Temple. 1997. Morphologie dérivationnelle et sens des mots construits: les voies de la référence ne sont pas impénétrables. In Advances in morphology, ed. Wolfgand U. Dressler, Martin Prinzhorn and John R. Rennison, 97– 110. Berlin: Mouton de Gruyter. Encyclopaedia Universalis = CD-ROM Encyclopaedia Universalis, version 2.0, Paris, Encyclopaedia Universalis. Grabar, Natalia and Pierre Zweigenbaum. 1999. Acquisition automatique de connaissances morphologiques sur le vocabulaire médical. Actes de la VIe conférence sur le Traitement Automatique des Langues Naturelles (TALN’99), Institut d’Etudes Scientifiques de Cargèse, Corse, 12–17 juillet 1999: 175–184. Habert, Benoît, Adeline Nazarenko, and André Salem. 1997. Les linguistiques de corpus. Paris: Armand Colin. Jacquemin, Christian. 1997. Guessing morphology from terms and corpora. Proceedings, 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97), Philadelphia, PA: 156–167. Jacquemin, Christian. 1999. Syntagmatic and paradigmatic representations of term variations. 37th annual meeting of the Association for Computational Linguistics (ACL’99), University of Maryland, June 20–26 1999: 341–348. Karttunen, Lauri, Ronald M. Kaplan, and Annie Zaenen. 1992. Two-level morphology with composition. Proceedings of the fifteenth International Conference on Computational Linguistics. Coling–92, Nantes, France: 141–148.
An Experimental Constructional Database
209
Koskenniemi, K. 1983. Two-level model for morphological analysis. Presented at 8th IJCAI Conference, Karlsruhe. Le Monde = Le Monde sur CD-ROM, SA Le Monde (Paris) – CEDROM-SNi inc. (Montréal), 1999. Le Petit Robert. Dictionnaire de la langue française. Version électronique du Nouveau Petit Robert. CD-ROM, Paris, Dictionnaires Le Robert / van Dijk 1996. Martin, R. 1992. Pour une logique du sens, 2nd edition. Paris: Presses Universitaires de France. MULTEXT: http://www.lpl.univ-aix.fr/projects/multext/ Namer, Fiammetta and Georgette Dal. 2000. GéDériF: Automatic generation and analysis of morphologically constructed lexical resources. Second International Conference on Language Resources and Evaluation (LREC), Athens, Greece, May 31 – June 2, 2000: 1447–1454. Plénat, Marc. 1988. Morphologie des adjectifs en –able. Cahiers de Grammaire 13: 101– 132. RE = Le Robert électronique DMW, Disque optique compact CD-ROM, Paris, Dictionnaires Le Robert, 1994. Roché, Michel. 1997. Briard, bougeoir et camionneur: dérivés aberrants, dérivés possibles. Silexicales 1: 241–250. Rumelhart, David E. and James L. Mcclelland. 1986. On learning the past tense of English verbs. In Parallel distributed processing, vol. 2, ed. James L. McClelland and David E. Rumelhart, 216–271. Cambridge, MA: MIT Press. Savoy, Jacques. 1993. Stemming of French words based on grammatical categories. JASIS: Journal of the American Society for Information Sciences 44.1: 1–9. Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK: 44–49. Silberztein, Max. 1993. Dictionnaires électroniques et analyse automatique de textes: le système INTEX. Paris: Masson. Temple, Martine. 1996. Pour une sémantique des mots construits. Villeneuve d’Ascq: Presses Universitaires du Septentrion. WinBrillv0.3 = Souvay, Gilles. 1998. Version 0.3 of Brill’s parser for Windows95/98, INaLF-CNRS, Nancy; information and download: http://jupiter.inalf.cnrs.fr/ WinBrill. Xu, Jinxi and W. Bruce Croft. 1998. Corpus-based stemming using co-occurrence of word variants. ACM Transaction on Information Systems 16.1: 61–81.