Université de Nice - Sophia Antipolis École Doctorale Sciences et Technologies de l'Information et de la Communication
Manuscript de Thèse Docteur en Sciences, mention Informatique
Victoria
Efficient production of linguistic resources : the
Project
Présentée et soutenue par
Lionel NICOLAS
Dirigée par
Pr. Jacques Farré
Soutenue le 16/12/2010 devant un jury composé de
Pr. Jacques Farré Pr. Gertjan Van Noord Dr. Carlos Gómez Rodríguez Pr. Alexis Nasr Dr. Benoît Sagot Pr. Jean Charles Régin
Univ. de Nice SA + CNRS
Directeur de thèse
Univ. de Groningen
Rapporteur
Univ. de A Coruña
Rapporteur
Univ. de Aix-Marseille
Rapporteur
Univ. de Paris 7 + INRIA
Examinateur
Univ. de Nice SA + CNRS
Examinateur
2
3
Résumé L'ecacité de la grande majorité des outils utilisés pour le Traitement Automatisé des Langues Naturelles (TALN) dépend directement ou indirectement de la qualité des ressources linguistiques informatisées sur lesquels ils reposent. Pour des langues internationalement employées telles que le français ou l'espagnol, bien des ressources de référence sont encore dans un état précaire de développement. Pour d'autres langues ayant une communauté moins importante, ces ressources sont souvent inexistantes. Cette situation est la conséquence directe des ambiguïtés et des irrégularités des langues naturelles. Ces dernières rendent leur formalisation complexe, leur description manuelle fastidieuse et leur acquisition automatisée dicile. De nos jours, pour les aspects linguistiques ayant des formalismes de description consensuels, la principale limitation à la création de ressources linguistiques est le coût humain prohibitif induit par leur création et amélioration manuelle. Comme le formalise la loi de Zipf, améliorer la qualité et la couverture d'une ressource linguistique devient toujours plus laborieux lorsque l'on compare les eorts investis aux améliorations obtenues. La difculté est donc moins de savoir comment décrire l'aspect linguistique d'une langue que d'en réaliser une description dont la couverture et la qualité répondent aux besoins d'applications performantes. Construire de telles ressources requiert donc des années d'eorts constants débouchant trop souvent sur des résultats d'une qualité relative et d'une visibilité limitée. L'acquisition et la correction rapides et ecaces de ressources linguistiques sont donc des problèmes peu résolus et d'une importance capitale pour les développements dans le domaine du TALN. Dans ce contexte, mes recherches ont pour but premier de faciliter la production de ressources linguistiques symboliques ayant trait à l'analyse syntaxique. Elles s'inscrivent dans un projet, appelé Victoria, dont l'objectif est de développer un ensemble de techniques, d'outils et de stratégies pour l'acquisition et la correction de règles morphologiques, de lexiques morphosyntaxiques et de grammaires lexicalisées. L'application pratique de ces développements nous a permis de créer et/ou d'améliorer des ressources linguistiques pour le français, l'espagnol et le galicien. Plus particulièrement, mes eorts se sont concentrés sur : des stratégies pratiques pour minimiser les eorts nécessaires à la création et l'amélioration de ressources linguistiques, l'acquisition automatique des règles morphologiques d'une langue à morphologie concaténative, la correction semi-automatique de lexiques morpho-syntaxiques à large couverture.
4
5
Abstract The eciency and linguistic relevance of most tools dedicated to Natural Language Processing (NLP) depend directly or indirectly on the quality and coverage of the numeric linguistic resources they rely on. Even for major languages such as Spanish and French, many well known and widely used linguistic resources are still in a enfant state of development. For languages with a smaller speech community, they barely exist. Such a situation is the direct consequence of the ambiguity and irregularities of natural languages which make their formalization complex, their manual description labor-intensive and their automatized acquisition dicult. Regarding linguistic aspects that have consensual frameworks to describe them, one can consider the eorts required to develop linguistic resources as the main limitation. As formalized by Zipf 's law, increasing the quality of a linguistic resource becomes more and more dicult with time since the eorts are always more demanding when compared with the resulting improvements. In other words, the diculty for some linguistic levels lies less in how to describe them than in actually achieving a description that has both the coverage and precision required by complex NLP tasks. Building linguistic resources can therefore take years of constant eort which might fail to achieve visible or useful results. Therefore, quick and ecient acquisition and correction of linguistic resources is an unsolved problem of considerable interest for the NLP community. In this context, my PhD has focused on enhancing the production capacities of symbolic linguistic resources for parsing. It is part of a project, named Victoria, that aims at developing techniques, tools and guidelines to produce or improve morphological rules, morpho-syntactic wide-coverage lexicons and lexicalized grammars. The practical use of these developments has enabled us to create and/or improve linguistic resources for French, Spanish and Galician. More specically, my eorts has focused on : practical guidelines for saving as much eorts as possible when creating or improving linguistic resources, the automatic acquisition of the morphological rules of languages with a concatenative morphology, the semi-automatic correction of morpho-syntactic wide-coverage lexicons.
6
7
Préface (Français) Ce manuscrit a été écrit en accord avec les critères usuels nécessaires à un ouvrage scientique. Cependant, il présente quelques caractéristiques pouvant surprendre un lecteur expérimenté. Ces dernières sont détaillées et expliquées dans cette préface. L'écriture de ce manuscrit a été, avant tout, guidée par la volonté de décrire le travail de recherche réalisé durant cette thèse. Elle a été aussi guidée par la volonté de de rédiger un document potentiellement utile à des lecteurs néophytes, tels que des étudiants démarrant leur doctorat. Cet objectif secondaire suit l'idée que les chercheurs expérimentés sont plus prompts à lire des articles scientiques que des manuscrits de thèse car ces premiers sont plus référencés, disponibles, directs et résumés. Au contraire, les étudiants démarrant leur doctorat cherchent généralement à construire ou améliorer leur compréhension du domaine. Puisque un nombre important de sujets et d'outils ont été étudiés et utilisés durant cette thèse, ce document donne l'opportunité de dépeindre plusieurs aspects du TALN au sein d'un contexte cohérent. C'est la raison pour laquelle ce document inclut du contenu, telles que des dénitions basiques, pouvant paraître redondant à un lecteur expérimenté. Ce contenu a cependant été séparé de telle sorte à être facilement sauté. En ce qui concerne les dénitions, bien que la plupart sont consensuellement partagées par la communauté TALN entière, l'expérience nous a montré que certaines peuvent parfois diérer et donner lieu à des malentendus. Les dénitions de la majorité des termes techniques, tels qu'ils sont compris et utilisés dans ce document, sont donc données au début de chaque chapitre où ils apparaissent pour la première fois. Toutes les dénitions sont alphabétiquement réunies en annexe B. Un autre particularité du document est la présence de plusieurs bibliographies avec parfois des entrées dupliquées d'une bibliographie à l'autre. Cette caractéristique est motivée par le fait que cette thèse couvre des sujets diérents. Le fait de séparer la bibliographie a donc pour intérêt de simplier l'étude de chaque sujet. La bibliographie intégrale est néanmoins disponible en annexe
??.
Enn, une dernière particularité est la présence de traduction en français de certaines parties de ce manuscrit originellement écrit en anglais. Cette caractéristique a pour but de remplir des critères voulus par le système universitaire français.
8
9
Foreword (English) This manuscript has been written in agreement with the usual criteria required by a scientic publication. Nonetheless, it presents some characteristics that may surprise an expert reader. They are therefore quickly exposed and explained in this foreword. The redaction of this manuscript has been oriented towards the main objective of describing the research work achieved during this thesis. However, it has also been written towards the objective of being a useful document to any neophyte reader, such as students starting their PhD. This secondary objective follows the intuition that expert researchers tend more to consult scientic papers than PhD manuscripts because papers are far more referenced, available, direct and summarized. On the other hand, PhD students who start their research usually intend to build their general understanding of the eld. Since an important set of subjects and tools have been studied or used during this PhD, this document provides the opportunity to sketch, within a coherent context, an overview of several aspects of the NLP eld. Therefore, this document has been written with some content, such as fairly basic denitions, that may seem redundant to an expert reader. These contents have been accordingly separated within sections that can be easily skipped. Regarding denitions, even though most of them are consensually shared within the NLP community, the experience has shown us that they can dier and lead to some misunderstandings. The denitions of most technical terms, as understood and used in this document, are thus provided at the beginning of the chapter where each term is rst mentioned. All denitions are also alphabetically gathered in Appendix B. Another particularity of this document is the presence of various bibliography and sometimes duplicated entries from a bibliography to another one. This feature is motivated by the fact that this PhD covers several subjects ; splitting it in various smaller ones is thus an eort to ease the study of each subject. Nevertheless, the entire bibliography is still available in Appendix
??.
Finally, on last particularity is the presence of the french translation of some parts, such as this of this manuscript that has originally been written in English. This feature aims at fullling criteria required by the french academic system.
10
Table des matières
1 Producing linguistic resources: issues and challenges 1.1
Abstract process to upgrade co-interacting resources. . . . . .
50
2.2
Generic Spanish common nouns class in a meta-grammar named SPMG. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3
Partial example of parse for the Spanish sentence Hasta la
victoria siempre /onward to victory. 2.4
. . . . . . . . . . . . . .
56
Semi-automatic chain of tools for the upgrade of the linguistic resources required to perform parsing.
3.1
55
. . . . . . . . . . . . .
58
Simplied example of a letter tree. The suxes ed and ing occur on gray nodes combined with the stems us and caus.
71
3.2
2009 and 2010 results for English with the MC metric. . . . .
91
3.3
2009 and 2010 results for German with the MC metric. . . . .
92
3.4
2009 and 2010 results for Turkish with the MC metric. . . . .
93
3.5
2010 results for English with the EMMA and MC metrics. . .
94
3.6
2010 results for German with the EMMA and MC metrics.
.
95
3.7
2010 results for Turkish with the EMMA and MC metrics. . .
96
4.1
Parse rate of sentences with wildcard (Y axis) according to the suspicion rate of the suspected forms substituted with wildcards in the sentences (X axis). . . . . . . . . . . . . . . . 121
4.2
Number of sentences successfully parsed after each session. . . 122
5.1
Snapshot of the interface dedicated to the edition of the lexicon.138
5.2
Merging procedure performed to build the lee 's rst version. 141
A.1
Sample of English sux families acquired
. . . . . . . . . . . 173
A.2
Sample of English prex families acquired
. . . . . . . . . . . 174
A.3
Sample of German sux families acquired . . . . . . . . . . . 175 15
16
Table des gures
A.4
Sample of German prex families acquired . . . . . . . . . . . 176
Producing linguistic resources : issues and challenges
17
18
Chapitre 1. Producing linguistic resources : issues and challenges
19
Related terms
A linguistic level
also called linguistic aspect, designates a set of closely
related linguistic characteristics. For example, morphology is a linguistic level that covers the linguistic characteristics used to form words. Languages are usually described according to several linguistic levels. These levels are usually listed as phonology, morphology, lexicology, syntax, semantics and pragmatics.
Phonology
studies the systematic use of sound to encode meaning in any
spoken human language. It focuses on the way dierent sounds function within a given language or across languages to encode meaning.
Morphology
studies patterns of word formation and attempts to formu-
late rules modeling them. It focuses on the way phonemes and syllabus are combined to form words.
Lexicology
studies words, their natures, their elements, their relations to
one another and their meanings. It focuses on the words and their characteristics.
Syntax
studies the principles and rules for constructing sentences in na-
tural languages. It focuses on the way words can be combined to form syntactically correct sentences.
Semantics
studies the meanings of words within particular circumstances
and contexts. It focuses on the way words can be combined to form semantically coherent sentences.
Pragmatics
studies how the transmission of meaning depends not only on
the linguistic knowledge of the speaker and listener, but also on the context of the utterance, knowledge about the status of those involved, the inferred intent of the speaker, and so on. Pragmatics focuses on the way language users are able to overcome apparent ambiguity, since meaning relies on the manner, place, time etc. of an utterance.
A linguistic resource
is a numeric data base that describes a linguis-
tic knowledge. It usually focuses on a single linguistic aspect for a given language.
A lexical form
also called word form or simply form in a shortcut, is a
technical term to designate a word. All forms belong to a lemma.
20
Chapitre 1. Producing linguistic resources : issues and challenges
A lemma
is a set of related lexical forms represented by a canonical form.
For example, for a verb, the set of its conjugated forms constitutes a lemma. When the term word paradigm is used to designate a set of related form, the term lemma is usually used as a synonym for canonical form.
A canonical form
of a lemma is a lexical form consensually chosen as
the representative of the lemma. For example, for verbs, their innitive is usually used as canonical form.
1.1. Introduction (Français)
21
1.1 Introduction (Français) Le
Traitement Automatique
des
Langues Naturels
(TALN) est un do-
maine de recherche dont l'objectif est de doter les ordinateurs des capacités linguistiques nécessaires à l'analyse et la génération de langages naturels. Le TALN couvre un large éventail d'applications telles que la recherche améliorée de document, les systèmes questions-réponses, la fouille de texte, les résumés automatiques ou encore la traduction. Au cours des dernières décennies, l'intérêt pour ces technologies s'est naturellement renforcé avec le développement des moyens de communication et l'avènement de la société de l'information telle que nous la connaissons aujourd'hui. Dépuis ses débuts dans les années 40-50, ce domaine a mélangé à diérents degrés statistiques et linguistiques. De nos jours, les méthodes reposant fortement sur les statistiques obtiennent souvent les résultats les plus probants. Cependant, elles sont inexorablement limitées par leur diculté à prendre en compte les aspects les moins fréquents. De nombreux eorts sont donc aujourd'hui orientés vers le développement d'approches reposant d'avantage sur la linguistique. Une approche linguistiquement-motivée pour un tâche donnée requiert comme préalable d'avoir une description des aspects linguistiques impliqués dans la dite tâche. Ces descriptions sont généralement formalisées soit explicitement par le biais de règles désignant ce qui appartient ou pas à la langue étudiée, soit implicitement en annotant des exemples positifs ou négatifs. Ces ensembles de règles ou d'exemples sont le plus souvent réunis sous forme de documents numériques désignés par le terme générique "
Linguistique" (RL).
Ressource
Puisque l'ecacité et la pertinence de bien des outils dédiés au TALN dépendent directement des RL sur lesquels ils reposent, l'existence et la disponibilité de RL de qualité est une nécessité stratégique pour le domaine de recherche tout entier. Cependant, bien qu'il existe des RL pour certains langages ayant une communauté capable de développer de telles ressources, pour la plus grande majorité des langages elles sont quasiment inexistantes. La situation est telle qu'il n'existe aucune méthode able et claire pour déterminer "rapidement" si des RLs existent pour un langage donné et, dans le cas où elles existent, quelles peuvent être les attentes vis-à-vis ces RLs. Des intiatives ayant pour but de lister les RLs disponibles, telles que le
European Language Resources Association 1 (ELRA)
permettent de
constater cet état de fait. En eet, une simple consultation de ce catalogue permet de voir que : seulement peu d'aspects linguistiques d'un nombre restreint de lan-
2
gages sont couverts , 1.
http://www.elra.info/
2. ou peut-être référencés...
22
Chapitre 1. Producing linguistic resources : issues and challenges
les informations fournies sont le plus souvent générales et sans métriques consensuelles, i.e., déterminer l'adéquation d'une ressource donnée avec les besoins nécessite l'étude détaillée de la documentation scientique de la dite ressource. De plus, lorsque des RLs sont disponibles pour une langue donnée, il ne semble pas aventureux de dire que, de façon globale, elles ne répondent pas aux attentes. Bien sûr, l'évaluation de la qualité d'une RL est à mettre en correspondance avec l'application pour laquelle elle a été originellement crée, c.a.d., décider si oui ou non une RL répond aux attentes dépend des objectifs xés originellement. Cependant, un indice sous-entendant la diculté d'amener une RL à y répondre est le fait que, à notre connaissance, la plupart des RLs sont utilisées à des n de recherche et non pas dans des applications accessibles au grand public. Lorsque l'on prend en considération l'attention reçue par à la fois le monde académique et industriel ainsi que les eorts investis dans le développement de RLs au cours des dernières décennies, un tel manque indique clairement la diculté que représente leur création, correction et extension. Obtenir ces ressources est pourtant une nécessité absolue nécessaire à l'aboutissement de systèmes linguistiquement fondés. De plus, puisque la plupart des outils dédiés au TALN peuvent proter d'une donnée préalablement désambiguïsée par des systèmes linguistiquement fondés tels que des étiqueteurs et analyseurs syntaxiques, l'obtention de ces ressources présente un intérêt qui va bien au delà de ces seuls systèmes. La création de RLs de grande qualité en termes de couverture, validité et richesse des informations est par conséquent un problème d'une importance considérable pour le TALN.
1.1.1 Ambiguités et irrégularités Contrairement aux langages formels tels que ceux utilisés pour programmer les ordinateurs, les langages naturels présentent le plus souvent des ambiguïtés et irrégularités pour chaque aspect linguistique. Par exemple, en anglais : la prononciation de i dière lorsque utilisé dans les mots write et t, mais est pourtant la même dans t et written, le suxe s peut à la fois être la marque de la troisième personne singulière d'un verbe ou alors celle du pluriel pour un nom commun mais pas tous les pluriels et les troisièmes personnes singulières sont marqués d'un s, la forme bow peut à la fois être un verbe ou un nom commun, la phrase John follows the dogs on a bike est syntaxiquement ambiguë car on peut syntaxiquement considérer que on a bike est lié à the
1.1. Introduction (Français)
23
3
dogs , tout le monde a vécu (au moins) une fois dans sa vie une situation gênante à cause d'une phrase sémantiquement ambiguë. Puisque les ordinateurs sont essentiellement basés sur des systèmes électroniques et mathématiques rigoureux, ces ambiguïtés et irrégularités permanentes peuvent être diciles à gérer. De façon plus spécique, ces phénomènes problématiques impactent le développement des RLs de deux façons distinctes. Ils compliquent les formalismes nécessaires à la représentation des différents aspects linguistiques (ainsi que les techniques les mettant en ÷uvre). Ils limitent les procédés d'acquisition automatique et relèguent la construction de RLs de grande qualité à une tâche souvent manuelle ou nécessitant trop de validation humaine.
1.1.2 Formalisation d'un language par le biais de RLs Chaque langage peut être décrit à travers diérents aspects linguistiques inter-reliés. Ces aspects sont consensuellement recensés comme étant la phonologie, la morphologie, la lexicologie, la syntaxe, la sémantique et la pragmatique (voir dénitions). Bien entendu, chacune de ces descriptions est loin d'être trivial et requiert des formalismes adaptés. Le fait est que, à ce jour, nous manquons encore de consensus globaux pour un certain nombre d'aspects linguistiques. Ceci est particulièrement vrai pour la sémantique et la pragmatique, mais le nombre important de formalismes syntaxiques distincts et utilisés de nos jours est une autre illustration de cette diculté. Cette diversité dans la formalisation des langages est aujourd'hui attestée par l'existence de projets internationaux nommés CLARIN[4] and FLARENET[1]. Ces deux projets, reposant sur des nancements conséquents et regroupant un nombre important de chercheurs, ont pour principal objectif d'aider à la normalisation du domaine. En eet, alors que le premier a pour objectif de convertir/interconnecter sous une architecture commune les RLs et outils existants dédiés au TALN , le second travaille pour la formulation de recommandations/stratégies et, par conséquent, la formulation de standards. Un tel manque de formalismes consensuels est l'une des raisons à la situation actuelle : certaines RLs n'existent pas car il n'existe pas de solution claire pour décrire l'aspect linguistique qu'elles traiteraient. Á ces considérations, il est important de rajouter que si la description d'un aspect linguistique donné n'est toujours pas consensuelle, l'évaluation et la comparaison de RLs hétérogènes sont d'autant plus problématiques. 3. bien que sémantiquement incongru. . .Á part peut être dans un cirque.
24
Chapitre 1. Producing linguistic resources : issues and challenges
Tenter d'expliquer pourquoi des solutions à ces problèmes ne sont toujours pas apparues ne fait pas partie des objectifs de cette thèse. Cependant, il semble clair que bien des solution employées de nos jours n'auraient pu être envisagées et testées sans les capacités de calculs dont les ordinateurs récents disposent.
1.1.3 Couvrir un niveau linguistique La description de l'information lexicale est par contre très consensuelle. Elle est même standardisée au travers de plusieurs normes ISO telle que LMF [2] (Lexical Markup Framework). Cependant, bien qu'il existe un consensus (et donc des formalismes le respectant) pour certains niveaux linguistiques, pour un nombre trop important de langues, il est encore fréquent de ne pas trouver les RLs de qualité et à large-couverte correspondantes. Pour ces niveaux, la diculté pour construire les RLs n'est plus de savoir comment décrire l'information mais bien d'obtenir une description d'une qualité et d'une couverture susantes à leur utilisation dans des tâches TALN complexes. La limitation principale est alors due aux eorts nécessaires à la construction de ces RLs. En eet, l'acquisition automatisée est de facto limitée par le nombre important de comportements irréguliers et d'ambiguïtés que comporte chaque niveau. An de garantir la qualité et la consistance d'une RL, une validation humaine est presque toujours indispensable. De ce fait, la plupart des RLs sont développées de façon plus ou moins manuelle, i.e, de façon plus ou moins automatisée. Le plus souvent, il est possible d'atteindre une certaine qualité de couverture et de précision en une période de temps raisonnable. Malheureusement, l'amélioration d'une RL devient chaque fois plus dicile avec le temps. Comme le montre la loi de [5], les langues naturelles ont des tendances (métaphoriquement) logarithmiques. En eet, au sein de chaque niveau, les instances ne sont pas généralement utilisées à la même fréquence et certaines sont bien plus utilisées que le reste. Par conséquent, la description d'un ensemble d'instances fréquentes peut se révéler bien plus bénéque en terme de couverture que d'investir un somme d'eorts équivalente pour la description d'un ensemble comparable d'instances moins fréquentes. Par exemple, la Table 1.2 et la Figure 1.2 représentent la couverture lexicale obtenue sur diérents corpus d'anglais, d'allemand, de turc et de nlandais
4 lorsque sont décrites les X plus fréquentes formes. Pour l'anglais,
en décrivant les 5000 plus fréquentes formes, on obtient une couverture de 85.20% alors qu'un eort équivalent pour la description des 5000 plus fréquentes formes suivantes n'implique qu'une amélioration de 5.74%. 4. fournis par l'édition 2010 d'un concours annuel [3] dédié à la morphologie.
1.1. Introduction (Français)
25
100
80
60 Coverage 40
English
20
German Finnish Turkish
0 0
5000
10000
15000
20000
25000
30000
Nb forms described
Figure
1.1
Couverture lexicale obtenue en décrivant les X formes les
plus fréquentes.
Nb Forms English
German
Turkish
Finnish
Lexical coverage
5.000
80.000
160.000
90.94%
94.87%
97.31%
98.66%
99.33%
+5.74%
+3.93%
+2.44%
+1.35%
+0.67%
76.80%
82.41%
87.14%
90.98%
93.95%
96.11%
-
+5.61%
+4.73%
+3.84%
+2.97%
+2.16%
61.28%
69.97%
78.07%
85.04%
90.60%
94.72%
-
+8.69%
+8.1%
+6.97%
+5.56%
+4.12%
Improvements Lexical coverage
40.000
-
Improvements Lexical coverage
20.000
85.20%
Improvements Lexical coverage
10.000
58.54%
66.25%
73.41%
79.70%
85.04%
89.37%
-
+7.71%
+7.16%
+6.29%
+5.34%
+4.33%
Improvements
Table 1.1 Couverture lexicale et améliorations obtenues en décrivant les X formes les plus fréquentes.
Les eorts pour construire une RL suivent donc une courbe exponentielle dans le sens où chaque amélioration est bien plus coûteuse à réaliser que la précédente. Par exemple, la Table 1.2 montre pour les quatre langages que même en doublant les eorts à chaque étape, l'augmentation en terme
5
de couverture ne cesse de baisser . Si les ressources humaines allouées ne sont pas susantes et les outils 5. La somme d'eort nécessaires à l'obtention d'une couverture donnée est d'autant plus dicile à évaluer au préalable.
26
Chapitre 1. Producing linguistic resources : issues and challenges
pour l'acquisition ne simplient pas assez la tâche, le recensement de toutes les instances d'un niveau peut être un travail sans n qui n'atteindra pas ses objectifs initiaux. Puisque bien des tâches TALN telle que l'analyse syn-
6
taxique, se concentrent sur des ensembles d'instances , nombre d'outils leur étant dédiés demandent des RL d'une couverture et qualité importantes pour fonctionner correctement. Par conséquent, si ces dernières n'atteignent pas des seuils limites, l'ensemble des eorts investis dans la construction d'une RL aura été à perte car l'outil pour lequel cette RL aura été développée n'obtiendra pas les performances minimales requises.
1.1.4 Sommaire du manuscript Le travail de recherche décrit dans ce manuscrit s'attaque à la production de RLs à travers diérents sujets : la création et amélioration de RLs en général, l'acquisition de règles morphologiques, la correction de lexiques. Le chapitre 2 se concentre sur un projet, appelé Victoria, duquel ce travail de recherche est une part importante. Il commence par décrire les origines, motivations et objectifs de Victoria et se concentre ensuite sur les stratégies qui ont guidé le travail de recherche réalisé. Ce chapitre permet notamment de comprendre le lien indirect que présentent les diérents sujets abordés durant cette thèse au sein de ce projet. Le chapitre 3 détaille une méthode permettant d'obtenir de façon automatique et à partir de corpus bruts (non-annotés), un ensemble de règles morphologiques. Il introduit des observations sur les mécanismes morphologiques concaténatifs, précise la méthode étape par étape et explique pourquoi cette dernière s'applique aisément à bien des langages. Il termine en détaillant les résultats pratiques obtenus sur l'anglais, le turc et le nlandais. Le chapitre 4 présente une méthode permettant de corriger un lexique grâce à une grammaire. Cette méthode repose sur l'idée d'utiliser deux RL pour en corriger une grâce à l'autre. Ce chapitre explique comment il est possible de suspecter des entrées d'un lexique d'être mal décrites ou incomplètes, et comment il est possible de génèrer des hypothèses de correction grâce aux attentes d'une grammaire. Il termine en détaillant des résultats obtenus sur le français. Enn, le chapitre 5 résume tous les autres résultats atteints par le projet
Victoria. Il détaille donc des travaux et résultats secondaires ayant un lien direct ou indirect avec cette thèse et explique les extensions qui pourraient être considérées dans le futur.
6. L'analyse syntaxique se concentre sur des séquences de mots.
1.2. Introduction (English)
27
1.2 Introduction (English)
Natural Language Processing
(NLP) is a research eld that intends to
equip computers with the linguistic capacities necessary to process and produce natural languages. NLP covers a large set of applications such as document retrieval, question-answering, information extraction, text mining, document summarization or translation. Over the past decades, the interest for such technologies have naturally increased with the development of communication means. Since its early ages (60 years ago), this eld has mixed together statistics and linguistics with dierent degrees. Nowadays, methods relying strongly on statistics tend to be among the most successful. Nevertheless, these methods will always be limited to some degree by their diculties to cope with infrequent cases. Many eorts are therefore being directed towards more linguistically-based approaches. Developing a linguistically-based approach requires to obtain or achieve the description of the linguistic aspects involved in the task. Such descriptions can either be explicit, by listing rules distinguishing what belongs to the language and what does not, or implicit, by annotating positive or negative examples. These sets of rules or examples are usually gathered in numeric resources designated by the generic term
Linguistic Resource
(LR).
The eciency and linguistic relevance of many NLP tools are directly linked with the quality of the underlying LRs they rely on. The existence and availability of high-quality LRs are therefore an absolute cornerstone for the NLP domain. Even though some LRs are indeed available for widely-spoken languages, such as Spanish and French, for languages with a smaller speech community, they barely exist. The situation is such that there are no clear and reliable manners to quickly know what LRs exist for a given language and what can be expected of each LR. Initiatives that intend to inventory the existing LRs, such as the universal catalog of the
ELRA)
(
European Language Resources Association 7
allow to observe the current situation. Indeed, a quick look at its
catalog shows that : only few linguistic aspects of a rather small number of languages are
8
covered , the informations are usually general with no consensual metrics, i.e., evaluating the adequacy of an existing resource with the needs motivating a search requires a detailed study of the related scientic documentation. In addition, when LRs are available for a given language, it does not 7.
http://www.elra.info/
8. or maybe referenced...
28
Chapitre 1. Producing linguistic resources : issues and challenges
seem risky to state that they usually do not fully meet the expectations. Of course, evaluating the quality of a LR is a dicult task that varies according to the application for which the LR has been devised. In other words, stating whether an LR meets the expectations or not should always take into account its initial objectives. However, a direct clue of the diculty in bringing an LR to meet them is that, to our knowledge, most are essentially used in a research scope and not in public application. When considering the attention received from both academic and industrial worlds and the signicant eorts achieved for LR development during the past decades, such a lack of high quality and wide-coverage LR shows how dicult their creation and correction can be. Nevertheless, in order to complete many complex LR-based systems, obtaining these resources is an unavoidable requirement. Since any NLP tool can benet from data previously disambiguated by LR-based systems such as taggers or syntactic parsers, the interest in obtaining such LRs falls actually beyond the scope of LR-based systems . The creation of LRs with a high level of quality in terms of coverage, accuracy and richness is therefore a fundamental issue in the NLP research eld.
1.2.1 Ambiguities and Irregularities Contrarily to formal languages, such as the ones used to program computers, the languages spoken by humankind present ambiguities and irregularities at several levels. For example, in English : the pronunciation of i diers when used in words write and t, but is the same in t and written, the sux s can either be the mark of the third person singular of a verb or the plural of a common noun, but not all plurals or third person are marked with s, the form bow can either be a verb or a common noun, the sentence John follows the dogs on a bike is syntactically ambi-
9
guous since relating on a bike to the dogs is syntactically correct , almost anybody has experienced (at least) once in his life an awkward situation because of some semantically ambiguous talk. Since our computers are essentially based on more rigorous electronic and mathematical systems, these permanent ambiguity and irregularity result dicult to handle. More specically, such problematic phenomenons impact LRs developments in two dierent ways. They complicate the formalisms used to represent a given linguistic level and the techniques to process them. 9. Although semantically dicult. . .except in a circus.
1.2. Introduction (English)
29
They limit automatic acquisition processes and relegate the construction of high-quality LRs to a manual task, or at least, to a human validation.
1.2.2 Formalizing a language through LRs Every natural language can be described through several linguistic interlinked levels. These levels are usually listed as phonology, morphology, lexicology, syntax, semantic and pragmatic (see denitions). Of course, none of these levels is trivial and requires specialized formalisms adapted to their description. The fact is that, nowadays, we still lack a global consensus for modeling most linguistic description levels. This is particularly true for the semantic level, but the large range of available syntactic formalisms is another illustration of this diculty. Such diversity in formalizing languages is directly indicated by the existence of two international projects called CLARIN[4] and FLARENET[1]. These two NLP projects, funded with noticeable grants and bringing together an important number of NLP actors, aims essentially at normalizing the domain. Indeed, whereas the rst one aims at bringing under a common framework the existing LRs and NLP tools, the second intends to work towards the creation of recommendations/guidelines and, consequently, towards the formulation of standards. The lack of consensual formalisms is one of the main causes for the current situation : some LRs are simply not built because there is no clear solution for describing the corresponding linguistic level. One must also note that if the ways of describing a given linguistic aspect are still not consensual, the evaluation and comparison of heterogeneous LRs produces even more complications. Trying to explain why research work has not resolved this issue is out of the scope of this thesis. However, this delay can be partially explained by the limited computing capacities available only a few years ago. In previous decades, the computing power made it impossible to imagine or test the complex ambiguity-compliant formalisms used nowadays.
1.2.3 Covering a given linguistic level As far as lexical information is concerned, morphological and syntactic notions are now mostly consensual, and are even standardized by various ISO norms such as LMF [2] (Lexical Markup Framework). However, despite the fact that a consensus exists (and therefore formalisms) for some levels, it is still dicult to nd the corresponding high-quality and wide-coverage LRs for many languages. For these linguistic levels, the diculty in building LRs does not lie anymore in how to describe them but in achieving a description that has both
30
Chapitre 1. Producing linguistic resources : issues and challenges
the coverage and accuracy required by complex NLP tasks, i.e., the main limitation comes from the eorts required to develop the LRs. Indeed, automatic acquisition processes are limited since they require, at some point, a choice between ambiguous options or irregular behaviors. Human intervention is thus often necessary to ensure completeness and consistency. As a result, most of LRs are developed in a more or less manual fashion, i.e, in a more or less automatised fashion Usually, in a reasonable amount of time, one can achieve a certain level of coverage and precision. However, increasing the quality of a LR becomes more and more dicult with time. As clearly demonstrated by Zipf 's law [5], human people tend to have a (metaphorically) logarithmic behavior when using natural languages, i.e., some instances of a given linguistic level are far more frequently used than others. Consequently, describing a set of frequent instances can be far more benecial, in terms of coverage, than investing an equivalent amount of eorts in describing a comparable set of less frequent ones. For example, Table 1.2 and Figure 1.2 represent the lexical coverage obtained over dierent English, German, Turkish and Finnish corpora
10 when
describing the X most frequent forms. As one can see, for English, achieving a lexical description of the 5000 most frequent words represents a coverage of 85.20% whereas an equivalent eort to describe the following 5000 most frequent words only brings a 5.74% improvement. Nb Forms English
Table 1.2 Lexical coverage and improvements achieved when describing the X most frequent forms.
The eorts to build a LR follow a somehow exponential curve, that is, they are becoming heavier and heavier at each step in the resource improvement. For example, Table 1.2 shows that, for the four languages, the improvements keep on decreasing even if the eorts are doubled
11 .
If the workforce is not sucient and the tools easing the acquisition not 10. provided by the 2010 edition of an annual challenge [3] dedicated to morphology. 11. As a result, determining beforehand how many eorts are necessary to attain a given degree of quality can barely be estimated.
1.2. Introduction (English)
31
100
80
60 Coverage 40
English
20
German Finnish Turkish
0 0
5000
10000
15000
20000
25000
30000
Nb forms described
Figure 1.2 Lexical coverage achieved when describing the X most frequent forms.
good enough, listing all the instances of a given linguistic level can be an endless task which sometimes fails to achieve the practical results expected. Indeed, many NLP application, such as syntactic parsing, do not focus on single words but on sequences of words like sentences. They can therefore fail because only a few elements are not covered. Therefore, if a certain level of coverage and quality is not achieved, the whole eort invested in building the LR may result void since the corresponding NLP tool can not achieve the minimum performance expected.
1.2.4 Summary of the manuscript The research work achieved during this PhD and described in this manuscript directly addresses the production of LRs through several topics : the creation and extension of LRs in general, the acquisition of morphological rules, the correction of lexicons. Chapter 2 focuses on a project, named Victoria, of which this research is a large part. It starts by detailing the origins, motivations and objectives of Victoria and then concentrates on the guidelines that have motivated the eorts achieved during this PhD. This chapter thus links within a general context the several indirectly related topics which have been investigated.
32
Chapitre 1. Producing linguistic resources : issues and challenges
Chapter 3 exposes a method that allows to automatically compute morphological rules from raw corpora (unannotated text). It introduces basic observations on concatenative morphology, details the dierent steps of the method and explains why it can be easily applied to an important number of languages. Finally, it reports and comments on practical results obtained with English, Turkish and Finnish. Chapter 4 presents a method that allows corrections of a lexicon thanks to a grammar. This method relies on the concept of using two LRs to correct one another. This chapter details how entries of a lexicon can be suspected to be erroneous and how correction hypotheses are generated thanks to the grammar expectations. Finally, it reports practical results obtained with French. Finally, chapter 5 summarizes all the other results achieved by the Vic-
toria project. It details secondary works directly or indirectly related to this PhD and exposes the future extensions that should be considered.
Bibliographie
[1] Nicoletta Calzolari and Claudia Soria. Preparing the eld for an open resource infrastructure : the role of the arenet network of excellence. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Proceedings of the Seventh conference on Interna-
tional Language Resources and Evaluation (LREC'10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). [2] Gil Francopoulo, Monte George, Nicoletta Calzolari, Monica Monachini, Nuria Bel, Mandy Pet, and Claudia Soria. Lexical Markup Framework (LMF). In Proceedings of LREC 2006, Genoa, Italy, 2006. [3] Sami
Virpioja
supervised
Mikko
morpheme
Kurimo analysis
and
Ville morpho
Turunen. challenge
Un2009.
www.cis.hut./morphochallenge2009, 2009. [4] Tamás Váradi, Steven Krauwer, Peter Wittenburg, Martin Wynne, and Kimmo Koskenniemi.
Clarin : Common language resources and tech-
nology infrastructure.
In European Language Resources Association
(ELRA), editor, Proceedings of the Sixth International Language Re-
sources and Evaluation (LREC'08), Marrakech, Morocco, may 2008. [5] George K. Zipf.
Human Behavior and the Principle of Least Eort.
Addison-Wesley, 1949.
33
34
Bibliographie
-
Chapitre
2-
The Victoria project
35
36
Chapitre 2. The Victoria project
37
Related terms
Parsing,
also called syntactic analysis, is an NLP task that consists in
checking, with respect to a given formal grammar, the correct syntax of a text and building a data representation of its grammatical structure.
Parsers
are NLP tools that perform parsing. Contrarily to trained ones
that are trained on examples, symbolic parsers rely on explicit rules.
POS Tagging
is the process of assigning to the forms of a text a descrip-
tive part-of-speech (POS) tag based on both its denition and its context. A simplied form of this task is the identication of words as nouns, verbs, adjectives, adverbs, etc.
Morphological rules
are a linguistic resource that describes how set of
forms are related within a lemma. For example, for a verb, the set of rules describing how to generate its conjugations are morphological rules.
A morpheme
is a substring of a form that holds part of the meaning. The
global meaning of a lexical form is thus subdivided among the morphemes it contains. They can be either stems or axes. For example, the English
lexical form thinking contains two morphemes, one stem think and one ax ing.
A stem
of a form is the substring related to the lemma. It thus holds the
greatest part of the semantic meaning, e.g., the stem of the English lexical
form thinking is think. It is important to note that, in this document, the stem of a form is considered as the largest substring shared by all the related lexical forms of a given lemma, e.g. the lexical forms manage, managing, managed shares the largest substring and stem manag.
An ax
is a substring used to create new lexical forms by combining
it with a stem at its beginning (prex), in its middle (inx) or at its end (sux). For example, the English form thinking contains the sux ing whereas the English form grandmother contains the prex grand. In this document most approaches regarding morphology do not concern inxes. The term ax is therefore used as a shortcut for prex and sux.
Morphological inection,
describes the process of generating a lexical
form of a lemma.
Morphological derivation from another one.
describes the process of generating a lemma
38
Chapitre 2. The Victoria project
A sandhi phenomenon
is a modication of the form or sound of a word
under the inuence of an adjacent word or morpheme.
A lexicon
is a linguistic resource that inventories the lexical forms of a gi-
ven language and associates them with morphologically-related, syntacticallyrelated or even semantically-related informations.
A lexeme
represents a minimal meaningful unit of language, i.e, a given
meaning of a lexical form.
A grammar,
in NLP, is a linguistic resource that details the syntactic
structures of a given language. It thus describes how forms can be combined to form sentences.
HPSG
stands for Head-driven phrase structure grammar. It is a highly
lexicalized, non-derivational generative type of grammar. This formalism is based on lexicalism in the sense that the lexicon is more than just a list of entries ; it is in itself richly structured. Individual entries are marked with types. Types form a hierarchy.
Related publication The publications related to this chapter can be found with references [24, 25] and [26].
2.1. Introduction (Français)
39
2.1 Introduction (Français) Le travail réalisé durant cette thèse prend part à un projet plus général nommé Victoria, dont l'objectif est d'automatiser la production de ressources linguistiques utiles, entre autre, à l'analyse syntaxique symbolique. Ce chapitre commence par expliquer les motivations et objectifs de Vic-
toria, les nancements et les ressources humaines qui lui ont été consacrés, ainsi que les projets passés ou en cours avec lesquels il partage des points communs. Il se concentre ensuite sur les directives/stratégies à partir desquelles les eorts ont été orientés et se termine en expliquant comment ces dernières ont été appliquées en pratique et quels rôles les travaux décrits dans les prochains chapitres jouent au sein du projet.
2.2 Introduction (English) The work achieved during this thesis is part of a more general project, named Victoria, that aims at enhancing the production capacities of the linguistic resources necessary to perform, among other tasks, symbolic parsing. This chapter starts by presenting the motivation, objectives, grants and task-force of the project and introduces past and on-going projects that share some common points with Victoria. It then focuses on the guidelines that have motivated and orientated the eorts achieved so far. Finally, it exposes how these guidelines have been applied in practice while positioning the works described in the following chapters within the context of Victoria.
2.3 Origins of the project
2.3.1 Motivation and objectives If the global motivation of Victoria has to be summarized in few words, it would be : producing disambiguated data . Indeed, most NLP tools can benet from previously disambiguated data. Since they can provide data disambiguated over several levels, tools focusing on disambiguating raw data are an absolute cornerstone for this research eld. More specically, Victoria aims at enhancing the means to build symbolic parsers by enhancing the production of the three types of LRs required to achieve symbolic syntactic parsing : morphological rules, morphological and syntactic lexicons, lexicalized grammars.
40
Chapitre 2. The Victoria project
One must note that other types of NLP task, such as POS-tagging, are also concerned by some of these resources. As a short term goal, the work achieved has been oriented towards the improvements of existing LRs for French and the creation and extension of LRs for Spanish. As a medium and long term goal, Victoria extends its eorts rst to Galician
1 and then to other languages.
2.3.2 Grant and task-force The Victoria project started in November 2008 thanks to a grant of the Galician government
2 that lasts until the beginning of 2011. It brings
together researchers from four dierent French and Spanish teams : the COLE team
3 from the University of Vigo,
4 the LyS team from the University of A Coruña, 5 from the University Paris 7 and INRIA Paris the Alpage project Rocquencourt,
6
the RL team , I3S laboratory, University of Nice Sophia-Antipolis and CNRS. Since its creation, the average workforce of the project is moderate : two PhD students receiving punctual helps from other members of the project.
2.4 Other projects In this section, every past, on-going, we introduce and compare small or big project that share some common points (sometimes remote ones) with ours. The comparison just states the main objectives of each project because comparing the results of each project, when they are available and described, is a far too dicult task that fall beyond the scope of this document, The main dierences that justify Victoria 's existence are provided thereafter.
EAGLES
[13]
EAGLES is an initiative that intended to provide a wide-range of guidelines enhancing re-usability in the NLP research eld. These guidelines are thus oriented towards the emergence of standards for a very-large-scale of language resources and corresponding processing tools. 1. A co-ocial language spoken in the North-West of Spain. 2. Project number INCITE08PXIB302179PR. 3. 4. 5. 6.
TEI is an initiative that develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of guidelines which specify encoding methods for machine-readable texts. Since 1994, the TEI guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the guidelines themselves, the Consortium provides a variety of supporting resources, including resources for learning TEI, information on projects using the TEI, TEI-related publications, and software developed for or adapted to the TEI.
MULTEXT
[20, 21]
MULTEXT aimed at developing standards and specications for the encoding and processing of linguistic corpora, and to develop tools, corpora and linguistic resources embodying these standards. Among other things, it has allowed the creation of a widely used tagset, several TAG corpora and morphological lexicons for Western European languages. This project interacted with the EAGLES and TEI initiatives.
MULTEXT-East
[15, 36]
MULTEXT-East is a complementary project to MULTEXT focusing on the production of the same type of LRs for several Central and Eastern European languages. Contrarily to MULTEXT, its development is still on-going and various features, such as TEI compliant ones, have been added to the resources.
GENELEX
[3]
GENELEX was an industrial project aiming at the denition and construction of generic computational dictionaries. It involved three types of tasks : specication of a generic model for electronic dictionaries, development of pieces of software suited for managing such dictionaries and creation of large lexical repositories complying with the model.
PAROLE
[28]
PAROLE was a project aiming at oering a large-scale harmonized set of core corpora and lexica for all European Union languages. It thus produced several TEI-compliant tag corpora and EAGLES and GENELEX compliant lexicons.
DELPHIN
[12]
DELPHIN is an initiative to produce several grammars according to a shared grammatical format based on HPSG and a rigid scheme of evaluation. The project also proposes tools and architectures, such as the LKB framework,
42
Chapitre 2. The Victoria project
to develop new grammars. Among the sub-objectives, one should note the LinGO grammar matrix [8] which explores the possibility of sharing formalized linguistic knowledge among several resources in dierent languages. It thus aims at extracting the common components that may be useful in the development of new grammars.
AGFL
[1]
AGFL was a project similar to DELPHIN in that it allowed to produce several grammars and tools to develop new ones. The main dierence is the formalism on which the grammars were based : they were ax grammars over a nite lattice.
EDylex
[14]
EDylex is a project oriented towards the dynamic acquisition of new lexical entries. Therefore, it intends to develop means to extend and update lexicons with morphological, syntactical and semantical data.
Passage
[29, 30]
Passage is a French project that aims at producing syntactic annotations by repeatedly combining the outputs of 10 French parsing systems. Its main aims is establishing consensual syntactic annotations along with a large scale French Treebank implementing them.
CLARIN
[41, 42]
CLARIN is a project aiming at establishing an integrated and interoperable research infrastructure of language resources and its technology. It thus aims at lifting the current fragmentation, oering a stable, persistent, accessible and extensible infrastructure.
FLARENET
[6, 43]
FLARENET is a project that aims at developing a common vision by creating a consensus among major players in the eld, identifying priorities and objectives and providing consensual recommendations. The project is therefore divided in several working groups that work towards the formulation of recommendations/guidelines, ensuring that they are formulated through a consensual bottom-up process in which the relevant scientic, technical, organizational, economic, political aspects and positions are taken into account. The characteristics dierentiating Victoria from the previously cited projects could be summarized as follows. 1. The main objective of Victoria is to produce LRs for three dierent linguistic levels.
2.5. Existing French, Spanish and Galician LRs for syntactic parsing
43
2. Victoria does not intend to develop standards or frameworks embodying these standards. 3. Victoria intends to reduce as much as possible the manual work required to build LRs. As far as we know, Victoria is among the few projects aiming at producing LRs that spend the greatest part of its eorts on devising means and tools towards this objective. 4. Regarding languages, Victoria focuses on French, Spanish and Galician. 5. Victoria does not aim at producing guidelines (even though some were indeed produced).
2.5 Existing French, Spanish and Galician LRs
for syntactic parsing Regarding the production of LRs or the improvement of existing ones,
Victoria 's focuses on French, Spanish and Galician languages. As explained in the previous chapter, there are few linguistic aspects that have consensual frameworks and formalisms. If describing a linguistic aspect is already a problematic issue, evaluating the quality of an LR is an even more complicate subject. As we will see in chapter 3, even for morphological rules (one of the easiest LR to build), metrics are still evolving with drastic changes. Since we lack consensual means to evaluate LRs, comparing two LRs requires to : list the advantages and shortcomings of the formalisms (if not the same) used to build the LRs, establish fair methods of evaluation or at least clearly state if an existing method might favor/disadvantage LRs embodying these formalisms. Therefore, even if restraining our comparison to three languages, comparing the existing LRs available for syntactic parsing is a task that falls beyond the scope of this PhD. We shall just see what an NLP researcher can expect to nd for the three languages and provide tracks to the corresponding LRs.
7
In regards to symbolic parsing , French, Spanish and Galician languages have dierent situations.
French
is clearly the most covered language among the three. It even has a
challenge of parsing named Easy and therefore, several updated grammars, lexicons and morphological rules involved in this challenge. A quick look at Easy 's scientic publications such as [40] allows us to nd most of the existing LRs for this task. 7. And NLP in general
44
Chapitre 2. The Victoria project
Spanish,
although being a large-scale language with one of the biggest
speech communities, is a language with only few LRs to achieve syntactic parsing. We found two available grammars, namely SRG [23] and Freeling
Spanish Grammar [4] and four available morphological or morpho-syntactic lexicons named Spanish Multext [20], USC [2], SRG [23] and ADESSE [18].
Galician's
situation regarding symbolic parsing is the same as for most
languages : we barely found two morphological lexicons, one named CORGA and another one related to the Freeling initiative [4]. For practical reasons regarding the formalisms we rely on (see sect. 2.7.1.1), we decided to : use and extend for French two existing LRs : a lexicon named lef [32] for morphological and lexical knowledge and, for grammatical knowledge, a meta-grammar named FRMG [11] ; build new resources for Spanish and Galician and try transferring as much content as possible from the existing LRs.
2.6 Guidelines Since the average workforce of the project was moderate, a notable eort has been invested in planning how to optimize or increase this workforce. Such a purpose has led to the conceptual non-expected result of producing a set of guidelines general enough to be used or adapted for the production of other types of LRs and thus, be used or adapted for other projects with similar goals.
8
Although some of these guidelines may seem obvious at rst , there was no document clearly stating them when Victoria started. Experience has shown that there could be an interest for other people involved within a similar task [24]. These guidelines aim at tackling the problems of the eorts required to produce LRs by applying two complementary strategies : 1. sharing eorts among the people interested in obtaining those resources, 2. saving manual eorts by automatizing as much as possible the processes of creation and correction.
2.6.1 Enhancing collaborative work 2.6.1.1 Problems limiting collaborative work If a language receives enough attention from the community, the eorts to describe it by means of LRs can obviously be split among the people 8. as a wise colleague uses to say ironically, nding America just required to navigate towards West...
2.6. Guidelines
45
interested in obtaining them. Nevertheless, the greater the workforce is, the more dicult it can be to manage as it requires us to nd agreements and solutions over several non-trivial aspects.
Formalisms
Nowadays, it is not rare to nd various similar LRs focusing
on the same linguistic level for a given language, as shown by the diversity of projects previously introduced. This typically happens for two reasons. First, the data described in LRs generally depends on the application they have been created for. Therefore, one can nd non-related but similar LRs covering the same sub-parts of a given level. Second, the way a language is described can drastically change when based on dierent linguistic theories. Therefore, there exist similar LRs that are (partially) incompatible. In both cases, it implies : a loss of human work to formalize several times a same knowledge, a waste of feedback when splitting the users over various LRs.
Licenses
The distribution and terms of use of LRs are issues that are both
fundamental and problematic/polemic for their life-cycle. Indeed, since LRs are often built manually, they usually have a high cost. This fact often leads LRs to be distributed under restrictive licenses and/or to not be shared with the public. Distributing a LR under restrictive licenses presents the drawbacks of : limiting collaborations and thus workforce, reducing the feedback brought on by a greater number of users.
Condence
Federating as many people as possible around a common goal
does not make sense if the overall quality of the LR is reduced by some collaborators. Therefore, one usually needs to rst demonstrate his or her competence before being granted the right to edit an LR. The resulting number of candidate collaborators is thus reduced to a smaller number of persons who both have the linguistic and computer skills required for a shared edition of the LR.
Accessibility
Obviously, someone willing to help maintaining an LR needs
to access it. This simple statement is sometimes restrained by technical issues (some restrictive non-standardized technologies are required), geographical distance (the LR is not accessible online) or even security-related restrictions (the LR is located on a server with restricted access). Once again, Such drawbacks limit the number of possible collaborators.
46
Chapitre 2. The Victoria project
2.6.2 Guidelines to enhance collaborative work Formalisms
The formalisms used for developing LRs over a given linguis-
tic aspect should aim at covering as many languages and applications as possible. In particular, general frameworks that are associated with tools (compilers) capaable of convert a general LR into specialized ones are a handy approach. Indeed, it allows experts to develop and maintain specialized modules as independent modules and thus, avoids redundant work around common sub-parts and maximizes feedback. For example, one can develop a core lexicon for a language and provide several branches for developing specialized lexicons on zoology, medicine, etc. In addition, the more used the framework is, the more chance it has to be regularly maintained and updated.
Licenses
Choosing a license for a LR mostly depends on the nal ob-
jectives planned for the LR. For example, a LR intended to be part of a commercial product, if published, is generally put under restrictive licenses. Nevertheless, many LRs are produced thanks to public fundings and have no direct economic purpose. Thus, if the main objective is to bring the LR to a greater level of quality, one should try to maximize feedback and federate people with the skills to collaborate. The licenses used should thus remain as non-restrictive as possible.
Condence
The main problem when granting somebody with edit rights
on a LR is that it generally means granting such rights on the whole resource. Whereas some part of a LR may be simpler to modify than others, such approach can obviously be risky and unproductive. A simple but straightforward solution to bypass this issue by progressively granting edit rights on sub-part of the LR. Such a scalable approach can be easily achieved by designing interfaces with restrictions according to the condence level assigned to the user. In addition, using interfaces presents two other handy features. On one hand, it can prevent edition/typing errors and allows users to focus on the data themselves without worrying about mastering the underlying formalism or technology. On the other hand, interfaces can help controll the evolution of LRs by tracing their modications.
Accessibility
In order to avoid the technical, distance and security troubles
mentioned above, one needs to employ standardized, online and public technologies. Web technologies seem an adequate choice for such a task. When used to develop interfaces, they generally constitute an appropriate way to access and edit LRs with the great advantage of not requiring any particular additional installation.
2.6. Guidelines
47
2.6.3 Guidelines for saving eorts The previous part focuse on the strategy of increasing the workforce necessary to improve the quality of a LR. However, another useful and complementary strategy is to reduce (as much as possible) the dependence towards manual eorts. Such goals can be pursued by considering several tracks.
2.6.3.1 Using existing frameworks Even if the NLP community does not provide stable frameworks for all linguistic levels, most of them have been studied and (partial) solutions have emerged. Since existing frameworks are usually mature and their libraries/codes have a reduced numbers of errors, a reasonable idea is to always favor the use and extension of existing frameworks over the creation from scratch of new ones.
2.6.3.2 Using existing resources Existing resources are generally valuable sources of linguistic knowledge when building or extending new LRs. Indeed, spending eorts on describing a linguistic knowledge that has already been formalized is counter-productive. Such an approach depends of course on the kind of knowledge one is trying to adapt and relies on the formalisms (or its underlying linguistic theory) on which the LR is based. Nevertheless, various practical experiments such as [9] have shown that existing resources usually share common points and thus, adapting, even partly, the available existing resources is often an achievable objective. Since related languages share signicant parts of their linguistic legacy, such an approach should not be limited to the scope of a single language. Indeed, the proximity between linguistically related languages can sometimes allow to transfer formalized knowledge and thus consider other existing LRs describing related languages. This approach is particularly suited for languages with smaller speech communities and limited digital resources. It also facilitates the establishment of interlingual links required for multilingual tasks.
2.6.3.3 Automatizing correction and extension LRs are often built with little (or no) computer aid. This causes a common situation where the resources are developed until a more or less advanced state of development in which nding manually errors/deciencies becomes too dicult. . Since it can greatly reduce the need for manual work, automatizing the processes of extension and correction should generally be considered in order to enhance the sustainability of LRs. Indeed, even if these
48
Chapitre 2. The Victoria project
automatized processes will have to handle an increasing amount of data and thus achieve an increasing amount of computations to nd errors, it is clear that improving the overall computational capacities is a way easier objective than augmenting the human workforce. Of course, such techniques are specic to each type of linguistic knowledge and it can be more dicult to develop such processes for some linguistic levels (e.g., semantics) than for others (e.g., morphology). Nevertheless, a theoretical process abstracted from practical ones described in [35] and [27] can be considered. This process is divided into two steps : identifying statistically evidences for missing instances, i.e., text that is non covered by the studied LR, generating corrections thanks to a complementary LR.
Identifying shortcomings in a resource Identifying possible shortcomings in a studied resource can be achieved by studying unexpected/incorrect behaviors of some tools relying on the resource. To do so, one needs rst to establish what can be considered as unexpected (incorrect) behavior. For example, for a parser, a parse failure can be considered as an unexpected behavior. Once unexpected behaviors are identied, one must ensure they are not due to : some incorrect data given as input, some other LR with which the studied LR interacts (e.g., a lexicon and a grammar in a symbolic parser). The rst situation can easily be globally avoided by providing as input corpora considered as linguistically correct (error-free). Some types of corpora, such as law texts or selected journalistic productions, should thus be privileged over low-quality corporas like those composed of emails. The second situation, i.e., when the tool relies on various co-interacting LRs, can be solved through a global study of the unexpected behaviors. Indeed, natural languages are ambiguous and thus dicult to formalize. Nevertheless, this ambiguity impacts dierently o the linguistic levels and the LRs describing them. Depending on the state of development of the LRs, it can be truly rare for two co-interacting LRs to be incorrect at the same time on a given element, i.e., many unexpected behaviors are only induced by one resource at a time. In a restricted scope, it is dicult and hazardous to identify a culprit for a given unexpected behavior. However, such an aspect can be balanced by a global study of the behaviors when processing an important quantity of text. Indeed, if among the elements of a given resource, some are always found when unexpected behaviors occur, then such an element can be (statistically) suspected to be incorrectly described within the resource. For example, in [35, 37], the authors identifyed shortcomings in a lexi-
2.6. Guidelines
49
con. The tool observed is a syntactic parser and parse failures are considered as unexpected behaviors of the parser. Each parse failure may be due to deciencies in the grammar and/or of the lexicon the parser relies on. Determining for a given parse failure which resource is the true culprit can be utterly complex. In order to detect incorrect lexical entries, the authors use a xed point algorithm which emphasizes the lexical forms that occur more than expected in non parsable sentences.
Generating relevant corrections The previous error mining step already provides an interesting set of data to orientate the correction of a studied LR and can be completed with a correction suggestion step. Consider two dierent LRs interacting within an NLP tool (e.g., a syntactic lexicon and a grammar combined in a symbolic parser). This tool is designed to try and nd a joint match between both resources and the input of the tool (e.g., a parse that is compatible with both the grammar and the lexicon). In other words, one can view each LR as providing a set of possibilities for each element (e.g., lexical unit) within the input. As explained earlier, it may be rare for two resources
A
and
B
to be
incorrect on a same given element and thus be both responsible for a given unexpected behavior. Therefore, if one of the LRs, say
A, is suspected by the
error mining step to provide erroneous and/or incomplete information on a given element, it is reasonable to try and rely on the information provided by the other LR,
B,
A. A is suspected
for proposing corrections to the dubious entry of
For example, let us suppose that a verbal entry in a lexicon
to provide a lexical data that is incomplete w.r.t. a given sentence. Using a parser that combines
A
with a grammar
B,
it is reasonable to let the
grammar decide which syntactic structures are possible for this sentence, by
A. A can be extracted from the lexical data used
preventing the parser from using the dubious information provided by Then, correction proposals for to build parses. Of course, among the corrections generated thanks to
B
there can be
correct and incorrect ones. Therefore, such approaches should generally be semi-automatic (i.e., with manual validation). In addition, semi-automatic approaches are a good compromise to limit both human and machine errors since most of the updates that are achieved on the LRs are automatically created while being manually validated. Finally, another highly convenient feature of this approach is as follows : if resource
A,
B
cannot provide any longer relevant corrections for resource
and thus does not oer a solution for ant unexpected behavior, we can
consider the remaining ones as mostly representing shortcomings of resource
B
since it does not cover them. Correcting resource
thus generates useful data to correct resource
B
A
thanks to resource
B
since it reveals shortcomings
50
Chapitre 2. The Victoria project
Technique improving resource A thanks to resource B
Resource B
Resource A
Technique improving resource B thanks to resource A
Input
Tool
Linguistic resource
Output
Figure 2.1 Abstract process to upgrade co-interacting resources. of resource
B.
Once resource
correct resource
A
B
has been updated, it can be again used to
and so forth. This incremental and sequential method to
upgrade both resources is represented in a simplied schema by Figure. 2.1.
2.6.3.4 Using plain text As previously discussed, the previous approach requires an input corpora as error-free as possible in order to guarantee that most unexpected behaviors are caused by shortcomings of the LRs and not by errors in the input. If the input data are annotated ones, only manual annotation can guarantee a good level of quality. But manually annotated data are only available in limited quantities, for a small number of languages and producing such data contradicts the objective of saving manual work. Therefore, the data used should be raw text since it is produced daily for most languages and freely available in large quantities on the Internet. In order to guarantee the quality of the data, as stated earlier, only linguistically correct (error-free) texts should be considered.
2.7 Practical applications of the guidelines We shall now see the choices taken with respect to the previously described guidelines. The subjects that directly concern the work achieved during this thesis are only briey introduced but fully described in the following chapters.
2.7. Practical applications of the guidelines
51
2.7.1 Enhancing collaborative work 2.7.1.1 Formalisms Creating formalisms is not one of the objectives for Victoria, all the LRs produced are therefore based on already existing frameworks. The choice of the most suitable frameworks went logically towards the ones developed by one of the team of the project (the Alpage team). An obvious reason for using these frameworks is to continue developing the existing LRs (overall the French ones). Nevertheless, the guidelines regarding formalisms have mainly been extracted from their properties. They thus directly comply with most of the requirements and present the great advantage of being quickly xed when facing bugs, uncovered phenomena or missing property.
Morphological knowledge Our morphological rules are based on the Alexina framework [33, 34, 9, 32]. As detailed in [32], in this formalism, morphological inection is modeled as the axation of a prex and a sux around a stem, while sandhi phenomena, that are sometimes conditioned by stem properties, may occur at morpheme boundaries. This formalism, that shares some widespread ideas with the DATR formalism [16], relies on the following concepts. The core of a morphological description is a set of inection classes that can (partly or completely) inherit from another one. Each inection classes denes a set of forms. Each form is dened by a morphological tag and by a prex and a sux that, together with the stem, constitute a sequence of prex_stem_sux. Sandhi phenomena allow us to link the inected form to the underlying prex stem and stem sux sequences by applying regular transformations ; such rules may use classes of characters (e.g., [ :eií :] which can be dened as denoting one of the characters e, i or í as illustrated in Table 2.1). Forms can be controlled by tests over the stem, e.g. a given rule can apply only if a given regular expression matches the stem and/or if another one does not match it. Forms can be controlled by variants of the inection classes (e.g., forms can be selected by one or more ags which complement the name of the class). Tables 2.1 and 2.2 illustrate this model by showing respectively a few sandhi rules and an example of a verbal inection class. In Table 2.1, the rst sandhi rule associates, for example, cochecito with
coche_íto, whereas the second one associates codiguito with codig_íto. In Table 2.2, the attributes tag and synt respectively dene the morphological tag and the morphosyntactic ag. The attribute rads indicates
52
Chapitre 2. The Victoria project
name="enr" letters="e n r"/
...
...
Table 2.1 A letter class denition and two sandhi rules.
...
< /table>
Table
2.2
Example of inection class for Spanish regular rst group
verbs.
that the forms can be generated only if the stem matches the corresponding pattern. A morphological description based upon the Alexina framework allows us to generate automatically two tools : 1. an inection tool that generates all inected forms of a given lemma according to its morphological class ; 2. an ambiguous lemmatization tool, that computes for a given form (associated or not with a category) all possible candidate lemmas (existing or not) that are consistent with the morphological description and have this form among their inected forms.
Lexical knowledge Our lexicons are also based on the Alexina framework [33, 34, 9, 32], which
2.7. Practical applications of the guidelines
is compatible with the LMF
53
9 standard [17]. Alexina represents morphologically-
related and syntactically-related lexical information in a complete, ecient and readable fashion. The lexical part of the Alexina model is based on a two-level representation distinguishing the description of a lexicon from its use. The intensional level factorizes the lexical information by associating each entry with a lexeme and its lemma, morphological class and initial syntactic information (a deep subcategorization frame, a list of possible restructurations, and other syntactic features such as information on control, attributes, mood of sentencial complements, etc.). This intensional level is used for the development of the lexical resources. For example, as detailed in [32], the intensional entry for the French lemma diagnostiquer1 /to diagnose in a French Alexina -based lexicon named lef is as follows :
It describes a transitive entry with the following information :
its morphological class is v-er :std, the class of standard rst-conjugation verbs (ending -er ) ; its semantic predicate can be represented by the Lemma as is, i.e.,
diagnostiquer ; its category is verb (v ) ; it has two arguments canonically realized by the syntactic functions
Suj (subject) and Obj (direct object) ; each syntactic function is associated with a list of possible realizations, but the Obj is optional as shown by the brackets ; it allows for three dierent redistributions : %active, %passive, and
%se_moyen. The extensional level is automatically generated by compiling each entry of the intensional lexicon with respect to its morphological class and restructurations into several ones in the extensional lexicon. The extensional level thus associates each inected form with a detailed structure that represents its morphological information and (some of ) its possible syntactic behaviors. This extensional level is directly used by NLP tools such as parsers. For example, as explained in [32], the only inected forms of diagnos-
tiquer that are compatible with the passive redistribution are the past participle forms. The (simplied) extensional passive entry for diag-
nostiqués/diagnosed is the following (
Kms is the morphological tag for
past participle masculine plural forms) :
9. Lexical Markup Framework, the ISO/TC37 standard for NLP lexicons.
54
Chapitre 2. The Victoria project
diagnostiqués v [pred='diagnostiquer1', @passive,@pers,@Kms];%passive The original direct object (Obj) has been transformed into the passive Subject and an optional Agent (Obl2) realized by a noun phrase preceded by a preposition (par-sn) was added. Alexina has already been used to develop morpho-syntactic wide-coverage lexicons for French, Spanish, Slovak, Polish and Kurdish and has been combined with syntactic parsers based on commonly used grammatical formalisms such as LTAG [10] and LFG [5].
Grammatical knowledge Regarding grammatical knowledge, we rely on a meta-grammar formalism [11, 7] which represents the syntactic rules of a language in a hierarchical structure of classes. The classes on top of the hierarchy dene general concepts as Part-of-Speech (noun, verb, etc.) and their possible attributes. Classes are then rened while descending towards the bottom of the hierarchy, adding constraints, allowed/forbidden constructions, etc. For example, as explained in [19], the Spanish common nouns are described in a Spanish meta-grammar named SPMG as shown in Figure 2.2
10 .
This meta-grammar formalism is theoretically compilable in most commonly used grammar formalisms. In practice, we compile our grammars into hybrid TAG/TIG parsers that produce dependency trees. Figure 2.3 shows an example of parse for a Spanish sentence.
2.7.1.2 Licenses So as to maximize feedback and federate people with the skills and the will to collaborate around a common benecial goal, the availability of the techniques and LRs produced by the Victoria project, in terms of access and modication, is guaranteed by the use of non-restrictive public licenses like LGPL
11 , LGPL-LR 12 and CeCILL-C 13 .
2.7.1.3 Condence and Accessibility In order to prevent edition errors and allow users to focus on the data themselves without worrying about the underlying formalism, we are willing to develop a dedicated query and edition interface for every resource and technique output. 10. Some comments have been added to the content of the class so as to better understand it.
11. Lesser General Public License,http://www.gnu.org/licenses/lgpl.html.
12. Lesser General Public License for Linguistic Resources,
~unitex/.
13. LGPL-compatible,
http://www.cecill.info/.
http://igm.univ-mlv.fr/
2.7. Practical applications of the guidelines
class cnoun{ > N; N >> Nc; %'N2' should be above 'N' which itself should be above 'Nc' N2 >> det; %'N2' should be above 'N' det < N; %'det' should always comes before 'N' Nc = Ancre; %'Nc' is an anchor node N : [cat: N]; %feature structure of node 'N' node det : [cat: det, type: subst]; %features structure of node 'det' - nc::agreement; Nc = nc::N; %requesting checks from other class - n::agreement; N = n::N; %requesting checks from other class %feature equations to restrict possibilities node(det).top.number = node(N2).bot.number; node(det).top.gender = node(N2).bot.gender; node(det).top.wh = node(N2).bot.wh; node(Ancre).bot.person = value(3);
}
det => node(N2).bot.sat = value(+); ~ det => node(N2).bot.wh = value(-);
Figure
2.2
%checking number between nodes 'det' and 'N2' %checking gender between nodes 'det' and 'N2' %checking if sentence is an interrogative one %assigning third person
%if 'det' is available, %'N2' is said to be 'saturated' %if 'det' is not available %'N2' can not be part of an interrogative sentence without a 'det'
Generic Spanish common nouns class in a meta-grammar
named SPMG.
In order to overcome technical and distance troubles, every dedicated interface is being developed with stable Web technologies supported by most Web browsers without additional installations. The scalable edition with respect to the user's competences is obtained by developing the interfaces within open-source Content Management Systems (CMS) that provides useful features such as users and groups management. So far, the interfaces have been developed thanks to a java-based technology called Portlet [31] implemented by an open-source CMS named JBoss
Portal [22].
2.7.2 Saving eorts 2.7.2.1 Using existing resources As stated in the guideline, existing LRs have always been considered when building new LRs. Regarding the transfer of similar LRs describing a related language, the following decisions have been made depending on the kind of linguistic knowledge. Transferring morphological knowledge has not been considered. Indeed, it is not clear whether or not adapting the morphological rules of a given
Chapitre 2. The Victoria project 56
Figure 2.3 Partial example of parse for the Spanish sentence Hasta la victoria siempre/onward to victory.
2.7. Practical applications of the guidelines
57
language to a related one is productive. Regarding lexical knowledge, an idea has been considered but not applied yet (see Chapter. 5). It relies on the idea that many direct translations are eective between common-rooted languages such as the romance ones. As regards to grammatical knowledge, grammars are abstract and static enough to not evolve much and thus be easily adapted from a given language to a related one. This idea has been applied to build a Spanish grammar from a French one (see Chapter. 5).
2.7.2.2 Automatizing correction and extension According to the ideas described previously, we have established a sequential chain of tools (see Figure 2.4) which aims at helping to upgrade from plain text, in a semi-automatic fashion, the LRs required to do symbolic syntactic parsing. We thus aim at developing four methods that take advantage of the oneto-one interactions between the three types of LRs. A method that helps correcting the morphological information of a lexicon thanks to morphological rules. A method that helps correcting morphological rules thanks to the morphological information of a lexicon (see Chapter 3). A method that helps correcting the syntactic information of a lexicon thanks to a grammar (see Chapter 4). A method that helps correcting a grammar thanks to the syntactical information of a lexicon.
Chapitre 2. The Victoria project 58
Plain text
Morphological rules
Plain text
Morphological - lexical information upgrade
Morphological rules upgrade
Input Output
Morpho-syntactic Lexicon
Tool
Grammatical rules upgrade
Morphological and syntactic lexical information upgrade
Linguistic resource
Plain text
Grammar
Plain text
Figure 2.4 Semi-automatic chain of tools for the upgrade of the linguistic resources required to perform parsing.
Bibliographie
[1] Ag.
http://www.delph-in.net/index.php.
[2] Concepción Álvarez, Pilar Alvari no, Adelaida Gil, Teresa Romero, María Paula Santalla, and Susana Sotelo. Avalon, una gramática formal basada en corpus. In Procesamiento del Lenguaje Natural (Actas
del XIV Congreso de la SEPLN), pages 132139, Alicante, Spain, 1998. [3] Marie-Hélène Antoni-Lay, Gil Francopoulo, and Laurence Zaysser.
A
Generic Model for Reuseable Lexicons : The Genelex Project. Literary
and Linguistic Computing, 9(1) :4754, 1994. [4] Jordi Atserias, Bernardino Casas, Elisabet Comelles, Meritxell González, Lluis Padró, and Muntsa Padró.
Freeling 1.3 : Syntactic and se-
mantic services in an open-source nlp library.
In Proceedings of the
5th International Conference on Language Resources and Evaluation (LREC'06), pages 4855, 2006. [5] Pierre Boullier and Benoît Sagot. Ecient and robust LFG parsing : SxLfg. In Proceedings of IWPT'05, 2005. [6] Nicoletta Calzolari and Claudia Soria. Preparing the eld for an open resource infrastructure : the role of the arenet network of excellence. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Proceedings of the Seventh conference on Interna-
tional Language Resources and Evaluation (LREC'10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). [7] Marie-Hélène Candito. Organisation modulaire et paramétrable de gram-
maires électroniques lexicalisées. PhD thesis, Univ. of Paris 7, 1999. [8] Ann Copestake and Dan Flickinger. An open-source grammar development environment and broad-coverage English grammar using HPSG. In Proceedings of the second international conference on Language Re-
sources and Evaluation (LREC-2000), Athens, Greece, 2000. [9] Laurence Danlos and Benoît Sagot. Constructions pronominales dans dicovalence et le lexique-grammaire. In Proceedings of the 27th Lexicon-
Grammar Conference, L'Aquila, Italy, 2008. 59
60
Bibliographie
[10] Éric De La Clergerie, Benoît Sagot, Lionel Nicolas, and Marie-Laure Guénot. Frmg : évolutions dún analyseur syntaxique tag du francais. In Journées ATALA, 2009. [11] Éric Villemonte de la Clergerie. From metagrammars to factorized tag/tig parsers.
In Parsing '05 : Proceedings of the Ninth International
Workshop on Parsing Technology, pages 190191, Morristown, NJ, USA, 2005. Association for Computational Linguistics.
[15] Tomaz Erjavec. Multext-east version 4 : multilingual morphosyntactic specications, lexicons and corpora. In LREC 2010 : proceedings of the
seventh international conference on Language Resources and Evaluation, 2010. [16] Roger Evans and Gerald Gazdar. Datr : a language for lexical knowledge representation. Comput. Linguist., 22(2) :167216, 1996. [17] Gil Francopoulo, Monte George, Nicoletta Calzolari, Monica Monachini, Nuria Bel, Mandy Pet, and Claudia Soria. Lexical Markup Framework (LMF). In Proceedings of LREC 2006, Genoa, Italy, 2006. [18] José M. García-Miguel and Francisco J. Albertuz.
Verbs, semantic
classes and semantic roles in the ADESSE project. In Proceedings of the
Interdisciplinary Workshop on the Identication and Representation of Verb Features and Verb Classes, Saarbrücken, Germany, 2005. [19] Daniel Fernández González. Cadena de Procesamiento Linguístico para
el Español. July 2010. [20] Nancy Ide and Jean Véronis. Multext : Multilingual text tools and corpora. In Proceedings of the 15th conference on Computational linguistics
- Volume 1, pages 588592, Morristown, NJ, USA, 1994. Association for Computational Linguistics. [21] Multext.
http://aune.lpl.univ-aix.fr/projects/MULTEXT/.
[22] Jboss portal.
http://jboss.org/jbossportal.
[23] Montserrat Marimon, Núria Bel, Sergio Espeja, and Natalia Seghezzi. The spanish resource grammar : pre-processing strategy and lexical acquisition. In DeepLP '07 : Proceedings of the Workshop on Deep Linguis-
tic Processing, pages 105111, Morristown, NJ, USA, 2007. Association for Computational Linguistics. [24] Lionel Nicolas, Miguel A. Molinero, Benoît Sagot, Nieves Fernández Formoso, and Vanesa Vidal Castro.
Creating and maintaining language
resources : the main guidelines of the victoria project. In Proceedings of
Bibliographie
61
the LRSLM Workshop of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). [25] Lionel Nicolas, Miguel A. Molinero, Benoît Sagot, Elena Trigo, Éric De la Clergerie, Miguel Pardo, Jacques Farré, and Joan Miquel Vergés. Producción eciente de recursos lingüísticos : el proyecto victoria.
In
SEPLN 2009, 25th Edition of the Conference of the Spanish Society for Natural Language Processing, San Sebastian, Spain, September 2009. [26] Lionel Nicolas, Miguel A. Molinero, Benoît Sagot, Elena Trigo, Éric De la Clergerie, Miguel Pardo, Jacques Farré, and Joan Miquel Vergés. Towards ecient production of linguistic resources : the victoria project. In Proceedings of the International Conference RANLP-2009, pages 318323, Borovets, Bulgaria, September 2009. Association for Computational Linguistics. [27] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric de La Clergerie. Computer aided correction and extension of a syntactic wide-coverage lexicon. In COLING '08 : Proceedings of the 22nd
International Conference on Computational Linguistics, pages 633640, Manchester, United Kingdom, August 2008. Association for Computational Linguistics. [28] Parole.
[29] Patrick Paroubek, Anne Vilnat, Sylvain Loiseau, Olivier Hamon, Gil Francopoulo, and Eric Villemonte de la Clergerie. Large scale production of syntactic annotations to move forward.
In CrossParser '08 :
Coling 2008 : Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation, pages 3643, Morristown, NJ, USA, 2008. Association for Computational Linguistics. [30] Passage. [31] Portlet.
[32] Benoît Sagot. The lef, a freely available and large-coverage morphological and syntactic lexicon for french. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Pro-
ceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). [33] Benoît Sagot, Lionel Clément, Éric Villemonte de La Clergerie, and Pierre Boullier. The Lef 2 syntactic lexicon for French : architecture, acquisition, use. In Proceedings of LREC'06, Genoa, Italy, 2006. [34] Benoît Sagot and Laurence Danlos.
Méthodologie lexicographique de
constitution d'un lexique syntaxique de référence pour le français. In
62
Bibliographie
Proceedings of the workshop Lexicographie et informatique : bilan et perspectives , Nancy, France, 2008. [35] Benoît Sagot and Éric Villemonte de La Clergerie. parsing results.
Error mining in
In Proceedings of ACL/COLING'06, pages 329336,
Sydney, Australia, 2006. [36] Multext-east.
http://nl.ijs.si/ME/.
[37] Gertjan van Noord. Error mining for wide-coverage grammar engineering. In Proceedings of ACL 2004, Barcelona, Spain, 2004. [38] Edward Vanhoutte. An Introduction to the TEI and the TEI Consortium. Literary and Linguistic Computing, 19(1) :916, 2004. [39] Tei.
http://www.tei-c.org/index.xml.
[40] Tristan
Vanrullen,
Philippe
Blache,
and
Jean-Marie
Balfourier.
Constraint-Based Parsing as an Ecient Solution : Results from the Parsing Evaluation Campaign EASy. In Proceedings of LREC 2006 (Lan-
guage Resources and Evaluation), pages 165170. LREC, 2006. 2771. [41] Tamás Váradi, Steven Krauwer, Peter Wittenburg, Martin Wynne, and Kimmo Koskenniemi. Clarin : Common language resources and technology infrastructure.
In European Language Resources Association
(ELRA), editor, Proceedings of the Sixth International Language Re-
sources and Evaluation (LREC'08), Marrakech, Morocco, may 2008. [42] Clarin.
http://www.clarin.eu.
[43] Flarenet.
http://www.flarenet.eu/.
-
Chapitre
3-
MorphAcq : Unsupervised learning of concatenative morphology based on frequency-related occurrence
63
64
Chapitre 3. MorphAcq : Unsupervised learning of morphology
65
Related terms
A morphological family
is a set of morphological rules allowing to build
all the lexical forms related to a given lemma.
An inx
is an ax which is added inside a root morpheme in the formation
of a word. It contrasts with axes attached at the outside of a stem, such as a prex or sux. In a language like English, inxes do not occur since the root morpheme is indivisible.
A letter tree
, in this document, is a data structure used to represent
various lexical forms and to know which forms share a substring. This structure is composed of nodes and transitions between nodes that are labeled by a letter. In this structure, lexical forms are introduced letter by letter, i.e., the letters of a lexical form label a path of transitions from the root node.
Related publication A publication related to this chapter can be found with reference [23].
66
Chapitre 3. MorphAcq : Unsupervised learning of morphology
3.1. Introduction (Français)
67
3.1 Introduction (Français) Des RLs formalisant un langage (grammaire, lexique, etc.), les règles morphologiques sont parmi les moins diciles à établir. Comme expliqué auparavant, ces règles sont un élément clé de l'architecture lexicale que nous utilisons. Bien que cela serait moins ecace, nous pourrions toujours construire nos lexiques avec une structure lexicale qui ne nécessite pas une telle RL. Néanmoins, pour des langues ayant une morphologie hautement productive telles que le Turc ou le Finlandais, une approche sans règles morphologiques présente un défaut majeur. En eet, l'aspect fortement concaténatif et agglutinatif de ces langues implique un nombre exponentiel de formes possibles. Les lexiques les décrivant doivent donc garder un certain de degré de factorisation qu'il est impossible à mettre en ÷uvre sans règles morphologiques permettant de décomposer ou générer les formes. Puisque la construction de telles ressources nécessite une expertise linguistique, elles ne sont pas disponibles pour bien des langues. Par conséquent, l'acquisition automatisée de la morphologie d'une langue est encore un sujet ouvert en TALN dont l'intérêt est attesté par l'existence d'un concours régulier [19] dédié à cette tâche. Dans ce chapitre, nous présentons une approche permettant d'obtenir automatiquement, à partir d'un corpus brut, une représentation de la morphologie des langues à morphologie concaténative, i.e, une représentation des mécanismes morphologiques ne reposant que sur l'utilisation de préxes et de suxes. Cette approche prote de phénomènes observables pour toute langue utilisant l'inexion ou la dérivation, ces derniers se révélant plus simples à exploiter lorsqu'il s'agit de morphologie concaténative. Parmi ces phénomènes, une probabilité d'occurrence liée à la fréquence d'un lemme pour les formes de ce même lemme, qu'elles soient inéchies ou dérivées, est mise en avant et exploitée intensivement d'une façon jamais considérée auparavant. Puisque cette approche a été réalisée et testée avec des formules relativement simples et avec des variables dont les valeurs se veulent générales, l'application de cette approche à un ensemble divers de langues à morphologie concaténative ne nécessite que peu ou pas d'expertise. L'ensemble fonctionne comme un séquence de ltres ranant progressivement une liste d'axes ou de familles morphologiques potentielles. Les principales contributions de ce travail de recherche sont : 1. de mettre en valeur un phénomène de probabilité d'occurrence des formes lié à la fréquence de leur lemme, 2. de présenter un ensemble de ltres susamment généraux pour être adaptés à d'autres approches, 3. de décrire et expliquer une combinaison séquentielle de ces ltres ainsi que les résultats qu'elle obtient.
68
Chapitre 3. MorphAcq : Unsupervised learning of morphology
3.2 Introduction (English)
Among the LRs formalizing a language (grammar, lexicon, etc.), morphological rules are considered as one of the easiest to build. As explained earlier, these rules are a key component for the lexical framework that we use. Although it would be less ecient, we could still build our lexicons with a dierent framework that does not require such LR. However, for many highly productive/concatenative languages such as Turkish or Finnish, morphological rules are an absolute need when building lexicons. Indeed, the highly concatenative and agglutinative aspect of these languages forces their lexicon to keep a certain level of factorization in order to avoid an exponential number of entries. Morphological rules are therefore necessary to decompose or generate the forms. Since the construction of such a resource require linguistic expertise, morphological rules are still lacking for many languages. The automatized acquisition of morphology is thus an open topic within the NLP eld, and its usefulness is attested by the existence of an annual challenge [19] dedicated to this task. In this chapter, we present an approach that allows to automatically compute, from raw corpora, a data-representative description of the morphology of concatenative languages, i.e., a description of morphological mecanisms that only rely on prexes and suxes. Our approach takes advantage of phenomena that are observable for all languages using morphological inection and derivation but are more easy to exploit with concatenative morphologies. Among these phenomena, a frequency-related occurrence of the forms belonging to a same lemma, be it derivated or inected, is highlighted and intensively exploited in a way that has not been considered so far. Since it is implemented with mostly straightforward and parameters-free formulas, applying this approach to a varied set of concatenative languages requires no or only few expert work. The whole approach works as a set of lters that renes sequentially a list of candidate axes and a list of morphological families. The main contributions of this piece of research are :
1. to highlight a frequency related phenomenon, 2. to present a set of lters general enough to be adapted to other existing approaches, 3. to describe a sequential combination of these lters and an evaluation of its results.
3.3. Related work
69
3.3 Related work The existing approaches for the automatic acquisition of morphological knowledge can be classied among two types : 1. the ones that build a morphological analyzers trained to maximize a set of metrics, 2. the ones that explicitly list morphological rules and apply them. Within the rst type, the methods described in [4] and [8] are the most referenced ones. In [8], the authors introduce for the rst time the concept of MDL (Minimum Description Length ) which relies on the idea of encoding/factorizing a corpora with a set of morphemes as small as possible, i.e., the better the axes and stems are identied, the better the corpora will be encoded/factorized. In [4], the authors also start with an MDL-approach but eventually use a combination of a Maximum Likelihood and Viterbi algorithms to better encode/factorize the forms. It has been later extended in [14] in an attempt to handle allomorphy. In a dierent manner, in [9], the authors use MDL to rst determine a set of candidates stems and then use the remaining substrings of the forms to identify axes. Candidates axes are rst split into letters and are later agglomerated as axes according to a metric based on the substrings' frequencies. In [26] as in [1] and [13], the authors describe methods that originate from Harris' approach [11] and its follows-up [10, 6]. These approaches focus on transition probabilities and letter successor variety, i.e., they detect morpheme boundaries by means of metrics that shall increase or drop when considering a position between the last and the rst letters of two morphemes. The method described in [7] follows the algorithm described in [13] and corrects important drawbacks. Among them is a drawback due to a statement with a direct bias towards languages that make an intensive use of the empty sux such as English. Within the second type of methods that, as we do, explicitly list morphological rules and apply them, we have inventoried six other methods [17, 2, 3, 18, 22, 5]. In [17], the authors identify morphological rules by means of analogies, e.g., live is for lively what cordial is for cordially. Each analogy receives a weight according to its productiveness, i.e., a weight computed according to the number of times where the analogy is indeed validated and the number of times the analogy could apply. In [18], the approach is similar except that the weight given to a morphological rule is computed with the number of shared candidate stems and the number of letters of the axes. Oppositely, in [2], all possible pairs of words are compared and morphological rules/analogies are identied when the edition distance is under a
70
Chapitre 3. MorphAcq : Unsupervised learning of morphology
certain threshold. The rules are then used to link related forms and a clustering algorithm is used to group the related forms of a given lemma. In [3], the authors rst achieve a clustering method, again with thresholds, so as to group forms with similar syntactic behaviors. Morphological rules are then detected by analogy between sets of forms in dierent clusters. Each morphological rule receives a score computed with the number of common stems. In [22], the authors directly build morphological families called paradigms. This task is achieved in a brute-force fashion controlled by a large set of thresholds. In [5], the authors extend the approach described in [13] by adding several features to better handle compound axes, related form occurrence when cutting and allomorphy. The frequency-related occurrence of related form is partially implied at some point for forms with no related forms found in the corpus. The method described is similar to ours in the sense that it sequentially renes a list of candidate axes and gradually improves its quality. As in [13], the method described applies overall on languages that make an intensive use of the empty sux. Finally, as an interesting additional feature, in [26, 21], the authors intend to automatically guess/optimize the parameters according to a given corpora.
3.4 General definitions and informations Some of the computations are done thanks to letter trees. An ax is said
to occur on a given node if it has been combined with a (prex or sux) substring of a form and the letters of this substring label a path from the root to this node. An ax combined with forms will thus occur on
n
n dierent substrings in n dierent
dierent nodes (see Fig. 3.1).
As a shortcut, a form is designated in this chapter as frequent if its frequency is above the average frequency computed over all the forms of the input corpus. Since only raw text is provided as input, when referring to the lemmas of the forms, we actually refer to them in an abstract fashion to better explain our approach. It is important to note that, in this document, a morphological rule consists in adding an ax to a given stem with no character deletion or substitution. Linguistic phenomena that can modify the stem, such as allomorphy, are thus acquired as dierent morphological rules than the ones they are derivated from. Although our examples focus on suxes, the approach applies indierently on prexes and suxes. Finally, all substrings starting a form are marked with a # at their beginning and all substrings ending a form are marked with a # at their
3.5. Frequency related phenomenon
71
d e s
i
u
n g
c a u
d s
e i n g
Figure 3.1 Simplied example of a letter tree. The suxes ed and ing occur on gray nodes combined with the stems us and caus.
end.
3.5 Frequency related phenomenon Since their semantic meanings match better the content of some texts, some lemmas are more frequent than others. When considering a given lemma in a language, the probabilities for its inected or derivated lexical forms to occur in a text increase with the frequency of the whole lemma. In other words, the more a lemma is frequently used in a corpus, the more probable it is to encounter a more diversied sample of its related lexical form. Consequently, the more frequent a form is, the more chances there are to nd its morphologically related forms. For example, in general texts, the various forms related to the lemma to talk are usually easier to encounter than the ones related to the lemma to orate. This phenomenon applies to most kind of text, be it specialized or general, except those describing exhaustively the lexical forms related to some particular lemmas. Of course, there exist some circumstances which can aect the chances of a given form of a given lemma to occur. For example, the style of the writer and the type of the text aect the ratio of occurrence between each form of the lemma : in an autobiography, the rst person singular is far more used than it is usually. Nevertheless, these particularities do not alter the fact that the chances to occur for forms belonging to a lemma increase with the frequency of the lemma, and therefore, it does not alter the fact that the more frequent a form is, the more chances there are to nd its morphologically related forms.
72
Chapitre 3. MorphAcq : Unsupervised learning of morphology
3.6 Global overview The methodology detailed in the following sections can be summarized as follows. 1. Establish an over-covering and naive list of candidate axes, i.e., substrings that may be axes. 2. Detect pairs of candidate axes that seem to be related within some morphological families. For example, for a family with three axes
A, B
and
C,
detect the pairs
{A, B}, {B, C}
and
{A, C}.
3. Build morphological families according to the set of pairs present on a same node. For instance, if the pairs
{A, B}, {B, C} {A, B, C}.
and
{A, C}
are
present on a same node, build a family 4. Filter incorrect morphological families.
5. Split compound axes. For example, split the English suxes ingly# into ing#+ly#. 6. Detect what substrings can connect stems and split the compound stems. For instance, detect that substring - can connect English stems and split the form brother-in-law into brother + in + law.
3.7 Identifying candidate affixes The rst step consists in computing a naive list of candidate axes with the substrings ending the forms when looking for suxes, or starting them when looking for prexes. For example, the form naive will allow to generate ve candidate suxes : aive#, ive#, ve#, e# and #. A candidate ax is then kept when it fullls the following permissive conditions. 1. It occurs in at least one sub-tree that is present at least twice in the whole tree, i.e., it occurs in at least two identical sub-trees. 2. It occurs in at least one node covering a frequent word. 3. It is more likely to be an ax than the substrings it is combined, i.e, it is combined, in average, with more substrings than the substrings it
1
is combined with are themselves combined with other substrings . 4. It occurs more frequently on nodes with other substrings than it occurs alone. 5. It co-occurs with, at least, a given other candidate ax more than once. 1. We thus previously compute for any starting and ending substrings the number of substrings it is combined with.
3.8. Identifying pairs of candidate axes
73
The original list is thus ltered in a rough fashion in order to save useless computations for the later lters that are more precise but also more computationally-intensive. It also presents the advantage of establishing a rst list of candidate axes without relying on strict criteria such as a maximum length of characters.
3.8 Identifying pairs of candidate affixes
3.8.1 Leading idea This lter aims at identifying pairs of related axes, i.e., pairs of axes that both occur in one or several same families. It relies on two considerations.
2 3
First, a morphological family covers at least two axes .
The second relies on the frequency related phenomenon described in sect. 3.5 as well as on a specic characteristic of concatenative morphologies that allows to easily take advantage of this phenomenon. Indeed, contrarily to other kind of morphologies which make use of inxes and thus modify part of the substring located within the stem, concatenative
4
morphology relies on prexes and suxes and do not alter much the stem . When inserting all lexical forms in a tree, the related forms of a lemma will follow a same path from the root until they pass the last letter of the stem and spread in dierent branches according to their respective ax. Consequently, all related forms of a lemma occur on a same last common node formalizing the frontier between the stem and the axes. In order to nd the related candidate axes of a candidate ax a, one only needs to pay attention to the nodes where a occurs. This particularity of concatenative languages allows to easily exploit the frequency related phenomenon. Indeed, if a truly belongs to a family f, the more frequent the form
form containing a is, the more chances there are for its morphologically related forms to be also present in the corpus. This implies that the more frequent the form form is, the more probable it is for a to occur on the corresponding node with the other axes of the family f. For example, if we order all the nodes where the English sux ing# occurs by the frequency of the forms ending with it, the nodes corresponding to infrequent forms such as orating will be found at the bottom of the ordered list whereas the nodes corresponding to frequent forms such as talking should be found at the top. As explained in sect. 3.5, the dierent forms of the lemma corresponding to 2. The empty string is considered as an ax, e.g., the English form think has two morphemes : the stem think and the empty sux #. 3. We do not consider lemmas that only cover one form since the form itself can be considered as the stem. Morphological rules are therefore not needed. 4. Linguistic phenomena that modify the stem, such as allomorphy, are actually acquired as dierent morphological rules than the ones they are derivated from.
74
Chapitre 3. MorphAcq : Unsupervised learning of morphology
talking shall be easier to encounter in the corpus than the dierent forms of the lemma corresponding to orating. Consequently, ing# shall co-occur on the node corresponding to talking with more related suxes than on the node corresponding to orating. Such phenomenon should globally apply to most pairs of nodes in the list. Therefore, by sorting by frequency the nodes where a correct candidate ax occurs, we observe a progressively increasing co-occurrence rate with other axes of the family. On the other hand, if the candidate ax has no relation with some candidate axes with which it co-occurs at some point, all co-occurrence rates shall be chaotic.
3.8.2 Practical application In order to establish if a candidate ax a presents an increasing cooccurrence rate with another one, a list of the nodes where it occurs is computed and then sorted according to the frequency of the form corresponding to each node. This list is then split in sublists with the condition that the average frequency of a sublist the previous sublist
si−1 5 .
si
is
mult
times higher (mult
> 1)
than
We then compute for each candidate axes co-
ratei over each sublist and a score inc = sumpos + mult ∗ sumneg where sumpos is the sum of the positive value ratei − ratei−1 , whereas sumneg is the sum of the negative ones. If mult is superior to 1 and we draw the curve of the ratei values, the curve has to be a globally increasing one so as for inc to be a positive value. Indeed, because of mult, negative progressions impact inc more than positive ones do. The cooccurrence is therefore considered as increasing if inc is positive. Candidates appearing with a a co-occurrence rate
with no increasing co-occurrence with other candidates are discarded.
3.8.3 Incorrect englobing or englobed pairs One must note that this lter identies pairs of incorrect candidate axes because of correct ones. Indeed, any ax X related to an ax Y often allows the candidate axes subX and subY starting with a same substring sub to be considered as related since their co-occurrence rate will also be an increasing one. For example, the English suxes ing# / ed# allow the incorrect ones ming# / med# to be considered as related. We later refer at these type of incorrect pairs as incorrect englobing pairs. The same fact applies for two incorrect axes Z and W and two correct and related axes subZ and subW that start with a same substring sub. For example, the pair of English suxes es# / ed# allow the pair s# / d# to be considered as related. We later refer at these type of incorrect pairs as
incorrect englobed pairs. 5. The rst sublist is the set of nodes corresponding to the forms with the lowest frequency.
3.9. Morphological families
75
3.9 Morphological families As explained in the introduction, the main objective of this research work is to provide a data-representative description of a concatenative morphology. The process is therefore brought to a higher level of description by identifying morphological families. Fortunately, this level presents several useful phenomena to rene the acquired description.
3.9.1 Building morphological families Once pairs are identied, we recursively process the tree so as to build morphological families according to the pairs present on each node. A basic approach could be to merge together all the pairs found. For example, if the pairs
{A, B}, {B, C}
approach would build a
{A, C} are family [A, B, C]. and
present on a same node, this basic
Nevertheless, it is not rare for two dierent families to be present on a same node. For example, the Spanish verbs sentir (to feel ) and sentar (to
sit ) belong to two dierent families but share the same stem sent. Merging together all the pairs present on a node can thus lead to the construction of incorrect families merging together two families. In order to avoid such problem, the following approach is applied on each node : 1. every candidate ax non-included in a family by a previous iteration votes for another candidate ax with which it shares a pair ; 2. a family is built with the pairs of the candidate ax that has received most votes ; 3. if there are still candidate axes on the node that are not included in a family, the process is iterated. Two dierent families present on a same node will not be merged together unless their most popular axes are the same. For each family built, we record the nodes where they have been identied/built.
3.10 Cutting forms Because the cutting algorithm is used in some of the processes described right after, we now detail it. Nevertheless, one must note that the set of families provided to the cutting algorithm depends on the step of the approach in which it is used. Morphological families are used to split every word as prex(es) + stem
+ sux(es). This step is achieved by selecting for each nodes the families that can apply.
76
Chapitre 3. MorphAcq : Unsupervised learning of morphology
A family is said to apply on a node if it covers n (n
> 1)
substrings
6
occurring on the node and thus generates a set of n possible cuts. Each set of possible cuts is thus obtained with a given family applied on a given node. For each form covered by several possible sets of cuts, a choice among the sets is achieved by eliminating them sequentially according to the following three criteria : 1. the greatest number of cuts, 2. the smallest distance from the root for the corresponding node, 3. the largest size for the corresponding family. The rst criteria relies on the idea that the more forms are covered by a family, the more the resulting set of cuts tends to be correct. The second one favors longer axes over smaller ones. Finally, the third one emphasizes the fact that larger families are also the most accurate ones (see sect. 3.10.1.1). If after those tree steps, more than one set remains, we simply select the rst one. Indeed, since the remaining sets cover as much forms, cut on the same node and are all equivalent in size, the competing families are likely to be sub-families of a same bigger and non-discovered family.
3.10.1 Filtering morphological families The generation of morphological families produces four kinds of families : 1. correct complete or incomplete families, 2. incorrect ones brought by incorrect englobing pairs, 3. incorrect ones brought by incorrect englobed pairs, 4. completely incorrect ones brought by completely incorrect pairs, i.e., incorrect pairs that are neither englobing nor englobed. Various lters have been devised and used sequentially to remove the incorrect families.
3.10.1.1 Filtering on sub-families This lter directly addresses the incomplete families and the completely incorrect ones. As explained in sect. 3.5, depending on the lemma, more or less related forms are found in the input corpus and thus, more or less complete families are generated. A correct family composed of n axes shall appear in subfamilies with n, n-1, n-2,...,1 of its axes. A family with n axes is thus kept if validated by the occurrence of at
7
least one family with n-1 of its axes . 6. We give up on cutting a form with a given family if not even one other form is covered by the family. 7. Families with two axes are automatically validated.
3.10. Cutting forms
77
All families validating another one are discarded (essentially sub-families) as well as families that have not been validated (essentially the biggest completely incorrect families). One must notice that if two equivalent families with n+1 axes sharing
n of their axes are generated, this lter will keep both unless a family with n+2 axes covering them appears. Also, this lter is not eective with incorrect families brought by incorrect englobing pairs or by incorrect englobed pairs. Indeed, the sub-families of the correct families they are derivated from will provide the incorrect sub-families necessary to pass this test. It also proved to be less eective over small completely incorrect families since they require less sub-families to be validated. These small incorrect families are usually built from infrequent forms with no other related forms present in the corpus.
3.10.1.2 Filtering on frequent forms This lter tends to compensate the previous lter regarding small completely incorrect families. It relies on the idea that morphological families are frequency-independent, i.e., they apply indierently on frequent or infrequent lemmas. A correct family should thus cover at least, in one of the nodes from which it has been built from, one of the forms considered as frequent.
3.10.1.3 Filtering englobing families This third lter follows the idea that the letters common to all related forms should belong to the stem. It thus tackles the incorrect families brought by incorrect englobing pairs by simply rejecting all families composed of axes starting with an unique common rst letter. For example, if
[#, s, ring, red] is an English sux family acquired [r, rs, ring, red] another one acquired
the candidate stem prefer and
with with
the candidate stem bothe, the rst one will be kept whereas the second shall be discarded.
3.10.1.4 Filtering dominated families Dominated families are dominated in the sense that they are never selected by the cutting algorithm because : they have less axes than the other families they compete with, they only apply on node that occur too deep in the tree. These dominated families are essentially non-ltered sub-families or incorrect ones brought by incorrect englobed pairs.
78
Chapitre 3. MorphAcq : Unsupervised learning of morphology
Indeed, englobed pairs are found deeper in the tree than the correct pairs they are derivated from. Consequently, the family built with them cannot cover the other englobed axes since they are in dierent sub-trees.
[ab, ac, bd, be] built thanks to pairs {ab, ac},{bd, be} will allow
For example, let us consider a family
the
{ab, ac},{bd, be},{ab, bd}, etc. The the {b, c},{d, e} to be identied and the consequent families [b, c] and [d, e] to be built. Those two families shall always be dominated by [ab, ac, bd, be]. pairs
incorrect englobed pairs
We thus run the cutting algorithm with all the families and discard the ones that have not even been selected once.
3.11 Splitting compound affixes The axes of the acquired families can either be singleton like the English suxes ing# and ly# or compound as ingly#. If an ax a3 in a family fam1 is to be split as two axes a1 and
a2, we consider that a3 is obtained by rening an ax a1 with a family fam2 containing the ax a2. We thus consider that a1 is the reason for a3 to exist and that a1 provides of a context where fam2 can apply. Consequently, there should be other axes in the family fam1 obtained by rening a1 with axes of fam2. Therefore, we list the other axes in fam that could be the reason for
a3 to exist by following the idea that they should be more present than absent on the nodes where a3 occurs. For example, for every node where the sux ingly#, occurs, one can also expects to nd the sux ing#. We then add a3 to the list and apply on it the cutting algorithm as if they were forms. The family that covers most of the elements of this list, including a3, is selected, a3 is split as a1 +a2 and the process is recursively applied on a2.
3.12 Splitting compound stems So as to split compound stems, one rst needs to determine what substrings can connect them, be it the empty string or not. For example, in order to split the compound stem of the form grand-mother, one needs to identify the substring - as a valid connector. During our experiments, we could observe that these connectors act like double-axes in the sense that they tend hento combine two surrounding stems the same way suxes are connected to the rst stem and prexes are connected to the second one. We also observed that, if enough data are provided, the most frequent forms tend to be identied along with connectors as fake axes. For example,
3.13. Comparison with related works
79
in English, the substrings #grand, #rst- are identied as prexes whereas the substrings -based#, man# are identied as suxes. These fake axes provide an elegant way to guess the connectors of a given language. So as to identify the fake axes and extract the corresponding connector, we apply the cutting algorithm to all the forms with the nal set of morphological families. We then establish two lists of starting and ending substrings corresponding to the combination of all the stems found with the prexes and suxes they have been found with. For example, if the English stem appear is found with the prexes #re and #dis and the suxes ing# and ed#, the substrings #reappear and #disappear are used as
starting substrings and appearing# and appeared# as ending substrings. We then identify all the prexes containing starting substrings and all the suxes containing ending substrings. The part of these fake axes that do not belong to the starting or ending substrings are considered as candidate connectors. A connector is kept if it is both found in one fake prex and one fake sux. For example, in our experiments over English, the connector - has been found in the fake prex #rst- and in the fake sux -based#. Finally, the stem of a given form is split if the form combines a starting substring, a connector and an ending substring. For example, the English stem speedboat was splitted into the two stems of the forms speed and boat since it combines the starting substring #speed, the empty connector and the ending substring boat#. It is important to note that we boldly assume that any prex used with the compound stem shall be found with the rst contained stem and any sux used with the compound stem shall be found with the second contained stem. This relies on the idea that compound stem usually renes the semantic concept of the stems it contains, i.e., the frequent combination of the contained stems has permitted the existence of the compound stem itself. The contained stems should therefore be more frequent than the compound stem and, since they are more frequent, they should have been combined with more axes than the compound stem did.
3.13 Comparison with related works In this section, we only focus on unsupervised methods. Methods such as [12, 28, 25, 24] that are related to the subject but make use of an additional source of knowledge are not included in the comparisons.
3.13.1 General discussion on practical interest Nowadays, there exist morphological analyzers implementing the ideas described in [15] that take as input manually coded morphological rules.
80
Chapitre 3. MorphAcq : Unsupervised learning of morphology
The main interest of unsupervised morphological acquisition is therefore to avoid establishing the corresponding list of rules, be it in an explicit fashion as we do, or in an implicit manner by means of a trained tool that shall perform morphological analysis. Many approaches intend to model this task by means of more or less complex formulas with more or less variables. These approaches, devised and tested with the languages understood by the authors, tend to require adaptations when applied on another language. If these adaptations are not trivial, so as to know whether the results are relevant or not, the person using the acquisition tool needs both competences to adapt and tune the tool and to understand the morphology of the acquired language. Obviously, such competences drastically reduce the range of users to a small number of skilled ones. In addition, mathematical/statistical models usually manage to handle the most common cases, but fail with infrequent ones. Since morphological knowledge is a knowledge that, at least for the most common cases, can be described by a large range of persons understanding the language acquired, such considerations raise the question of the practical interest of the approaches, i.e., would it not be more simple to manually describe the morphological rules of a language than to adapt a given approach to the acquired language ? It is important to note that we do not highlight this aspect to criticize other approaches relying on more or less complex formulas with more or less variables, research always follows dierent paths and no one can predict which path will indeed be the most successful one. Our objective is actually to point out a characteristic that dierentiates our approach from the rest : it has been developed towards the objective of easing its application to new languages. Such objective is achieved by following the idea that various phenomena can be exploited without narrowing too strictly at a given moment the search space. The search space is thus narrowed by a succession of lters, each one taking advantage of a given phenomenon. Each lter follows the idea that if the language has a certain aspect then the lter should be able to at least reduce the search space to a certain degree where its relevance/coherence are less subject to doubts. In practice, these lters do make use of some formulas, some variables or even some thresholds. However, we intend to model them so as to guarantee their relevance from a language to another. Indeed, if the reader pays attention to the previous parts, he shall surely notice a prudence in every decision we take. This prudence intends to guarantee the direct application of our approach to any concatenative language with no formula adaptation or variable tuning. Some examples : no maximum length is set to establish which substrings may be a can-
3.13. Comparison with related works
81
didate ax ; we usually use simple averages in most of our computations, e.g., a frequent form is just a form occurring above the average ; morphological families require at least two axes (not three or more) ; and their suxes shall start with at least two (not three or more) dierent letters ; a form is cut if only one (not two or more) other possibly related form appears ; for an ax to be the possible root of another compound ax, it only needs to be more present than absent on the nodes where the compounded ax occurs ; no prior knowledge, even fairly global one such as the fact that hyphen is usually used to compound words, is provided.
3.13.2 Step-by-step comparison Candidates axes Other approaches do not build a rst list by ltering all possibilities with easy-to-fulll conditions as we do. Apart from [9], they rather consider the entire set of possibilities and let the following steps handle them or they directly restrict it with a maximum length of characters. In [9], the authors consider as candidate axes all substrings that are not part of the candidate stems they detected in a previous step.
Pairs of candidates axes Only methods that list morphological rules intend to explicitly identify relatedness between candidates axes. The relatedness is often characterized by a score and morphological rules are either ltered according to a given threshold [17, 2, 5] or their score do not allow them to correctly compete with other pairs when cutting forms [18].
Morphological families Even for methods that explicitly list morphological rules, building morphological families is a not often achieved : we inventoried three methods [8, 3, 22] trying to achieve similar goals. Among these three methods, two [8, 3] report rather small set of suxes. On the other hand, [22] exploits, as we do, the concept of subfamilies in order to validate bigger families. Nevertheless, the construction of the families and the ltering of completely incorrect families, of correct subfamilies and incorrect ones brought by englobing pairs are all controlled by thresholds. The authors also report that they fail to deal with families brought by englobed pairs.
82
Chapitre 3. MorphAcq : Unsupervised learning of morphology
Cutting forms Regarding methods that explicitly list morphological rules, a cut is usually considered when a possibly related form is also found. The only exception to this statement is the method described in [5] that tries to excuse the absence of related form according to the frequency of the form studied. The other methods do not explicitly require related axes [8, 4, 26, 13, 7], or a common stem for two forms. They rather compute, for every possible cut, a score that tends to be higher when the cut is on the frontier between stem and ax. For these methods, the occurrence of other related forms in the corpus is therefore not requested but they do participate in scoring whether a certain substring might be a true stem or not. If various morphological analyses are possible, other approaches either select some analysis (often ambiguous) above a certain threshold [9, 26, 17, 18, 2, 22], or they produce unambiguous analysis by selecting, as we do, analysis that maximize a given criteria or score[4, 13, 7, 5].
Splitting compound axes As far as we know, except in [5], the other approaches that do intend to handle compound axes[4, 26, 13, 7, 17, 18, 3], do not identify whether or not an ax a1, that remains after cutting the possibly compound ax a2, might indeed be a root ax from which a2 has been derivated from. Compound axes are thus handled by simply cutting shorter axes when possible and reiterating the process on the remaining substrings. Consequently, most approaches will incorrectly consider the French form par-
leras /(you) will speak as to be cut before the last 's' since the French form parlera /(he) will speak exists and the concatenation of an 's' is a correct morphological rule.
Splitting compound stems The approaches that intend to handle compound stems [4, 26, 18, 3] usually rely, as we do, on the idea that the contained stems should generally be more frequent than the compound stem itself. However, contrarily to us, this decision is usually determined according to the frequencies of the forms involved in the choice. Such direct decision will fail when facing, for a given corpus, compound stem that are more frequent than their contained stems. For example, basketball is usually more frequent than basket in a corpus related to sport. Except in [18], no other methods intends to identify connectors strings but prefer to rely on the occurrence of smaller forms to split a bigger one. No other method is therefore able to identify, as we do, substrings dedicated to the composition of stems such as the hyphen
8 or the o in the French
8. Many methods, although unsupervised, consider the compounding eect of the hyphen as a basic prior knowledge.
3.14. Evaluation
83
form latinoaméricain /latin-american. Nevertheless, since in [18] the authors actually restrict the scope of the
connectors to the suxes of the rst stems, this method do not dier much from the others.
3.13.3 Main dierences As detailed, most of these existing approaches do share similarities with ours. Nevertheless, the main dierences can globally be summarized as follows. It relies on the sequential application of several complementary strategies taking advantage of dierent phenomena. A noticeable part of the strategies can separately be adapted to other approaches. It highlights the existence, utility and usability of a frequency-related phenomenon as it has never been considered before. It identies axes of any length. It builds a clear set of morphological families that can be manually corrected and/or extended. It can therefore be part of a semi-automatic acquisition process to build morphological rules. It presents a new method to handle compound stems. Its several steps rely on straightforward formulas with no (or few) parameters and no prior assumption on the language acquired except the fact that it needs to be a concatenative one.
3.14 Evaluation
3.14.1 MorphoChallenge Regarding evaluation, some approaches in the literature are based on nonconsensual evaluation methods. Such way to process presents two important drawbacks. When these tools are not provided (or at least not thorough-fully detailed), comparing results with the previously published methods will always be subject to doubts related to competence and/or eventually to honesty. When these tools are provided, there might exist bias in the way the evaluation is performed. Indeed, even for English, some people may consider manage as the stem of the form manages whereas others
9
will consider manag as the stem. Fortunately for the morphology acquisition task, an annual challenge (MorphoChallenge) focusing on morphological analysis from raw data is organized every year since 2005. This challenge provides a set of evaluation 9. Us included.
84
Chapitre 3. MorphAcq : Unsupervised learning of morphology
tools which represent a consensual and reliable way to estimate the quality of an approach. These evaluation tool thus solve the rst drawback mentioned above and allows researchers to focus on their approach and not on the validity of the evaluation. In order to feed these tools, one needs to produce a le containing morphological analysis of forms as sequence of morpheme labels. In our evaluations, we directly used the stems and axes as morpheme labels. As explained on the 2010 edition website [20], since the task involves unsupervised learning, the evaluation tools provided do not expect the algorithms to come up with morpheme labels that match the linguistic ones. However, it expects for two forms containing a same morpheme according to participants' algorithms to also have a morpheme in common according to the gold standard. This method of evaluation elegantly solves the second drawback mentioned above. Nevertheless, it presents characteristics the reader needs to consider when discovering the results : the morpheme labels provided by the gold standard are not ax specic. Therefore, there can be several labels for a single ax. For instance, the English sux s# indicates the plural of a noun or the third person singular of a verb and is thus designed in the gold standard with two dierent labels, i.e., the evaluation tools expect the analysis to dierentiate two axes with the same spelling. When morphological analysis are based, as we do, on the substrings of the stems and axes found, incorrect pairs of words shall be identied in the morphological analysis where they have morphemes with identical labels. However, they shall not be identied in the gold standard where they have morphemes with syntactically-motivated and distinct labels. The evaluation of the precision computed by the tools will thus decrease. This phenomenon is known as syncretism. In a similar but opposite way, there can exist several axes for a single label. For example, the English suxes s# and es# can both represent the third person singular of a verb and are then designed by the same label when they occur in the gold standard, i.e., the evaluation tools expect the morphological analysis to group syntactically equivalent axes. This will lower the evaluation of recall since many pairs of words having morphemes with syntactically-motivated and equivalent labels shall be identied in the gold standard but not in the morphological analysis where they shall be designed with morphemes with spelling-motivated and thus dierent labels. This phenomenon is known as allomorphy. Consequently, even if all the performed cuts were to be correct, since all morphologies are ambiguous at some point, no method relying on spellingmotivated labels can achieve a perfect score. In addition to the previous two phenomena (syncretism and allomorphy), this sophisticated evaluation method handles two other phenomena : morphophonology and ambiguity.
3.14. Evaluation
85
Morphophonology occurs when applying a morphological rule alters the surface form of stems or axes. For example, in the word wives, the stemnal 'f ' of wife is modied when the plural sux is added. A metric should thus penalize for not placing wives and wife as forms of the same lexeme.
Ambiguity happens with homonyms. For example, the French form che has two possible morphological analysis : one relating it with the verb -
cher /to le and another one relating it with the common noun che /card. A metric should thus account for legitimate morphological ambiguity. Just like most unsupervised approaches, ours is still unable to deal with syncretism, allomorphy and the ambiguity brought by homonyms. To our knowledge, only one unsupervised method intends, to some extent, to handle allomorphy [14]. On the other hand, our approach does handle morphophonology, provided that the phenomenon is regular-enough to be acquired as a dierent morphological family than the one it has been derivated from. Nevertheless, such approach just transfers the problem to allormorphy. For example, the pair of forms wife/wives, just as knife/knives or shelf/shelves, allows to create the pair of related axes
f e#, ves#
. Consequently, the pairs of forms are
analyzed as having the same stem, i.e, our approach does not suer from morphophonology. But on the other hand, the corresponding suxes represent a new case of allomorphy for the singular and plural labels. Regarding morphophonology, the other approaches do not state whether or not they handle it. Our intuition is for the methods that explicitly list morphological rules to behave similarly as ours. The ones that build shallow morphological analyzers seem more likely to fail when facing a case of morphophonology. Indeed, since they do not request the occurrence of other related forms, they should favor morphological analyses that involve the most commonly used suxes, i.e, e# and es# in the case of wife/wives.
3.14.1.1 MorphoChallenge's evaluation metrics An important change in the 2010 edition has been the adoption for the 2011 challenge of a new metric named EMMA [27] instead of the MC metric [16] used so far. This decision has been motivated by the fact that EMMA correlates far better with the performance of real-world NLP processing tasks which embed the morphological analyses than the MC metric does. This new evaluation metric does bring an important change since it barely correlates with the older MC metric. The main reason for this non-correlation is that EMMA present the same advantages as the MC metric but is not susceptible to two types of gaming that have plagued recent MorphoChallenge competitions : ambiguity Hijacking and shared morpheme padding. Indeed, as explained in details in [27], the MC metric is not robust when providing ambiguous analysis : it tends to boost recall without harming much precision. The only manner for the MC metric to avoid such gamings would
86
Chapitre 3. MorphAcq : Unsupervised learning of morphology
be to prohibit ambiguous analysis. However, such drastic solution would unable it to handle legitimate morphological ambiguity.
3.14.2 Results Our approach has been essentially developed by studying the results produced for French, Spanish and English. Nevertheless, our evaluations have been computed over English, German and Turkish. It is truly important to note that absolutely no threshold or variable adjustment has been tuned from a language to another
10 .
All results are summarized in the two next sections. In the tables of these sections, r. stands for ranking, P. for precision, R. for recall and T. for type of approach. The dierent types of approach are labeled with letters : U stands for fully-unsupervised, P for unsupervised algorithm
with supervised parameter tuning and S for semi-supervised. We comment the results of our approach labeled MorphAcq by comparing them with the other fully unsupervised methods. Nevertheless, the results of non-fully supervised methods are also provided so as to highlight the drastic change brought by the new metric.
3.14.2.1 Results with the
MC
metric
The 2010 edition of MorphoChallenge has introduced a new semi-supervised contest that allows participants to take advantage of a part of the gold standard. A direct consequence has been a reduced number of participants for the fully unsupervised task. In addition, because some training corpora or gold standard have changed since the last edition, a direct comparison between the results of the 2009 and 2010 results for the unsupervised methods remains subjective. Nevertheless, two reference methods called Morfessor CatMAP and Mor-
fessor Baseline [4] have been used in both editions. The dierence between their 2009 and 2010 results oers thus a method to guess what results unsupervised methods participating to the 2009 edition would have obtained in the 2010 edition and the other way around. For English (see Table 3.2), MorphAcq ranks fourth over seven unsupervised participants for the 2010 edition. He is mainly dominated by the three dierent versions of Base Inference, namely Base Inference itself, Ite-
rative Comp. and Aggressive Comp.. Its results once compared with those of Morfessor baseline allow to think that, for the MC metric and the English language, it would have compete with the last edition state-of-the-art. Regarding German (see Table 3.3), MorphAcq ranks fth over seven unsupervised participants for the 2010 edition. He is once again dominated 10. The
mult
variable mentioned in sect. 3.8 has been set to
2
for all languages.
3.14. Evaluation
87
by the three dierent versions of Base Inference but also by Morfessor Cat-
MAP. Its results once compared with those of Morfessor baseline allow to think that, for the MC metric and the German language, it would have been an average technique in the 2009 edition. As for Turkish (see Table 3.4), MorphAcq ranks fourth over seven unsupervised participants for the 2010 edition. This time, it is beaten by Iterative
Comp., Aggressive Comp. and Morfessor CatMAP. Its results once compared with those of Morfessor Morfessor baseline allow to think that, for the
MC metric and the Turkish language, it would have been one of the last technique in the 2009 edition.
3.14.2.2 Results with the
EMMA metric
As explained earlier, the new EMMA metric has brought a noticeable change. As one can observe in the following results, this metric barely correlates with the older MC metric, many leading methods in the previous section are now among the least eective ones. This tends to indicate that these methods were optimized for the previous metric and were thus producing ambiguous analysis more than necessary. They have consequently be blamed by Emma. For English (see Table 3.5), MorphAcq ranks sixth over seven unsupervised participants. Nevertheless, the third, fourth and fth receives almost the same results. It is dominated by Base Inference and Iterative Comp.. Regarding German (see Table 3.6), MorphAcq ranks second over seven unsupervised participants. Only Morfessor CatMAP does better. As for Turkish (see Table 3.7), MorphAcq ranks third over seven unsupervised participants. Both Morfessor CatMAP and Base Inference score better.
3.14.2.3 Comments on metrics The surprise created by the new EMMA metric brings more questions than answers. It globally brings doubts on most of the previous results computed with the MC metric, be it from this year edition or the previous ones. It raises the question about how would have performed on both metrics the methods that were severely blamed by EMMA if they had produced less ambiguous analysis. Indeed, there are methods such as
Morfessor S+W+L that remain high on both metrics. The common point between these stable methods is to avoid producing ambiguous analysis. If EMMA would have been devised before the deadline of this year's edition, some participants could have choose to trade the recall boost brought by several ambiguous analysis on the MC metric for stability over both metrics.
88
Chapitre 3. MorphAcq : Unsupervised learning of morphology
The EMMA methods being a recent creation, it might perfectly have an undiscovered bias. After all, the gaming problem found with the
MC metric has lasted for several years. A quick manner to solve this uncertainty would be for all past participants to participate to the next edition. The new classication would allow to better compare a method to another one. The study of all the new results would also be an interesting data in order to detect possible bias.
In terms of evaluation, this situation illustrates in a pretty clear manner the situation of the whole NLP domain : even for a linguistic aspect considered as among one of the easiest, evaluation is still uncertain.
3.14.2.4 Comments on the analysis As we are unable to fully understand Turkish and German, the detailed study of our results mainly focused on English. This study showed that most of our cuts are performed as expected, i.e., right after the last letter common to all related forms. As explained previously, the evaluation tools take into account allomorphy and syncretism. Consequently, our most important loss of recall is our inability to recognize syntactically-equivalent axes and group them under a same label whereas our main loss of precision is our inability to split in two syntactically dierent labels a same ax. The same comment should also apply to Turkish and German even thought we are unable to evaluate the impact on the results. However, the lower recall obtained for both Turkish and German could also be a consequence of a still unidentied drawback. Indeed, whereas a supercial study of the corpora tends to show that these languages rely more on inection
11 than
English does, the sizes of the biggest families obtained for Turkish (10) and German (13) seem insucient when compared with the size of the biggest one obtained for English (10). A data sparsity problem or some unconsidered aspect is thus aecting either, or both, the recognition of pairs and/or the construction of bigger morphological families.
3.14.3 Integration within the semi-automatic chain of tools The chain of semi-automatic tools described at the end of the previous chapter relies on the idea that two dierent types of LRs can be used within an NLP tool designed, among other things, to try and nd a joint match between both resources that share (in a dierent manner) a same type of data. Regarding morphological rules and lexicon, two approaches may be developed depending on the type of lexicon used. 1. If the lexicon does not rely at some point on morphological rules, i.e., the lexicon is not synchronized with morphological rules. The tool 11. And are well-known as being morphologically rich.
3.15. Future work
89
designed to try and nd a joint match between both resources could be a program stating whether two forms are related or not, any disagreement being considered as an unexpected behavior. Adding to the approach the data provided by a lexicon could allow the following improvements. It could allow to automatically validate pairs of related candidate axes.
n > 2) axes and thus avoid nding the n−1 sub-families necessary
It could allow to create and automatically validate families with (n
to their validation. It could allow to group related forms as one unique fake form with one fake ax occurring on the frontier with the stem. Such approach would decrease the data sparsity problem since any sub-families (with a dierent set of axes) of a same morphological family will be group under the same fake ax. Consequently, the remaining axes that are still not included in the family will not anymore occur on the nodes with a disparate set of axes as before but with an unique fake ax. The relation with the non included axes of the family and the family itself represented by an unique fake ax should thus be easier to establish. 2. If the lexicon does rely at some point on morphological rules, such as a two-level lexicon of the Alexina framework, both resources will always agree on whether two forms are related or not since they are synchronized. Even though, as for non-morphologically synchronized lexicon, adding to the approach the data provided by a lexicon would allow to group related forms as one unique fake form with one fake ax and thus, it would reduce the data sparsity problem.
3.15 Future work The initial goal of this approach is to acquire from a raw corpus a datarepresentative description of a concatenative morphology. The current results could be directly enhanced by : identifying and, if possible, balancing the limitations highlighted by the low recall obtained over Turkish and German, developing the integration within the semi-automatic chain of tools as described in the previous section. Unfortunately, for time reasons, both tasks could not be achieved before the redaction of this manuscript. Regarding the next edition of MorphoChallenge, a fairly interesting feature would be to generate morphological analysis based on syntactically-
90
Chapitre 3. MorphAcq : Unsupervised learning of morphology
motivated labels and not spelling-motivated ones as we currently do. Achieving such improvement could allow us to deal with syncretism, allomorphy and the ambiguity brought by homonyms. Nevertheless, this non-trivial task requires to take advantage of the syntactic contexts of each form in order to automatically infer syntactic classes/behaviors. The study of the research achieved regarding the automatic construction/induction of part-of-speech tagger might provide us tracks towards this objective.
3.16 Conclusion (English) As conrmed by our experiments and the results presented above, the approach already fullls its initial goal of acquiring from a raw corpus a datarepresentative description of a concatenative morphology. As the samples of morphological families provided in appendix A show, anybody interested in building a morphological description of a concatenative language can rely on this approach to guide and ease its eorts. Just as MorphoChallenge's evaluation tools have pointed out, there are still several aspects that need to can be improved. However, the sequential combination of lters provides a convenient way to perform upgrades. For a rst participation to MorphoChallenge, our approach performs satisfyingly over English, Turkish and German. Even though the approach is recent and in many aspects dierent from the others, these results conrm its potential.
3.17 Conclusion (Français) En vue de nos expériences et des résultats présentés ci dessus, nous pouvons observer que la méthode remplit déjà son objectif initial de simplier la construction de règles formalisant les mécanismes concaténatifs d'une langue. Les échantillons de familles morphologiques fournis en annexe A montrent bien que les résultats de cette méthode peuvent servir de point de départ pour la construction d'un ensemble de règles formalisant la morphologie d'une langue concaténative. Comme les outils d'évaluation de MorphoChallenge l'ont montré, bien des aspects restent encore à améliorer. L'architecture basée sur une combinaison séquentielle de ltres devrait normalement nous permettre de réaliser plus aisément ces améliorations. Bien que notre approche soit récente et, en bien des points, diérente de l'état de l'art, elle obtient des résultats satisfaisants pour l'anglais, le turc et l'allemand. Ces résultats nous paraissent conrmer son potentiel.
Figure 3.2 2009 and 2010 results for English with the MC metric.
32
58
9) Can Manandhar
13) PROMODES 2
68
8) RALI-COF
64
65
7) MorphoNet
12) RALI-ANA
53
6) ParaMor Mimic
36
54
5) ParaMor-Morfessor Mimic
84
83
4) Base Inference
11) PROMODES
55
3) ParaMor-Morfessor Union
10) Morfessor CatMAP
74
68
P.
2) Morfessor Baseline
1) Allomorfessor
r.)
8
37
42
47
44
49
55
55
59
59
59
60
60
62
63
64
67
67
F.
U
P
S
P
U
P
U
S
S
U
P
U
S
S
U U
S
S
T.
3.17. Conclusion (Français) 91
Chapitre 3. MorphAcq : Unsupervised learning of morphology 92
3) Morfessor CatMAP
1) ParaMor-Morfessor Union
57
50
71
52
P.
50
34
42
47
38
60
R.
41
42
42
45
49
49
50
56
F.
MorphoChal. 2009
4) ParaMor Mimic
67
28
40
r.)
5) Can Manandhar 2
36
30
81
29
21
22
25
26
33
33
35
54
6) RALI-COF
77
33
10) PROMODES
78
19
57
7) PROMODES 2
67
12) Morfessor Baseline
39
15
51
8) Allomorfessor
49
13) Base Inference
39
5
24
2) ParaMor-Morfessor Mimic
9) MorphoNet
14) UNGRADE
73
15
40
15) MetaMorph
99
34
16) Can Manandhar 1
2
61
48
17) RALI-ANA
11) PROMODES committee
18) Letters
T.
U U U U U U U U U U U U U U U U U U
59
66
58
37
35
72 35
P.
34
44
R.
45
46
47
F.
T.
MorphoChal. 2010 2) Morfessor CatMAP
62
37 31
U U U U U U U
r.)
3) Base Inference
69 25 82 19
5
5) Iterative Comp.
44
P
4) Aggressive Comp.
99
50
6) MorphAcq 7) Morfessor Baseline
2
1) Morfessor U+W
8) Letters
Figure 3.3 2009 and 2010 results for German with the MC metric.
8.66 73.03
16) letters
17) Can Manandhar 2
8.89
99.13
12.85
17.78
19.53
29.45
30.16
28.35
38.13
30.90
66.42
58.70
31.88
44.54
54.77
60.01
60.39
R.
15.86
15.93
21.69
29.67
31.82
33.61
36.64
37.48
39.70
41.19
43.39
44.14
45.49
46.40
52.02
52.88
53.53
F.
U U U U U U U U U U U U U U U U U
T.
55 68
14) Iterative Comp.
72 8
17) Base Inference 18) Letters
99
16
79 19 89 17
15) MorphAcq 16) Morfessor Baseline
21
34
65
39
46
32
52
51
13) Aggressive Comp.
40
8) Promodes-E
12) MAGIP
46
7) Promodes
47
58
47
65
6) Morfessor S+W
11) Promodes-H
65
5) DEAP PROB-NOCAT
55
53
40
68
4) DEAP PROB-CAT
79 31
72
3) DEAP MDL-CAT
56
59
9) Morfessor CatMAP
68
R.
10) Morfessor U+W
71
P.
2) DEAP MDL-NOCAT
MorphoChal. 2010
1) Morfessor S+W+L
r.)
Figure 3.4 2009 and 2010 results for Turkish with the MC metric.
69.52
15) RALI-ANA
85.89
13) Allomorfessor
89.68
39.14
12) MetaMorph
14) Morfessor Baseline
46.67
41.39
9) Can Manandhar 2
11) UNGRADE
61.75
8) MorphoNet
55.30
32.22
7) PROMODES
10) PROMODES committee
35.36
6) PROMODES 2
48.43
4) RALI-COF
79.38
49.54
3) ParaMorMimic
5) Morfessor CatMAP
47.25
2) ParaMor-Morfessor Union
P. 48.07
MorphoChal. 2009
1) ParaMor-Morfessor Mimic
r.)
15
26
32
31 29
42
43
43
43
45
45
49
54
61
61
61
61
65
F.
S
U U U U U U
P
P
U
P
P
S
S
S
S
S
S
T.
3.17. Conclusion (Français) 93
Chapitre 3. MorphAcq : Unsupervised learning of morphology 94
EMMA metric
P.
R.
81
83
F.
2-> 1
T.
S
U U
73
92
69
76
78 77
68
67
74
78
S
P
P
P
U U U U
S
8) MorphAcq
65
64
S
9) Morfessor U+W
63
3 -> 2
10) Promodes-E
73
78
9 -> 16
81
7) Morfessor CatMAP
10 -> 8
56
S
U
52
r.)
84
81
T.
76
80
F.
82
76
78
R.
87
81
78
P.
1) Morfessor S+W+L
87
75
MC metric
2) Base Inference
79
72
r.)
3) Iterative Comp.
82
Corr.
4) Morfessor S+W
86
1-> 4
5) Aggressive Comp.
S
4 -> 3
6) Morfessor Baseline
67
U U
5 -> 7
69
63
S
6 -> 12
65
62
S
7 -> 13
1) Morfessor S+W
52
60
P
U
67
80
60
64
80
63
59
66
51
52
53
4) Iterative Comp.
58
59
67
5) DEAP MDL-NOCAT
71
80
6) DEAP PROB-NOCAT
60
2) Morfessor S+W+L
7) Aggressive Comp.
3) Base Inference
8) Morfessor U+W
8 -> 9 59
11) Promodes
59
75
12) DEAP MDL-NOCAT
67 53
49
12 -> 6
11 -> 17
U
10) MorphAcq
9) DEAP MDL-CAT
S
55
55
S
P
55
81
49
56
11) DEAP PROB-CAT
39
41
12) Morfessor Baseline
62
S
85
S
40
63
25
57 15) Promodes-H
73
16) DEAP MDL-CAT
34
70 15 -> 7 16 -> 14
14
56 47
S
U
77
48
64
44
29
13) DEAP PROB-NOCAT 46
42
22
14) MAGIP
39
30
09
U
S
17) DEAP PROB-CAT
13 -> 11 49
70
17 -> 15 18 -> 18
18) Letters
P
13) Promodes
86
8
37
49
14) Promodes-E
29
63
14 -> 10
15) Morfessor CatMAP
99
P
16) MAGIP
4
26
P
17) Promodes-H
U
18) Letters
Figure 3.5 2010 results for English with the EMMA and MC metrics.
59 62
82 2
4) Aggressive Comp.
5) Iterative Comp.
7) Morfessor Baseline
8) Letters
99
f19
25
34
37
35
35
44
R.
5
31
37
44
45
46
47
50
F.
U U U U U U U
P
T.
7 -> 4
6 -> 2
5 -> 6
4 -> 7
3 -> 3
2-> 1
1-> 5
Corr.
8) Letters
08
68
71
6) Iterative Comp. 7) Aggressive Comp.
65
81
74
23
51
50
55
48
50
55
R.
76 51
78
P.
5) Morfessor U+W
4) Morfessor Baseline
3) Base Inference
2) MorphAcq
EMMA metric
1) Morfessor CatMAP
r.)
Figure 3.6 2010 results for German with the EMMA and MC metrics.
69
66
3) Base Inference
6) MorphAcq
72
2) Morfessor CatMAP
P. 58
MC metric
1) Morfessor U+W
r.)
12
58
59
60
60
60
61
65
F.
P
U U U
U U U U
T.
3.17. Conclusion (Français) 95
Chapitre 3. MorphAcq : Unsupervised learning of morphology 96
3) Morfessor CatMAP
2) Morfessor S+W
1) Morfessor S+W+L
EMMA metric
66
71
57
75
P.
37
36
63 37
64
34
35
40
44
57
R.
45
45
46
46
49
49
65
F.
S
T.
S
2 -> 9
U U U U U U S
S
61
P
61
43
56
42
53
55
68
41
72
35
2) DEAP MDL-NOCAT
44
3) DEAP MDL-CAT
9) DEAP MDL-NOCAT
13 -> 7
14) MAGIP
13) Promodes-E
24
23
34
37
20
26
52
55
35
36
14
27
33
33
35
36
S
S
S
P
4 -> 16
14 -> 6
15) DEAP MDL-CAT
29
47
10) Promodes
15 -> 5
16) DEAP PROB-CAT
11
17 -> 4
16 -> 9
17) Morfessor U+W 18 -> 18
U
P 18) Letters
4) Base Inference
9 -> 3
3 -> 15
10 -> 17
r.)
5) MorphAcq
57
T.
6) Iterative Comp.
65
F.
S
5 -> 11
7) Aggressive Comp.
R.
S
6 -> 2
8) Morfessor Baseline
P.
61
S
7 -> 10
MC metric
61
P
r.)
55
54
P
Corr.
58
49
S
68
47
45
65
65
51
59
4) DEAP PROB-CAT
65
52
71
5) DEAP PROB-NOCAT
46
1) Morfessor S+W+L
6) Morfessor S+W
40
P
U
S
7) Promodes
1 -> 1
8) Promodes-E
8 -> 13 45
31
43
46
S
79
P
40
9) Morfessor CatMAP
39
10) Morfessor U+W
38
43
52 43
35
39
32 65
42
41
47
34
11) DEAP PROB-NOCAT 32
21
29
31
U U U U U U
12) Promodes-H
11) Promodes-H
55
26
11 -> 12
12) MAGIP
68
17
79 19
15
12 -> 14
13) Aggressive Comp.
89
16
S
15) MorphAcq
99
P
16) Morfessor Baseline
8
72
32
17) Base Inference
14) Iterative Comp.
18) Letters
Figure 3.7 2010 results for Turkish with the EMMA and MC metrics.
Bibliographie
[1] Delphine Bernhard. Simple morpheme labelling in unsupervised morpheme analysis. pages 873880, 2008. [2] Delphine Bernhard. Morphonet : Exploring the use of community structure for unsupervised morpheme analysis. In Multilingual Information
Access Evaluation Vol. I, 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Revised Selected Papers. Springer, 2010. To appear. [3] Burcu Can and Suresh Manandhar. Clustering morphological paradigms using syntactic categories. In CLEF, pages 641648, 2009. [4] Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Helsinki University of Technology, 2005. [5] Sajib
Dasgupta
and
Vincent
Ng.
High-performance,
language-
independent morphological segmentation. In NAACL HLT 2007 : Pro-
ceedings of the Main Conference, pages 155163, 2007. [6] Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In NeMLaP3/CoNLL '98 : Proceedings of the
Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, pages 295298, Sydney, Australia, 1998. The Association for Computational Linguistics. [7] Vera Demberg. A language-independent unsupervised model for morphological segmentation. In Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pages 920927, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [8] John Goldsmith. An algorithm for the unsupervised learning of morphology. Nat. Lang. Eng., 12(4) :353371, 2006. [9] Bruno Golenia, Sebastian Spiegler, and Peter Flach. Ungrade : Unsupervised graph decomposition.
In Working Notes for the CLEF 2009
Workshop, Corfu, Greece, September 2009. 97
98
Bibliographie
[10] Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10(11-12) :371 385, 1974. [11] Zellig S. Harris. From phoneme to morpheme. Language, 31(2) :190222, 1955. [12] Nabil Hathout. Acquistion of the morphological structure of the lexicon based on lexical similarity and formal analogy. In TextGraphs '08 : Pro-
ceedings of the 3rd Textgraphs Workshop on Graph-Based Algorithms for Natural Language Processing, pages 18, Morristown, NJ, USA, 2008. Association for Computational Linguistics. [13] Samarth Keshava. A simpler, intuitive approach to morpheme induction. In In PASCAL Challenge Workshop on Unsupervised Segmentation
of Words into Morphemes, pages 3135, 2006. [14] Oskar Kohonen, Sami Virpioja, and Mikaela Klami. towards unsupervised morpheme analysis.
Allomorfessor :
In CLEF'08 : Proceedings
of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access, pages 975 982, Berlin, Heidelberg, 2009. Springer-Verlag. [15] Kimmo Koskenniemi. Two-level model for morphological analysis. In
IJCAI-83, pages 683685, Karlsruhe, Germany, 1983. [16] Mikko Kurimo, Ville Turunen, and Matti Varjokallio. Overview of morpho challenge 2008. In CLEF'08 : Proceedings of the 9th Cross-language
evaluation forum conference on Evaluating systems for multilingual and multimodal information access, pages 951966, Berlin, Heidelberg, 2009. Springer-Verlag. [17] Jean-Francois Lavallée and Philippe Langlais. logy acquisition by formal analogy.
Unsupervised morpho-
In Lecture Notes in Computer
Science, page 8 pages. 2010. [18] Constantine Lignos, Erwin Chan, Mitchell P. Marcus, and Charles Yang. A rule-based acquisition model adapted for morphological analysis. In
[21] Christian Monson, Kristy Hollingshead, and Brian Roark. Simulating morphological analyzers with stochastic taggers for condence estimation. In CLEF, pages 649657, 2009. [22] Christian Monson, Alon Lavie, Jaime Carbonell, and Lori Levin. Evaluating an agglutinative segmentation model for paramor. In SigMor-
Phon '08 : Proceedings of the Tenth Meeting of ACL Special Interest
Bibliographie
99
Group on Computational Morphology and Phonology, pages 4958, Morristown, NJ, USA, 2008. Association for Computational Linguistics. [23] Lionel Nicolas, Jacque Farré, and Miguel A. Molinero.
Unsupervised
learning of concatenative morphology based on frequency-related form occurrence. In Proceedings of the PASCAL Challenge Workshop on Un-
supervised Segmentation of Words into Morphemes, Helsinki, Finland, September 2010. [24] Hoifung Poon, Colin Cherry, and Kristina Toutanova. morphological segmentation with log-linear models.
Unsupervised
In HLT-NAACL,
pages 209217, 2009. [25] Benjamin Snyder and Regina Barzilay. Unsupervised multilingual learning for morphological segmentation. In Proceedings of ACL-08 : HLT, pages 737745, Columbus, Ohio, June 2008. Association for Computational Linguistics. [26] Sebastian Spiegler, Bruno Golenia, and Peter Flach. Unsupervised Word
Decomposition with the Promodes Algorithm, volume I. Springer Verlag, February 2010. [27] Sebastian Spiegler and Christian Monson. Emma : A novel evaluation metric for morphological analysis. In Proceedings of the 23rd Internatio-
nal Conference on Computational Linguistics (COLING), August 2010. [28] Nicolas Stroppa and François Yvon. An analogical learner for morphological analysis. In Proceedings of the Ninth Conference on Computational
Natural Language Learning (CoNLL-2005), pages 120127, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics.
100
Bibliographie
-
Chapitre
4-
Lexfix : Mining Parsing Results for Lexical Correction
101
102
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
103
Related terms
A classier
is a trained tool that maps observations about an item to
conclusions about the item's target value. It therefore creates a model, often based on a tree or entropy, to predict the value of a target variable based on several input variables.
Subcategorization frames
specify the number and types of arguments
of a form. For instance, a mono-transitive verb, like eat, sub-categorizes a subject and an object, e.g. the subject eats the object. A ditransitive verb, like give, sub-categorizes a subject, an indirect object and a direct object, e.g. the subject gives the direct object to the indirect object. Since they denote what arguments a form can have, subcategorization frames are essential to lexicalised grammars because they allow them to discard many incorrect parses.
The realizations
of a given subcategorization frame denote the dierent
types of syntactic structure that can match the subcategorization frame. For example, the frame specifying that the verb eat receives an object can take a noun phrase the subject eats the object as its realization.
PCFG
stands for probabilistic context-free grammar. It is a context-free
type of grammar in which each production is augmented with a probability. The probability of a parse is then computed according to the probabilities of the productions used in that parse.
An open-class
form or lemma belongs to a syntactic class which, oppo-
sitely to closed-class, covers an innite set of lemmas since they constantly acquire new members as languages evolve. For example, in English, openclasses are common nouns, proper nouns, adjectives, adverbs and verbs.
A closed-class
form or lemma, also called functional lemma, is a lemma
belonging to a syntactic class which, oppositely to open-class, covers a known and dened set of lemmas such as pronouns, clitics, conjunctions, determiners, etc.
Related publication The publications related to this chapter can be found with references [18, 19, 20, 21, 22] and [23].
104
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
4.1. Introduction (Français)
105
4.1 Introduction (Français) Le développement manuel d'un lexique à la fois précis et couvrant est une tâche laborieuse, complexe et encline à erreur. À moins qu'une somme très importante d'eorts soit investie, il n'est pas rare que des lexiques n'atteignent pas leurs objectifs en terme de couverture et/ou de qualité. Dans ce chapitre, nous présentons un ensemble de méthodes combinées dans une chaîne d'outils ayant pour objectif de simplier la correction et l'extension d'un lexique. Les diérents maillons de cette chaîne reposent soit sur un analyseur syntaxique, soit étiqueteur ou encore un classieur d'entropie. Une fois combinés, ils permettent de détecter des entrées lexicales manquantes, incomplètes ou erronées et de générer des corrections pertinentes. Cette méthode implémente pleinement la méthode abstraite décrite au chapitre 2
1 en s'appuyant sur une grammaire pour corriger un lexique. Elle
utilise comme outil combinant les deux types de RL un analyseur syntaxique et considère ses échecs d'analyse comme des comportements non-attendus. Bien que nos exemples et résultats ne portent que sur la langue française, cette ensemble de techniques est indépendant de la langue considérée ou des systèmes utilisés. Et cela, dans le sens ou il peut facilement être adapté à la grande majorité des analyseurs syntaxiques, étiqueteurs syntaxiques ou classieurs d'entropie et, par conséquent, à la grande majorité des langues. Les principales contributions de ce travail de recherche sont : 1. de détailler une approche semi-automatique combinant séquentiellement plusieurs composants dans l'objectif d'améliorer un lexique ; 2. de présenter un ensemble de composants réalisant des tâches intermédiaires et pouvant être réutilisés à d'autres ns ; 3. d'expliquer pourquoi l'application itérative de cette méthode permet de convertir un corpus brut donné en entrée en un corpus représentatif des manques d'une grammaire. Dans ce chapitre, nous nous concentrons sur les informations lexicales d'une langue. Lorsque nous nous référons à des informations morphologiques ou syntaxiques, nous sous-entendons donc des informations lexicales liées à la morphologie (catégorie, genre, nombre, etc.) ou liées à la syntaxe (cadres de sous-catégorisation et réalisations de ces cadres).
4.2 Introduction (English) The manual development of a lexicon that is both accurate and widecoverage is a labor-intensive, complex and error prone task, requiring human 1. La méthode abstraite a en réalité été établie après avoir créé cette méthode.
106
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
expert work. Unless very important eorts are inverted, the lexicons usually do not achieve the expected objectives in terms of coverage or quality. In this chapter, we present a set of techniques brought together in a chain of tools that simplies the correction and extension of a lexicon. The various components of this chain rely either on a symbolic parser, a tagger or an entropy classier. They allow all together to detect missing, incomplete or erroneous entries in a lexicon and generate relevant lexical corrections. This approach implements the abstract method described in Chapter 2
2
by relying on a grammar to correct a lexicon, using a symbolic parser as a tool combining these two types of LRs and considering a parse failure as an unexpected behavior. Although our examples and results are related to French, this set of techniques is system and language independent, i.e., it can be easily adapted to most existing symbolic parsers, taggers or entropy classiers and can thus be applied to most languages. The main contributions of this piece of research can be summarized as : 1. to detail a semi-automatic approach that sequentially combines several components with the objective to improve a lexicon ; 2. to present a set of components that achieve dierent sub-tasks and can therefore be re-used for other objectives ; 3. to explain why the iterative application of this approach allows to convert a raw corpus given as input into a corpora representative of grammatical shortcomings. In this chapter, we focus on lexical information. Therefore, when we sometimes refer to morphological and syntactic information, we actually refer to morphologically-related (part-of-speech, gender, case, etc.) and syntacticallyrelated (subcategorization frames and realizations) lexical information.
4.2.1 Related work To our knowledge, the rst time that a grammatical context was used to automatically infer lexical information was in 1990 [11]. Since then, the approaches related with the subject can be classied within two categories. The rst category is composed of methods [16, 3, 4, 13, 24, 17, 5] that intend to automatically upgrade morphological-only entries with syntacticallyrelated information. These methods are mostly the progressive improvement of a same work [16, 4, 13, 24, 17]. They either rely on specially devised grammars, non-lexicalised PCFG or lexicalised PCFG in order to guess a more or less large scope of lexical information. Many methods classied within the second category [1, 12, 30, 33, 6, 8] are also the progressive improvement of a same work [30, 33, 6, 8]. These 2. The abstract approach has actually been abstracted from this method.
4.3. Global overview
107
methods, globally more recent, rely all on HPSG systems. However, as clearly stated in [8], most approaches are system-independent and can be applied to any systems fullling some preliminary requirements.
4.3 Global overview The semi-automatic methodology implemented in this chain can be summarized as follows. 1. Parse a large number of raw sentences considered as lexically and grammatically valid (law texts, newspapers, etc.), and distinguish the parsed ones from the non-parsable ones. 2. For each non-parsable sentence, determine with the help of a statistical classier if the parsing failure might be due to a lack of coverage of the grammar (syntactically non-parsable) or by a shortcoming of the lexicon (lexically non-parsable). 3. Within the lexically non-parsable sentences detect suspicious forms that correspond to missing, incomplete or erroneous lexical entries. 4. Generate correction hypotheses by studying the expectations of the grammar for those forms when trying to parse the non-parsable sentences in which they occur. 5. Order and rank the acquired correction hypotheses for an easier manual validation.
4.4 Classifying non-parsable sentences Let us suppose we have parsed a large corpus with a deep parser. Some sentences were successfully parsed, some were not. Sentences that were parsed are both lexically and syntactically covered (even if the parses might not be the expected ones). On the contrary, and in rst approximation, the parsing failure of a given sentence can be due either to a lack of grammatical coverage or a lack of lexical coverage. Since we focus on improving a lexicon thanks to a grammar, we need to collect non-parsable sentences that are covered by the grammar, i.e., sentences for which the parse failure is only a consequence of shortcomings in the lexicon. Since syntactic structures are more frequent and less numerous than words, grammatical shortcomings tend to correspond, contrarily to lexical ones, to recurrent patterns in non-parsable sentences. An entropy classier is thus trained to recognize problematic syntactic patterns and dierentiate syntactically covered sentences from non-covered ones. This classier categorizes non-parsable sentences among two sets :
108
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
syntactically non-parsable sentences represented by a ag
non−parsable,
i.e., sentences that are likely to be non-covered by the grammar ; syntactically covered non-parsable sentences represented by a ag
parsable,
i.e., sentences that are likely to be covered by the grammar and thus are non-parsable because of shortcomings in the lexicon. In order to achieve this task, the classier is trained with contextual
n-grams present in the senn-grams are built using the sequence of part-of speech (POS) for
features that are obtained by listing the set of tences. The
open-class forms and the forms themselves for closed-class ones. A start-ofsentence
< s > and a end-of-sentence < /s > elements are also added at the
beginning and end of each sequence. The entropy classier is then trained with various sets of features, each one extracted from a given sentence, and a ag indicating whether the sentence is a parsable one or not. For example, if we decide to use 3-grams of the parsable sentence Ernesto went to Argentina. tagged as Ernesto/NP went/V to/Prep Argentina/NP ./Punct , the entropy classier will receive as training : -NP-V NP-V-Prep V-Prep-NP Prep-NP-Punct NP-Punct-
parsable
The POS are obtained thanks to a tagger. Although taggers are not perfect, their errors mostly depend on the set of tags available for each of the forms tagged. Therefore, we follow the idea that these errors are syntactically random enough to not blur the global coherence of the classier's model. This training presents a bias : among the non-parsable sentences used to train the classier, some non-parsable ones are lexically non-covered but
3
syntactically covered . The classier is thus trained to consider syntactically covered sentences as syntactically non-covered ones. Nevertheless, since lexical shortcomings are not syntactically recurrent, the training is randomly impacted, i.e., the bias is randomly distributed among the features. In addition, as for any syntactically covered feature that appears in non-parsable sentences, their occurrences in non-parsable sentences are not recurrent and are naturally balanced by their presence in parsable ones.
4.5 Detecting lexical shortcomings The following step automatically detects missing, incomplete or erroneous lexical entries. Two complementary techniques are used to pursue this objective. They both identify dubious lexical forms and associate them with non-parsable sentences for which they are suspected to cause the parsing failure. 3. These sentences are actually the ones we are looking for.
4.5. Detecting lexical shortcomings
109
4.5.1 Tagger-based detection of missing homonyms In a lexicon, there can be two types of missing entries : unknown words that are easy to detect and usually do not lead to a parse failure since they tend to receive, with the parsing systems that we use, default non-restrictive and ambiguous lexical information, missing homonyms that usually lead a parse to a failure since they are hidden by their corresponding known homonyms with their inadequate lexical information. For example, the form bow can either be a common noun or a verb and obviously their respective lexical information are inadequate the one to the other. Therefore, if only one homonym is described, the parse of the sentences containing the other one will surely be misled. The method detailed in this section has been devised to detect missing homonyms. It forces a tagger to consider a known form as unknown in order to call its guesser. Indeed, when facing a known form, most taggers are strongly (if not completely) inuenced by the tags associated to the form in their internal lexicon. Thus, when facing a missing homonym of a form, most tagger do not consider as a potential candidate the correct missing tag. In order to obtain such a behavior, we simply bypass the internal lexicon. This allows the tagger to output tags that are compatible with the morphology and the context of the form, including tags that might lack in the lexicon. Since forms belonging to closed categories are generally well described (and their homonyms correctly included too), only forms belonging to open categories are forced as unknown. Such a process introduces ambiguity on purpose. In order to keep this ambiguity beyond reasonable limits, we only force one form at a time to be considered as unknown for a given sentence. Thus, to guess POS tags for all words in a sentence, the sentence is entirely tagged several times. Of course, taggers make mistakes, particularly when dealing with unknown forms. A well-known situation for a tagger is to consider an unknown proper noun as a common noun. However, the scope of the process is not restricted to a single sentence but spans an entire corpus. Global computations over a large amount of text allow us to compute a statistical ranking of the suspected missing forms that balances the false positives produced by tagging errors. This ranking takes into account the precision rate form tagged as
t,
prect
for an unknown
as evaluated relatively to the training corpus, and
number of occurrences of the form to each couple (form
w,
tag
t)
w
a score
tagged as
Ssc (w, t)
t.
nwt
the
More precisely, we assign
dened as follows :
Ssc (w, t) = prect · log(nwt )
(4.1)
110
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
4.5.2 Statistical approach for detecting lexical shortcomings This technique, fully described in [28, 27], relies on the following assumptions : the more often a lexical form appears in non-parsable sentences and not in parsable ones, the more likely its lexical entries are to be erroneous or incomplete [31] ; the suspicion rate of a form must be reinforced when the form appears in smaller non-parsable sentences ; the suspicion rate of a form must be reinforced if the form appears in non-parsable sentences along with other forms that appear in parsable ones. These assumptions are used in a x-point computation that, when studying
1 − gram,
allows to quickly establish a relevant list of lexical forms
suspected to be incorrectly or incompletely described in the lexicon. The advantage of this technique[28] over the previous one is that it intends to detect any type of lexical shortcomings, be it morphologically-related (e.g. missing homonyms) or syntactically-related (e.g. subcategorization frames). However, it directly depends on the quality of the grammar used. Indeed, if a specic form is naturally tied with some syntactic construction that is badly covered by the grammar, this form will be found in more non-parsable sentences and will thus be unfairly suspected. This limitation can be balanced in at least two ways : 1. by excluding from the statistical computation all sentences that are non-parsable because of shortcomings of the grammar (as decided by the classier dened in the previous section) ; 2. as described in [28], by combining parsing results of various parsers with dierent coverage lacks.
4.6 Generating lexical correction hypotheses :
parsing non-parsable sentences Depending on the quality of the lexicon and the grammar, the probability that both resources are simultaneously erroneous about a specic form in a given sentence can be very low. If a sentence cannot be parsed because of a suspected form, it implies that the lexicon and the grammar could not nd an agreement about the form, i.e., the lexical information that the grammar was expecting does not match what the lexicon provides for this form. For the suspected forms detected earlier, we directly assume that the parsing failures of the sentences in which they occur are the consequence of lexical problems regarding these forms. In order to generate lexical corrections, we study the expectations of the grammar for every suspected form
in its associated non-parsable sentences. In a metaphorical way, since we believe the lexicon to be wrong, we could say that we ask the grammar its opinion about the suspected forms. To fulll this goal, we get as close as possible to the set of parses that the grammar would have allowed with an error-free lexicon. Since we believe that the incomplete or erroneous lexical information of a form have restricted its possibilities to be part of a successful parse and thus, led the parsing to a failure, we decrease those lexical restrictions by underspecifying the lexical information of the suspected form, i.e., we dynamically add lexical information to the suspected form. A full underspecication can be simulated in the following way : during the parsing process, each time a lexical information is checked about a suspected form, the lexicon is bypassed and all the constraints are considered satised, i.e., the form becomes whatever the grammar wants it to be. This operation is actually achieved by exchanging, in the associated nonparsable sentences, the suspected form with underspecied lexical forms. We shall call these forms that have only few lexical restrictions wildcards. If the suspected form has been correctly suspected, and if indeed it is the unique cause of the parsing failure of some sentences, replacing it by a wildcard allows these sentences to become parsable. In these new parses, the wildcard takes part of grammatical structures that correspond to fullyinstantiated lexical entries, i.e., lexical entries that would have allowed the original form to take part in these structures. These instantiated lexical en-
tries are the information used to build lexical corrections.
4.6.1 Generating wildcards As explained in [1], using fully underspecied wildcards introduces a too large ambiguity in the parsing process. During our own experiments, it usually led to either : no parse at all because of time or memory constraints, i.e., no correction hypothesis was generated ; or too many parses, i.e., too many correction hypotheses were generated. Therefore, ambiguity must be kept below reasonable limits by adding some lexical information to the wildcards. We only add them POS and rely upon the parsers' ability to handle underspecied forms, i.e, forms with a POS but no restriction regarding the sub-categorisation frames. The ambiguity introduced by the wildcards clearly still generates an important number of correction hypotheses. However, as explained in section 4.7, this ambiguity can be easily handled, provided that there are enough non-parsable sentences associated with a given suspected form. In practice, the POS added to a wildcard depends on the kind of lexical shortcoming we are trying to solve, i.e., it is chosen according to the kind of
112
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
detection technique that suspected the form. So far, we use the tagger-based detection to validate new POS for a suspected form. Therefore, when using this approach, we generate wildcards with the POS that are given by the tagger and are not present in the lexicon for the form. When using the statistical detection approach, we generate wildcards with the POS present in the lexicon for the suspected form : we try to discover new syntactic structures for the form, without changing its POS.
4.7 Extracting and ranking corrections The way correction hypotheses are extracted depends on how they are later used during the manual validation. In a rst stage [19], the corrections were directly extracted in the output format of the parser. Such an approach has three important drawbacks : One rst needs to understand the output format of the parser before being able to study the correction hypotheses ; Some parts of the correction might use information that is not easy to relate with the format used by the lexicon (specic tagsets, under- or over-specied information w.r.t. the lexicon, etc.) ; Merging results produced by various parsers, which is an ecient solution to tackle some limitations of the process (see Sect. 4.7.2), tends to be more dicult. We thus developed for each parser a specialized conversion module that extracts the instantiated lexical entry given to the wildcard in a parse and translates it back from the output format of the parser to the format of the lexicon. Natural languages are ambiguous, and so have to be the grammars that model them. Thus, even an inadequate wildcard might perfectly lead to new parses and thus provide irrelevant corrections. In order to balance this drawback and prepare an easier manual validation, the correction hypotheses obtained for a given suspected form with a given wildcard are ranked according to the following ideas.
4.7.1 Baseline ranking : single parser mode Within the scope of only one sentence, there is no information that can help dierentiating valid correction hypotheses from irrelevant ones. However, by considering simultaneously various sentences that contain the same suspected form, one can observe that erroneous correction hypotheses are randomly scattered. Indeed, irrelevant correction hypotheses are brought by the ambiguity introduced by the wildcards. Since this ambiguity will impact each sentence dierently depending on its syntactic structure and the other
4.8. Manual validation of the corrections
113
lexical forms it contains, the irrelevant correction hypotheses that it might produce have no particular reasons to be stable from a sentence to another. On the other hand, correction hypotheses that are proposed for various sentences are more likely to be valid. This idea is the basis of our baseline ranking. Let us consider a given suspected form
w.
First, all correction hypotheses for
w
in a given sentence
form a group of correction hypotheses. This group receives a weight according to its size : the more corrections it contains, the lower weight it has, since it is probably related to several permissive syntactic skeletons. Therefore, for
P = cn 0.95) and n the
each group, we compute a score in
]0, 1[
close to
hypothesis
σ
1
(e.g.
with
n
of group
being a numerical constant
size of the group. Each correction
in the group receives the weight
twice on the size
c
pgσ =
P n
=
cn n , which depends
g.
We then sum up all the weights that a given correction hypothesis
σ
has received in all the groups it appears in. This sum is its global score
sσ = Σg pgσ .
Thus, the best corrections are the ones that appear in many
small groups.
4.7.2 Reducing grammar inuence : multi-parser mode Just like it does for the statistical detection technique[28], crossing the results obtained with dierent parsers allows to improve the ranking. Indeed, most erroneous corrections hypotheses depend on the grammar rules used to parse the non-parsable sentences updated with wildcards. Since two parsers with two dierent grammars usually do not behave the same way, erroneous correction hypotheses are even more scattered. On the opposite, it is natural for grammars describing a same language to nd an agreement about how a particular form should be used, which means that relevant correction hypotheses usually remain stable. Corrections can then be considered less relevant if they are not proposed by all parsers. Consequently, we separately rank the corrections for each parser as described in section 4.7.1 and merge the results using an harmonic mean.
4.8 Manual validation of the corrections Thanks to the previous steps, validating the proposed corrections is easier. In addition, since corrections are manually validated, we can aord to group and study the correction of a given suspected form with a given wildcard. During this study, one must note that three situations might occur. 1. There are no corrections at all : the form has been unfairly suspected or the generation of wildcards has been inadequate or the suspected form is not the only reason for its associated parsing failures.
114
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
2. There are some relevant corrections : the form has been correctly detected, the generation of wildcards has been adequate and the form is the only reason for (some of ) its associated parsing failures. 3. There are only irrelevant corrections : the ambiguity introduced by the wildcards on the suspected form has opened the path to irrelevant parses providing irrelevant corrections ; if the grammar does not fully cover the language, there is absolutely no guarantee that relevant corrections will be generated.
4.9 Comparison with related works In this section, we only pay attention to methods taking as input raw corpora. Methods that extract lexical information from annotated data are therefore not considered. Regarding lexical acquisition, no annual challenge or consensual methods of evaluation exists. Since existing methods have been tested on various languages with dierent parsing systems, taggers, raw corpora or even gold standards, a clear and relevant comparison of their results is impossible. Hence, the following comparisons are focusing each method's characteristics, advantages and drawbacks.
4.9.1 Practical interest, semi-automatic vs automatic The main dierence between our approach and the others is the fact that we process in a semi-automatic fashion whereas the others process automatically. Our semi-automatic manner to approach the subject follows two ideas. Since languages evolve slowly, LRs do not need to be regularly rebuilt. Since languages are already ambiguous, unnecessary ambiguity brought by incorrect data should be avoided by any means, especially if this unnecessary ambiguity regards frequent forms. Indeed, the natural ambiguity of languages can make it dicult to determine if a given ambiguity is necessary and thus, it makes it complicate to get rid of the unnecessary ones. According to these two statements, when adding a new entry, we provide a set of information as restricted as possible and we carefully upgrade with incremental approaches that usually involve shortcoming detection. In order to fulll our criteria, automatic approaches should therefore provide a tremendous level of precision. However, such objective seems impossible to attain since all correction hypotheses are generated according to the expectations of a grammar. As explained before, many noisy lexical corrections can be generated. Unless the grammar used completely covers the language,
4.9. Comparison with related works
115
there is even no guarantee that a given set of correction hypotheses will indeed contain at least a relevant one. The results in terms of precision provided by the other approaches are generally computed on few frequent forms. Frequent forms are both the more important for many NLP tasks and also the more suitable forms for lexical acquisition. Indeed, because of their high number of occurrence, an incorrect and ambiguous lexical information can noticeably harm the lexicon's quality. In addition, frequent forms are usually, within a given POS, the most syntactically complex forms, and therefore the most dicult forms to correct : when pruning a possibly irrelevant piece of information, there is always a doubt about whether or not it responds to a correct but non-obvious infrequent case. On the other hand, their high number of occurrences allows to better establish which piece of lexical information they are lacking. Regarding practical interest, the results in precision provided by the other (automatic) approaches tend to improve with time. Nevertheless, they still do not represent, for our criteria, a sucient trade-o to give up on our semi-automatic way to consider the subject. The previous statements need to be balanced by the fact that we do symbolic parsing. Indeed, symbolic parsing, unlike statistical one, is weak against incorrect and ambiguous information. In statistical parsing, incorrect lexical entries are usually associated with low probabilities that do not allow them to correctly compete with the other entries. Therefore, they usually do not impact much the performance of the parser. To summarize, we could say that, for our criteria, automatic approaches are to be avoided when dealing with non-probabilistic description of natural languages, but can be perfectly considered when dealing with statistical description. It is also important to note that most existing automatic approaches can be straightforwardly converted into semi-automatic ones.
4.9.2 Step-by-step comparison 4.9.3 Lexical shortcoming detection Apart from ours, few methods, such as [30, 33, 6, 8], intend to use at some point a previous error mining step to detect erroneous lexical entries. The three rst approaches [30, 33, 6] rely on the error-mining approach described in [31] whereas the fourth [8] exploits a more recent one described in [9]. The other methods focus on morphological-only entries or completely unknown forms. As far as we know, the detection technique described in [28, 27] and the tagger-based detection technique have been used by no one else but us[21].
116
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
4.9.4 Filtering syntactically non-parsable sentences Except ours, no other methods intends to lter syntactically non-parsable sentences. However, this lter has been devised to balance a drawback of the statistical lexical shortcoming detection [28, 27] that also applies to the one described in [31] : if a given form is too often used within an uncovered syntactic structure, it shall be unfairly suspected. On the other hand, in [8] the authors use a statistical detection described in [9] that better resists to this issue.
4.9.5 Wildcards generation The approaches classied in the rst category use the same type of wildcards in the sense that they only provide the part-of-speech information. For the methods classied in the second category, the wildcards are sets of fully specied lexical entries that are either pre-established for a given POS or dynamically selected. The scope of the underspecication and consequently the type of the acquired lexical information varies from a method to another, i.e., it mostly depends on the parsing system and the lexical framework used. For example, in our case, the two-level lexical framework we use allows our lexicons to be robust against morphologically-related shortcomings. Our correction hypotheses thus focus on syntactically-related information. Another example would be [4], where the grammar used is a specially devised one. The scope of the acquired lexical information is therefore more limited.
4.9.6 Hypothesis ranking and validation As explained earlier, our semi-automatic approach allows us to split the ranking of the correction hypotheses according to a given pair of form and wildcard. When considering the lexical corrections, we usually validate few corrections and, when in doubts about additional corrections, we rerun the process to conrm that the suspected form is still problematic. In addition, by rerunning the process on a form that has been corrected, some non-parsable sentences become parsable and are not used anymore to generate corrections. Consequently, the irrelevant corrections that were generated from these sentences, along with the relevant corrections applied on the previous step, are not generated in this new iteration. Let us consider an entry which is lacking two pieces of lexical information : a frequent one and a less frequent one. The search for the frequent one might perfectly generate irrelevant corrections that hide the relevant corrections for the less frequent piece of lexical information. However, by correcting the frequent one rst and iterating the process, the irrelevant corrections of the previous step shall disappear in this new iteration and not hide anymore the relevant corrections for the infrequent piece of lexical information.
4.10. Results and Discussion
117
Consequently, the semi-automatic approach allows us to do such separation in the classications and use the rather simple ranking method described earlier. It is important to note that the semi-automatic approach is the reason for the simple ranking method to be and not the other way around, i.e., we are not proceeding in a semi-automatic fashion because of a too simple
4
ranking method . On the other hand, since the other methods are automatic, they need to produce a global classication and apply a threshold to discard most incorrect correction hypotheses. The methods classied above in the rst category apply a diverse set of metrics thorough-fully studied in [14]. These metrics intend to evaluate the likeliness for a form to use a given subcategorization frame. Indeed, whereas the most frequent lexical information is easier to identify for a given form, it can be dicult to determine if a less frequent one is an incorrect one or if it actually represents an infrequent use of the form. This feature is especially useful for infrequent suspicious forms with only few associated non-parsable sentences. The threshold used to cut the list of correction hypotheses is usually set manually. Nevertheless, as explained earlier, if the acquired lexical information is used in statistical description, what matters is for correct information to receive a better probability than incorrect ones. We can thus fore-guess that the threshold can be chosen in a quite arbitrary fashion without impacting much the lexicon's quality. Except for [1, 12] that do not describe any results, the methods classied in the second category let a maximum-entropy classier choose the best correction to apply to a given occurrence of a form. The classier bases its decision on various features regarding either the morphology of the form or the syntactic context it appears in. In [8], the authors add two additional steps. First, a lter that discards lexical information that constrains a form to be part of a certain word paradigm whereas the paradigm does not seem to exist. Second, a classication that uses a metric exploited by the methods of the rst category. Finally, none of the other methods has reported to merge the corrections provided by various parsers.
4.10 Results and Discussion We now detail the practical context in which our experiments were performed. We then describe some correction examples and the results achieved. Finally, we list the future improvements. 4. Even though this ranking method would clearly prohibit us to proceed automatically.
118
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
4.10.1 Practical context The French lexicon used and improved is called the lef or Lexique des
5
formes échies du français/Lexicon of inected forms of French . This widecoverage morphological and syntactic French lexicon contains more than 600 000 entries and has been built partially automatically [26]. This lexicon is under constant development. During our experiments, this lexicon has been used with two parsers based on two dierent formalisms and grammars :
French Meta-Grammar
FRMG [10] (
) is a grammar generated in an
hybrid TAG/TIG form from a more abstract meta-grammar with highly factorized trees [29]. It then compiled into a parser by the dyalog system [32]. SxLFG-Fr [2] is a deep non-probabilistic LFG grammar compiled into a parser by sxg. The raw corpus of sentences with 25 tokens or less was extracted from a French journalistic corpus called Le monde diplomatique. It contains 280 000 sentences for a total of 4,3 million of words.
4.10.2 Examples of corrections Here are some examples of valid corrections found : israélien/Israeli , portugais/Portuguese , parabolique/parabolic , pit-
toresque/picturesque , minutieux/meticulous were missing as adjectives ; politiques/politic was missing as a common noun ; revenir/to come back did not handle constructions like to come back
from or to come back in ; se partager/to share did not handle constructions like to share (so-
mething) between ; aimer/to love was described as expecting a mandatory direct object ; livrer/to deliver did not handle constructions like to deliver (some-
thing) to somebody .
4.10.3 Classication of non-parsable sentences Let us recall that we intend to distinguish sentences syntactically covered from non-covered ones. In order to evaluate the relevance of this classication, we kept 5% of all parsable sentences so as to check if this classier was correctly classifying them as syntactically covered. Since parsable sentences are both lexically and syntactically covered, we calculate the precision rate with them only. Indeed, for non-parsable sentences, we have no automatic 5. See
http://alpage.inria.fr/~sagot/lefff-en.html.
4.10. Results and Discussion
119
means to know for sure which non-parsable sentences are lexically covered and syntactically not. As expected, since the quality of the training data improves after each session of correction, the precision rate of the classier raises (see Table 4.1). Indeed, as explained earlier in Sect. 4.4, all non-parsable sentences are categorized in the training data as syntactically non-covered, even those that are in fact only non-parsable because of shortcomings in the lexicon. By correcting lexical shortcomings, the amount of incorrect training data decreases : some training sentences that were incorrectly categorized for training as syntactically non-covered become parsable and thus, have been consequently re-categorized. Session Precision rate
0
1
2
3
92.7%
93.8%
94.1%
94.9%
Table 4.1 Precision of the non-parsable sentence classication. Since there is no dierence when generating the 3-grams of parsable and non-parsable sentences, the results in terms of precision that we obtain for parsable sentences are likely to be similar to the precision rate that we would obtain for non-parsable sentences. In addition, after three correction sessions, 80% of the non-parsable sentences were classied as grammatically non-covered. This sharp contrast with the results in Table 4.1 on parsable sentences is an additional clue that this classier performs satisfyingly and does not tend to classify all the non-parsable as syntactically covered. In the end, given the positive impact of this ltering step on our detection techniques, the small error rate that prevents us from taking into account some non-parsable sentences is not a signicant issue. Indeed, since there is no particular reason for a given form to be more frequent than the average in these incorrectly classied sentences, such mistakes can be simply balanced by increasing the size of the raw corpus.
4.10.4 Lexical shortcomings detection techniques 4.10.4.1 Tagger-based detection The rst tests on this technique were conducted with a rather simple preliminary version. At that time, the technique was dierent on many points. 1. We were introducing ambiguity for all open-class forms of a sentence at the same time. We now introduce ambiguity for one open-class form at a time. 2. We were applying the technique on the whole corpus, which brings a lot of false positives. Even if there might be true positives in the parsable
120
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
sentences as well as in syntactically non-parsable sentences, it is far more interesting to restrict the detection to the lexically non-parsable sentences. 3. We were not considering the precision rate associated with each guessed tag when ranking the suspects. As far as quality is concerned, the results were less convincing than they are now. However, this preliminary version of the technique allowed us to correct 182 lemmas in the lexicon. An advantage of this tagger-based detection is that it only needs to be applied once on a corpus. Indeed, for a given session, the set of sentences that are non-parsable because of shortcomings in the lexicon is a subset of the corresponding set of non-parsable sentences of the previous session. So far, this detection method presents two limitations : its use is limited to short range lexical information such as POS, it generates a non negligible amount of false positives.
4.10.4.2 Statistical detection This technique proved to be relevant from the very beginning and allowed us to correct 72 other lemmas. The number of corrections would actually have been even more important if the detection method alone had not been applied several times before our own experiments [28], i.e., the lexicon had been previously corrected thanks to the detection method only. The new detection results could actually be achieved thanks to some updates on the grammars. The validity of this detection is demonstrated by Figure 4.1 which exhibits a clear correlation between the suspicion rate of a form and the new parse rates obtained with the FRMG parser when substituting the suspected form with wildcards. This detection technique presents the great advantage to detect all kinds of lexical shortcomings whereas its main drawback is its underlying link between the quality of the detection and the quality of the grammar used to perform the detection. One must also note that during a session, some suspected forms can prevent other problematic forms from being corrected ; it is thus necessary to make several correction sessions for a same corpus until no fairly suspected form arises.
4.10.5 Correction generation, ranking and manual validation The overall accuracy of the correction hypotheses decreases after each session : there are less and less lexical errors that need to be corrected. In other words, on the corpus provided as input, the quality of the lexicon
4.10. Results and Discussion
121
55
50
45
40
35
30
25
20
15
10
5 0
10
20
30
40
50
60
70
80
Figure 4.1 Parse rate of sentences with wildcard (Y axis) according to the suspicion rate of the suspected forms substituted with wildcards in the sentences (X axis).
reaches the quality of the grammar and the grammar is no longer available to provide corrections. Since we want to improve eciently our lexicon, we demonstrate the relevance of the whole process by showing the increase of the parsing rate for the two parsers we use (Figure 4.2).
Of course, any parsing rate can be straightforwardly increased by introducing random ambiguity in the lexicon. Nevertheless, one must keep in mind that the corrections are manually validated, i.e., the noticeable increases of parsing coverage are mostly due to the improvement of the quality of the lexicon. Regarding the quality of the parser using this updated version of the lexicon, even if we expect it to work better, there is no guarantee of such thing unless new evaluations are performed. Indeed, even if we would achieve only correct updates on the lexicon, we might perfectly unleash some ambiguities that were not impacting the parser before. For all sessions but the second one, all correction sessions are based on the non-parsable sentence classication, the statistical detection and the correction generation. The second session has been achieved only thanks to the tagger-based detection technique for identifying POS shortcomings (Sect. 4.5.1).
122
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
Figure 4.2 Number of sentences successfully parsed after each session. Table 4.2 lists the number of lexical forms updated at each session. Session nc adj verbs adv total
1
2
3
total
30
99
1
130
66
694
27
787
1183
0
385
1568
1
7
0
8
1280
800
413
2493
Table 4.2 Lexical forms updated after each session. As expected, we have been quickly limited by the corpus and the quality of the grammars. Indeed, the lexicon and the grammars have been developed together during the last few years, using this same corpus as a testing corpus. Therefore, on this corpus, there was not a huge gap between the coverage of our grammars and the coverage of our lexicon. Further correction and extension sessions only make sense after grammar improvements or if applied on new corpora. To sum up our results, we have already detected and corrected 254 lemmas corresponding to 2493 forms. The coverage rate (percentage of sentences
4.11. Future work
123
for which a full parse is found) has undergone an absolute increase of 3,41% (5141 sentences) for the FRMG parser and 1,73% (2677 sentences) for the SXLFG parser. Those results were achieved within only a few hours of manual work !
4.11 Future work Regarding the classication of non-parsable sentences, we could generate the features for the parsable sentences used during training with the parse outputs themselves instead of using an external tagger. Indeed, since we are only interested in syntactic patterns covered by the grammar, the lexical data expressed in the (possibly ambiguous) parse outputs could be directly used to generate the tags. However, the practical interest of this lter is to balance an issue in our lexical shortcoming detection methods. Since the method described in [9] seems more robust to such issue, this lter might perfectly loose its practical interest. The practical interest of the tagger-based detection is also to reconsider. Indeed, our experiments showed us that our two-level lexicon is fairly robust against morphological lexical shortcomings, i.e., fully missing entries. Indeed, the tagger-based detection has been devised for detecting missing homonyms. However, whereas it is possible for a form of a given lemma to be hidden by another homonym form, it is not possible for all the forms of the same lemma. Therefore, detecting missing lemmas before applying our lexical correction process is an achievable objective. Applying methods such as the ones described in [7, 25] would guarantee a good morphological coverage of all the forms and thus question the practical interest of the tagger-based detection. Because the two-level lexical framework used allows our lexicon to be robust against morphological shortcomings, only syntactic information is subject to corrections. Such a characteristic could allow us to factorize suspicious forms to lemmas, regroup the associated non-parsable sentences and consequently, improve the lexical corrections generated. As regards to the ranking of the correction, we could try to implement a more sophisticated metric such as those studied in [14]. This feature would allow to less iterate the process for a given suspicious form and also improve the number of corrections for infrequent forms. Regarding corrections, we would like to investigate the following idea. Semantically related lemmas of a same class tend to have similar syntactic behaviors, e.g. to build, to construct, to assemble. This concept is actually used in [15] to cluster semantically-related forms. Nevertheless, this similarity is not systematic. During the manual validation of the correction, when various lemmas of with similar meaning received a same correction, a LR or a method grouping lemmas according to their semantic meanings
124
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
could therefore be used to trigger new corrections or bring the attention to some other lemmas.
4.11.1 From lexical corrections to grammar corrections The inability for the grammar to provide for a given corpus new relevant lexical corrections can lead to an highly interesting data : given a non-parsable sentence, if none of its suspected forms leads to a relevant correction, this sentence can be considered as lexically correct w.r.t. the current state of the grammar. This means that it exhibits shortcomings in the grammar, which can help improving it. The iterative application of this approach can therefore turn a corpus provided as input into a corpus representative of the shortcomings of the grammar Therefore, an iterative process that alternatively and incrementally improves both the lexicon and the grammar can be devised. In other words, the grammar can be used to correct the lexicon and, when it cannot anymore, the input corpus can be used to correct the grammar. Once the grammar is updated, the lexicon can again be corrected, and so on. Although it is still not clear how to exploit it, a corpus representative of shortcomings of the grammar is clearly a valuable data that we would like to take advantage of. This is especially important languages that lack large scale TreeBanks such as French. An additional approach to correct the grammar could also be devised by studying the statistical model built by the maximum entropy classier that is used to distinguish syntactically non-parsable sentences from parsable ones.
4.12 Conclusion (English) The process described in this chapter presents four noticeable advantages. First, this process does allow to improve signicantly a morpho-syntactic wide-coverage lexicon within a short amount of human time. We showed this result thanks to the improvement of the parsing coverage of parsing systems that rely on such a lexicon, namely the lef . Moreover, our technique contributes to the improvement of deep parsing accuracy, which can be seen as a cornerstone for many advanced NLP applications. Second, our process is fed with raw text. This allows to use as an input many types of text, including texts produced daily by journalistic sources as well as technical corpora. Third, although the various components of the process have been developed towards a same objective, they achieve dierent sub-tasks and can therefore be used for other objectives. Fourth, its iterative semi-automatic application on an input corpus eventually turns it into a corpus representative of the shortcomings of the gram-
4.13. Conclusion (Français)
125
mar. Such a valuable data could be a starting point for an automatized improvement of grammars for deep parsing.
4.13 Conclusion (Français) Le processus décrit dans ce chapitre présente quatre avantages notables. Premièrement, il permet d'améliorer de façon signicative et en un temps raisonnable un lexique morpho-syntaxique à large couverture. Ce résultat est démontré par l'amélioration du taux d'analyse d'analyseurs syntaxique utilisant un tel type de lexique, à savoir le lef . De plus, cette méthode contribue à l'amélioration de l'analyse syntaxique profonde, une des pierres angulaires de bien des applications TALN de pointe. Deuxièmement, le processus prend en entrée n'importe quel texte brut. Il peut donc être utilisé sur des corpus de tout genres, qu'ils soient littéraires, journalistiques ou encore techniques. Troisièmement, bien que les diérents maillons de ce processus ont été développés pour accomplir un seul et même objectif, ils réalisent des tâches intermédiaires et peuvent être réutilisés/adaptés à d'autres ns. Enn, son application itérative sur un même corpus permet de convertir ce dernier en un corpus représentatif des manques grammaticaux d'un analyseur. Ce type de donnée brute, fort appréciable en l'état, pourrait servir de point de départ pour une méthode automatisant la correction des grammaires utilisées pour l'analyse syntaxique profonde.
126
Chapitre 4. Lexx : Mining Parsing Results for Lexical Correction
Bibliographie
[1] Petra Barg and Markus Walther. Processing unknown words in hpsg. In ACL-36 : Proceedings of the 36th Annual Meeting of the Associa-
tion for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 9195, Morristown, NJ, USA, 1998. Association for Computational Linguistics. [2] Pierre Boullier and Benoît Sagot. Ecient parsing of large corpora with a deep LFG parser. In Proceedings of LREC'06, 2006. [3] Michael R. Brent. From grammar to lexicon : Unsupervised learning of lexical syntax. Computational Linguistics, 19(2) :243262, 1993. [4] Ted Briscoe and John Carroll.
Automatic extraction of subcategori-
zation from corpora. In Proceedings of the fth conference on Applied
natural language processing, pages 356363, Morristown, NJ, USA, 1997. Association for Computational Linguistics. [5] Paula Chesley and Susanne Salmon-alt. Automatic extraction of subcategorization frames for french.
In In Proceedings of the Language
Resources and Evaluation Conference, LREC 2006, 2006. [6] Kostadin Cholakov, Valia Kordoni, and Yi Zhang.
Towards domain-
independent deep linguistic processing : Ensuring portability and reusability of lexicalised grammars. In Coling 2008 : Proceedings of the
workshop on Grammar Engineering Across Frameworks, pages 5764, Manchester, England, August 2008. Coling 2008 Organizing Committee. [7] Kostadin Cholakov and Gertjan van Noord. word paradigms for large-scale grammars.
Acquisition of unknown
In Coling 2010 : Posters,
pages 153161, Beijing, China, August 2010. Coling 2010 Organizing Committee. [8] Kostadin Cholakov and Gertjan van Noord. Using unknown word techniques to learn known words. In EMNLP 2010, Octuber 2010. [9] Daniël de Kok, Jianqiang Ma, and Gertjan van Noord. A generalized method for iterative error mining in parsing results. In Proceedings of the
2009 Workshop on Grammar Engineering Across Frameworks (GEAF 127
128
Bibliographie
2009), pages 7179, Suntec, Singapore, August 2009. Association for Computational Linguistics. [10] Éric De La Clergerie, Benoît Sagot, Lionel Nicolas, and Marie-Laure Guénot. Frmg : évolutions dún analyseur syntaxique tag du francais. In Journées ATALA, 2009. [11] Gregor Erbach. Syntactic processing of unknown words. In IWBS Re-
port 131, 1990. [12] Frederik Fouvry. Lexicon acquisition with a large-coverage unicationbased grammar. In EACL, pages 8790, 2003. [13] Anna Korhonen, Genevieve Gorrell, and Diana McCarthy.
Statistical
ltering and subcategorization frame acquisition. In Proceedings of the
2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora, pages 199206, Morristown, NJ, USA, 2000. Association for Computational Linguistics. [14] Anna Korhonen and Yuval Krymolowski. On the robustness of entropybased similarity measures in evaluation of subcategorization acquisition systems. In COLING-02 : proceedings of the 6th conference on Natural
language learning, pages 17, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [15] Anna Korhonen, Yuval Krymolowski, and Zvika Marx. Clustering polysemic subcategorization frame distributions semantically. In Procee-
dings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 6471, Sapporo, Japan, July 2003. Association for Computational Linguistics. [16] Christopher D. Manning.
Automatic acquisition of a large sub cate-
gorization dictionary from corpora. In Proceedings of the 31st Annual
Meeting of the Association for Computational Linguistics, pages 235 242, Columbus, Ohio, USA, June 1993. Association for Computational Linguistics. [17] Cédric Messiant.
A subcategorization acquisition system for French
verbs. In Proceedings of the ACL-08 : HLT Student Research Workshop, pages 5560, Columbus, Ohio, June 2008. Association for Computational Linguistics. [18] Lionel Nicolas, Jacques Farré, and Eric De La Clergerie. Confondre le coupable : Corrections d'un lexique suggérées par une grammaire.
In
Proc. of TALN'07, Toulouse, France, June 2007. [19] Lionel Nicolas, Jacques Farré, and Eric De La Clergerie. Mining Parsing Results for Lexical Corrections. In Human Language Technologies
as a Challenge for Computer Science and Linguistics 3rd Language & Technology Conference, Poznan, Poland, October 2007. Wydawnictwo Pozna«skie Sp. z o. o.
Bibliographie
129
[20] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric Clergerie. Mining parsing results for lexical correction : Toward a complete correction process of wide-coverage lexicons. pages 178191, 2009. [21] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric de La Clergerie. Computer aided correction and extension of a syntactic wide-coverage lexicon. In COLING '08 : Proceedings of the 22nd
International Conference on Computational Linguistics, pages 633640, Manchester, United Kingdom, August 2008. Association for Computational Linguistics. [22] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric de La Clergerie. Extensión y corrección semi-automática de léxicos morfo-sintáctico.
In SEPLN 2008, 24th Edition of the Conference of
the Spanish Society for Natural Language Processing, Madrid, Spain, September 2008. [23] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric de La Clergerie. Trouver et confondre les coupables : un processus sophistiqué de correction de lexique.
In Traitement Automatique des
Langues Naturelles (TALN 2009), Senlis, France, June 2009. [24] Judita Preiss, Ted Briscoe, and Anna Korhonen. A system for large-scale acquisition of verbal, nominal and adjectival subcategorization frames from corpora.
In Proceedings of the 45th Annual Meeting of the As-
sociation of Computational Linguistics, pages 912919, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [25] Benoît Sagot. Automatic acquisition of a Slovak lexicon from a raw corpus. In Lecture Notes in Articial Intelligence 3658 (Springer-Verlag),
Proceedings of TSD'05, pages 156163, Karlovy Vary, Czech Republic, September 2005. [26] Benoît Sagot. The Lef, a freely available, accurate and large-coverage lexicon for French.
In 7th international conference on Language Re-
sources and Evaluation (LREC 2010), Valetta Malte, 2010. [27] Benoît Sagot and Éric de La Clergerie. Fouille d'erreurs sur des sorties d'analyseurs syntaxiques. Traitement Automatique des Langues, 49(1), 2008. [28] Benoît Sagot and Éric Villemonte de La Clergerie. parsing results.
Error mining in
In Proceedings of ACL/COLING'06, pages 329336,
Sydney, Australia, 2006. [29] François Thomasset and Éric Villemonte de La Clergerie.
Comment
obtenir plus des méta-grammaires. In Proceedings of TALN'05, 2005. [30] Tim van de Cruys. Automatically extending the lexicon for parsing. In
Proceedings of the eleventh ESSLLI student session, 2006.
130
Bibliographie
[31] Gertjan van Noord. Error mining for wide-coverage grammar engineering. In Proceedings of ACL 2004, Barcelona, Spain, 2004. [32] Éric Villemonte de La Clergerie.
DyALog : a tabular logic program-
ming based environment for NLP. In Proceedings of 2nd International
Workshop on Constraint Solving and Language Processing (CSLP'05), Barcelona, Spain, October 2005. [33] Yi Zhang and Valia Kordoni.
Automated deep lexical acquisition for
robust open texts processing.
In Proceedings of LREC-2006,, pages
275280, Genoa, Italy, 2006.
-
Chapitre Results
131
5-
132
Chapitre 5. Results
133
Related terms
Cognates
are words that have a common etymological origin, e.g. publi-
cation in English corresponds to publicación in Spanish. The semantical meaning of the cognates may dier as languages develop separately and eventually become false friends.
Related publication The publications related to this chapter can be found with references [11, 12] and [13].
134
Chapitre 5. Results
5.1. Introduction (Français)
135
5.1 Introduction (Français) Ce chapitre situe l'état de développement de Victoria par rapport à ses objectifs. Il détaille donc sa situation en termes de directives, d'interfaces, de techniques et de RLs et complète les résultats et travaux futurs décrits dans les chapitres précédents. Il est important de noter que les résultats (et travaux futurs) de Victo-
ria ne sont pas tous le fruit de cette thèse. Cependant, puisque les moyens humains alloués étaient limités, la quasi totalité d'entre eux ont une implication directe ou indirecte avec le travail réalisé pour cette thèse. Étant donné que l'organisation de ce projet est une partie du travail réalisé et que les objectifs de Victoria ont été introduits au chapitre 2, la totalité des résultats est présentée an de maintenir la cohérence de ce document. Cependant, pour ne pas tout mélanger, les résultats sont donnés en deux temps : dans un premier temps sont détaillés ceux ayant une lien direct avec le travail réalisé durant cette thèse et dans un deuxième temps (en partie 5.5.1) sont décrits brièvement ceux ayant un lien indirect.
5.2 Introduction (English) This chapter reviews the Victoria 's state of development with respect to its objectives. It details its situation in terms of guidelines, interfaces, techniques, LRs and completes the results and future works described in the previous chapters. Obviously, Victoria 's results and future works have to be dierentiated from the one related to this thesis. Nevertheless, since Victoria is a project with a moderate workforce, the work achieved during this thesis had at some point a direct or indirect implication with most of them. In addition, because the global organization of the project is indeed a part of this thesis and most subjects have been introduced in chapter 2, not sketching the current situation of Victoria would alter the global coherence of this document. Therefore, so as to avoid claiming credit for results that do not present a direct link with this thesis, indirectly related results are briey described within section 5.5.1.
5.3 Guidelines Among the results achieved during this thesis, guidelines are clearly the most ambiguous to evaluate. as they are abstract notions that have not been all demonstrated in practice. Ddetermining whether they are relevant or not can only be based on personal impressions and thus, be subject to caution. Nevertheless, we can state that :
136
Chapitre 5. Results
the transfer of grammatical language from one language to another related one is indeed an ecient manner to start building a new grammar (see sect. 5.6.2.3), reusing existing LRs to build a new one is a handy and often achievable approach (see sect. 5.6), the use of two co-interacting LRs to correct one another is relevant. Regarding the others guidelines for which we do not have clear results, we can only point out a scientic paper dedicated to the Victoria 's guidelines which has been published in a workshop specialized in the sustainability and LR life-cycle management [14]. Although such a publication does not demonstrate that all guidelines are indeed eective, it allows to at least arm that they have been considered as coherent and useful by specialists of the subject.
5.3.1 Future work Regarding the transfer of linguistic knowledge from a language to a related one, the following idea has been considered for lexical information. With common-rooted languages, such as French and Spanish, many direct translations are eective. For example, a word ending with -tion in French can often be translated by a word ending with -ción in Spanish. It thus seems possible to apply a basic morphological alignment to translate some words. This concept, similar to cognate, is known as very delicate. Nevertheless, this statement seems to apply mostly to infrequent words. An explanation to this phenomenon could be that they might be among the ones that have evolved the least from the root language (Latin in the present case). If veried, such approach would have a valuable interest since infrequent words are dicult to acquire. Indeed, because of their rare occurrences, tools dedicated to lexical acquisition have diculties dierentiating them from forms with typing errors. In addition, the sparsity of the contexts in which they appear prohibits a proper ranking (if any) of correction hypothesis such as those described in the previous chapter.
5.4 Interfaces As explained earlier, since interfaces are useful for enhancing collaborative work, Victoria aims at developing a dedicated one for each type of LRs. Among the three types of LRs developed, the eorts have been focused on an interface for lexicons. This decision follows the idea that lexicons are, from far, the type of LR that requires the most collaboration and may receive it. Indeed, lexicons have a greater number of entries than grammars and morphological rules do and since a part of the information for each entry
5.5. Techniques
137
is simple to edit, they are the type of LR with the widest range of possible collaborators. Currently, the interface (see gure 5.1) already allows us to : query enhanced searches based on logical operations over any types of data available ; select, enable, disable, create or delete entries ; edit them in a guided fashion that prevents irrelevant modications ; trace all changes and, if relevant, cancel them ; download, in an Alexina format, a le of selected entries ; access to a set of functionalities restrained by means of permissions given to certain groups of users ;
5.4.1 Future work The lexicon interface has not been tested in production. Yet, its nalization should concentrate the eorts before considering other kind of interface. Once this interface is achieved, eorts should be directed towards an interface to manipulate morphological rules. This decision has been made according to the idea that morphological rules are easier to describe than grammars. They thus have a bigger potential of collaborators.
5.5 Techniques In terms of techniques, Victoria aims at developing a chain to upgrade from plain text, in a semi-automatic fashion, all the LRs required to perform symbolic syntactic parsing. This chain is composed by four methods that take advantage of the one-to-one interactions between LRs. In the two previous chapters, we described two of these methods, namely : MorphAcq, a method that helps correct morphological rules ; LexFix, a methods that helps correct the syntactic information of a lexicon ;
5.5.1 Additional results A third method, developed before Victoria started, helps correcting the morphological information of a lexicon thanks to morphological rules. It achieves this task by considering as unexpected behaviors the absence of some lexical forms in a lexicon and uses the morphological rules to predict hypothetical lemmas for the missing forms. A statistical xed-point algorithm is then used to rank the hypothetical lemmas according to the number of inected forms found in the corpus. A complete description of the technique can be found in [3].
Chapitre 5. Results 138
Figure 5.1 Snapshot of the interface dedicated to the edition of the lexicon.
5.6. Linguistic resources
139
5.5.2 Future work In addition to the respective extensions described earlier, the two techniques developed during this thesis need to be stabilized. Indeed, because they have been developed as prototype, their codes are far from being optimal or easy to manipulate. Regarding the fourth missing technique which aims at easing the correction of a grammar thanks to a lexicon, we should develop techniques that take advantage of the raw corpora produced as a side eect of the correction of the lexicon. Indeed, as explained in the previous chapter, when no new relevant lexical correction is produced, the input corpora may be considered as being mostly composed of syntactically non covered sentences. An alternative approach would be to study the statistical model built by an entropy classier that is trained to recognize non-syntactically covered sentences (see previous chapter). This model could be an interesting starting point to guess non covered syntactic structures. Finally, a third approach could be to apply the method described in [4] that is able to detect problematic
n-grams
based both on forms and POS.
5.6 Linguistic resources Regarding LRs, the eorts has been mostly dedicated to Spanish, a little to Galician and only few to French. Indeed, the French resources we use were carefully developed for years by the Alpage team. Therefore, apart from the lexical corrections reported in the previous chapter, no recent improvements for these LRs are to be related to Victoria. On the other hand, for Spanish and Galician, almost everything had still to be done.
5.6.1 Lee, léxico de formas exionadas del español Among the two Alexina -based morpho-syntactic wide-coverage lexicons created by Victoria, the Spanish lexicon lee has overtaken other well known Spanish lexicons in terms of coverage.
5.6.1.1 Creation and extension The rst version of the lee was obtained by merging several existing Spanish lexicons [13] :
The Spanish Multext
Multext is an international project [9] which aims,
among other things, at developing standards and specications for the encoding and processing of linguistic corpora along with tools, corpora and linguistic resources embodying these standards.
140
Chapitre 5. Results
The USC lexicon
is a large morphological Spanish lexicon [1] created for
PoS tagging in the research group Gramática del Español of the University of Santiago de Compostela (Spain).
ADESSE
is a database of Spanish verbs developed at the University of Vigo
(Spain) [6] with syntactic and some semantic information for more than 4,000 verbs.
The Spanish Resource Grammar (SRG)
is an open-source multi-purpose
large-coverage and precise grammar for Spanish [10] grounded in the theoretical framework of HPSG. It includes a lexicon describing syntactic information for Spanish in a well organized hierarchy of syntactic classes. Figure 5.2 shows how the rst version of the lee was built. This construction has been successfully achieved by interpreting all input resources mentioned above (despite their partially incompatible lexical models), converting them into the Alexina format, and nally merging the converted lexicons. Since the Multext and the USC lexicons only include morphological information, whereas the SRG and the ADESSE lexicons include syntactic information, the merging procedure was achieved by : 1. converting the Multext lexicon into an Alexina -based morphological lexicon and adding some Alexina-specic entries (prexes, suxes, named entities, punctuation signs) ; 2. converting the USC Lexicon as well and merging it with the lexicon that had been extracted from Multext so as to get the morphological basis of the lee ; 3. converting the ADESSE and the SRG lexicon, which are syntactic-only, into the Alexina format ; 4. merging the morphological lee from step 2 and both verbal syntactic lexicons built during step 3. All entries that did not receive syntactic information were assigned a default restrictive syntactic information according to their POS. The mapping of the syntactic information in ADESSE to the Alexina framework resulted straightforward, i.e., both frameworks describe syntactic information in a similar fashion. On the other hand, SRG classies lemmas according to a hierarchy of syntactic classes. A mapping of these classes within the Alexina syntactic descriptions was thus performed. This mapping used the lef as a bridge in order to take advantage of the syntactic proximity between Spanish and French. A more detailed description of the merging procedure can be found in [11]. Once the rst version of lee was built, we used an adaptation to unknown forms of the technique described in Chapter 4.5.1 so as to extend its coverage [13]. This technique received as input a raw corpus extracted from
Proyecto Victoria - Recursos
5.6. Linguistic resources
Morphological Info. Multext
141
Morphological Info. Merging
USC Lexicon
Morphological Info. Leffe beta Léxico de formas flexionadas del español
Syntactic Info. Syntactic Info. Lefff (French)
Syntactic transfer
Merging
Syntactic Info.
SRG
ADESSE
Figure 5.2 Merging procedure performed to build the lee's rst version. a subset of the Spanish part of the Europarl
1 containing approximatively 6
million words. A ranking of suspected missing pairs (form, tag) was obtained with many false positives. Nevertheless, thanks to this list, we did include in the lee ,
2
at a very minimal costing , nothing less than 1,800 lemmas. We must point out that the original coverage of the lee was already high.
Adjectives Adverbs Verbs Common nouns Proper nouns Total
Lemmas (intensional)
Inflected forms (extensional)
88 54 26 117 1,518 1,803
298 54 1,693 231 1,518 3,740
Table 5.1 Lemmas acquired using the tagger-based technique. Table 5.1 shows, according to their categories, the number of lemmas added to the lee . The great majority were proper nouns, since they were very incomplete in lee up to this point. The approximatively 1,800 intensional entries added to the lee correspond to more than 3,700 inected forms in the extensional lexicon. For example, we added the verbs abstraer /to abstract and documentar /to document, the adjective francoespañol /FrancoSpanish, the common noun biocarburante /biofuel, the adverb precipita-
damente /hastily and the proper noun Niza /Nice. 1. A parallel corpus from the European Parliament proceedings 2. It was manually performed by one person in two days.
142
Chapitre 5. Results
The few other updates performed on the lee until now have been manually achieved.
5.6.1.2 Syntactic features described The syntactic description used for lee is similar to the one used for lef, the main dierences are actually translations of prepositions. For verbs, the lee uses the following (dened here in a simplied manner) syntactic functions :
Suj for subjects, e.g. Yo soy de españa/I 'm from Spain ; Obj for direct objects, e.g. Hemos soportado muchas dicultades /We
Obja
have endured many diculties ; y
Objde
for indirect objects introduced by the prepositions a
or de, e.g. Entregaron una copa al ganador /They gave a cup to the
winner ;
Obl
and
Obl2
for other (non-cliticizable) arguments where Obl2 is
used for verbs with two oblique arguments, e.g. Hablar de algo con
alguien /To speak of something with somebody ;
Att
for (subject, object or `a-object) attributes and pseudo-objects,
e.g. Mi padre es médico /My father is doctor ;
Loc for locative arguments, e.g. Estoy en
casa /I'm at home .
For predicative adjectives and nouns, that can be headed respectively by a copula or a support verb, the same set of functions are used. The argument of a preposition is considered as an Obj. Adverbs may have arguments with the syntactic function Obja (contrariamente a /contrarily to) or Objde (independientemente de /Regardless of ). Possible realizations are threefold. clitic pronouns : cln for nominative clitic, e.g yo /I, tú /you ; cla for accusative clitic, e.g lo /it, la /it or her ; cld for dative clitic, e.g le /him or her, te /you ; direct phrases : sn for noun phrase, e.g. El vestido verde es muy caro/The green
dress is very expensive ; sa for adjectival phrase, e.g. La niña tiene unos ojos muy boni-
tos /The girl has very beautiful eyes ; sadv for adverbial clause, e.g. Has resuelto el problema perfecta-
mente /You've solved the problem perfectly ; scompl for completive clause, e.g. Creo que está enfermo /I think
he's sick ; qcompl for interrogative clause, e.g. Miguel me contó qué había ocur-
rido /Miguel told me what had happened . prepositional phrases : direct phrase introduced by any preposition, e.g. con-sn, de-sn, para-sinf, etc.
5.6. Linguistic resources
143
For verbs, the following ve redistributions are available.
%actif :
a basic redistribution that does not change the initial sub-
categorization information.
%passif : for the standard passive in por /by. %passif_impersonnel : for passive impersonal
constructions with
inverted subject, if any.
%se_moyen : for modeling constructions such as este coche se vende bien /this car sells well on the basis of the underlying transitive construction for the same verb.
%ppp_employé_comme_adj :
for indicating that the participle
of a verb can be used as adjective and thus, that an additional entry should be generated. For adjectives, two redistributions are dened :
%adj_impersonnel :
when an adjective is the lexical head of an
impersonal construction, e.g. Es difícil trabajar /It is dicult to work.
%adj_personnel :
for other cases.
5.6.1.3 Technical informations In [13], we evaluated the rst version of the lee in terms of morphological and syntactic coverage. Regarding morphological coverage, we tested lee in the context of a real application : a morphological pre-processor [8], developed by the COLE LYS
4 groups, which is able to benet from an extensional lexicon.
3 and
The corpus of raw text we used as input for these tests was obtained from Wikipedia Sources18. It includes more than 4,322,000 words after clearing Wikipedia references and foreign expressions. The evaluation took into account how many words were not tagged by the pre-processor and thus remained unknown. It is worth noting that unknown words are the main cause of PoS-tagging errors. Such problems can be tackled by relying on (very) large coverage lexicons. As it can be observed in table 5.3, the lee managed to beat the largest existing and available Spanish morphological lexicon in the morphological preprocessing task.
USC Lexicon
lee
Unknown words
Unique unknown words
70,026 69,756
25,888 24,703
Table 5.2 Morphological lexical coverage. In order to estimate the syntactic coverage, we used the notion of expan-
ded intensional entry [11] which can be seen as defactorized Alexina entries 3. 4.
where each subcategorization frame receives only one realization. Each ex-
panded intensional entry describes one fully-specied syntactic behavior. For example, an intensional entry with two subcategorization frames with, for each frame, three realizations, will generate
3∗3 = 9
expanded intensional
entry.
Lexicon expanded entries
Lee
ADESSE 39,040
91,507
SRG 42,689
Table 5.3 Syntactic lexical coverage. In August 2010, the following technical properties have been computed over the lee. adjectives
common nouns
proper nouns
adverbs
verbs
Lemmas
28,553
71,005
54,347
4,102
8,027
Forms
96,513
152,738
54,335
4,099
411,460
Table 5.4 Number of unique lemmas and forms described in lee. adjectives
common nouns
proper nouns
adverbs
verbs
Intensional
28,557
71,021
55,140
4,116
15,306
Extensional
97,256
154,243
55,476
4,117
1,213,367
Table 5.5 Number of intensional and extensional entries in lee. 5.6.2 Additional results As explained in introduction, this section describe in a brief fashion the results of Victoria that are indirectly related to this Phd.
5.6.2.1 Lega, léxico de formas exionadas do galego The lega has been created by adapting the Galician morphological lexicon developed by the CORGA
5 (Corpus de referencia do Galego actual) pro-
ject to the Alexina framework. The set of features used (subcategorization frames, realizations, redistributions etc.) is the same as the lee. However, since there was no syntactic lexical information available when building its rst version, lega can only be considered as a morphological lexicon to this point. In August 2010, the following technical properties have been computed. 5.
http://corpus.cirp.es/corga/.
5.6. Linguistic resources
145
adjectives
common nouns
adverbs
verbs
Lemmas
16,478
33,622
1,912
6,889
Forms
53,976
71,612
1,918
377,858
Table 5.6 Number of unique lemmas and forms described in lega. adjectives
common nouns
adverbs
verbs
Intensional
16,478
33,622
1,912
6,889
Extensional
78,002
78,463
3,663
558,391
Table 5.7 Number of intensional and extensional entries in lega. 5.6.2.2 Morphological rules Two morphological descriptions of Spanish and Galician have been developed together with the lee and lega. Indeed, morphological rules are a key element of the compiling process that converts the intensional level of an
Alexina -based lexicon to the extensional one. Their rst versions have both been automatically extracted from existing morphological lexicons. So far, the morphological acquisition method described in Chapter 3 has not been applied to correct or extend them. Since its rst version, the Spanish morphological rules have been manually corrected in order to better factorize the description. Surprisingly, such improvements has also allowed us to boost the number of entries in the extensional lexicon of the lee by 24%. In August 2010, the following technical properties have been computed.
Number of classes Average forms/class
adj.
com. nouns
pro. nouns
adv.
verbs
36
78
10
1
98
3,33
2,70
1,2
1
67,58
Table 5.8 Number of Spanish morphological classes per open syntactic category and average number of Spanish forms generated by each class.
Number of classes Average forms/class
adj
com. nouns
adv
verbs
27
43
2
46
5,11
3,51
1,5
81,65
Table 5.9 Number of Galician morphological classes per open syntactic category and average number of Gallician forms generated by each class.
146
Chapitre 5. Results
SPMG , Spanish Meta-Grammar
5.6.2.3
SPMG is a Spanish meta-grammar built by taking as a starting point an existing French one named FRMG [5]. We can thus conrm the ease of transferring grammatical knowledge from French to Spanish. Nowadays, SPMG contains 214 classes organized in a hierarchical structure. Once combined with lee, SPMG allows us to parse Spanish sentences and produce dependency trees. Since determining if a sentence has been correctly parsed requires us to know what parses can be considered as correct, evaluating the quality of a parser can only be achieved with a gold standard such as the one provided by the Easy [15] initiative for French. Unfortunately, there are no such gold standard for Spanish. However, a preliminary experiment reported in [7] has been performed to test several technical properties and evaluate, in a global manner, the potential of SPMG :
Coverage : number of sentences for which the parser is able to provide at least one parse covering the whole sentence.
Ambiguity : the average ambiguity rate for all parsed sentences that corresponds to the average of the ambiguity rate given to each form in a parsed sentence. The ambiguity rate of a form is computed with the number (minus one) of vertice it receives in the dependency tree built. Ideally, each form should receive only one vertice and the corresponding average ambiguity should therefore be zero.
This evaluation has been performed over several small corpora. A manually built corpus composed of 123 sentences chosen to test a varied set of syntactic phenomena. A corpus of 194 sentences provided by a project named EUROTRA that aimed at creating a translation system for the seven most used languages of the European Commission. A corpus of 282 sentences extracted from the Europarl 96 parallel cor-
6
pus . A corpus of 455 sentences extracted from the Europarl 97 parallel cor-
pus. A corpus of 528 sentences extracted from the Corpus del Real Jardín
Botánico : a corpus of scientic descriptions of plants. A corpus of 335 sentences extracted from the Corpus del Museo de
Ciencias Naturales : a corpus of scientic descriptions of animals. The results are summarized in the following table. These results must be considered with great caution since, even if special care has been taken to complete the lexicon when necessary, many parsing errors might still be caused by the lexicon. In addition, as explained in [7], long 6.
http://www.statmt.org/europarl/
5.6. Linguistic resources
Corpus
147
Num. sentences
% Cov.
T. amb.
Corpus Propio
123
100
4.4
EUROTRA
194
100
5.35
Europarl 96
282
68.71
5.4
Europarl 97
455
67.58
6.1
R. J. Botánico
528
100
4.6
M. C. Naturales
335
100
7.2
Table 5.10 sentences were not processed, i.e., the results were probably been computed over rather short sentences.
5.6.3 Future work In regards to morphological rules, we are willing to increase the coverage by applying the extension of the morphological acquisition described in Chapter 3. Since adding new rules in a given class also allows us to boost the number of entries in the corresponding extensional lexicon, such future work could have a valuable eect. Concerning lexicons, new sessions of lexical corrections should be performed to improve both morphological and syntactic lexical information within the lef and lee, and morphological lexical information within the lega. Regarding lega, morphological information could also be extended with the information provided by the Galician lexicon of the Freeling intiative [2]. Meanwhile the methods to correct lexicon are being updated, manual validation of the most frequent forms of lee and lega could also be considered. Regarding the grammars, we should consider adapting the Spanish metagrammar into Galician, just as we adapted the French meta-grammar into Spanish. If a rst meta-grammar for Galician is achieved, we could start completing the lega with syntactic lexical information.
148
Chapitre 5. Results
Bibliographie
[1] Concepción Álvarez, Pilar Alvari no, Adelaida Gil, Teresa Romero, María Paula Santalla, and Susana Sotelo. Avalon, una gramática formal basada en corpus. In Procesamiento del Lenguaje Natural (Actas
del XIV Congreso de la SEPLN), pages 132139, Alicante, Spain, 1998. [2] Jordi Atserias, Bernardino Casas, Elisabet Comelles, Meritxell González, Lluis Padró, and Muntsa Padró.
Freeling 1.3 : Syntactic and se-
mantic services in an open-source nlp library.
In Proceedings of the
5th International Conference on Language Resources and Evaluation (LREC'06), pages 4855, 2006. [3] Lionel Clément, Benoît Sagot, and Bernard Lang. automatic acquisition of large-coverage lexica.
Morphology based
In proc. of LREC'04,
pages 18411844, May 2004. [4] Daniël de Kok, Jianqiang Ma, and Gertjan van Noord. A generalized method for iterative error mining in parsing results. In Proceedings of the
2009 Workshop on Grammar Engineering Across Frameworks (GEAF 2009), pages 7179, Suntec, Singapore, August 2009. Association for Computational Linguistics. [5] Éric De La Clergerie, Benoît Sagot, Lionel Nicolas, and Marie-Laure Guénot. Frmg : évolutions dún analyseur syntaxique tag du francais. In Journées ATALA, 2009. [6] José M. García-Miguel and Francisco J. Albertuz.
Verbs, semantic
classes and semantic roles in the ADESSE project. In Proceedings of the
Interdisciplinary Workshop on the Identication and Representation of Verb Features and Verb Classes, Saarbrücken, Germany, 2005. [7] Daniel Fernández González. Cadena de Procesamiento Linguístico para
el Español. July 2010. [8] Jorge Graña, Fco. Mario Barcala, and Jesús Vilares. Formal methods of tokenization for part-of-speech tagging. Computational Linguistics and
Intelligent Text Processing, Lecture Notes in Computer Science, 2002. [9] Nancy Ide and Jean Véronis. Multext : Multilingual text tools and corpora. In Proceedings of the 15th conference on Computational linguistics 149
150
Bibliographie
- Volume 1, pages 588592, Morristown, NJ, USA, 1994. Association for Computational Linguistics. [10] Montserrat Marimon, Núria Bel, Sergio Espeja, and Natalia Seghezzi. The spanish resource grammar : pre-processing strategy and lexical acquisition. In DeepLP '07 : Proceedings of the Workshop on Deep Linguis-
tic Processing, pages 105111, Morristown, NJ, USA, 2007. Association for Computational Linguistics. [11] Miguel A. Molinero, Benoît Sagot, and Lionel Nicolas. Building a morphological and syntactic lexicon by merging various linguistic resources. In Proceedings of NODALIDA'09, Odense, Denmark, 2009. [12] Miguel A. Molinero, Benoît Sagot, and Lionel Nicolas. Construcción y extensión de un léxico morfológico y sintáctico para el español : el lee. In SEPLN 2009, 25th Edition of the Conference of the Spanish Society
for Natural Language Processing, San Sebastian, Spain, September 2009. [13] Miguel A. Molinero, Benoît Sagot, and Lionel Nicolas.
A morpholo-
gical and syntactic wide-coverage lexicon for spanish : The lee.
In
Proceedings of the International Conference RANLP-2009, pages 264 269, Borovets, Bulgaria, September 2009. Association for Computational Linguistics. [14] Lionel Nicolas, Miguel A. Molinero, Benoît Sagot, Nieves Fernández Formoso, and Vanesa Vidal Castro.
Creating and maintaining language
resources : the main guidelines of the victoria project. In Proceedings of
the LRSLM Workshop of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). [15] Tristan
Vanrullen,
Philippe
Blache,
and
Jean-Marie
Balfourier.
Constraint-Based Parsing as an Ecient Solution : Results from the Parsing Evaluation Campaign EASy. In Proceedings of LREC 2006 (Lan-
guage Resources and Evaluation), pages 165170. LREC, 2006. 2771.
-
Chapitre
6-
Conclusion
151
152
Chapitre 6. Conclusion
6.1. Conclusion (English)
153
6.1 Conclusion (English) In this manuscript, the dierent research works that have been achieved during this thesis are detailed. These works, at rst, would seem unrelated to most people but are complementary parts of a project with the ambitious objective of enhancing the production of disambiguated data that can broadly benet the NLP. This objective is being achieved by developing techniques, tools and guidelines for enhancing the production of the linguistic resources required to perform parsing. The practical use of these developments has enabled us to create and/or improve linguistic resources for French, Spanish and Galician. In this manuscript, the achievements that have concentrated most of the eorts achieved during this thesis have been detailed. First, I have detailed the motivations, objectives, grants and task-force of the Victoria project and a set of more or less abstract guidelines to save eorts when creating or extending LRs. Among the exposed guidelines, special attention has been taken to demonstrate the interest in using two cointeracting LRs in order to correct one another and an abstract approach to achieve it has been provided. Second, I introduced and detailed an approach that allows us to automatically compute, from raw corpora, a data-representative description of the morphology of concatenative languages. The approach takes advantage of several phenomena that are observable for many languages but are more easily exploited with concatenative morphologies. Among these phenomena, a frequency-related occurrence of the forms belonging to a same lemma, be it derivated or inected, is highlighted and intensively exploited in a way which had not been considered so far. The practical relevance of this approach has been demonstrated by its capacity in being applied to a varied set of concatenative languages with no, or few, expert intervention. For a rst participation in an annual challenge dedicated to this task, the results obtained conrmed its potential. Third, I presented a semi-automatic method that brings together a set of techniques in order to simplify the correction and extension of a lexicon. The several components allow, all together, to detect missing, incomplete or erroneous entries in a lexicon and generate relevant lexical corrections. This approach demonstrates the practical interest in using two co-interacting LRs to correct one another by using, in our case, a grammar to correct a lexicon. The achieved experiments allowed us to correct a noticeable number of entries of a French lexicon in only few hours of manual work. In addition, several applications of the method have the interesting eect of converting the input corpora into a corpora representative of shortcomings of the grammar. Finally, as secondary achievements that are directly or indirectly related to this thesis, I detailed :
154
Chapitre 6. Conclusion
an enhanced interface for the edition of lexicon based on the Alexina framework, a set of Spanish morphological rules, a set of Galician morphological rules, a Spanish morpho-syntactic wide-coverage lexicon, a Galician morphological wide-coverage lexicon, a Spanish meta-grammar. Even though this thesis started before the Victoria project itself, its main objective has quickly become, as much as possible, the achievement of
Victoria 's objectives. Consequently, since Victoria has not yet fullled all of them, neither has this thesis. However, it did succeed in bringing an abstract idea to a explicit plan with strong bases. Given that most of the theoretical questions have been partially or fully answered, it is now a matter of time before Victoria reaches maturity.
6.2 Conclusion (Français) Dans ce manuscrit sont décrits les diérents travaux de recherche réalisés durant cette thèse. Ces travaux qui, à première vue, ne paraissent pas connectés, sont des parties complémentaires d'un projet dont l'objectif ambitieux est d'améliorer la génération de textes désambiguïsés. Cet objectif, qui bénécie au TALN de façon globale, est réalisé en développant des techniques, outils et stratégies facilitant la production des ressources linguistiques nécessaires à l'exécution d'analyseurs syntaxiques. L'utilisation pratique de ces développements nous a permis de créer et/ou d'améliorer des ressources linguistiques pour le français, l'espagnol et le galicien. Dans ce manuscrit, je me suis concentré sur les réalisations qui ont représentées la quasi totalité des eorts accomplis durant cette thèse. Premièrement, j'ai détaillé les motivations, objectifs, nancements et ressources humaines du projet Victoria et un ensemble de stratégies plus ou moins abstraites pour réduire les eorts nécessaires à la création et l'extension de RLs. Parmi les stratégies exposées, une attention particulière a été portée sur l'intérêt d'utiliser deux ressources co-interactives an de corriger l'une à partir de l'autre et une méthode abstraite mettant en ÷uvre cette idée a été détaillée. Deuxièmement, j'ai introduit une approche permettant d'obtenir automatiquement, à partir d'un corpus brut, une représentation des mécanismes concaténatifs de la morphologie d'une langue. Cette approche prote de phénomènes observables pour toute langue utilisant l'inexion ou la dérivation, ces derniers se révélant plus simples à exploiter lorsqu'il s'agit de morphologie concaténative. Parmi ces phénomènes, une probabilité d'occurrence des formes liée à la fréquence de leur lemme est mise en avant et exploitée d'une façon jamais considérée auparavant. L'intérêt de cette approche a été démon-
6.2. Conclusion (Français)
155
tré par sa capacité à être appliquée à un ensemble varié de langues avec peu ou pas de modication experte. Les résultats encourageants obtenus pour sa première participation à un challenge dédié à cette tâche conrment sa pertinence et son potentiel. Troisièmement, j'ai présenté une méthode semi-automatique combinant dans une chaîne d'outils un ensemble de sous-méthodes et ayant pour objectif de simplier la correction et l'extension d'un lexique. Les diérents maillons de cette chaîne permettent à la fois de détecter des entrées lexicales manquantes, incomplètes ou erronées et de générer des corrections pertinentes. Cette méthode implémente pleinement la méthode abstraite reposant sur l'idée d'utiliser deux ressources co-interactives an de corriger une ressource à partir de l'autre. En ce qui concerne cette méthode semi-automatique, elle s'appuie sur une grammaire pour corriger un lexique. Les expériences réalisées nous ont permis de corriger en peu de temps un nombre important d'entrées sur un lexique à large couverture du français. De plus, l'application itérative de cette méthode sur un même corpus présente l'eet secondaire fort intéressant de le convertir en un corpus représentatif des manques grammaticaux d'un analyseur syntaxique. Finalement, j'ai détaillé des réalisations secondaires directement ou indirectement liées au travail eectué durant ma thèse, à savoir : une interface pour l'édition simpliée d'un lexique basé sur l'architecture Alexina, un ensemble de règles morphologiques pour l'espagnol, un ensemble de règles morphologiques pour le galicien, un lexique morpho-syntaxique à large couverture de l'espagnol, un lexique morphologique à large couverture du galicien, une méta-grammaire de l'espagnol. Bien que cette thèse ait démarré avant le projet Victoria lui-même, son principal objectif est rapidement devenu la réalisation d'autant de sousobjectifs que possible. Par conséquent, puisque Victoria n'a pas encore atteint tous ses objectifs, cette thèse non plus. Cependant, elle a réussi à convertir une idée abstraite en un plan explicite avec des bases concrètes. Puisque la plupart des questions théoriques ont été partiellement ou pleinement résolues, ce n'est maintenant plus qu'une question de temps avant que Victoria arrive à maturité.
156
Chapitre 6. Conclusion
6.2. Conclusion (Français)
157
Related publication The publications related to this PhD can be found within references [61, 62, 63, 66, 67, 68, 69, 70, 71, 72, 73, 74] and [75].
158
Chapitre 6. Conclusion
Bibliographie
[1] Ag.
http://www.delph-in.net/index.php.
[2] Concepción Álvarez, Pilar Alvari no, Adelaida Gil, Teresa Romero, María Paula Santalla, and Susana Sotelo. Avalon, una gramática formal basada en corpus. In Procesamiento del Lenguaje Natural (Actas
del XIV Congreso de la SEPLN), pages 132139, Alicante, Spain, 1998. [3] Marie-Hélène Antoni-Lay, Gil Francopoulo, and Laurence Zaysser. A Generic Model for Reuseable Lexicons : The Genelex Project. Literary
and Linguistic Computing, 9(1) :4754, 1994. [4] Jordi Atserias, Bernardino Casas, Elisabet Comelles, Meritxell González, Lluis Padró, and Muntsa Padró. Freeling 1.3 : Syntactic and semantic services in an open-source nlp library.
In Proceedings of the
5th International Conference on Language Resources and Evaluation (LREC'06), pages 4855, 2006. [5] Petra Barg and Markus Walther. Processing unknown words in hpsg. In ACL-36 : Proceedings of the 36th Annual Meeting of the Association
for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 9195, Morristown, NJ, USA, 1998. Association for Computational Linguistics. [6] Delphine Bernhard. Simple morpheme labelling in unsupervised morpheme analysis. pages 873880, 2008. [7] Delphine Bernhard.
Morphonet : Exploring the use of community
structure for unsupervised morpheme analysis. In Multilingual Infor-
mation Access Evaluation Vol. I, 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, Revised Selected Papers. Springer, 2010. To appear. [8] Pierre Boullier and Benoît Sagot. Ecient and robust LFG parsing : SxLfg. In Proceedings of IWPT'05, 2005. [9] Pierre Boullier and Benoît Sagot.
Ecient parsing of large corpora
with a deep LFG parser. In Proceedings of LREC'06, 2006. [10] Michael R. Brent. From grammar to lexicon : Unsupervised learning of lexical syntax. Computational Linguistics, 19(2) :243262, 1993. 159
160
Bibliographie
[11] Ted Briscoe and John Carroll. Automatic extraction of subcategorization from corpora. In Proceedings of the fth conference on Applied na-
tural language processing, pages 356363, Morristown, NJ, USA, 1997. Association for Computational Linguistics. [12] Nicoletta Calzolari and Claudia Soria. Preparing the eld for an open resource infrastructure : the role of the arenet network of excellence. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors, Proceedings of the Seventh conference on In-
ternational Language Resources and Evaluation (LREC'10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). [13] Burcu Can and Suresh Manandhar.
Clustering morphological para-
digms using syntactic categories. In CLEF, pages 641648, 2009. [14] Marie-Hélène Candito.
Organisation modulaire et paramétrable de
grammaires électroniques lexicalisées.
PhD thesis, Univ. of Paris 7,
1999. [15] Paula Chesley and Susanne Salmon-alt. Automatic extraction of subcategorization frames for french.
In In Proceedings of the Language
Resources and Evaluation Conference, LREC 2006, 2006. [16] Kostadin Cholakov, Valia Kordoni, and Yi Zhang. Towards domainindependent deep linguistic processing : Ensuring portability and reusability of lexicalised grammars. In Coling 2008 : Proceedings of the
workshop on Grammar Engineering Across Frameworks, pages 5764, Manchester, England, August 2008. Coling 2008 Organizing Committee. [17] Kostadin Cholakov and Gertjan van Noord. Acquisition of unknown word paradigms for large-scale grammars. In Coling 2010 : Posters, pages 153161, Beijing, China, August 2010. Coling 2010 Organizing Committee. [18] Kostadin Cholakov and Gertjan van Noord. Using unknown word techniques to learn known words. In EMNLP 2010, Octuber 2010. [19] Lionel Clément, Benoît Sagot, and Bernard Lang. Morphology based automatic acquisition of large-coverage lexica. In proc. of LREC'04, pages 18411844, May 2004. [20] Ann Copestake and Dan Flickinger. An open-source grammar development environment and broad-coverage English grammar using HPSG. In Proceedings of the second international conference on Language Re-
sources and Evaluation (LREC-2000), Athens, Greece, 2000. [21] Mathias Creutz and Krista Lagus. Unsupervised morpheme segmentation and morphology induction from text corpora using morfessor 1.0. In Helsinki University of Technology, 2005.
Bibliographie
161
[22] Laurence Danlos and Benoît Sagot. Constructions pronominales dans dicovalence et le lexique-grammaire. In Proceedings of the 27th Lexicon-
independent morphological segmentation. In NAACL HLT 2007 : Pro-
ceedings of the Main Conference, pages 155163, 2007. [24] Daniël de Kok, Jianqiang Ma, and Gertjan van Noord. A generalized method for iterative error mining in parsing results.
In Proceedings
of the 2009 Workshop on Grammar Engineering Across Frameworks (GEAF 2009), pages 7179, Suntec, Singapore, August 2009. Association for Computational Linguistics. [25] Éric De La Clergerie, Benoît Sagot, Lionel Nicolas, and Marie-Laure Guénot. Frmg : évolutions dún analyseur syntaxique tag du francais. In Journées ATALA, 2009. [26] Éric Villemonte de la Clergerie.
From metagrammars to factorized
tag/tig parsers. In Parsing '05 : Proceedings of the Ninth Internatio-
nal Workshop on Parsing Technology, pages 190191, Morristown, NJ, USA, 2005. Association for Computational Linguistics. [27] Hervé Déjean. Morphemes as necessary concept for structures discovery from untagged corpora. In NeMLaP3/CoNLL '98 : Proceedings
of the Joint Conferences on New Methods in Language Processing and Computational Natural Language Learning, pages 295298, Sydney, Australia, 1998. The Association for Computational Linguistics. [28] Delphin.
http://www.delph-in.net/index.php.
[29] Vera Demberg. A language-independent unsupervised model for morphological segmentation. In Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pages 920927, Prague, Czech Republic, June 2007. Association for Computational Linguistics.
[32] Tomaz Erjavec. Multext-east version 4 : multilingual morphosyntactic specications, lexicons and corpora.
In LREC 2010 : proceedings of
the seventh international conference on Language Resources and Evaluation, 2010. [33] Roger Evans and Gerald Gazdar. Datr : a language for lexical knowledge representation. Comput. Linguist., 22(2) :167216, 1996. [34] Frederik Fouvry. Lexicon acquisition with a large-coverage unicationbased grammar. In EACL, pages 8790, 2003. [35] Gil Francopoulo, Monte George, Nicoletta Calzolari, Monica Monachini, Nuria Bel, Mandy Pet, and Claudia Soria. Lexical Markup Framework (LMF). In Proceedings of LREC 2006, Genoa, Italy, 2006.
162
Bibliographie
[36] José M. García-Miguel and Francisco J. Albertuz.
Verbs, semantic
classes and semantic roles in the ADESSE project. In Proceedings of the
Interdisciplinary Workshop on the Identication and Representation of Verb Features and Verb Classes, Saarbrücken, Germany, 2005. [37] John Goldsmith. An algorithm for the unsupervised learning of morphology. Nat. Lang. Eng., 12(4) :353371, 2006. [38] Bruno Golenia, Sebastian Spiegler, and Peter Flach. Ungrade : Unsupervised graph decomposition. In Working Notes for the CLEF 2009
Workshop, Corfu, Greece, September 2009. [39] Daniel Fernández González. Cadena de Procesamiento Linguístico para
el Español. July 2010. [40] Jorge Graña, Fco. Mario Barcala, and Jesús Vilares. Formal methods of tokenization for part-of-speech tagging. Computational Linguistics
and Intelligent Text Processing, Lecture Notes in Computer Science, 2002. [41] Margaret A. Hafer and Stephen F. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10(11-12) :371 385, 1974. [42] Zellig S. Harris. From phoneme to morpheme. Language, 31(2) :190 222, 1955. [43] Nabil Hathout. Acquistion of the morphological structure of the lexiIn TextGraphs
con based on lexical similarity and formal analogy.
'08 : Proceedings of the 3rd Textgraphs Workshop on Graph-Based Algorithms for Natural Language Processing, pages 18, Morristown, NJ, USA, 2008. Association for Computational Linguistics. [44] Nancy Ide and Jean Véronis.
Multext : Multilingual text tools and
corpora. In Proceedings of the 15th conference on Computational lin-
guistics - Volume 1, pages 588592, Morristown, NJ, USA, 1994. Association for Computational Linguistics.
[47] Samarth Keshava. A simpler, intuitive approach to morpheme induction. In In PASCAL Challenge Workshop on Unsupervised Segmenta-
tion of Words into Morphemes, pages 3135, 2006. [48] Oskar Kohonen, Sami Virpioja, and Mikaela Klami.
Allomorfessor :
towards unsupervised morpheme analysis. In CLEF'08 : Proceedings
of the 9th Cross-language evaluation forum conference on Evaluating systems for multilingual and multimodal information access, pages 975 982, Berlin, Heidelberg, 2009. Springer-Verlag. [49] Anna Korhonen, Genevieve Gorrell, and Diana McCarthy. Statistical ltering and subcategorization frame acquisition.
In Proceedings of
Bibliographie
163
the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora, pages 199206, Morristown, NJ, USA, 2000. Association for Computational Linguistics. [50] Anna Korhonen and Yuval Krymolowski. On the robustness of entropybased similarity measures in evaluation of subcategorization acquisition systems.
In COLING-02 : proceedings of the 6th conference on
Natural language learning, pages 17, Morristown, NJ, USA, 2002. Association for Computational Linguistics. [51] Anna Korhonen, Yuval Krymolowski, and Zvika Marx. Clustering polysemic subcategorization frame distributions semantically. In Procee-
dings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 6471, Sapporo, Japan, July 2003. Association for Computational Linguistics. [52] Kimmo Koskenniemi. Two-level model for morphological analysis. In
IJCAI-83, pages 683685, Karlsruhe, Germany, 1983. [53] Mikko Kurimo, Ville Turunen, and Matti Varjokallio. morpho challenge 2008.
Overview of
In CLEF'08 : Proceedings of the 9th Cross-
language evaluation forum conference on Evaluating systems for multilingual and multimodal information access, pages 951966, Berlin, Heidelberg, 2009. Springer-Verlag. [54] Jean-Francois Lavallée and Philippe Langlais. Unsupervised morphology acquisition by formal analogy.
In Lecture Notes in Computer
Science, page 8 pages. 2010. [55] Constantine Lignos, Erwin Chan, Mitchell P. Marcus, and Charles Yang. A rule-based acquisition model adapted for morphological analysis. In CLEF, pages 658665, 2009. [56] Christopher D. Manning. Automatic acquisition of a large sub categorization dictionary from corpora. In Proceedings of the 31st Annual
Meeting of the Association for Computational Linguistics, pages 235 242, Columbus, Ohio, USA, June 1993. Association for Computational Linguistics. [57] Montserrat Marimon, Núria Bel, Sergio Espeja, and Natalia Seghezzi. The spanish resource grammar : pre-processing strategy and lexical acquisition.
In DeepLP '07 : Proceedings of the Workshop on Deep
Linguistic Processing, pages 105111, Morristown, NJ, USA, 2007. Association for Computational Linguistics. [58] Cédric Messiant.
A subcategorization acquisition system for French
verbs. In Proceedings of the ACL-08 : HLT Student Research Work-
shop, pages 5560, Columbus, Ohio, June 2008. Association for Computational Linguistics.
[61] Miguel A. Molinero, Benoît Sagot, and Lionel Nicolas.
Building a
morphological and syntactic lexicon by merging various linguistic resources. In Proceedings of NODALIDA'09, Odense, Denmark, 2009. [62] Miguel A. Molinero, Benoît Sagot, and Lionel Nicolas. Construcción y extensión de un léxico morfológico y sintáctico para el español : el lee. In SEPLN 2009, 25th Edition of the Conference of the Spanish
Society for Natural Language Processing, San Sebastian, Spain, September 2009. [63] Miguel A. Molinero, Benoît Sagot, and Lionel Nicolas. A morphological and syntactic wide-coverage lexicon for spanish : The lee. In Procee-
dings of the International Conference RANLP-2009, pages 264269, Borovets, Bulgaria, September 2009. Association for Computational Linguistics. [64] Christian Monson, Kristy Hollingshead, and Brian Roark. Simulating morphological analyzers with stochastic taggers for condence estimation. In CLEF, pages 649657, 2009. [65] Christian Monson, Alon Lavie, Jaime Carbonell, and Lori Levin. Evaluating an agglutinative segmentation model for paramor. In SigMor-
Phon '08 : Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology, pages 4958, Morristown, NJ, USA, 2008. Association for Computational Linguistics. [66] Lionel Nicolas, Jacque Farré, and Miguel A. Molinero. Unsupervised learning of concatenative morphology based on frequency-related form occurrence. In Proceedings of the PASCAL Challenge Workshop on Un-
supervised Segmentation of Words into Morphemes, Helsinki, Finland, September 2010. [67] Lionel Nicolas, Jacques Farré, and Eric De La Clergerie. Confondre le coupable : Corrections d'un lexique suggérées par une grammaire. In
Proc. of TALN'07, Toulouse, France, June 2007. [68] Lionel Nicolas, Jacques Farré, and Eric De La Clergerie. Mining Parsing Results for Lexical Corrections. In Human Language Technologies
as a Challenge for Computer Science and Linguistics 3rd Language & Technology Conference, Poznan, Poland, October 2007. Wydawnictwo Pozna«skie Sp. z o. o. [69] Lionel Nicolas, Miguel A. Molinero, Benoît Sagot, Nieves Fernández Formoso, and Vanesa Vidal Castro.
Creating and maintaining
Bibliographie
165
language resources : the main guidelines of the victoria project.
In
Proceedings of the LRSLM Workshop of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, May 2010. European Language Resources Association (ELRA). [70] Lionel Nicolas, Miguel A. Molinero, Benoît Sagot, Elena Trigo, Éric De la Clergerie, Miguel Pardo, Jacques Farré, and Joan Miquel Vergés. Producción eciente de recursos lingüísticos : el proyecto victoria. In
SEPLN 2009, 25th Edition of the Conference of the Spanish Society for Natural Language Processing, San Sebastian, Spain, September 2009. [71] Lionel Nicolas, Miguel A. Molinero, Benoît Sagot, Elena Trigo, Éric De la Clergerie, Miguel Pardo, Jacques Farré, and Joan Miquel Vergés. Towards ecient production of linguistic resources : the victoria project. In Proceedings of the International Conference RANLP-2009, pages 318323, Borovets, Bulgaria, September 2009. Association for Computational Linguistics. [72] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric Clergerie. Mining parsing results for lexical correction : Toward a complete correction process of wide-coverage lexicons. pages 178191, 2009. [73] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric de La Clergerie.
Computer aided correction and extension of
a syntactic wide-coverage lexicon.
In COLING '08 : Proceedings of
the 22nd International Conference on Computational Linguistics, pages 633640, Manchester, United Kingdom, August 2008. Association for Computational Linguistics. [74] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric de La Clergerie. Extensión y corrección semi-automática de léxicos morfo-sintáctico. In SEPLN 2008, 24th Edition of the Conference of
the Spanish Society for Natural Language Processing, Madrid, Spain, September 2008. [75] Lionel Nicolas, Benoît Sagot, Miguel A. Molinero, Jacques Farré, and Éric de La Clergerie. Trouver et confondre les coupables : un processus sophistiqué de correction de lexique. In Traitement Automatique des
Langues Naturelles (TALN 2009), Senlis, France, June 2009. [76] Parole.
[77] Patrick Paroubek, Anne Vilnat, Sylvain Loiseau, Olivier Hamon, Gil Francopoulo, and Eric Villemonte de la Clergerie. Large scale production of syntactic annotations to move forward. In CrossParser '08 :
Coling 2008 : Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation, pages 3643, Morristown, NJ, USA, 2008. Association for Computational Linguistics. [78] Passage.
http://atoll.inria.fr/passage/.
166
Bibliographie
[79] Hoifung Poon, Colin Cherry, and Kristina Toutanova. Unsupervised morphological segmentation with log-linear models. In HLT-NAACL, pages 209217, 2009. [80] Portlet.
http://en.wikipedia.org/wiki/Portlet.
[81] Judita Preiss, Ted Briscoe, and Anna Korhonen. A system for largescale acquisition of verbal, nominal and adjectival subcategorization frames from corpora.
In Proceedings of the 45th Annual Meeting of
the Association of Computational Linguistics, pages 912919, Prague, Czech Republic, June 2007. Association for Computational Linguistics. [82] Benoît Sagot. Automatic acquisition of a Slovak lexicon from a raw corpus. In Lecture Notes in Articial Intelligence 3658 (Springer-Verlag),
Proceedings of TSD'05, pages 156163, Karlovy Vary, Czech Republic, September 2005. [83] Benoît Sagot. The lef, a freely available and large-coverage morphological and syntactic lexicon for french. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Mike Rosner, and Daniel Tapias, editors,
Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), Valletta, Malta, may 2010. European Language Resources Association (ELRA). [84] Benoît Sagot. The Lef, a freely available, accurate and large-coverage lexicon for French. In 7th international conference on Language Re-
sources and Evaluation (LREC 2010), Valetta Malte, 2010. [85] Benoît Sagot, Lionel Clément, Éric Villemonte de La Clergerie, and Pierre Boullier. The Lef 2 syntactic lexicon for French : architecture, acquisition, use. In Proceedings of LREC'06, Genoa, Italy, 2006. [86] Benoît Sagot and Laurence Danlos. Méthodologie lexicographique de constitution d'un lexique syntaxique de référence pour le français. In
Proceedings of the workshop Lexicographie et informatique : bilan et perspectives , Nancy, France, 2008. [87] Benoît Sagot and Éric de La Clergerie. Fouille d'erreurs sur des sorties d'analyseurs syntaxiques. Traitement Automatique des Langues, 49(1), 2008. [88] Benoît Sagot and Éric Villemonte de La Clergerie. parsing results.
Error mining in
In Proceedings of ACL/COLING'06, pages 329336,
Sydney, Australia, 2006. [89] Benjamin Snyder and Regina Barzilay. Unsupervised multilingual learning for morphological segmentation. In Proceedings of ACL-08 : HLT, pages 737745, Columbus, Ohio, June 2008. Association for Computational Linguistics.
Bibliographie
167
[90] Sebastian Spiegler, Bruno Golenia, and Peter Flach.
Unsupervised
Word Decomposition with the Promodes Algorithm, volume I. Springer Verlag, February 2010. [91] Sebastian Spiegler and Christian Monson. Emma : A novel evaluation metric for morphological analysis. In Proceedings of the 23rd Interna-
tional Conference on Computational Linguistics (COLING), August 2010. [92] Nicolas Stroppa and François Yvon. An analogical learner for morphological analysis. In Proceedings of the Ninth Conference on Computa-
tional Natural Language Learning (CoNLL-2005), pages 120127, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. [93] François Thomasset and Éric Villemonte de La Clergerie. Comment obtenir plus des méta-grammaires. In Proceedings of TALN'05, 2005. [94] Multext-east.
http://nl.ijs.si/ME/.
[95] Tim van de Cruys. Automatically extending the lexicon for parsing. In Proceedings of the eleventh ESSLLI student session, 2006. [96] Gertjan van Noord. Error mining for wide-coverage grammar engineering. In Proceedings of ACL 2004, Barcelona, Spain, 2004. [97] Edward Vanhoutte. An Introduction to the TEI and the TEI Consortium. Literary and Linguistic Computing, 19(1) :916, 2004. [98] Tei.
http://www.tei-c.org/index.xml.
[99] Tristan
Vanrullen,
Philippe
Blache,
and
Jean-Marie
Balfourier.
Constraint-Based Parsing as an Ecient Solution : Results from the Parsing Evaluation Campaign EASy.
In Proceedings of LREC 2006
(Language Resources and Evaluation), pages 165170. LREC, 2006. 2771. [100] Tamás Váradi, Steven Krauwer, Peter Wittenburg, Martin Wynne, and Kimmo Koskenniemi. Clarin : Common language resources and technology infrastructure.
In European Language Resources Association
(ELRA), editor, Proceedings of the Sixth International Language Re-
sources and Evaluation (LREC'08), Marrakech, Morocco, may 2008.
[103] Éric Villemonte de La Clergerie. DyALog : a tabular logic programming based environment for NLP. In Proceedings of 2nd International
Workshop on Constraint Solving and Language Processing (CSLP'05), Barcelona, Spain, October 2005. [104] Yi Zhang and Valia Kordoni. Automated deep lexical acquisition for robust open texts processing. 275280, Genoa, Italy, 2006.
In Proceedings of LREC-2006,, pages
168
Bibliographie
[105] George K. Zipf.
Human Behavior and the Principle of Least Eort.
Addison-Wesley, 1949.
Bibliographie
169
170
Bibliographie
-
Annexe
A-
Samples of morphological families acquired by MorphAcq
Figure A.6 Sample of Turkish prex families acquired
-
Annexe
B-
Terminology
179
180
Annexe B.
Terminology
181
An ax
is a substring used to create new lexical forms by combining
it with a stem at its beginning (prex), in its middle (inx) or at its end (sux). For example, the English form thinking contains the sux ing whereas the English form grandmother contains the prex grand. In this document most approaches regarding morphology do not concern inxes. The term ax is therefore used as a shortcut for prex and sux.
A canonical form
of a lemma is a lexical form consensually chosen as
the representative of the lemma. For example, for verbs, their innitive is usually used as canonical form.
A closed-class
form or lemma, also called functional lemma, is a lemma
belonging to a syntactic class which, oppositely to open-class, covers a known and dened set of lemmas such as pronouns, clitics, conjunctions, determiners, etc.
Cognates
are words that have a common etymological origin, e.g. publi-
cation in English corresponds to publicación in Spanish. The semantical meaning of the cognates may dier as languages develop separately and eventually become false friends.
A classier
is a trained tool that maps observations about an item to
conclusions about the item's target value. It therefore creates a model, often based on a tree or entropy, to predict the value of a target variable based on several input variables.
A lexical form
also called word form or simply form in a shortcut, is a
technical term to designate a word. All forms belong to a lemma.
A grammar,
in NLP, is a linguistic resource that details the syntactic
structures of a given language. It thus describes how forms can be combined to form sentences.
HPSG
stands for Head-driven phrase structure grammar. It is a highly
lexicalized, non-derivational generative type of grammar. This formalism is based on lexicalism in the sense that the lexicon is more than just a list of entries ; it is in itself richly structured. Individual entries are marked with types. Types form a hierarchy.
An inx
is an ax which is added inside a root morpheme in the formation
of a word. It contrasts with axes attached at the outside of a stem, such as a prex or sux. In a language like English, inxes do not occur since the root morpheme is indivisible.
182
A lemma
Annexe B.
Terminology
is a set of related lexical forms represented by a canonical form.
For example, for a verb, the set of its conjugated forms constitutes a lemma. When the term word paradigm is used to designate a set of related form, the term lemma is usually used as a synonym for canonical form.
A letter tree
, in this document, is a data structure used to represent
various lexical forms and to know which forms share a substring. This structure is composed of nodes and transitions between nodes that are labeled by a letter. In this structure, lexical forms are introduced letter by letter, i.e., the letters of a lexical form label a path of transitions from the root node.
A lexeme
represents a minimal meaningful unit of language, i.e, a given
meaning of a lexical form.
Lexicology
studies words, their natures, their elements, their relations to
one another and their meanings. It focuses on the words and their characteristics.
A lexicon
is a linguistic resource that inventories the lexical forms of a gi-
ven language and associates them with morphologically-related, syntacticallyrelated or even semantically-related informations.
A linguistic level
also called linguistic aspect, designates a set of closely
related linguistic characteristics. For example, morphology is a linguistic level that covers the linguistic characteristics used to form words. Languages are usually described according to several linguistic levels. These levels are usually listed as phonology, morphology, lexicology, syntax, semantics and pragmatics.
A linguistic resource
is a numeric data base that describes a linguis-
tic knowledge. It usually focuses on a single linguistic aspect for a given language.
A morpheme
is a substring of a form that holds part of the meaning. The
global meaning of a lexical form is thus subdivided among the morphemes it contains. They can be either stems or axes. For example, the English
lexical form thinking contains two morphemes, one stem think and one ax ing.
Morphological derivation from another one.
describes the process of generating a lemma
183
A morphological family
is a set of morphological rules allowing to build
all the lexical forms related to a given lemma.
Morphological inection,
describes the process of generating a lexical
form of a lemma.
Morphological rules
are a linguistic resource that describes how set of
forms are related within a lemma. For example, for a verb, the set of rules describing how to generate its conjugations are morphological rules.
A morphological rule
allows to build or derive a form of a given lemma.
It is usually described as string operations that combine with axes the stem or the canonical form of the lemma. For example, some morphological rules allow to build a conjugation of some verbs at a given gender, person and time whereas others allow to build the plural of nouns.
Morphology
studies patterns of word formation and attempts to formu-
late rules modeling them. It focuses on the way phonemes and syllabus are combined to form words.
An open-class
form or lemma belongs to a syntactic class which, oppo-
sitely to closed-class, covers an innite set of lemmas since they constantly acquire new members as languages evolve. For example, in English, openclasses are common nouns, proper nouns, adjectives, adverbs and verbs.
Parsers
are NLP tools that perform parsing. Contrarily to trained ones
that are trained on examples, symbolic parsers rely on explicit rules.
Parsing,
also called syntactic analysis, is an NLP task that consists in
checking, with respect to a given formal grammar, the correct syntax of a text and building a data representation of its grammatical structure.
PCFG
stands for probabilistic context-free grammar. It is a context-free
type of grammar in which each production is augmented with a probability. The probability of a parse is then computed according to the probabilities of the productions used in that parse.
Phonology
studies the systematic use of sound to encode meaning in any
spoken human language. It focuses on the way dierent sounds function within a given language or across languages to encode meaning.
184
Annexe B.
Pragmatics
Terminology
studies how the transmission of meaning depends not only on
the linguistic knowledge of the speaker and listener, but also on the context of the utterance, knowledge about the status of those involved, the inferred intent of the speaker, and so on. Pragmatics focuses on the way language users are able to overcome apparent ambiguity, since meaning relies on the manner, place, time etc. of an utterance.
The realizations
of a given subcategorization frame denote the dierent
types of syntactic structure that can match the subcategorization frame. For example, the frame specifying that the verb eat receives an object can take a noun phrase the subject eats the object as its realization.
A sandhi phenomenon
is a modication of the form or sound of a word
under the inuence of an adjacent word or morpheme.
Semantics
studies the meanings of words within particular circumstances
and contexts. It focuses on the way words can be combined to form semantically coherent sentences.
A stem
of a form is the substring related to the lemma. It thus holds the
greatest part of the semantic meaning, e.g., the stem of the English lexical
form thinking is think. It is important to note that, in this document, the stem of a form is considered as the largest substring shared by all the related lexical forms of a given lemma, e.g. the lexical forms manage, managing, managed shares the largest substring and stem manag.
Subcategorization frames
specify the number and types of arguments
of a form. For instance, a mono-transitive verb, like eat, sub-categorizes a subject and an object, e.g. the subject eats the object. A ditransitive verb, like give, sub-categorizes a subject, an indirect object and a direct object, e.g. the subject gives the direct object to the indirect object. Since they denote what arguments a form can have, subcategorization frames are essential to lexicalised grammars because they allow them to discard many incorrect parses.
Syntax
studies the principles and rules for constructing sentences in na-
tural languages. It focuses on the way words can be combined to form syntactically correct sentences.
POS Tagging
is the process of assigning to the forms of a text a descrip-
tive part-of-speech (POS) tag based on both its denition and its context. A simplied form of this task is the identication of words as nouns, verbs, adjectives, adverbs, etc.