Document not found! Please try again

Automated Generation of Derivative Relations in the

0 downloads 0 Views 149KB Size Report
vast majority of them is based on unsupervised learning from large annotated corpora. Combi- ..... Słownik j˛ezyka pol- skiego, volume I-X. PWN, Warszawa.
Automated Generation of Derivative Relations in the Wordnet Expansion Perspective Maciej Piasecki Radosław Ramocki Marek Maziarz Wrocław University Wrocław University Wrocław University of Technology, Poland of Technology, Poland of Technology, Poland {maciej.piasecki,radoslaw.ramocki,marek.maziarz}@pwr.wroc.pl Abstract

automated construction of a generator of derivational links must be provided.

We present a machine learning approach to the generation of derivative relations. Instances of deriva-

2.1

tive relations described in a wordnet are used in

A large scale process was run for Czech WordNet (Pala and Hlaváˇcková, 2007). 10 “main regular derivational relations” were covered. Relation instances were generated on the basis of the handcrafted description of the inflection derivational paradigms implemented in an expanded version of the morphological analyser. The results were manually corrected and non-derivational lemmas were manually deleted. As the applied analyser do not support “changes in stem” (alternations) appropriate modifications had to be introduced manually. Derivational relations link lemmas (“literals”) in Czech WordNet, i.e. if a lemma has more than one lexical unit (LU), the lexico-semantic derivational relation is implicitly extended to all LUs pairs pertaining to it. This is in contrast to the semantic ambiguity of suffixes noted in (Fellbaum et al., 2009). The idea of a semi-automated expansion of the wordnet derivational relations was also discussed in (Koeva et al., 2008) for the Serbian and Bulgarian. In both wordnets derivational relations link synsets, i.e. lemma pairs are expanded on synsets to which they belong. So the semantics of these relations seems to be expanded beyond the associations encoded by derivational pairs of LUs. Our approach presented in Sec. 3 goes in between the distinction: a relation on lemmas vs synsets.

the bootstrapping approach to build an analyser of derivational relations. plWordNet derivational relations are presented and the planned semi-automatic wordnet expansion with derivational relations is discussed. Limits to which form-based markers can encode semantic distinctions are analysed and a model of semantic post-filtering of the generated derivational relation instances is presented.

1

Introduction

Derivational relations occur in many languages and mostly encode certain lexico-semantic relations, e.g. diminutives. They are present in many wordnets. Due to their regular character and productivity there are attempts to automate wordnet expansion with respect to derivational relations. In existing approaches lacking derivational links are automatically generated by extended morphological analysers, which, however, are based on handcrafted rules. For many languages, including Polish tools of this type and coverage do not exist. Our idea is to remove this external dependency and to enclose the automated expansion of the wordnet derivational part in a kind of bootstrapping scheme: start with a handful of instances of derivational relations added manually to a wordnet, train a generator of derivatives and use it to boost the wordnet expansion. The system should be open for different relations and trained on wordnet data. the system should not only generate pairs of words as associated by a formal derivative relation but should also identify a semantic relation that is expressed by a derivational pair.

2

Derivation, Automation and Wordnets

Two problems need to be solved in order to fully automate wordnet expansion with derivational relations: a method must be defined and semi-

2.2

Automated wordnet expansion

Generating derivative relations

Works dedicated to derivational morphology learning are relatively rare, contrary to general methods of morphology learning. Roughly, two groups of methods can be distinguished (Walther and Nicolas, 2011): aimed at automated construction of morphological analysers and extraction of morphological models (e.g. segmentation and rules). As we need an analyser of derivatives we focus on the methods of the first group. The vast majority of them is based on unsupervised learning from large annotated corpora. Combi-

nations of different methods of statistical analysis are used in order to identify affixes, stems and word form families, e.g. (Schone and Jurafsky, 2001), Minimum Description Length concept is often used in discovering segmentation, e.g. (Kohonen et al., 2009), cf overview in (Walther and Nicolas, 2011). Walther and Nicolas (2011) presented corpus-based extraction of derivational rules. Derivative candidates were filtered on the basis of their frequency in the corpus. From 37.5 million token corpus of French 62,158 derivative candidates were extracted, but only 1,511 new derived French lemmas were identified after ranking candidates. During manual evaluation of a small sample of 100 lemmas: 42 lemmas and relations were identified as correct, but 43 lemmas were definitely incorrect (many due to foreign words and typos). However, we can use a limited but manually annotated set of derivational pairs. For a limited training data memory based learning, e.g. (van den Bosch and Daelemans, 1999), and transformation-based learning paradigms, e.g. (Oflazer et al., 2001), were used. The latter was applied to Polish, but without evaluation. Both approaches are valuable options, but Polish derivational rules can be described in terms of prefixes and suffixes added to the derivative base lemma together with a limited set of internal stem alternations (Rabiega-Wi´sniewska, 2009). Polish derivatives are rarely built by simultaneous use of a prefix and suffix. Transducer based morphological models can easily cope with suffixation and converted to morphological guessers applying recorded rules to unknown words, e.g. for Polish (Daciuk, 2001) and a large scale Polish guesser called Odgadywacz (Piasecki and Radziszewski, 2008) of high precision and recall. Transducer guessers have problems with alternations (store exact word parts) and with prefixes (are mostly built on the basis of a tergo indexes). In Sec. 4 we present solutions to both problems.

3

Derivative Relations in plWordNet

plWordNet is the largest publicly available wordnet of Polish. The version 1.0 was published in 2009 (Piasecki et al., 2009). plWordNet 2.0 project started in 2009, according to the contemporary estimates, it is to reach the size of 140–150 thousands of LUs and more than 100000 synsets by the end of 2012. plWordNet 2.0 has been extended not only in the number of LUs but also in the number

of lexico-semantic relations. Among them derivationally motivated relations were expanded from two coarse grained defined in plWordNet 1.0 to a sophisticated system described in this section. Word formation is interconnected with semantics: certain senses corresponds to affixes, e.g., English suffix -er has several different meanings, among them: ‘(male) agent’ (thinker, writer, driver, etc.); ‘instrument’ (opener, printer, pager); ‘experiencer’ (hearer); ‘stimulus’ (pleaser, thriller); ‘patient/theme’ (fryer, keeper, looker, sinker, loaner) or ‘location’ (diner) (Lieber, 2008, pp. 1-2, 17), cf (Bosch et al., 2008, p. 83). Of course, the same meaning can be carried by many different affixes. The ‘agent’ sense is represented also by -ant/-ent (servant, evacuant, descendant) (Lieber, 2008, pp. 37, 69). The same holds for Polish. Suffix -ak is similar in some of its functions to English -er, for instance, it has meanings ‘agent’ (rybak ‘fisher’, wie´sniak ‘peasant’, pływak ‘swimmer’) or ‘instrument’ (szczeliniak ‘chisel for making cracks’), but it has also few other senses: ‘dweller’ (Polak, Słowak), ‘offspring’ (e.g. kociak ‘kitten’, s´winiak ‘piglet’, kurczak ‘chicken’), ‘emotional markedness’ (dzieciak ‘kid’, łobuziak ‘rascal’) (Grzegorczykowa and Puzynina, 1998). For plWordNet, we have chosen relations that have clear semantics and are regular or very frequent in Polish, see (Maziarz et al., 2011a, p. 175), (Maziarz et al., 2011b)). In analysis of the frequencies we followed (Grzegorczykowa and Puzynina, 1998) who based their estimates on the ‘lexicon’ frequencies in (Doroszewski, 1969). Cross-categorial synonymy is extremely frequent. Semantically they are transposition relations causing that the meanings of the words are very close and differ only in their parts of speech. N-V subtype links deverbal nouns (gerunds) with their bases. We account for regular type on -anie, -enie, -cie: pływanie ‘swimming’ < pływa´c ‘swimimpf ’, napełnienie ‘filling’ < napełni´c ’fillpf ’, picie ‘drinking’ < pi´c ‘drinkimpf ’ (Grzegorczykowa and Puzynina, 1998, pp. 393-8). Pact-V subtype connects active adjectival participle on -acy ˛ (pact in IPIC tagset) with its verb base (pijacy ˛ ‘drinking’ < pi´c ‘drink’, pływajacy ˛ ‘swimming’ < pływa´c ‘swim’). In Polish the active participle may be formed only from imperfective verbs. In many grammars the regular formation is treated as an inflexional form of a given

verb, e.g. (Laskowski, 1998, p. 268), however, we ´ follow Saloni and Swidzi´ nski (1998) in considering it an adjective due to its inflectional behaviour. N-Adj type refers to deadjectival nouns on o´sc´ : blado´sc´ ‘paleness’ < blady ‘pale’, władczo´sc´ ‘imperiousness’ < władczy ‘imperious’, mało´sc´ ‘smallness’ < mały ‘small’; this type is regular. Markedness (N-N relation) connects related nouns of which one is a marked counterpart of the second, unmarked. Three most productive Polish subtypes were included in plWordNet. Diminutives express small size or positive emotional marking and are frequent in Polish. The meaning could be characterised as: ‘Xderiv is a little or pleasant Ybase ’. The most popular suffixes are -ek/-ik(-yk), -ko and -ka: płotek ‘little or pleasant fence’ < płot ‘fence’, pałacyk ‘little or pleasant palace’ < pałac ’palace’, uszko < ucho ‘ear’, lampka < lampa ‘lamp’ (Grzegorczykowa and Puzynina, 1998, 425-6). Some formations are derivatives from diachronic point of view but we consider only synchronic formations: e.g., word młotek ‘hammer’ was derived from młot ‘heavy hammer’ with suffix -ek, but nowadays it will be ridiculous to say that młotek is ’small or pleasant młot’, so this pair is not included in plWordNet. Augmentatives express grand size and negative emotional marking, and may be paraphrased as: Xderiv is huge or terrible Ybase ’. Augmentative suffixes are, e.g., -uch, -isko(-ysko) or -al: paluch ‘huge or terrible finger’ < palec ‘finger’, ptaszysko ‘huge or terrible bird’ < ptak ‘bird’, nochal < nos ‘nose’ (Grzegorczykowa and Puzynina, 1998). Young being expresses youth of derivative’s denotat and its paraphrase is: ‘Xderiv is young Ybase ’. There are two formants: -˛e and -ak, e.g.: małpi˛e ’young monkey’ < małpa ‘monkey’, s´winiak ‘piglet’ < s´winia ‘pig’ (Grzegorczykowa and Puzynina, 1998, pp. 429-30). Femininity (N-N) links nouns denoting women with their male counterparts: Xderivate –Ybase is ‘X is female Y’. For suffixes -ka (pisarka ‘female writer’ < pisarz ’writer’) the type is almost fully productive, another popular suffixes are -ini/-yni (bogini ’goddess’ < bóg ’god’), -ica (kocica ‘female cat’ < kot ‘cat’), -a (markiza f. ‘marquise’ < markiz m. ‘marquis’) (Grzegorczykowa and Puzynina, 1998, p. 422-5). plWordNet includes 1745 instances of this relation, see Tab. 1. Role (N-V) expresses thematic roles of predicate arguments, e.g. agent, object, instrument

etc., see Tab. 1. In the case of the most frequent agent subtype, the most popular suffixes are -acz (spawacz ‘welder’ < spawa´c ‘weld’), ca (władca ’ruler’ < włada´c ’rule’), -iciel (zbawiciel ‘saviour’ < zbawi´c ‘save’), -ator (restaurator ‘restorer’ < restaurowa´c ‘restore’), there is also (less frequent) backward (paradigmatic) derivation (szpieg ‘spy’ < szpiegowa´c ‘spy’). Together suffixal and parafigmatic formations account in (Doroszewski, 1969) for not less than 3500 instances (Grzegorczykowa and Puzynina, 1998, pp. 398-416). In Słowosie´c the relation occurs 4072 times and is most favourite among linguists. Role inclusion (V-N) in a similar way to role refers to thematic roles of predicate arguments which are built into the verb meaning. However verb derivatives include its bases (noun arguments) in role inclusion, whereas in role noun derivative plays role of argument in its base (predicate, verb). This derivation is relatively frequent in plWordNet: 1262 instances. According to (Wróbel, 1998, pp. 577-83) the most frequent subtypes in Polish are instrument and result, e.g. (pieprzy´c ‘to pepper’ < pieprz ‘pepper’, dziurkowa´c ‘to perforate’ < dziurka ‘hole’), next object (kartkowa´c ‘to leaf through’ < kartka ‘a sheet’) and subject (s˛edziowa´c ‘to referee’ < s˛edzia ‘referee’), the rest are less productive. The assumptions were confirmed by plWordNet data statistics, see Tab. 1. State/feature bearer (N-Adj) and state/feature (Adj-N) are both very productive in Polish. The meaning of the relation linking XN -YAdj could be articulated in following way: X is/has feature Y. The most frequent suffixes (more than 100 lemmas in (Doroszewski, 1969)) are -ec (m˛edrzec ‘sage’ < madry ˛ ‘wise’), -ka (dziczka ‘ rootstock’ < dziki ‘wild’), -ak (dziwak ‘freak’ < dziwny ‘strange’), -ik (okrutnik ‘cruel man’ < okrutny ‘cruel’), together with less frequent suffixes these relations are represented by about 600-800 instances (Grzegorczykowa and Puzynina, 1998, p. 420-1). Till now we have introduced 219 feature bearer relations into plWordNet . Inhabitant (N-N) describes X as an ‘inhabitant/dweller of Y’, where Y is the base denotation. Inhabitant names are derived from geographical proper names (for countries, regions, cities, towns, villages and parts of the world) with such frequent suffixes as -anin and -czyk or with paradigmatic backward derivation: Afrykanin ‘African’ < Afryka ‘Africa’, Wietnamczyk ‘Vietnamese’

Suggest Documents