Brill's Rule-Based Part of Speech Tagger for Hungarian - Stp

Computational Linguistics Department of Linguistics Stockholm University

Spring 1998

Brill’s Rule-Based Part of Speech Tagger for Hungarian

Beáta Megyesi

Master’s Course in Computational Linguistics D-level thesis, 10 credits Advisor: Gunnel Källgren

i

Abstract There are two main methodologies for automatic Part of Speech (PoS) tagging: statistical and rule-based. Almost all automatic PoS tagging systems are based on Markov models where training consists of learning lexical and contextual probabilities. Rule-based taggers on the other hand, work by rules that have been hitherto constructed by linguists. In spite of the fact that statistical models are less accurate than rule-based models most existing PoS analysers have been based on a probabilistic model because these systems are very robust and can be automatically trained. The limitations of the rule-based taggers are that they are non-automatic, costly and time-consuming. Brill (1992) presented a rule-based PoS tagger which automatically infers rules from a training corpus based on transformation-based error-driven learning. Thereby Brill avoids most of the limitations of traditional rule-based taggers in his system. In this study, Brill’s tagger is tested on Hungarian which, unlike English, has a very rich morphology, is agglutinative with some inflectional characteristics and has fairly free word order. The tagging software has been trained on a Hungarian corpus, consisting of 99860 word tokens. This corpus has been annotated for part of speech including inflectional properties. The training was done twice, once for part of speech tags only and once for part of speech and ‘subtags’ denoting inflectional properties. New test texts with approximately 2500 words as well as the original training corpus were tested on the tagger. Precision was calculated for all test texts, and recall and precision for specific part of speech tags. The results presented in this work show that the accuracy of tagging of the test texts is 85.12% for part of speech tags alone, 82.45% for part of speech tags with correct and complete subtags, 84.44% for part of speech tags with correct but not necessarily complete subtags and 87.55% for part of speech tags without considering the correctness of the subtags. The tagger has the most problems with parts of speech belonging to open classes because of their complicated morphological structure. On the other hand, grammatical categories are easier to detect and correctly annotate. It is shown, that it is more difficult to derive information from words which are annotated with only PoS tags, than from words whose tags include information about the inflectional categories. It is suggested that for better evaluation it would be necessary to test the tagger on a large corpus and also to evaluate different components of the system. Higher accuracy could probably be achieved by using a larger corpus made of different text types. Also, using a larger lexicon would reduce the number of unknown words. Additionally, accuracy could be greatly improved if the rule generating mechanisms in Brill’s system would be more flexible allowing consideration of the different characteristics of Hungarian and other languages. The results show, contrary to Brill’s statement that his system is language independent (1992), that Brill’s present system has some difficulty with languages which are dissimilar in their characteristics from English.

ii

Acknowledgement Above all, I would like to thank my advisor Gunnel Källgren for her suggestions and encouragement during the preparation of this work. I wish to express my deepest thanks to Klas Prütz for helping me with the training process of the tagger and for his many valuable suggestions. A very big thanks to Jennifer Spenader for reading the paper draft after draft, for her suggestions and comments, and for our many heated discussions concerning this work. I’m also grateful to Sofia Gustafsson-Capková for our fruitful meetings for discussing our work and to Christina Ericsson, Beb Lindberg and Sara Rydin for readings and comments. Thanks to Ljuba Veselinova for helping me with UNIX! I would also like to thank the researchers of the Research Institute for Linguistics, Hungarian Academy of Sciences, especially Júlia Pajzs and Csaba Oravecz for giving me the Hungarian corpora and to prof. Ferenc Kiefer for giving me the opportunity to participate in the project. Last but not least, I would like to thank my life-partner Olav Bandmann for his help and for his patience with me during this work.

iii

Contents 1

INTRODUCTION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2

BACKGROUND. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 2.2 2.3 3

TYPES OF CORPORA ..................................................................................................3 PART OF SPEECH ANNOTATION ...................................................................................4 EVALUATION OF SYSTEM EFFICIENCY ..........................................................................6 BRILL’S RULE-BASED POS TAGGER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 3.2 3.3 3.4 3.5 4

TRANSFORMATION-BASED ERROR-DRIVEN LEARNING......................................................8 LEARNING LEXICAL RULES ...................................................................................... 10 LEARNING CONTEXTUAL RULES ............................................................................... 13 TAGGING NEW TEXTS............................................................................................. 15 SOME PRACTICAL INFORMATION............................................................................... 15 THE HUNGARIAN LANGUAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 6

4.1 4.2 4.2.1 4.2.2

PHONOLOGY ........................................................................................................ 16 MORPHOLOGY ...................................................................................................... 17 Articles ....................................................................................................... 17 Nouns......................................................................................................... 17

4.2.2.1 4.2.2.2 4.2.2.3

4.2.3 4.2.4 4.2.5 4.2.6

Pronouns..................................................................................................... 20 Adjectives.................................................................................................... 22 Numerals..................................................................................................... 22 Verbs.......................................................................................................... 22

4.2.6.1 4.2.6.2 4.2.6.3 4.2.6.4

4.2.7 4.2.8 4.2.9

Copula..................................................................................................... 2 4 Infinitive ................................................................................................. 2 4 Participles ................................................................................................ 2 4 Verbal particles.......................................................................................... 2 5

Postpositions ............................................................................................... 25 Adverbs....................................................................................................... 26 Word formation............................................................................................. 26

4.2.9.1 4.2.9.2

4.3 4.3.1 4.3.2 4.3.3

Number .................................................................................................... 1 8 Person/Possession...................................................................................... 1 9 Case........................................................................................................ 1 9

Word composition ...................................................................................... 2 6 Derivation of words..................................................................................... 2 6

SYNTAX .............................................................................................................. 27 Word order ................................................................................................... 27 The main types of sentence structure................................................................. 28 Agreement ................................................................................................... 29

5

THE HUNGARIAN CORPORA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1

6

RESULTS AND EVALUATION OF SYSTEM EFFICIENCY . . . . . . . . . . . . . . . . . . . . . . 3 6

6.1 6.2 6.3 6.4

TESTING ON THE TRAINING SET ................................................................................ 37 TESTING ON THE POEM ........................................................................................... 38 TESTING ON THE FAIRY TALE ................................................................................... 40 EVALUATION ....................................................................................................... 42

7

DIRECTIONS FOR FUTURE RESEARCH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 7

8

CONCLUSION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 9

REFERENCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 0

iv

APPENDIX A: THE HUNGARIAN TAGSET. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3 APPENDIX B: LEXICAL RULES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 8 APPENDIX C: CONTEXTUAL RULES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5 APPENDIX D: TEXT NORMALISATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 APPENDIX E: OUTPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 0

v

1 Introduction There are two main methodologies for automatic Part of Speech (PoS) tagging; statistical (probabilistic) and rule-based. Almost all automatic PoS tagging systems are based on Markov models where training consists of learning lexical probabilities (P(Tag|Word)) and contextual probabilities (P(TagX|TagX-1…TagX-N)). These taggers usually take a large manually annotated corpus from which they extract probabilities. Rule-based taggers on the other hand, work by rules that have been hitherto constructed by linguists. In terms of accuracy, systems based on the statistical approach tend to reach 95-97% correct analysis using tagsets ranging from a few dozen to about 130 tags (see Charniak 1996; Church 1988; DeRose 1988; Garside et al. 1987; Merialdo 1994, etc.) while rulebased system, such as Constraint Grammar, can achieve slightly more than 99% correct results (Samuelsson & Voutilainen, 1997). In spite of the fact that statistical models are less accurate than rule-based models most existing PoS analysers have been based on a probabilistic model because of the robustness and automatic training ability of such systems. The limitations of the rule-based taggers are that they are non-automatic, costly and time-consuming. In 1992 Eric Brill presented a rule-based system which differs from other rule-based systems because it automatically infers rules from a training corpus and it utilises some corpus statistics. It takes less space than probabilistic taggers. Hence, the system has become a competitive alternative to probabilistic tagging models. The tagger has been trained for tagging words for English with an accuracy of 97% correct (Brill, 1994a). The aim of this study is to test Brill’s rule-based part of speech tagger on a Hungarian corpus. In this project, a Hungarian training corpus of 80 000 running words, George Orwell’s 1984, has been used. This corpus has been annotated for part of speech (PoS) including inflectional properties (subtags) by the Research Institute for Linguistics at the Hungarian Academy of Sciences. The tagger has been trained on the same material twice, once with PoS and subtags and once with only PoS tags. The question is, with what accuracy will the rules Brill’s tagger has inferred from the training corpus be able to tag new unannotated Hungarian texts. Hungarian, which belongs to the Finno-Ugric branch of the Uralic family, is different from most European languages. It has a very rich morphology, is agglutinative with some inflectional characteristics and has free word order. Thus, it is probable that the tagger trained for Hungarian will have a lower accuracy compared to the result for English. Furthermore, tagging systems of simple construction (for example Church and DeRose’s systems) are usually easier to develop and use for different languages while more sophisticated tagging systems (such as Brill’s system) in most cases are complicated to develop for languages of various types. The purpose of this study is to find out if Brill’s system is immediately applicable to a language, which greatly differ in structure from English, with a high degree of accuracy. In section two, an overview of corpus types, part of speech annotation, different disambiguation strategies and methods of evaluating system efficiency are presented. Section three describes the theory and practice of the rule-based PoS tagger constructed by Eric Brill. Section four presents a description of the Hungarian language. It contains a section about Hungarian phonology followed by a description of the morphology of the main parts of speech, such as nouns, pronouns, adjectives, verbs and different verbal particles, postpositions, and adverbs. After that, a short overview of word formation follows. The main characteristics of Hungarian syntax, such as sentence structure, word order and agreement are also described. In section five, the Hungarian corpora which constitute the training corpus of the tagger and the test corpus with their two tagsets (PoS tags with and without inflectional information) are presented.

1

In section six, the training process of the tagger on the Hungarian corpus, the results of the tagging of different test texts and an evaluation of system efficiency of the tagger are presented and discussed. Precision and recall rates are shown for different parts of speech. Section seven discusses ways in which the tagger could be developed to improve tagging of Hungarian and other grammatically similar languages as well as different limitations and advantages of the tagger. This work has been carried out within the GRAMLEX COPERNICUS project (no: 621) which aims to forward research on building corpus linguistic tools (lemmatisers, concordance programmes, lexicons) for four languages: French, Italian, Polish and Hungarian. The Hungarian partners concentrate on the development of a morphological analyser and on making the result of the analysis more usable for lexicographic work. The main task of the Institute for Linguistics at the Hungarian Academy of Sciences in this project is to carry on research on homograph disambiguation for Hungarian texts. Testing and developing Brill’s rule-based system on Hungarian will hopefully give an effective tool for tagging new unannotated Hungarian texts.

2

2 Background This section will present an outline of important notions concerning corpora, part of speech tagging, different methods for disambiguating words in a corpus and methods used to evaluate system efficiency. 2.1 Types of corpora A corpus, from a computational linguist’s view, is a collection of machine-readable utterances, either spoken or written, taken as a representative sample of a given language, dialect or genre and used for linguistic analysis. There are different ways in which corpora may be categorised, according to the type of texts and the types of annotation. According to Krenn & Samuelsson (1997) a corpus can be balanced, pyramidal or opportunistic. A balanced corpus consists of ‘different text genres of size proportional to the distribution of a certain text type within a language’ while pyramidal corpora ‘range from very large samples of a few representative genres to small samples of a wide variety of genres’ (Krenn & Samuelsson, 1997:64). Opportunistic corpora consist of those texts which the collector can get. A corpus can be classified as unannotated or annotated. Unannotated corpora contain raw but in most cases pre-processed texts, i.e. tokenised text with control characters eliminated. (More about tokenisation follows in section 2.2.) Corpus annotation involves a repository of several types of linguistic information. The input to an annotated corpus is a raw text and the output is an analysed, annotated text. Labels or tags are attached to segments of the text, describing their properties or relations between them. The information about the text may include the origin and the title of the text, the author’s name, the declaration of the writing system, etc. Linguistic information includes part of speech annotation, lemmatisation, parsing, semantics, discourse and text linguistic annotation, phonetic transcription and prosody (McEnery & Wilson, 1996). A tag may be defined as a code representing some feature or a set of features and is attached to the segments in a text. The tag can convey single or complex information. According to McEnery & Wilson (1996) there are several requirements on the characteristics of a tag. One is the basic idea that there should be only one part of speech tag per unit. Another claim is the divisibility of the tag, e.g. each element in a tag should mean something in itself. The third is mnemonicity - the ability to see in an easy way from a tag its meaning. The tagsets may represent morphological, syntactic and/or phrase structure information. It can incorporate some special tags for foreign words, unknown words, idioms, symbols, etc. Thus, the size, contents, complexity and level of annotation of a tagset varies from system to system. How the tagset categorises words and phrases is also different from system to system. Tagging can be done manually or automatically. A manually tagged corpus is annotated by a person or a group of people while an automatically annotated corpus is done by one or more computer programs whose output may or may not be corrected by human beings. Manual annotation of texts has several disadvantages. It is costly, and time-consuming. Often the work is done by a fairly large group of people which results in the additional problem of inconsistency of annotation that is often difficult to detect and correct. With automatic tagging the number of errors may be larger than with manual tagging, but the errors are consistent and are easier to detect. The drawback with automatic tagging is the problem of disambiguation which greatly increases the number of errors in tagging. It is of great importance that every annotation (or tag) is clearly defined from the start to avoid inconsistency in tagging, whether it is done manually or automatically. Corpora can be used, for example, for natural language statistics, machine translation, natural language processing, speech recognition, etc. Corpora are also used for training,

3

testing and evaluation methods for disambiguation strategies. Large linguistically annotated corpora are used for training speech processing and stochastic part of speech taggers and parsers (Krenn & Samuelsson, 1997). 2.2 Part of speech annotation The most basic type of corpus annotation is part of speech tagging (PoS-tagging), which is necessary in most applications of natural language processing (NLP), speech synthesis, speech recognition, machine translation, etc. PoS-tagging is useful for further linguistic study, for analysing the syntactic structure of the sentences in a text or for statistical work such as counting the distribution of the different word classes in text corpora. PoS-tagging consists of assigning a word the accurate part of speech tag, i.e. the text is annotated with syntactic category at word level, but also other relevant information such as the inflectional categories of the classes, e.g. number, person and case. Thus, the task of PoS-tagging is attaching appropriate grammatical or morphosyntactical category labels to every word token, and even to punctuation marks, symbols, abbreviations, etc. in a corpus. Words traditionally have been divided into different word classes on the basis of their meaning, form, or syntactic function, (e.g. in English, nouns, pronouns, verbs, adverbs, adjectives, prepositions, conjunctions, and interjections are generally recognised). The great problem is deciding what criteria (semantic, morphological and/or syntactical) should be used for classifying words into a certain part of speech. The size of the tagset varies from system to system depending on how detailed the categorisation of words is. For instance, one tagset e.g. the LOB-tagset1 , can make a difference between lexical verbs, modal verbs and irregular verbs, while other systems such as the Penn-tagset2 uses the same tag (VB) for all verbs and distinguishes only the morphosyntactic forms of the verb (McEnery & Wilson, 1996). Another major problem in automatic PoS-tagging is that words may belong to several parts of speech, i.e. are ambiguous between two or more classes, e.g. the word graduate belongs to either the class verb, noun or adjective. Contextual (and perhaps prosodic) information is needed to decide to which class the item belongs. Part of speech annotation has to handle the disambiguation of homographs, known and unknown words as well as achieve a high rate of correctly tagged words. A PoS-tagger usually consists of three components: a tokeniser, a morphological classifier and a morphological disambiguator. The tokeniser segments the input text into words and sentences and (sometimes) recognises proper names, acronyms, and idioms by means of dictionaries. After tokenisation, the morphological classifier identifies all the possible word-classes and morpho-syntactic features (e.g. number, gender, case, etc.) of each word according to a lexicon. For non-identified words, the morphological analyser is used to detect if the lexeme exists in the lexicon or not by guessing common word endings. The larger the lexicon, the greater the chances will be that the words can be identified and assigned the appropriate part of speech tag. Then, the morphological disambiguator tries to disambiguate each word in a corpus by making the correct decision about the appropriate part of speech tag for a word in a certain context. Thus, automatic part of speech taggers take raw text, one or several lexicons or lists (e.g. for idioms), a morphological analyser and some rules or transition probabilities for disambiguation and produce a part of speech description of the text. 1

The LOB-tagset is designed for tagging the LOB corpus (Lancester/Oslo-Bergen corpus) which is a collection of more than 1 000 000 words of English text taken from different sources (de Marcken, 1990). 2 The Penn-tagset is designed for the Penn treebank which is a PoS tagged and parsed corpus consisting mainly of the Brown Corpus and articles from the Wall Street Journal (Marcus, Marcinkiewitz, Santorini, 1993).

4

Identifying all possible part of speech tags of a word is regarded as ‘an easy problem’ while disambiguation is known to be a difficult problem. Mainly, there are two types of disambiguation strategies which are frequently used in systems of today: statistical/probabilistic approaches and rule-based methods. The main goal of the probabilistic methods, given a sequence of words W 1...W n is to find the sequence of tags T1...Tn that is the most likely tag sequence for these words, i.e. the sequence T1...Tn that maximises P(T1...Tn | W1...Wn). It is possible to estimate P(T 1...Tn | W 1...W n) using the probabilities P(Ti | Wi) and P(Ti | Ti-1Ti-2) (see Krenn and Samuelsson, 1997:97). These two probabilities are called lexical and contextual probabilities, respectively. Lexical probability is the probability of observing a tag given a specific word P(Ti | Wi). Contextual probability, on the other hand, is the probability of observing a specific tag (Ti) given the surrounding tags. They are typically implemented as bigrams or trigrams, i.e. P(Ti | Ti-1) or P(Ti | Ti-1Ti-2), respectively. The lexical and contextual probabilities (frequency-based information) are often estimated from a correctly annotated (pretagged) training corpus. Thus, various specific disambiguation algorithms which have been developed for part of speech taggers based on probabilistic methods, decide upon one, the most probable of the several possible categories for a given word, i.e. they disambiguate categorical ambiguities in a given text based on statistical methods (see Church 1988; Cutting et al 1992; DeRose 1988; Garside 1987; Merialdo 1994; Weischedel et al. 1993, etc.). One example of a system based on the probabilistic approach is CLAWS (Constituent Likelihood Automatic Word-Tagging System), which is a PoS tagger for English trained on a manually analysed text (see Garside 1987; Atwell 1993; Prütz 1996; CLAWS, 1998). CLAWS consists of three stages: pre-edit, automatic tag assignment, and manual post-edit. In the pre-edit stage the machine readable text is automatically converted to a suitable format for the tagging program. The text is then passed through the tagging program which assigns a PoS tag to each word or word combination in the text. CLAWS has a lexicon of words with their possible parts of speech, and a list of lexicalised phrases or idioms. For unknown words which are not in the databases, CLAWS uses various heuristics including a suffix list of common word suffixes with their possible parts of speech. The disambiguation component of the system selects the most likely tag for each word by calculating the probability for all possible paths through the sequence of ambiguously tagged words on the basis of a matrix of transitional probabilities derived from a training corpus. Finally, manual post-editing using a special tag editor may take place to correct the machine output. The CLAWS system has an accuracy of between 96%-97%, depending on the type of text. DeRose (1988) created another system, the VOLSUNGA algorithm, which is similar to CLAWS and operates in linear rather than exponential time and space. The VOLSUNGA algorithm is based on Dynamic Programming meaning that a ‘mean problem’ can be divided into subproblems and that the main problem can be solved faster if the best solution of the subproblems are available. DeRose’s system has an accuracy of 96%. In contrast to the probabilistic method, the rule-based approach for disambiguation uses rules created by linguists or by a computer program. The rules may contain a large number of morphological, lexical and/or syntactical information. They are based on linguistic knowledge about the structure of the language and specific languages. With human rule creation there is a large set of manually constructed rules based on a specific grammar, written in a formal notation (see example a below) so that they can be used by the computer for further parsing. Many systems also use the trial-and-error method, i.e. finding sentences where rules have failed in order to manually add further rules to the system for higher accuracy. The drawback of this method is that adding a rule to the system may involve overgeneration, i.e. one extra rule can result in more harm than good to the accuracy of general tagging. Rule-based methods are time-consuming and require a great knowledge of both languages in general and the specific language being tagged. TAGGIT, constructed by Greene and Rubin (1971) was one of the first systems which used rules for disambiguation with an accuracy of 77%.

5

Another example of a rule-based system is Constraint Grammar (CG), which is a very successful language independent rule-based approach. It has an accuracy of 99.7%, including some ambiguities. The tagger uses a two-level morphological analyser with a big lexicon and a large number of morphological and syntactical descriptions, created by linguists. Rules are accessible for several languages, such as Finnish, English and Swedish (see Samuelsson & Voutilainen, 1997). A rule from CG may look like the one presented in example a. This rule means ‘discard a verb interpretation (first line) if the word is preceded by an unambiguous determiner (second line)’. Example A

@w = 0 (V) (-1C DET). CG, like many other models, is intended to be language independent when the necessary language specific information is also implemented. Unlike TAGGIT or CG, Brill’s rule-based tagger automatically creates rules from a training corpus, without human intervention. Brill’s tagger will be described in more detail in section 3, but first some explanation about the evaluation of system efficiency. 2.3 Evaluation of system efficiency System efficiency may be evaluated by two metrics for effectiveness: recall and precision. Recall is the ratio of the number of retrieved intended tags returned by the tagger divided by the total number of intended tags. Precision is the ratio of the number of all retrieved intended tags, divided by the total number of retrieved tags. Consider figure a below for a detailed explanation.

Figure A. The representations of the different sets, where A is the set of intended items, B is the set of retrieved items, x, y and z are the number of elements corresponding to the different parts in the picture.

Let A be the set of intended items, let B be the set of retrieved items. x, y, and z are the number of elements corresponding to the different parts in figure a above. Recall and precision are defined as following: recall = y / (y+x) = intended_found / intended_total precision = y / (y+z) = intended_found / total_found Recall estimates the probability that the tagger annotates a word with tag T given that the correct annotation for the word is T. Precision estimates the probability that the correct annotation for a word is T given that the tagger annotates a word with T. For example, a recall of 90% means that the system missed 10% of the intended tags while a precision of 85% means that 15% of what the system retrieved was not regarded as a correct result.

6

The error rate is defined as error rate = 100% - precision. The effectiveness of PoS taggers is measured in terms of accuracy, i.e. the correctly assigned tags in a text. The perfect system could be characterised as 100% recall and 100% precision. The accuracy of most systems of today is between 95% and 99%. According to Krenn & Samuelsson (1997) the accuracy depends on the following conditions with examples: i) the size of the tagset, for example, a small tagset results in good results when trained on a small corpus, ii) the size of the training corpus, e.g. the bigger the tagset the less correct the result will be if the size of the training corpus doesn’t increase accordingly, iii) the type of training and test corpora, e.g. if a tagger is trained on a specific training corpus and is tested on another type of text, the result will be less satisfactory than with a similar test corpus and lastly, iv) the type and the completeness of the vocabulary, e.g. an incomplete lexicon also results in less accuracy. The theoretical background for evaluation of system efficiency comes from information retrieval research. More information and a nice overview about this topic is found in Krenn & Samuelsson (1997).

7

3 Brill’s rule-based PoS tagger Eric Brill introduced a PoS tagger in 1992 that was based on rules, or transformations as he calls them, where the grammar is induced directly from the training corpus without human intervention or expert knowledge. The only additional component necessary is a small, manually and correctly annotated corpus - the training corpus - which serves as input to the tagger. The system is then able to derive lexical/morphological and contextual information from the training corpus and ‘learns’ how to deduce the most likely part of speech tag for a word. Once the training is completed, the tagger can be used to annotate new, unannotated corpora based on the tagset of the training corpus. The rule-based part of speech tagger can be said to be a hybrid approach, because it first uses statistical techniques to extract information from the training corpus and then uses a program to automatically learn rules which reduces the faults that would be introduced by statistical mistakes (Brill, 1992). The tagger does not use hand-crafted rules or prespecified language information, nor does the tagger use external lexicons or lists of different types. According to Brill (1992) ‘there is a very small amount of general linguistic knowledge built into the system, but no language-specific knowledge’. The long time goal of the tagger is, in Brill’s own words, to ‘create a system which would enable somebody to take a large text in a language he does not know and with only a few hours of help from a speaker of the language accurately annotate the text with part of speech information.’ (Brill & Marcus, 1992b:1). For achieving his aim, Brill has also developed a parser, consisting of systems which may automatically derive word classes and the bracketing structure of sentences, assigning nonterminal labels to the bracketing structure and improving prepositional phrase attachment (see Brill, 1992, 1993a, 1993b, 1995b). In this work the part of speech tagger, applying the most likely tag for a word will be presented. The following section gives a description of the main ideas of Brill’s tagger and some information about how to train and test that tagger. All information contained in the following section is based on Brill’s articles, listed in the reference list (Brill, 1992; 1993a; 1993b; 1993c; 1994a; 1995a; 1995b). The description below is only functional, disregarding any efficiency aspects considered in the actual software. 3.1 Transformation-based error-driven learning The general framework of Brill’s corpus-based learning is so-called Transformationbased Error-driven Learning (TEL). The name reflects the fact that the tagger is based on transformations or rules, and learns by detecting errors. First, a general description of how TEL works in principle is given, then a more detailed explanation for the specific modules of the tagger will follow in subsequent sections. Roughly, the TEL, shown in figure b, begins with an unannotated text as input which passes through the initial state annotator. It assigns tags to the input in some fashion. The output of the initial state annotator is a temporary corpus which is then compared to a goal corpus which has been manually tagged. For each time the temporary corpus is passed through the learner, the learner produces one new rule, the single rule that improves the annotation the most (compared with the goal corpus), and replaces the temporary corpus with the analysis that results when this rule is applied to it. By this process the learner produces an ordered list of rules. The tagger uses TEL twice: once in the lexical module deriving rules for tagging unknown words, and once in the contextual module for deriving rules that improve the accuracy. Both modules use two types of corpora: the goal corpus, derived from a manually annotated corpus, and a temporary corpus whose tags are improved step by step to resemble the goal corpus more and more.

8

Figure B. Error-driven learning module (data marked by thin lines)

In the lexical module of the tagger the goal corpus is a list of words containing information about the frequencies of tags in a manually annotated corpus. A temporary corpus on the other hand is a list consisting of the same words as in the goal corpus, tagged in some fashion. In the contextual learning module the goal corpus is a manually annotated running text while a temporary corpus consists of the same running text as in the goal corpus but with different tags. A rule consists of two parts: a condition (the trigger and possibly a current tag), and a resulting tag. The rules are instantiated from a set of predefined transformation templates. They contain uninstantiated variables and are of the form If Trigger, then change the tag X to the tag Y or If Trigger, then change the tag to the tag Y where X and Y are variables. The interpretation of the first type of the transformation template is that if the rule triggers on a word with current tag X then the rule replaces current tag with resulting tag Y. The second one means that if the rule triggers on a word (regardless of the current tag) then the rule tags this word with resulting tag Y. The set of all permissible rules PR are generated from all possible instantiations of all predefined templates. The rule generating process takes an initially tagged3 temporary corpus TC0 and finds the rule in PR which gets the best score when applied to TC0 (a high score for a rule means that the temporary corpus produced when applying the rule gives an annotation closer to the goal corpus). Call this rule R1. Then R1 is applied to TC0, producing TC 1. The process is now repeated with TC1, i.e. it finds the rule in PR which gets the best

3

Initial state annotation is done differently depending on which learning module is used. This will be explained in detail later.

9

score when applied to TC1. This will be rule R24 which then is applied to TC1 producing TC2. The process is reiterated producing rules R3, R4, ... and corresponding temporary corpora TC3, TC4, ... until the score of the best rule fails to reach above some predetermined threshold value. The sequence of temporary corpora can be thought of as successive improvements closer and closer to the goal corpus. Note that only the tags differ in the different temporary corpora. The output of the process is the ordered list of rules R1, R2, ... which will serve to tag new unannotated texts. The score for a rule R in PR is computed as follows. For each tagged word in TCi the rule R gets a score for that word by comparing the change from the current tag to the resulting tag with the corresponding tag of the word in the goal corpus. How this is computed depends on the module, see sections 3.2 and 3.3 below. A positive score means that the rule improves the tagging of this word, and a negative score means that the rule worsens the tagging. If the condition of the rule is not satisfied then the score is zero. The total score for R, score(R), is then obtained by adding the scores for each word in TCi for that rule. When the total score for each R is obtained the rule which has the highest score is added to the set of rules which have already been learned. Rules are ordered, i.e. the last rule is dependent on the outcome of earlier rules. The following sections present an overview of the two modules in Brill’s system: the lexical module and the contextual module. First, the lexical module uses transformationbased error-driven learning to produce lexical tagging rules. Then the contextual module, also using transformation-based error-driven learning, produces contextual tagging rules. The lexical and the contextual modules have different transformation templates. The triggers in the lexical module depend on the character(s) and the affixes of a word and some of them depend on the following/preceding word. The triggers in the contextual module, on the other hand, depend on the current word itself, the tags or the words in the context of the current word. 3.2 Learning lexical rules The ideal goal of the lexical module is to find rules that can produce the most likely tag for any word in the given language, i.e. the most frequent tag for the word in question considering all texts in that language. The problem is to determine the most likely tags for unknown words, given the most likely tag for each word in a comparatively small set of words. The lexical learner module is weakly statistical. It uses the first half of the manually tagged corpus5 as well as any additional large unannotated corpus containing the manually tagged corpus with the tags removed. The learner module uses three different lexicons or lists constructed (by the user) from the manually tagged corpus and the large unannotated corpus. The smallwordtaglist, built from the manually annotated corpus, serves as the goal corpus in TEL and contains lines of the form Word Tag Frequency.

It simply contains the frequencies of each Word/Tag pair in the manually annotated corpus. Freq(W,T) denotes the Frequency of a word W with a specific tag T, and Freq(W) denotes the frequency of the word Word in the manually annotated corpus. These numbers are used by the lexical learner in two ways: Freq(W,T) is used to compute the most likely tag T for the word W (i.e. the one with the highest frequency), and

4

Note that R2 could in principle be the same rule as R1, i.e. the rules considered in each iteration is the whole set PR. 5 The second half of the manually annotated corpus will be used with the contextual rule learner.

10

P(T|W) := Freq(W,T) / Freq(W) is the estimated probability that the word W is tagged with the tag T. The other two lists are called bigwordlist and bigbigramlist and are constructed from the large unannotated corpus. bigwordlist is a list of all words occurring in the unannotated corpus, sorted by decreasing frequency. The bigbigramlist is a list consisting of all word pairs (hence the name bigram) occurring in the unannotated corpus. Bigbigramlist does not contain the frequencies of word pairs; it only records if a given word pair occurs in the unannotated corpus or not. Note that once the user has constructed these lists, the lexical learner does not need either the manually annotated or the unannotated corpus because the lists contain all the information the learner needs. The bigwordlist and the bigbigramlist are only used to check the trigger condition, which is completely determined by the data in these two lists. First, the learner constructs a word list from smallwordtaglist, i.e. the words with tagging information removed. Then the initial state annotator assigns to every word the default most likely tag. The default tags for English are NN and NNP6 , where NNP is assigned to words which start with a capital letter and NN is assigned to words which do not start with a capital letter. The word list thus obtained is the initial temporary corpus WL0; it was called TC0 in the general description of TEL above. Once WL0 has been produced by the initial state annotator the learner generates the set of all permissible rules PR from all possible instantiations of all the predefined lexical templates and computes a score for every rule R in PR (see below). The rule which achieves the best score becomes rule number one in the output. Then the learner transforms WL0 to WL1 by applying this rule. This process is repeated until no rule can be found with a score greater than some threshold value, i.e. compute the new scores for all the rules in PR, pick the one with the best score, output this rule as rule number two and apply it on WL1 to get WL2, etc. The scoring function is defined as follows: If the rule R has the template: if Trigger then change tag X to tag Y and w is a word in WLi with current tag X satisfying the trigger condition, then R gets the score P(Y|w) - P(X|w) for the word w. The total score for R is then obtained by adding all the ‘word scores’. score(R) :=

Σ

P(Y|w) - P(X|w)

w

where the sum runs through all w in WLi with current tag X, satisfying the trigger condition. If the rule R has the template: if Trigger then change current tag to tag Y and w is a word in WLi satisfying the trigger condition, then R gets the score P(Y|w) P(Current tag of w |w) for the word w. The total score for R is then obtained by adding all the ‘word scores’. score(R) :=

Σ

P(Y|w) - P(Current tag of w |w)

w

6

In the case of the Hungarian corpus I simply used these default tags as nonterminal tags.

11

where the sum runs through all w in WLi, satisfying the trigger condition. Note that the score the rule R gets for w always is of the form P(new tag|w)-P(old tag|w). A positive score means that the new tag is more likely than the old tag while a negative score means that the new tag is less likely than the old tag. The trigger condition is tested using bigwordlist and bigbigramlist, and the estimated probabilities are computed from the frequencies in smallwordtaglist. The set of templates used and the name of the rule type for each template in parenthesis are given below, (see also Appendix B). 1.

Change the most likely tag to Y if the character Z appears anywhere in the word. (char)

2.

Change the most likely tag to Y if the current word has suffix x, |x| ≤ 47 . (hassuf)

3.

Change the most likely tag to Y if deleting the suffix x, |x| ≤ 4, results in a word. (deletesuf)

4.

Change the most likely tag to Y if adding the suffix x, |x| ≤ 4, results in a word. (addsuf)

5.

Change the most likely tag to Y if the current word has prefix x, |x| ≤ 4. (haspref)

6.

Change the most likely tag to Y if deleting the prefix x, |x| ≤ 4, results in a word. (deletepref)

7.

Change the most likely tag to Y if adding the prefix x, |x| ≤ 4, results in a word. (addpref)

8.

Change the most likely tag to Y if word W ever appears immediately to the left/right of the word8 . (goodleft) vs. (goodright)

9.

Change the most likely tag from X to Y if the character Z appears anywhere in the word. (fchar)

10.

Change the most likely tag from X to Y if the current word has suffix x, |x| ≤ 4. (fhassuf)

11.

Change the most likely tag from X to Y if deleting the suffix x, |x| ≤ 4, results in a word. (fdeletesuf)

12.

Change the most likely tag from X to Y if adding the suffix x, |x| ≤ 4, results in a word. (faddsuf)

13.

Change the most likely tag from X to Y if the current word has prefix x, |x| ≤ 4. (fhaspref)

14.

Change the most likely tag from X to Y if deleting the prefix x, |x| ≤ 4, results in a word. (fdeletepref)

|x| ≤ 4 means that the length of the ‘prefix’ or the ‘suffix’ must not exceed 4 characters. These templates are costly to compute since for each word W we get a template of the simpler type (the templates except number 8 and 16). To speed upp the learning process a threshold n is given as an argument to the learner which restricts the use of these templates to the n most frequent words in bigwordlist (i.e. the first n words in bigwordlist). This is the reason bigwordlist is sorted.

7

8

12

15.

Change the most likely tag from X to Y if adding the prefix x, |x| ≤ 4, results in a word. (faddpref)

16.

Change the most likely tag from X to Y if word W ever appears immediately to the left/right of the word. (fgoodleft) vs. (fgoodright)

Note that the only difference between the first eight and the second eight templates is that in the latter the template changes a tag X to another tag Y while in the first eight templates the new tag will be Y regardless of the current tag. The lexical rules produced by the lexical learning module are then used to initially tag the unknown words in the contextual training corpus. This will be described in detail below. 3.3 Learning contextual rules Once the tagger has learned the most likely tag for each word found in the manually annotated training corpus and the method for predicting the most likely tag for unknown words, contextual rules are learned for disambiguation. The learner discovers rules on the basis of the particular environments (or the context) of word tokens. The contextual learning process needs an initially annotated text. The input to the initial state annotator is an untagged corpus, a running text which is the second half of the manually annotated corpus where the tagging information of the words is removed. The initial state annotator needs a list, a so called traininglexicon, which consists of a list of words with a number of tags attached to each word. These tags are the tags that are found in the first half of the manually annotated corpus (the ones used by the lexical module). The first tag is the most likely tag for the word in question and the rest of the tags are in no particular order. Word Tag1 Tag2 … Tagn With the help of the traininglexicon, the bigbigramlist (the same as used in the lexical learning module, se above) and the lexical rules, the initial state annotator assigns to every word being in the untagged corpus the most likely tag. It tags the known words, i.e. the words occurring in traininglexicon, with the most frequent tag for the word in question. The tags for the unknown words are computed using the lexical rules: each unknown word is first tagged with NN or NNP and then each of the lexical rules are applied in order. The reason why bigbigramlist is necessary as input is that some of the triggers (nr. 8 and 16) in the lexical rules are defined in terms of this list9 . The annotated text thus obtained is the initial temporary running text RT010 which serves as input to the contextual learner. Transformation-based error-driven learning is used to learn contextual rules in a similar way as in the lexical learner module. The input to the contextual learner is the second half of the manually annotated corpus (i.e. the goal corpus), the initial temporary corpus RT0 and the traininglexicon (the same one as above). First, the learner generates the set of all permissible rules PR from all possible instantiations of all the predefined contextual templates. Note that PR in the contextual module and PR in the lexical module are different sets of rules, since the two modules have different transformation templates. Here, the triggers of the templates for the rules usually depend on the current context. The following are the triggers of the contextual transformation templates: 9

These triggers do not check if the word before/after the current word is a particular word (in the current context), instead they check if the current word ever can be preceded/followed by a particular word in some context. 10 RT0 was called TC0 in the general description of TEL above.

13

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

The preceding/following word is tagged with Z. (PREVTAG/NEXTTAG) One of the two preceding/following words is tagged with Z. (PREV1OR2TAG/NEXT1OR2TAG) One of the three preceding/following words is tagged with Z. (PREV1OR2OR3TAG/NEXT1OR2OR3TAG) The preceding word is tagged with Z and the following word is tagged with V. (SURROUNDTAG) The preceding/following two words are tagged with Z and V. (PREVBIGRAM/NEXTBIGRAM) The word two words before/after is tagged with Z. (PREV2TAG/NEXT2TAG) The current word is Z. (CURWD) The preceding/following word is W. (PREVWD/NEXTWD) One of the preceding/following words is W. (PREV1OR2WD/NEXT1OR2WD) The word two words before/after is W. (PREV2WD/NEXT2WD) The current word is Z and the preceding word is V. (LBIGRAM) The current word is V and the following word is Z. (RBIGRAM) The current word is V and the preceding/following word is tagged with Z. (WDPREVTAG/WDNEXTTAG) The current word is V and the word two words before/after is tagged with Z. (WDAND2BFR/WDAND2TAGAFT)

For all rules in PR for which the trigger condition is met the scores on the temporary corpus RT0 are computed. It picks the rule with the highest score, R1, which is then put on the output list. Then the learner applies R1 to RT0 and produces RT 1, on which the learning continues. The process is reiterated putting one rule (the one with the highest score in each iteration) on the output list11 in each step until learning is completed, i.e. no rule achieves a score higher than some predetermined threshold value. A higher threshold speeds up the learning process but reduces the accuracy because it may eliminate effective low frequency rules. If R is a rule in PR, the score for R on RTi is computed as follows. For each word in RTi the learner computes the score for R on this word. Then the scores for all words in RTi where the rule is applicable are added and the result is the total score for R. The score is easy to compute since the system can compare the tags of words in RTi with the correct tags in the goal corpus (the texts are the same). If R is applied to the word w, thereby correcting an error, then the score for w is +1. If instead an error is introduced the score for w is -1. In all other cases the score for w is 0. Thus, the total score for R is score(R) := number of errors corrected - number of errors introduced. There is one difference compared to the lexical learning module, namely the application of the rules is restricted in the following way: if the current word occurs in the traininglexicon but the new tag given by the rule is not one of the tags associated to the word in the traininglexicon, then the rule does not change the tag of this word. The rules produced by the contextual learning module together with the lexical rules and several other files can then be used as input to the tagger to tag unannotated text, as will be described in the next section.

11

Recall that the output list is ordered.

14

3.4 Tagging new texts To execute the tagger several files are used: lexicon, an unannotated corpus which will be tagged by the tagger, bigbigramlist, lexical rule file and contextual rule file. For calling the tagger, the following command has to be written: tagger LEXICON UNTAGGEDCORPUS BIGBIGRAMLIST LEXICALRULEFILE \ CONTEXTUALRULEFILE > Outputfile bigbigramlist is the same as above, the lexical and contextual rule files are the ones computed by the lexical and the contextual learner modules. The lexicon is built in the same way as traininglexicon above, the only difference being that the whole manually tagged corpus is used. The bigbigramlist is needed to check the trigger conditions for the lexical rules. First, the tagger tags all known words (words occurring in the lexicon) with the most frequently occurring tag for the word in question. Then, the tags for the unknown words are computed using the lexical rules: each unknown word is first tagged with NN or NNP and then each of the lexical rules are applied in order. This completes the lexical part of the tagging. Then, the contextual rules are applied to the whole corpus in order with the same restriction as in section 3.3. The result is the annotated text.

3.5 Some practical information The software is distributed freely and is available from ‘ftp.cs.jhu.edu/pub/brill/Programs/RULE_BASED_TAGGER’ Information about how to compile the programs and how to train and test the tagger is found in README files in the directory Docs. Note that all text files, both annotated and unannotated used for training and testing must be tokenised which includes removing the punctuation from words and placing one sentence per line. Tagged text is of the form word/tag. For example, annotated text taken from the Hungarian test corpus has the following form, De/KOT keme1nyse1ge/FN_PSe3 volt/IGE_Me3 a/DET legnagyobb/MN_FOK ./.

while unannotated text taken from the same corpus is as follows, De keme1nyse1ge volt a legnagyobb .

The normalisation of the Hungarian training and test corpus will be described in section 5 but first a presentation of the Hungarian language follows.

15

4 The Hungarian language Hungarian, also called Magyar, traditionally belongs to the Ob-Ugric languages (e.g. Khanty and Mansi) of the Finno-Ugric branch of Uralic. Hungarian is the official language of the Republic of Hungary, and has approximately fifteen million speakers, of which four million reside outside of Hungary. In this section a description of Hungarian phonology, morphology and syntax follows. The subsections are based on BenkŒ & Imre (1972), Rácz (1968), Olsson (1992) and Abondolo (1992). 4.1 Phonology Hungarian has a rich system of vowels and consonants. The vowel inventory consists of 14 phonemes of which one can distinguish 5 pairs, consisting of short and long counterparts; these are i - í, o - ó, ö - Œ, u - ú, ü - ı. The remaining four are e - é and a - á. Short vowels, if they are marked, take an umlaut (¨), while long vowels are indicated by an acute (´) or with a double acute accent (´´) which is a diacritic unique to Hungarian. Long vowels are usually somewhat tenser than their short counterparts with two exceptions; e is low while é is higher mid and á is low whereas a is lower mid and slightly rounded (Abondolo, 1992). Vowel length is independent of prosodic factors such as stress. The vowels may be interconnected through the laws of vowel harmony which means that suffixes, which may assume two or three different forms, usually agree in backness with the last vowel of the stem. In other words, front vs. back alternatives of suffixes are selected according to which vowel(s) the stem contain(s) (BenkŒ & Imre, 1972). The vocalism of stems, classified by Abondolo (1992), is inherently back for all stems containing at least one back vowel and for most verbs with the sole vowel i or í. For all other stems the vocalism is front. In regard to vowel harmony, i and í are neutral and can be used with either front (high) or back (low) vowels. Harmony causes the following alternations among suffix combinations: a/e (-ban/-ben - ‘in’), á/é (-nál/-nél - ‘at’), ó/Œ (ból/-bŒl - ‘from’), u/ü (-ul/-ül - ‘for, by’) and o/e/ö (-hoz/-hez/-höz - ‘to’). The vowels may show even paradigmatic alternations as long and short vowels (é vs. e and á vs. a, see the example below) alternate in some stems (BenkŒ & Imre, 1972). Example B

tehén

tehen-et

fa

fá-t

cow:NOM

cow-ACC

tree:NOM

tree-ACC

‘cow’

‘tree’

When building a Hungarian corpus it is usual to delete the accent and the umlaut from the vowel and mark vowel length as well as the umlaut by numerals which follow the vowel: 1 denotes the acute accent (e.g. ó -> o1), 2 the umlaut (ü -> u2) and 3 the double acute accent (e.g. Œ -> o3). This notation can be useful when automatically deriving rules from a corpus because of the paradigmatic alternations of the long vs. short vowels (see example b). There are totally 25 consonants which can be determined according to the manner and the place of articulation, voicing and quantity. The consonants are shown in the table below.

16

Table A. The Hungarian consonant chart, given with regular orthographic symbols. Phonetic values are given in square brackets.

Stops Affricates Fricatives Nasals

labial/labiodental dental/alveolar palatal velar glottal -voice +voice -voice +voic -voice +voice -voice +voice -voice e p b t d ty [t’] gy [d’] k g c [ts] dz cs [c˘] dzs [j˘] f v sz [s] z s [s˘] zs [z˘] h m n ny [n’]

Laterals Tremulants Glides

l r j, ly12

Consonant length is distinctive and is independent of vowel length and of prosodic factors such as stress. Each consonant can be pronounced short or long, where the last mentioned has almost double the length of short consonants and is written by doubling the letter (gg), or the first element of a digraph (ggy) (BenkŒ & Imre, 1972). Many of the long consonants occur at morpheme boundaries or root-finally in foreign vocabulary (Abondolo, 1992). Assimilation is either full or partial and can be indicated by orthography. In section 4.2 on part of speech features, the most important morpheme specific assimilation rules will be presented. 4.2 Morphology Hungarian is basically agglutinative, i.e. grammatical relations are expressed by means of affixes. For understanding the function of different affixes and how they interact the following section will give an overview of these for different parts of speech. The types of homography will also be described under the main categories. 4.2.1 Articles The articles [DET]13 are invariable for number, person, gender and case. The indefinite article is egy, while the definite article has two forms a and az, where the first is used before consonants and the latter before vowels, similar to English indefinite articles. With regard to homography, the form of the definite article az can also be a demonstrative pronoun and the indefinite article can be homonymous to the numeral ‘one’ (Pajzs, 1996). 4.2.2 Nouns Every Hungarian noun [FN] may be analysed as a stem followed by three positions in which inflectional suffixes can occur. Thus, nouns are inflected for number, person (possessor) and case, with the relevant suffixes attached in that order (Abondolo, 1992). Any or all of the three inflectional suffixes may be occupied by a zero suffix which 12 13

j and ly are pronounced alike. Within brackets [] are those PoS tags that are used in the Hungarian corpus.

17

denotes either singular number (first position), absence of possessor (second position) or nominative case (third position) (Abondolo, 1987). Thus we have: Example C

gyereke-Ø-m-en

gyereke-k-Ø-en

gyereke-i-m-Ø

child-Ø-1POSS-SUPESS

child-PL-Ø-SUPESS

child-PL-1POSS-Ø

‘on my child’

‘on children’

‘my children’

There is no grammatical gender. The personal pronoun Œ means both ‘he’ and ‘she’. Describing Hungarian, many authors, not without reason, ‘forget’ to mention something about the complicated system of noun stem alternation. Here, an outline of this system based on the summary by Abondolo (1987) is presented. Noun stems may end in either a consonant or a vowel. All stems with final a or e are lengthened (á vs. é) before most suffixes, whether derivational or declensional. Example D

lámpa

=> lámpá-m

kefe

=>

lamp:NOM

lamp-1POSS

brush:NOM

‘lamp’

‘my lamp’

‘brush’

kefé-m brush-1POSS

‘my brush’

There also exists a very special stem form, called ‘oblique stem’, which occurs with several nouns. This oblique stem form differs from the nominative singular in a way that there is present a stem-final a or e, or there is absent the stem-penultimate o, ö or e, or both. example e illustrate this where fal ‘wall’, dal ‘song’, gyomor ‘stomach’, nyomor ‘misery’, sátor ‘tent’ and mámor ‘rapture’ (Abondolo, 1987). Example E Different nouns have different inflectional patterns based on their oblique stems

Stem Nominative Oblique Accusative

present a fal fal-a fala-t

normal dal daldal-t

absent o gyomor gyomr gyomr-ot

normal nyomor nyomornyomor-t

absent o and present a sátor sátr-a sátra-t

normal mámor mámormámor-t

There are words whose oblique stems not only have final a or e but also v instead of u, e.g. falu ‘village’ whose oblique stem is falva-, as in the form falva-k ‘villages’. Nominative stems with the long vowels á and é also change to short a vs. e in the penultimate position when becoming oblique stem, e.g. madár ‘bird’, oblique stem madara-, thus madara-k ‘birds’. This phenomenon may cause problem when automatically identifying stems. 4 . 2 . 2 . 1 Number The category number is realised as singular and plural. There are two plural suffixes. The suffix -k is preceded by an epenthetic vowel after a consonant final stem (Olsson, 1992). Example F ‘university’ ‘student’

SINGULAR PLURAL egyetem egyetemek diák diákok

The other plural suffix is -i , which is used only when person suffixes are present, as in example c and example g.

18

Example G

gyereke-i-nke-t child-PL-1PL-ACC

‘our children’ 4 . 2 . 2 . 2 Person/Possession Possession is usually indicated with a personal suffix on the possessed noun. The forms vary for number and person, as shown in Appendix A under possessives. If there is a chain of possessors, the last possessor, closest to the head, takes a dative case marker -nak/-nek in addition to the possessive suffix. Consider the following example. Example H

az apá-m

barát-já-nak

a könyv-e

the father-1SG.POSS

friend-3SG.POSS-DAT/GEN

the book-3.SG.POSS

‘my father’s friend’s book’ The suffix -já- marks possession by apám ‘my father’ and -nak signals the pending possessive ending -e (Campbell, 1991). Nouns consisting of a stem with a possessive ending followed by a cases suffix are in most cases homonymous. For example, the word fej+é+nek with the suffix -é- as a possessive suffix means ‘to the possession of his head’, while the same word with -é- as a paradigmatic alternation to the vowel e means ‘to his head’ (Pajzs, 1996). 4 . 2 . 2 . 3 Case Hungarian has a complex case system involving 16 to 24 distinct forms to mark that an NP bears some identifiable grammatical or semantic relation to the rest of the sentence. The case suffixes may be classified into two groups, non-local and local. The non-local cases express primary syntactic or adverbial functions, such as subject, direct and indirect object, possessor or instrument. The local cases show concrete spatial and kinetic conditions such as interior vs. exterior, stationary vs. moving (Abondolo, 1987). There are different assumptions about the exact number of case suffixes. I count 19 and the names of the cases and the forms (allomorph) of the suffixes are given in table b with examples (ház ‘house’, öt ‘five’). Unfortunately, there is no space for explaining the function of each case but hopefully the examples illustrate their functions. Case suffixes are the same both in singular and in plural. The plural suffix always precedes the case suffix, as the examples below show: Example I

a háza-k-ban

a háza-i-m-ban

the house-PL-INESS

the house-PL-1SG-INESS

‘in the houses’

‘in my houses’

Assimilation at juncture takes place in instrumental and translative case which may cause problems in automatic tagging systems. The initial -v- in the instrumental suffix -val/-vel and the translative suffix -vá/vé is assimilated to a preceding consonant (Olsson, 1992). Example J

ház ‘house’ öt ‘five’

Instrumental ház-zal öt-tel

Translative ház-zá öt-té

19

The case suffixes may also occur as stems and take personal suffixes if not postposed to a noun. They are then usually regarded as pronouns or adverbs in traditional Hungarian grammar (se sections 4.2.3 and 4.2.8). Table B. The Hungarian case system, listed with each allomorph of the case suffix in singular and plural and exemplified by the words ház ‘day’, and öt ’five’

Case [Tag] Nominative [NOM] Accusative14 [ACC] Dative-genitive15 DAT] Instrumental [INS] Essive-modal [SOC] Translative [FAC] Causal-final [CAU] Illative [ILL] Sublative [SUB] Allative [ALL] Inessive [INE] Superessive [SUP] Adessive [ADE] Elative [ELA] Delative [DEL] Ablative [ABL] Terminative [TER] Formal [FOR] Temporal [TEM]

Suffixes -t, -ot, -et, -öt -nak, -nek

Examples ház house házat house háznak of the house

-(V)al, -(V)el -stul, -stül -(V)á, (V)é -ért -ba, -be -ra, -re -hoz, -hez, höz -ban, -ben -n, -on, -en, -ön -nál, nél -ból, -böl -ról, röl -tól, töl -ig -ként -kor

házzal házastul házzá házért házba házra házhoz házban házon háznál házból házról háztól házig házként ötkor

with the house with the house and its parts into a house for the house into the house onto the house to the house in the house on the house at the house out of the house from (top of) the house from (nearby) the house as far as the house as a house at five

4.2.3 Pronouns The use of personal pronouns [NM] is not frequent in Hungarian because it is a prodrop language. They basically have two cases: nominative and accusative. The singular forms in nominative are én (I), te(‘you’), and Œ(‘he’, ‘she’). Third person plural in the nominative case can be derived from the corresponding singular by adding the plural suffix -k (the same as the plural suffix for nouns) to the singular stem, e.g. Œ+k. For the other plural forms there is no such simple connection (mi ‘we’ and ti ‘you’). The personal pronouns in first and second person singular in accusative occur usually without the accusative suffix -t. (engem ‘me’, téged ‘you’). The first and second plural forms in accusative are, on the other hand, constructed as nominative form + corresponding possessive suffix + accusative marker -t, i.e. the same marker for possessive suffixes and for the accusative for nouns, e.g. mi+nk+et ‘us’ and ti+tek+et ‘you’. The third person in accusative can be derived as nominative + (plural ending) + accusative ending, thus Œ+t ‘her/him’ Œ+k+et ‘them’. By adding the enclitic (personal) markers to the case endings of nouns, listed in table b, oblique forms are made. These correspond to prepositional phrases in English, and are often regarded as pronouns in different cases, or as adverbs because of their adverbial function in the sentence. The enclitic markers with examples are shown in table c.

14

The accusative case ending in certain constructions may be zero (-Ø-) if the object is a noun with a possessive personal endig, e.g. eladom a házam/házamat ‘I sell my house.’. 15 The reason for marking the genitive and the dative cases as the same is, that dative may mark not only the indirect object but also the possessor.

20

Table C. The enclitic markers for Hungarian pronouns

Singular 1 -m 2 -d 3 -i, -e

Plural -unk, -ünk -tok, -tek -ik, -ük

Examples - SG nek-em ‘to me’ nek-ed ‘to you’ nek-i ‘to him/her’

Examples - PL nek-ünk ‘to us’ nek-tek ‘to you’ nek-ik ‘to them’

Reflexive pronouns [NM] consist of the word mag ‘pit, nucleus’ and a possessive personal suffix, listed in Appendix A. Besides the reflexive function they also have a nonreflexive function to emphasise the personal pronouns. In reflexive function the accusative ending -at is ‘obligatory’, except the first and second person where the nominative form is common. Possessive pronouns [NM] which serve to express possession have the following forms: enyém ‘mine’ miénk ’ours’ tied ‘yours’ tietek ‘yours’ övé ‘his/hers’övék ‘theirs’ Possessive pronouns cannot stand together with the noun head (the possessed entity). When the noun head is present a personal pronoun is used. Thus, the use of double markers on the pronoun and the noun head at the same time is not allowed in Hungarian. Example K

eny:ém

volt

a könyv

a(z) (én) könyv-em

POSSPRON:1SG COP:PAST the book

the

‘It was my book.’

‘my book’

I book-POSS1SG

The system of demonstrative pronouns [NM] consists of two categories; pronouns with front vowel mean ‘near’ in contrast to pronouns with back vowel which mean ‘far’, e.g. ez/az ‘this/that’. Case endings are usually added to the pronouns and show both regressive (ez-nek => en-nek) and progressive (az-val => az-zal) assimilation. The demonstrative pronoun cannot take enclitic markers or possessive suffixes and takes the same position as the head nominal. Example L

ezek

mögött a problémá-i-d

this-PL behind

the problem-PL-2SG

mögött behind

‘behind these problems of yours’ The interrogative pronouns [KSZ] are based on the stems ki ‘who’, mi ‘what’ and hol ‘where’. Their derivatives are different case suffixes as in the case of the demonstrative pronouns. Other pronouns, e.g. relative and indefinite pronouns [NM] are compounds of the interrogative pronouns. Relative pronouns are derivable from interrogative pronouns which is done by the prefix a-. It has personal and non-personal counterparts and singular and plural forms as well as full declension. They follow the pattern of Hungarian nouns in case and number. One can describe other type of pronouns in a similar way, as below. Pronouns Relative Indefinite Negative Selective General

Prefix avalase(n/m) bár- / akárminden

Pronominal stems + ki ‘who’ + mi ‘what’ + milyen ‘what kind’ + mely ‘which’ + mennyi ‘how many’

21

+ hol ‘where’ + hova ‘where to’ + honnan ‘where from’ + mikor ‘when’ + meddig ‘where to’

Each pronominal prefix can be added to the pronominal stems. Thus, the indefinite pronoun valami means ‘something’, valaki ‘somebody’, valahol ‘somewhere’, the selective pronoun bárki means ‘anyone’, bármi ‘anything’ and so on. The above mentioned pronouns also take case- and plural suffixes, e.g. valamiben ‘in something’. 4.2.4 Adjectives Adjectives [MN] can be used as nouns and are then declined fully, i.e. plural and case endings can be added to adjectives. There is also a special case only used by adjectives: the modal-essive with the forms -en/-an/-on and -leg, e.g. gazdag-on ‘in a rich way’. Comparative is formed by adding a (harmonic vowel) + -bb to the stem. Some comparative forms are suppletive, such as sok ‘many’ vs. több ‘more’. Superlatives are formed by adding the prefix leg- to the comparative form, e.g. rossz ‘bad’, rossz+abb ‘worse’, leg+rossz+abb ‘worst’ In Hungarian, adjectives in attributive position precede their head nouns and do not agree with them. On the other hand, adjectives in predicative position agree in number with the subject. Example M

a szép

virág-ok

a virágok

the beautiful

flower-PL

the flower-PL beautiful-PL

‘the beautiful flowers’

szép-ek

‘The flowers are beautiful.’

4.2.5 Numerals Numerals [SZN] precede nouns and do not agree with them. When standing without the head noun they carry the same case endings as the nouns, pronouns or adjectives, e.g. ötért ‘for five’. 4.2.6 Verbs Hungarian verbs [IGE] may be analysed as a stem, followed by a tense/mood suffix, followed by a person-and-number suffix, see example n. The morphological manifestations of the tense/mood and person/number system are interconnected and will be discussed together. In Appendix A under the Verb [IGE] category all allomorphs of different tense/mood and person/number categories are listed. Below just a few examples of different paradigms follow. Example N

ír-Ø-om

ír-t-am

write-PRS-1SG:DEF

write-PST-1SG:DEF

‘I write’

‘I wrote’

In verb conjugation the personal suffixes play a central role because free pronominal subjects are only present when there is emphasis on the person, as in example n above. Personal suffixes express first, second and third person not only with pronouns, nouns and postpositions to mark personal relations but also with verbal personal endings and on non-finite forms of the verb. The verbal paradigm is split into two conjugations according to the definiteness of the complement of the verb - the direct object. Thus, each person suffix refers not only to the person and number of the subject, but also to the person or the definiteness (though not the number) of the object. The object is definite if it is a proper name, a noun with a

22

definite article, a noun with a personal ending or a personal pronoun in the third person. Other pronouns in object position take verbs in indefinite conjugation (BenkŒ & Imre, 1972). Thus, there are two first person singular suffixes in the non-past form of the verb ír ‘write’: -k is used with an indefinite direct object and -m is used with definite objects. Note that both suffixes also refer to the first person singular noun. Example O

ír-ok

egy könyv-et

write-1SGINDEF a

ír-om

book-ACC

a könyv-et

write:1SGDEF the book-ACC

‘I write a book.’

‘I write the book.’

Hungarian basically distinguishes between two tense features; past and non-past. The present tense suffix is zero (-Ø-) while past tense is marked by the -t- suffix. Note, that -t also marks the accusative case for noun, pronouns, etc. Example P

ír-Ø-ja

a könyvet

ír-t-a

a könyv-et

write-PRS-3SGDEF the book-ACC

write-PST-3SGDEF the book-ACC

‘He/she is writing the book.’

‘He/she was writing the book.’

The future tense is made with the auxiliary ‘fog’ with different personal endings + infinitive (-ni), as in fogok/fogom adni ‘I’ll give’ depending on whether the complement is indefinite or definite (Campbell, 1991). The present tense used with the verb particle meg- can also be used in future tense, e.g. megírom a könyvet ‘I’ll write the book’. There are three mood categories: indicative, subjunctive, which also functions as imperative, and conditional. In indicative mood, as it was mentioned above, the present and past tenses are made by means of personal endings for every person and number, and for definite and indefinite complement. In imperative/subjunctive mood the marker is -j. In verbs in which the root ends in t, the j suffix of the resulting ‘tj’ is phonetically realised as ‘ss’ and this is sometimes indicated in the orthography: más+j => mássz ‘Climb!’ and mos+ja => mossa ‘She washes it’ (BenkŒ & Imre, 1972). The conditional form is made by the marker -n- followed by a harmonic vowel, e.g. adnám ‘I’d give’ (definite object) and kérnék ‘I would ask’ (indefinite object) (Campbell, 1991). The rich marking system is complicated because of allomorphic variation where some of the allomorphs are not phonologically indicated, but rather to avoid homonymy. For example, the first person singular past tense form lát-ta-m ‘I saw’ is used both with indefinite 3rd person object, no object and with the definite object. The reason for this, according to Olsson (1992) is that the indefinite form would otherwise be identical with the third person plural in past tense with indefinite object lát-ta-k ‘they saw’ because of the personal suffix -k with indefinite objects. Selection of the personal suffix is governed by phonological, lexical and tense/mood factors. There are over twenty distinct suffixes of person because of the two conjugations and in order to avoid homonymy. There are also words which are homonymous between the past tense and the present tense of different verbs, such as vált ‘he became’ and ‘he changes’ (Pajzs, 1996). Furthermore, verbs in third person singular are often homonymous with nouns in nominative case, e.g. vár ‘he waits vs. ‘castle’. Sometimes even a conjugated form of a verb and an inflected form of a noun are homonymous, as in várnak ‘they wait’ vs. ‘to the castle’ (Pajzs, 1996).

23

4 . 2 . 6 . 1 Copula In Hungarian, the copula [IGE] expresses that something exists. It can signal the existence of something, someone, place, and also signifies time, weather, a material, an origin, a cause, or a purpose. The use of the copula expresses even possession, e.g. ‘to have something’ where many other languages have a transitive verb for this type of construction. The copula like other verbs is conjugated according to person/number, tense and mood. In the case of present indicative third person singular/plural the copula is realised as zero (Ø) if the predicate is nominal, e.g. pronoun, noun or adjective and the predicate expresses a profession, thing, state, quality or characteristic, etc. This phenomenon may cause problems in automatic tagging systems because there is no verb in the sentence and the elements may be considered as an example of a single NP. If the predicate is an adverbial and signifies place, time, purpose, etc. or expresses possession the form in third person singular is van and in third person plural is vannak. The negation of the singular form van is nincs ‘is not’ and of the plural form vannak is nincsenek ‘are not’. All other forms of the copula are negated by a preposed nem ‘no/not’, e.g. nem vagyok ‘I am not’. 4 . 2 . 6 . 2 Infinitive Hungarian infinitives [INF], unlike most European languages, may be inflected for person. The reason for this is functional because often there is no other element in the sentence to mark the person. The suffixes are almost identical to the nominal paradigm, except in the third person both in singular and plural, where there is an i instead of the epenthetic j (see Appendix A). 4 . 2 . 6 . 3 Participles The present, past and ‘future’ suffixes of the participles [MN] are -ó/-Œ, -t/-tt, and andó/-endŒ, respectively. The suffixes are added to the verb stem, and follow the rules of vowel harmony. The present participle form (-ó/Œ) is quite productive and is often used as a noun (Campbell, 1991). Example Q

a dolgoz-ó ember

a dolgozó-k

the work-PART man

the worker-PL

‘the working man’

‘the workers’

The main function of the past participle is to express an antecedent action and the states which result from it. It syntactically often behaves as adjective. Its form is identical with the third person singular past indefinite verb form (-t). The participles are not used as predicates. Instead, there is a structure consisting of the copula and the verbal adverb with the suffix -va/-ve. The difference between participles and verbal adverbs is that the latter are more closely connected with the finite verb of the sentence than the participle is in time, state or mood. Participles modify the noun head while verbal adverbs modify the verb. The use of the verbal adverb is more limited than the use of the participle. The following examples show the difference between them.

24

Example R

a szobá-ban ül-Œ

gyerek-ek játsza-nak

the room-INESS sit-PRSPART child-PL

play-3PL

‘The children sitting in the room are playing.’ a gyerek-ek a szobá-ban ül-ve játsza-nak the child-PL the room-INESS sit-VERBADV play-3PL ‘The children are playing sitting in the room.’

4 . 2 . 6 . 4 Verbal particles Hungarian has a very rich system of verbal particles or verb prefixes [IK] which are separable from the stem (Campbell, 1991). They serve to mark direction (le ‘down’, ki ‘out’, etc.), aspect (meg ‘completed’) and to make verbs transitive. They can be combined with many verbs. In many cases there is a different meaning of the verb depending on what particle is attached to it, e.g. átad ‘hand over, pass’, elad ‘sell’, etc. Verbal particles may have two positions depending on the emphasis within the sentence. In the case of a neutral sentence or in yes or no questions the particle is a prefix attached to the verb. Example S

Péter ki-megy a szobá-ból ./? Peter out-go:3SG the room-ELAT

‘Peter leaves the room. / Does Peter leave the room?’ The particle follows the verb in questions, in negatives and when any part of the sentence is emphasised. Example T

Péter nem megy

ki a szobá-ból ./?

Peter not go:3SG out the room-ELAT

‘Peter doesn’t leave the room./Doesn’t Peter leave the room?’ Note that in example s and example t the local relation is marked twice within each structure: once with verbal prefix and once with the case (elative) of the noun. 4.2.7 Postpositions Postpositions [NU] follow the head they refer to and express principally local relations and show a three-way opposition for motion relative to the speaker or other referents. Postpositions may even represent temporal or abstract meaning, e.g. után ‘after’ or ellen ‘against’. They are reduplicated with demonstratives (DEM) as shown the example below. Example U

az

alatt az asztal alatt

DEM under the table

under

‘under that table’ The postpositions, like case markers, may occur as stems and take possessive endings, (see Appendix A) and in this form according to traditional Hungarian grammar they are considered as adverbs. This can be problematic for an automatic tagging system because

25

the stem of these type of adverbs is a postposition hence will be annotated as a postposition [NU], rather than an adverb [HA]. 4.2.8 Adverbs One type of adverb [HA] is the form of case_marker/postposition + personal ending. In some corpus, words consisting of case_marker + personal ending are considered as pronouns, while in another corpus they may be considered as adverbs, as was mentioned in section 4.2.3. There are also adverbs which are derived from verbal particles expressing local relations though without possessive endings, such as be ‘in’ vs. bent ‘inside’, ki ‘out’ vs. kint ‘outside’. The local relation is marked twice within a structure: the adverb and the case marker on the noun. Example V

Bent van

a szobá-ban.

inside COP:3SG the room-INESS

‘He/she/it is inside the room.’ 4.2.9 Word formation This section will give a brief overview on composition and derivation of words in Hungarian which can cause problems when automatically analysing texts. This part is based on BenkŒ & Imre (1972:145-156) where the interested reader may find more information with examples on this topic. 4 . 2 . 9 . 1 Word composition Compound nouns are very frequent in Hungarian. Two nouns can simply be combined without any formal means, e.g. kávé+ház ‘coffee’+’house’. Another type of compound nouns is where the first constituent is a participle, e.g. mosó+nŒ ‘washing’+’woman’. The participle can also occur as a second member but the word is already substantivised and considered to be a noun, e.g. adó+szedŒ ‘tax’+’collector’. These types of words may be confused with adjectives by an automatic tagging system because of their ending, typical to adjectives. Compound adjectives consist of a noun + adjective. The compound may express similarities, e.g. jég+hideg ‘ice’+’cold’ or the adjective member limits a certain range of meaning, e.g. adó+mentes ‘tax’+ ‘free’. There are also compounds which consist of i) a noun with adverbial ending or adverb + noun, ii) a noun with adverbial ending + non-finite verb form, iii) a noun with possessive suffix + participle, and iv) a noun with adverbial ending or adverb + verb etc. These may also cause problems in automatic morphological analysis. 4 . 2 . 9 . 2 Derivation of words Unlike most European languages, Hungarian has a very regular system for derivational suffixes where a single suffix corresponds to a significant meaning and its use is regular. One suffix is tied to each important suffix function. Unfortunately, there is no space for describing the whole system with each derivational suffix, so only a few examples of those suffixes which change the category of a word are given.

26

Derivational suffixes which change the part of speech of a word are very common and productive. Verbs, nouns, adjectives and even adverbs can be further derived. Deverbal verb suffixes, for example express frequentative-iterative, causative and reflexive meaning. Thus, the suffix -hat/-het means ‘can, is capable’ and ‘may, is possible’, e.g. kime-het ‘he/she may go out’. Denominal verb suffixes are -l, -z, -kodik/kedik, -lkodik/-lkedik, and -skodik/-skedik, e.g. szolga ‘servent’ => szolgá-l ‘serve’, társ ‘fellow’ => társa-lkodik ‘converse’. Deadjectival verb suffixes are also very productive, such as the suffix -kodik which coincides with the denominal verb suffixes, e.g. ügyes ‘skilful’ => ügyes-kedik ‘behave skilfully’. Deverbal noun suffixes are also very common, especially the suffix -ás/-és as in temet ‘bury’ => temet-és ‘burial, funeral’, and the suffix -at/-et as in talál ’find, discover, hit’ => talál-at ’hit, win’. The suffix -ó/-Œ for the name of a profession which is identical with the participial belongs also to this category. The deadjectival noun suffix -ság/-ség is a very productive one, e.g. alázatos ‘humble’ => alázatos-ság ‘humbleness’. Adjectives may also be derived from verbs and nouns. Deverbal adjective suffixes among others are -tlan/-tlen, e.g. árt ‘harm’ => árt-atlan ‘harmless’, and -ós/Œs, e.g. nyúl ‘stretch’ => nyúl-ós ‘stretchy’. Some examples for denominal adjective suffixes are -s, -talan/-tlan, -i, -ú/-ı, -jú/-jı. Note that several derivational suffixes may follow each other within one form, e.g. le-ír-hat-atlan ‘unwritable’. 4.3 Syntax In the area of Hungarian syntax, the word order, main types of sentence structure, and agreement will be briefly described. 4.3.1 Word order In most of the literature, Hungarian is described as a basically free word order language. But the fairly free word order can be said to be on the sentence level where the order of the major constituents is only free with respect to grammatical functions and their cases. The word order is pragmatically oriented with a special position for focused or emphasised constituents before the finite verb. The basic order of the sentence constituents is topic + focus + finite verb + any other items (Abondolo, 1992). Both topicalised and focused elements receive sentence stress. If a sentence contains both then the focused element is more prominent. example w shows some possible word orders for the sentence Péter vesz egy könyvet a boltban ‘Peter buys a book at the store’16 . The constituents of the sentence below (the subject, the verb, the direct object and the adverbial) can be ordered in 24 ways (4!). Example W

Péter vesz egy könyvet a boltban. Péter vesz a boltban egy könyvet. Péter egy könyvet vesz a boltban. Péter egy könyvet a boltban vesz. Péter a boltban egy könyvet vesz. Péter a boltban vesz egy könyvet. …

16

Péter vesz egy könyvet a Peter buys a

book

boltban

the store:INESS

‘Peter buys a book at the store’

27

However, the free word order description is inadequate on the phrase level and as soon as interrogative pronouns and/or negated complements are involved. Below, just a few examples are given. There is a strict word order within the NP. Determiners and demonstrative pronouns always precede the noun head, e.g. in example w, the determiner a precedes the noun head könyvet ‘book:ACC’. Qualifiers, like adjectives and participles within an NP occur after the determiners and demonstratives, but precede the noun, as in example below. Example X

az-ok az elsŒ csodálatos karibiai nyári nap-ok that-PL the first wonderful

Caribbean summer day-PL

‘those first wonderful Caribbean summer days’ Furthermore, qualifiers may themselves be qualified by preceding adverbial complements. Example Y

a zöld ház-nál

dohányz-ó

fiatal nyelvész-ek

the green house-ADESS smoke-PRSPART young linguist-PL

‘the young linguists smoking at the green house’ Since Hungarian shares some typological characteristics of SOV languages (e.g. it is postpositional, the attribute precedes the noun, etc.) it is often described as a SOV language. On the other hand, some researchers maintain that Hungarian is partly SOV, partly SVO. However, these word orders are canonical and only represent the dominant order of simple declarative sentences, containing a nominal subject and a nominal object. In the next section, the main types of sentence structure with canonical word order are given. Note that the order of the constituents can be changed according to the information structure. 4.3.2 The main types of sentence structure The rules for the most basic syntactical forms of sentence structure are taken from BenkŒ & Imre, (1972:86-87) and given below. Grammatical categories which begin with a capital letter represent stem and affix and can be expanded by a subclass with the same name. 1. [Noun]subject + [Adjective + auxiliary_verb]predicate where the auxiliary may be zero. 2. [Noun]subject + [Noun + auxiliary_verb]predicate where the auxiliary may be zero. 3. [Noun]subject + [Copula]predicate 4. [Noun]subject + [noun:case + Copula]predicate 5. [noun:dative + Copula]predicate + [noun:possessive_suffix]subject 6. [Noun]subject + [Verb]predicate 7. [Noun]subject + [Verb + noun:case]predicate 8. [Noun]subject + [Verb + noun:case + noun:case]predicate 9. [Noun]subject + [Verb + Noun + postposition]predicate

[Noun] 1. article + noun_stem:base_suffix:case_suffix where article may be zero and base_suffix is either a single plural suffix or a possessive personal ending (including the special plural suffix). 2. article + noun_stem:base_suffix where article may be zero and base_suffix is either a single plural suffix or a possessive personal ending (including the special plural suffix).

28

3. article + noun_stem:case_suffix where article may be zero. 4. noun_stem 5. article + noun_stem:base_suffix + postposition where article may be zero and base_suffix is either a plural suffix or a possessive personal ending (including the special plural suffix). 6. article + noun_stem + postposition where article may be zero and base_suffix is either a plural suffix or a possessive personal ending (including the special plural suffix). [Verb]

1. verb_stem:suffix where suffix is personal ending, suffix of time and/or suffix of mood. 2. verb_stem 3. preverb + verb_stem:suffix where suffix is personal ending, suffix of time and/or suffix of mood. 4. preverb + verb_stem 5. verb_stem:suffix + auxiliary_verb where suffix is personal ending, suffix of time and/or suffix of mood. 6. verb_stem:suffix + auxiliary_verb where suffix is personal ending, suffix of time and/or suffix of mood. Additionally, Hungarian is a pro-drop language, which means that the subject position of the verb can be left empty. The subject is implicit as the personal endings of the verb express the first and the second person, and the third person if the context makes clear who or what the subject is. 4.3.3 Agreement Generally, syntagmatic relations are marked, if possible, on both members of the construction which results in redundancy. The following rules illustrate the agreement patterns (BenkŒ & Imre, 1972). 1. The congruence of noun and substantival pronoun as genitive attribute and qualified noun. Example Z

a

te

könyv-ed

a fiú(-nak a)

könyv-e

DEM you book:2SGPOSS

the boy(-DAT DET) book-3SGPOSS

‘your book’

‘the boy’s book’

2. The demonstrative pronominal dependent ez/az ‘this/that’ agrees in case and in number with the noun which it qualifies. Example AA

eb-böl

a könyv-böl

azo-k-at

a könyv-ek-et

DEM.PRON-ELAT the book-ELAT

DEM.PRON-PL-ACC the book-PL-ACC

‘from this book’

‘those books’

3. Adjectival or numerical attributes are however, not in congruence with the modified word. Example BB

29

nagy város-ok

öt város

big

five city:SG

city-PL

‘big cities’

‘five cities’

4. As we have seen in section 4.2.6, the verb agrees with the subject in number, person and even with the object, so called object-agreement - the relationship between the person of the subject and the person of the object is marked. Example CC

Péter ír-Ø

egy könyv-et.

Peter write-3SGINDEF a

Péter ír-ja

book-ACC

a

könyv-et.

Peter write-3SGDEF the book-ACC

‘Peter writes a book.’

‘Peter is writing the book.’

5. Agreement in number between subject and nominal predicate with zero copula. Example DD

A diák-ok szorgalmas-ak. the student-PL diligent-PL

‘The students are diligent.’ 6. The personal endings of the infinitive refers to that part of the sentence which is in the dative and which is the logical subject. Example EE

Péter-nek kell tanulni-a

(Én) nek-em kell tanuln-om

Peter-DAT must study-INF3SG

(I)

‘Peter must study.’

‘I must study.’

me-1SG

must study-INF1SG

7. Verbal particles often agree with the case suffix of the nominal complement of the verb. Example FF

Rá-megy-ek a fı-re. PART-go-1SG the grass-ILL

‘I step onto the grass.’ In the following section the Hungarian corpora, which have been utilised for training and/or testing Brill’s tagger and their tagset, will be presented.

30

5 The Hungarian corpora Two different corpora, both of them opportunistic in the sense that only these were available, were used for training and for testing Brill’s tagger. They were collected and annotated by the Research Institute for Linguistics at the Hungarian Academy of Sciences. The Hungarian corpus used for training the tagger is the novel 1984 written by George Orwell. It consists of 14034 sentences: 99860 tokens including punctuation marks, 80668 words excluding punctuation marks. The corpus used for testing the tagger consisted of two texts that were extracted from the large Hungarian ‘hand’ corpus: a poem ‘Az én erdŒm’ (‘My forest’), written by István Ágh in the 1970’s, and a fairy tale ‘A három hıséges királyleány’ (‘The three faithful princesses’) written by Béla Balázs in 1948. They are modern literary pieces, both without archaic words. The poem consists of 397 tokens and the fairy tale of 2110 tokens, both including punctuation marks. Both corpora, i.e. the test and the training corpus had to be converted into the format that the tagger requires. Thus, the texts were tokenised and altered to one sentence per line before using them on the tagger. The normalisation of the corpora was done with PCBeta, a production system consisting of actions controlled by a set of rules which are created by the user (Brodda, forthcoming). The rules of the normalisation of the corpora are shown in Appendix D. The training corpus (Orwell’s 1984) as it was taken from the Hungarian Academy of Sciences had the following form before normalisation: ______________________________________________________________________ < /p >

%Winston majdnem[HA] egy[DET] tea1scse1sze1nyi[MN]+t[ACC] o2nt[IGE]+o2tt[Me3] #5.3 maga[NM]=maga1+nak[DAT] , o2sszeszed[IGE]+te[TMe3] minden[NM] lelkiero3[FN]=lelkiere+je1[PSe3]+t[ACC] , s[KOT] lenyel[IGE]+te[TMe3] #5.1 , @mint[KOT] egy[DET] adag[FN] orvossa1g[FN]+ot[ACC] . ___________________________________________________________________ Figure C. An extract from Orwell 1984 in ‘originally’ annotated version before normalisation

31

Each word is written on a new line. In the simplest case, the analysis contains the word token followed by the PoS tag within square brackets ([]). A more complicated analysis appears in the analysis of an inflected word. In this case, the lemma of the token with its PoS tag is given first, followed by one or several inflectional ending(s) with its subtag(s), each preceded by a plus sign (+). If a word begins with a capital letter, the lemma with initial lower-case letter followed by the PoS tag in brackets is given. Then the lemma with the initial capital letter, preceded by the equal sign (=) follows. If the lemma and the stem of the token are not the same, the lemma with its PoS tag is given first, followed by an equal sign. Then the stem of the token is given followed by the inflectional ending(s) with its subtag in square brackets. Each inflectional ending is preceded by a plus sign. Note that part of speech tags and tags for inflectional categories are always enclosed by square brackets ([]). Long vowels and vowels with umlaut are followed by a numeral: 1 indicates that the vowel is marked with an acute accent, 2 indicates that the vowel is marked with an umlaut and 3 indicates that the vowel is marked with a double acute accent, see also section 4.1. Punctuation marks are not annotated at all: minor delimiters such as commas are followed by two empty lines and major delimiters, (e.g. full stops) by six empty lines. Headings are marked within arrowhead brackets () where the heading-expression is divided into its components, represented on separate lines preceded by a slash (/) and/or a percent sign. A line may begin with several marks, e.g. percent sign (‘%’) may indicate that the token is not analysed, (for example foreign proper names are not annotated at all), the ‘at’- sign (@) indicates that there are also several analyses, i.e. the word is ambiguous, etc. Sometimes a number is given after the analysis preceded by a -# or #, which is the identification number of a disambiguation rule, used in the automatic analysis of the corpus. There were 45 rule files written (see Appendix D) for normalising the original corpus to the form that Brill’s tagger requires. Basically, the rules eliminated items such as (-)#Number, headings, empty lines, alternative analyses and lemmas. Furthermore, the rules annotated punctuation marks, foreign words which were not annotated, and converted the input to the format of Word/Tag1_Tag2_TagN (or just Word/Tag) and one sentence per line. Thus, the output which constitutes the training corpus for PoS tags and for tags denoting inflectional properties (subtags) looks as shown in figure d. ______________________________________________________________________

Winston/FN majdnem/HA egy/DET tea1scse1sze1nyit/MN_ACC o2nto2tt/IGE_Me3 maga1nak/NM_DAT ,/, o2sszeszedte/IGE_TMe3 minden/NM lelkiereje1t/FN_PSe3_ ACC ,/, s/KOT lenyelte/IGE_TMe3 ,/, mint/KOT egy/DET adag/FN orvossa1got/FN _ACC ./. ______________________________________________________________________ Figure D. An extract from Orwell’s 1984 after normalisation with PoS and subtags

For training the tagger with only part of speech annotated text, tags denoting inflectional properties were removed, as shown in figure e. ______________________________________________________________________

Winston/FN majdnem/HA egy/DET tea1scse1sze1nyit/MN o2nto2tt/IGE maga1nak/NM ,/, o2sszeszedte/IGE minden/NM lelkiereje1t/FN ,/, s/KOT lenyelte/IGE ,/, mint/KOT egy/DET adag/FN orvossa1got/FN ./. ______________________________________________________________________ Figure E. An extract from Orwell’s 1984 after normalisation with only PoS tags

In the test corpus, as it was given, each word is written on a new line in conformity with the training corpus. But here, every line consists of four columns where the first one contains a number or a heading mark (#), the second one gives information about whether or not the word is a token or a punctuation mark. The third column consists of the token, followed by a special tag (if there is any), marking if a word begins with a capital letter or

32

marking the type of punctuation mark. The last column gives the lemma and the PoS and subtags, divided by back slash (\). Foreign proper names are preceded by #% and are not annotated. Extracts from the test corpus - from the poem and from the fairy tale - are shown in figure f and figure g, respectively. ______________________________________________________________________ #2000004009A1GH ISTVA1N:AZ E1N ERDO3M 0031 ##/# 1 TOK Ko2zel BOS Ko2zel\[HA] 1 TOK hu1zo1dik hu1zo1dik\[IGE][e3] 1 TOK e1letemhez e1let\[FN][PSe1][ALL] 1 PUNCT , ,\WPUNCT 1 TOK s s\[KOT] 1 TOK nem nem\[HA] 1 TOK tudom tud\[IGE][Te1] 1 PUNCT , ,\WPUNCT #/ 1 TOK mekkora mekkora\[NM] 1 PUNCT , ,\WPUNCT 1 TOK hol hol\[KSZ] 1 TOK ve1gzo3dik ve1gzo3dik\[IGE][e3] 1 PUNCT , ,\WPUNCT 1 TOK fal fal\[FN] 1 TOK nekem nekem\[NM][PSe1] 1 TOK o3 o3\[NM] 1 PUNCT , ,\WPUNCT 1 TOK e1lo3 e1lo3\[MN] #/ 1 TOK panaszfalom panaszfal\[FN][PSe1] 1 PUNCT . EOS .\SPUNCT

______________________________________________________________________ Figure F. An extract from text 2000004009 A1GH ISTVA1N:AZ E1N ERDO3M in the hand corpus in originally annotated version

___________________________________________________________________________ #2000019002BALA1ZS BE1LA:A HA1ROM HU3SE1GES KIRA1LYLEA1NY 0013 ##/# 1 TOK E1lt BOS E1l\[FN][ACC] 1 TOK egyszer egyszer\[HA] #%Kandrapura 1 TOK va1rosa1ban va1ros\[FN][PSe3][INE] 1 TOK egy egy\[DET] #%Suryakanta 1 TOK nevu3 nevu3\[MN] 1 TOK hatalmas hatalmas\[MN] 1 TOK kira1ly kira1ly\[FN] 1 PUNCT , ,\WPUNCT 1 TOK aki aki\[NM] 1 TOK ha1rom ha1rom\[SZN] 1 TOK fo3tulajdonsa1ga fo3tulajdonsa1g\[FN][PSe3] 1 TOK a1ltal a1ltal\[NU] 1 TOK olyan olyan\[NM] 1 TOK volt van\[IGE][Me3] 1 TOK a a\[DET] 1 TOK fe1rfiak fe1rfi\[FN][PL] 1 TOK ko2zo2tt ko2zo2tt\[NU] 1 TOK mint mint\[KOT]

33

1 1 1 1 #%Adamanth 1 1 1 1 1 1

TOK TOK TOK TOK

a ko2vek ko2zt az

a\[DET] ko3\[FN][PL] ko2zt\[NU] az\[DET]

TOK PUNCT TOK TOK TOK PUNCT

tu2ndo2klo3 , tiszta e1s keme1ny . EOS

tu2ndo2klo3\[MN] ,\WPUNCT tiszta\[MN] e1s\[KOT] keme1ny\[MN] .\SPUNCT

___________________________________________________________________________ Figure G. An extract from text 2000019002 BALA1ZS BE1LA:A HA1ROM HU3SE1GES KIRA1LYLEA1NY in the hand corpus in originally annotated version

In contrast to the training corpus, the normalisation of the test corpus was considerably easier and faster because of the format and the size of the test corpus. With the help of 14 rules (compared to 45 rules in the case of the training corpus) it was possible to convert the annotated text to the annotated format which Brill’s tagger requires. Mainly, the rules eliminated headings and lemmas, and annotated foreign words, punctuation marks, etc., se Appendix D. For testing the texts on the tagger, the tags (both PoS and subtags) from the tokens were removed. Extracts from the results are shown in figure h and figure i below. ______________________________________________________________________ Ko2zel hu1zo1dik e1letemhez , s nem tudom , mekkora , hol ve1gzo3dik , fal nekem o3 , e1lo3 panaszfalom . ______________________________________________________________________ Figure H. An extract from the poem after normalisation which provides the input to testing Brill’s tagger

______________________________________________________________________ E1lt egyszer Kandrapura va1rosa1ban egy Suryakanta nevu3 hatalmas kira1ly , aki ha1rom fo3tulajdonsa1ga a1ltal olyan volt a fe1rfiak ko2zo2tt mint a ko2vek ko2zt az Adamanth : tu2ndo2klo3 , tiszta e1s keme1ny . ______________________________________________________________________ Figure I. An extract from the fairy tale after normalisation which provides the input for testing Brill’s tagger

In spite of the fact that the tagset of the ‘Orwell’ and the ‘hand’ corpus is the same (see Appendix A), the principles of annotation used in the two corpora are different in some respects. Some categories belong to different levels of annotation. For example, in the ‘hand’ corpus the infinitive verb is annotated with \[IGE][INF] ([Verb][Infinitive]): [IGE] on the top level and [INF] as a subtag, while in the ‘Orwell’ corpus the same item is annotated only as [INF] being on the top level. Another difference lies in that a word consisting of case_ending + personal_ending is tagged as pronoun [NM] in the ‘Orwell’ corpus while the same type of words in the ‘hand’ corpus are tagged as adverbs [HA], sometimes with inflectional tags, sometimes without. Both alternatives are theoretically correct because these types of words behave morphologically as pronouns but syntactically as adverbs. Without guidelines it is difficult to decide which annotation is correct and which is not in a certain context. In the beginning of this work, the plan was to use the entire correctly annotated ‘hand’ corpus as test corpus (including the poem and the fairy tale) and to automatically compare it with the test corpus annotated by the tagger. Because of the differences and the inconsistencies in the annotation of the training and test corpus this was not possible.

34

Thus, the largest difficulty with normalisation has been that information about how the annotation was carried out was missing. It would be helpful for future users to be provided with information about the guidelines on which the annotation is based. It would be necessary to suggest principles for the encoding of texts in the same format: the annotation of the ‘hand’ corpus is preferable. The training corpus (Orwell’s 1984) was tagged by a combination of automatic methods (thus the need for disambiguation rules) and manual tagging though specific information about how this was done is missing from the documentation. Moreover, it is necessary to further develop sets of coding conventions suited for various applications and to maintain compatibility with existing standards as far as possible.

35

6 Results and evaluation of system efficiency The tagger has been trained on the same text twice: once with only PoS tags and once with PoS and subtags marking inflectional properties of given words. Both times, the correctly annotated and normalised training corpus was randomly divided into two parts of equal size using a program provided with the tagger. The first half of the manually annotated corpus was used in the lexical learning module, while the second part of the manually annotated corpus was used in the contextual learning module. Before the lexical learner could be invoked, three files had to be created: bigwordlist, smallwordtaglist and bigbigramlist (see section 3.2). This was done with the software provided with the tagger. The learning module for lexical rules requires these three files and a threshold value. The threshold value was set to 300, meaning that the learner only used bigram contexts, i.e. the neighbour of the current word, among the 300 most frequent words. The default tag for unknown words with an initial capital letter was NNP and for other unknown words NN. Note, that neither NNP nor NP exists in the Hungarian tagset, although these default tags can be changed in the program into the most frequent tag in the Hungarian corpus (FN). The lexical rules are listed in Appendix B. The next stage was learning contextual rules for improving tagging accuracy, after tagging all words with their most likely tag. That learning process requires (as was mentioned in section 3.3) several files which had to be created: traininglexicon, lexicallytaggedcorpus and manuallytaggedcorpus. The output of the contextual learner, i.e. the contextual rules, are listed in Appendix C. The total number of lexical and contextual rules with and without subtags is given in table d below. The lexical learner has derived 326 rules based on 31 PoS tags while it has derived 457 rules based on the much larger tagset, consisting of 452 PoS and subtag combinations. Note that if the tagset consists of a large number of frequently occurring tags, the lexical learner necessarily generates more rules simply to be able to produce all these tags. On the other hand, if only PoS tags (excluding subtags) are used the first rules score very high, in comparison with the scores of the first rules based on PoS and subtags, see Appendix B. Another difference is that the score decreases faster in the beginning and slower in the end, compared to the rules based on PoS and subtags, resulting in a large amount of rules, relative to the size of the tagset. The contextual learner, used to improve the accuracy, derived approximately three times more rules based on 31 PoS tags than it derived from the text annotated with both PoS and subtags. This is somewhat harder to interpret since the output of the contextual learner does not contain scores (although there is an intern scoring function, see section 3.3). It seems reasonable that the contextual rule learner easier finds ‘globally good’ rules, i.e. rules that are better in the long run, since the subtags contain important extra information, for instance about agreement. The conclusion that can be drawn from these facts and the fact that the test on the training corpus (the entire Orwell’s 1984, see section 6.1) achieved slightly higher precision using subtags, is that it is probably more difficult to derive information from words which are annotated with only PoS tags, than from words whose tags include information about the inflectional categories. Table D. The total number of lexical and contextual rules for Pos tags and for PoS with subtags

31 PoS tags 452 PoS with subtags

Lexical rules Contextual rules Total 326 565 891 457 180 637

36

The training process17 for PoS tags for lexical rules took 18 hours and for contextual rules about twenty-four hours. The lexical training process for PoS and subtags took one week while the contextual learning process took at least twenty-four hours, maybe a lot more (the time was not recorded). The main reason for this difference in time between PoS and PoS and subtags is the fact that the number of all permissable rules in the rule generating process tends to increase if the number of tags is higher. For evaluating the system efficiency, the tagger has been tested on different types of texts, i.e. on the training corpus and on other texts. All test corpora were annotated with both PoS tags, and PoS together with subtags denoting both the part of speech and the inflectional properties of the word. Thus, the performance of the whole system has been evaluated against a total of three types of texts from different domains, which were presented in section 5. The texts were all literary works, one poem (‘Az én erdŒm’, written by István Ágh), one fairy tale (‘A három hıséges királyleány’ by Béla Balázs) and Orwell’s 1984 which constituted the training corpus. The analysis of the poem and the fairy tale has been manually checked while the analysis of Orwell’s 1984 was checked automatically using the diff function in UNIX which compares two texts for differences. Precision was calculated for the entire texts, both for PoS tags and PoS with subtags, based on all the tags and the individual part of speech tags. Recall was only calculated for the individual part of speech tags for the poem and the fairy tale. Note that recall is not relevant when considering all the tags, but only when considering specific part of speech categories. Recall of PoS with subtags were not calculated for the poem and the fairy tale since the frequencies of the individual tags were very low. Remember, that precision for a specific PoS tag T is obtained by total number of correctly machine-tagged words with tag T total number of machine-tagged words with tag T and recall is obtained by total number of correctly machine-tagged words with tag T total number of words with correct tag T (see also section 2.3). In the following sections the results will be described for each text tested on the tagger. In Appendix E the output of the poem and the fairy tale after testing is shown where the wrongly annotated tokens are underlined and tokens which have a missing inflectional tag are printed in italics. 6.1 Testing on the training set When testing the rules on the training text, Orwell’s 1984, only the contextual rules are used since finallexicon contains all the words of the entire text, see also section 3.4, i.e. all words are considered known. • Orwell’s 1984 consists of 99860 tokens. • The total number of correctly PoS-tagged tokens is 98432. • The total number of tokens which have correct and complete PoS and subtags is 98656. • The total number of tokens which have correct PoS tags irrespective of the correctness of the subtags is 98677. 17

The training was done at the Department of Linguistics, Uppsala University by Klas Prütz using a IBM RISC System/6000 Model 43P.

37

In this case the difference between the correctly annotated text and the tagger annotated text was automatically checked, thus neither the correct but incomplete subtags nor the precision and recall for specific PoS tags were calculated. In table e below, the precision for both the PoS annotated and the PoS with subtags annotated texts (with and without18 subtags) are shown. Table E. Precision for Orwell’s 1984, which constituted the training corpus

Orwell PoS tags complete PoS and subtags correct PoS tags irrespective of the correctness of subtags

Precision (98432/99860)*100 ≈ 98.6% (98656/99860)*100 ≈ 98.8% (98677/99860)*100 ≈ 98.8%

As mentioned above, these results indicate that the tagger attains slightly better annotation using PoS and subtags, hence better rules, than for only PoS tags. The reason for this could be that marking an inflectional property of a word may give more information about which part of speech the word belongs to. 6.2 Testing on the poem • The total number of tokens is 397. • The total number of correctly PoS-tagged words retained by the tagger is 341. • The total number of words which have correct PoS and subtags but have some subtags that are missing is 356. • The total number of words which have correct and fully complete PoS and subtags is 352. • The total number of words with correct PoS tag but nor necessarily correct subtag is 366. Thus, precision can be calculated for the PoS tagged poem and for the PoS and subtagged poem as shown in table f below. The accuracy is lowest (85.9%) for the PoS tagged version while the annotation with PoS and subtags gives a higher accuracy, namely 88.7%. When the missing tags for inflectional categories are excluded from consideration, the accuracy increases from 88.7% to 89.7%. The best result is given (92.2%) when only correct PoS tags of the PoS and subtag combinations are considered, i.e. the subtags are not included. Table F. Precision for poem

Poem only PoS tags excluding missing subtags as correct including missing subtags as correct

with PoS tags (341/397)*100 ≈ 85.9% -´´-´´-

with PoS and subtags (366/397)*100 ≈ 92.2% (352/397)*100 ≈ 88.7% (356/397)*100 ≈ 89.7%

For different parts of speech the precision and recall are given in table g below. In the first column the specific parts of speech are listed. In the second column the first number gives the precision for a specific PoS tag, which is the quotient of the total number of 18

After annotation the subtags were removed in order to automatically check the total number of correct PoS tags.

38

correctly annotated words with that PoS tag and the total number of words annotated with that PoS tag by the tagger. Which part of speech the wrongly annotated words actually belong to is also given. For example, one noun (FN 1), one verb (IGE 1) and one adjective (MN 1) were wrongly annotated by the tagger as infinitive (INF). In the third column, recall is shown, i.e. the quotient of the total number of correctly annotated words with a specific PoS tag and the total number of words with that intended tag, followed by those categories which the tagger wrongly annotated. For example, one of the 14 infinitives (INF) in the text was analysed by the tagger as a noun (FN 1). Table G. Precision and recall for different parts of speech in the poem

PoS tags

Precision correct_found/retrieved_total

Recall correct_found/intended_tota l

INF (Infinitive)

0.81

13/16

0.93

13/14

FN 1

IGE (Verb)

0.81

42/52

FN 1 IGE 1 MN 1 MN 5 FN 4 NU 1

0.72

42/58

IK (Verbal Particle) FN (Noun)

0.80

4/5

NM 1

0.67

4/6

0.75

70/93

0.86

70/81

DET (Determiner) MN (Adjective)

1.0

25/25

MN 11 IGE 10 NM 1 INF 1 ----

FN 10 MN 4 INF 1 SZN 1 NM 1 HA 1 MN 6 IGE 4 INF 1

1.0

25/25

----

0.76

35/46

0.67

35/52

NM (Pronoun) HA (Adverb)

0.93

14/15

FN 6 IGE 4 HA 1 IK 1

0.88

14/16

0.90

28/31

0.88

28/32

SZN (Numeral) NU (Postposition) KOT (Conjunction) KSZ (Interrogative) ISZ (Interjection) NNP (Default tag)

0.50

1/2

KOT 1 IK 1 KSZ 1 IGE 1

FN 11 IGE 5 INF 1 FN 1 IK 1 KOT 3 MN 1

1.0

1/1

----

1.0

3/3

----

0.75

3/4

IGE 1

0.88

22/25

HA 3

0.96

22/23

HA 1

--

0

----

--

0/1

HA 1

--

0

----

--

0

----

--

0

----

--

0

----

The tagger gave maximum precision and recall for determiners (DET), i.e. neither gave the tagger wrong tags for that category or missed any. All postpositions (NU) which the tagger suggested were correct (maximum precision), but it missed one of the four postpositions. Pronouns (NM), adverbs (HA) and conjunctions (KOT) have quite high

39

precision but the tagger missed some of them. The open part of speech categories, such as nouns (FN), adjectives (MN) and verbs (IGE) have the lowest precision and recall. For the category numerals (SZN) the recall rate is maximum but the precision rate is very low. The result doesn’t say much about that category because the occurrence of this category is minimal. 6.3 Testing on the fairy tale • The total number of tokens is 2110. • The total number of correctly PoS tagged tokens retained by the tagger is 1793. • The total number of tokens which have correct and fully complete PoS and subtags is 1715. • The total number of tokens which have correct PoS and subtags but are missing some subtags is 1761. • The total number of tokens without foreign names (Suryakanta, Balapandita and Razakosa) is 1999. Thus, precision can be calculated for the PoS tagged fairy tale and for the PoS and subtagged fairy tale as shown in table h below. Here, in contrast to the poem, the version with PoS and subtags has the lowest accuracy no matter if tags with missing subtags are counted as wrong or as right tags (85% compared to 81.3% and 83.4%). Although the accuracy increases from 81.3% to 83.4%, when the missing tags for inflectional categories are excluded from consideration. In view of the fact that the foreign names Suryakanta, Balapandita and Razakosa (in total 111) were frequently wrongly annotated by the tagger the precision was also calculated without those words. When these foreign names are excluded from the total number of machine-tagged words, the accuracy increases considerably: from 85% to 86.6% for the PoS tagged version, and from 81.3% to 85.4% for the PoS and subtagged version excluding missing subtags as correct. On the other hand, the accuracy is consistently higher for the PoS and subtagged version when only counting correct PoS tags, compared to the PoS tagged version. Table H. Precision for fairy tale

Fairy tale correct tags in % only PoS tags

with PoS tags

excluding missing subtags as correct

(1793/2110)*100 ≈ 85% -´´-

including missing subtags as correct

-´´-

without foreign names (and excluding missing subtags as correct) without foreign names and including missing subtags as correct without foreign names and only PoS tags

(1731/1999)*100 ≈ 86.6% -´´-´´-

with PoS and subtags (1829/2110)*100 ≈ 86.7% (1715/2110)*100 ≈ 81.3% (1761/2110)*100 ≈ 83.4% (1708/1999)*100 ≈ 85.4% (1754/1999)*100 ≈ 87.7% (1805/1999)*100 ≈ 90.3%

The precision and recall for different PoS tags are shown in table i below. Similarly to table g in section 6.2, in the first column different PoS tags are listed. In the second column the precision with the categories to which the wrongly annotated words actually belong are given, while in the third column the recall is calculated and the categories which the tagger wrongly annotated and missed are listed.

40

Table I. Precision and recall for different PoS tags in the fairy tale

PoS tags

Precision Recall correct_found/retrieved_total correct_found/intended_tota l

INF (Infinitive) IGE (Verb)

1.0

14/14

----

1.0 =

14/14

----

0.69

260/377

0.86

260/302

FN 28 MN 8 KOT 4 SZN 2

IK (Verbal Particle) FN (Noun)

0.73

22/30

FN 89 MN 9 NM 8 HA 5 ISZ 5 KOT 1 NM 8

0.88

22/25

0.85

453/534

0.77

453/585


1.0

148/148

IGE 28 MN 22 NM 12 HA 11 ISZ 5 KOT 2 SZN 1 ----

NM 2 HA 1 IGE 89 MN 35 HA 7 NM 1

1.0

148/148

----

0.69

116/168

0.78

116/148

FN 22 IGE 9 HA 1

NM (Pronoun)

0.94

118/125

FN 35 IGE 8 HA 5 ISZ 2 NM 1 KOT 1 KSZ 3 HA 2 FN 1 IK 1

0.80

118/148

HA (Adverb)

0.84

109/129

0.72

109/152

SZN (Numeral) NU (Postposition) KOT (Conjunction)

0.78

7/9

FN 7 KSZ 5 KOT 5 MN 1 NM 1 IK 1 IGE 2

0.88

7/8

FN 12 IGE 8 IK 8 HA 1 MN 1 KOT 13 FN 11 NU 7 IGE 5 MN 5 NM 2 FN 1

0.82

31/38

HA 7

1.0

31/31

----

0.91

178/195

HA 13 IGE 4

0.96

178/186

ISZ (Interjection)

1.0

5/5

----

0.20

5/25

KSZ (Interrogative)

--

0

----

--

0/8

HA 3 FN 2 MN 2 IGE 1 NP 8 FN 5 IGE 5 MN 2 HA 5 NM 3

41

NNP (Default tag)

--

0/8

ISZ 8

--

0

----

The categories infinitive (INF) and determiner (DET) have maximum recall and precision, thus the tagger found these categories easiest. For the category postposition (NU) the tagger didn’t miss any, but it wrongly suggested that words from this category were adverbs (HA) seven of 38 times. Pronouns (NM) and nouns (FN) have tolerably high precision but the tagger missed 30 of 148 pronouns and 132 of 453 nouns. The tagger had the most difficulties with adjectives (MN) and verbs (IGE) which were often incorrectly annotated and often missed. These results resemble the results of the precision and recall of PoS tags in the poem, where the open categories seem to be the most difficult ones for the tagger to correctly annotate. 6.4 Evaluation Testing on the training set, i.e. using the same corpus for training and testing, gave the best result (98.6% and 98.8%), as would be predicted. Due to the fact that the tagger learned rules on the same corpus as the test corpus, the outcome of testing is much better than it is for the other types of test texts. The results don’t give a valid statement about the performance of the system, but tells how good or bad the rules the system derived from the training set are. These results mean that the tagger couldn’t correctly or completely annotate approximately one in every hundred words. One reason for this is the great number of homographs in Hungarian, which according to Pajzs’ (1996) examination, averages more than 30% of the running words in a corpus consisting of 200 000 running words. Thus, the tagger based on the lexical and contextual rules managed to correctly annotate most of the homographs. In order to get a picture of the taggers performance, the tagger was tested on two different samples (the poem and the fairy tale) other than the training set. The results are lower than the result obtained from training and testing on the same text. In general, the tagger succeeded better in annotating the poem (85.9%-92.2%) than the fairy tale (81.3%90.3%) with parts of speech with or without subtags. On the other hand, the poem consists of fewer tokens than the fairy tail, hence its result is not as reliable as the result of the fairy tail. On average, putting together the results from both subcorpora consisting of 2507 words, we get 85.12% correct results for only part of speech tags, 82.45% for part of speech tags with correct and complete subtags, 84.44% for part of speech tags with correct but not necessarily complete subtags, and 87.55% for part of speech tags without regarding the correctness of the subtag. These results are not very promising when compared with Brill’s results of English test corpora which have an accuracy of 96.5% trained on 88200 words (Brill, 1995a). The difference in accuracy might depend on i) the type of the training corpus, ii) the types and the size of the test corpus, and iii) the type of language structure, such as morphology and syntax. The corpus which was used to train the tagger on Hungarian consisted of only one text, a fiction with ‘inventive’ language, while Brill used a training corpus consisting of several types of texts (Brill, 1995a). Also, there is a difference between the types and the sizes of the test corpora. In this work very small samples, which greatly differ in type from the training corpus, have been used, while Brill’s test corpus consisted of different types of texts (Brill, 1995a). Nevertheless, the most significant difference between the results lies in the type of the language structure. Hungarian, as was described in section 4 has free word order and more complicated morphology than English has. I argue that the low tagging accuracy for Hungarian mostly depends on the fact that the templates of the learner modules of the tagger, especially the templates of the lexical learner module, are predefined in a way that they include strong language specific information which does not fit for Hungarian and other agglutinative languages. The predefined templates are principally based on the structure of English and perhaps other ‘big’ European languages,

42

even if Brill maintains that his system is language independent. Those lexical templates whose triggers depend on the affixes of a word look at only the first or last four characters of a word. In other words, defining that a lexical trigger is ‘delete/add the suffix x where |x| ≤ 4’ is to assert that it is only important to look at the last or first four letters in a word which is often not enough for correct annotation in Hungarian. For example, the word siessu2nk ‘hurry up:1PL’ was annotated by the tagger as IGE_t1, i.e. as a verb in present indicative first person plural with indefinite object. The correct annotation should be IGE_Pt1, i.e. as a verb in imperative (P) first person plural with indefinite object. Because the tagger was only looking at the last four characters -u2nk, it missed the necessary information about the imperative -s19 . Another example concerns derivational suffixes giving important information about the PoS tag because they often change the category of the word. They follow the stem of the word and may be followed by different inflectional suffixes, see also section 4.2.9.2. For example, the word a1rt:atlan:sa1g:a1t ‘harm:less:Deadjectival_noun:ACC’ could be wrongly annotated by the tagger because information about the two derivational suffixes is missed if the word a1rtatlansa1g doesn’t exist in the lexicon. Thus, if the tagger had looked at more than four characters it would have been possible to reduce the total number of words in the lexicon and the tagger would have been able to create better and more efficient rules concerning the morphological structure of Hungarian words. This is especially true in the case of the corpora used in this work, since the encoding of accentuation of the vowels is done with extra characters (numbers) which reduces the effective length of the affixes. In the example above, siessu2nk (siessünk), at most three of the last letters are examined. For Hungarian, the triggers of templates seem to be unsuccessful because of the Hungarian suffix structure of the open classes, such as the categories Noun, Verb and Adjective, as it was described in section 4.2. A possible solution could be to give the user the possibility to change the predefined language specific templates to more suitable ones for the particular language in some easy way. One could for instance give a template file as an input to the learner. Furthermore, the contextual rules probably work more effectively for strong ‘wordorder’ languages, such as English, than for Hungarian, though it is difficult to show without evaluating the system more thoroughly. In order to know which categories the tagger in general failed to identify, the precision and recall were calculated for each part of speech category of the poem and fairy tale together in a similar way as it was calculated for each type of text in sections 6.2 and 6.3. In table j the results are given.

19

Note that in imperative/subjunctive mood the marker is -j, but here the -j changes to -ss because the root of the word ends in -t (siet ‘hurry’), see also section 4.2.6.

43

Table J. Precision and recall for PoS categories of both test texts

PoS tags

Precision correct_found/retrieved_total

Recall correct_found/intended_total

INF (Infinitive)

0.90

27/30

0.96 27/28

IGE (Verb)

0.70

302/429

IK (Verbal Particle)

0.74

26/35

FN (Noun)

0.83

523/627


1.0 0.70

173/173 151/214

NM (Pronoun)

0.94

132/140

HA (Adverb)

0.85

137/160

SZN (Numeral) NU (Postposition) KOT (Conjunction)

0.73 0.83 0.91

8/11 34/41 200/220

ISZ (Interjection)

1.0

5/5

----

KSZ (Interrogative)

--

0

----

FN 1 IGE 1 MN 1 FN 93 MN 14 NM 8 HA 5 ISZ 5 KOT 1 NU 1 NM 9 IGE 38 MN 33 NM 13 HA 11 ISZ 5 KOT 2 INF 1 SZN 1 ---FN 41 IGE 12 HA 6 ISZ 2 NM 1 KOT 1 KSZ 3 HA 2 IK 2 FN 1 FN 7 KOT 6 KSZ 6 IK 2 MN 1 NM 1 IGE 3 HA 7 HA 16 IGE 4

44

FN 1

0.83 302/360 FN 38 MN 12 KOT 4 SZN 3 INF 1

0.84 26/31

NM 3 HA 2 0.78 523/666 IGE 93 MN 41 HA 7 NM 1 INF 1

1.0 173/173 ---0.75 151/200 FN 33 IGE 14 HA 1 INF 1

0.80 132/164 FN 13 IK 9 IGE 8 HA 1 MN 1 0.74 137/184 KOT 16 FN 11 NU 7 MN 6 IGE 5 NM 2 0.89 8/9 FN 1 0.97 34/35 IGE 1 0.96 200/209 HA 4 FN 2 MN 2 IGE 1 0.20 5/25 NNP 8 FN 5 IGE 5 MN 2 -0/9 HA 6 NM 3

NNP (Default tag)

--

0/8

ISZ 8

--

0

----

The tagger managed to find all determiners [DET] and didn’t give any wrong alternatives either which is in many ways astonishing because the definite article az ‘the’ may be homonymous with the demonstrative pronoun az ‘that’ and the indefinite article egy ‘a/an’ has the same form as the numeral ‘one’. Interjections [ISZ] also have maximal precision, i.e. all words which have been annotated as interjections are correct but the tagger missed 20 of 25 interjections, hence the low recall (20%). 8 of the 25 interjections have been tagged with the default tag NNP meaning that the word begins with a capital letter and is unknown. Most of the words annotated with the tag pronoun [NM] are correct, only 8 of 140 were wrong but the tagger missed some of the pronouns and annotated them mainly as nouns [FN], verbal particles [IK] or verbs [IGE]. The mistakes depend on the fact that pronouns take personal endings in a similar way as nouns and verbs do. When the tagger looks for the word endings (the last four characters in a word) it finds the same endings for pronouns, nouns and verbs. The mix-up of the verbal particles may be the result of the fact that the interrogative pronoun ki ‘who’ has the same form as the verbal particle ‘out’. Conjunctions [KOT] and infinitives [INF] both have high recall (96%) and precision (91% and 90%, respectively). Adverbs are often tagged as conjunctions, perhaps because both are found in the same position in a sentence. Furthermore, conjunctions were confused with verbs four times because of the homographs vagy and mert. Vagy is either the conjunction ‘or’, or the second person singular present form of the verb van ‘be’. Mert may belong to the class of conjunctions with the meaning ‘because’ or to the verb class with the form of past tense third person singular, meaning ‘dare’. These conjunctions can easily be detected because they are often the first word in a sentence or a clause. For the category adverb [HA] the tagger missed 47 of 184 adverbs and mostly annotated these words as conjunctions [KOT] or nouns [FN]. This result is not surprising because adverbs may appear in the same position as conjunctions, and adverbs and nouns may take the same personal endings, as was described in section 4.2. Concerning precision, the tagger correctly found 137 adverbs but wrongly annotated 23 words belonging mainly to the categories noun [FN], conjunctions [KOT], and interrogatives [KSZ]. Postpositions [NU] have high recall (97%) meaning that the tagger missed only one. On the other hand this class has very low precision because the tagger annotated seven adverbs as postpositions. The difference between adverbs and postpositions is, among other things, that the later category doesn’t take personal endings while adverbs do. Note also that many adverbs have the same stem as postpositions (see section 4.2.8 for more detail). The open classes, such as nouns [FN], verbs [IGE] and adjectives [MN] are often wrongly annotated and confused by the tagger because of their complicated form, their similar morphological structure and the fairly free word order in Hungarian. Many words belonging to these classes are homographs, especially nouns and verbs, and verbs and adjectives (or participles). For that reason, these categories have both low precision and recall. Among the closed classes, verbal particles [IK] and numerals [SZN] have the lowest precision (74% and 73%, respectively) - the tagger often gives these categories an incorrect analysis - but have a higher recall (84% and 89%, respectively), i.e. the tagger finds many of them. This is because of the fact that the words belonging to these categories constitute a countable set, many of them being in the lexicon but often ambiguous. To sum up the results, the tagger has the most difficulties with categories belonging to the open classes because of their morphological structure and homonymy, while grammatical categories are easier to detect and correctly annotate. Complicated and highly developed morphological structure and fairly free word order, i.e. making positional

45

relationships less important, lead to lower accuracy when using Brill’s tagger with Hungarian.

46

7 Directions for future research Firstly, it has to be pointed out that the results described in section 6 are based on a very small test corpus consisting of approximately 2500 running words. For a better evaluation of the ability of the tagger to annotate Hungarian texts based on the learned lexical and contextual rules, it would be necessary to test the tagger on a large corpus with different types of texts, including fiction, poetry, non-fiction, articles from different newspapers, trade journals, etc. Note that for effective evaluation it is important to have a correctly annotated test corpus, where the tagset and the annotation principle for that corpus are in accordance with the training corpus. The correctly annotated test corpus and the (by the tagger) reannotated test corpus can then be automatically compared. It would be interesting to evaluate the components of the system for Hungarian by running only the lexical rules on the test corpus, and compare its accuracy to the accuracy when both sets of rules are applied, i.e. to see how much the contextual rules improve the result. For a higher accuracy it would be necessary to train the tagger on a larger corpus with different types of texts or even on several corpora because the likelihood of higher accuracy increases with the size of the training corpus. Unfortunately, there is a trade-off between accuracy and training time; training on a large corpus can take a lot of time and memory, but on the other hand it gives higher accuracy. It seems reasonable that before lexical training, the default tags NN and NNP should be changed in the program to the most common tag in the Hungarian tagset, e.g. FN (noun), which was not done for the training on the Hungarian corpus20 . On the other hand, the outcome of modifications like this is hard to predict since the rule generating process might be very dynamic. Furthermore, it would be advantageous to create a very large dictionary of the type Word Tag1 Tag2… TagN, (where the first tag is the most frequent tag for that word), listing all possible tags for each word. By using this lexicon, accuracy would be improved in two ways. First, the number of unknown words, i.e. the number of words not being in the training corpus would be reduced, though no matter how much text the tagger looks at there will always be a number of words that appear only a few times, according to Zipf’s law (frequency is roughly proportional to inverse rank). Secondly, the large dictionary would give more accurate knowledge about the set of possible part of speech tags for a particular word. For example, the template of the type ‘Change the most likely tag from X to Y, if...´, the template would only change tag X to tag Y, if tag Y exists with a particular word in the training corpus. Thus, a large dictionary would reduce the errors of the annotation by applying better rules and increase the speed of the contextual learning. A limitation of the tagger lies in the fact that it has difficulty handling idioms and lexicalised phrases because there are restrictions on the length of the segment - only one word long segments can be marked. One possibility could be to construct a tagset with new tags for idioms/lexicalised phrases, where the first part of the tag is the name of the idiom/lexicalised phrase and the second part a number which indicates the position of the element in the idiom/lexicalised phrase. Also, it would be desirable if the system could add the lemma of a certain word to its tag, since the lemma is often vital for further investigations. The lemma could for example stand in the first position of the compound tag. This would probably demand an even larger training corpus. How this notation would affect the learning process and the accuracy of the tagger is difficult to predict, but would be interesting to try. The lexical learning process would be even more time and memory consuming than it already is, hence this approach might not be feasible. Brill’s tagger already requires much memory 20

The default tag NNP occurred eight times in the annotation of the test corpus though the tag does not exist in the Hungarian tagset.

47

and processing time, especially for the lexical rule learner; the lexical learning process based on 452 tags (PoS and subtag combinations) took one week. The above mentioned suggestions for improvement are applicable by a future conscientious user. Below, some recommendations concerning the improvement of the system follow. To get better accuracy it would be worthwhile attempting to change the predefined transformation templates so that they would work better for Hungarian and for other languages with a different morphological structure than English. The templates which limit the accuracy, in my opinion, are the lexical types, namely deleting/adding the suffix/prefix x, where |x| ≤ 4 and the current word has prefix/suffix x, where |x| ≤ 4 (see also section 3.2, template 2-7 and 10-15). It should be possible to set the maximum length of x by the user before training. For Hungarian, for example, it would be better if x has been set to 6 or 8, i.e. if the system could look at the first/last six or eight characters. Often it is not enough to look at the first/last four characters because the system loses information about other prefixes/suffixes which could give relevant information about a specific part of speech category or a relevant inflectional category, see also section 6.4. An even better improvement would be that the user could supply his/her own template file to the learner/tagger. In today’s version of the tagger it is difficult to change the predefined transformation templates.

48

8 Conclusion This work has presented Eric Brill’s rule-based PoS tagger which automatically acquires rules from a training corpus, based on transformation-based error driven learning. The tagger has been trained on a Hungarian corpus, which is composed of Orwell’s 1984 consisting of 99860 tokens including punctuation marks. The tagset of the training corpus consists of 452 PoS tags including inflectional properties of which 31 denote different parts of speech. The training was done two times, once for only part of speech tags and once for part of speech and ‘subtags’ denoting inflectional properties. New test texts with approximately 2500 words as well as the training corpus have been tested on the tagger. Precision was calculated for all test texts, and recall and precision for specific part of speech tags. The results presented in this work show that the accuracy of the test texts was 85.12% for only part of speech tags, 82.45% for part of speech tags with correct and complete subtags, and 84.44% for part of speech tags with correct but not necessarily complete subtags. The best annotation, 87.5% has been attained by tagging a text with PoS and subtags although only considering the correctness of the PoS tag (i.e. without regarding the subtag). The tagger had most difficulties with categories belonging to the open classes because of their complicated morphological structure while closed-class categories were easier to detect and correctly annotate. It was shown, that it is probably more difficult to derive information from words which are annotated with only PoS tags, than from words whose tags include information about the inflectional categories. It was suggested that to be able to more accurately evaluate Brill’s tagger on Hungarian it is necessary to test the tagger on a larger corpus and also to evaluate different components of the system. For higher accuracy it would be necessary to use a larger corpus which consists of different types of texts, to change the default tags to the most common tag in the Hungarian tagset, and to create a large lexicon of the type ‘Word Tag1 Tag2 ... TagN’ for reducing the number of unknown words. Furthermore, it was shown that many of the predefined transformation templates of the two learning modules are language specific with the current choice of parameters. They are based on the language structure of English and perhaps other Germanic languages. They do not suit agglutinative languages, such as Hungarian because of their complicated morphological structure and fairly free word order. It would be necessary to construct a flexible solution giving the user the possibility to change the predefined transformation templates, or at least their parameters, in a simple way to better more expressive ones for a particular language. Finally, there are many advantages of the tagger compared to other tagging systems, as Brill (1992) declares. Brill’s rule-based PoS tagger automatically learns its rules based on a comparably small correctly annotated corpus. The learners give a relatively small set of ‘meaningful’ rules. Furthermore, the tagset and/or corpus genre is easy to change to another. Also, it is easy to find and implement improvements to the tagger.

49

References Abondolo, D. 1987. Hungarian. In The Major Languages of Eastern Europe. Comrie, B. et al. Routledge, London. Abondolo, D. 1992. Hungarian. In International Encyclopaedia of Linguistics. Volume 2. Bright, W. Oxford University Press. Atwell, E. 1993. Corpus-Based Statistical Modelling of English Grammar. In CorpusBased Computational Linguistics. ed. Souter, C & Atwell, E. pp. 195-214. Rodopi, Amsterdam - Atlanta. Bencédy, J., et al. 1968. A mai magyar nyelv. Nemzeti tankönyvkiadó, Budapest. BenkŒ, L & Imre, S. 1972. The Hungarian Language. Akadémiai Kiadó, Budapest. Brill, E. 1992. A Simple Rule-Based Part of Speech Tagger. In Proceedings of the DARPA Speech and Natural Language Workshop. pp. 112-116. Morgan Kauffman. San Mateo, California. Brill, E. 1993a. Automatic Grammar Induction and Parsing Free Text: A TransformationBased Approach. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. pp. 259-265. ACL. Columbus, Ohio, USA. Brill, E. 1993b. Transformation-Based Error-Driven Parsing. In Proceedings of the Third International Workshop on Parsing Technologies. Tilburg, The Netherlands. Brill, E. 1993c. A Corpus Based Approach to Language Learning. Ph.D. Dissertation, Department of Computer and Information Science, University of Pennsylvania. Brill, E. 1994a. A Report of Recent Progress in Transformation-Based Error-Driven Learning. ARPA-94. Brill, E. 1994b. Some Advances in Rule-Based Part of speech Tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Wa. Brill, E. 1995a. Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging. Very Large Corpora Workshop. 1995. Brill, E. 1995b. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of speech Tagging. In Computational Linguistics. 21:4. Brill, E. & Marcus, M. 1992a. Automatically Acquiring Phrase Structure Using Distributional Analysis. DARPA Workshop on Speech and Natural Language. 1992. Brill, E. & Marcus, M. 1992b. Tagging an Unfamiliar Text With Minimal Human Supervision. In Proceedings of the Fall Symposium on Probabilistic Approaches to Natural Language. 1992. Brill, E. & Resnik, P. 1994. A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation. In Proceedings of Fifteenth International Conference on Computational Linguistics COLING-1994. Kyoto, Japan.

50

Briscoe, T. 1994. Prospects for practical parsing of unrestricted text: Robust statistical parsing techniques. In Corpus-based research into language. eds. Oostdijk, N & de Haan, P. Rodopi, Amsterdam - Atlanta. Brodda, B. forthcoming. Do corpus work with PcBeta and Be Your Own Computational Linguist. vers. October 1996. Stockholm, Sweden. Campbell, G. L. 1991. Compendium of the Worlds Languages. Volume 1. Routledge, London and New York. Charniak, E. 1996. Statistical Language Communication Series. MIT Press.

Learning.

Language,

Speech

and

Church, W. K. 1988. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Second Conference on Applied Natural Language Processing. ACL. Austin, Texas. CLAWS. 1998. http://www.comp.lancs.ac.uk/ucrel/claws Cutting, D. et al. 1992. A Practical Part of speech Tagger. In Proceedings, Third Conference on Applied Natural Language Processing. Trento, Italy. pp. 133-140. de Marcken, C. G. 1990. Parsing the LOB Corpus. In Proceedings, 28th Annual Meeting of the Association for Computational Linguistics. University of Pittsburgh, Pennsylvania, USA. Dermatas, E. & Kokkinakis, G. 1995. Automatic Stochastic Tagging of Natural Language Texts. In Computational Linguistics. 21:2. DeRose, J. S. 1988. Grammatical Category Disambiguation by Statistical Optimization. In Computational Linguistics, 14:1. pp. 31-39. Garside, R. 1987. ‘The CLAWS word-tagging system’. In The Computational Analysis of English. eds. Leech, G., Sampson, G. and Garside, R. Longman, London. Greene, B. B. and Rubin, G. 1971. Automated Grammatical Tagging of English. Department of Linguistics, Brown University. Kiss, É. K. 1987. Configurationality in Hungarian. Akadémiai Kiadó. Budapest. Kokkinakis, S. J. & Kokkinakis, D. 1996. Rule Based Tagging in Språkbanken (Bank of Swedish). Språkdata/Dept. of Swedish Language, Göteborg University, Sweden. Krenn, B. & Samuelsson, Ch. 1997. The Linguist’s Guide to Statistics. Universität des Saarlandes, Saarbrücken, Germany. WWW:http://coli.uni-sb.de. Marcus, P. M. & Marcinkiewicz, M. A. & Santorini, B. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. In Computational Linguistics. 19:2. McEnery, T. & Wilson, A. 1996. Corpus Linguistics. Edinburgh University Press, Edinburgh. Merialdo, B. 1994. Tagging English Text with a Probabilistic Model. In Computational Linguistics. 20:2.

51

Olsson, M. 1992. Hungarian Phonology and Morphology. Lund University Press. Pajzs, J. 1996. Disambiguation of suffixal structure of Hungarian words using information about part of speech and suffixal structure of words in the context. COPERNICUS Project 621 GRAMLEX, Work package 3 - Task 3E2. Research Institute for Linguistics, Hungarian Academy of Sciences. Prütz, K. 1996. Disambiguation Strategies in Automatic Part of Speech Tagging Systems: A Probabilistic and a Rule Based System. Working Papers in Computational Linguistics & Language Engineering. Department of Linguistics, Uppsala University, Sweden. Rácz, E., et al. 1968. A mai magyar nyelv. Nemzeti Tankönyvkiadó, Budapest. Roche, E. & Schabes, Y. 1995. Deterministic Part of speech Tagging with Finite-State Transducers. In Computational Linguistics. 21:2. Samuelsson, Ch. & Voutilainen, A. 1997. Comparing a Linguistic and a Stochastic Tagger. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. ACL-97. pp. 246-253. Schütze, H. & Singer, Y. 1994. Part of speech Tagging Using a Variable Memory Markov Model. In Proceedings of the Association for Computational Linguistics. 1994. Tompa, J., et al. 1970. A mai magyar nyelv rendszere. Akadémiai Kiadó, Budapest. Van Guilder, L. 1995. Automated Part of Speech Tagging: A Brief Overview. Handout for LING361, Fall 1995. Georgetown University. http://www.georgetown.edu/cball/ling361/tagging_overview.html Voutilainen, A. 1995. A Syntax-Based Part of speech Analyser. In Proceedings of The Seventh Conference of the European Chapter of the Association for Computational Linguistics. Dublin. Weischedel, R. et al. 1993. Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics. 19:2. pp. 359-382.

52

Appendix A: The Hungarian tagset Part of speech Tags DET FN NM KSZ MN SZN IGE INF IK HA NU KOT ISZ MSZ MOD USZ

Determiner Noun Pronoun Interrogative Adjective Numeral Verb Infinitive Verbal Particle Adverb Postposition Conjunction Interjection Sentence adverb Modifier Foreign word

Noun [FN] Number PL

Plural

Possessive, the possessed item is singular Num/Per Tag Allomorph 1 PSe1 Singular 2 PSe2 3 PSe3 1 PSt1 Plural 2 PSt2 3 PSt3

Possessive, Num/Per 1 Singular 2 3 1 Plural 2 3

POS

the possessed item is plural Tag Allomorph PSe1i PSe2i PSe3i PSt1i PSt2i PSt3i

Possessive

53

Case Ta g NO M AC C DA T IL L IN E EL A AL L AD E AB L SU B SU P DE L TE R CA U IN S FA C FO R SO C TE M

Name

Allomorph

Nominative

Accusative

Dative-genitive Illative

Inessive

Elative

Allative

Adessive

Ablative

Sublative

Superessive

Delative

Terminative

Causal-final

Instrumental

Translative

Formal

Essive-modal

Temporal

Verb [IGE] Present indicative, indefinite object Num/Per Tag Allomorph 1 e1 Singular 2 e2 3 e3 1 t1 Plural 2 t2 3 t3

54

Present indicative, definite object Num/Per Ta Allomorph g 1 Te 1 Singular 2 Te 2 3 Te 3 1 Tt 1 Plural 2 Tt 2 3 Tt 3

Past, indefinite object Num/Per Tag Allomorph 1 Me1 Singular 2 Me2 3 Me3 1 Mt1 Plural 2 Mt2 3 Mt3

Past, definite object Num/Per Tag Allomorph 1 TMe1 Singular 2 TMe2 3 TMe3 1 TMt1 Plural 2 TMt2 3 TMt3

Subjunctive/imperative, indefinite object Num/Per Tag Allomorph 1 Pe1 Singular 2 Pe2 3 Pe3 1 Pt1 Plural 2 Pt2 3 Pt3

Subjunctive/imperative, definite object Num/Per Tag Allomorph 1 TPe1 Singular 2 TPe2 3 TPe3 1 TPt1

55

Plural

2 TPt2 3 TPt3

Conditional, indefinite object Num/Per Tag Allomorph 1 Fe1 Singular 2 Fe2 3 Fe3 1 Ft1 Plural 2 Ft2 3 Ft3

Conditional, definite object Num/Per Tag Allomorph 1 TFe1 Singular 2 TFe2 3 TFe3 1 TFt1 Plural 2 TFt2 3 TFt3

56

The object is already marked on the verb; the suffix signals that the subject is first person singular and the object is second person. Tense/Mood Tag Allomorph present, Ie1 indicative past IMe1 subjunctive IFe1 conditional IPe1

Inflected infinitive forms Num/Per Tag Allomorph 1 INRe1 Singular 2 INRe2 3 INRe3 1 INRt1 Plural 2 INRt2 3 INRt3

57

Appendix B: Lexical rules There are 18 types of lexical rules, which are instantiated from a set of transformation templates, see also section 3.2. The types with their form and explanation are given in the table below. (Note that rules starting with ‘f’ are only applied if the current tag matches the specified current tag, while other rules change the tagging regardless of the current tag.) Type addpref

Form Ltrx addpref X Tag Score

addsuf

Ltrx addsuf X Tag Score

char

Ltr char Tag Score

deletepref

Ltrx deletepref X Tag Score

deletesuf

Ltrx deletesuf X Tag Score

goodleft

Word goodleft Tag Score

goodright

Word goodright Tag Score

haspref

Ltrx haspref X Tag Score

hassuf

Ltrx hassuf X Tag Score

faddpref

Tag1 Ltrx faddpref X Tag2 Score

faddsuf

Tag1 Ltrx faddsuf X Tag2 Score

fchar

Tag1 Ltr fchar Tag2 Score

fdeletepref Tag1 Ltrx fdeletepref X Tag2 Score

fdeletesuf

Tag1 Ltrx fdeletesuf X Tag2 Score

58

Explanation If adding the letters Ltrx of length X to the beginning of a word results in a word, tag it as Tag. If adding the letters Ltrx of length X to the end of a word results in a word, tag it as Tag. If the character Ltr appears anywhere in the current word, tag it as Tag If deleting the letters Ltrx of length X from the beginning of a word results in a word, tag it as Tag. If deleting the letters Ltrx of length X from the end of a word results in a word, tag it as Tag. If the current word ever appears to the left of the Word, tag it as Tag. If the current word ever appears to the right of the Word, tag it as Tag. If a word has the prefix Ltrx of length X , tag it as Tag. If a word has the suffix Ltrx of length X , tag it as Tag. If the current word has tag Tag1 and if adding the letters Ltrx of the length X to the beginning of a word results in a word, then change Tag1 to Tag2. If the current word has tag Tag1 and if adding the letters Ltrx of length X to the end of a word results in a word, then change Tag1 to Tag2. If the character Ltr appears anywhere in the current word, and that word is tagged with Tag1, then change Tag1 to Tag2 If the current word has tag Tag1 and if deleting the letters Ltrx of length X from the beginning of a word results in a word, then change Tag1 to Tag2. If the current word has tag Tag1 and if deleting the letters Ltrx of length X from the end of a word results in a word, then change Tag1 to Tag2.

fgoodleft

Tag1 Word fgoodleft Tag2 Score

fgoodright Tag1 Word fgoodright Tag2 Score

fhaspref

Tag1 Ltrx fhaspref X Tag2 Score

fhassuf

Tag1 Ltrx fhassuf X Tag2 Score

If the current word ever appears to the left of the Word and the current word is tagged as Tag1, then change Tag1 to Tag2. If the current word ever appears to the right of the Word and the current word is tagged as Tag1, then change Tag1 to Tag2. If a word has the prefix Ltrx of length X and it is currently tagged as Tag1, then change Tag1 to Tag2. If a word has the suffix Ltrx of length X and it is currently tagged as Tag1, then change Tag1 to Tag2.

For example, rule kus hassuf 3 MN 11

means that if the last three characters of the word are kus, annotate the word with tag MN (as an adjective). The rule FN enek fhassuf 4 MN 11

means that if the word is tagged with FN (as a noun) and the last four characters of that word is suffix enek, change the tag FN (noun) to tag MN (adjective). In both rules above, the number in the last coloumn indicates the score which the rule in question achieved during the training process and they only serve as place-holders in tagging, and can be replaced with any character string.

59

Lexical rules for PoS tags 1 char FN 3764.37859264918 NN t fchar IGE 1096.7107042863 a goodright FN 430.588443407703 ni hassuf 2 INF 380 NN o fchar FN 299.653846153846 1k hassuf 2 IGE 198 tt hassuf 2 IGE 193.098531746032 3 hassuf 1 MN 178.569298245614 NN e fchar IGE 165 1 hassuf 1 MN 153.952380952381 es hassuf 2 MN 127.666666666667 NN a fchar FN 126.5 ta hassuf 2 IGE 108 os hassuf 2 MN 104.8 lt hassuf 2 IGE 92.9976190476191 NNP n fchar FN 91.6666666666667 FN i fhassuf 1 MN 90.2818181818182 bb hassuf 2 MN 83.6899766899767 tak hassuf 3 IGE 78 va hassuf 2 HA 76.2777777777778 ve hassuf 2 HA 68 et hassuf 2 FN 66.6666666666667 len hassuf 3 MN 58.5 FN nem fgoodright IGE 57.3766840664407 sen hassuf 3 MN 54 IGE az fgoodright FN 52.8033456739339 an deletesuf 2 MN 51 te hassuf 2 IGE 50.2833333333333 ik hassuf 2 IGE 47 t addsuf 1 FN 44.9822169059011 l hassuf 1 FN 45.6698841698842 ni addsuf 2 IGE 40.4343434343434 lan hassuf 3 MN 39.6666666666667 am haspref 2 NM 39.216393442623 NNP e fchar FN 35.8571428571429 IGE a fgoodright FN 33.8825396825397 nk hassuf 2 IGE 32 nu2l hassuf 4 MN 31 IGE volt fgoodleft FN 29.4166666666667 nia hassuf 3 IGE 29 tek hassuf 3 IGE 28 ei hassuf 2 FN 27 3en hassuf 3 MN 25.3333333333333 IGE n fhassuf 1 FN 23 FN nagyon fgoodright MN 23.5714285714286 NNP S-T-A-R-T fgoodright NM 23 leg hassuf 3 HA 22.6666666666667 FN tt fhassuf 2 MN 22.1666666666667 IGE egy fgoodright FN 21.4137931034483 jon hassuf 3 IGE 21 nul hassuf 3 MN 21 IGE it fhassuf 2 FN 20 anak hassuf 4 IGE 19 FN an faddsuf 2 MN 18.2

60

ot hassuf 2 FN 18 ne1 hassuf 3 IGE 17 lja hassuf 3 IGE 17 1t addsuf 2 FN 17 ne goodright IGE 17 FN De fgoodright HA 16.9716142499955 d hassuf 1 IGE 16.4 IGE A fgoodright FN 21.8857142857143 at hassuf 2 FN 16 IGE re fhassuf 2 FN 16 ugy haspref 3 NM 16 mind haspref 4 NM 15.1666666666667 vala haspref 4 NM 15.0098484848485 IGE a fgoodright FN 15 tam hassuf 3 IGE 15 3s hassuf 2 MN 15 9 char SZN 15 ment addsuf 4 IK 14.6289089089089 o1ra goodleft SZN 14.4332443257677 ie hassuf 2 IGE 14 tja hassuf 3 IGE 14 ban addsuf 3 FN 13 hat hassuf 3 IGE 13 san hassuf 3 MN 13 dt hassuf 2 IGE 12.5 juk hassuf 3 IGE 12 ta1l hassuf 4 IGE 12 2s hassuf 2 MN 12 esek hassuf 4 MN 12 lag hassuf 3 HA 12 tiz haspref 3 SZN 12 sa1g addsuf 4 MN 11.6 lnak hassuf 4 IGE 11 as hassuf 2 MN 11 kus hassuf 3 MN 11 FN enek fhassuf 4 MN 11 FN bben fhassuf 4 MN 11 la1tszott goodleft MN 11 o2nm haspref 4 NM 11 FN mit fgoodright IGE 10.0833333333333 IGE g fhassuf 1 FN 10 ata hassuf 3 FN 10 -e hassuf 2 IGE 10 i1t hassuf 3 IGE 10 FN az fhaspref 2 NM 10 Mi haspref 2 HA 9.53846153846154 se1g addsuf 4 MN 9.01363636363636 ra hassuf 2 FN 9 na1 hassuf 3 IGE 9 iuk hassuf 3 IGE 9 het hassuf 3 IGE 9 znek hassuf 4 IGE 9 tak addsuf 3 IGE 9 1le hassuf 3 MN 9 NM h fchar HA 9 is addsuf 2 HA 9

61

egym haspref 4 NM 9 magu haspref 4 NM 9 negy haspref 4 SZN 9 mind addpref 4 NM 8.99532710280374 FN u2l fhassuf 3 HA 8.8 le1 hassuf 3 HA 8.05263157894737 sai hassuf 3 FN 8 zi hassuf 2 IGE 8 ne hassuf 2 IGE 8 dom hassuf 3 IGE 8 harm haspref 4 SZN 8 IGE a fgoodright FN 7.75 -e addsuf 2 IGE 7.12820512820513 NNP L fchar FN 7 1i hassuf 2 FN 7 ete hassuf 3 FN 7 IGE re faddsuf 2 FN 7 IGE kis fgoodright FN 7 tse hassuf 3 IGE 7 bek hassuf 3 MN 7 3ek hassuf 3 MN 7 IGE o1k fhassuf 3 MN 7 ily haspref 3 MN 7 1lis hassuf 4 MN 7 nn hassuf 2 HA 7 aki haspref 3 NM 7 FN ma1s fhaspref 4 NM 7 NM p fchar FN 7 1rom hassuf 4 SZN 7 mo2 haspref 3 NU 7 ha goodleft KOT 6.87255373455829 kor deletesuf 3 HA 6.67486338797814 tti hassuf 3 MN 6.6 int hassuf 3 HA 6.55555555555556 tem hassuf 3 IGE 6.5 1e1 hassuf 3 FN 6 eik hassuf 3 FN 6 ka1k hassuf 4 FN 6 IGE inek fhassuf 4 FN 6 Mini haspref 4 FN 6 IGE van fgoodleft FN 6 zem hassuf 3 IGE 6 te1l hassuf 4 IGE 6 IGE ben faddsuf 3 FN 6.66666666666667 i1ti hassuf 4 IGE 6 bbak hassuf 4 MN 6 is deletesuf 2 HA 6 ah haspref 2 HA 6 pen hassuf 3 HA 6 ba1r deletepref 4 NM 6 lyen addsuf 4 NM 6 hus haspref 3 SZN 6 FN ent fhassuf 3 IGE 5.75 FN vala faddpref 4 NM 5.41025641025641 abb addsuf 3 MN 5.33333333333333 FN o2n faddpref 3 NM 5.25

62

aik hassuf 3 FN 5 j deletesuf 1 IGE 5 ssa hassuf 3 IGE 5 lsz hassuf 3 IGE 5 ne1m hassuf 4 IGE 5 znak hassuf 4 IGE 5 ju2k hassuf 4 IGE 5 jo2n hassuf 4 IGE 5 o1 addsuf 2 IGE 5 INF i fdeletesuf 1 MN 5 FN bban fhassuf 4 MN 5 prol haspref 4 MN 5 ebb addsuf 3 MN 5 ha hassuf 2 HA 5 NM tt fhassuf 2 HA 5 zor hassuf 3 HA 5 lo3l hassuf 4 HA 5 Mag haspref 3 NM 5 FN tt fhassuf 2 MN 4.78571428571429 FN jo2 fhaspref 3 MN 4.6 IGE be fhassuf 2 FN 4.5 FN Eg fhaspref 2 HA 4.5 1s deletesuf 2 MN 4.12121212121212 mi deletepref 2 HA 4.01234567901235 NN i fchar FN 4 NN u fchar FN 4 ai deletesuf 2 FN 4 IGE ba fhassuf 2 FN 4 io1 hassuf 3 FN 4 ntja hassuf 4 FN 4 IGE o2sz fhaspref 4 FN 4 IGE ro1l faddsuf 4 FN 4 IGE volt fgoodleft FN 4 IGE voltak fgoodleft FN 4 IGE u1j fgoodright FN 4 dsz hassuf 3 IGE 4 nja hassuf 3 IGE 4 ant hassuf 3 IGE 4 tsa hassuf 3 IGE 4 zza hassuf 3 IGE 4 dja hassuf 3 IGE 4 1gja hassuf 4 IGE 4 heti hassuf 4 IGE 4 o3 addsuf 2 IGE 4 IGE mi fhassuf 2 MN 4 tan hassuf 3 MN 4 1ak hassuf 3 MN 4 o1s hassuf 3 MN 4 snek deletesuf 4 MN 4 ajta hassuf 4 MN 4 he1r hassuf 4 MN 4 re1g haspref 4 MN 4 dolog goodleft MN 4 IGE egy fgoodright MN 4 nan hassuf 3 HA 4 FN sem fhassuf 3 HA 4

63

o1ta deletesuf 4 HA 4 u1gy hassuf 4 HA 4 jebb hassuf 4 HA 4 HA z fhassuf 1 NM 4 to3 haspref 3 NM 4 o2ve haspref 4 NM 4 egyi haspref 4 NM 4 benn haspref 4 NM 4 am addpref 2 NM 4 hogy addsuf 4 HA 4 Tiz haspref 3 SZN 4 lio1 hassuf 4 SZN 4 za1z hassuf 4 SZN 4 FN o2tv fhaspref 4 SZN 4 NN . fgoodleft SZN 4 csak deletesuf 4 KOT 4 -t deletesuf 2 USZ 4 Tim haspref 3 USZ 4 le1v addpref 4 NM 3.88235294117647 ki deletesuf 2 NM 3.60714285714286 FN 2lt fhassuf 3 IGE 3.5 FN dt fhassuf 2 MN 3.5 lyo1 hassuf 4 FN 3.33333333333333 etek deletesuf 4 FN 3.33333333333333 MN megs fhaspref 4 IGE 3.33333333333333 FN Az fhaspref 2 NM 3.33333333333333 alatt goodleft FN 3.32142857142857 FN mi faddsuf 2 KOT 3.10087719298246 IGE C fchar FN 3 NM j fchar FN 3 NNP a fchar FN 3 NNP i fchar FN 3 NM T fhaspref 1 FN 3 IGE pa fhaspref 2 FN 3 ze1 hassuf 3 FN 3 ink hassuf 3 FN 3 o3i hassuf 3 FN 3 jai hassuf 3 FN 3 IGE knak fhassuf 4 FN 3 IGE knek fhassuf 4 FN 3 IGE test fhaspref 4 FN 3 1s addsuf 2 FN 3 IGE ig faddsuf 2 FN 3 val addsuf 3 FN 3 o2n addsuf 3 FN 3 kkal addsuf 4 FN 3 IGE elso3 fgoodright FN 3 IGE ege1sz fgoodright FN 3 na hassuf 2 IGE 3 IGE 1t faddsuf 2 FN 4 FN ol fhassuf 2 IGE 3 jen deletesuf 3 IGE 3 rsz hassuf 3 IGE 3 bja hassuf 3 IGE 3 FN nom fhassuf 3 IGE 3 szel hassuf 4 IGE 3

64

sson hassuf 4 IGE 3 1tok hassuf 4 IGE 3 IGE kal faddsuf 3 FN 3 FN majd fgoodleft IGE 3 li hassuf 2 MN 3 Ro haspref 2 MN 3 u1s hassuf 3 MN 3 1an hassuf 3 MN 3 FN sta fhassuf 3 MN 3 ege haspref 3 MN 3 anok hassuf 4 MN 3 osra hassuf 4 MN 3 keny hassuf 4 MN 3 kete hassuf 4 MN 3 bbre hassuf 4 MN 3 2rke hassuf 4 MN 3 bnek hassuf 4 MN 3 FN nnek fhassuf 4 MN 3 IGE snak fhassuf 4 MN 3 IGE nne1 fhassuf 4 MN 3 FN yire fhassuf 4 MN 3 to2b haspref 4 MN 3 IGE 1n faddsuf 2 MN 3 MN ko2ru2l fgoodleft FN 3 pp hassuf 2 HA 3 hol hassuf 3 HA 3 nte hassuf 3 HA 3 ogy hassuf 3 HA 3 FN Mos fhaspref 3 HA 3 ltal hassuf 4 HA 3 zo2r hassuf 4 HA 3 yett hassuf 4 HA 3 FN nben fhassuf 4 HA 3 NM folytatta fgoodleft HA 3 valahol goodright HA 3 FN gondolta fgoodright HA 3 MN 1ny fdeletesuf 3 NM 3 mag deletepref 3 NM 3 na1 haspref 3 NM 3 nek haspref 3 NM 3 lyen deletesuf 4 NM 3 ro1l haspref 4 NM 3 2 haspref 1 SZN 3 NM k fhaspref 1 SZN 3 ann haspref 3 SZN 3 FN ti1 fhaspref 3 SZN 3 Ne1g haspref 4 SZN 3 FN nap fgoodleft SZN 3 FN tte fhassuf 3 NU 3 HA teleke1p fgoodleft DET 2.99532710280374

65

Lexical rules for PoS and subtags o char MN 838.393967557783 t hassuf 1 IGE_Me3 659.101902858138 ni hassuf 2 INF 382.05 NN 1 fchar FN 363.165934065934 k hassuf 1 FN_PL 300.05 ta hassuf 2 IGE_TMe3 208.551231527094 1t hassuf 2 FN_PSe3_ACC 202.857142857143 te hassuf 2 IGE_TMe3 192.417741935484 l hassuf 1 FN_INS 181.007142857143 n hassuf 1 FN_INE 177.741666666667 tak hassuf 3 IGE_Mt3 130 1k hassuf 2 IGE_TMt3 128.25 IGE_Me3 a fgoodright FN_ACC 125.326767676768 ra hassuf 2 FN_SUB 113.05 1t addsuf 2 FN_PSe3 110.5 es hassuf 2 MN 99.6 bb hassuf 2 MN_FOK 96.4777777777778 tek hassuf 3 IGE_Mt3 95.1666666666667 nak hassuf 3 FN_DAT 94.1428571428571 g hassuf 1 FN 93.5434782608696 1n hassuf 2 FN_PSe3_SUP 80 va hassuf 2 HA 77.25 kat hassuf 3 FN_PL_ACC 77 ve hassuf 2 HA 76 ik hassuf 2 IGE_e3 74.4 et hassuf 2 FN_ACC 73.6666666666667 re hassuf 2 FN_SUB 72.6666666666667 1s hassuf 2 FN 69 len hassuf 3 MN 66.3866666666667 NN i fchar MN 66.0147058823529 MN a fhassuf 1 FN_PSe3 65 ot hassuf 2 FN_ACC 58 st hassuf 2 FN_ACC 57 FN_INE en fdeletesuf 2 MN_ESS_MOD 57 an deletesuf 2 MN_ESS_MOD 52 eket hassuf 4 FN_PL_ACC 52 ba hassuf 2 FN_ILL 50 ni addsuf 2 IGE 49.5797101449275 nek hassuf 3 FN_DAT 48.75 FN i fhassuf 1 MN 48 nk hassuf 2 IGE_t1 48 m hassuf 1 FN 47.5 on hassuf 2 FN_SUP 46 it hassuf 2 FN_PSe3i_ACC 46 FN e fhassuf 1 FN_PSe3 45 FN_INS 1l fhassuf 2 FN_DEL 44 lan hassuf 3 MN 43 ja hassuf 2 IGE_Te3 42.3 bo1l hassuf 4 FN_ELA 41 FN_INS 2l fhassuf 2 MN_ESS_MOD 39 r hassuf 1 FN 36.025641025641 3 hassuf 1 MN 36 ja1k hassuf 4 IGE_Tt3 35.475 ot addsuf 2 FN 35

66

1ben hassuf 4 FN_PSe3_INE 35 u1 hassuf 2 MN 34 ei hassuf 2 FN_PSe3i 33 nia hassuf 3 IGE_INRe3 33 MN e fhassuf 1 FN_PSe3 32 be hassuf 2 FN_ILL 32 ie hassuf 2 IGE_INRe3 32 NNP S-T-A-R-T fgoodright MN 31 uk hassuf 2 FN_PSt3 30 1re hassuf 3 FN_PSe3_SUB 29 kal hassuf 3 FN_PL_INS 28.75 e1t addsuf 3 FN 28 1nek hassuf 4 FN_PSe3_DAT 28 3l hassuf 2 FN_ABL 28 MN oz fhassuf 2 FN_ALL 28 FN a fhassuf 1 FN_PSe3 27 ro3l hassuf 4 FN_DEL 27 to1l hassuf 4 FN_ABL 27 leg hassuf 3 HA 24.6666666666667 FN_INS ul fhassuf 2 MN_ESS_MOD 24 u2k hassuf 3 FN_PSt3 24 FN_SUB a1ra fhassuf 4 FN_PSe3_SUB 23.95 NN a fchar FN 23.3333333333333 1nak hassuf 4 FN_PSe3_DAT 23 ig hassuf 2 FN_TER 23 1ja hassuf 3 FN_PSe3 22.5 bo3l hassuf 4 FN_ELA 22 IGE_Me3 at fhassuf 2 FN_ACC 21 o1t hassuf 3 FN_ACC 21 sen hassuf 3 MN_ESS_MOD 21 FN 1 fhassuf 1 IGE_TFe3 21 tam hassuf 3 IGE_TMe1 20.8333333333333 y hassuf 1 FN 20 het hassuf 3 IGE 20 FN_INS kel fhassuf 3 FN_PL_INS 20 1vel hassuf 4 FN_PSe3_INS 20 ne hassuf 2 IGE_Fe3 20 FN_PSt3 j fchar IGE_Tt1 18.6666666666667 S goodright HA 18.4919924337957 san hassuf 3 MN_ESS_MOD 18 ott addsuf 3 IGE 18 tik hassuf 3 IGE_Tt3 18 FN_SUP j fchar IGE_Pe3 18 na hassuf 2 IGE_Fe3 18 kon hassuf 3 FN_PL_SUP 18 e1nt hassuf 4 FN_FOR 17.3026315789474 FN_DAT . fgoodleft IGE_t3 16.4642857142857 lag hassuf 3 HA 16 1t addsuf 2 FN_PSe3 16 FN_PSe3 1k faddsuf 2 FN 16.3 tem hassuf 3 IGE_TMe1 16 FN_ACC tt fhassuf 2 MN 15.8 1ban deletesuf 4 FN_PSe3_INE 15.5 MN a1 fhassuf 2 MN_FAC 15 tad hassuf 3 IGE_TMe2 15 Mi haspref 2 HA 14.6153846153846

67

FN as fhassuf 2 MN 14 let hassuf 3 FN 14 IGE_TMt3 a fgoodright FN_PL 14 kban hassuf 4 FN_PL_INE 14 sz hassuf 2 IGE_e2 14 ba addsuf 2 FN 14.0378787878788 ukat hassuf 4 FN_PSt3_ACC 14 ekre hassuf 4 FN_PL_SUB 14 ment addsuf 4 IK 13.6241257538471 zi hassuf 2 IGE_Te3 13 enek hassuf 4 MN_PL 13 NN e fchar IGE_TPe3 13 FN_DAT anak fhassuf 4 IGE_Pt3 13 FN_DAT knak fhassuf 4 FN_PL_DAT 13 o1ra goodleft SZN 12.6025706940874 abb addsuf 3 MN 12 3en hassuf 3 MN_ESS_MOD 12 hat hassuf 3 IGE 12 eti hassuf 3 IGE_Te3 12 1be hassuf 3 FN_PSe3_ILL 12 2ket hassuf 4 FN_PSt3_ACC 12 FN_DEL a1l fhassuf 3 IGE_Me2 12 FN_DEL e1l fhassuf 3 IGE_Me2 12 MN od fhassuf 2 IGE_Te2 12 hez hassuf 3 FN_PSe3_ALL 12 NNP 1 fchar FN 11 FN_PSe3 ia fhassuf 2 FN 11 FN_ACC et faddsuf 2 FN 11 FN_INE S-T-A-R-T fgoodright HA 11 zt hassuf 2 FN_ACC 11 MN_ESS_MOD n fdeletesuf 1 FN_SUP 11 ai deletesuf 2 FN_PSe3i 11 zem hassuf 3 IGE_Te1 11 kben hassuf 4 FN_PL_INE 11 kra hassuf 3 FN_PL_SUB 11 bban hassuf 4 MN_FOK_ESS_MOD 11 MN t faddsuf 1 FN 10.7315062388592 sa1g addsuf 4 MN 10.5833333333333 etek hassuf 4 FN_PL 10.5 1le hassuf 3 MN 10 ete hassuf 3 FN_PSe3 10 esek hassuf 4 MN_PL 10 la1tszott goodleft MN_DAT 10 dom hassuf 3 IGE_Te1 10 iban hassuf 4 FN_PSe3i_INE 10 inek hassuf 4 FN_PSe3i_DAT 10 knek hassuf 4 FN_PL_DAT 9.75 bben hassuf 4 MN_FOK_ESS_MOD 9.5 e1rt hassuf 4 FN_CAU 9.38461538461539 IGE_TMt3 o1k fhassuf 3 MN_PL 9.25 le1 hassuf 3 HA 9.05 FN nagyon fgoodright MN 9 cs hassuf 2 FN 9 ha hassuf 2 HA 9 3t hassuf 2 FN_ACC 9 IGE_Me3 yt fhassuf 2 FN_ACC 9

68

ata hassuf 3 FN_PSe3 9 1val hassuf 4 FN_PSe3_INS 9 FN_ILL j fchar FN_PSe3_ILL 9 eken hassuf 4 FN_PL_SUP 9 ed hassuf 2 IGE_TMe2 9 FN_INE 2n fhassuf 2 FN_PSt3_SUP 9 kig hassuf 3 FN_PL_TER 9 is addsuf 2 HA 8.9 kor addsuf 3 NM 8.86507636473613 o3 addsuf 2 IGE 8.61111111111111 kor deletesuf 3 HA 8.5 o1 addsuf 2 IGE 8 IGE_t3 a fgoodright FN_DAT 8 znek hassuf 4 IGE_t3 8 FN_DAT nem fgoodright IGE_t3 8 zze hassuf 3 IGE_TPe3 8 jen hassuf 3 IGE_Pe3 8 jo2n hassuf 4 IGE_Pe3 8 se1 hassuf 3 MN_FAC 8 FN_PSt3 n fchar IGE_INRt3 8 IGE_Me2 n fchar FN_ADE 8 ink hassuf 3 FN_PSt1i 8 ne1k hassuf 4 IGE_Fe1 8 IGE_Me3 az fgoodright MN 7.88888888888889 mit goodleft KOT 7.80684007707129 int hassuf 3 HA 7.64285714285714 ert hassuf 3 FN_ACC 7.25 FN i fhassuf 1 MN 7 se1g addsuf 4 MN 7.72222222222222 FN 1s fdeletesuf 2 MN 7 NNP e fchar FN 7 p hassuf 1 FN 7 hoz addsuf 3 FN 7 FN_INS ol fhassuf 2 HA 7 pen hassuf 3 HA 7 ogy hassuf 3 HA 7 lo3l hassuf 4 HA 7 ia1t hassuf 4 FN_ACC 7 FN_PSe3_ACC egy fgoodright FN_ACC 7 ta addsuf 2 IGE 7 val deletesuf 3 FN_INS 7 MN c fhassuf 1 SZN 7 tiz haspref 3 SZN 7 3n hassuf 2 FN_SUP 7 FN_SUP t faddsuf 1 FN 7 sak hassuf 3 MN_PL 7 FN_PSe3_DAT nak fdeletesuf 3 MN_DAT 7 FN_DAT 3nek fhassuf 4 MN_DAT 7 ssa hassuf 3 IGE_TPe3 7 lje hassuf 3 IGE_TPe3 7 FN_SUP son fhassuf 3 IGE_Pe3 7 hez deletesuf 3 FN_ALL 7 zel hassuf 3 IGE_e2 7 sa1k hassuf 4 IGE_TPt3 7 j hassuf 1 IGE_Pe2 7 FN_INE in fhassuf 2 FN_PSe3i_SUP 7

69

IGE_e3 k fdeletesuf 1 IGE_Tt3 6.2 lyen deletesuf 4 NM 6.02941176470588 o1s hassuf 3 MN 6 ajta hassuf 4 MN 6 1lis hassuf 4 MN 6 da hassuf 2 FN 6 FN_ACC a faddsuf 1 FN 6 is deletesuf 2 HA 6 zor hassuf 3 HA 6 FN_INE s fgoodright HA 6 FN_PSe3_SUP - fgoodright HA 6 tan hassuf 3 MN_ESS_MOD 6 9 char SZN 6 1rom hassuf 4 SZN 6 szor addsuf 4 SZN 6 lnak hassuf 4 IGE_t3 6 yik hassuf 3 NM 6 1ak hassuf 3 MN_PL 6 3ek hassuf 3 MN_PL 6 FN_PSt3 . fgoodleft NU 6 1i hassuf 2 FN_PSe3i 6 tsa hassuf 3 IGE_TPe3 6 tse hassuf 3 IGE_TPe3 6 FN_ELA j fchar FN_PSe3_ELA 6 FN_PSe3i_DAT a fgoodleft NM_DAT 6 FN_PL nem fgoodright IGE_e1 6 IGE_t1 j fchar IGE_Pt1 6 ned hassuf 3 IGE_INRe2 6 nod hassuf 3 IGE_INRe2 6 tok deletesuf 3 IGE_t2 6 MN e1 fhassuf 2 FN_POS 6 ira hassuf 3 FN_PSe3i_SUB 6 -e hassuf 2 IGE_Me3_KSZ 6 FN kell fgoodright IGE_INRe1 6 1nk hassuf 3 IGE_Ft1 6 - hassuf 1 FN 5 io1 hassuf 3 FN 5 mus hassuf 3 FN 5 1s addsuf 2 FN 5 FN_TER , fgoodright HA 5 IGE_Te3 c fhaspref 1 FN_PSe3 5 1an hassuf 3 MN_ESS_MOD 5 MN_ESS_MOD t faddsuf 1 IGE 5 IGE_Me3 2t fhassuf 2 SZN 5 NN . fgoodleft SZN 5 IGE_e3 a fgoodright SZN 5 o1n hassuf 3 FN_SUP 5 ugya haspref 4 NM 5 za1k hassuf 4 IGE_Tt3 5 mo2 haspref 3 NU 5 snek deletesuf 4 MN_DAT 5 mit hassuf 3 NM_ACC 5 om deletesuf 2 IGE_Te1 5 3ve1 hassuf 4 MN_FAC 5 MN_PL j fchar IGE_Pt3 5 se1k hassuf 4 IGE_TPt3 5

70

iben hassuf 4 FN_PSe3i_INE 5 inak hassuf 4 FN_PSe3i_DAT 5 to3 haspref 3 NM_ABL 5 e1m hassuf 3 IGE_TFe1 5 IGE_t1 kell fgoodright IGE_INRt1 5 INF i fdeletesuf 1 MN 4.75 a1t addsuf 3 FN 4.57028985507246 je1k hassuf 4 IGE_Pe3 4.5 ta1n hassuf 4 HA 4.33333333333333 HA t fdeletesuf 1 NM_ACC 4.32857142857143 eni addsuf 3 IGE 4.28571428571429 IGE_Me3 i faddsuf 1 NU 5 MN_ESS_MOD egy fgoodleft HA 4.25 szer addsuf 4 SZN 4.08731645679927 ja1t addsuf 4 FN 4.00613496932515 IGE_e2 1sz fhassuf 3 MN 4 IGE_Me3 egy fgoodright MN 4 MN 1d fhassuf 2 FN 4 os addsuf 2 FN 4.5 FN de fgoodright HA 4.25489444243955 FN_INS ll fhassuf 2 FN 4 FN_PSe3 ka fhassuf 2 FN 4 net hassuf 3 FN 4 FN_ACC zet fhassuf 3 FN 4 de1k hassuf 4 FN 4 MN val faddsuf 3 FN 4 IGE_Me2 a fgoodright FN 4 FN_INE nan fhassuf 3 HA 4 IGE_TMe3 1ta fhassuf 3 HA 4 u1gy hassuf 4 HA 4 yett hassuf 4 HA 4 gyan hassuf 4 HA 4 jebb hassuf 4 HA 4 FN_SUB csak fgoodleft HA 4 FN_PSe3_ACC - fgoodright HA 4 ont hassuf 3 FN_ACC 4 FN a fdeletesuf 1 FN_PSe3 5 ere hassuf 3 FN_PSe3 4 IGE_e2 nek faddsuf 3 IGE 4 FN_PSe3_INS , fgoodright FN_INS 4 FN_PSe3 za fhassuf 2 IGE_Te3 4 1ti hassuf 3 IGE_Te3 4 FN_INE ven fhassuf 3 SZN 4 FN_DAT k fdeletesuf 1 IGE_t3 4 dnak hassuf 4 IGE_t3 4 ki deletesuf 2 NM 4 iek hassuf 3 MN_PL 4 anok hassuf 4 MN_PL 4 sai hassuf 3 FN_PSe3i 4 jai hassuf 3 FN_PSe3i 4 FN_DAT e1s fgoodleft MN_DAT 4 tsen hassuf 4 IGE_Pe3 4 zom hassuf 3 IGE_Te1 4 2z hassuf 2 FN_ALL 4 FN_ILL a fhaspref 1 FN_PSe3_ILL 4 ne1 hassuf 3 IGE_TFe3 4

71

FN m fdeletesuf 1 FN_PSe1 4 vel haspref 3 NM_INS 4 ze1k hassuf 4 IGE_TPt3 4 FN_DEL j fchar FN_PSe3_DEL 4 benn haspref 4 NM_INE 4 ire hassuf 3 MN_SUB 4 FN_ALL a fgoodleft NM_ALL 4 dd hassuf 2 IGE_TPe2 4 1ik hassuf 3 FN_PSt3i 4 bnak hassuf 4 MN_FOK_DAT 4 bnek hassuf 4 MN_FOK_DAT 4 1tok hassuf 4 IGE_Tt2 4 na1k hassuf 4 IGE_TFt3 4 bbek hassuf 4 MN_FOK_PL 4 e1 deletesuf 2 FN_POS 3.95 MN kell fgoodleft IK 3.92857142857143 FN_ACC lt fhassuf 2 MN 3.6 MN t fdeletesuf 1 FN_ACC 4.73333333333333 va addsuf 2 IGE 3.55849802371542 SZN ege1sz fgoodleft DET 3.55393851497289 rem hassuf 3 IGE_Te1 3.5 IGE_Me3 A fgoodright MN 3.25 vu2l hassuf 4 HA 3.07142857142857 MN sem fgoodleft FN 3.05263157894737 u1s hassuf 3 MN 3 leti hassuf 4 MN 3 keny hassuf 4 MN 3 kete hassuf 4 MN 3 he1r hassuf 4 MN 3 ista hassuf 4 MN 3 IGE_Me3 megv fhaspref 4 MN 3 ebb addsuf 3 MN 3 FN_PL volt fgoodleft MN 3 IGE_TPe3 c fchar FN 3 SZN p fchar FN 3 FN_PSe3_SUP - fchar FN 3 a1s deletesuf 3 FN 3 1tel hassuf 4 FN 3 le1k hassuf 4 FN 3 MN bo3l faddsuf 4 FN 3 MN alatt fgoodleft FN 3 IGE_e2 az fgoodright FN 3 FN_CAU z fchar IGE_Me3 3 IGE_Me3 mi fhaspref 2 HA 3 be1 hassuf 3 HA 3 1je hassuf 3 HA 3 ltal hassuf 4 HA 3 zo2r hassuf 4 HA 3 Most haspref 4 HA 3 egya haspref 4 HA 3 MN_ESS_MOD bizonyos fgoodleft HA 3 FN mindenki fgoodleft HA 3 FN_INS s fgoodright HA 3 FN_PSe3_SUB S-T-A-R-T fgoodright HA 3 FN_FOR S-T-A-R-T fgoodright HA 3 u1t hassuf 3 FN_ACC 3

72

IGE_Te3 A fchar FN_PSe3 3 MN sa fhassuf 2 FN_PSe3 3 aja hassuf 3 FN_PSe3 3 eje hassuf 3 FN_PSe3 3 fa1k hassuf 4 FN_PL 3 ia1k hassuf 4 FN_PL 3 da1k hassuf 4 FN_PL 3 IGE_TMt3 ban faddsuf 3 FN_PL 3 FN egy fhassuf 3 IGE 3 te addsuf 2 IGE 3 FN_PSe3_SUB egy fgoodright FN_SUB 3 MN_DAT , fgoodleft FN_DAT 3 3ket hassuf 4 FN_PL_ACC 3 ezer deletesuf 4 SZN 3 lio1 hassuf 4 SZN 3 za1z hassuf 4 SZN 3 FN_DAT el fdeletepref 2 IGE_t3 3 znak hassuf 4 IGE_t3 3 le1v addpref 4 NM 3 vik hassuf 3 IGE_Tt3 3 IGE_e3 fel fhaspref 3 IGE_Tt3 3 yis hassuf 3 KOT 3 csak deletesuf 4 KOT 3 FN_DAT nnek fhassuf 4 MN_DAT 3 IGE_Te3 bja fhassuf 3 IGE_TPe3 3 1juk hassuf 4 FN_PSt3 3 NM t fchar NM_ACC 3 yit hassuf 3 NM_ACC 3 NM_ACC s fchar FN_ACC 3 kit hassuf 3 NM_ACC 3 nket hassuf 4 NM_ACC 3 1lom hassuf 4 IGE_Te1 3 nna1 hassuf 4 MN_FAC 3 nne1 hassuf 4 MN_FAC 3 FN 1m fdeletesuf 2 FN_PSe1 3 zek hassuf 3 IGE_e1 3 FN_INS am fhaspref 2 NM_INS 3 smit hassuf 4 MN_ACC 3 FN_DEL k fchar FN_PL_DEL 3 FN_PL_DEL k fhaspref 1 FN_DEL 5 FN_PL_DEL , fgoodright FN_DEL 3 IGE_TMe2 a fgoodright FN_PSe2 3 IGE_t1 sz fhaspref 2 FN_PSt1 3 FN_SUB - fgoodright NM_SUB 3 zz hassuf 2 IGE_Pe2 3 IGE_TFe3 va1lt fgoodleft FN_FAC 3 sd hassuf 2 IGE_TPe2 3 IGE_TFe3 v fchar FN_PSe3_FAC 3 ire deletesuf 3 FN_PSe3i_SUB 3 IGE_INRt3 tu2k fhassuf 4 IGE_TMt1 3 FN_ADE , fgoodright NM_ADE 3 eik hassuf 3 FN_PSt3i 3 1sul hassuf 4 FN_ESS 3 MN_ESS_MOD su2l fhassuf 4 FN_ESS 3 ival deletesuf 4 FN_PSe3i_INS 3 FN_INS ivel fhassuf 4 FN_PSe3i_INS 3

73

mben deletesuf 4 FN_PSe1_INE 3 khez hassuf 4 FN_PL_ALL 3 FN_PL_INE mi fgoodright FN_PSt1_INE 3 mat deletesuf 3 FN_PSe1_ACC 3 IGE_Me3_KSZ a fchar IGE_KSZ 3 ba1 hassuf 3 MN_FOK_FAC 3 SZN b fchar SZN_INE 3 a-e hassuf 3 IGE_TMe3_KSZ 3 1d deletesuf 2 IGE_TFe2 3 nkre hassuf 4 FN_PSt1_SUB 3 ikat hassuf 4 FN_KIEM_ACC 3 ba1r addsuf 4 KOT 2.99386503067485

74

Appendix C: Contextual rules There are several types of contextual rules, which are instantiated from a set of transformation templates, see also section 3.3. The following types of rules are represented among the set of rules for Hungarian and are given in the table below with their form and explanation. Type CURWD

Form Tag1 Tag2 CURWD Word

LBIGRAM

Tag1 Tag2 LBIGRAM Word1 Word2

NEXTBIGRAM

Tag1 Tag2 NEXTBIGRAM Tag3 Tag4

NEXTTAG

Tag1 Tag2 NEXTTAG Tag3

NEXT1OR2TAG

Tag1 Tag2 NEXT1OR2TAG Tag3

NEXT1OR2OR3TAG Tag1 Tag2 NEXT1OR2OR3TAG Tag3 NEXT2TAG

Tag1 Tag2 NEXT2TAG Tag3

NEXTWD

Tag1 Tag2 NEXTWD Word

NEXT1OR2WD

Tag1 Tag2 NEXT1OR2WD Word

NEXT2WD

Tag1 Tag2 NEXT2WD Word

PREVBIGRAM

Tag1 Tag2 PREVBIGRAM Tag3 Tag4

PREVTAG

Tag1 Tag2 PREVTAG Tag3

PREV1OR2TAG

Tag1 Tag2 PREV1OR2TAG Tag3

75

Explanation Change Tag1 to Tag2 if the current word is Word. Change Tag1 to Tag2 if the left bigram are the words Word1 Word2 where current word is Word2. Change Tag1 to Tag2 if the following two tags are Tag3 and Tag4. Change Tag1 to Tag2 if the next tag is Tag3. Change Tag1 to Tag2 if any of the next two tags is Tag3. Change Tag1 to Tag2 if any of the next three tags is Tag3. Change Tag1 to Tag2 if the second following tag is Tag3. Change Tag1 to Tag2 if the next word is Word. Change Tag1 to Tag2 if any of the next two words is Word. Change Tag1 to Tag2 if the second following word is Word. Change Tag1 to Tag2 if the previous two tags are Tag3 and Tag4 . Change Tag1 to Tag2 if the previous tag is Tag3. Change Tag1 to Tag2 if any of the previous two tags is Tag3.

PREV1OR2OR3TAG

Tag1 Tag2 PREV1OR2OR3TAG Tag3

PREV2TAG

Tag1 Tag2 PREV2TAG Tag3

PREVWD

Tag1 Tag2 PREVWD Word

PREV1OR2WD

Tag1 Tag2 PREV1OR2WD Word

PREV2WD

Tag1 Tag2 PREV2WD Word

RBIGRAM

Tag1 Tag2 RBIGRAM Word1 Word2

SURROUNDTAG

Tag1 Tag2 SURROUNDTAG Tag3 Tag4

WDAND2BFR

Tag1 Tag2 WDAND2BFR Word1 Word2

WDAND2TAGAFT

Tag1 Tag2 WDAND2TAGAFT Word Tag3

WDPREVTAG

Tag1 Tag2 WDPREVTAG Tag3 Word

WDNEXTTAG

Tag1 Tag2 WDNEXTTAG Word Tag3

Change Tag1 to Tag2 if any of the previous three tags is Tag3. Change Tag1 to Tag2 if the second previous tag is Tag3. Change Tag1 to Tag2 if the previous word is Word. Change Tag1 to Tag2 if any of the previous two words is Word. Change Tag1 to Tag2 if the second word before is Word. Change Tag1 to Tag2 if the right bigram is the words Word1 Word2 where current word is Word1. Change Tag1 to Tag2 if the previous tag is Tag3 and the next tag is Tag4. Change Tag1 to Tag2 if the word is Word2 and the second word before is Word1. Change Tag1 to Tag2 if the word is Word and the second tag after is Tag3. Change Tag1 to Tag2 if the word is Word and the previous tag is Tag3. Change Tag1 to Tag2 if the word is Word and the next tag is Tag3.

Moreover, STAART marks the beginning of the sentence. Thus, the rule . FN NEXT1OR2TAG STAART means that ‘change the tag . to FN if one of the two following ‘tags’ is the beginning of the sentence.

76

Contextual rules for PoS tags STAART . RBIGRAM . STAART FN STAART LBIGRAM STAART STAART KOT , PREVTAG , FN DET WDPREVTAG DET a , FN NEXTWD , . FN NEXT1OR2TAG STAART IGE STAART LBIGRAM STAART STAART HA STAART LBIGRAM STAART STAART DET STAART PREVTAG STAART FN MN PREVTAG MN , IGE NEXTWD , MN STAART LBIGRAM STAART STAART MN FN PREVTAG MN STAART FN NEXT2WD STAART FN IGE PREVTAG IGE NM , CURWD , DET IGE PREV1OR2OR3TAG IGE IGE FN NEXTTAG IGE KOT STAART PREVTAG STAART STAART DET NEXT2WD STAART NM STAART LBIGRAM STAART STAART HA , CURWD , STAART , RBIGRAM , STAART . IGE NEXT1OR2TAG STAART IGE HA NEXTTAG IGE FN DET CURWD az DET NM NEXTTAG DET DET KOT NEXT1OR2OR3TAG DET STAART KOT NEXT2WD STAART FN STAART LBIGRAM STAART STAART - STAART PREVTAG STAART MN DET CURWD a FN , CURWD , , NM NEXTWD , IK IGE PREV1OR2OR3TAG IGE IGE NM NEXTTAG IGE NM IGE PREVTAG IGE FN DET CURWD A DET , CURWD , IGE KOT NEXTTAG IGE DET HA NEXT1OR2OR3TAG DET HA KOT PREV1OR2OR3TAG , HA FN PREV1OR2OR3TAG STAART IGE - CURWD MN , CURWD , NM KOT PREV1OR2OR3TAG , KOT FN PREV1OR2TAG FN IGE FN NEXTTAG IGE , MN NEXTWD , STAART ? RBIGRAM ? STAART DET FN NEXTTAG DET NU FN PREV1OR2OR3TAG STAART KOT FN PREV1OR2OR3TAG DET

77

MN DET CURWD az . NM NEXTTAG . , HA NEXTTAG , HA KOT PREV1OR2OR3TAG STAART HA IGE PREVTAG IGE IGE HA NEXTTAG IGE MN FN NEXTTAG MN FN NM SURROUNDTAG NM FN NM FN PREV1OR2OR3TAG STAART FN INF PREVTAG INF IGE MN SURROUNDTAG MN IGE MN HA NEXTTAG MN MN DET NEXTTAG MN FN SZN SURROUNDTAG SZN FN STAART ! RBIGRAM ! STAART IGE , CURWD , STAART MN NEXT2WD STAART STAART HA NEXT2WD STAART STAART NM NEXT2WD STAART STAART IGE NEXT2WD STAART MN STAART LBIGRAM STAART STAART MN KOT NEXTTAG MN HA NM NEXTTAG HA NM HA NEXTTAG NM HA KOT CURWD is DET KOT PREV1OR2OR3TAG , FN DET CURWD egy SZN STAART LBIGRAM STAART STAART HA FN PREV1OR2OR3TAG DET - FN NEXTWD FN KOT CURWD hogy INF STAART LBIGRAM STAART STAART HA - CURWD , INF NEXTWD , FN KOT CURWD e1s INF IGE NEXTTAG INF IGE FN NEXTTAG IGE FN KOT CURWD ha . NU NEXTTAG . . MN NEXTTAG . MN NM NEXTTAG MN NM DET NEXTTAG NM FN - CURWD NU FN NEXT1OR2OR3TAG STAART FN MN SURROUNDTAG MN FN IK IGE PREV1OR2OR3TAG STAART KOT MN PREV1OR2TAG MN FN DET CURWD Az . INF NEXTWD . FN IGE NEXTWD volna MN IGE PREVTAG IGE MN FN NEXTTAG MN MN HA NEXTTAG MN . IK NEXTTAG . IGE HA CURWD nem NM - CURWD -

78

MN FN PREVBIGRAM MN FN DET NM NEXTTAG DET KOT IK CURWD meg , IK NEXTWD , ! FN NEXTWD ! KOT - CURWD ? IGE NEXTWD ? KOT NM NEXTTAG KOT DET HA NEXTTAG DET DET IGE PREV1OR2TAG STAART KOT HA NEXTWD is IGE STAART LBIGRAM STAART STAART , NU NEXTTAG , NM FN PREV1OR2TAG FN IGE FN PREVTAG DET FN IK CURWD fel : FN NEXTWD : FN STAART LBIGRAM STAART STAART FN NU CURWD ellen DET MN NEXTTAG DET MN KOT CURWD e1s DET - CURWD . HA NEXTTAG . FN HA CURWD mikor NM IK CURWD ki MN DET NEXTTAG MN ? FN NEXTWD ? HA FN PREVTAG FN HA NU PREV1OR2OR3TAG FN HA KOT PREV1OR2OR3TAG FN SZN DET CURWD a DET FN NEXTTAG DET NM KOT NEXTTAG NM DET KOT PREV1OR2OR3TAG STAART ; IGE NEXTWD ; NM HA NEXT1OR2OR3TAG IGE - IGE NEXTWD - ! NEXTWD , - NEXTWD , DET NM NEXTTAG IGE HA IK PREVTAG IGE KOT FN NEXTTAG KOT INF IGE NEXTTAG INF HA ; CURWD ; DET IK PREV1OR2OR3TAG IGE FN IGE SURROUNDTAG STAART NM FN NM SURROUNDTAG , FN FN KOT SURROUNDTAG , FN NM FN PREV1OR2OR3TAG DET IGE IK PREVTAG IK NM IGE PREV1OR2TAG STAART IGE INF PREVTAG INF KOT IGE PREVTAG IGE IGE MN SURROUNDTAG MN IGE IGE HA NEXTBIGRAM IGE STAART FN KOT CURWD is

79

MN KOT CURWD hogy STAART - RBIGRAM - STAART HA DET NEXT1OR2OR3TAG FN NM INF PREVTAG INF " FN NEXTWD " FN NM CURWD e1n MN NM NEXTTAG MN NM MN CURWD valamilyen MN SZN NEXTTAG MN MN FN PREVTAG MN INF FN NEXTTAG INF INF HA NEXTTAG INF DET : CURWD : FN IGE PREVTAG IGE IGE FN PREVBIGRAM IGE IGE IGE FN PREVBIGRAM NM IGE IGE FN PREVBIGRAM DET MN FN HA SURROUNDTAG HA FN HA NM NEXTTAG HA IGE FN SURROUNDTAG STAART IGE IGE HA SURROUNDTAG STAART IGE IGE MN SURROUNDTAG STAART IGE FN SZN SURROUNDTAG SZN FN NM DET NEXTTAG FN DET MN NEXT1OR2OR3TAG FN NM HA NEXTTAG NM HA MN NEYTTAG HA SZN FN PREV1OR2OR3TAG DET INF IGE NEXTTAG INF SZN KOT PREVTAG , INF MN NEXTTAG INF NU NM PREV1OR2OR3TAG STAART MN - CURWD KOT HA CURWD mie1rt DET NM NEXTWD a USZ DET NEXT1OR2OR3TAG FN STAART " RBIGRAM " STAART FN " WDNEXTTAG " FN ? NM NEXTWD ? , SZN NEXTWD , SZN HA NEXT1OR2TAG SZN MN FN NEXTTAG MN IGE KOT WDPREVTAG IGE vagy IK HA NEXTTAG IK HA KOT CURWD is HA IGE PREV1OR2OR3TAG STAART KOT MN NEXT1OR2TAG KOT FN DET LBIGRAM STAART Egy DET INF NEXTTAG DET - ? NEXTTAG ! IGE NEXTTAG ! IGE MN PREVWD a IGE NM NEXTBIGRAM IGE STAART IGE MN SURROUNDTAG MN IGE DET NM NEXTTAG , IGE NM NEXTBIGRAM IGE ?

80

FN KOT WDNEXTTAG vagy FN IK KOT NEXT1OR2TAG IK KOT IK PREV1OR2OR3TAG STAART IK NM PREVTAG STAART NM DET NEXTBIGRAM IGE FN NU FN NEXTTAG NU FN HA SURROUNDTAG , FN HA KOT CURWD sem IGE DET CURWD a DET HA NEXTTAG DET SZN NM NEXTTAG SZN HA DET NEXTTAG HA STAART ... RBIGRAM ... STAART - HA NEXTTAG HA IK PREV1OR2OR3TAG IK IK HA NEXT1OR2TAG IK ! SZN NEXTTAG ! FN MN SURROUNDTAG MN FN FN IGE NEXTWD volna IGE FN PREVWD volt IGE FN NEXTBIGRAM IGE KOT DET NM NEXTWD az IGE KOT SURROUNDTAG , IGE IGE KOT CURWD sem KOT NM NEXTTAG KOT IGE NU SURROUNDTAG FN IGE NU IGE PREV1OR2OR3TAG FN DET FN PREV1OR2TAG STAART ; FN NEXTWD ; . SZN NEXTWD . " : NEXTWD " IGE FN SURROUNDTAG MN IGE FN NM CURWD minden KOT HA PREV1OR2OR3TAG . MN HA CURWD nem IK NM PREVWD , IGE SZN SURROUNDTAG SZN IGE SZN KOT NEXT1OR2TAG SZN KOT FN NEXTTAG KOT SZN DET NEXT1OR2TAG SZN KOT NU PREV1OR2OR3TAG STAART SZN , CURWD , NM : CURWD : KOT ; PREV1OR2OR3TAG FN FN STAART LBIGRAM STAART STAART : NM NEXTWD : NM DET NEXTTAG NM " MN NEXTWD " FN MN PREVBIGRAM MN KOT IGE MN NEXTBIGRAM FN STAART FN HA PREVBIGRAM FN IGE FN NM NEXTBIGRAM FN NM FN NM CURWD olyan IGE HA NEXT2WD is KOT HA NEXTWD vagy MN NM CURWD valami

81

IGE KOT WDNEXTTAG is IGE MN KOT CURWD is FN USZ WDNEXTTAG B FN NM FN NEXT1OR2OR3TAG NM HA NU PREVTAG NU IK FN NEXTTAG IK IK NU NEXT1OR2OR3TAG STAART NM NU NEXTTAG NM KOT SZN NEXTTAG KOT SZN IGE NEXTTAG SZN STAART ; RBIGRAM ; STAART MN : CURWD : INF , CURWD , HA " CURWD " DET NU NEXTTAG DET DET KOT NEXTTAG DET ; HA PREV1OR2OR3TAG FN . ROV PREVTAG ROV , " NEXTWD , " . NEXTTAG " FN MN PREVBIGRAM MN , FN HA WDNEXTTAG nem FN FN HA WDNEXTTAG me1g FN FN HA CURWD mindig IGE FN SURROUNDTAG IGE INF FN HA CURWD ha1t FN NM CURWD Minden IGE HA NEXTBIGRAM IGE INF KOT HA WDNEXTTAG amikor IGE HA NM NEXTTAG HA NM HA PREVTAG , DET NM NEXTTAG KOT HA FN PREV1OR2TAG DET IGE NU SURROUNDTAG NU IGE SZN FN PREVTAG SZN NU FN NEXTTAG NU SZN FN NEXTTAG SZN NU NM NEXTTAG NU SZN - CURWD STAART : RBIGRAM : STAART NM ; CURWD ; IK , CURWD , IGE ; CURWD ; HA INF PREVTAG INF INF IGE NEXTTAG INF HA : CURWD : FN ; WDNEXTTAG ; FN DET ; NEXTTAG DET - NM NEXTTAG - MN NEXTTAG , KOT NEXTTAG , " KOT NEXTWD " FN MN NEXTWD va1lt FN IGE PREVWD nem FN IGE SURROUNDTAG HA DET FN IGE SURROUNDTAG KOT DET

82

IGE FN NEXT1OR2WD volt IGE FN PREV1OR2WD nincs FN HA CURWD csak FN MN SURROUNDTAG HA FN FN HA CURWD re1szben FN NM WDNEXTTAG ne1ha1ny FN FN NM CURWD valami MN IGE CURWD i1rt MN HA NEXTBIGRAM HA IGE IGE HA CURWD Nem KOT HA NEXTWD aze1rt IGE NM SURROUNDTAG , IGE DET NM NEXTTAG . MN NM CURWD olyan NM HA NEXTTAG NM IGE KOT CURWD e1s IK KOT PREVWD u1jra NM KOT NEXT2TAG FN KOT FN PREVTAG MN MN DET LBIGRAM STAART A IK IGE NEXT1OR2OR3TAG KOT IGE HA NEXTBIGRAM IGE KOT INF NM NEXTTAG INF DET HA NEXTWD egy INF KOT NEXTTAG INF MN " CURWD " KOT " NEXTTAG KOT DET " NEXTTAG DET ? INF NEXTTAG ? : IGE NEXTWD : . " NEXTTAG . " DET NEXTWD " " , NEXTTAG " ! NM NEXTTAG ! FN MN PREVWD teljesen FN MN CURWD gyenge1den FN IGE CURWD fa1jt IGE FN PREVWD ke1szu2lt IGE FN PREVWD Az IGE FN PREV1OR2WD van IGE FN PREVBIGRAM MN MN FN HA WDNEXTTAG to2bbe1 FN FN HA CURWD este MN FN WDPREVTAG DET rigo1 MN FN PREVBIGRAM MN MN FN NM WDNEXTTAG egyik FN FN NM WDNEXTTAG ilyen FN MN IGE WDPREVTAG FN adott MN IGE NEXT2TAG INF FN SZN NEXTWD e1v MN HA LBIGRAM STAART Ro2gto2n MN HA NEXTBIGRAM MN IGE MN HA NEXTBIGRAM MN DET KOT HA NEXTWD azonban IGE HA NEXTBIGRAM IGE MN HA NM WDAND2TAGAFT benne FN

83

IGE NM NEXTWD volt FN KOT WDNEXTTAG Ha FN HA FN SURROUNDTAG HA FN IK HA NEXT1OR2WD hogy FN IK WDNEXTTAG meg FN NM HA WDNEXTTAG Mie1rt HA HA IK CURWD le MN KOT CURWD s IK KOT PREV1OR2TAG DET FN IK CURWD Fel MN SZN CURWD sza1mtalan KOT NM NEXTBIGRAM FN FN NM IK WDNEXTTAG el NM IGE IK LBIGRAM STAART Meg NM FN NEXTTAG MN FN NM CURWD Maga1to1l IGE IK CURWD el INF IGE PREVTAG IK NM KOT NEXTTAG NM DET SZN NEXT1OR2WD ketto3 IGE DET LBIGRAM STAART Az NM IGE PREVTAG HA MN DET LBIGRAM STAART Az NU IGE NEXT1OR2OR3TAG STAART ROV STAART PREVTAG STAART NM " NEXTTAG NM IGE : WDNEXTTAG : IGE IGE " CURWD " FN : WDNEXTTAG : FN FN . WDNEXTTAG . FN : KOT NEXTTAG : - : NEXTWD - " NEXTTAG " IGE NEXTTAG " ! HA NEXTTAG ! HA KOT PREVTAG FN KOT FN PREV2WD hogy FN MN NEXTWD la1tszo1 FN MN NEXTWD lesz FN MN NEXTWD la1tszott FN MN NEXTWD voltak MN FN PREVBIGRAM HA MN FN MN PREVWD nagyon FN MN WDNEXTTAG ke1szu2lo3ben IGE FN MN WDPREVTAG FN folyo1 FN MN WDPREVTAG MN hatalmasabb FN MN WDPREVTAG MN jo2vo3 MN FN SURROUNDTAG DET MN FN MN NEXT1OR2WD csoport FN MN CURWD keme1nyen FN IGE PREVTAG NU IGE MN NEXTBIGRAM FN IGE FN IGE NEXTWD ra1 MN KOT NEXTBIGRAM MN IGE FN IGE NEXT1OR2WD mit FN IGE PREVBIGRAM STAART -

84

IGE HA NEXTBIGRAM IGE HA IGE FN PREVBIGRAM , IGE IGE MN SURROUNDTAG , IGE IGE KOT NEXTBIGRAM IGE HA FN IGE CURWD hozza1e1rt FN IGE CURWD tudta FN IGE CURWD tesz FN IGE CURWD gondolok IGE FN PREVWD to2rte1nt IGE FN PREVWD olyan IGE FN PREVBIGRAM NU IGE IGE FN CURWD O'Brient FN HA LBIGRAM STAART Esetleg FN HA LBIGRAM STAART Me1g FN HA WDNEXTTAG ma IGE FN HA WDNEXTTAG sebtiben IGE FN HA WDNEXTTAG csattanva IGE FN HA PREVBIGRAM IK IGE FN HA CURWD ma1sfelo3l FN HA CURWD valahonnan MN FN PREVWD u1j MN FN RBIGRAM Diszno1 ! MN FN WDNEXTTAG o2nke1ntes FN MN FN SURROUNDTAG MN IGE MN FN WDAND2BFR STAART u1jbesze1l IGE MN PREVTAG DET MN FN PREV1OR2WD mint IGE MN PREVWD voltak FN NM LBIGRAM STAART Ebbo3l FN NM LBIGRAM STAART Ne1ha1ny FN NM WDNEXTTAG ma1sik FN FN NM WDNEXTTAG e FN FN NM WDNEXTTAG milyen FN FN NM WDPREVTAG HA kinek FN NM CURWD ma1s FN NM CURWD mindenfe1le FN NM CURWD o2nmagad FN NM CURWD magukhoz MN IGE NEXTWD elo3 MN IGE NEXTWD egyetlen MN IGE LBIGRAM STAART kellett MN IGE WDPREVTAG FN vetett MN HA SURROUNDTAG DET MN MN HA CURWD szi1vesen IGE HA PREVWD vannak KOT HA NEXTWD ha KOT HA WDNEXTTAG amint HA KOT HA WDNEXTTAG amikor HA KOT HA PREVBIGRAM , KOT IGE NM WDNEXTTAG azt IGE HA NM CURWD minden IGE NM WDNEXTTAG Mit IGE KOT HA NEXTBIGRAM KOT IGE HA KOT CURWD mint IGE NM NEXTBIGRAM IGE MN DET NM NEXTWD bizonyos

85

FN KOT WDNEXTTAG sem FN FN KOT CURWD ha FN KOT CURWD s HA MN PREVWD egy HA MN NEXT1OR2WD benne FN NU PREVTAG NU HA KOT WDAND2TAGAFT amikor HA HA FN SURROUNDTAG MN HA FN USZ NEXTWD nevezte1k FN NU CURWD alattuk KOT IGE PREVWD nem IGE KOT CURWD mert NM DET NEXTBIGRAM HA MN HA IK WDNEXTTAG ra1 IGE STAART NU RBIGRAM elo3tt STAART INF FN PREVTAG INF IK HA CURWD egyu2tt IK KOT PREVTAG SZN FN USZ LBIGRAM a duplagondol HA IK WDPREVTAG KOT egyu2tt FN IK WDPREVTAG IGE fel NM MN NEXTTAG NM MN KOT CURWD sem FN IK CURWD ve1gig SZN HA PREVTAG , HA SZN CURWD sokkal KOT FN PREVTAG MN NM MN WDNEXTTAG ma1s IGE HA IGE NEXTWD volna STAART IK RBIGRAM el STAART STAART IGE PREVBIGRAM STAART STAART IGE STAART LBIGRAM STAART STAART IGE FN SURROUNDTAG STAART KOT STAART SZN NEXTBIGRAM STAART STAART STAART NU NEXTBIGRAM STAART STAART STAART INF NEXTBIGRAM STAART STAART INF IGE NEXTTAG INF FN IK LBIGRAM STAART El SZN HA NEXTTAG SZN INF MN NEXTTAG INF SZN NM NEXTTAG SZN NM DET NEXTTAG SZN NU MN NEXTTAG NU KOT IK PREVTAG IGE IK SZN NEXTWD meg IGE DET WDNEXTTAG egy IGE DET HA NEXTTAG NM STAART IK NEXTBIGRAM STAART STAART MSZ - PREV1OR2TAG STAART MN INF PREVTAG INF KOT INF NEXTTAG KOT ? MN NEXTTAG ? IGE KOT NEXTBIGRAM IGE MN ? HA NEXTTAG ? HA IK WDPREVTAG IGE ide ; MN NEXTWD ;

86

: INF NEXTTAG : : HA NEXTTAG : - KOT NEXTTAG - INF PREVTAG INF INF NM NEXTTAG INF NM FN NEXT1OR2OR3TAG INF , ISZ PREVTAG ISZ FN MN NEXTWD lesznek

87

Contextual rules for PoS and subtags DET NM NEXTTAG DET DET NM NEXTTAG , DET NM NEXTWD volt DET NM NEXTTAG KOT MN FN SURROUNDTAG DET , IGE_Me3 MN NEXTTAG FN_PL MN_DAT FN_DAT PREV1OR2OR3TAG DET MN FN WDPREVTAG DET u1jbesze1l HA KOT LBIGRAM , amikor IGE_t3 FN_DAT PREV1OR2TAG DET IK KOT PREVWD u1jra MN IGE_Te3 NEXTTAG DET HA IK PREVBIGRAM HA IGE_Me3 FN_ELA FN_PSe3_ELA PREV1OR2OR3TAG FN NM_ALL IK PREVTAG IGE_TMe3 KOT HA CURWD me1giscsak DET NM NEXTTAG NU FN_DAT MN_DAT NEXTWD la1tszo1 FN_INE FN_PSe3_INE PREV1OR2TAG FN FN_ABL FN_PSe3_ABL PREV1OR2TAG FN IGE_t1 FN_PSt1 PREV1OR2TAG NM FN_FOR HA PREV1OR2TAG DET IGE_Me3 FN_CAU PREVTAG DET FN_ILL FN_PSe1_ILL CURWD eszembe IGE_Me3 MN PREVTAG DET IGE_Me3 MN NEXTTAG FN_PL_INS IGE_Me3 MN NEXT1OR2WD volt IGE_Me3 MN NEXTBIGRAM MN FN MN FN SURROUNDTAG MN IGE_Me3 MN FN SURROUNDTAG DET . DET NM NEXTTAG . NU HA CURWD belo3le IGE_Te3 FN_PSe3 NEXT1OR2OR3TAG IGE_Me3 NNP FN NEXT1OR2OR3TAG . IGE FN PREVTAG DET IGE_INRt3 FN_PSt3 PREVTAG MN IGE_Pe3 FN_SUP PREVTAG MN IGE_t1 IGE_INRt1 NEXT1OR2TAG IGE MN FN NEXTTAG NU FN_DAT IGE_t3 PREV1OR2OR3TAG FN_PL HA FN PREVWD a FN_PSe3_INS FN_INS PREVTAG DET FN_ELA FN_PL_ELA SURROUNDTAG , , FN HA CURWD de1luta1n NM_DAT HA NEXT1OR2TAG KOT IGE_Tt1 FN_PSt3 PREVTAG DET IGE_TMe2 IGE PREV1OR2OR3TAG FN IGE_TMt3 FN CURWD bizonyi1te1k NNP FN NEXT1OR2OR3TAG FN FN_DEL HA CURWD ha1tulro1l FN_ADE FN_PSe3_ADE PREVTAG FN IGE_TPt3 FN PREV1OR2TAG DET FN MN CURWD furcsa KOT HA NEXTWD csak

88

KOT HA WDNEXTTAG amikor NM KOT HA CURWD hacsak KOT HA CURWD ugyancsak FN_DAT IGE_t3 PREVTAG FN_ACC FN_ACC FN NEXTTAG NU MN_ESS_MOD HA CURWD e1ppolyan FN_PSe3 FN PREV1OR2WD egy FN_DAT MN_DAT PREV1OR2TAG IGE_TMe3 IGE_TMt3 FN_PL PREVTAG MN IK KOT SURROUNDTAG HA DET FN_DAT MN_DAT NEXTBIGRAM IGE_Me3 , IGE_TMt3 FN_PL PREVTAG IGE_Me3 MN_PL FN_PL PREVTAG MN FN_ABL FN_PSe3_ABL NEXT1OR2TAG KOT IGE_t3 FN_DAT PREVTAG MN FN_ILL FN_PSe3_ILL CURWD hatalma1ba IGE_t1 FN_PSt1 NEXT1OR2OR3TAG NU IGE_Me3 FN_ACC NEXTTAG IGE_Me3 FN_SUP FN CURWD szexbu3n FN_PSe3 MN CURWD tompa FN_ABL FN_PL_ABL PREV1OR2OR3TAG , IGE_TMe3 NU CURWD fo2lo2tte FN_PSt3_ACC FN_PL_ACC PREV1OR2OR3TAG DET IGE_TMe3 FN PREVTAG DET IGE_TMe3_KSZ IGE_Te3_KSZ PREVTAG HA MN_FOK SZN LBIGRAM STAART To2bb IGE_MN IGE_Me3 CURWD igyekezett FN_INS IK CURWD Fel IGE_Me3 MN NEXTTAG FN_INE IGE_Me3 MN NEXTTAG FN_PL_ACC IGE_Me3 MN NEXT1OR2TAG IGE IGE_Me3 MN NEXT1OR2OR3TAG FN_PSe3_INS IGE_Me3 MN NEXTBIGRAM MN MN MN FN NEXTTAG MN FN LBIGRAM a proletaria1tus FN MN NEXTTAG FN_SUP MN FN NEXTBIGRAM KOT DET FN MN NEXTBIGRAM , MN MN FN CURWD o2nke1ntes FN_PSe3_ACC FN_ACC PREV1OR2TAG NM_ACC FN_PSe3_ACC FN_ACC NEXT1OR2OR3TAG IGE FN MN NEXTTAG FN_ACC FN_PSe3_ACC FN_ACC PREVBIGRAM DET MN KOT HA LBIGRAM , ami1g FN_PSe3_ACC FN_ACC PREV2TAG IK DET NM NEXTTAG ? FN_PL MN_PL CURWD gazdagok FN_ACC FN WDPREVTAG DET dro1t HA KOT WDNEXTTAG ami1g IK HA FN WDPREVTAG MN konyha FN_DAT IGE_t3 NEXTTAG NM_ACC IGE_e3 IGE_Tt3 NEXT1OR2TAG NM_ACC HA KOT CURWD illeto3leg HA IK WDNEXTTAG ra1 IGE IGE_e3 IGE_Tt3 PREV1OR2TAG FN_PL HA NM_INE RBIGRAM benne ,

89

FN_PSe3_ILL FN_ILL NEXTTAG IGE_TMt3 FN FN_PSe3 CURWD alkalma KOT IGE NEXTTAG ! IGE_Mt3 MN_PL NEXTTAG IGE_Mt3 HA MN PREV1OR2WD sem FN_PSe3_INS FN_INS NEXT1OR2TAG FN FN_PSe3_ILL FN_ILL PREVWD az FN_ELA FN_PSe3_ELA PREV1OR2TAG FN_PL HA NM_SUB PREVWD le HA NM_CAU NEXTTAG IGE_t1 MN HA LBIGRAM STAART Ele1g KOT IGE NEXTTAG . IK KOT NEXTTAG SZN IGE_Te3 MN NEXT1OR2TAG IGE_Mt3 FN_PSe3_INS FN_INS WDPREVTAG MN iro1nia1val FN_PL FN CURWD feste1k FN_ELA FN_PSe3_ELA PREVWD saja1t MN_FOK HA CURWD messzebb HA NM_INE WDAND2TAGAFT benne FN NM MN NEXTWD lesz MN FN_PSe3i CURWD vezeto3i IGE_Te3 FN_PSe3 PREVTAG SZN IGE_TMe2 IGE_Te2 NEXT1OR2TAG INF FN_PSe3_DAT FN_DAT CURWD O1cea1nia1nak FN_INE FN_SUP CURWD emeleten FN_DEL NM_DEL PREVTAG , FN_DEL FN_PSe3_DEL PREV1OR2TAG FN_PL FN SZN CURWD ke1tszer MN_FOK HA CURWD arra1bb FN_SUB FN CURWD Shakespeare FN_ALL FN_PSe3_ALL PREVTAG FN MN_FAC HA PREVTAG , MN_ESS_MOD FN_SUP PREVWD a MN IGE WDPREVTAG HA szabad FN_INE MN_INE WDNEXTTAG ta1volban FN IGE_TMe2 IGE_Te2 RBIGRAM teheted meg HA SZN NEXT1OR2TAG FN_SUB SZN_SUB NEXTWD ugrott FN_PL IGE_Pt2 NEXT1OR2TAG ! MN_FOK HA CURWD elo3bb-uto1bb MN_FAC IGE_TFe3 NEXTBIGRAM , KOT IGE_TPe3 MN PREV1OR2TAG MN IGE_TPe3 IGE_Te3 NEXT1OR2OR3TAG NM IGE_INRt3 FN_PSt3 PREVTAG DET HA MN_ESS_MOD CURWD meztelenu2l FN_KIEM_ACC FN_PSt3i_ACC NEXTTAG , FN KOT LBIGRAM , a1mba1r SZN FN CURWD illat NM_INE HA NEXT1OR2OR3TAG HA HA NM_INE PREVBIGRAM IGE_Me3 IK FN_PSe3_ALL FN_ALL PREV1OR2OR3TAG , FN_DEL FN_PSe3i_DEL PREVTAG FN IGE_Te1 FN_PSe1 PREVTAG DET IGE_TFe3 FN_FAC PREV1OR2TAG MN HA NM CURWD Mindez FN_PSe3i_SUP FN PREV1OR2OR3TAG ,

90

FN_PSe1 FN PREV1OR2OR3TAG KOT FN IGE_TPe2 NEXTTAG IK NNP FN NEXT1OR2OR3TAG DET IK NM_ILL NEXT1OR2TAG FN_ILL IGE_TMe3 NU LBIGRAM STAART Mo2go2tte IGE_KSZ IGE_Mt3_KSZ PREV1OR2TAG HA IGE_INRt3 FN_PSt3 PREVTAG , FN_PSt3_SUP IGE_Pe3 PREV1OR2OR3TAG KOT FN_PSe3i_INS FN_INS NEXTWD sem FN_PSe1 KOT PREVTAG STAART FN_ADE IGE_Me2 NEXT1OR2OR3TAG DET FN FN_PSe3_SUP LBIGRAM a ha1ta1n FN DET WDNEXTTAG AZ FN IGE_Me3 MN NEXTTAG FN_PSe3i

91

Appendix D: Text normalisation Normalisation of the training corpus (Orwell’s 1984) NORMSUT.BAT @ECHO OFF pcbeta -#to#.rul orwdis1.txt -#to#ut.txt pcbeta #bort.rul -#to#ut.txt #bortut.txt pcbeta stj@{.rul #bortut.txt stj@{ut.txt pcbeta tomrad.rul stj@{ut.txt trut.txt pcbeta delim.rul trut.txt delimut.txt pcbeta fn.rul delimut.txt fnut.txt pcbeta usz.rul fnut.txt uszut.txt pcbeta mn.rul uszut.txt mnut.txt pcbeta ige.rul mnut.txt igeut.txt pcbeta chap.rul igeut.txt chaput.txt pcbeta oms0.rul chaput.txt oms0ut.txt pcbeta oms1.rul oms0ut.txt oms1ut.txt pcbeta clip1.rul oms1ut.txt clip1ut.txt pcbeta oms2.rul clip1ut.txt oms2ut.txt pcbeta clip2.rul oms2ut.txt clip2ut.txt pcbeta oms3.rul clip2ut.txt oms3ut.txt pcbeta clip3.rul oms3ut.txt clip3ut.txt pcbeta oms4.rul clip3ut.txt oms4ut.txt pcbeta clip4.rul oms4ut.txt clip4ut.txt pcbeta oms5.rul clip4ut.txt oms5ut.txt pcbeta oms6.rul oms5ut.txt oms6ut.txt pcbeta oms7.rul oms6ut.txt oms7ut.txt pcbeta oms8.rul oms7ut.txt oms8ut.txt pcbeta oms9.rul oms8ut.txt oms9ut.txt pcbeta oms10.rul oms9ut.txt oms10ut.txt pcbeta oms11.rul oms10ut.txt oms11ut.txt pcbeta oms4.rul oms11ut.txt om114ut.txt pcbeta oms12.rul om114ut.txt oms12ut.txt pcbeta oms13.rul oms12ut.txt oms13ut.txt pcbeta oms14.rul oms13ut.txt oms14ut.txt pcbeta clip14.rul oms14ut.txt cli14ut.txt pcbeta clip15.rul cli14ut.txt cli15ut.txt pcbeta oms15.rul cli15ut.txt oms15ut.txt pcbeta clip4.rul oms15ut.txt cli41ut.txt pcbeta oms15.rul cli41ut.txt oms151u.txt pcbeta clip4.rul oms151u.txt cli42ut.txt pcbeta clipl.rul cli42ut.txt cliplut.txt pcbeta oms16.rul cliplut.txt oms16ut.txt pcbeta oms18.rul oms16ut.txt oms18ut.txt pcbeta oms18.rul oms18ut.txt oms181.txt pcbeta oms18.rul oms181.txt oms182.txt pcbeta oms18.rul oms182.txt oms183.txt pcbeta clip18.rul oms183.txt cli18ut.txt pcbeta menrad.rul cli18ut.txt menraut.txt pcbeta oms17.rul menraut.txt subtag.txt del -#to#ut.txt del #bortut.txt

92

del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del del

stj@{ut.txt trut.txt delimut.txt fnut.txt uszut.txt mnut.txt igeut.txt chaput.txt oms0ut.txt oms1ut.txt clip1ut.txt oms2ut.txt clip2ut.txt oms3ut.txt clip3ut.txt oms4ut.txt clip4ut.txt oms5ut.txt oms6ut.txt oms7ut.txt oms8ut.txt oms9ut.txt oms10ut.txt oms11ut.txt om114ut.txt oms12ut.txt oms13ut.txt oms14ut.txt cli14ut.txt cli15ut.txt oms15ut.txt cli41ut.txt oms151u.txt cli42ut.txt cliplut.txt oms18ut.txt oms181.txt oms182.txt oms183.txt cli18ut.txt menraut.txt

-#to#.rul _ -#Numeral.Numeral => #Numeral.Numeral in notes CHARSET Dig: 48-57 RULES -#; #;

LC 32

RC Dig

SC 1

RS 2

#bort.rul _cut #Numeral.Numeral in notes STATESET Cut: 10

93

CHARSET Dig: 48-57 RULES #;; #;;

LC 32 Dig

RC -# #

SC 1 2

RS 2 10

stj{.rul _*Ltr --> _@Ltr --> _delete { _delete } _delete |

Ltr Ltr between # and letter between space and letter between space and letter

CHARSET Ltr: 65-90 97-122 å ä ö Å Ä Ö RULES #*; #; #@; #; #{ ; #; }; ; |; ;

LC # # # 32 32

RC Ltr Ltr Ltr # #

SC 1 1 1 1 1

RS 2 2 2 2 2

tomrad.rul _delete every empty line RULES #;;

LC #

RC #

SC 1

RS 2

MV 0

delim.rul _tag ./.@ ?/?@, and !/!@ for _marking sentence per entry with @ _tag ,/, -/- ;/; (/( )/) and :/:, etc. CHARSET Del: / Ltrdel: 48-57 65-90 97-122 / RULES ,; ,/,; *; */*; :; :/:; (; (/(; -; -/-; +; +/+; =; =/=; ...; .../...@; .; ./.@; ?; ?/?@; !; !/!@; ); )/)@; "; "/"@;

LC RC SC # # 1 # # 1 # # 1 # # 1 # # 1 # # 1 # # 1 # # 1 # # 1 # -Del 1 # -Del 1 # -Del 1 # -Ltrdel 1

RS 2 2 2 2 2 2 2 2 2 2 2 2 2

94

MV -3 5

fn.rul _tag the following words as noun RULES LC RC SC RS %Aaronson; aaronson[FN]=Aaronson; # # 1 2 %Aaronsont; Aaronson[FN]+t[ACC]; %Aaronsonnak; Aaronson[FN]+nak[DAT]; %Amershambe; Amersham[FN]+be[ILL]; %Ampleforth; ampleforth[FN]=Ampleforth; %Ampleforthra; Ampleforth[FN]+ra[SUB]; %Amplefortht; Ampleforth[FN]+t[ACC]; %Ampleforthot; Ampleforth[FN]+ot[ACC]; %Angszoc; angszoc[FN]=Angszoc; %ANGSZOC; angszoc[FN]=ANGSZOC; %Astaro1th; astaro1th[FN]=Astaro1th; %Astaro1thnak; Astaro1th[FN]+nak[DAT]; %Baa1lnak; Baa1l[FN]+nak[DAT]; %Bailey; bailey[FN]=Bailey; %Berkhamsted; berkhamsted[FN]=Berkhamsted; %Berkhamstedbe; Berkhamsted[FN]+be[ILL]; %Brazzaville; brazzaville[FN]=Brazzaville; %Bumstead; bumstead[FN]=Bumstead; %Charrington; charrington[FN]=Charrington; %Charringtont; Charrington[FN]+t[ACC]; %Charringtonnal; Charrington[FN]+nal[INS]; %Charringtonto1l; Charrington[FN]+to1l[ABL]; %Chauser; chauser[FN]=Chauser; %Clement; clement[FN]=Clement; %Colchester; colchester[FN]=Colchester; %Colchesterre; Colchester[FN]+re[SUB]; %Cromwell; cromwell[FN]=Cromwell; %Danes; danes[FN]=Danes; %Danest; Danes[FN]+t[ACC]; %Emmanuel; emmanuel[FN]=Emmanuel; %EMMANUEL; emmanuel[FN]=EMMANUEL; %Gestapo; gestapo[FN]=Gestapo; %Goldstein; goldstein[FN]=Goldstein; %GOLDSTEIN; goldstein[FN]=GOLDSTEIN; %Goldsteint; Goldstein[FN]+t[ACC]; %Goldsteinnek; Goldstein[FN]+nek[DAT]; %Goldsteinnel; Goldstein[FN]+nel[INS]; %Goldsteinto3l; Goldstein[FN]+to3l[ABL]; %Goldsteinro3l; Goldstein[FN]+ro3l[DEL]; %Goldsteinre; Goldstein[FN]+re[SUB]; %Hyde; hyde[FN]=Hyde; %Inprekor; inprekor[FN]=Inprekor; %Internaciona1le1; internaciona1le1[FN]=Internaciona1le1; %Jefferson; jefferson[FN]=Jefferson; %Julia; julia[FN]=Julia; %Julia1t; Julia[FN]=Julia1+t[ACC]; %Julia1e1; Julia[FN]=Julia1+e1[POS]; %Julia1ra; Julia[FN]=Julia1+ra[SUB]; %Julia1ro1l; Julia[FN]=Julia1+ro1l[DEL]; %Julia1val; Julia[FN]=Julia1+val[INS]; %Julia1nak; Julia[FN]=Julia1+nak[DAT];

95

%Julius; julius[FN]=Julius; %Keleta1zsia; keleta1zsia[FN]=Keleta1zsia; %Keleta1zsia1t; keleta1zsia[FN]=Keleta1zsia1+t[ACC]; %Keleta1zsia1val; keleta1zsia[FN]=Keleta1zsia1+val[INS]; %Keleta1zsia1ban; keleta1zsia[FN]=Keleta1zsia1+ban[INE]; %Keleta1zsia1ra; keleta1zsia[FN]=Keleta1zsia1+ra[SUB]; %Komintern; komintern[FN]=Komintern; %Leopoldville; leopoldville[FN]=Leopoldville; %Mayor; mayor[FN]=Mayor; %O'Brien; o'brien[FN]=O'Brien; %O'Brient; O'Brien[FN]+t[ACC]; %O'Brienen; O´Brien[FN]+en[SUP]; %O'Brienre; O´Brien[FN]+re[SUB]; %O'Briennel; O´Brien[FN]+nel[INS]; %O'Briennek; O'Brien[FN]+nek[DAT]; %O'Brienhez; O'Brien[FN]+hez[ALL]; %O'Briento3l; O'Brien[FN]+to3l[ABL]; %O1cea1nia; o1cea1nia[FN]=O1cea1nia; %Ogilvy; ogilvy[FN]=Ogilvy; %Oliver; oliver[FN]=Oliver; %Ozirisz; ozirisz[FN]=Ozirisz; %Ozirisznak; Ozirisz[FN]+nak[DAT]; %Parsons; parsons[FN]=Parsons; %Parsonst; Parsons[FN]+t[ACC]; %Parsonsnak; Parsons[FN]+nak[DAT]; %Rutherford; rutherford[FN]=Rutherford; %Rutherfordot; Rutherford[FN]+ot[ACC]; %Rutherfordnak; Rutherford[FN]+nak[DAT]; %Shaftesbury; shaftesbury[FN]=Shaftesbury; %Smith; smith[FN]=Smith; %Smithnek; Smith[FN]+nek[DAT]; %Stepney; stepney[FN]=Stepney; %Stepneyben; Stepney[FN]+ben[INE]; %szem[FN]=SZEM+mel[INS]=MEL; szem[FN]=SZEM+MEL[INS]; %Swift; swift[FN]=Swift; %Tillotson; tillotson[FN]=Tillotson; %Tillotsont; Tillotson[FN]+t[ACC]; %Tom; tom[FN]=Tom; %Weeks; weeks[FN]=Weeks; %Wilsher; wilsher[FN]=Wilsher; %Winston; winston[FN]=Winston; %Winstont; Winston[FN]+t[ACC]; %Winstonban; Winston[FN]+ban[INE]; %Winstonra; Winston[FN]+ra[SUB]; %Winstonon; Winston[FN]+on[SUP]; %Winstonnal; Winston[FN]+nal[INS]; %Winstonto1l; Winston[FN]+to1l[ABL]; %Winstonnak; Winston[FN]+nak[DAT]; %Winstonro1l; Winston[FN]+ro1l[DEL]; %Winstonba; Winston[FN]+ba[ILL]; %Winstonbo1l; Winston[FN]+bo1l[ELA]; %Winstonhoz; Winston[FN]+hoz[ALL]; %Winstone1; Winston[FN]+e1[POS]; %Winstone1val; Winston[FN]+e1[PSe3]+val[INS]; %Winstone1kat; Winston[FN]+e1k[PL]+at[ACC]; %Withers; withers[FN]=Withers;

96

%Witherst; Withers[FN]+t[ACC]; %agitprop; agitprop[FN]; %besze1li1r; besze1li1r[FN]; %besze1li1rt; besze1li1r[FN]+t[ACC]; %besze1li1rja; besze1li1r[FN]+ja[PSe3]; %besze1li1rba; besze1li1r[FN]+ba[ILL]; %besze1li1rto1l; besze1li1r[FN]+to1l[ABL]; %besze1li1rral; besze1li1r[FN]+ral[INS]; %besze1li1rokba; besze1li1r[FN]+ok[PL]+ba[ILL]; %ditechnika1nak; ditechnika[FN]=ditechnika1+nak[DAT]; %egyenebe1det; egyenebe1d[FN]+et[ACC]; %gondolat; gondolat[FN]; %gondolrend; gondolrend[FN]; %Gyu31o2let; gyu3lo2let[FN]=Gyu3lo2let; %gyu31o2let; gyu3lo2let[FN]; %jo1gondol; jo1gondol[FN]; %jo1gondolo1; jo1gondolo1[FN]; %jo1szex; jo1szex[FN]; %nemszeme1ly; nemszeme1ly[FN]; %nemszeme1lyek; nemszeme1ly[FN]+ek[PL]; %o1besze1l; o1besze1l[FN]; %O1besze1l; o1besze1l[FN]=O1besze1l; %o1besze1lt; o1besze1l[FN]+t[ACC]; %o1besze1lnek; o1besze1l[FN]+nek[DAT]; %o1besze1lre; o1besze1l[FN]+re[SUB]; %O1besze1lre; O1besze1l[FN]+re[SUB]; %o1gondol; o1gondol[FN]; %teleke1p; teleke1p[FN]; %teleke1pek; teleke1p[FN]+ek[PL]; %teleke1pet; teleke1p[FN]+et[ACC]; %teleke1pnek; teleke1p[FN]+nek[DAT]; %teleke1pbo3l; teleke1p[FN]+bo3l[ELA]; %teleke1pto3l; teleke1p[FN]+to3l[ABL]; %teleke1ppel; teleke1p[FN]+pel[INS]; %teleke1pben; teleke1p[FN]+ben[INE]; %teleke1phez; teleke1p[FN]+hez[ALL]; %teleke1pen; teleke1p[FN]+en[SUP]; %teleke1pre; teleke1p[FN]+re[SUB]; %teleke1pekkel; teleke1p[FN]+ek[PL]+kel[INS]; %teleke1pekhez; teleke1p[FN]+ek[PL]+hez[ALL]; %teleke1peken; teleke1p[FN]+ek[PL]+en[SUP]; %teleke1pekbo3l; teleke1p[FN]+ek[PL]+bo3l[ELA]; %u1jbesze1lnek; u1jbesze1l[FN]+nek[DAT]; %u1jbesze1lre; u1jbesze1l[FN]+re[SUB]; %u1jbesze1lben; u1jbesze1l[FN]+ben[INE];

usz.rul _tag those words which don’t exist in Hungarian _and/or diverge in morphology and/or its suffix _is separated from the stem by a RULES %arcbu3n; arcbu3n[USZ]; %arcbu3n-nek; arcbu3n[USZ]+-nek[DAT]; %Avenue-n; Avenue[USZ]+-n[SUP]; %B; b[USZ]=B;

97

LC #

RC #

SC 1

RS 2

%bu3ngondol; bu3ngondol[USZ]; %bu3ngondol-ban; bu3ngondol[USZ]+-ban[INE]; %bu3nstop; bu3nstop[USZ]; %bu3nstop-ban; bu3nstop[USZ]+-ban[INE]; %bu3nstop-nak; bu3nstop[USZ]+-nak[DAT]; %Bu3nstop; bu3nstop[USZ]=Bu3nstop; %duplagondol; duplagondol[USZ]; %duplagondol-ro1l; duplagondol[USZ]+-ro1l[DEL]; %duplagondol-on; duplagondol[USZ]+-on[SUP]; %duplagondol-t; duplagondol[USZ]+-t[ACC]; %duplagondol"-nak; duplagondol"[USZ]+-nak[DAT]; "duplagondol"-nak; "duplagondol"[USZ]+-nak[DAT]; %duplagondol-ban; duplagondol"[USZ]+-ban[INE]; "duplagondol-ban; "duplagondol"[USZ]+-ban[INE]; %elo3rejel; elo3rejel[USZ]; %eml; eml[USZ]; %felu2lbem; felu2lbem[USZ]; %felu2lben; felu2lben[USZ]; %gyors-at; gyors[USZ]+-at[ACC]; %gyorsan-t; gyors[USZ]+an[ESSMOD]+-t[ACC]; %hideg-et; hideg[USZ]+-et[ACC]; %irosz; irosz[USZ]; %irosz-nak; irosz[USZ]+-nak[DAT]; %jus; jus[USZ]; %k-val; k[USZ]+-val[INS]; %kacsabesze1l; kacsabesze1l[USZ]; %kacsabesze1lo3-ro3l; kacsabesze1lo3[USZ]+-ro3l[DEL]; %Katharine; katharine[USZ]=Katharine; %Katharine-t; Katharine[USZ]+-t[ACC]; %Katharine-hez; Katharine[USZ]+-hez[ALL]; %lo1k; lo1k[USZ]; %limpszeszt; limpszeszt[USZ]; %meleg-et; meleg[USZ]+-et[ACC]; %mesmeg; mesmeg[USZ]; %mesmeg-nek; mesmeg[USZ]+-nek[DAT]; %Martin's-in-the-Fieldsnek; Martin's-in-the-Fields[USZ]+-nek[DAT]; %noctis; noctis[USZ]; %noctis-nak; noctis[USZ]+-nak[DAT]; %nt; nt[USZ]; %pa; pa[USZ]; %polik"-ke1nt; polik"[USZ]+-ke1nt[FOR]; "polik"-ke1nt; "polik"[USZ]+-ke1nt[FOR]; %porno1szek; porno1szek[USZ]; %porno1szek-nek; porno1szek[USZ]+-nek[DAT]; %primae; primae[USZ]; %rossz-ra; rossz[USZ]+-ra[SUB]; %Syme; syme[USZ]=Syme; %Syme-mal; Syme[USZ]+-mal[INS]; %Syme-ot; Syme[USZ]+-ot[ACC]; %Syme-ra; Syme[USZ]+-ra[SUB]; %saja1te1let; saja1te1let[USZ]; %saja1te1let-nek; saja1te1let[USZ]+-nek[DAT]; %staccato; staccato[USZ]; %telosz-nak; telosz[USZ]+-nak[DAT]; %te1vid; te1vid[USZ]; %Times; times[USZ]=Times;

98

%times; times[USZ]; %Times-ba; Times[USZ]+-ba[ILL]; %Times-ban; Times[USZ]+-ban[INE]; %Times-bo1l; Times[USZ]+-bo1l[ELA]; %to1k; to1k[USZ]; %U1ESZ; U1ESZ[USZ]; %U1ESZ-t; U1ESZ[USZ]+-t[ACC]; %uta1nikt; uta1nikt[USZ];

mn.rul _tag for fantasy words with normal inflection RULES LC %duplaplusz; duplaplusz[MN]; # %duplapluszhideg; duplapluszhideg[MN]; %duplapluszjo1; duplapluszjo1[MN]; %duplaplusznemjo1; duplaplusznemjo1[MN]; %egyenlo3; egyenlo3[MN]; %jo1; jo1[MN]; %jo1gondolos; jo1gondolos[MN]; %jo1gondolosan; jo1gondolos[MN]+an[ESSMOD]; %minibo3; minibo3[MN]; %nemhideg; nemhideg[MN]; %nemjo1; nemjo1[MN]; %nemso2te1t; nemso2te1t[MN]; %nemvila1gos; nemvila1gos[MN]; %o1besze1lu2l; o1besze1l[MN]+u2l[ESSMOD]; %O1besze1lu2l; O1besze1l[MN]+u2l[ESSMOD]; %O1gondolo1k; O1gondolo1[MN]+k[PL]; %plusz; plusz[MN]; %pluszhideg; pluszhideg[MN]; %pluszjo1; pluszjo1[MN]; %pluszteljes; pluszteljes[MN]; %sebesse1gesen; sebesse1ges[MN]+en[ESSMOD]; %so2te1t; so2te1t[MN]; %St; st[MN]=St; %Saint; saint[MN]=Saint; %szabad; szabad[MN]; %to2bbese; to2bbes[MN]+e[PSe3]; %u1jbesze1l; u1jbesze1l[MN]; %U1jbesze1l; u1jbesze1l[MN]=U1jbesze1l; %u1jbesze1lu2l; u1jbesze1l[MN]+u2l[ESSMOD]; %U1jbesze1lu2l; U1jbesze1l[MN]+u2l[ESSMOD]; %vila1gos; vila1gos[MN]; %-os; -os[SKEP]; %-es; -es[SKEP]; %-o2s; -o2s[SKEP]; %-osan; -os[SKEP]+an[ESSMOD]; %-esen; -es[SKEP]+en[ESSMOD]; %-o2sen; -o2s[SKEP]+en[ESSMOD]; %elo3; elo3[IK]; %uta1n; uta1n[NU]; %nagyon; nagyon[HA]; %nem; nem[HA];

99

RC #

SC 1

RS 2

ige.rul _following verbs are incorrect in Orwdis.txt RULES LC RC %aszongya[IGE]; aszongy[IGE]+a[Te3]; # # %aszongya[IGE]=Aszongya; Aszongy[IGE]+a[Te3]; %aszondom[IGE]=Aszondom; Aszond[IGE]+om[Te1]; %aszondom[IGE]; aszond[IGE]+om[Te1]; %hase1rez; hase1rez[IGE]; %nemhase1rzik; namhase1rez[IGE]=nemhase1rz+ik[e3]; %nemfolytatni; nemfolytatni[INF]; %jo1gondolt; jo1gondol[IGE]+t[Me3]; %gondol; gondol[IGE]; %va1g; va1g[IGE]; chap.rul _delete the following lines: _%hi _%/hi _%/div _%div _%type=chapter _%type=part _%n=1/2/3/4/5/6/7/8/9 _%/p _%p _> _< RULES #%hi; ; #%/hi; ; #%div1; ; #%div2; ; #%/div1; ; #%/div2; ; #%type=chapter; ; #%type=part; ; #%n=1; ; #%n=2; ; #%n=3; ; #%n=4; ; #%n=5; ; #%n=6; ; #%n=7; ; #%n=8; ; #%n=9; ; #%/p; ; #%p; ; #>; ; # word%Word/WORD[TAG] _word[TAG]="Word => word%"Word/WORD[TAG] _word[TAG1]=Word+morph[TAG2] => word%Word+morph[TAG1TAG2] _morph[TAG1]=Word+[TAG2]= => word%Word+[[TAG1]TAG2]= _word.[TAG]=Word. => word.%Word./WORD.[TAG] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 . Maj: 65-90 " Ltrplus: 48-57 65-90 97-122 + . STATESET Cut: 10 Paste: 20 RULES [; %[; ]=; ]; #; ; [; [;

LC Ltr Ltr Ltr Ltrplus

RC Maj Maj # Ltr

SC 1 2 10 10

RS 2 10 2 20

MV -4 5 5 5

clip1.rul _ord%Ord[TAG] => Ord[TAG] _ord%"Ord[TAG] => "Ord[TAG] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 . Majcit: 65-90 " STATESET Cut: 10 RULES #; #; %; %;

LC # Ltr

RC SC Ltr 1 Majcit 2

RS 2 10

MV -3 5

101

RS 2 2 2 2

MV 5 5 5 5

oms2.rul _ord%[TAG1]=ord+morph[TAG2] => ord%ord+morph[[TAG1]TAG2] _ord%[TAG1]=ord+[TAG2] => ord%morph[[TAG1]TAG2] _morph%[TAG1]=ord => morph%ord[TAG1] _ord%[TAG1]="ord+morph"[TAG2] => ord%"ord+morph"[[TAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 Plus: + Ltrplus: 48-57 65-90 97-122 + " STATESET Cut: 10 Paste: 20 RULES %[; %[; ]=; ]; [; [; #; ;

LC Ltr Ltr Ltrplus Ltr

RC Maj -Plus Ltr #

SC 1 2 10 10

RS 2 10 20 20

MV -4 5 5 5

clip2.rul _word%word[[TAG1]TAG2] => word[[TAG1]TAG2] _word%word+morph[[TAG1]TAG2] => word+morph[[TAG1]TAG2] _word%"word+morph"[[TAG1]TAG2] => "word+morph"[[TAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Ltrcit: 48-57 65-90 97-122 " STATESET Cut: 10 RULES #; #; %; ;

LC # Ltr

RC SC Ltr 1 Ltrcit 2

RS 2 10

MV -3 5

oms3.rul _word%[TAG1]+morph[TAG2] => wordmorph[[TAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 " Maj: 65-90 Del: - =

102

STATESET Cut: 10 Paste: 20 RULES %[; [; ]+; ]; [; [;

LC Ltr Ltr Ltr

RC Maj -Del Ltr

SC 1 2 10

RS 2 10 20

MV -4 5 5

clip3.rul _word+morph[[LONGTAG1]TAG2] => word+morph[LONGTAG1TAG2] _"word+morph1"[[LONGTAG1]TAG2]+-morph2 => word+morph[LONGTAG1TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Ltrcit: 48-57 65-90 97-122 " Ltrplus: 48-57 65-90 97-122 + STATESET Cut: 10 RULES [[; [; ]; ;

LC Ltrcit Ltr

RC SC Ltr 1 Ltrplus 2

RS 2 1

MV -5 5

oms4.rul _word[LONGTAG1]+morph[TAG2] => wordmorph[[LONGTAG1]TAG2] _word[LONGTAG1]+-morph[TAG2] => word-morph[[LONGTAG1]TAG2] _word.%[LONGTAG1]+-morph[TAG2] => word.-morph[[LONGTAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 . Ltrdel: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 Paste: 20 RULES .%[; .[; ]+; ]; [; [;

LC Ltr Ltr Ltr

RC SC Maj 1 Ltrdel 2 Ltr 10

RS 2 10 20

MV -4 5 5

clip4.rul _word+morph[[LONGTAG1]TAG2] => word+morph[LONGTAG1TAG2] _word+[[LONGTAG1]TAG2]= => word[LONGTAG1TAG2]= DEFTYP 1: 32

103

CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 RULES [[; [; +[[; [; ]; ;

LC Ltr Ltr Ltr

RC Ltr Ltr Ltr

SC 1 1 2

RS 2 2 1

MV -5 -5 5

oms5.rul _morph[TAG1]=+[TAG2]=+=word => morph[[TAG1]TAG2]=+=word DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 Paste: 20 RULES [; [; ]=+; ]; [; [;

LC Ltr Ltr Ltr

RC Ltr [ Ltr

SC 1 2 10

RS 2 10 20

MV -4 5 5

oms6.rul _morph[[TAG1]TAG2]=+=word => morph=word[[TAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 Paste: 20 RULES [; [; ]=+; ]; #; ;

LC Ltr Ltr Ltr

RC [ = #

SC 1 2 10

RS 2 10 20

MV -4 5 5

oms7.rul _morph1[TAG1]=+morph2[TAG2]+=morph3 => morph=word[[TAG1]TAG2]+=morph3 DEFTYP 1: 32

104

CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 Paste: 20 RULES [; [; ]=+; ]=; [; [;

LC Ltr Ltr Ltr

RC Ltr Ltr Ltr

SC 1 2 10

RS 2 10 20

MV -4 4 5

oms8.rul _morph%word[[TAG1]TAG2]+=morph3 => morph%word+=morph3[[TAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 STATESET Cut: 10 Paste: 20 RULES [; [; ]+=; ]; #; ;

LC Ltr Ltr Ltr

RC [ Ltr #

SC 1 2 10

RS 2 10 20

MV -4 5 5

oms9.rul _word[USZ]+-morph[TAG2] => word-morph[[TAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Ltrdel: 48-57 65-90 97-122 " Maj: 65-90 Del: STATESET Cut: 10 Paste: 20 RULES [; [; ]+; ]; [; [;

LC Ltrdel Ltr Ltr

RC Maj Del Ltr

SC 1 2 10

RS 2 10 20

MV -4 5 5

oms10.rul _Word+=morph[TAG1] => Wordmorph[TAG1] _Word+=morph=word[[TAG1]TAG2] => Wordmorph[[TAG1]TAG2]

105

DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 RULES #; #; +=; ; =; ; [; [;

LC # Ltr Ltr Ltr

RC Maj Ltr Ltr [

SC 1 2 3 4

RS 2 3 4 10

MV 5 5 -5 4

oms11.rul _word[TAG1]+=morph => wordmorph[TAG1] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 STATESET Cut: 10 Paste: 20 RULES [; [; ]+=; ]; #; ;

LC Ltr Ltr Ltr

RC Ltr Ltr #

SC 1 2 10

RS 2 10 20

MV -4 5 5

oms4.rul _word[LONGTAG1]+morph[TAG2] => wordmorph[[LONGTAG1]TAG2] _word[LONGTAG1]+-morph[TAG2] => word-morph[[LONGTAG1]TAG2] _word.%[LONGTAG1]+-morph[TAG2] => word.-morph[[LONGTAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 . Ltrdel: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 Paste: 20 RULES .%[; .[; ]+; ]; [; [;

LC Ltr Ltr Ltr

RC SC Maj 1 Ltrdel 2 Ltr 10

RS 2 10 20

MV -4 5 5

106

oms12.rul DEFTYP 1: 32 RULES enye1m=et; enye1met; enye1m=hez; enye1mhez; o2ve1=hez; o2ve1hez; o2ve1=nek; o2ve1nek; o2ve1=be; o2ve1be; o2ve1=to3l; o2ve1to3l; o2ve1=t; o2ve1t; o2ve1=re; o2ve1re; mie1nk=hez; mie1nkhez;

LC #

RC [

SC 1

oms13.rul _#word[TAG]=# => #word[TAG]# DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 RULES ]=; ];

LC Ltr

RC #

SC 1

RS 2

MV 5

oms14.rul _morph=word[TAG] => word[TAG] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 STATESET Cut: 10 RULES #; #; =; ;

LC RC SC # Ltr 1 Ltr Ltr 2

RS 2 10

MV -5 5

clip14.rul _word[[TAG1]TAG2] => word[TAG1TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 STATESET Cut: 10

107

RS 2

MV 5

RULES [[; [; ]; ;

LC Ltr Ltr

RC SC Ltr 1 Ltr 2

RS 2 1

MV -5 5

clip15.rul _clip + between letters DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 RULES +; ;

LC Ltr

RC SC Ltr 1

RS 2

MV -5

oms15.rul _word[LONGTAG1]+morph[TAG2] => wordmorph[[LONGTAG1]TAG2] _Word[LONGTAG1]+morph[TAG2] => Wordmorph[[LONGTAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 Paste: 20 RULES [; [; ]+; ]; [; [;

LC Ltr Ltr Ltr

RC Maj Ltr Ltr

SC 1 2 10

RS 2 10 20

MV -4 5 5

clip4.rul _word+morph[[LONGTAG1]TAG2] => word+morph[LONGTAG1TAG2] _word+[[LONGTAG1]TAG2]= => word[LONGTAG1TAG2]= DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 RULES [[; [;

LC Ltr

RC SC Ltr 1

RS 2

MV -5

108

+[[; [; ]; ;

Ltr Ltr

Ltr Ltr

1 2

2 1

-5 5

oms15.rul _word[LONGTAG1]+morph[TAG2] => wordmorph[[LONGTAG1]TAG2] _Word[LONGTAG1]+morph[TAG2] => Wordmorph[[LONGTAG1]TAG2] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 Paste: 20 RULES [; [; ]+; ]; [; [;

LC Ltr Ltr Ltr

RC Maj Ltr Ltr

SC 1 2 10

RS 2 10 20

MV -4 5 5

clip4.rul _word+morph[[LONGTAG1]TAG2] => word+morph[LONGTAG1TAG2] _word+[[LONGTAG1]TAG2]= => word[LONGTAG1TAG2]= DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 RULES [[; [; +[[; [; ]; ;

LC Ltr Ltr Ltr

RC Ltr Ltr Ltr

SC 1 1 2

RS 2 2 1

MV -5 -5 5

clipl.rul _Numeral-+morph[TAG] => Numeral-morph[TAG] _"Word"+-morph[TAG] => "Word"-morph[TAG] DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 " STATESET Cut: 10

109

RULES -+; -; +-; -;

LC Ltr Ltr

RC Ltr Ltr

SC 1 1

RS 2 2

MV 5 5

oms16.rul _word[TAG] => word/Tagg _word-[TAG] => word-/Tagg CHARSET Ltr: a-z A-Z 48-57 " RULES [; /; ]; ;

LC Ltr Ltr

RC SC Ltr 1 # 2

RS 2 1

MV 5 5

oms18.rul * 4 @set in _ between the tags CHARSET Ltr: a-z A-Z 48-57 " RULES /; /; DET; _DET; FN; _FN; HA; _HA; IGE; _IGE; IK; _IK; INF; _INF; ISZ; _ISZ; KOT; _KOT; KSZ; _KSZ; MN; _MN; NM; _NM; NU; _NU; NUM; _NUM; SZN; _SZN; USZ; _USZ; e1; _e1; e2; _e2; e3; _e3; t1; _t1; t2; _t2; t3; _t3; Te1; _Te1; Te2; _Te2; Te3; _Te3; Tt1; _Tt1; Tt2; _Tt2; Tt3; _Tt3; Me1; _Me1; Me2; _Me2; Me3; _Me3; Mt1; _Mt1; Mt2; _Mt2; Mt3; _Mt3;

LC 0 -_

RC 0 0

SC 1 2

RS 2 3

MV 5 3

Ltr -_

0 0

2 2

3 3

3 3

Ltr -_

0 0

2 2

3 3

3 3

E

0

2

3

3

-_

0

2

3

3

E

0

2

3

3

110

TMe1; _TMe1; TMe2; _TMe2; TMe3; _TMe3; TMt1; _TMt1; TMt2; _TMt2; TMt3; _TMt3; Fe1; _Fe1; Fe2; _Fe2; Fe3; _Fe3; Ft1; _Ft1; Ft2; _Ft2; Ft3; _Ft3; Pe1; _Pe1; Pe2; _Pe2; Pe3; _Pe3; Pt1; _Pt1; Pt2; _Pt2; Pt3; _Pt3; TFe1; _TFe1; TFe2; _TFe2; TFe3; _TFe3; TFt1; _TFt1; TFt2; _TFt2; TFt3; _TFt3; TPe1; _TPe1; TPe2; _TPe2; TPe3; _TPe; TPt1; _TPt1; TPt2; _TPt2; TPt3; _TPt3; Ie1; _Ie1; IMe1; _IMe1; IFe1; _IFe1; IPe1; _IPe1; INRe1; _INRe1; INRe2; _INRe2; INRe3; _INRe3; INRt1; _INRt1; INRt2; _INRt2; PSe1; _PSe1; PSe2; _PSe2; PSe3; _PSe3; PSt1; _PSt1; PSt2; _PSt2; PSt3; _PSt3; PSe1i; _PSe1i; PSe2i; _PSe2i; PSe3i; _PSe3i; PSt1i; _PSt1i; PSt2i; _PSt2i; PSt3i; _PSt3i; DIS; _DIS; ESS; _ESS; IKEP; _IKEP; KIEM; _KIEM; MOD; _MOD;

-_

0

2

3

3

E

0

2

3

3

-_

0

2

3

3

111

MUL; _MUL; SKEP; _SKEP; PL; _PL; POS; _POS; FF; _FF; ABL; _ABL; ACC; _ACC; ADE; _ADE; ALL; _ALL; CAU; _CAU; DAT; _DAT; DEL; _DEL; ELA; _ELA; FAC; _FAC; INE; _INE; ILL; _ILL; INS; _INS; FOK; _FOK; FOR; _FOR; NOM; _NOM; SOC; _SOC; SUB; _SUB; SUP; _SUP; TEM; _TEM; TER; _TER;

clip18.rul @ /_Tagg => /Tagg RULES /_; /;

LC 0

RC 0

SC 1

RS 2

MV 5

menrad.rul _one sentence per line PARAM TOT=1000 DEFTYP 1: @ 3: 31

oms17.rul _delete @ in the end of the line PARAM TOT=1000 DEFTYP 1: @ 3: 31 RULES @#; ;

LC 0

RC #

SC 1

RS 2

MV 5

112

NORMHUT.BAT _delete all subtags (Word/Tag1_Tag2_TagN => Word/Tag1) @ECHO OFF pcbeta huvud.rul oms16ut.txt huvudut.txt pcbeta menrad.rul huvudut.txt huvmenr.txt pcbeta oms17.rul huvmenr.txt normhut.txt del huvudut.txt del huvmenr.txt

huvud.rul _delete all subtags _input: oms16ut.txt _output: huvud.txt DEFTYP 1: 32 CHARSET Ltr: 48-57 65-90 97-122 Del: / STATESET Cut: 10 RULES FN; FN; USZ; USZ; MN; MN; HA; HA; IGE; IGE; DET; DET; MSZ; MSZ; MOD; MOD; KSZ; KSZ; KOT; KOT; IK; IK; NM; NM; SZN; SZN; ISZ; ISZ; INF; INF; ROV; ROV; MIN; MIN; HI; HI; NU; NU; NUM; NUM; AUY; AUY; !; !; ?; ?; .; .; ...; ...; :; :; ,; ,; -; -; *; *;

LC Del

RC Ltr

SC 1

RS 2

113

MV -5

"; (; ); =; +; #;

"; (; ); =; +; #;

Ltr

#

2

menrad.rul _one sentence per line PARAM TOT=1000 DEFTYP 1: @ 3: 31

oms17.rul _delete @ in the end of the line PARAM TOT=1000 DEFTYP 1: @ 3: 31 RULES @#; ;

LC 0

RC #

SC 1

RS 2

MV 5

114

10

5

Normalisation of the test (‘hand’) corpus NORHAND.BAT _change the input text to the form ‘Word/Tag1_Tag2_TagN’ @ECHO OFF pcbeta omv1.rul sagain.txt omv1ut.txt pcbeta omv2.rul omv1ut.txt omv2ut.txt pcbeta omv3.rul omv2ut.txt omv3ut.txt pcbeta omv4.rul omv3ut.txt omv4ut.txt pcbeta omv5.rul omv4ut.txt omv5ut.txt pcbeta tomrad.rul omv5ut.txt tomraut.txt pcbeta omv6.rul tomraut.txt omv6ut.txt pcbeta omv7.rul omv6ut.txt omv7ut.txt pcbeta omv8.rul omv7ut.txt omv8ut.txt pcbeta omv9.rul omv8ut.txt omv91ut.txt pcbeta omv9.rul omv91ut.txt omv92ut.txt pcbeta omv9.rul omv92ut.txt omv9ut.txt pcbeta omv10.rul omv9ut.txt omv10ut.txt pcbeta omv11.rul omv10ut.txt sagaut.txt del omv1ut.txt del omv2ut.txt del omv3ut.txt del omv4ut.txt del omv5ut.txt del tomraut.txt del omv6ut.txt del omv7ut.txt del omv8ut.txt del omv91ut.txt del omv92ut.txt del omv9ut.txt

_omv1.rul _#%Word => Word\[USZ] CHARSET Ltr: 48-57 65-90 97-122 RULES #%; ; #; \[USZ];

LC RC SC # Ltr 1 Ltr # 2

RS 2 1

MV 5 5

_omv2.rul _1 PUNCT mark _the rule does not include the ; hence, _1 PUNCT ; ;\TAG => ; ;\TAG CHARSET Del: 32-47 58-64 91-96 Ltr: 65-90 97-122 Tag: \

115

mark\TAG => mark\TAG

STATESET Cut: 10 RULES #1 ,; ,; .; .; :; :; !; !; ?; ?; &; &;

PUNCT

; #;

TOK

word

LC # 0

RC Del Tag

SC 1 2

RS 2 10

MV -5 -4

_omv3.rul _1

token\[TAG] => word\[TAG]

CHARSET Ltr: 48-57 65-90 97-122 STATESET Cut: 10 RULES #1

LC # Ltr Ltr

RC SC Ltr 1 Ltr 2 [ 3

RS MV 2 -3 3 -5 10 4

TOK

; #; ; ;

BOS

Token\[TAG] => Word\[TAG]

\; \;

omv4.rul _Word

CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 RULES BOS EOS ; ;

; ;

\; \;

LC Ltr Ltr Ltr

RC SC Maj 1 Maj 1 [ 2

RS MV 2 -3 2 -3 10 4

_omv5.rul _word_wrong§wordcorrect\[TAG] => wordcorrect\[TAG] _WordWrong§WordRight\[TAG] CHARSET Ltr: 48-57 65-90 97-122 Maj: 65-90 STATESET Cut: 10 RULES #; #; §; ;

LC # 0

RC SC Ltr 1 Ltr 2

RS 2 10

MV -3 5

116

_tomrad.rul _eliminate lines containing #/ _or ##/

CHARSET Siff: 0 1 2 3 4 5 6 7 8 9 * RULES #/;; ##/;; ##/#;; #**;;

LC #

RC #

RS 2

MV 0

Siff 1

2

0

#

SC 1

_omv6.rul _mark\XPUNCT => mark RULES LC RC SC RS \WPUNCT; ; 0 # 1 2 \SPUNCT; ;

_omv7.rul _tag all punctuation mark RULES ,; ,/,; :; :/:; (; (/(; .; ./.@; ?; ?/?@; !; !/!@; ); )/)@; ...; .../...@; #-; -/-;

LC # # # # # # # # #

RC # # # # # # # # #

SC 1 1 1 1 1 1 1 1 1

RS 2 2 2 2 2 2 2 2 2

_omv8.rul _delete + and all empty lines RULES +; ; #;;

LC RC SC RS 0 0 1 2 # # 1 2

MV 5 0

_omv9.rul £][ => _ RULES ][; _;

LC 0

RC 0

SC 1

RS 2

MV 5

_omv10.rul _delete [] and change \ with / RULES \[; /; ]; ;

LC 0 0

RC 0 0

SC 1 2

RS 2 1

MV 3 5

117

_omv11.rul _one sentence per line PARAM TOT=1000 DEFTYP 1: @ 3: 31 RULES @#; ;

LC 0

RC #

SC 1

RS 2

MV 5

SHEADTAG.BAT _delete the subtags @ECHO OFF pcbeta headtag.rul omv10ut.txt headut.txt pcbeta omv11.rul headut.txt sheadta.txt del headut.txt

_headtag.rul _delete the morphological tags _input files: omv10ut.txt and dikttag.txt _word/Tag1_Tag2_TagN => word/Tag1 CHARSET Ltr: 48-57 65-90 97-122 1 2 3 Del: / DEFTYP 3: 1-255 STATESET Cut: 10 RULES USZ_; USZ; FN_; FN; MN_; MN; HA_; HA; IGE_; IGE; DET_; DET; MSZ_; MSZ; MOD_; MOD; KSZ_; KSZ; KOT_; KOT; IK_; IK; NM_; NM; SZN_; SZN; ISZ_; ISZ; INF_; INF; ROV_; ROV; MIN_; MIN;

LC Del

RC Ltr

SC 1

RS 2

MV -5

118

HI_; HI; NU_; NU; NUM_; NUM; AUX_; AUX; #; #;

Ltr

#

2

10

5

_omv11.rul _one sentence per line PARAM TOT=1000 DEFTYP 1: @ 3: 31 RULES @#; ;

LC 0

RC #

SC 1

RS 2

MV 5

SDELTAG.BAT _delete tags and put one sentence per line @ECHO OFF pcbeta [email protected] omv10ut.txt deltagu.txt pcbeta menrad.rul deltagu.txt sagadel.txt del deltagu.txt

_keeprul.rul _delete tags but keep @ STATESET Cut: 10 RULES /; ; @; @; #; ;

LC 0 0 -@

RC 0 # #

SC 1 2 2

RS 2 10 10

MV -3 4 5

_menrad.rul _one sentence per line and delete @ DEFTYP 1: @ 3: 31 RULES @#; ;

LC 0

RC #

SC 1

RS 2

MV 5

119

Appendix E: Output In texts with only PoS tags, the items with wrong part of speech tags are underlined. In texts with PoS and subtags (i.e. tags which provide information about inflectional properties of a token) wrongly annotated tokens are underlined. If there is more than one tag attached to a token then the wrongly annotated tag is printed in bold. Items which are correctly annotated but missing one or several subtags are italicized.

Texts with PoS tag Poem: I. Ágh: Az én erdŒm

Ko2zel/MN hu1zo1dik/IGE e1letemhez/IGE ,/, s/KOT nem/HA tudom/IGE ,/, mekkora/FN ,/, hol/HA ve1gzo3dik/IGE ,/, fal/FN nekem/NM o3/NM ,/, e1lo3/MN panaszfalom/FN ./. A/DET ritkulo1/FN levelesen/MN a1thullo1/FN fe1nyt/FN elke1pzelem/FN ,/, a/DET to2lgy/FN ,/, bu2kk/FN ,/, szilfalevelek/FN fodrozta/IGE fe1nyt/FN ,/, az/DET eso3uta1ni/FN pocsolya1khoz/INF hasonlo1/MN avart/IGE ,/, vadleso3/MN magaslatait/FN ,/, amelyek/NM tekintetem/FN fo2lvezetne1k/IGE az/DET e1g/FN fa1tylaira/FN ./. Izgat/FN a/DET megoldott/FN rejte1lyu3/MN sza1nto1k/FN uta1n/NU ,/, szeretne1k/IGE odafutni/INF a/DET to2bbiekto3l/MN ,/, szeretne1k/IGE meglesni/INF egy/DET vadat/FN ,/, hajtani/INF ,/, elveszteni/INF e1s/KOT u1jra/HA megtala1lni/INF ./. Szeretne1k/IGE elveszni/INF e1s/KOT si1rni/INF ,/, elo3keru2lni/INF ,/, ra1hajolni/INF a/DET korareggeli/MN tu3zhely/FN melege1re/IGE a/DET hazate1rt/FN hadifoglyok/FN o2ro2me1vel/FN ./. Vonz/FN ,/, mert/KOT titkait/FN sohasem/HA fedte/IGE fo2l/HA elo3ttem/IGE e1s/KOT sohasem/HA hagytak/IGE ,/, hogy/KOT titkait/FN megtala1ljam/FN ./. A/DET bozo1tbo1l/FN kiugro1/MN szarvast/FN ki/IK la1tta/IGE ?/? Igaz/MN ,/, olyan/NM mint/KOT a/DET szappanrekla1mon/FN ,/, elso3/SZN la1bait/FN maga/NM ala1/NU hu1zva/HA ,/, szarvait/MN e1gre/FN vetve/HA szo2kell/FN ,/, repu2l/IGE olyan/NM fo2nse1gesen/MN ,/, hogy/KOT lelo3ve/HA is/KOT u1gy/HA bukik/IGE ,/, ahogy/HA a/DET szabadsa1g/FN nagy/MN vadja1nak/FN bukni/INF illik/IGE ./. La1ttam/IGE a/DET szarvast/FN rabnak/FN ,/, fiatal/MN e1s/KOT megte1pa1zott/MN volt/IGE ,/, mint/KOT akit/NM e1lve/HA fognak/IGE el/IK ,/, mint/KOT akit/NM ero3szakos/MN kezekkel/FN ,/, te1rdekkel/FN maguke1va1/NM tesznek/IGE ./. Nem/HA la1ttam/IGE szabadnak/IGE ./. Ha/KOT nem/HA lettem/IGE olyan/NM szabadonfuto1/MN ,/, szabadon/MN do3zso2lo3/FN ,/, szabadon/MN szenvedo3/FN ,/, ha1t/HA megla1tni/FN a1hi1toztam/INF legala1bb/HA ./. Mert/KOT csak/HA a/DET gya1va/MN nyulat/FN ,/, a/DET sompolygo1/MN ro1ka1t/FN ,/, a/DET sze1tszalado1/MN egereket/FN ismerem/IGE ./. Puska1val/FN menjek/IGE oda/IK ?/? Ma1r/HA csak/HA fegyverrel/FN juthatok/IGE oda/IK ,/, ahova/KOT valamikor/HA u2res/MN ke1zzel/FN akartam/IGE ?/? I1gy/KOT nyerjek/FN biztonsa1got/IGE ?/? Hu1zzon/FN ,/, vezessen/MN ,/, leskelo3dtessen/MN ,/, lo2vessen/MN ?/? Borral/FN e1s/KOT friss/MN vadhu1ssal/FN megki1na1ltasson/IGE ?/? Engedjen/FN bo3ge1st/FN hallani/INF ,/, la1tni/INF hajnali/INF forra1st/FN ,/, megtudni/INF titkot/FN ?/?

120

Nem/HA kell/IGE nekem/NM a/DET kutata1sbo1l/FN passzio1/FN ,/, vesze1lybo3l/MN se1ta1lgata1s/FN !/! Vagy/KOT nincs/IGE ma1r/HA ez/NM az/DET erdo3/FN ?/? Vagy/KOT tala1n/HA benne/HA vagyok/IGE ?/? S/KOT mindig/HA egyma1sra/NM gondolunk/IGE ,/, amint/KOT ja1rom/SZN o2sve1nyeit/FN egyedu2l/HA ,/, u2res/MN ke1zzel/FN ./. Lendu2l/FN a/DET hab/FN s/KOT a/DET part/FN fala1n/FN ezu2stsarkantyu1s/MN la1ba/FN dobban/FN :/: ne1zd/IGE ,/, fu2rdik/IGE a/DET fekete/MN la1ny/FN ,/, fekete/MN la1ny/FN fehe1r/MN habokban/FN ./. Elszenderu2lt/IGE a/DET bu1/MN szeme1n/FN ,/, hulla1mba/FN hull/IGE ma/HA teste/FN ,/, lelke/FN ,/, hulla1mos/FN haju1/MN vo3lege1ny/FN milyen/NM ero3sen/MN a1to2lelte/IGE ./. De/KOT ne1zd/IGE :/: so2te1t/MN erdo3k/FN ko2zo2tt/NU nagy/MN ,/, ordas/MN fellegek/FN szakadnak/IGE s/KOT jo2n/IGE a1rja/MN za1poro2tt/FN go2rgeteges/IGE hegyi/IGE pataknak/FN ./. Haragos/MN ,/, szennyes/MN a1radat/FN ,/, a/DET medre1t/FN karmok/FN a1ssa1k/IGE ./. Ke1rd/IGE meg/IK a/DET ho2kkent/FN ga1takat/IGE ,/, ga1ncsolja1k/IGE a/DET vizek/FN futa1sa1t/FN ./. Ne/HA essen/MN folt/FN fe1nyes/MN haja1n/FN ,/, iszapos/MN a1r/FN hozza1/NM ne/HA e1rjen/IGE :/: fekete/MN la1ny/FN tala1n/HA ,/, tala1n/HA utolszor/KOT fu2rdik/IGE -/- ho1fehe1ren/FN ./. Fairy tale: B. Balázs: A három hıséges királyleány

BALA1ZS/FN BE1LA/FN :/: A/DET HA1ROM/FN HU3SE1GES/FN KIRA1LYLEA1NY/FN E1lt/IGE egyszer/HA Kandrapura/IGE va1rosa1ban/FN egy/DET Suryakanta/FN nevu3/MN hatalmas/MN kira1ly/FN ,/, aki/NM ha1rom/SZN fo3tulajdonsa1ga/FN a1ltal/NU olyan/NM volt/IGE a/DET fe1rfiak/FN ko2zo2tt/NU mint/KOT a/DET ko2vek/FN ko2zt/NU az/DET Adamanth/FN :/: tu2ndo2klo3/MN ,/, tiszta/MN e1s/KOT keme1ny/MN ./. Tu2ndo2klo3/MN sze1pse1ge/FN minden/NM asszonyt/FN szi1ven/FN harapott/MN mint/KOT a/DET kobra/FN ./. Bu3ntelen/MN tisztasa1ga/MN emberfeletti/FN ero3t/FN adott/IGE neki/IK ./. De/KOT keme1nyse1ge/FN volt/IGE a/DET legnagyobb/MN ./. Lelke1nek/FN fa1ja/FN elo3bbi/MN e1letekben/FN gyo2kerezett/IGE e1s/KOT eze1rt/KOT mozdi1thatatlan/MN vira1gokke1nt/FN termettek/IGE rajta/IK tettek/IGE e1s/KOT szavak/FN ,/, melyek/IGE ezer/SZN e1v/FN o1ta/NU ke1szen/MN voltak/IGE ./. Suryakanta/FN kira1ly/FN csak/HA hullatta/IGE o3ket/NM ,/, de/KOT rajtuk/NM nem/HA va1ltoztathatott/IGE ./. Suryakanta/FN kira1ly/FN uta1lta/IGE a/DET szavakat/FN e1s/KOT minden/NM ke1rde1sre/FN tettel/FN felelt/IGE ./. A/DET ko2nyo2rgo3k/FN e1s/KOT ke1relmezo3k/FN felett/NU ne1ma1n/IGE ne1zett/IGE el/IK mozdulatlan/MN arccal/FN de/KOT mire/HA hazate1rtek/KOT ke1re1su2k/HA teljesi1tve/FN volt/IGE ./. Suryakanta/FN kira1ly/FN u1gy/HA a1llt/IGE az/DET emberek/FN ko2zo2tt/NU mint/KOT az/DET erdo3/FN te1tova/MN hajlongo1/MN no2ve1nyei/FN ko2zo2tt/NU a/DET merev/MN istenszobor/FN ,/, melyben/FN a/DET szellem/FN e1rcbo2rto2nbe/FN za1rva/HA o2ro2k/MN ido3kre/FN egyforma/MN e1s/KOT mozdulatlan/MN ./. Egyszer/HA Suryakanta/IGE kira1ly/FN maga1hoz/NM hi1vatta/IGE felese1ge1t/FN a/DET bu2szke/MN Balapandita1t/FN ./. Mikor/HA Balapandita/IGE a/DET terembe/FN le1pett/IGE azon/NM halk/MN remege1s/FN futott/IGE ve1gig/IK mint/KOT a/DET fu2ves/MN re1ten/FN ha/KOT a/DET felkelo3/FN hold/MN rea1lehel/FN ./.

121

Mert/KOT Balapandita/FN sze1pse1ge1ben/IGE a/DET hold/FN go2mbo2lyu3se1ge/FN ,/, futo1inda/MN hajla1sa/FN ,/, fu3/FN remege1se/FN ,/, leve1l/FN ko2nnyu3se1ge/FN ,/, elefa1ntorma1ny/MN ve1konyula1sa/FN ,/, me1hek/MN rajza1sa/FN e1s/KOT a/DET papaga1j/FN melle/FN la1gysa1ga/FN egyesu2ltek/IGE mint/KOT egyma1st/NM re1g/MN kereso3/MN szerelmesek/FN ./. -/- Mie1rt/NM hi1vat/IGE uram/FN ,/, Suryakanta/MN kira1ly/FN -/- ke1rdezte/IGE Balapandita/IGE ./. De/KOT Suryakanta/FN nem/HA felelt/IGE hanem/KOT odale1pett/IGE a/DET sze1pcsi1po3ju2ho2z/FN e1s/KOT homlokon/FN cso1kolta/IGE olyan/NM jeleket/FN te1ve/HA mint/KOT a/DET bu1csu1zo1k/FN ./. -/- Uram/IGE Suryakanta/FN -/- szo1lt/IGE Balapandita/IGE akkor/HA -/- homlokon/FN cso1kolsz/IGE e1s/KOT olyan/NM jeleket/FN teszel/IGE mint/KOT a/DET bu1csu1zo1k/FN ./. Vajon/KOT bu1csu1zni/INF akarsz/IGE to3lem/NM te/NM fe1rfitigris/FN ?/? Suryakanta/FN nem/HA felelt/IGE hanem/KOT elso3/SZN minisztere1hez/FN e1s/KOT ifju1sa1ga1nak/FN ta1rsa1hoz/FN fordult/IGE ,/, mondva1n/FN :/: -/- Razakosa/FN ke1szu2lj/FN u1tra/FN ./. De/KOT ki1se1ret/FN ne/HA ko2vessen/MN e1s/KOT fullajta1rok/MN elo3ttu2nk/IGE ne/HA fussanak/IGE ./. E/NM szavakbo1l/FN Balapandita/IGE e1s/KOT Razakosa/FN mege1rtette1k/IGE hogy/KOT Suryakanta/IGE kira1ly/FN az/DET erdo3be/FN indul/IGE valamely/IGE szent/MN liget/MN la1togata1sa1ra/FN ,/, hogy/KOT a1ldozattal/FN ,/, fu2rde1ssel/FN e1s/KOT medita1cio1val/FN tiszti1tsa/IGE lelke1t/FN e1s/KOT ereje1t/FN no2velje/FN ./. Akkor/NM Balapandita/FN hangja1ra/IGE nedves/MN fa1tyolt/IGE bori1tott/MN a/DET fa1jdalom/FN e1s/KOT i1gy/KOT szo1lt/IGE :/: -/- Isten/FN veled/NM Suryakanta/IGE ./. Bizony/HA visszajo2tto2d/IGE u1gy/HA va1rom/SZN ,/, mint/KOT halott/FN az/DET u1jraszu2lete1st/FN ./. Megfestetem/FN ke1pedet/IGE ,/, hogy/KOT a/DET hold/FN vila1gi1tson/FN lelkemre/FN mi1g/HA napja/FN visszate1r/FN ./. De/KOT Suryakanta/FN kira1ly/FN nem/HA felelt/IGE hanem/KOT ne1ma/MN ajakkal/FN e1s/KOT besze1lo3/MN szemmel/FN ne1zte/IGE felese1ge1t/FN akit/NM a/DET hu3se1g/FN szekre1nye1nek/FN neveztek/IGE azta1n/HA elindult/IGE a/DET palota/FN kapuja/FN fele1/NU ./. O3t/NM pedig/KOT ko2vette/IGE Razakosa/IGE elso3/SZN minisztere/FN e1s/KOT ifju1sa1ga1nak/FN ta1rsa/MN ./. Mikor/NM ma1r/HA me1lyen/MN ja1rtak/IGE az/DET erdo3ben/FN ahol/HA a/DET Kadamba/FN fa/FN illata1to1l/FN megdu2ho2do2tt/FN pe1zsmaszarvasok/FN tu1rja1k/IGE magukat/NM a/DET sziromhullato1/MN ja1zminbokrokon/FN keresztu2l/HA ,/, Suryakanta/HA kira1ly/FN hirtelen/MN mega1llt/IGE ./. -/- Mit/NM la1tta1l/IGE uram/FN ?/? -/- ke1rdezte/IGE Razakosa/IGE ./. Suryakanta/FN kira1ly/FN akkor/HA lehajolt/IGE e1s/KOT mikor/NM megint/HA kiegyenesedett/IGE jobb/MN keze1vel/FN egy/DET nagy/MN fekete/MN ki1gyo1/FN torka1t/FN fogta/IGE ,/, bal/MN keze1vel/FN pedig/KOT egy/DET kis/MN nyulat/FN ,/, melyet/FN a/DET ki1gyo1/MN sza1ja1bo1l/FN vett/IGE ki/IK ./. Azuta1n/IGE mind/NM a/DET ketto3t/SZN eldobta/IGE e1s/KOT tova1bb/IK akart/IGE menni/INF ./. Hanem/FN akkor/HA a/DET fekete/MN ki1gyo1/FN fela1gaskodott/IGE elo3tte/NU e1s/KOT i1gy/KOT szo1lt/IGE ./. -/- Mie1rt/HA tesz/IGE ezt/NM velem/NM Suryakanta/MN kira1ly/FN ?/? Nekem/FN a/DET nyulat/FN istenek/FN rendelte1k/FN ta1pla1le1komul/IGE e1s/KOT ez/NM a/DET falat/FN melyet/FN most/HA sza1jambo1l/HA kivette1l/MN e1hhala1lom/FN elo3l/NU volt/IGE utolso1/MN menekve1sem/HA ./.

122

Mert/KOT tudd/IGE meg/IK o1h/FN Suryakanta/FN kira1ly/FN ,/, hogy/KOT harminc/SZN napig/FN nem/HA ettem/IGE e1s/KOT a/DET hala1l/FN sarka/FN van/IGE fejemen/FN ./. Tudsz/IGE e/NM nekem/NM ma1s/NM ta1pla1le1kot/FN adni/INF a/DET nyu1l/FN helyett/NU ?/? Suryakanta/FN kira1ly/FN mozdulatlan/MN arccal/FN ne1zte/IGE a/DET fekete/MN ki1gyo1t/FN e1s/KOT nem/HA felelt/IGE ,/, hanem/KOT kihu1zta/IGE to3re1t/IGE e1s/KOT izmos/MN barna/MN karja1bo1l/FN kanyari1tott/IGE egy/DET darabot/FN e1s/KOT azt/NM odadobta/IGE a/DET ki1gyo1/MN ele1/NU ./. Hanem/FN a/DET ki1gyo1/HA nem/HA nyu1lt/IGE uta1na/NU ./. -/- Meghalt/IGE -/- mondta/IGE Razakosa/IGE ./. -/- Nem/HA mondtad/IGE ,/, hogy/KOT segi1teni/INF akarsz/IGE rajta/IK ,/, ha1t/HA eleresztette/IGE e1lete1t/FN mert/KOT nem/HA tudta/IGE mire/HA va1rjon/IGE me1g/HA tova1bb/IK szenvedve/IGE ./. Suryakanta/FN kira1ly/FN mozdulatlan/MN arccal/FN ne1zte/IGE a/DET hosszu1/MN ,/, fekete/MN hulla1t/FN ./. Lelke1t/IGE oly/NM keve1sse1/MN illette/IGE a/DET hala1l/FN mint/KOT a/DET vad/MN elefa1ntot/FN az/DET erdei/MN leve1l/FN mely/NM ra1ncos/MN ha1ta1ra/FN hull/IGE ./. E1ppen/HA a1t/IK akart/IGE le1pni/INF rajta/NM ,/, hogy/KOT tova1bbmenjen/FN mikor/FN hirtelen/MN megjelent/IGE elo3tte/NU Ganesa/IGE az/DET elefa1ntarcu1/MN isten/FN ./. O1ria1s/FN fu2lein/FN a1tizzott/IGE a/DET nap/FN ,/, hogy/KOT tu2zes/MN korallbokrokke1nt/FN piroslottak/IGE benne/NM az/DET erek/FN ./. Ro1zsaszi1n/MN suga1r/MN orma1nya1t/FN ta1ncban/FN emelte/IGE e1s/KOT i1gy/KOT szo1lt/IGE :/: -/- Suryakanta/HA o2lte1l/IGE ./. Suryakanta/FN mozdulatlan/MN szemmel/FN ne1zett/IGE Ganesa/IGE kis/MN szemeibe/FN e1s/KOT i1gy/KOT felelt/IGE ./. -/- Az/DET istenek/FN o2ltek/IGE akik/NM terme1szetemet/FN u1gy/HA termelte1k/IGE hogy/KOT uta1ljam/IGE a/DET szavakat/FN e1s/KOT tettekkel/FN besze1ljek/FN ./. Lelkem/FN vira1ga/FN elo3bbi/MN e1letekben/FN gyo2keredzik/IGE e1s/KOT sem/KOT mozdulni/INF sem/KOT va1ltozni/INF nem/HA tud/IGE ./. Akkor/NM Ganesa/FN nevetett/IGE :/: -/- Ha/KOT lelked/FN sem/KOT mozdulni/INF sem/KOT va1ltozni/INF nem/HA tud/IGE ,/, ha1t/HA majd/HA lete1pem/FN sza1ra1ro1l/FN ,/, hogy/KOT mozduljon/IGE e1s/KOT keme1nyse1ge/FN miatt/NU to2bbe1/HA gyilkossa1gba/FN ne/HA esse1k/IGE ./. Ezt/NM mondta/IGE nevetve/FN Ganesa/FN e1s/KOT ro1zsaszi1n/FN hosszu1/MN suga1r/MN orma1nya1val/FN belenyu1lt/FN Suryakanta/FN melle1be/IGE e1s/KOT lelke/FN vira1ga1t/FN lete1pte/IGE sza1ra1ro1l/IGE ./. Avval/FN eltu3nt/IGE ./. A/DET kira1ly/FN melle1hez/FN kapott/IGE mintha/KOT nyi1l/FN fu1rta/IGE volna/IGE a1t/IK e1s/KOT felkia1ltott/IGE :/: -/- O1h/FN jaj/FN nekem/NM !/! Akkor/NM Razakosa/FN kiejtette/IGE keze1bo3l/FN i1ja1t/FN ,/, mert/KOT soha/HA senki/FN me1g/HA Suryakanta/IGE kira1ly/FN jajszava1t/FN nem/HA hallotta/IGE ./. -/- Uram/IGE Suryakanta/FN -/- ke1rdezte/IGE halkan/MN mint/KOT aki/NM a/DET feleletto3l/FN fe1l/FN e1s/KOT a1lomnak/FN szeretne1/IGE hinni/INF o2nmaga1t/NM ./. -/- Uram/IGE ,/, mi/NM baj/FN e1rt/IGE ?/? Suryakanta/FN ijedten/FN fordi1totta/IGE feje1t/FN Razakosa/FN fele1/NU e1s/KOT me1lyen/MN elpirult/IGE ./. -/- Oh/NNP kedves/MN Razakosa/FN -/- mondta/IGE sze1gyenlo3sen/IGE mosolyogva/HA ./. -/- Nagyon/HA fa1j/IGE sebem/IGE ,/, melyet/FN ama/NM ki1gyo1/MN miatt/NU a/DET karomba/FN va1gtam/IGE ./. Akkor/NM Razakosa/FN kiejtette/IGE keze1bo3l/FN tegeze1t/FN e1s/KOT szeme1t/FN ne1ma1n/FN lesu2to2tte/IGE mert/KOT Suryakanta/IGE olyat/NM mondott/IGE ,/,

123

amit/NM egy/DET ko2zo2nse1ges/MN Ksatria1nak/FN sem/KOT szabad/MN ,/, nemhogy/MN kira1lynak/FN ./. -/- Elejtetted/IGE i1jad/IGE e1s/KOT tegezed/IGE ,/, kedves/MN Razakosa/FN -/mondta/IGE Suryakanta/IGE e1s/KOT felemelte/IGE sietse1ggel/IGE mint/KOT valami/NM hi1zelgo3/FN szolga/FN ./. -/- Mie1rt/HA su2t/IGE le/IK a/DET szemed/FN Razakosa/FN e1s/KOT mie1rt/HA sa1padsz/IGE el/IK ?/? Tala1n/HA bizony/HA rajtam/FN la1tsz/IGE valami/NM va1ltoza1st/FN ?/? Oh/NNP milyen/NM va1ltoza1s/FN eshetne1k/FN rajtam/IGE ?/? Suryakanta/FN szapora/FN szavainak/FN cso2rge1se/FN u1gy/HA hullott/IGE Razakosa/IGE fu2le1be/FN ,/, mint/KOT egy/DET o2sszeto2rt/MN dra1ga/MN ede1ny/FN cserepeinek/FN cso2ro2mpo2le1se/FN ./. -/- Legyu2nk/IGE vida1mak/IGE -/- mondta/IGE Suryakanta/IGE -/- e1s/KOT siessu2nk/FN felese1gemhez/FN Balapandita1hoz/IGE ,/, mert/KOT szemem/FN e1s/KOT karom/FN u1gy/HA ki1va1nja1k/FN Balapandita1t/IGE ,/, hogy/KOT leva1lnak/FN testemro3l/IGE e1s/KOT hozza1/HA repu2lnek/IGE ha/KOT nem/HA sietek/IGE ./. -/- Uram/IGE Suryakanta/FN kira1ly/FN -/- mondta/IGE Razakosa/IGE -/- a/DET szent/MN liget/MN la1togata1sa1ra/FN indultunk/IGE ,/, hogy/KOT a1ldozzunk/IGE az/DET isteneknek/FN ./. Suryakanta/FN elva1ltozott/IGE arca/FN mint/KOT az/DET ijedt/FN gyermeke1/IGE olyan/NM lett/IGE ./. -/- Igaza1n/HA aze1rt/KOT indultunk/IGE ?/? Jo1/MN ,/, nem/HA ba1nom/IGE ./. Akkor/NM siessu2nk/IGE a/DET szent/MN ligetbe/FN ./. Alig/HA haladtak/IGE ne1ha1ny/NM le1pe1st/FN Suryakanta/FN kira1ly/FN hirtelen/MN visszafordult/IGE e1s/KOT i1gy/KOT szo1lt/IGE ./. -/- A/DET bu2szke/MN Balapandita/FN arcke1pem/IGE elo3tt/NU u2l/IGE e1s/KOT va1r/IGE ./. Gondolatai/FN la1bam/FN ko2re1/NU hurkolo1dnak/IGE e1s/KOT nem/HA eresztenek/IGE tova1bb/IK ./. Kedves/MN Razakosa/FN ke1rlek/IGE ez/NM egyszer/HA te1rju2nk/IGE inka1bb/HA vissza/IK ./. Akkor/NM Razakosa/FN hangos/MN si1ra1ssal/FN feljajdult/IGE :/: -/- O1h/FN Suryakanta/FN ,/, Suryakanta/HA mi/NM lett/IGE belo3led/NU !/! Suryakanta/FN elva1ltozott/IGE szeme/FN hunyorgatva/FN ne1zett/IGE mint/KOT egy/DET gya1va/MN kolduse1/FN ./. -/- Mie1rt/NM jajgat/IGE Razakosa/FN ?/? Tala1n/HA valami/NM va1ltoza1s/FN esett/IGE rajtam/IGE ?/? Mife1le/MN va1ltoza1s/FN eshetne1k/FN rajtam/IGE ,/, holott/IGE e1n/NM vagyok/IGE Suryakanta/FN kira1ly/FN ,/, fe1rfiak/FN ko2zt/NU olyan/NM mint/KOT a/DET ko2vek/FN ko2zt/NU az/DET adamanth/FN ./. E1n/NM vagyok/IGE a/DET mozdi1thatatlan/MN Suryakanta/FN ./. Tova1bb/IK nem/HA besze1lt/IGE a/DET kira1ly/FN hanem/KOT arca/FN ele1/NU csapta/IGE kezeit/FN e1s/KOT hangosan/MN felzokogott/IGE :/: -/- Oh/NNP Razakosa/FN ifju1sa1gomnak/FN ta1rsa/MN ,/, nem/HA vagyok/IGE e1n/NM to2bbe1/HA Suryakanta/MN ,/, se/HA adamanth/IGE a/DET fe1rfiak/FN ko2zo2tt/NU !/! -/- Uram/IGE -/- felelte/IGE Razakosa/IGE -/- kinek/NM szeme1bo3l/FN Suryakanta/FN ko2nnyei/FN elo3hi1vta1k/FN testve1reiket/IGE mint/KOT ahogy/HA egy/DET felbukkano1/MN hangya/FN elo3hi1vja/FN mind/NM a/DET to2bbit/MN ./. -/- Uram/IGE ez/NM Ganesa/FN bu2ntete1se/FN a/DET ki1gyo1e1rt/FN ./. Lete1pte/IGE lelked/FN vira1ga1t/FN sza1ra1ro1l/FN e1s/KOT most/HA mint/KOT szakasztott/FN lo1tusz/IGE hulla1mos/MN tavon/FN inog/IGE e1s/KOT ha1nyko1dik/IGE ,/, re1szeg/MN ta1ncoske1nt/FN ,/, bizonytalanul/MN ./. -/- Jaj/ISZ ,/, jaj/MN Razakosa/MN segi1ts/FN rajtam/IGE ,/, kedves/MN Razakosa/FN !/!

124

Fogd/IGE meg/IK lelkemnek/IGE lelke1t/FN ,/, hogy/KOT megint/HA a1llando1/MN legyen/IGE ./. Szo1li1tsd/IGE neve1n/FN ,/, hogy/KOT horgonya/FN legyen/IGE a/DET ne1v/FN ./. Oh/NNP mondd/IGE elo3/IK nekem/NM ,/, hogy/KOT ki/IK vagyok/IGE ,/, hogy/KOT azt/NM ko2vessem/MN ,/, mint/KOT vezeto3je1t/IGE a/DET vak/MN ./. -/- Nem/HA tudom/IGE ,/, hogy/KOT ki/IK vagy/KOT Suryakanta/MN kira1ly/FN ./. -/- Hiszen/KOT miniszterem/IGE vagy/KOT e1s/KOT ifju1sa1gomnak/FN ta1rsa/MN !/! -/- Ifju1sa1godnak/IGE ta1rsa/MN vagyok/IGE Suryakanta/IGE :/: fe1rfibara1t/FN ./. Tudom/IGE szavaidat/IGE e1s/KOT tetteidt/FN tudom/IGE ./. Te1ged/NM nem/HA tudlak/IGE ./. -/- Mie1rt/HA szeret/IGE ha1t/HA engem/NM ,/, jaj/FN !/! -/- Szavake1rt/IGE e1s/KOT tetteke1rt/FN szeret/IGE a/DET bara1t/FN ,/, Suryakanta/HA kira1ly/FN ./. Szavak/FN e1s/KOT tettek/IGE mo2ge1/NU asszony/FN tud/IGE szeretni/INF ./. -/- O1h/IGE bu2szke/MN Balapandita/IGE ,/, hu3se1g/FN szekre1nye/FN !/! -/- I1gy/KOT kia1ltott/IGE fel/HA akkor/HA Suryakanta/IGE ./. -/- Arcke1pem/IGE elo3tt/NU u2lsz/IGE e1s/KOT va1rsz/IGE ra1m/NM e1s/KOT szemedben/FN o3rzo2d/IGE re1gi/MN arcomat/FN ./. Szemedbo3l/FN fogom/IGE azt/NM elo3venni/INF ,/, hogy/KOT u1jra/HA magamra/FN o2ltsem/IGE ./. Oh/NNP gyeru2nk/FN Razakosa/IGE ,/, siessu2nk/FN Balapandita1hoz/IGE ./. Akkor/NM Suryakanta/FN kira1ly/FN elindult/IGE hazafele1/HA az/DET erdo3n/FN ,/, o3t/NM pedig/KOT ko2vette/IGE elso3/SZN minisztere/FN e1s/KOT ifju1sa1ga1nak/FN ta1rsa/MN ./. De/KOT ha/KOT Razakosa/FN nem/HA tudta/IGE volna/IGE ,/, hogy/KOT ki/IK le1pked/IGE elo3tte/NU az/DET erdei/MN o2sve1nyen/FN ,/, ja1ra1sa1ro1l/IGE ra1/NM nem/HA ismert/IGE volna/IGE Suryakanta/FN kira1lyra/IGE ./. Mert/KOT aki/NM me1g/HA tegnap/HA u1gy/HA ja1rt/IGE mint/KOT a/DET vad/MN elefa1nt/FN a/DET tango1/MN bozo1tban/FN ,/, az/DET ma1ma/FN u1gy/HA ja1r/IGE mint/KOT az/DET u1jon/FN vedlett/IGE ki1gyo1/IGE ,/, amely/NM bo3re1t/FN fe1ltve1n/FN kanyarogva/FN keru2l/IGE minden/NM kavicsot/FN ./. E1s/KOT aki/NM tegnap/HA u1gy/HA haladt/IGE ce1lja/FN fele1/NU ,/, mint/KOT Krisna/FN kilo3tt/FN nyila/IGE ,/, az/DET te1tova1zva/HA meg-mega1llt/IGE most/HA ,/, jobbra-balra/HA ne1zve/HA ,/, mint/KOT a/DET szerelmes/FN asszony/FN aki/NM elo3szo2r/HA szo2kik/FN kedvese1hez/IGE e1s/KOT la1bperceinek/FN cso2rge1se/FN riasztja/IGE az/DET e1ji/MN csendben/FN ./. -/- O1h/IGE Suryakanta/FN ,/, Suryakanta/FN mi/NM lett/IGE belo3led/NU !/! Ma1r/HA esteledett/IGE mikor/HA a/DET palota/FN ele1/NU e1rtek/IGE ./. Ma1r/HA a/DET nap/FN ura/FN va1ndoru1tja1nak/FN ve1ge1n/FN bete1rt/IGE az/DET alkony/FN bi1bor/FN palota1ja1ba/FN de/KOT onnan/HA is/KOT kiu3zetve1n/IGE a/DET tengerbe/FN vetette/IGE maga1t/NM ./. Ma1r/HA a/DET hold/FN ezu2stvize/FN csurgott/IGE le/IK a/DET palota/FN malachit/FN le1pcso3in/FN e1s/KOT a/DET ta1ncolo1/MN pa1va1k/FN kapkodta1k/IGE la1bukat/FN ,/, hogy/KOT meg/IK ne/HA a1zze1k/IGE ./. Ma1r/HA minden/NM fa1klya/FN kialudt/IGE csak/HA a/DET legbelso3/MN szoba1ban/FN e1gett/IGE egy/DET illatozo1/MN la1mpa/FN ,/, mint/KOT a/DET so2te1t/MN palota1nak/MN vila1gi1to1/FN szi1ve/FN ./. Balapandita/FN virrasztott/IGE Suryakanta/IGE kira1ly/FN arcke1pe/FN elo3tt/NU ./. Mikor/HA a/DET kira1ly/FN remego3/MN ke1zzel/FN fe1lrehu1zta/IGE a/DET fu2ggo2nyt/FN e1s/KOT a/DET szoba1ba/FN le1pett/IGE volna/IGE ,/, a/DET bu2szke/MN Balapandita/FN felugrott/IGE az/DET oroszla1nla1bu1/MN pa1rna1ro1l/FN ./. -/- Ki/IK vagy/KOT te/NM ,/, e1s/KOT hogy/KOT mersz/IGE a/DET kira1lyno3/MN szoba1ja1ba/FN le1pni/INF ?/? -/- Kia1ltotta/IGE ha1traszegett/IGE fejjel/FN haragosan/MN ./. -/- Suryakanta/IGE vagyok/IGE ,/, a/DET fe1rjed/FN ./.

125

Nem/HA ismersz/IGE ?/? -/- Szemtelen/MN csalo1/FN te/NM !/! Suryakanta/FN hangja/FN olyan/NM mint/KOT a/DET bagzo1/FN tigrise1/FN ./. De/KOT a/DET te/NM hangod/IGE mint/KOT a/DET here1lt/FN szolga1ke1/IGE u1gy/HA remeg/FN ./. Akkor/NM a/DET kira1ly/FN egy/DET le1pe1ssel/FN ko2zelebb/MN le1pett/IGE a/DET la1mpa1hoz/FN e1s/KOT i1gy/KOT szo1lt/IGE :/: -/- Oh/NNP sze1pcsi1po3ju3/MN Balapandita/FN ismerj/IGE meg/IK ,/, mert/KOT e1n/NM vagyok/IGE a/DET fe1rjed/FN Suryakanta/IGE kira1ly/IGE ./. Akkor/NM Balapandita/IGE egy/DET le1pe1st/FN ha1trale1pett/IGE e1s/KOT i1gy/KOT szo1lt/IGE :/: Suryakanta/MN arca/FN mint/KOT a/DET Himala1ja/FN szikla1ja/FN keme1ny/MN e1s/KOT mozdulatlan/MN a/DET te/NM arcod/FN puha/MN e1s/KOT va1ltozo1/MN mint/KOT a/DET szi1ne1szeke1/FN ./. Jaj/ISZ !/! -/- Oh/NNP Balapandita/IGE ,/, Balapandita/FN ismerj/IGE meg/IK !/! E1n/NM vagyok/IGE Suryakanta/FN ./. -/- Suryakanta/IGE kira1ly/FN ma/HA elindult/IGE az/DET erdo3be/FN vezekelni/INF e1s/KOT Suryakanta/IGE sza1nde1ka1t/FN soha/HA meg/IK nem/HA ma1si1tja/IGE ./. Jaj/ISZ !/! -/- O1h/IGE Balapandita/FN visszate1rtem/IGE hozza1d/NM ,/, mert/KOT va1gyo1dtam/IGE uta1nad/NU ./. -/- Suryakanta/IGE a/DET ce1l/FN ,/, a/DET mozdulatlan/MN e1s/KOT ve1gso3/MN ./. Nekem/FN kell/IGE o3hozza1/IGE mennem/IGE ./. O3/NM nem/HA jo2n/IGE e1nhozza1m/IGE soha/HA ./. Jaj/ISZ !/! -/Eljo2ttem/IGE hozza1d/NM ,/, hogy/KOT ta1masztd/IGE meg/IK gyo2kereszakadt/HA ingo1/IGE lelkemet/FN ./. -/- Jaj/ISZ Suryakanta/FN a/DET szikla/FN ,/, melyhez/IGE az/DET e1n/NM lelkem/IGE ingo1/FN hajo1ja1t/FN ko2to2ttem/IGE ./. Akkor/NM a/DET kira1ly/FN me1g/HA egy/DET le1pe1ssel/FN ko2zelebb/MN le1pett/IGE az/DET illatozo1/MN la1mpa1hoz/FN e1s/KOT i1gy/KOT szo1lt/IGE :/: -/O1h/FN Balapandita/FN ne1zz/IGE ra1m/NM ,/, ne1zzed/IGE a/DET kira1lyi/FN e1kszereket/MN mellemen/FN ./. E1n/NM vagyok/IGE a/DET fe1rjed/FN Suryakanta/IGE ./. Akkor/NM Balapandita/IGE me1g/HA egy/DET le1pe1ssel/FN ha1trale1pett/IGE e1s/KOT i1gy/KOT szo1lt/IGE :/: -/- Ha/KOT te/NM Suryakanta/MN neve1t/FN e1s/KOT kira1lyi/MN e1kszereit/FN viseled/IGE akkor/HA Suryakanta/MN helye1re/FN le1pte1l/IGE ebben/NM az/DET e1letben/FN e1s/KOT Suryakanta/FN kira1ly/FN ,/, a/DET fe1rfitigris/FN ,/, az/DET e1n/NM fe1rjem/FN meghalt/IGE ./. Ezt/NM mondta/IGE Balapandita/FN e1s/KOT bu2szke/MN arca1t/FN kezeinek/FN lo1tusza/FN mo2ge1/NU rejtve/HA si1rt/IGE ./. -/- O1h/IGE Balapandita/FN jaj/IGE nekem/NM !/! Te1ged/NM a/DET hu3se1g/FN szekre1nye1nek/FN hi1vnak/IGE e1s/KOT szerelmed/FN a1llando1sa1ga1ban/FN volt/IGE ingo1/HA lelkem/FN bizodalma/IGE !/! Akkor/NM Balapandita/FN bu2szke1n/FN kiegyenesedett/IGE :/: -/- Bizony/HA holtomig/FN hu3/MN maradok/MN Suryakanta1hoz/FN kinek/NM szi1vemet/FN adtam/IGE ./. Suryakanta1hoz/FN ki/IK a/DET fe1rfiak/FN ko2zt/NU olyan/NM volt/IGE mint/KOT a/DET ko2vek/FN ko2zt/NU az/DET adamanth/FN ,/, de/KOT meghalt/IGE ./. Suryakanta1hoz/IGE akit/NM ez/NM az/DET arcke1p/FN a1bra1zol/IGE ,/, aki/NM Indra1to1l/FN kapta/IGE me1lto1sa1ga1t/IGE ,/, a/DET tu3zto3l/FN heve1t/FN ,/, Jamato1l/MN haragja1t/FN ,/, e1s/KOT Ramato1l/FN a1llhatatossa1ga1t/FN ,/, de/KOT meghalt/IGE ./. Bizony/HA hu3/MN maradok/FN hozza1/NM e1s/KOT el/IK nem/HA hagyom/IGE emle1ke1t/FN valakie1rt/FN akinek/NM arca/FN mint/KOT a/DET szi1ne1szeke1/FN ,/,

126

hangja/FN mint/KOT a/DET here1lt/FN szolga1ke1/IGE ,/, viselkede1se/FN mint/KOT a/DET terhes/FN asszonyoke1/FN ,/, olyan/NM ./. Ezt/NM mondva1n/FN fe1lrehu1zta/IGE az/DET ajto1/FN arannyal/FN hi1mzett/FN fu2ggo2nye1t/IGE e1s/KOT kikia1ltott/IGE :/: -/- Oh/NNP Cseti/FN ,/, kedves/MN csele1dem/MN e1bredj/FN ./. Fehe1r/MN ruha1t/FN hozz/FN nekem/NM e1s/KOT szeza1mmagokat/FN ke1szi1ts/FN vi1zzel/FN e1s/KOT puha/MN darbha-fu2vet/FN ./. Akkor/NM Cseti/FN megjelent/IGE az/DET aranyfu2ggo2ny/FN nyi1la1sa1ban/FN e1s/KOT a1lomto1l/FN ta1gult/MN szemekkel/FN ne1zett/IGE a/DET szoba1ba/FN e1s/KOT halkan/MN szo1lt/IGE mint/KOT aki/NM fe1l/SZN ,/, hogy/KOT ve1lt/IGE a1lma1bo1l/FN o2nmaga1t/NM fele1breszti/MN :/: O1h/FN bu2szke/MN Balapandita/IGE a/DET fehe1r/MN ruha/FN o2zvegyek/FN ruha1ja/FN e1s/KOT szeza1mmagok/FN puha/MN darbha/IGE fu3vel/FN halotti/MN a1ldozat/FN ./. Mie1rt/HA ki1va1n/IGE mindezt/NM o1h/MN kira1lyno3m/MN mikor/FN ott/HA la1tom/IGE a/DET fe1rfit/FN kinek/NM melle1n/FN a/DET kira1lyi/FN e1kszerek/MN csillognak/FN ?/? -/- A/DET kira1lyi/MN e1kszereket/FN hordja/IGE me1g/HA a/DET test/FN de/KOT a/DET lelket/FN belo3le/NU kite1pte/IGE Yama/FN a/DET hala1l/FN ura/FN ko2te1lhurokkal/FN maga/NM uta1n/NU hu1zva1n/IGE ./. Szo1li1tsd/FN szolga1imat/FN kik/FN atya1m/FN ha1za1bo1l/FN jo2ttek/IGE velem/NM ,/, hogy/KOT engem/NM hozza1/HA visszavigyenek/IGE ./. Atya1m/FN ke1szi1tsen/MN ma1glya1t/MN sza1momra/FN ,/, hogy/KOT a/DET tu3z/FN fu2stje1vel/FN sza1llhassak/IGE a/DET meghalt/IGE Suryakanta/IGE uta1n/NU ,/, akihez/NM o2ro2kke1/HA hu3/MN maradok/FN ./.

127

Texts with PoS and subtags (tags for inflectional properties) Ko2zel/MN hu1zo1dik/IGE_e3 e1letemhez/FN_PSe3 _ALL ,/, s/KOT nem/HA tudom/IGE_Te1 ,/, mekkora/FN_SUB ,/, hol/HA ve1gzo3dik/IGE_e3 ,/, fal/FN nekem/NM_DAT o3/NM ,/, e1lo3/MN panaszfalom/FN ./. A/DET ritkulo1/MN levelesen/MN_ESS_MOD a1thullo1/MN fe1nyt/FN_ACC elke1pzelem/FN ,/, a/DET to2lgy/FN ,/, bu2kk/FN_PL ,/, szilfalevelek/FN_PL fodrozta/IGE_TMe3 fe1nyt/FN_ACC ,/, az/DET eso3uta1ni/INF pocsolya1khoz/FN_ALL hasonlo1/MN avart/IGE_Me3 ,/, vadleso3/MN magaslatait/FN_PSe3i_ACC ,/, amelyek/NM_PL tekintetem/IGE_TMe1 fo2lvezetne1k/IGE_Fe1 az/DET e1g/FN fa1tylaira/FN_PSe3i_SUB ./. Izgat/FN_ACC a/DET megoldott/MN rejte1lyu3/MN sza1nto1k/FN_PL uta1n/NU ,/, szeretne1k/IGE_Fe1 odafutni/INF a/DET to2bbiekto3l/FN_ABL ,/, szeretne1k/IGE_Fe1 meglesni/INF egy/DET vadat/FN_ACC ,/, hajtani/INF ,/, elveszteni/INF e1s/KOT u1jra/HA megtala1lni/INF ./. Szeretne1k/IGE_Fe1 elveszni/INF e1s/KOT si1rni/INF ,/, elo3keru2lni/INF ,/, ra1hajolni/INF a/DET korareggeli/MN tu3zhely/FN melege1re/FN_PSe3_SUB a/DET hazate1rt/MN hadifoglyok/FN_PL o2ro2me1vel/FN_PSe3_INS ./. Vonz/MN ,/, mert/KOT titkait/FN_PSe3i_ACC sohasem/HA fedte/IGE_TMe3 fo2l/IK elo3ttem/IGE_TMe1 e1s/KOT sohasem/HA hagytak/IGE_Mt3 ,/, hogy/KOT titkait/FN_PSe3i_ACC megtala1ljam/FN ./. A/DET bozo1tbo1l/FN_ELA kiugro1/MN szarvast/FN_ACC ki/IK la1tta/IGE_TMe3 ?/? Igaz/MN ,/, olyan/NM mint/KOT a/DET szappanrekla1mon/FN_SUP ,/, elso3/SZN la1bait/FN_PSe3i_ACC maga/NM ala1/NU hu1zva/HA ,/, szarvait/FN_PSe3i_ACC e1gre/FN_SUB vetve/HA szo2kell/FN ,/, repu2l/IGE olyan/NM fo2nse1gesen/MN_ESS_MOD ,/, hogy/KOT lelo3ve/HA is/KOT u1gy/HA bukik/IGE_e3 ,/, ahogy/HA a/DET szabadsa1g/FN nagy/MN vadja1nak/FN_PSe3_DAT bukni/INF illik/IGE_e3 ./. La1ttam/IGE_TMe1 a/DET szarvast/FN_ACC rabnak/MN_FOK_DAT ,/, fiatal/MN e1s/KOT megte1pa1zott/MN volt/IGE_Me3 ,/, mint/KOT akit/NM_ACC e1lve/HA fognak/IGE_t3 el/IK ,/, mint/KOT akit/NM_ACC ero3szakos/MN kezekkel/FN_PL_INS ,/, te1rdekkel/FN_PL_INS maguke1va1/NM_PSt3_POS_FAC tesznek/IGE_t3 ./. Nem/HA la1ttam/IGE_TMe1 szabadnak/IGE_t3 ./. Ha/KOT nem/HA lettem/IGE_TMe1 olyan/NM szabadonfuto1/MN ,/, szabadon/MN_S U P do3zso2lo3/MN ,/, szabadon/MN_S U P szenvedo3/MN ,/, ha1t/HA megla1tni/INF a1hi1toztam/IGE_TMe1 legala1bb/HA ./. Mert/KOT csak/HA a/DET gya1va/MN nyulat/FN_ACC ,/, a/DET sompolygo1/MN ro1ka1t/FN_ACC ,/, a/DET sze1tszalado1/MN egereket/FN_PL_ACC ismerem/IGE_Te1 ./. Puska1val/FN_P S e 3_INS menjek/FN_PL oda/IK ?/? Ma1r/HA csak/HA fegyverrel/FN_INS juthatok/FN_PL oda/IK ,/, ahova/HA valamikor/HA u2res/MN ke1zzel/FN_INS akartam/IGE_TMe1 ?/? I1gy/KOT nyerjek/FN_PL biztonsa1got/FN_ACC ?/? Hu1zzon/FN_SUP ,/, vezessen/MN_ESS_MOD ,/, leskelo3dtessen/MN_ESS_MOD ,/, lo2vessen/MN_ESS_MOD ?/? Borral/FN_INS e1s/KOT friss/MN vadhu1ssal/FN_INS megki1na1ltasson/IGE_Pe3 ?/? Engedjen/IGE_Pe3 bo3ge1st/FN_ACC hallani/INF ,/, la1tni/INF hajnali/MN forra1st/FN_ACC ,/, megtudni/INF titkot/FN_ACC ?/? Nem/HA kell/IGE nekem/NM_DAT a/DET kutata1sbo1l/FN_ELA passzio1/FN ,/, vesze1lybo3l/FN_P S e 3_ELA se1ta1lgata1s/FN !/! Vagy/KOT nincs/IGE ma1r/HA ez/NM az/DET erdo3/FN ?/? Vagy/KOT tala1n/HA benne/HA vagyok/IGE_e1 ?/? S/KOT mindig/HA egyma1sra/NM_SUB gondolunk/IGE_t1 ,/, amint/HA ja1rom/IGE_Te1 o2sve1nyeit/FN_PSe3i_ACC egyedu2l/HA ,/, u2res/MN ke1zzel/FN_INS ./.

128

Lendu2l/MN_ESS_MOD a/DET hab/FN s/KOT a/DET part/FN fala1n/FN_PSe3_SUP ezu2stsarkantyu1s/MN la1ba/FN_PSe3 dobban/MN_FOK_ESS_MOD :/: ne1zd/IGE_TPe2 ,/, fu2rdik/IGE_e3 a/DET fekete/MN la1ny/FN ,/, fekete/MN la1ny/FN fehe1r/MN habokban/FN_PL_INE ./. Elszenderu2lt/IGE_Me3 a/DET bu1/MN szeme1n/FN_PSe3_SUP ,/, hulla1mba/FN_ILL hull/FN ma/HA teste/FN_PSe3 ,/, lelke/FN_PSe3 ,/, hulla1mos/MN haju1/MN vo3lege1ny/FN milyen/NM ero3sen/MN_ESS_MOD a1to2lelte/IGE_TMe3 ./. De/KOT ne1zd/IGE_TPe2 :/: so2te1t/MN erdo3k/FN_PL ko2zo2tt/NU nagy/MN ,/, ordas/MN fellegek/FN_PL szakadnak/IGE_t3 s/KOT jo2n/IGE a1rja/FN_PSe3 za1poro2tt/MN go2rgeteges/MN hegyi/MN pataknak/FN_DAT ./. Haragos/MN ,/, szennyes/MN a1radat/FN ,/, a/DET medre1t/FN_PSe3_ACC karmok/FN_PL a1ssa1k/IGE_TP t3 ./. Ke1rd/IGE_TPe2 meg/IK a/DET ho2kkent/FN_CAU ga1takat/FN_PL_ACC ,/, ga1ncsolja1k/IGE_Tt3 a/DET vizek/IGE_e1 futa1sa1t/FN_PSe3_ACC ./. Ne/HA essen/MN_ESS_MOD folt/FN fe1nyes/MN haja1n/FN_PSe3_SUP ,/, iszapos/MN a1r/FN hozza1/NM_ALL ne/HA e1rjen/IGE_Pe3 :/: fekete/MN la1ny/FN tala1n/HA ,/, tala1n/HA utolszor/SZN fu2rdik/IGE_e3 -/- ho1fehe1ren/MN_ESS_MOD ./. BALA1ZS/FN BE1LA/FN :/: A/DET HA1ROM/FN HU3SE1GES/FN KIRA1LYLEA1NY/FN E1lt/IGE_Me3 egyszer/HA Kandrapura/FN_S U B va1rosa1ban/FN_INE egy/DET Suryakanta/FN nevu3/MN hatalmas/MN kira1ly/FN ,/, aki/NM ha1rom/SZN fo3tulajdonsa1ga/FN_PSe3 a1ltal/NU olyan/NM volt/IGE_Me3 a/DET fe1rfiak/FN_PL ko2zo2tt/NU mint/KOT a/DET ko2vek/FN_PL ko2zt/NU az/DET Adamanth/NNP :/: tu2ndo2klo3/MN ,/, tiszta/MN e1s/KOT keme1ny/MN ./. Tu2ndo2klo3/MN sze1pse1ge/FN_PSe3 minden/NM asszonyt/FN_ACC szi1ven/FN_SUP harapott/MN mint/KOT a/DET kobra/FN_S U B ./. Bu3ntelen/MN tisztasa1ga/FN_PSe3 emberfeletti/MN ero3t/FN_ACC adott/IGE_Me3 neki/NM_DAT ./. De/KOT keme1nyse1ge/FN_PSe3 volt/IGE_Me3 a/DET legnagyobb/MN_FOK ./. Lelke1nek/FN_PSe3_DAT fa1ja/FN_PSe3 elo3bbi/MN e1letekben/FN_PL_INE gyo2kerezett/IGE_Me3 e1s/KOT eze1rt/KOT mozdi1thatatlan/MN vira1gokke1nt/FN_FOR termettek/IGE_Mt3 rajta/NM_SUP tettek/IGE_Mt3 e1s/KOT szavak/FN_PL ,/, melyek/F N_PL ezer/SZN e1v/FN o1ta/NU ke1szen/MN_ESS_MOD voltak/IGE_Mt3 ./. Suryakanta/IGE_TMe3 kira1ly/FN csak/HA hullatta/IGE_TMe3 o3ket/NM_ACC ,/, de/KOT rajtuk/NM_SUP nem/HA va1ltoztathatott/IGE_Me3 ./. Suryakanta/IGE_TMe3 kira1ly/FN uta1lta/IGE_TMe3 a/DET szavakat/FN_PL_ACC e1s/KOT minden/NM ke1rde1sre/FN_SUB tettel/FN_INS felelt/IGE_Me3 ./. A/DET ko2nyo2rgo3k/FN_PL e1s/KOT ke1relmezo3k/FN_PL felett/NU ne1ma1n/FN_PSe3_SUP ne1zett/IGE_Me3 el/IK mozdulatlan/MN arccal/FN_INS de/KOT mire/HA hazate1rtek/IGE_Mt3 ke1re1su2k/FN_PSt3 teljesi1tve/HA volt/IGE_Me3 ./. Suryakanta/IGE_TMe3 kira1ly/FN u1gy/HA a1llt/IGE_Me3 az/DET emberek/FN_PL ko2zo2tt/NU mint/KOT az/DET erdo3/FN te1tova/MN hajlongo1/MN no2ve1nyei/FN_PSe3i ko2zo2tt/NU a/DET merev/MN istenszobor/FN ,/, melyben/F N_P S e 3_INE a/DET szellem/FN e1rcbo2rto2nbe/FN_ILL za1rva/HA o2ro2k/MN ido3kre/FN_SUB egyforma/MN e1s/KOT mozdulatlan/MN ./. Egyszer/HA Suryakanta/IGE_TMe3 kira1ly/FN maga1hoz/NM_ALL hi1vatta/IGE_TMe3 felese1ge1t/FN_PSe3_ACC a/DET bu2szke/MN Balapandita1t/FN_ACC ./. Mikor/HA Balapandita/IGE_TMe3 a/DET terembe/FN_ILL le1pett/IGE_Me3 azon/NM halk/MN remege1s/FN futott/IGE_Me3 ve1gig/IK mint/KOT a/DET fu2ves/MN re1ten/FN_SUP ha/KOT a/DET felkelo3/MN hold/MN rea1lehel/FN_INS ./. Mert/KOT Balapandita/IGE_TMe3 sze1pse1ge1ben/FN_PSe3_INE a/DET hold/MN go2mbo2lyu3se1ge/FN_PSe3 ,/, futo1inda/FN hajla1sa/FN_PSe3 ,/, fu3/FN

129

remege1se/FN_PSe3 ,/, leve1l/FN ko2nnyu3se1ge/FN_PSe3 ,/, elefa1ntorma1ny/FN ve1konyula1sa/FN_PSe3 ,/, me1hek/FN_PL rajza1sa/FN_PSe3 e1s/KOT a/DET papaga1j/IGE_Pe2 melle/FN_PSe3 la1gysa1ga/FN_PSe3 egyesu2ltek/IGE_Mt3 mint/KOT egyma1st/NM_ACC re1g/MN kereso3/MN szerelmesek/FN_PL ./. -/- Mie1rt/HA hi1vat/FN_ACC uram/FN_PSe1 ,/, Suryakanta/IGE_TMe3 kira1ly/FN -/ke1rdezte/IGE_TMe3 Balapandita/IGE_TMe3 ./. De/KOT Suryakanta/IGE_TMe3 nem/HA felelt/IGE_Me3 hanem/KOT odale1pett/IGE_Me3 a/DET sze1pcsi1po3ju2ho2z/FN_ALL e1s/KOT homlokon/FN_PL_SUP cso1kolta/IGE_TMe3 olyan/NM jeleket/FN_PL_ACC te1ve/HA mint/KOT a/DET bu1csu1zo1k/MN_PL ./. -/- Uram/FN Suryakanta/IGE_TMe3 -/- szo1lt/IGE_Me3 Balapandita/IGE_TMe3 akkor/HA -/- homlokon/FN_PL_SUP cso1kolsz/IGE_e2 e1s/KOT olyan/NM jeleket/FN_PL_ACC teszel/IGE_e2 mint/KOT a/DET bu1csu1zo1k/MN_PL ./. Vajon/KOT bu1csu1zni/INF akarsz/IGE_e2 to3lem/NM_ABL te/NM fe1rfitigris/FN ?/? Suryakanta/IGE_TMe3 nem/HA felelt/IGE_Me3 hanem/KOT elso3/SZN minisztere1hez/FN_PSe3_ALL e1s/KOT ifju1sa1ga1nak/FN_PSe3_DAT ta1rsa1hoz/FN_ALL fordult/IGE_Me3 ,/, mondva1n/FN_PSe3_SUP :/: -/Razakosa/FN_P S e 3 ke1szu2lj/IGE_Pe2 u1tra/FN_SUB ./. De/KOT ki1se1ret/FN_ACC ne/HA ko2vessen/MN_ESS_MOD e1s/KOT fullajta1rok/FN_PL elo3ttu2nk/IGE_t1 ne/HA fussanak/IGE_t3 ./. E/NM szavakbo1l/FN_PL_ELA Balapandita/IGE_TMe3 e1s/KOT Razakosa/FN_P S e 3 mege1rtette1k/IGE_TMt3 hogy/KOT Suryakanta/IGE_TMe3 kira1ly/FN az/DET erdo3be/FN_ILL indul/IGE valamely/FN szent/MN liget/FN_ACC la1togata1sa1ra/FN_PSe3_SUB ,/, hogy/KOT a1ldozattal/FN_INS ,/, fu2rde1ssel/FN_INS e1s/KOT medita1cio1val/FN_P S e 3_INS tiszti1tsa/IGE_TPe3 lelke1t/FN_PSe3_ACC e1s/KOT ereje1t/FN_PSe3_ACC no2velje/IGE_TPe3 ./. Akkor/NM_TEM Balapandita/IGE_TMe3 hangja1ra/FN_PSe3_SUB nedves/MN fa1tyolt/IGE_Me3 bori1tott/MN a/DET fa1jdalom/FN e1s/KOT i1gy/KOT szo1lt/IGE_Me3 :/: -/- Isten/FN veled/NM_INS Suryakanta/IGE_TMe3 ./. Bizony/HA visszajo2tto2d/MN u1gy/HA va1rom/IGE_Te1 ,/, mint/KOT halott/FN az/DET u1jraszu2lete1st/FN_ACC ./. Megfestetem/IGE_TMe1 ke1pedet/FN_ACC ,/, hogy/KOT a/DET hold/MN vila1gi1tson/FN_SUP lelkemre/FN_SUB mi1g/HA napja/FN_PSe3 visszate1r/FN ./. De/KOT Suryakanta/IGE_TMe3 kira1ly/FN nem/HA felelt/IGE_Me3 hanem/KOT ne1ma/MN ajakkal/FN_PL_INS e1s/KOT besze1lo3/MN szemmel/FN_INS ne1zte/IGE_TMe3 felese1ge1t/FN_PSe3_ACC akit/NM_ACC a/DET hu3se1g/FN szekre1nye1nek/FN_PSe3_DAT neveztek/IGE_Mt3 azta1n/HA elindult/IGE_Me3 a/DET palota/FN kapuja/IGE_Te3 fele1/NU ./. O3t/NM_ACC pedig/KOT ko2vette/IGE_TMe3 Razakosa/FN_P S e 3 elso3/SZN minisztere/FN_PSe3 e1s/KOT ifju1sa1ga1nak/FN_PSe3_DAT ta1rsa/FN_PSe3 ./. Mikor/HA ma1r/HA me1lyen/MN_ESS_MOD ja1rtak/IGE_Mt3 az/DET erdo3ben/FN_INE ahol/HA a/DET Kadamba/FN_ILL fa/FN illata1to1l/FN_PSe3_ABL megdu2ho2do2tt/MN pe1zsmaszarvasok/FN_PL tu1rja1k/IGE_Tt3 magukat/NM_PSt3_ACC a/DET sziromhullato1/MN ja1zminbokrokon/FN_PL_SUP keresztu2l/HA ,/, Suryakanta/IGE_TMe3 kira1ly/FN hirtelen/MN mega1llt/IGE_Me3 ./. -/- Mit/NM_ACC la1tta1l/IGE_Me2 uram/FN_PSe1 ?/? -/- ke1rdezte/IGE_TMe3 Razakosa/FN_P S e 3 ./. Suryakanta/IGE_TMe3 kira1ly/FN akkor/HA lehajolt/IGE_Me3 e1s/KOT mikor/HA megint/HA kiegyenesedett/IGE_Me3 jobb/MN_FOK keze1vel/FN_PSe3_INS egy/DET nagy/MN fekete/MN ki1gyo1/MN torka1t/FN_PSe3_ACC fogta/IGE_TMe3 ,/, bal/MN keze1vel/FN_PSe3_INS pedig/KOT egy/DET kis/MN nyulat/FN_ACC ,/, melyet/F N_ACC a/DET ki1gyo1/MN sza1ja1bo1l/FN_PSe3_ELA vett/IGE_Me3 ki/IK ./. Azuta1n/HA mind/NM a/DET ketto3t/SZN_ACC eldobta/IGE_TMe3 e1s/KOT tova1bb/IK akart/IGE_Me3 menni/INF ./.

130

Hanem/FN akkor/HA a/DET fekete/MN ki1gyo1/FN fela1gaskodott/IGE_Me3 elo3tte/NU e1s/KOT i1gy/KOT szo1lt/IGE_Me3 ./. -/- Mie1rt/HA tesz/IGE ezt/NM_ACC velem/NM_INS Suryakanta/IGE_TMe3 kira1ly/FN ?/? Nekem/FN a/DET nyulat/FN_ACC istenek/MN_PL rendelte1k/IGE_TMt3 ta1pla1le1komul/MN_ESS_MOD e1s/KOT ez/NM a/DET falat/FN_ACC melyet/F N_ACC most/HA sza1jambo1l/FN_PSe3_ELA kivette1l/IGE_Me2 e1hhala1lom/IGE_Te1 elo3l/NU volt/IGE_Me3 utolso1/MN menekve1sem/FN ./. Mert/KOT tudd/IGE_TPe2 meg/IK o1h/MN Suryakanta/IGE_TMe3 kira1ly/FN ,/, hogy/KOT harminc/SZN napig/FN_TER nem/HA ettem/IGE_TMe1 e1s/KOT a/DET hala1l/FN sarka/FN_PSe3 van/IGE fejemen/FN_PSe1_SUP ./. Tudsz/IGE_e2 e/NM nekem/NM_DAT ma1s/NM ta1pla1le1kot/FN_ACC adni/INF a/DET nyu1l/FN helyett/NU ?/? Suryakanta/IGE_TMe3 kira1ly/FN mozdulatlan/MN arccal/FN_INS ne1zte/IGE_TMe3 a/DET fekete/MN ki1gyo1t/FN_ACC e1s/KOT nem/HA felelt/IGE_Me3 ,/, hanem/KOT kihu1zta/IGE_TMe3 to3re1t/NM_ABL e1s/KOT izmos/MN barna/MN karja1bo1l/FN_PSe3_ELA kanyari1tott/IGE_Me3 egy/DET darabot/FN_ACC e1s/KOT azt/NM_ACC odadobta/IGE_TMe3 a/DET ki1gyo1/FN ele1/NU ./. Hanem/FN a/DET ki1gyo1/MN nem/HA nyu1lt/IGE_Me3 uta1na/NU ./. -/- Meghalt/IGE_Me3 -/- mondta/IGE_TMe3 Razakosa/FN_P S e 3 ./. -/- Nem/HA mondtad/IGE_TMe2 ,/, hogy/KOT segi1teni/INF akarsz/IGE_e2 rajta/NM_SUP ,/, ha1t/HA eleresztette/IGE_TMe3 e1lete1t/FN_PSe3_ACC mert/KOT nem/HA tudta/IGE_TMe3 mire/HA va1rjon/IGE_Pe3 me1g/HA tova1bb/IK szenvedve/HA ./. Suryakanta/IGE_TMe3 kira1ly/FN mozdulatlan/MN arccal/FN_INS ne1zte/IGE_TMe3 a/DET hosszu1/MN ,/, fekete/MN hulla1t/FN_ACC ./. Lelke1t/FN_PSe3_ACC oly/NM keve1sse1/MN_FAC illette/IGE_TMe3 a/DET hala1l/FN mint/KOT a/DET vad/MN elefa1ntot/FN_ACC az/DET erdei/MN leve1l/FN mely/NM ra1ncos/MN ha1ta1ra/FN_PSe3_SUB hull/FN ./. E1ppen/HA a1t/IK akart/IGE_Me3 le1pni/INF rajta/NM_SUP ,/, hogy/KOT tova1bbmenjen/IGE_Pe3 mikor/HA hirtelen/MN megjelent/IGE_Me3 elo3tte/NU Ganesa/FN az/DET elefa1ntarcu1/MN isten/FN ./. O1ria1s/FN fu2lein/FN_PSe3i_SUP a1tizzott/IGE_Me3 a/DET nap/FN ,/, hogy/KOT tu2zes/MN korallbokrokke1nt/FN_FOR piroslottak/IGE_Mt3 benne/HA az/DET erek/FN_PL ./. Ro1zsaszi1n/FN_PSe3_SUP suga1r/FN orma1nya1t/FN_PSe3_ACC ta1ncban/FN_P S e 3_INE emelte/IGE_TMe3 e1s/KOT i1gy/KOT szo1lt/IGE_Me3 :/: -/Suryakanta/IGE_TMe3 o2lte1l/IGE_Me2 ./. Suryakanta/IGE_TMe3 mozdulatlan/MN szemmel/FN_INS ne1zett/IGE_Me3 Ganesa/FN kis/MN szemeibe/FN_ILL e1s/KOT i1gy/KOT felelt/IGE_Me3 ./. -/- Az/DET istenek/MN_PL o2ltek/IGE_Mt3 akik/NM_PL terme1szetemet/FN_ACC u1gy/HA termelte1k/IGE_TMt3 hogy/KOT uta1ljam/FN a/DET szavakat/FN_PL_ACC e1s/KOT tettekkel/FN_PL_INS besze1ljek/FN_PL ./. Lelkem/FN vira1ga/FN_PSe3 elo3bbi/MN e1letekben/FN_PL_INE gyo2keredzik/IGE e1s/KOT sem/KOT mozdulni/INF sem/KOT va1ltozni/INF nem/HA tud/IGE ./. Akkor/NM_TEM Ganesa/FN nevetett/IGE_Me3 :/: -/- Ha/KOT lelked/FN_PSe2 sem/KOT mozdulni/INF sem/KOT va1ltozni/INF nem/HA tud/IGE ,/, ha1t/HA majd/HA lete1pem/FN sza1ra1ro1l/FN_PSe3i_DEL ,/, hogy/KOT mozduljon/IGE_Pe3 e1s/KOT keme1nyse1ge/FN_PSe3 miatt/NU to2bbe1/HA gyilkossa1gba/FN_ILL ne/HA esse1k/IGE_Pe3 ./. Ezt/NM_ACC mondta/IGE_TMe3 nevetve/HA Ganesa/FN e1s/KOT ro1zsaszi1n/FN_PSe3_SUP hosszu1/MN suga1r/FN orma1nya1val/FN_PSe3_INS belenyu1lt/IGE_Me3 Suryakanta/IGE_TMe3 melle1be/FN_PSe3_ILL e1s/KOT lelke/FN_PSe3 vira1ga1t/FN_PSe3_ACC lete1pte/IGE_TMe3 sza1ra1ro1l/FN_DEL ./. Avval/FN_INS eltu3nt/IGE_Me3 ./.

131

A/DET kira1ly/FN melle1hez/FN_PSe3_ALL kapott/IGE_Me3 mintha/KOT nyi1l/FN fu1rta/IGE_TMe3 volna/IGE_Fe3 a1t/IK e1s/KOT felkia1ltott/IGE_Me3 :/: -/O1h/FN jaj/IGE_Pe2 nekem/NM_DAT !/! Akkor/NM_TEM Razakosa/FN_P S e 3 kiejtette/IGE_TMe3 keze1bo3l/FN_PSe3_ELA i1ja1t/FN_PSe3_ACC ,/, mert/KOT soha/HA senki/FN me1g/HA Suryakanta/IGE_TMe3 kira1ly/FN jajszava1t/FN_PSe3_ACC nem/HA hallotta/IGE_TMe3 ./. -/- Uram/FN Suryakanta/IGE_TMe3 -/- ke1rdezte/IGE_TMe3 halkan/MN_ESS_MOD mint/KOT aki/NM a/DET feleletto3l/FN_P S e 3_ABL fe1l/SZN e1s/KOT a1lomnak/FN_DAT szeretne1/IGE_TFe3 hinni/INF o2nmaga1t/NM_ACC ./. -/- Uram/FN ,/, mi/NM baj/FN e1rt/IGE_Me3 ?/? Suryakanta/IGE_TMe3 ijedten/FN_INE fordi1totta/IGE_TMe3 feje1t/FN_PSe3_ACC Razakosa/FN_P S e 3 fele1/NU e1s/KOT me1lyen/MN_ESS_MOD elpirult/IGE_Me3 ./. -/Oh/NNP kedves/MN Razakosa/FN_P S e 3 -/mondta/IGE_TMe3 sze1gyenlo3sen/MN_ESS_MOD mosolyogva/HA ./. -/- Nagyon/HA fa1j/IGE sebem/FN ,/, melyet/FN_ACC ama/NM ki1gyo1/FN miatt/NU a/DET karomba/FN_ILL va1gtam/IGE_TMe1 ./. Akkor/NM_TEM Razakosa/FN_P S e 3 kiejtette/IGE_TMe3 keze1bo3l/FN_PSe3_ELA tegeze1t/FN_PSe3_ACC e1s/KOT szeme1t/FN_PSe3_ACC ne1ma1n/FN_PSe3_SUP lesu2to2tte/IGE_TMe3 mert/KOT Suryakanta/IGE_TMe3 olyat/NM_ACC mondott/IGE_Me3 ,/, amit/NM_ACC egy/DET ko2zo2nse1ges/MN Ksatria1nak/FN_P S e 3_DAT sem/KOT szabad/MN ,/, nemhogy/HA kira1lynak/FN_DAT ./. -/- Elejtetted/IGE_TMe2 i1jad/FN e1s/KOT tegezed/IGE ,/, kedves/MN Razakosa/FN_P S e 3 -/- mondta/IGE_TMe3 Suryakanta/IGE_TMe3 e1s/KOT felemelte/IGE_TMe3 sietse1ggel/FN_INS mint/KOT valami/NM hi1zelgo3/MN szolga/MN ./. -/- Mie1rt/HA su2t/SZN le/IK a/DET szemed/FN_PSe2 Razakosa/FN_P S e 3 e1s/KOT mie1rt/HA sa1padsz/IGE_e2 el/IK ?/? Tala1n/HA bizony/HA rajtam/IGE_TMe1 la1tsz/IGE_e2 valami/NM va1ltoza1st/FN_ACC ?/? Oh/FN milyen/NM va1ltoza1s/FN eshetne1k/IGE_Fe1 rajtam/IGE_TMe1 ?/? Suryakanta/IGE_TMe3 szapora/FN_SUB szavainak/FN_PSe3i_DAT cso2rge1se/FN_PSe3 u1gy/HA hullott/IGE_Me3 Razakosa/FN_P S e 3 fu2le1be/FN_PSe3_ILL ,/, mint/KOT egy/DET o2sszeto2rt/MN dra1ga/FN ede1ny/FN cserepeinek/FN_PSe3i_DAT cso2ro2mpo2le1se/FN_PSe3 ./. -/- Legyu2nk/IGE_t1 vida1mak/F N_PL -/- mondta/IGE_TMe3 Suryakanta/IGE_TMe3 /- e1s/KOT siessu2nk/IGE_t1 felese1gemhez/FN_ALL Balapandita1hoz/FN_ALL ,/, mert/KOT szemem/FN_PSe1 e1s/KOT karom/FN_PSe1 u1gy/HA ki1va1nja1k/IGE_Tt3 Balapandita1t/FN_P S e 3_ACC ,/, hogy/KOT leva1lnak/IGE_t3 testemro3l/FN_DEL e1s/KOT hozza1/NM_ALL repu2lnek/FN_DAT ha/KOT nem/HA sietek/IGE_e1 ./. -/Uram/FN Suryakanta/IGE_TMe3 kira1ly/FN -/mondta/IGE_TMe3 Razakosa/FN_P S e 3 -/a/DET szent/MN liget/FN_ACC la1togata1sa1ra/FN_PSe3_SUB indultunk/IGE_t1 ,/, hogy/KOT a1ldozzunk/IGE_t1 az/DET isteneknek/FN_PL_DAT ./. Suryakanta/IGE_TMe3 elva1ltozott/IGE_Me3 arca/FN_PSe3 mint/KOT az/DET ijedt/MN gyermeke1/FN_POS olyan/NM lett/IGE_Me3 ./. -/- Igaza1n/HA aze1rt/KOT indultunk/IGE_t1 ?/? Jo1/MN ,/, nem/HA ba1nom/IGE_Te1 ./. Akkor/NM_TEM siessu2nk/IGE_t1 a/DET szent/MN ligetbe/FN_ILL ./. Alig/HA haladtak/IGE_Mt3 ne1ha1ny/NM le1pe1st/FN_ACC Suryakanta/IGE_TMe3 kira1ly/FN hirtelen/MN visszafordult/IGE_Me3 e1s/KOT i1gy/KOT szo1lt/IGE_Me3 ./. -/- A/DET bu2szke/MN Balapandita/IGE_TMe3 arcke1pem/FN_PSe1 elo3tt/NU u2l/IGE e1s/KOT va1r/IGE ./. Gondolatai/FN_PSe3i la1bam/FN_PSe1 ko2re1/NU hurkolo1dnak/IGE_t3 e1s/KOT nem/HA eresztenek/MN_PL tova1bb/IK ./.

132

Kedves/MN Razakosa/FN_P S e 3 ke1rlek/IGE_Ie1 ez/NM egyszer/HA te1rju2nk/IGE_Pt1 inka1bb/HA vissza/IK ./. Akkor/NM_TEM Razakosa/FN_P S e 3 hangos/MN si1ra1ssal/FN_INS feljajdult/IGE_Me3 :/: -/- O1h/FN Suryakanta/IGE_TMe3 ,/, Suryakanta/IGE_TMe3 mi/NM lett/IGE_Me3 belo3led/NU !/! Suryakanta/IGE_TMe3 elva1ltozott/IGE_Me3 szeme/FN_PSe3 hunyorgatva/HA ne1zett/IGE_Me3 mint/KOT egy/DET gya1va/MN kolduse1/MN_FAC ./. -/- Mie1rt/HA jajgat/FN_ACC Razakosa/FN_P S e 3 ?/? Tala1n/HA valami/NM va1ltoza1s/FN esett/IGE_Me3 rajtam/IGE_TMe1 ?/? Mife1le/MN va1ltoza1s/FN eshetne1k/IGE_Fe1 rajtam/IGE_TMe1 ,/, holott/IGE_Me3 e1n/NM vagyok/IGE_e1 Suryakanta/IGE_TMe3 kira1ly/FN ,/, fe1rfiak/FN_PL ko2zt/NU olyan/NM mint/KOT a/DET ko2vek/FN_PL ko2zt/NU az/DET adamanth/FN ./. E1n/NM vagyok/IGE_e1 a/DET mozdi1thatatlan/MN Suryakanta/IGE_TMe3 ./. Tova1bb/IK nem/HA besze1lt/IGE_Me3 a/DET kira1ly/FN hanem/KOT arca/FN_PSe3 ele1/NU csapta/IGE_TMe3 kezeit/FN_PSe3i_ACC e1s/KOT hangosan/MN_ESS_MOD felzokogott/IGE_Me3 :/: -/- Oh/NNP Razakosa/FN_P S e 3 ifju1sa1gomnak/FN_DAT ta1rsa/FN_PSe3 ,/, nem/HA vagyok/IGE_e1 e1n/NM to2bbe1/HA Suryakanta/IGE_TMe3 ,/, se/HA adamanth/FN a/DET fe1rfiak/FN_PL ko2zo2tt/NU !/! -/- Uram/FN -/- felelte/IGE_TMe3 Razakosa/FN_P S e 3 -/- kinek/NM_DAT szeme1bo3l/FN_PSe3_ELA Suryakanta/IGE_TMe3 ko2nnyei/FN_PSe3i elo3hi1vta1k/IGE_TMt3 testve1reiket/FN_ACC mint/KOT ahogy/HA egy/DET felbukkano1/MN hangya/FN elo3hi1vja/IGE_Te3 mind/NM a/DET to2bbit/FN_PSe3i_ACC ./. -/- Uram/FN ez/NM Ganesa/FN bu2ntete1se/FN_PSe3 a/DET ki1gyo1e1rt/FN_CAU ./. Lete1pte/IGE_TMe3 lelked/FN_PSe2 vira1ga1t/FN_PSe3_ACC sza1ra1ro1l/FN_DEL e1s/KOT most/HA mint/KOT szakasztott/IGE_Me3 lo1tusz/IGE_e2 hulla1mos/MN tavon/FN_SUP inog/IGE e1s/KOT ha1nyko1dik/IGE_e3 ,/, re1szeg/MN ta1ncoske1nt/FN_FOR ,/, bizonytalanul/MN_ESS_MOD ./. -/- Jaj/ISZ ,/, jaj/IGE_Pe2 Razakosa/FN_P S e 3 segi1ts/FN rajtam/IGE_TMe1 ,/, kedves/MN Razakosa/FN_P S e 3 !/! Fogd/MN meg/IK lelkemnek/FN_DAT lelke1t/FN_PSe3_ACC ,/, hogy/KOT megint/HA a1llando1/MN legyen/IGE_Pe3 ./. Szo1li1tsd/IGE_TPe2 neve1n/FN_PSe3_SUP ,/, hogy/KOT horgonya/FN_PSe3 legyen/IGE_Pe3 a/DET ne1v/FN ./. Oh/NNP mondd/IGE_TPe2 elo3/IK nekem/NM_DAT ,/, hogy/KOT ki/IK vagyok/IGE_e1 ,/, hogy/KOT azt/NM_ACC ko2vessem/FN ,/, mint/KOT vezeto3je1t/FN_PSe3_ACC a/DET vak/MN ./. -/- Nem/HA tudom/IGE_Te1 ,/, hogy/KOT ki/IK vagy/KOT Suryakanta/IGE_TMe3 kira1ly/FN ./. -/- Hiszen/KOT miniszterem/IGE_Te1 vagy/KOT e1s/KOT ifju1sa1gomnak/FN_DAT ta1rsa/FN_PSe3 !/! -/- Ifju1sa1godnak/IGE_t3 ta1rsa/FN_PSe3 vagyok/IGE_e1 Suryakanta/IGE_TMe3 :/: fe1rfibara1t/FN_PSe3_ACC ./. Tudom/IGE_Te1 szavaidat/FN_ACC e1s/KOT tetteidt/IGE_Me3 tudom/IGE_Te1 ./. Te1ged/NM_ACC nem/HA tudlak/IGE_e 1 ./. -/- Mie1rt/HA szeret/IGE ha1t/HA engem/NM_ACC ,/, jaj/IGE_Pe2 !/! -/- Szavake1rt/IGE_Me3 e1s/KOT tetteke1rt/FN_CAU szeret/IGE a/DET bara1t/FN ,/, Suryakanta/IGE_TMe3 kira1ly/FN ./. Szavak/FN_PL e1s/KOT tettek/IGE_Mt3 mo2ge1/NU asszony/FN tud/IGE szeretni/INF ./. -/- O1h/FN bu2szke/MN Balapandita/IGE_TMe3 ,/, hu3se1g/FN szekre1nye/FN_PSe3 !/! -/- I1gy/KOT kia1ltott/IGE_Me3 fel/IK akkor/HA Suryakanta/IGE_TMe3 ./. -/- Arcke1pem/FN elo3tt/NU u2lsz/IGE_e2 e1s/KOT va1rsz/IGE_e2 ra1m/NM_SUB e1s/KOT szemedben/FN_INE o3rzo2d/IGE_Te2 re1gi/MN arcomat/FN_ACC ./.

133

Szemedbo3l/FN_ELA fogom/IGE_Te1 azt/NM_ACC elo3venni/INF ,/, hogy/KOT u1jra/HA magamra/F N_SUB o2ltsem/FN ./. Oh/NNP gyeru2nk/IGE_t1 Razakosa/FN_P S e 3 ,/, siessu2nk/IGE_t1 Balapandita1hoz/FN_ALL ./. Akkor/NM_TEM Suryakanta/IGE_TMe3 kira1ly/FN elindult/IGE_Me3 hazafele1/HA az/DET erdo3n/FN_SUP ,/, o3t/NM_ACC pedig/KOT ko2vette/IGE_TMe3 elso3/SZN minisztere/FN_PSe3 e1s/KOT ifju1sa1ga1nak/FN_PSe3_DAT ta1rsa/FN_PSe3 ./. De/KOT ha/KOT Razakosa/FN_P S e 3 nem/HA tudta/IGE_TMe3 volna/IGE_Fe3 ,/, hogy/KOT ki/IK le1pked/IGE elo3tte/NU az/DET erdei/MN o2sve1nyen/FN_SUP ,/, ja1ra1sa1ro1l/FN_PSe3_DEL ra1/HA nem/HA ismert/IGE_Me3 volna/IGE_Fe3 Suryakanta/IGE_TMe3 kira1lyra/FN_SUB ./. Mert/KOT aki/NM me1g/HA tegnap/HA u1gy/HA ja1rt/IGE_Me3 mint/KOT a/DET vad/MN elefa1nt/IGE_Me3 a/DET tango1/MN bozo1tban/FN_INE ,/, az/DET ma1ma/FN_PSe3 u1gy/HA ja1r/IGE mint/KOT az/DET u1jon/IGE_Pe3 vedlett/IGE_Me3 ki1gyo1/MN ,/, amely/NM bo3re1t/FN_PSe3_ACC fe1ltve1n/FN_PSe3_SUP kanyarogva/HA keru2l/IGE minden/NM kavicsot/FN_ACC ./. E1s/KOT aki/NM tegnap/HA u1gy/HA haladt/IGE_Me3 ce1lja/FN_PSe3 fele1/NU ,/, mint/KOT Krisna/IGE_Fe3 kilo3tt/IGE_Me3 nyila/FN_PSe3 ,/, az/DET te1tova1zva/HA meg-mega1llt/IGE_Me3 most/HA ,/, jobbra-balra/HA ne1zve/HA ,/, mint/KOT a/DET szerelmes/MN asszony/FN aki/NM elo3szo2r/HA szo2kik/IGE_e3 kedvese1hez/FN_PSe3_ALL e1s/KOT la1bperceinek/FN_PSe3i_DAT cso2rge1se/FN_PSe3 riasztja/IGE_Te3 az/DET e1ji/MN csendben/FN_INE ./. -/- O1h/FN Suryakanta/IGE_TMe3 ,/, Suryakanta/IGE_TMe3 mi/NM lett/IGE_Me3 belo3led/NU !/! Ma1r/HA esteledett/IGE_Me3 mikor/HA a/DET palota/FN ele1/NU e1rtek/IGE_e 1 ./. Ma1r/HA a/DET nap/FN ura/FN_PSe3 va1ndoru1tja1nak/FN_PSe3_DAT ve1ge1n/FN_PSe3_SUP bete1rt/IGE_Me3 az/DET alkony/FN bi1bor/FN palota1ja1ba/FN_PSe3_ILL de/KOT onnan/HA is/KOT kiu3zetve1n/FN_PSe3_SUP a/DET tengerbe/FN_ILL vetette/IGE_TMe3 maga1t/NM_ACC ./. Ma1r/HA a/DET hold/MN ezu2stvize/FN_PSe3 csurgott/IGE_Me3 le/IK a/DET palota/FN malachit/FN_PSe3i_ACC le1pcso3in/FN_PSe3i_SUP e1s/KOT a/DET ta1ncolo1/MN pa1va1k/FN_PL kapkodta1k/IGE_TMt3 la1bukat/FN_PSt3_ACC ,/, hogy/KOT meg/IK ne/HA a1zze1k/IGE_TPt3 ./. Ma1r/HA minden/NM fa1klya/FN kialudt/IGE_Me3 csak/HA a/DET legbelso3/MN szoba1ban/FN_INE e1gett/IGE_Me3 egy/DET illatozo1/MN la1mpa/FN ,/, mint/KOT a/DET so2te1t/MN palota1nak/FN_P S e 3_DAT vila1gi1to1/MN szi1ve/FN_PSe3 ./. Balapandita/IGE_TMe3 virrasztott/IGE_Me3 Suryakanta/IGE_TMe3 kira1ly/FN arcke1pe/FN_PSe3 elo3tt/NU ./. Mikor/HA a/DET kira1ly/FN remego3/MN ke1zzel/FN_INS fe1lrehu1zta/IGE_TMe3 a/DET fu2ggo2nyt/FN_ACC e1s/KOT a/DET szoba1ba/FN_ILL le1pett/IGE_Me3 volna/IGE_Fe3 ,/, a/DET bu2szke/MN Balapandita/IGE_TMe3 felugrott/IGE_Me3 az/DET oroszla1nla1bu1/MN pa1rna1ro1l/FN_DEL ./. -/- Ki/IK vagy/KOT te/NM ,/, e1s/KOT hogy/KOT mersz/IGE_e2 a/DET kira1lyno3/MN szoba1ja1ba/FN_PSe3_ILL le1pni/INF ?/? -/Kia1ltotta/IGE_TMe3 ha1traszegett/IGE_Me3 fejjel/FN_INS haragosan/MN_ESS_MOD ./. -/- Suryakanta/IGE_TMe3 vagyok/IGE_e1 ,/, a/DET fe1rjed/IGE_TMe2 ./. Nem/HA ismersz/IGE_e2 ?/? -/- Szemtelen/MN csalo1/MN te/NM !/! Suryakanta/IGE_TMe3 hangja/FN_PSe3 olyan/NM mint/KOT a/DET bagzo1/MN tigrise1/MN_FAC ./. De/KOT a/DET te/NM hangod/IGE_Te2 mint/KOT a/DET here1lt/FN_CAU szolga1ke1/FN_POS u1gy/HA remeg/FN ./. Akkor/NM_TEM a/DET kira1ly/FN egy/DET le1pe1ssel/FN_INS ko2zelebb/MN le1pett/IGE_Me3 a/DET la1mpa1hoz/FN_ALL e1s/KOT i1gy/KOT szo1lt/IGE_Me3 :/:

134

-/- Oh/NNP sze1pcsi1po3ju3/MN Balapandita/IGE_TMe3 ismerj/IGE_Pe2 meg/IK ,/, mert/KOT e1n/NM vagyok/IGE_e1 a/DET fe1rjed/IGE_TMe2 Suryakanta/IGE_TMe3 kira1ly/FN ./. Akkor/NM_TEM Balapandita/IGE_TMe3 egy/DET le1pe1st/FN_ACC ha1trale1pett/IGE_Me3 e1s/KOT i1gy/KOT szo1lt/IGE_Me3 :/: Suryakanta/IGE_TMe3 arca/FN_PSe3 mint/KOT a/DET Himala1ja/FN_P S e 3 szikla1ja/FN_PSe3 keme1ny/MN e1s/KOT mozdulatlan/MN a/DET te/NM arcod/FN_PSe2 puha/MN e1s/KOT va1ltozo1/MN mint/KOT a/DET szi1ne1szeke1/IGE_TFe3 ./. Jaj/ISZ !/! -/- Oh/NNP Balapandita/IGE_TMe3 ,/, Balapandita/IGE_TMe3 ismerj/IGE_Pe2 meg/IK !/! E1n/NM vagyok/IGE_e1 Suryakanta/IGE_TMe3 ./. -/Suryakanta/IGE_TMe3 kira1ly/FN ma/HA elindult/IGE_Me3 az/DET erdo3be/FN_ILL vezekelni/INF e1s/KOT Suryakanta/IGE_TMe3 sza1nde1ka1t/FN_PSe3_ACC soha/HA meg/IK nem/HA ma1si1tja/IGE_Te3 ./. Jaj/ISZ !/! -/- O1h/FN Balapandita/IGE_TMe3 visszate1rtem/IGE_TMe1 hozza1d/NM_ALL ,/, mert/KOT va1gyo1dtam/IGE_TMe1 uta1nad/NU ./. -/- Suryakanta/IGE_TMe3 a/DET ce1l/FN ,/, a/DET mozdulatlan/MN e1s/KOT ve1gso3/MN ./. Nekem/FN kell/IGE o3hozza1/MN_FAC mennem/IGE_INRe1 ./. O3/NM nem/HA jo2n/IGE e1nhozza1m/FN soha/HA ./. Jaj/ISZ !/! -/- Eljo2ttem/IGE_TMe1 hozza1d/NM_ALL ,/, hogy/KOT ta1masztd/IGE_TPe2 meg/IK gyo2kereszakadt/IGE_Me3 ingo1/MN lelkemet/FN_ACC ./. -/- Jaj/ISZ Suryakanta/IGE_TMe3 a/DET szikla/FN ,/, melyhez/F N_ALL az/DET e1n/NM lelkem/FN_PSe1 ingo1/MN hajo1ja1t/FN_PSe3_ACC ko2to2ttem/IGE_TMe1 ./. Akkor/NM_TEM a/DET kira1ly/FN me1g/HA egy/DET le1pe1ssel/FN_INS ko2zelebb/MN le1pett/IGE_Me3 az/DET illatozo1/MN la1mpa1hoz/FN_ALL e1s/KOT i1gy/KOT szo1lt/IGE_Me3 :/: -/- O1h/FN Balapandita/IGE_TMe3 ne1zz/IGE_Pe2 ra1m/NM_SUB ,/, ne1zzed/IGE_TMe2 a/DET kira1lyi/MN e1kszereket/FN_PL_ACC mellemen/FN_INE ./. E1n/NM vagyok/IGE_e1 a/DET fe1rjed/IGE_TMe2 Suryakanta/IGE_TMe3 ./. Akkor/NM_TEM Balapandita/IGE_TMe3 me1g/HA egy/DET le1pe1ssel/FN_INS ha1trale1pett/IGE_Me3 e1s/KOT i1gy/KOT szo1lt/IGE_Me3 :/: -/- Ha/KOT te/NM Suryakanta/IGE_TMe3 neve1t/FN_PSe3_ACC e1s/KOT kira1lyi/MN e1kszereit/FN_PSe3i_ACC viseled/IGE_TMe2 akkor/HA Suryakanta/IGE_TMe3 helye1re/FN_PSe3_SUB le1pte1l/IGE_Me2 ebben/NM_INE az/DET e1letben/FN_INE e1s/KOT Suryakanta/IGE_TMe3 kira1ly/FN ,/, a/DET fe1rfitigris/FN ,/, az/DET e1n/NM fe1rjem/FN meghalt/IGE_Me3 ./. Ezt/NM_ACC mondta/IGE_TMe3 Balapandita/IGE_TMe3 e1s/KOT bu2szke/MN arca1t/FN_PSe3_ACC kezeinek/FN_PSe3i_DAT lo1tusza/FN_PSe3 mo2ge1/NU rejtve/HA si1rt/IGE_Me3 ./. -/- O1h/FN Balapandita/IGE_TMe3 jaj/IGE_Pe2 nekem/NM_DAT !/! Te1ged/NM_ACC a/DET hu3se1g/FN szekre1nye1nek/FN_PSe3_DAT hi1vnak/IGE_t3 e1s/KOT szerelmed/IGE_TMe2 a1llando1sa1ga1ban/FN_INE volt/IGE_Me3 ingo1/MN lelkem/FN_PSe1 bizodalma/FN_PSe3 !/! Akkor/NM_TEM Balapandita/IGE_TMe3 bu2szke1n/FN_PSe3_SUP kiegyenesedett/IGE_Me3 :/: -/- Bizony/HA holtomig/FN_TER hu3/MN maradok/FN_PL Suryakanta1hoz/FN_ALL kinek/NM_DAT szi1vemet/FN_PSe1_ACC adtam/IGE_TMe1 ./. Suryakanta1hoz/FN_ALL ki/IK a/DET fe1rfiak/FN_PL ko2zt/NU olyan/NM volt/IGE_Me3 mint/KOT a/DET ko2vek/FN_PL ko2zt/NU az/DET adamanth/FN ,/, de/KOT meghalt/IGE_Me3 ./.

135

Suryakanta1hoz/FN_ALL akit/NM_ACC ez/NM az/DET arcke1p/FN a1bra1zol/IGE ,/, aki/NM Indra1to1l/FN_PL_ABL kapta/IGE_TMe3 me1lto1sa1ga1t/FN_PSe3_ACC ,/, a/DET tu3zto3l/FN_PL_ABL heve1t/FN_PSe3_ACC ,/, Jamato1l/FN_PL_ABL haragja1t/FN_PSe3_ACC ,/, e1s/KOT Ramato1l/FN_PL_ABL a1llhatatossa1ga1t/FN_PSe3_ACC ,/, de/KOT meghalt/IGE_Me3 ./. Bizony/HA hu3/MN maradok/FN_PL hozza1/NM_ALL e1s/KOT el/IK nem/HA hagyom/FN emle1ke1t/FN_PSe3_ACC valakie1rt/F N_CAU akinek/NM_DAT arca/FN_PSe3 mint/KOT a/DET szi1ne1szeke1/IGE_TFe3 ,/, hangja/FN_PSe3 mint/KOT a/DET here1lt/FN_CAU szolga1ke1/FN_POS ,/, viselkede1se/FN_PSe3 mint/KOT a/DET terhes/MN asszonyoke1/FN_POS ,/, olyan/NM ./. Ezt/NM_ACC mondva1n/FN_PSe3_SUP fe1lrehu1zta/IGE_TMe3 az/DET ajto1/FN arannyal/FN_INS hi1mzett/IGE_Me3 fu2ggo2nye1t/FN_PSe3_ACC e1s/KOT kikia1ltott/IGE_Me3 :/: -/- Oh/NNP Cseti/IGE_Te3 ,/, kedves/MN csele1dem/FN e1bredj/IGE_Pe2 ./. Fehe1r/MN ruha1t/FN_ACC hozz/IGE_Pe2 nekem/NM_DAT e1s/KOT szeza1mmagokat/FN_PL_ACC ke1szi1ts/FN vi1zzel/FN_INS e1s/KOT puha/MN darbha-fu2vet/FN_ACC ./. Akkor/NM_TEM Cseti/FN_P S e 3 megjelent/IGE_Me3 az/DET aranyfu2ggo2ny/FN nyi1la1sa1ban/FN_PSe3_INE e1s/KOT a1lomto1l/FN_ABL ta1gult/MN szemekkel/FN_PL_INS ne1zett/IGE_Me3 a/DET szoba1ba/FN_ILL e1s/KOT halkan/MN_ESS_MOD szo1lt/IGE_Me3 mint/KOT aki/NM fe1l/SZN ,/, hogy/KOT ve1lt/IGE_Me3 a1lma1bo1l/FN_PSe3_ELA o2nmaga1t/NM_ACC fele1breszti/MN :/: O1h/FN bu2szke/MN Balapandita/IGE_TMe3 a/DET fehe1r/MN ruha/FN o2zvegyek/FN_PL ruha1ja/FN_PSe3 e1s/KOT szeza1mmagok/FN_PL puha/MN darbha/HA fu3vel/FN_INS halotti/MN a1ldozat/FN ./. Mie1rt/HA ki1va1n/IGE mindezt/NM_ACC o1h/MN kira1lyno3m/FN mikor/HA ott/HA la1tom/IGE_Te1 a/DET fe1rfit/FN_ACC kinek/NM_DAT melle1n/FN_PSe3_SUP a/DET kira1lyi/MN e1kszerek/FN_PL csillognak/IGE_t3 ?/? -/- A/DET kira1lyi/MN e1kszereket/FN_PL_ACC hordja/IGE_Te3 me1g/HA a/DET test/FN de/KOT a/DET lelket/FN_ACC belo3le/HA kite1pte/IGE_TMe3 Yama/FN a/DET hala1l/FN ura/FN_PSe3 ko2te1lhurokkal/FN_PL_INS maga/NM uta1n/NU hu1zva1n/FN_PSe3_SUP ./. Szo1li1tsd/IGE_TPe2 szolga1imat/FN_ACC kik/IGE_Tt3 atya1m/FN ha1za1bo1l/FN_PSe3_ELA jo2ttek/IGE_Mt3 velem/NM_INS ,/, hogy/KOT engem/NM_ACC hozza1/NM_ALL visszavigyenek/MN_PL ./. Atya1m/FN ke1szi1tsen/IGE_Pe3 ma1glya1t/FN_P S e 3_ACC sza1momra/FN_SUB ,/, hogy/KOT a/DET tu3z/FN fu2stje1vel/FN_PSe3_INS sza1llhassak/MN_PL a/DET meghalt/IGE_Me3 Suryakanta/IGE_TMe3 uta1n/NU ,/, akihez/NM_ALL o2ro2kke1/HA hu3/MN maradok/FN_PL ./.

136