Data-driven Dependency Parsing for Romanian - CiteSeerX

0 downloads 0 Views 593KB Size Report
Aug 27, 2008 - I would also like to thank my husband, Gelu, for proofreading this thesis, ... of the phrase structure or constituency grammar and parsing a ... structure formalism creates a dependency graph made of lexical nodes (i.e. ... INTRODUCTION ... the parser, I will also analyze and classify the errors produced by the ...
Data-driven Dependency Parsing for Romanian Mihaela C˘al˘acean

Uppsala University Department of Linguistics and Philology Spr˚ akteknologiprogrammet (Language Technology Programme) Master’s thesis in Computational Linguistics, 40 credits August 27, 2008 Supervisor: Joakim Nivre, V¨axj¨o University and Uppsala University

Abstract This thesis presents the first data-driven dependency parser for Romanian. The parser was trained and tuned using MaltParser, a language-independent parsergenerator framework. The Romanian Dependency Treebank, which represents the material used for training and evaluation of the parser, is also presented for the first time. The best parsing accuracy of 88.6% measured in Labeled Attachemnt Score is obtained using the Nivre arc-eager algorithm, together with a feature model specific for Romanian. The results are similar to those reported for other languages using the same system. The different types of errors produced by the parser are classified, leading to directions for improving the parsing accuracy and for future expansion of the treebank.

Sammandrag

Denna uppsats presenterar den f¨orsta datadrivna dependensparsern f¨or rum¨anska. MaltParser, ett spr˚ akoberonde parsnings system, har anv¨ants f¨or att tr¨ana och finjustera parsern. Den Rum¨anska Dependenstr¨adbanken har legat till grund f¨or tr¨ anings- och utv¨ arderingsmaterialet. Tr¨adbanken a¨r presenterad h¨ar i premi¨ar. Nivres arc-eager algoritm tillsammans med en specifik s¨ardragsmodell f¨or rum¨ anska fick den h¨ ogsta parsnings korrektheten p˚ a 88.6% (huvud och etikett precision). Resultaten ligger p˚ a samma niv˚ a som f¨or andra spr˚ ak parsade med denna programvara. De olika parsnings fel har analyserats och klassificerats f¨or att i framtiden f¨ orb¨ attra parsningskorrektheten och m¨ojligen vidareutveckla tr¨adbanken.

Acknowledgements I would like to thank my supervisor, Professor Joakim Nivre, for his encouraging guidance, help and useful comments. I am very grateful to Professor Florentina Hristea from the University of Bucharest for all the details regarding the BALRIC-LING project. I also had a fruitful correspondence with Olga Pustylnikov from the University of Bielefeld. A special thanks is addressed to Be´ata Megyesi for all her excellent pieces of advice. I would also like to thank my husband, Gelu, for proofreading this thesis, writing his own Ph.D. thesis and changing his job, all at the same time. Thanks to my son, Sebastian, for not liking it in daycare. I would still be taking courses if not for his tantrums.

3

Contents Abstract

2

Acknowledgements

3

Introduction Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 7 8

The Romanian Dependency Treebank

9

Treebank Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dependency Structures in RDT . . . . . . . . . . . . . . . . . . . Evaluation of RDT . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 15 18

Comparison with other dependency treebanks . . . . . . . . . . .

20

MaltParser Architecture of the MaltParser System . . . . . . . . . . . . . . .

23 23

Deterministic Parsing Algorithms . . . . . . . . . . . . . . . Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . Feature Models . . . . . . . . . . . . . . . . . . . . . . . . . .

25 28 29

Previous Projects using MaltParser . . . . . . . . . . . . . . . . . Comparison with other dependency parsers . . . . . . . . . . . . .

31 32

Experiments Formatting and Encoding . . . . . . . . . . . . . . . . . . . . . . .

35 35

Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . .

36 39

4

CONTENTS

5

Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . Feature Model for Romanian . . . . . . . . . . . . . . . . . . Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 41 42

Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

Conclusion Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52 52 53

Bibliography

55

List of Figures

61

List of Tables

62

Introduction In “The Hitchhiker’s Guide to the Galaxy”, a small, yellow and leech-like fish that you stick in your ear, called Babel fish, is capable of translating any language into any other. Obviously, no tangible device possesses the properties of this imaginary fish at the present time. As such, it represents perhaps the dream of many people in the field of computational linguistics. As many other natural language processing (NLP) systems, this odd Babel fish would probably have to undertake a syntactic analysis of the input it would receive. This analysis is called parsing, which can be defined as the process of assigning a syntactic representation to a natural language sentence. Parsers, the modules that perform this process, are found in many different NLP applications, ranging from spoken dialog systems to information extraction and machine translation programs. The syntactic representation assigned to a sentence through parsing is structured according to a formalism, such as a phrase structure or dependency structure formalism. Constituents and phrases are the building blocks of the phrase structure or constituency grammar and parsing a sentence according to this formalism would essentially output a tree with phrase structure information. This formalism has been very popular especially when describing English. Dependency grammar represented by Tesni`ere (1959) has been attracting many adherents, mainly because of its ability to handle free word order languages. Parsing a sentence according to the dependency structure formalism creates a dependency graph made of lexical nodes (i.e. words) linked by dependencies. A dependency is a relation between two words, one of them being a head or regent and the other a dependent. In this thesis, I will adopt parsing with dependency structures, i.e. dependency 6

INTRODUCTION

7

parsing. Data-driven parsing systems have become increasingly popular given the alleged advantages of shorter development time and better performance. Statistical parsing is often based on non-deterministic parsers that derive a set of candidate analyses ranked according to generative probabilistic methods (Collins et al., 1999). Deterministic parsing has been used with dependency parsing (Nivre and Scholz, 2004), guiding the parser with a classifier trained on annotated data. As the name suggests, these systems require the existence of data resources such as treebanks, that is, corpora annotated with syntactic structures. Although not yet available for all languages and coming in rather small sizes, there are many treebanks right now annotated according to constituency or dependency grammar. In this thesis, I will use MaltParser, one of the data-driven parsing systems available right now for dependency parsing (Nivre et al., 2006a). As treebank for the classifier I will use a dependency annotated treebank called the Romanian Dependency Treebank (RDT).

Purpose The purpose of this thesis is to use the MaltParser system to train and tune a parser for Romanian given the Romanian Dependency Treebank. The parser is to be evaluated on test data extracted from the dependency treebank and compared with the results for other languages. A detailed description of RDT will also be presented here. After training and tuning the parser, I will also analyze and classify the errors produced by the parsing model. To the best knowledge of the author, this is the first parser for Romanian ever made and the first time this dependency treebank will be used for training an NLP application. Moreover, the treebank was never described in any publicly available resource.

INTRODUCTION

8

Outline This thesis is structured as follows. Chapter 2 concentrates on RDT, giving a detailed overview of the annotation scheme, the adopted format and the set of tags and dependency relations used in the treebank, everything accompanied by examples and comments. An evaluation of the material will also be considered in this chapter. The last section of the chapter contains a comparison of RDT with other available dependency treebanks. Chapter 3 describes the architecture of the MaltParser system, offering an overview of the deterministic parsing algorithms available in the system, of the learning algorithms and the feature model specifications. It is also important to present the previous projects in which the system was used as well as a comparison with other available dependency parsers. Chapter 4 is about the experiments I have performed in order to train and tune the parser for Romanian, taking into considerations the input format and the encoding required by MaltParser, the experimental conditions, the parsing algorithms and the feature model used. This chapter also presents the results, comparing them to the results of other parsers trained with the system. I will also analyze the errors produced by the parser. Finally, chapter 5 presents a summary of the thesis, setting some guidelines for future research.

The Romanian Dependency Treebank The development of NLP tools requires annotated corpora, a useful material in linguistic research. A syntactically annotated corpus or a treebank may represent dependency or phrase structures or even possibly hybrid structures. Although very useful for a multitude of purposes, building such treebanks has the major drawback of requiring considerable financial resources and large periods of time. A variety of treebanks are already available for many languages, naming, for instance, Talbanken05 for Swedish (Nivre et al., 2006b) and the Penn Treebank for English (Marcus et al., 1994) and adding to the list treebanks such as the Prague Dependency Treebank for Czech (Hajiˇc, 1998) and Tiger/Negra for German (Brants et al., 2004). However, there are still languages lacking this kind of linguistic resources. As far as Romanian is concerned, a collection of texts was recently annotated according to the Dependency Grammar formalism within the BALRIC-LING project. BALRIC-LING1 was an Information Society Technologies (IST) project aimed at raising awareness on Human Language Technologies (HLT) and possible scientific and industrial applications of linguistic resources in Bulgarian and Romanian. The project especially targeted groups in academia and industry and involved the design, implementation and maintenance of a website with relevant material for HLT and language resources. The project also regularly published a bulletin containing answers to the questions received from those interested. The first part of the RORIC-LING project2 1 BALkan Regional Information Centers for Awareness and Standardization of LINGuistic Resources and Tools for Advanced Human Language Technology Applications, http://www.larflast.bas.bg/balric/index.html. 2 RORIC-LING is the Romanian part of the BALRIC-LING project,

9

THE ROMANIAN DEPENDENCY TREEBANK

10

took into consideration grammatical formalisms and their usage in Romanian, as well as tools for corpus annotation. Within the project, a set of Romanian texts was annotated according to the Dependency Grammar (DG) formalism, using the Dependency Grammar Annotator (DGA).3 DGA is a tool conceived in order to facilitate the syntactic annotation of texts within the formal framework of DG. It practically offers a user-friendly graphical interface where the human annotator of the texts can insert sets of dependency types and dependency relations defined by the annotator. The syntactic annotation included in the tool is designed according to the EAGLES4 recommendations. The annotated texts are saved in XML format, using a minimal set of tags inspired by XCES.5 The set of annotated Romanian texts were placed under the name of Romanian Dependency Treebank on the TreebankWiki6 site. The texts are available both in the XML and eGXL7 format.

Treebank Data The Romanian Dependency Treebank consists of 36,150 tokens (punctuation excluded) and comprises newspapers articles, mostly on political and administrative subjects. It contains 4,042 sentences, having a mean sentence length of 8.94 tokens per sentence. A total of 8,867 word types were found in the treebank, so the type/token ratio (TTR) is as high as 0.245, which indicates a rather frequent repetition of certain types. The texts were chosen so as to offer a representative sample of modern written standard Romanian. However, texts including complex ambiguities were avoided as http://www.phobos.ro/roric/. 3 Downloadable at http://www.phobos.ro/roric/DGA/dga.html. 4 Expert Advisory Group on Language Engineering Standards, http://www.ilc.cnr.it/EAGLES96/home.html. 5 Corpus Encoding Standard for XML, http://www.cs.vassar.edu/ ide/xces-0.2/. 6 URL: http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/index.php5/Main Page. 7 eGXL, http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/index.php5/EGXL.

THE ROMANIAN DEPENDENCY TREEBANK

11

much as possible, being removed from the corpus.8 The strong tradition of applying the Dependency Grammar (DG) formalism in linguistic research on Romanian and in teaching prescriptive grammar in Romanian schools justifies the choice of annotation style. For instance, the manuals on the Romanian language and literature on the secondary and upper-secondary level contain the description of grammatical elements in the formalism of DG. As most linguistic research traditionally has been descriptive and prescriptive, the DG approach naturally fits the development of new linguistic resources for Romanian, such as the present treebank. Dependency structures in RDT take into account head-dependent relations (directed arcs), functional categories (arc labels) and structural categories like parts-of-speech. A description of a dependency grammar approach to syntactic analysis with special reference to Romanian can be found in Hristea and Popescu (2003). The annotation was performed completely manually by a Romanian linguist, using only the graphical interface tool DGA. Since there was only one annotator,9 the parts-of-speech (POS) and grammatical functions were relatively coherently used throughout the whole material. The use of DGA made superfluous other tools for annotation such as, for instance, a tokenizer since the texts were automatically stripped of almost all punctuation marks in the annotation phase. The annotated texts are automatically saved by DGA in the XML format, following the XCES guidelines (Ide and Romary, 2001). The treebank consists of a total of 92 files in this format. When annotating sentences using DGA, the user needs to input files in the ASCII character encoding, UTF-8 not being supported. The annotated files are saved using the same character encoding. This explains the absence of the five Romanian diacritics (˘a, aˆ, ˆı, ¸s and ¸t) from the treebank. The absence of the diacritics can lead to a 8 The

information regarding the choice of texts included in the corpus was included in the Romanian version of the RORIC-LING Bulletin available at http://www.phobos.ro/roric/Ro/qa16.html. 9 All the pieces of information regarding this phase of the RORIC-LING project were obtained through personal correspondence from Prof. Dr. Florentina Hristea, the coordinator of the RORIC-LING project.

THE ROMANIAN DEPENDENCY TREEBANK

12

certain degree of ambiguity in the treebank. However, this ambiguity can easily be resolved by the context. Below is a complete sentence extracted from the file ‘t3.xml’. trebuie 1 verb 2 predicat

The sentences are marked by the tags .... However, sentence identification numbers are lacking. All the other tags are similarly straightforward. Although the treebank does not come with an interactive search tool, due to the XML format, it is quite easy to search the treebank using only a simple text editor provided with regular expressions. It would not require too much effort either to convert the treebank to other desirable formats for usage in other projects. DGA can also be used as a viewing tool for dependency graphs for texts annotated using the DG formalism. Figure 1 shows a sentence annotated with DGA.

Figure 1: The sentence ‘La acest efort diplomatic participa premierul britanic Tony Blair’ [Lit. ‘The British prime-minister Tony Blair is part of this diplomatic effort’] annotated by DGA. The POS tag set used for annotating RDT is relatively small and rather simple considering the morphosyntactic richness of Romanian. The gram-

THE ROMANIAN DEPENDENCY TREEBANK

13

matical function set is exhaustive for the type of texts the corpus consists of. Table 1 presents the POS tag set used in RDT accompanied by explanations and examples10 while Table 2 presents the set of dependency types in RDT.

Figure 2: Dependency graph for the sentence ‘Vor coopera in domeniile politiei si informatiilor’ [Lit. ‘They will cooperate in the areas of police and information’]

The problematic cases were most of the time avoided or simply dismissed. All idiomatic and complex group structures were decomposed into simple elements, sometimes removing difficult and complex structures from the original texts. The representation of elliptical constructions in Romanian was considered unnecessary, mostly because the written language tends to eliminate these kind of constructions. Consequently, a shallow syntactic annotation was adopted. All complex-compound sentences (including coordinated sentences) were split into simple sentences, therefore there are no subordinate clauses in the treebank. Dates and measure phrases do not receive any kind of special annotation. There are no punctuation marks (except for a few hyphens, brackets and slashes in words like e.g. ‘e-mail’ or ‘NATO/Rusia’ [ro]). Proper names consisting of two or more elements were collapsed into one lexical element (E.g., ‘Tony Blair’ becomes a lexical unit ‘TonyBlair’). The sentence in Figure 2 is an example featuring some of the tricky cases. It has a subject ellipsis and a complex verb group formed of an auxiliary 10 The

POS tags used in RDT are represented by strings having blanks and punctuation characters as delimiters. Because of this, the original tags were mapped to shorter but absolutely equivalent ones.

THE ROMANIAN DEPENDENCY TREEBANK TAG ADEM ADJ ADJPT ADV AHOT ANHOT APOS CAUX CCOO NO PREP PRDEM PRFX PRON SBS VB VBAUX VBIF VBNPR VBPT

14

PART-OF-SPEECH EXAMPLE Demonstrative article cea, cel, cele Adjective scurt, puternic˘ a, unii Adjective (participle form) ars, c˘ azute, trecut Adverb acum, ieri, nu Definite article lui, -l, -a (enclitic) Indefinite article un, o, unui Possessive article a, al, ale Subordinating conjunction s˘ a Coordinating conjunction ¸si, sau, ori Numeral una, treime, zece Preposition ˆın, de, pentru Demonstrative pronoun aceea, acestuia Reflexive pronoun se, te, ne Pronoun el, noi, cine, ce, care Noun prevederilor, Timi¸soara Finite verb are, este, constatat Auxiliary verb a, au, va, fi Infinitive verb r˘ aspunde, defini, cuprinde Non-finite form apar¸tinˆ and, venind Participle verb a¸steptat, iluminat

Table 1: The POS tags used in RDT.

and a main verb. The main verb always assumes the role of head for the auxiliary in the dependency structures of RDT. The nouns ‘domeniile’ and ‘informatiilor’ are coordinated and governed by the conjunction ‘si’. In the prepositional construction ‘in domeniile’, the preposition ‘in’ acts as a regent and the noun ‘domeniile’ as a subordinate. All these cases are questionable; for instance, in the complex verb group, the auxiliary and not the main verb bears the active form of the construction. On the other hand, the main verb bears the meaning of the whole verbal construction. Coordination conjunctions mark a relation among similar elements, therefore one can question the legitimacy of their government. Finally, in the prepositional construction, the noun carries almost the whole weight of meaning though being only a subordinate of the preposition. The treebank was intended first of all as a step towards linguistic resources in Romanian and, secondly, for training and evaluating statistical

THE ROMANIAN DEPENDENCY TREEBANK

15

dependency parsers for Romanian. However, given the short duration of the RORIC-LING project, the size of the treebank was considered too small in order to develop such applications. Consequently, until now, the treebank has only been partially used for its original purpose.

Dependency Structures in RDT RDT can be seen as a text, that is to say a sequence of sentences T = (S 1 ,S 2 ,...,S n ). A sentence can be described as a sequence of words S = (W 1 ,W 2 ,...,W n ) and a word or a token as a sequence of characters. The dependency structure of a sentence in RDT can be defined as a labeled directed graph G. This graph may be described as follows: let D be a finite set of dependency types or arc labels (e.g. subject, attribute, determiner). Table 2 offers a review of all dependency types in use in RDT. A dependency graph for a sentence S is a structure G = (V,A,L), such that: • V is a set of nodes, that is V = {1, 2, ..., n, n + 1}, rendering the linear order of the words in the sentence S. For every node i (∀1  i  n), there is a corresponding word wi ∈ S and a tag ti designating the part-of-speech to which the word belongs. There is a special node n + 1 which does not correspond to any token of the sentence and which is the root of the dependency graph for a sentence in RDT. • The set A is a set of arcs, A = {(i, j)|1  i  n + 1, 1  j < n}, i and j being nodes. The head of the arc (i, j ) is i while j is the dependent. The arc (i, j ) will be denoted by i → j and the notation i →∗ j refers to the reflexive and trasitive closure of the arc relation A. • L is a function which takes its values from A and returns values in D, that is L : A → D. This function ensures that the arcs in A are labeled with dependency types in D. The graph G is well formed if and only if:

THE ROMANIAN DEPENDENCY TREEBANK LABEL ATADJ ATADV ATNO ATPRON ATSBS ATSBSAPZ ATVB CA CC CD CI NMPRED PRED RELAUX RELCOMP RELCONJ RELDEM RELHOT RELINF RELNEG RELNHOT RELPOS RELPPOZ RELRFX SUB

16

DEPENDENCY TYPE Adjectival modifier Adverbial modifier Numeral modifier Pronominal modifier Nominal modifier Appositive nominal modifier Verbal modifier Agent object Adverbial Direct object Indirect object Predicative Predicate Auxiliary relation Relation of comparison Conjunctive relation Demonstrative relation Definiteness relation Infinitival relation Negative relation Indefiniteness relation Possessive relation Preposition government Reflexive relation Subject

Table 2: The dependency types used in RDT and their labels • Any node depends on at most one other node, the head, i.e. if i → j then there is no k, such that k = i and k → j. (SINGLE HEAD) • G is connected, i.e. there is a (undirected) path connecting any two nodes i and j in the graph. (CONNECTED) • There is no arc of the form i → n + 1, i.e. n+1 is a root node. (ROOTED) • G is projective, i.e. if i → j then i →∗ k for every k such that i < k < j or j < k < i. (PROJECTIVE)

THE ROMANIAN DEPENDENCY TREEBANK

17

D = {PRED, SUB, CI, RELPPOZ, AT SBS, AT ADJ . . .} V = {1, 2, 3, 4, 5, 6, 7, 8, 9} A = {(9, 5), (5, 1), (5, 6), (1, 3), (3, 2), (3, 4), (6, 7), (6, 8)} L((9, 5)) = PRED L((5, 1)) = CI L((5, 6)) = SUB L((1, 3)) = RELPPOZ L((3, 2)) = AT ADJ L((3, 4)) = AT ADJ L((6, 7)) = AT ADJ L((6, 8)) = AT SBS Figure 3: Dependency graph for the sentence ‘La acest efort diplomatic participa premierul britanic Tony Blair’ [Lit. ‘The British prime-minister Tony Blair is part of this diplomatic effort’] All the dependency graphs for the sentences in RDT are well-formed, i.e. they are connected, projective, rooted, acyclic and any node has at most one head. In order to verify whether the graphs are projective, I have used NonProj2Proj,11 version 0.2. One should notice that the dependency relations and graph described for the sentence presented above are in accordance with the general principles. Figure 3 shows the representation of a dependency graph for the dependency structure assigned to one sentence in RDT, ‘La acest efort diplomatic participa premierul britanic Tony Blair’ [Lit. ‘The British prime-minister Tony Blair is part of this diplomatic effort’]. 11 Available

at http://w3.msi.vxu.se/ nivre/research/proj/0.2/doc/Proj.html.

THE ROMANIAN DEPENDENCY TREEBANK

18

Evaluation of RDT For evaluation purposes, I randomly selected 3% of the total number of sentences (i.e., 122 sentences) and manually corrected all the errors, thus creating a gold-standard material. The two files were then compared with the help of the MaltEval12 program, grouping the output by individual tokens and by POS-tags and using the metric LAS (Labeled Attachment Score).13 The results presented in Table 3 indicate that the accuracy is 98,9% when the test data is grouped by individual tokens and 97,6% when grouped by the POS-tag associated with the individual tokens. The accuracy is computed by dividing the number of tokens with a hit (according to the Metric value) with the total number of tokens in the data (Nilsson, 2008). The results show that the quality of the material is more than satisfactory. Group by Token POS-tag

Accuracy 0,989 0,976

Table 3: Accuracy of the RDT Most of the errors I found in the material belonged to one of the following types: • Dependency type errors – Predicate incorrectly classified as subject (E.g. Figure 4) Token ID Token POS-tag Head Dep. type (erroneous) Dep. type (corrected)

1 Va auxiliary 2 relaux relaux

2 obtine verb 4 SUBJECT PREDICATE

3 recunoasterea noun 2 cd cd

4 eos eos eos eos eos

Figure 4: The sentence ‘Va obtine recunoasterea’ [Lit. ‘(It) will obtain the acknowledgment’]. – Subject incorrectly classified as predicate (E.g. Figure 5) 12 Available 13 A

at http://w3.msi.vxu.se/ nivre/research/MaltEval.html. thourough explanation of this metric comes in the Evaluation Metrics section.

THE ROMANIAN DEPENDENCY TREEBANK Token ID Token POS-tag Head Dep. type (erroneous) Dep. type (corrected)

1 Fondurile noun 4 PREDICATE SUBJECT

2 sa aux 4 relaux relaux

3 fie aux 4 relaux relaux

19 4 blocate verb 5 predicate predicate

5 eos eos eos eos eos

Figure 5: The sentence ‘Fondurile sa fie blocate’ [Lit. ‘The fonds should be blocked’]. • Head identification errors - failure to correctly disambiguate (E.g. Figure 6) Token ID Token POS-tag Head (erroneous) Head (corrected)

1 depozit noun 6 6

2 de prep 1 1

3 combustibil noun 2 2

4 nuclear adj 3 3

5 ars adj 1 3

6 eos eos eos eos

Figure 6: The sentence ‘[...] depozitul de combustibil nuclear ars’ [Lit. ‘the waste storage for spent nuclear fuel’].

Even if the annotation of the material was performed by a single annotator, inconsistencies still occur, especially within the annotational scheme. Four of the twenty POS tags and one dependency type appear only in the first 6% of the material, reducing significantly the POS tagset for the rest of the material. For instance, verbs and adjectives in participle form are annotated as such only in the first part of the material. On the other hand, the definite article POS tag is present only in the last 90% of the material. In most cases, definiteness is marked by an enclitic (e.g. indefinite/definite article: un om - omul [a/the man]), accounting partly for the inconsistency. However, certain proper names in genitive require the proclitic variant of the article (E.g. ‘casa lui Ion’ [Ion’s house]), suggesting that probably the texts were strictly selected in the beginning of the project. The errors and inconsistencies detected in the material were fixed in the data sets used for the experiments. The corrected version of the treebank will be available on the TreebankWiki site.

THE ROMANIAN DEPENDENCY TREEBANK

20

Comparison with other dependency treebanks Since annotated data is such an important resource for the development of NLP applications, relatively many treebanks are already available, some of which have adopted the DG formalism for annotation. Abeill´e (2003) offers an extensive presentation of treebanks in general, both constituencyand dependency-based. Kakkonen (2006) presents a selection of current dependency-based treebanks, also reviewing the methods used for building the resources, the annotation schemes applied, and the tools used (such as POS taggers, parsers and annotation software). On the TreebankWiki website alone, there is a listing of treebanks ranging from Floresta Sint´a(c)tica for Portugese to the Swedish Talbanken05 and the Japanese Verbmobil. RDT resembles in size (measured in tokens) Turin University Treebank (TUT) for Italian and the Slovene Dependency Treebank. Looking at the sentence count, RDT would situate itself closer to the Spanish Cast3LB which has more than double the token numbers. The high number of sentences for such a small token count can be explained by the fact that RDT consists of simple clauses alone. The resemblance with TUT continues at the level of the annotation scheme. TUT has a number of 16 POS tags, very similar to the 20 in use in RDT. However, the similarities stop at a certain level. There are 51 subcategories for TUT’s POS tags as opposed to none in RDT. TUT also includes non-lexical nodes, i.e. traces in order to deal with movement (deletion, ellipsis and sharing) and long-distance dependencies (Lesmo et al., 2002). These problematic cases that also appear in Romanian are simply avoided, by excluding them from RDT. Most of the treebanks are currently annotated at least semiautomatically. For instance, the Tiger Treebank was annotated interactively, i.e. a human annotator is presented with a proposal from a parser which is accepted or rejected after inspection. On the contrary, RDT was annotated completely manually, by aid of a vizualisation tool, just as, for example, the Basque Dependency Treebank (Aduriz et al., 2003). Manual annotation is often the case when, just as for Romanian, tools (e.g. parsers) are missing.

THE ROMANIAN DEPENDENCY TREEBANK

21

It also requires annotation guidelines and consistency controls. Regarding the format of the treebank, RDT uses the XML-based Corpus Encoding Standard (XCES) format, just as, for example, the METUSabanci Turkish Treebank (Atalay et al., 2003). In fact, many treebanks are available in the XML format (see Kakkonen (2006) for a review of the formats in use for different treebanks). Table 4 presents an overview of some of the existing treebanks, indicating the name, language and size of the treebank. Many of them are purely dependency-based, while others only have a version in this annotation scheme. This is the case for BulTreeBank, Alpino Treebank, Cat3LB, Cast3LB, Tiger Treebank and Talbanken05. Treebank

Language

Size Sentences Tokens Prague Arabic Dependency Treebank Arabic 3,044 116,800 BulTreeBank Bulgarian 10,000 200,000 Cat3LB Catalan 2,800 100,000 Prague Dependency Treebank Czech 90,000 1,500,000 Danish Dependency Treebank Danish 5,540 100,008 Alpino Treebank Dutch 7,153 195,069 Tiger Treebank German 40,000 800,000 Turin University Treebank Italian 1,500 38,653 Floresta Sint´ a(c)tica Portughese 9,368 215,003 Romanian Dependency Treebank Romanian 4,042 36,150 Russian National Treebank Russian 12,000 180,000 Slovene Dependency Treebank Slovene 2,000 30,000 Cast3LB Spanish 4,000 100,000 Talbanken05 Swedish 21,571 342,170

Reference (Hajiˇc et al., 2004) (Osenova and Simov, 2004) (Civit et al., 2006) (B¨ ohmov´ a et al., 2001) (Kromann, 2003) (van der Beek et al., 2002) (Brants et al., 2004) (Bosco and Lombardo, 2004) (Afonso et al., 2002) RORIC-LING Bulletin (Boguslavsky et al., 2002) (Dˇzeroski et al., 2006) (Civit et al., 2006) (Nivre et al., 2006b)

Table 4: Overview of several dependency treebanks (some of the treebanks only have a dependency version) A quantitative structure analysis of several dependency treebanks including RDT, was performed by Pustylnikov and Mehler (2008). Quantitative characteristics such as the in and out (dependency links) degree, sentence length, depth, depth imbalance, child imbalance and compactness and stratum measurements were computed for each treebank, giving the feature vectors used in the classification and comparison of the treebanks. The results show that RDT and the Alpino Treebank exhibit the greatest similarity (the dependency-based version).

THE ROMANIAN DEPENDENCY TREEBANK

22

To summarize, RDT is a dependency treebank for Romanian coming in a relatively small size. It has been annotated completely manually, DGA providing the annotator with a friendly graphical interface and a modality to save all the annotated texts automatically in the XML format. It contains no punctuation marks and consists of simple clauses. The annotated texts were strictly selected newspaper articles, complex, ambiguous and problematic constructions being most of the time eliminated.

MaltParser MaltParser is a language-independent data-driven parser generator. In learning mode, it learns (or generates) a parsing model for a language if presented with a dependency-based treebank for that particular language. In parsing mode, it assigns a well-formed dependency graph to a natural language text sentence given a previously induced configuration model for the respective language. The current version of the software implements four different parsing algorithms and one learning algorithm. It is very user-friendly in the sense that it allows for the tuning of many parameters during learning and parsing. The system offers a set of default feature models, one for each algorithm. However, feature models specific for each language may be specified by the user. The input treebank needs to be in a tab-separated format: the CoNLL or the Malt-TAB data format. The system has already been used with competitive results for several languages, being limited only by the existence of the linguistic resources for a particular language.

Architecture of the MaltParser System MaltParser was designed in a modularized fashion as a system composed of relatively independent entities. This architecture offers the flexibility of choosing among existing components or implementing new ones, modifying the settings so that the system produces different parsers given the same training material. There are three principal components in MaltParser:

23

MALTPARSER

24

• Parser - This module generates a dependency graph for a given sentence. The graph is generated according to the deterministic parsing algorithm the user chooses. • Learner - The Learner is an interface to a machine learning system. The interface acts as a classifier for the training instances sent by the Guide. The current MaltParser implementation uses LIBSVM (Chang and Lin, 2001). • Guide - For each parser state of the system, the guide extracts feature vectors according to a feature model: the default model or the one specified by the user. During training, the Guide takes a transition from the Parser, constructs a vector of feature values for that particular transition and passes this training instance to the Learner. During parsing, the Parser sends a request to the Guide for each undeterministic transition, the Guide constructs a vector of feature values for the Parser’s request and sends it to the Learner. Moreover, the Guide passes the predicted transition from the Learner to the Parser. As a metaphor, the Guide ensures the independence of the Parser and Learner modules. The system operates in two phases: the learning or training phase and the parsing phase. During learning, the system extracts a set of training instances given a treebank of dependency structures. At every step the Parser takes, the system derives or extracts through the Guide a set of features from every sentence in the given treebank. These sets of features are saved as training or parsing instances. The Learner classifies these instances according to the settings made by the user for the machine learning system. All these parameters make up a configuration model for a given algorithm and a set of option settings which control the behavior of the three modules of the system. During the parsing phase, the Parser invokes the Guide to pass the predicted transition from the Learner according to the extracted training instances. A detailed description of the architecture of MaltParser is found in Hall (2006) and Nivre et al. (2007b).

MALTPARSER

25

Deterministic Parsing Algorithms In the current version of the software, four parsing algorithms are implemented, two belonging to the Nivre family and two to the Covington family. However, given the modularized architecture of the system, any deterministic parsing algorithm can be implemented and used in the frame of MaltParser. A parsing algorithm is deterministic in the sense that given an input, the parser always determines the next action.14 The deterministic approach is very similar to the Shift-Reduce parsing for context-free languages (Nivre, 2003). Except for Covington’s non-projective algorithm, the other three algorithms produce only projective dependency graphs. However, since RDT contains only projective structures, this does not constitute a problem.

Nivre family Nivre’s algorithm can be run in two versions: Nivre arc-eager and Nivre arcstandard. It has a ‘stepwise’ approach (McDonald and Nivre, 2007), linking the tokens by taking into consideration previous decisions. The strategy is to look at two tokens at the same time and decide if there is a dependency arc between the tokens. The arc-eager version tries to link the tokens as soon as possible. The two versions of the algorithm use two data structures: a stack Stack of processed tokens and a list List of (remaining) input tokens. Regarding the arc-eager version, at every step, the parser applies one of the following four transitions: Shift, Reduce, RightArc(l) and LeftArc(l). The first transition shifts the first token from the list of input tokens onto the stack. The Reduce transition pops or reduces the top token on the stack. The top token on the stack will be assigned a head as soon as possible. Otherwise, if the token is reduced without a head, it will be left unattached. The RightArc(l) transition assigns a dependency arc labeled l. The top token on the list of input tokens becomes the dependent of the top token on the stack. In addition, the token from the list is pushed on the stack. The LeftArc(l) transition adds a dependency arc with the label l so 14 More on

deterministic parsing at http://en.wikipedia.org/wiki/Deterministic parsing.

MALTPARSER

26

that the current token on the stack becomes the dependent of the current token on the list and pops the current token from the stack. Below is the transition sequence for the sentence S, ‘TonyBlair participa la acest efort diplomatic’ [‘TonyBlair is part of this diplomatic effort’] using the arc-eager version of the Nivre algorithm. The set A denotes the set of arcs connecting the token nodes in S. a and l are functions which map the values for the labeled arcs. At the beginning, the stack is empty while the list contains the token nodes from the input sentence and no values have been assigned for the labeled arcs. The transition is terminated when the list becomes empty. S = TonyBlair1 participa2 la3 acest4 efort5 diplomatic6 A ⊇ {(1 ← 2) , (2 → 3) , (3 → 5) , (4 ← 5) , (5 → 6)}

SHIFT→ LA(SUB)→

(nil, (1,...,6), a0 , l0 ) ((1), (2,...,6), a0 , l0 ) (nil, (2,...,6), a1 = a0 [1 ← 2] , l1 =l0 [1 → SUB])

SHIFT→ RA(CI)→ SHIFT→

((2), (3,...,6), a1 , l1 ) ((2,3), (4,5,6), a2 = a1 [2 → 3] , l2 =l1 [3 → CI]) ((2,3,4), (5,6), a2 , l2 )

LA(ATADJ)→ ((2,3), (5,6), a3 = a2 [4 ← 5] , l3 =l2 [4 → AT ADJ]) RA(RELPPOZ)→ ((2,3,5), (6), a4 = a3 [3 → 5] , l4 =l3 [5 → RELPPOZ]) RA(ATADJ)→

((2,3,5,6), nil, a5 = a4 [5 → 6] , l5 =l4 [6 → AT ADJ])

Several transitions in the sequence above are non-deterministic. However, the parser is guided in all non-deterministic points by the Learner via the Guide. Regarding the arc-standard version, at every step, the parser applies only three transitions, having no Reduce in comparison to the arc-eager version. Regarding the rest of the transitions, everything is the same except that RightArc(l) moves the current token of the stack to the list of remaining input tokens, replacing the current token at the top of the list. There are two options controlling the behavior of the two versions of the algorithm. The root handling option can be set to strict, relaxed and the

MALTPARSER

27

default normal. In the case of a strict root handling, the dependents of the root node will be attached with the default label only after parsing. The Reduce transition becomes invalid for the unattached tokens. The relaxed root handling is very similar to the strict one, except that the Reduce transition is valid. Regarding the normal root handling, the RightArc(l) transition attaches root dependents during parsing while the unattached tokens get attached with default label afterwards. The second option regards the post processing, i.e. a second pass over the input, processing only the unattached tokens. The default for this option is false.

Covington family Covington’s incremental algorithm comes in two flavors: projective and nonprojective. The first version imposes several restrictions, among which the projectiveness of the dependency graphs. The second version allows graphs that are non-projective, but acyclic. The algorithm uses four data structures. Right is the list of input tokens (or the remaining ones) whereas the list Left contains the partially processed tokens from the input list. The two other remaining lists are LeftContext and RightContext. The first one contains all the unattached tokens to the left of the first token in the Right list. The second one contains all the unattached tokens to the right of the first token in the Left list. Regarding the non-projective version, at every step, the parser applies one of the following three transitions: Reduce, RightArc(l) and LeftArc(l) (Hall, 2006). Reduce moves or pops the top token from a list when it has been (partially) processed. The transition RightArc(l) adds a right dependency arc with the label l so that the current token on the top of the list becomes the dependent of the next input token. The LeftArc(l) transition adds a dependency arc with the label l so that so that the current token on the top of the list becomes the head of the next input token. In addition, the RightArc(l) and the LeftArc(l) transitions reduce the top token from the list. Covington (2001) and Nivre (2007) offer a thorough

MALTPARSER

28

description of Covington’s parsing algorithm. The projective versions should build only projective graphs so the parser does not explore all the combinations of possible pairs. The transitions are the same as for the non-projective version except that the Reduce and RightArc(l) transitions add a flag DONE in order not to permit nonprojective arcs. There are two options controlling the behavior of the algorithm. If the option allow shift is set to true, the transition Shift is valid, allowing the parser to skip the remaining tokens in Left. If the option is set to the default false, all tokens in Left must be partially processed before the next token is shifted. If the option allow root is set to the default true, the special root node is treated as a token during parsing, allowing root dependents to be attached with a RightArc(l) transition. If false, root dependents are not attached during parsing. In both cases, unattached tokens are attached to the special root node with the default label after parsing is completed.

Learning Algorithm The Learner unit in the MaltParser system is an interface to LIBSVM. This is a library for support vector machines (SVM)15 which supports multi-class classification. The software provides an interface which can be integrated into other programs. It comes with a multitude of options, allowing the user to choose the parameters which best fit the material in use. The Learner classifies the data using the kernel functions in LIBSVM during training. It then predicts transitions during parsing using this classification. The data reaching the Learner (i.e., feature values and transitions) is mapped into a numerical representation. A lexicon with all word types and their numerical values is built and input into the system before involving the Learner in the training or parsing phases (Hall, 2006). 15 Support

vector machines (SVM) are a group of supervised learning methods that can be applied to classification or regression. More on SVM at http://www.supportvector-machines.org/.

MALTPARSER

29

LIBSVM comes with many options and parameters. For instance, there are four types of kernel functions, the default being the radical basis function. However, during the experiments reported in this thesis, only the polynomial kernel was used. Other options concern, for instance, the tolerance of the termination criterion,  (default 0.001) and the penalty parameter C (default 1). All the LIBSVM parameters used in the experiments are described in the Experimental Setup section. A considerable decrease in learning time is obtained by splitting the data into several training sets according to three criteria: column, structure and threshold. The first criterion indicates which input column in the data format specification file should be used, for instance, POSTAG or CPOSTAG. The data structure that can be used to split the training data is, for instance, the top token on the stack or in the input regarding the Nivre family and the top token on the right or the left for the Covington family. The threshold indicates the minimal value for forming a new training set. However, the decrease in learning time is accompanied by a drop in accuracy.

Feature Models The Guide extracts history-based features in order to use them in predicting the next appropriate action at non-deterministic points. The features themselves represent attributes of certain tokens, i.e. values of the grammatical information existent in the material and of the partially built dependency structures. For instance, taking into consideration RDT, regarding the input tokens, only POS tag values and lexical forms can be extracted. There is no information on coarse-grained POS tags. Regarding the partially built dependency structures, DEPREL (dependency relation) information can be extracted from the material. The features option allows the user to specify which kind of feature values to be extracted. The default feature model specification for each of the four algorithms are XML files part of the application data. The user may also construct new specifications, fitting closely the material at hand. The system accepts XML files or tab-separated data files with the suffix

MALTPARSER

30

.par. Below is an example for some feature specifications from a tab-separated data file: POS

INPUT

POS DEP DEP

INPUT STACK STACK

1 0 0

0 0

0 0

1 -1

Each line contains one feature. Each feature may be specified by seven columns, but at least the first two must be present. If the values for the columns 3 to 7 are zero, they can be truncated. The first column defines the type of the feature, i.e. POS for part-of-speech tag, CPOS for the coarsegrained part-of-speech, FEAT for the grammatical features, DEP for dependency relation, LEX for the lexical form of the token, LEM for the lemma of the token.16 The second column identifies the data structure in the parser configuration, that is INPUT for the list of input tokens, STACK, the stack of partially processed tokens and CONTEXT (relevant only for Covington’s non-projective algorithm) for the right of left context of a token. The third column targets the i +1th token in the data structure specified in the second column. For instance, in the example above, the first line refers to the part-of-speech of the next input token while the second line refers to the part-of-speech of the token immediately after the next input token in the input list. The fourth column indicates the position of the token in the original input string, the positive value referring to forward or right position whereas the negative one to backward or left position. The following three columns refer to the partially built graph. The fifth column identifies the head of a certain token. The sixth column refers to the left- or rightmost dependent of a token. The last two lines in the example above extract the dependency relation of the right- and leftmost dependent of the TOP (the first token on the STACK). Finally, the seventh column indicates the leftor rightmost sibling of a certain token. 16 The FEAT, LEM and CPOS types are available only for the tab-separated CONLL format. The Formatting and Encoding section contains a detailed description of the CONLL format.

MALTPARSER Tokens S: TOP S: TOP+1 I: NEXT I: NEXT+1 I: NEXT+2 I: NEXT+3 G: Head of TOP G: Leftmost dependent of TOP G: Rightmost dependent of TOP G: Leftmost dependent of NEXT

31 Feature Type POS DEP LEX + + + + + + + + + + + + + +

Table 5: Default feature model for Nivre’s arc-eager algorithm (S=Stack, I= Input, G = Graph)

Table 5 presents the default feature specification for Nivre’s arc-eager algorithm. The tokens are grouped according to their position on the input queue (I) or on the stack (S). Moreover, there are some feature types specified for the graph (G) properties of certain tokens.

Previous Projects using MaltParser Being a language-independent system, MaltParser was used to parse several languages, for instance, in the CONLL-X 2006 and CONLL 2007 shared tasks on dependency parsing. The shared tasks on multi-lingual dependency parsing gather the results obtained for several parsers trained and tested on the same material for a particular language using gold standard annotation and the same dependency format in order to better compare parsers and their performance. The two shared tasks offer the advantage of exactly the same data set and of the same evaluation metric. Regarding the CONLL-X task, MaltParser achieved a second best overall score, with top results for several languages in the dependency parsing shared task. The errors produced by the systems which obtained the best two results, MSTParser and MaltParser, were characterized and reported

MALTPARSER

32

in McDonald and Nivre (2007). In the 2007 multi-lingual dependency track, MaltParser obtained the first overall score with the blended parser and the fifth score with the single parser system. The blended parser is an ensemble system, combining the output of six deterministic parsers (Hall et al., 2007). Table 11 gathers the results for all the nineteen languages parsed in both projects (Romanian placed in the table only for comparison reasons). MaltParser was also trained on Latin using the Latin Dependency Treebank (Bamman and Crane, 2007).

Comparison with other dependency parsers Since so many NLP applications need text processing in the form of parsed text, many parsers are already available. Regarding dependency parsers, most of them are produced as research projects in different universities, as the MaltParser system, although commercial systems exist too. Buchholz and Marsi (2006) and Nivre et al. (2007a) provide an overview of the multilingual dependency parsers participating in two shared tasks of the CoNLL, detailing the approaches used by the systems and reporting and analyzing the results. MaltParser uses a stepwise approach, building a dependency graph by taking decisions at every step or transition of the parser. The contrastive all-pairs approach, which builds a dependency graph by considering every possible pair of tokens in a sentence, is used, for instance, in the MSTParser (MST as in Maximum Spanning Trees) system (McDonald et al., 2005). Jason Eisner’s probabilistic dependency parser uses the same approach as the latter (Eisner, 1996). Most of the transition-based systems employ a shift-reduce inspired parsing model, as MaltParser does. Other systems using a similar model are, for instance, the DeSR system (Attardi et al., 2007) and the ISBN Dependency Parser (Titov and Henderson, 2007). An alternative for the shift-reduce model is, for example, the LR parser found in the RASP system (Watson

MALTPARSER

33

and Briscoe, 2007). Regarding the learning method, SVM is also used by the CaboCha parser (Kudo and Matsumoto, 2002), together with the MaltParser system. Other systems employ different models, for instance, the DeSR system uses perceptron learning while the ISBN parser uses an incremental sigmoid belief networks. Maximum likelihood estimation is applied in the case of the RASP system. An extension of the Margin Infused Relaxed Algorithm (MIRA) is used for the MSTParser. System CaboCha DeSR Eisner’s parser ISBN Parser Link Grammar Parser Minipar MSTParser MaltParser RASP Stanford parser

Available at: http://chasen.org/~taku/software/cabocha http://desr.sourceforge.net www.cs.jhu.edu/~jason/papers/narrative http://cui.unige.ch/~titov/idp http://www.abisource.com/projects/link-grammar http://www.cs.ualberta.ca/~lindek/minipar.htm http://sourceforge.net/projects/mstparser http://w3.msi.vxu.se/~jha/maltparser www.informatics.susx.ac.uk/research/groups/nlp/rasp http://nlp.stanford.edu/software/index.shtml

Table 6: Overview of some dependency parsing systems

Table 6 provides an overview of some existent dependency parsers. Most of them are freely available. The Stanford parser provides only a demo. MaltParser is a language independent system. However, most of the dependency parsers available right now are language dependent, mostly developed for English. For instance, the Minipar system is a principle-based parser for English (Lin, 2003). Link Grammar Parser as well as a commercial parser, Connexor Machinese Syntax parser are also limited to specific languages (Pyysalo et al., 2006). Regarding the trainable parsers, besides MSTParser, ISBN parser and MaltParser, another system worth mentioning is the Stanford Parser. It is a trainable statistical parser, producing both constituency and dependencybased parses (Klein and Manning, 2003).

MALTPARSER

34

To summarize, MaltParser is a language-independent data-driven dependency parsing system. It has a modularized architecture, consisting of three relatively independent units: the Parser, the Learner and the Guide. The Parser implements two deterministic algorithms: Nivre and Covington, each having two versions. The Learner is an interface for LIBSVM, a machine learning system. The Guide makes the connection between the Parser and the Learner. It also extracts history-based feature models. The system comes with a multitude of options controlling the behavior of the different units. It has been used in several previous projects, obtaining encouraging results.

Experiments This chapter presents the experiments performed during the training and tuning of the parser for Romanian, offering a detailed review of the format and encoding required by MaltParser, the data sets used in the experiments and the experimental setup. The evaluation metrics, the results of the experiments and a comparison with the results for other languages trained with MaltParser are also reviewed here. An error analysis for the output of the parser is also presented here.

Formatting and Encoding MaltParser currently supports tab-separated data files, for instance, the CoNLL and Malt-TAB data format. A user-defined format may also be provided, should the user supply the format option to the data format specification file. One can also specify the character encoding of the files. I have chosen to convert the training and testing files used in the experiments into the CoNLL format since it is straightforward and user-friendly. The sentences in the data files are separated by blank lines, each token starting on a new line. A token is described in ten fields separated by a single tab character. The fields for token id, form, coarse-grained and fine-grained POS tag, head and dependency relation to the head have to contain nondummy values whereas the fields for lemma, features, projective head and dependency relation to the projective head may be represented by a dummy value (e.g., an underscore).17 The conversion from XML to CoNLL data for17 More

information on the CoNLL data format on http://nextens.uvt.nl/depparse-

35

EXPERIMENTS

36

mat was done using a script written in Perl by myself. Figure 7 contains a sentence from the Romanian treebank (‘La acest efort diplomatic participa premierul britanic Tony Blair’ [Lit. ‘The British prime-minister Tony Blair is part of this diplomatic effort’]) in the CoNLL data format with specifications for each field. CPOSTAG, POSTAG, FEATS, DEPREL, PHEAD, PDEPREL are abbreviations for coarse-grained part-of-speech tag, part-ofspeech tag, features, dependency relations, projective head and dependency relation to the projective head. ID 1 2 3 4 5 6 7 8

FORM LEMMA CPOSTAG PREP La acest ADJ efort SBS diplomatic ADJ participa VB premierul SBS britanic ADJ TonyBlair SBS

POSTAG FEATS HEAD PREP 5 ADJ 3 SBS 1 ADJ 3 VB 9 SBS 5 ADJ 6 SBS 6

DEPREL PHEAD PDELREL CI ATADJ RELPPOZ ATADJ PRED SUB ATADJ ATSBS

Figure 7: A sentence from RDT in the CoNLL data format

MaltParser requires that the training and testing files are encoded in the UTF-8 (Unicode) character encoding. As the original XML files were encoded in ASCII, this did not pose any kind of problem.

Data Sets The total data set was split into a training set, a development or validation set and an evaluation test set. The parsing models were trained on approximately 80% of the data. The validation set was used for preliminary testing and tuning the parsing models and the feature models specific for Romanian. The test set, unseen during the development phase, was used for a final test run, training on 90% of the data. Tabel 7 below presents some statistics on the different splits of data. On the whole, there were several modifications brought to the original treebank data. First of all, I modified some annotation errors (see section on wiki/DataFormat.

EXPERIMENTS Data set Training Development Test Total

37 Tokens 28,918 3,651 3,581 36,150

Sentences 3,234 404 404 4,042

T/S 8.94 9.03 8.86 8.94

Table 7: RDT - training, development and test sets (T/S - average sentence length)

the Evaluation of RDT). Secondly, regarding the original POS tag set and dependency types set used in the annotation of the treebank, it was chiefly the presence of blanks and punctuation characters inside the strings of the original annotations that led to the decision to map the original notations to shorter but absolutely equivalent symbols. No information was lost in the mapping. TAG ADEM ADJ ADV AHOT ANHOT APOS CAUX CCOO NO PREP PRFX PRON SBS VB VBAUX VBNPR

PART-OF-SPEECH EXAMPLE Demonstrative article cea, cel, cele Adjective scurt, puternic˘ a, unii Adverb acum, ieri, nu Definite article lui, -l, -a (enclitic) Indefinite article un, o, unui Possessive article a, al, ale Subordinating conjunction s˘ a Coordinating conjunction ¸si, sau, ori Numeral una, treime, zece Preposition ˆın, de, pentru Reflexive pronoun se, te, ne Pronoun el, noi, cine, ce, care Noun prevederilor, Timi¸soara Finite verb are, este, constatat Auxiliary verb a, au, va, fi Non-finite form apar¸tinˆ and, venind

Table 8: The reduced POS tags used in Data Set 2

Given the inconsistencies regarding the use of some POS tags throughout the material, I decided to create two data sets. The difference between the sets regards the POS tagset. The first set, called Data Set 1, contains all

EXPERIMENTS

38

the POS tags presented in Table 1. The second set, called Data Set 2, has a reduced tagset (16 out of the initial 20), presented in Table 8.

1 2 3 4 5 6 7 8

La acest efort diplomatic participa premierul britanic TonyBlair

PREP ADJ SBS ADJ VB SBS ADJ SBS

PREP ADJ SBS ADJ VB SBS ADJ SBS

Original HEAD 5 3 1 3 9 5 6 6

Modified HEAD 5 3 1 3 0 5 6 6

CI ATADJ RELPPOZ ATADJ PRED SUB ATADJ ATSBS

Figure 8: The modification on the dependency graph The third modification concerns the dependency graphs representing the syntactic structures of the sentences in RDT. For any sentence, there is a special node which does not correspond to any token of the sentence and which is the root of the dependency graph. In the original annotation of the Romanian treebank, a node n + 1, where n is an index denoting the nth word of the sentence, was chosen to represent the root of the dependency graphs. I modified the special node n + 1 to 0, following Nivre et al. (2007b). Since the special node does not correspond to any real token of the sentence, I have considered it a pure conventional notation, having no theoretical or practical consequences. It is, however, more convenient for the human viewer of a sentence with dependency structures to consider a

EXPERIMENTS

39

fixed number, 0, rather than a variable as the root of a graph. Figure 8 represents the dependency graph of a sentence from RDT, ‘La acest efort diplomatic participa premierul britanic Tony Blair’ [Lit. ‘The British primeminister Tony Blair is part of this diplomatic effort’]. The sentence is also shown in the CoNLL format. The modification on the dependency graph is highlighted.

Evaluation Metrics It is relevant for the purpose of this master’s thesis to measure how accurately the parsing models trained on RDT behave. One of the common measurements for accuracy of parsing models is the attachment score, or how often a token is parsed correctly. The following three metric values were used in the evaluation: • Labeled Attachment Score (LAS) represents the percentage of tokens which have been correctly assigned both the head and the dependency label (i.e. dependency type). (Both Correct) • Unlabeled Attachment Score (UAS) represents the percentage of tokens which have been correctly assigned the head, paying no attention to the assigned dependency label. (Correct Head) • Label Attachment (LA) represents the percentage of tokens which have been correctly assigned the dependency type label, this time paying no attention to the head. (Correct Label) MaltEval, a freely available software, was used in the evaluation.18

Experimental Setup This experiment checks how the different parsing algorithms, feature models and parameter settings of the learning method influence the parsing accuracy of the parsing models. The following setup was used: 18 Available

at http://w3.msi.vxu.se/users/jni/malteval/.

EXPERIMENTS

40

• Two data sets based on the Romanian Dependency Treebank • LIBSVM learning method • Two feature models: one which is the default feature model specification for the given parsing algorithm and another specific feature model for Romanian • Four parsing algorithms: Nivre’s arc-eager and arc-standard and Covington’s projective and non-projective. During the experiment, I have used the parameter set given as an example on the MaltParser web page for the LIBSVM learning method and obtained the best parsing accuracy. In this set, γ is at 0.2 and r at 0.4 for the polynomial kernel of degree 2. The parameter C for the type of SVM is kept at its default value, 1 whereas the tolerance of termination criterion,  is set at 0.1. During the development phase, I have also tested other sets of parameters, however, with no improvement in the results. Regarding the graph option group, the root label option ‘PRED’ instead of the default ‘ROOT’ was kept constant throughout the experiment. This is the default label used for unattached tokens. After parsing is completed, these tokens are automatically attached to the special root node. For the Nivre family of the parsing algorithms, the root handling option was set to relaxed. This means that root dependents are not attached during parsing, being attached with the default label afterwards. The reduction of unattached tokens is permitted. The post-processing value was set to true. This value allows a second pass over the input, processing only the unattached tokens. All these values have improved significantly the accuracy of the parser. The option group for the Guide was not modified from the default settings. The splitting of the data into several training sets resulted in a significant drop in accuracy, most probably because of the small training set to begin with. Regarding the Covington family, the options allow root and allow shift are kept at their default settings. Experiments performed with the changed values did not produce an increase in the accuracy.

EXPERIMENTS

41

This experiment uses two data sets, containing exactly the same data, 90% for training the parsing models and 10% of the data for testing. The two data sets differ only regarding the POS tagset. The first data set, Data Set 1, has the same tagset as the original material of RDT, shown in Table 1. The second data set, Data Set 2, has a reduced tagset shown in Table 8.

Feature Model for Romanian MaltParser uses a default feature model specification included in the system’s distribution for all the algorithms. Instead of the default, an XML file or a tab-separated data file may contain the feature model specified for a particular language. For this experiment, firstly, the default feature models of the parsing algorithms were used. Secondly, I used a feature model specific for Romanian in order to check how it influences the parsing accuracy. I created a base version of the feature model for the Nivre Arc-Eager parsing algorithm and other versions for the Nivre Arc-Standard algorithm and the two versions of Covington’s algorithm. Table 9 presents the details of the feature model for Romanian for the Nivre arc-eager algorithm. The part-of-speech (POS), dependency type (DEP) and lexical type (LEX) features19 are attributes of the tokens found either on the current stack S, the input I or the partially build graph G. TOP refers to the first token on the stack S and NEXT to the first token waiting in the input queue I. This model was tuned by forward and backward feature selection20 starting with the default model for the Nivre arc-eager algorithm and by looking at other feature models specific for other languages trained and tuned with MaltParser in other projects. The number of features in this model is 22, most of them (13) concerning the part-of-speech of the tokens taken into consideration. Lexical form 19 Regarding

the lemma, coarse-grained part-of-speech tag and specific morphosyntactic features of the tokens, there is no available information in RDT on these attributes. 20 Adding and removing feature from a certain model.

EXPERIMENTS Tokens S: TOP S: TOP-1 S: TOP-2 S: TOP+1 I: NEXT I: NEXT-1 I: NEXT+1 I: NEXT+2 I: NEXT+3 G: Head of TOP G: Leftmost dependent of TOP G: Rightmost dependent of TOP G: Leftmost dependent of NEXT G: Right sibling of the leftmost dependent of TOP

42 Feature Type POS DEP LEX + + + + + + + + + + + + + + + + + + + + + +

Table 9: Feature model for Romanian - Nivre arc-eager algorithm (S=Stack, I= Input, G = Graph)

features also occur even if they have many distinct values and thus require a large amount of training data, which is not the case in the present experiment. Fewer lexical features in the model resulted, however, in a drop of parsing accuracy. This is probably because RDT has a rather high TTR indicating a frequent repetition of certain types (see section Treebank Data). The parsing accuracy improved by using a specific feature model for Romanian for all the algorithms in the system.

Results Table 10 presents the labeled and unlabeled attachment scores and the label attachment for this experiment. The best accuracy for each algorithm is in boldface. The specific feature model for Romanian achieves better accuracy in all the cases because it gathers and uses adequate information to determine the next parser action. The highest score is LAS 88.6% and UAS 92%, achieved by using the

EXPERIMENTS

43

Parsing Algorithm

Feature Model

Nivre arc-eager

default rom aE.par default rom aS.par default rom cov.par default rom cov.par

Nivre arc-standard Covington non-projective Covington projective

Evaluation Metrics LA LAS UAS 0.902 0.879 0.915 0.908 0.886 0.92 0.898 0.878 0.913 0.901 0.879 0.916 0.903 0.881 0.913 0.908 0.883 0.916 0.904 0.880 0.914 0.908 0.883 0.916

Table 10: Parsing accuracy for four parsing algorithms, two feature models (FM), LIBSVM learning method and Data Set 1. The accuracy is measured in label attachment (LA) and labeled and unlabeled attachment score (LAS and UAS).

Nivre arc-eager algorithm. Covington projective and non-projective together with Nivre arc-earger algorithms produce the highest LA score of 90.2%. Language Arabic Basque Bulgarian Catalan Chinese Czech Danish Dutch English German

Results LAS UAS 74.75 84.21 74.99 80.61 87.41 91.72 87.74 92.20 86.92 90.54 77.22 82.35 84.77 89.80 78.59 81.35 85.81 86.77 85.82 88.76

Language Greek Hungarian Italian Japanese Portuguese Romanian Slovene Spanish Swedish Turkish

Results LAS UAS 74.21 80.66 78.09 81.71 82.48 86.26 91.65 93.10 87.60 91.22 88.60 92.00 70.30 78.72 81.29 84.67 84.58 89.50 79.24 85.04

Table 11: Overview of the results obtained with MaltParser for several languages in the CoNLL-X Shared Task and CoNLL 2007 Shared Task. All the scores are above 87% so the results are encouraging for a first project dealing with parsing Romanian text. Table 11 presents an overview of the results obtained using MaltParser 0.4 for several languages in the CoNLL-X

EXPERIMENTS

44

Shared Task and CoNLL 2007 Shared Task. The results were obtained using Nivre arc-eager algorithm with a specific feature model for each language and with different settings for the SVM learner. In comparison, the scores for Romanian are very similar to those obtained for configurational languages like English, Italian and Catalan. Languages with rich morphology and flexible word order like, for instance, Czech have lower results. Given that Romanian can be considered a language characterized by flexible word order, plus a relatively rich morphology, these results are rather unexpected. However, as presented in second chapter, the texts in RDT were strictly selected, eliminating, for instance, sentences with flexible word order and, in general, simplifying the complex structures. This may influence the results positively. The small training test size did not constitute a disadvantage. On the contrary, as Hall et al. (2007) observe, it seems that text selection and linguistic properties seem to be more important than, for example, the size of the material at hand. When taking into consideration the results obtained with the default version of the feature model, the highest LAS of 88.1% is achieved by using Covington’s non-projective parsing algorithm. Given that non-projective structures do not occur in the material, this result is rather surprinsing. Covington’s algorithm builds a graph by systematically trying to link each new token to all the tokens preceding it, some links being permissible only under certain constraints (Nivre, 2006; Covington, 2001). This approach may have an early advantage due to the very strategy of the algorithm. In fact, looking at the output of the parser, there are only two non-projective dependency graphs. The algorithm allows these kind of graphs but this does not necessarily mean that the parser outputs only non-projective structures. However, this advantage is lost when the feature model specific for the material is used for the Nivre arc-eager algorithm. Data Set 1 was at the base of the results presented in Table 10. Another experiment was performed using Data Set 2. The results are very similar and are not presented here as a table. Regarding, for instance, the Nivre arc-eager algorithm, there was no difference at all when using the default feature model. When using the specific feature model for Romanian, the

EXPERIMENTS

45

accuracy measured in LAS decreases slightly from 88.6% to 88.5%. On the other hand, the accuracy measured in LA, for instance, shows significantly better results. This can be explained by the use of a smaller but more consistent POS tagset in the material in the second data set, influencing for the better the attachment of the dependency type label.

Error Analysis It seems rewarding to perform an error analysis on the output of the parser given that it could set a path for future improvement of the parsing accuracy.

Figure 9: Parsing accuracy relative to sentence length

The length of a sentence is one of the factors influencing the error rate. Longer sentences tend to contain more errors than shorter ones. Figure 9 presents the parsing accuracy of the Nivre arc-eager and Covington nonprojective parsing algorithms relative to sentence length. The graph presents

EXPERIMENTS

46

the LAS results obtained with the test data separated into groups of size 10 (that is, 1-10, 11-20 and so on). The performance of the system is almost indistinguishable for shorter sentences of up to 30 tokens when using both algorithms. However, the arc-eager algorithm performs better for longer sentences. Looking closer at the parser outcome for the long sentences, it is easy to observe that the arc-eager inference algorithm takes almost all the decisions correctly, except when having to deal with dependents located far-away from their head, especially for dependents (adverbials, most often as prepositional constructs) of the main verb or the root. However, this did not produce an error propagation. This is partly due to the rich feature representation which guides correctly the parser, even after an error, producing fragmented parses. Continuing to consider length properties, it is very interesting to look at the precision and recall of the Nivre arc-eager algorithm regarding the distance between a dependent and its head (i.e. dependency length). The length of a dependency or an arc is the positive difference between the index of a token and that of its head. Looking only at the output of the parser, precision measures how many of the predicted arcs of a certain length were correct. Recall considers how many of the predicted arcs were correct, now looking at all the arcs of a certain length in the gold standard. In general, precision and recall are calculated according to the following formulae: Precision =

Correctcounter ParserCount

Recall =

Correctcounter GoldStandardCount

Table 12 presents the precision and recall of the grouped dependency length. Arcs of length 7 or more (like, for instance, the arc between an adverbial and its head, usually the main verb of a sentence) are harder to predict, with a precision of around 60%. The tendency is quite clear: the further away from the head, the lower the precison of the arcs. On the same track, the closer to the head, the higher the precision. Graph factors are also important to consider. It is interesting to measure the accuracy considering the tokens grouped by the distance to the root of

EXPERIMENTS Grouped Dependency Length to root 1 2 3-6 7-...

47 Precision (%) 99.50 95.92 92.34 83.59 60.78

Recall (%) 99.50 96.32 89.66 81.89 71.26

Table 12: Precision and Recall of Grouped Dependency Length (Nivre arceager algorithm)

the graph in order to observe at what level of the graph the errors are found. This distance is equal to the number of arcs taken in reverse from the dependent to the root.

Figure 10: Arc precision and recall for distance to root Figure 10 shows the precision and the recall of the Nivre arc eager algorithm for arcs of length zero to thirteen. It shows that for arcs of length zero (that is, all the tokens without a head, or, in our case, the predicates), both precision and recall are the same, almost reaching one hundred percent. This indicates that when calculating precision and recall, the denominator is the same, that is the total number of the arcs in the system’s output is

EXPERIMENTS

48

equal to the total number of arcs in the gold standard, regarding the arcs of length zero. Linguistic factors, such as parts of speech and dependency types, are also noteworthy. Table 13 presents the accuracy for several parts of speech, obtained using Nivre arc-eager algorithm. The figure presents the percentage of tokens grouped under a certain part of speech for which both the head and the dependency relation to the head are predicted correctly. Predicative verbs have almost one hundred percent accuracy. RDT containing only simple sentences with only one verb acting as a predicate, which is also the root of the sentence, surely contributes to this result. At the other end, prepositions and coordinating conjunctions have the lowest accuracy. Prepositions occur most often in adverbial structures as dependents of verbs (predicative and unpredicative) and, sometimes, at a long distance relative to their heads, thus explaining the lower accuracy. Many errors concerning preposition were also caused by the system’s tendency to overpredict root dependents. Part of speech accuracy Adjectives Adverbs Coordinating conjunctions Nouns Predicative verbs Prepositions Pronouns

LAS (%) Head+Label Right 95.10 90.80 65.20 89.10 99.50 69.20 89.10

Table 13: Accuracy for several parts of speech

Figure 11 exemplifies the kind of errors the parser produces regarding coordinating conjunctions when using Nivre arc-eager algorithm. In the sentence ‘Painea carnea si cartofii nu sunt indicate’, the coordinating conjunction si (‘and’) governs three nouns: painea (‘the bread’), cartofii (‘the potatoes’) and carnea (‘the meat’). The conjunction is one of the root’s dependents, acting as a subject for the predicate sunt (‘are’) of the sentence. However, as the figure shows, the parser fails to predict these dependencies and de-

EXPERIMENTS

49

pendency relations. The conjunction becomes a dependent of the first noun painea, the arc being labeled as a nominal modifier.

Figure 11: Predicted graph for the sentence ‘Painea carnea si cartofii nu sunt indicate’ [Lit. ‘Bread, meat and potatoes are not advisable’] This kind of error occurs for all coordinated structures. This is partly due to the lack of punctuation in RDT. As such, the coordinated structure is rather difficult to process even by a human. It is likely that the missing commas in these kind of structures are partly responsible for misleading the parser.

Figure 12: Precision and recall errors for several dependency relations Another linguistic factor worth taking into consideration is the dependency relation. Figure 12 presents precision and recall for tokens grouped according to their dependency relations when using Nivre arc-eager algorithm on the test data. The root category or the predicate (PRED) has almost one

EXPERIMENTS

50

hundred percent precision and recall. Indirect objects (CI) and adverbials (CC) have lowest precision. Looking closer at the sentences containing these dependency relations in the test data, it is clear that the system’s tendency to overpredict root dependents is one of the main causes of error in this case. Many indirect objects and adverbials are automatically attached to the root of the sentence when many are, for instance, dependents of unpredicative verbs.

Figure 13: Predicted graph for the sentence ‘care este organism independent’ [Lit. ‘which is an independent organism’] Regarding the subject (SUB) and the predicative (NMPRED) relations, many times the system produces an interesting mix-up. Figure 13 presents the predicted parse for the sentence ‘care este organism independent’ (‘which is an independent organism’). Both the pronoun care (‘which’) and the noun organism (‘organism’) are root dependents. The arc from the pronoun to the predicate este (‘is’) should be labeled with the subject relation while the arc from the noun to the predicate should be a predicative relation. However, the system predicts the labels exactly the other way around, which is not correct. Nevertheless, according to the word order in Romanian, this relation sequence is perfectly permissible, too. Moreover, similar sequences exist in RDT so the system predicts according to what it has learned from the training data. The parser predicts partly correctly. In this particular case, however, the relation sequence predicted by the system is incorrect, the pronoun care may be attached to the root only by the subject relation. Other specific errors appear in connection with multiword units. Looking at the output of the system, the parses of the prepositional expressions like, for instance, de la (‘from’), in legatura cu (‘about’) and in conformitate cu

EXPERIMENTS

51

(‘according to’) contain systematic errors. Figure 14 exemplifies the errors produced by the parser for a sentence containing a multiword unit. The second preposition in the expression in legatura cu is incorrectly attached to the root and labeled as an indirect object. Instead, the gold standard data proposes the noun legatura as the head of the preposition cu and labels the arc between the two tokens with the nominal modifier relation. Taking another perspective, Nivre and Nilsson (2004) observe that the recognition of multiword units improves parsing accuracy with respect to the units themselves and to the surrounding syntactic structures. These units should be taken into consideration for an improvement of the parsing accuracy.

Figure 14: Predicted graph for the sentence ‘sa formuleze o pozitie definitiva in legatura cu aceasta situatie’ [Lit. ‘In order to formulate a definitive position on this situation’]

To summarize, the experiments were performed on relatively small training and test sets which were converted into the CoNLL format and encoded into UTF-8. The experiments were done using two data sets, four parsing algorithms, the LIBSVM learning method and two feature model specifications. The best label attachment score of 88.6% is similar to the results obtained for other projects using MaltParser. Length, graph and linguistic factors were considered in the error analysis. As the linguistic properties of the material at hand seem to play a major role in obtaining better accuracy, the quality of the data needs to be constantly improved.

Conclusion Summary In the frame of this thesis, the MaltParser system was used to train, tune and test a first parser for Romanian, using the Romanian Dependency Treebank. At a closer look, RDT proved to be a relatively small treebank, containing around 36,000 tokens. The strong tradition of using dependency grammar approaches in linguistic research on Romanian determined the choice of framework. The treebank assembles strictly selected newspaper texts, standing as a sample for standard written Romanian. It was annotated completely manually, using DGA, a graphical interface. It comes in the XML format, following the XCES guideline. The files were encoded in ASCII. The treebank contains no punctuation signs. The texts consist only of simple sentences, complex constructions being avoided. Twenty POS tags were found in the treebank, most of the times coherently used. Twenty five dependency types were set as labels for the arcs of the dependency graphs. The EOS token represented the special root node of all dependency graphs. After evaluating the treebank by randomly selecting 3% of the material, it was clear that the quality of the material was satisfactory, although some inconsistencies were found. The MaltParser system was used for training and testing the parser. I converted the treebank from the XML format to the tab-separated CoNLL format required by MaltParser, using a UTF-8 encoding. I performed some changes on the material due to the inconsistencies found. For instance, the special root node was modified to 0, a change concerning all the sentences

52

CONCLUSION

53

in the material. As evaluation metrics, I used Labeled Attachment (LA), Labeled Attachment Score (LAS) and Unlabeled Attachment Score (UAS). For the tuning of the parser, I randomly selected 80% of the data for training and 10% for developping testing. For the final testing, I used 90% of the data and tested on the unseen 10%. I performed the experiment using all the four algorithms, the LIBSVM system and using the default feature specification for each parsing model as well as a feature model specific for Romanian. I created two data sets, one of them containing fewer POS tags in order to eliminate some inconsistencies. However, the results did not differ significantly according to the used data sets. I also performed an error analysis, classifying the errors according to length, graph or linguistic properties. For a first parser for Romanian, the results of around 88% accuracy measured in LAS using the Nivre arc-eager algorithm are extremely encouraging. However, the results are clearly influenced by the linguitic properties of the training material, for instance, the existence of simple clauses alone. Comparing this result to the numbers obtained for other languages, they are very close to results obtained for configrational languages indicating the importance of the material used for training and testing.

Future Research The results obtained in this thesis can be considered promising, entailing a continuation on the same track and leaving a lot of space for improvement. Since MaltParser is a data driven system, it requires a relatively large amount of training data. The parser trained in this project can be used in order to process new Romanian text which, after correction by a human expert, could be used to create a larger dataset. However, a new set of annotation guidelines together with extended and refined morphosyntactic information needs to be set in order before expanding the current treebank. Punctuation marks should also be taken into account in a future extended version of the current treebank. Since the error analysis showed

CONCLUSION

54

that multi-word units have a rather high error rate, this needs to be considered too before annotating new material. Furthermore, the selection of the texts should be looser, allowing at least complex sentences.

Bibliography Abeill´e, A. (2003). Treebanks: Building and Using Parsed Corpora. Springer. Aduriz, I., Aranzabe, M., Arriola, J., Atutxa, A., de Ilarraza, A., Garmendia, A., and Oronoz, M. (2003). Construction of a Basque dependency treebank. Proc. of the 2nd Workshop on Treebanks and Linguistic Theories (TLT), pages 201–204. Afonso, S., Bick, E., Haber, R., and Santos, D. (2002). Floresta sint´a (c) tica: a treebank for Portuguese. Proc. of the Third Intern. Conf. on Language Resources and Evaluation (LREC), pages 1698–1703. Atalay, N., Oflazer, K., and Say, B. (2003). The annotation process in the Turkish treebank. Proc. of the 4th Intern. Workshop on Linguistically Interpreteted Corpora (LINC). Attardi, G., DellOrletta, F., Simi, M., Chanev, A., and Ciaramita, M. (2007). Multilingual Dependency Parsing and Domain Adaptation using DeSR. Proceedings of the CoNLL Shared Task Session of EMNLPCoNLL, pages 1112–1118. Bamman, D. and Crane, G. (2007). The latin dependency treebank in a cultural heritage digital library. Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007), page pages 3340. Boguslavsky, I., Chardin, I., Grigorieva, S., Grigoriev, N., Iomdin, L., Kreidlin, L., and Frid, N. (2002). Development of a Dependency Treebank for Russian and its Possible Applications in NLP. Proceedings of the 3rd

55

BIBLIOGRAPHY

56

International Conference on Language Resources and Evaluation, pages 987–991. B¨ohmov´a, A., Hajiˇc, J., Hajiˇcov´a, E., and Hladk´a, B. (2001). The Prague Dependency Treebank: A Three-Level Annotation Scenario. Abeille, A.(ed.): Treebanks: Building and Using Syntactically Annotated Corpora. Bosco, C. and Lombardo, V. (2004). Dependency and relational structure in treebank annotation. COLING 2004 Recent Advances in Dependency Grammar, pages 1–8. Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., K¨onig, E., Lezius, W., Rohrer, C., Smith, G., and Uszkoreit, H. (2004). TIGER: Linguistic Interpretation of a German Corpus. Research on Language & Computation, 2(4):597–620. Buchholz, S. and Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. Proc. of the Tenth Conference on Computational Natural Language Learning, pages 189–210. Chang, C.-C. and Lin, C.-J. (2001). brary for support vector machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm.

LIBSVM: a Software available

liat

Civit, M., Marti, M., and Bufi, N. (2006). Cat3LB and Cast3LB: From Constituents to Dependencies. Lecture Notes in Computer Science, 4139:141. Collins, M., Ramshaw, L., Hajiˇc, J., and Tillmann, C. (1999). A statistical parser for Czech. Proceedings of the 37th conference on Association for Computational Linguistics, pages 505–512. Covington, M. (2001). A fundamental algorithm for dependency parsing. Proceedings of the 39th Annual ACM Southeast Conference, pages 95–102. ˇ ˇ Dˇzeroski, S., Erjavec, T., Ledinek, N., Pajas, P., Zabokrtsky, Z., and Zele, A. (2006). Towards a Slovene dependency treebank. Proc. of the Fifth Intern. Conf. on Language Resources and Evaluation (LREC).

BIBLIOGRAPHY

57

Eisner, J. (1996). An empirical comparison of probability models for dependency grammar. Institute for Research in Cognitive Science, University of Pennsylvania, Technical Report IRCS-96-11. Hajiˇc, J. (1998). Building a syntactically annotated corpus: The Prague Dependency Treebank. Issues of Valency and Meaning, pages 106–132. Hajiˇc, J., Smrz, O., Zemaneka, P., Snaidauf, J., and Beska, E. (2004). Prague Arabic dependency treebank: Development in data and tools. Proceedings of the NEMLAR International Conference on Arabic Language Resources and Tools, pages 110–117. Hall, J. (2006). MaltParser–An Architecture for Inductive Labeled Dependency Parsing. Licentiate thesis, School of Mathematics and System Engineering, V¨axj¨o University. Hall, J., Nilsson, J., Nivre, J., Eryiˇgit, G., Megyesi, B., Nilsson, M., and Saers, M. (2007). Single Malt or Blended? A Study in Multilingual Parser Optimization. Proceedings of the CoNLL Shared Task Session of EMNLPCoNLL, pages 933–939. Hristea, F. and Popescu, M. (2003). A Dependency Grammar Approach to Syntactic Analysis with Special Reference to Romanian. Hristea, F. and Popescu, M.(ed.): Building Awareness in Language Technology. Ide, N. and Romary, L. (2001). A common framework for syntactic annotation. Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pages 306–313. Kakkonen, T. (2006).

Dependency Treebanks: Methods, Annotation

Schemes and Tools. Arxiv preprint cs.CL/0610124. Klein, D. and Manning, C. (2003). Fast Exact Inference with a Factored Model for Natural Language Parsing. Advances in Neural Information Processing Systems 15: Proceedings of the 2002 Conference.

BIBLIOGRAPHY

58

Kromann, M. (2003). The Danish Dependency Treebank and the DTAG treebank tool. Proceedings of TLT 2003 (2nd Workshop on Treebanks and Linguistic Theory, V¨axj¨o, 2003). Kudo, T. and Matsumoto, Y. (2002). Japanese dependency analysis using cascaded chunking. International Conference On Computational Linguistics, pages 1–7. Lesmo, L., Lombardo, V., and Bosco, C. (2002). Treebank development: the TUT approach. Proceedings of the International Conference on Natural Language Processing, Mumbai, India. Lin, D. (2003). Dependency-based evaluation of Minipar. Treebanks: Building and Using Parsed Corpora. Marcus, M., Kim, G., Marcinkiewicz, M., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B. (1994). The Penn Treebank: Annotating predicate argument structure. ARPA Human Language Technology Workshop, pages 114–119. McDonald, R. and Nivre, J. (2007). Characterizing the errors of data-driven dependency parsing models. Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). McDonald, R., Pereira, F., Ribarov, K., and Hajiˇc, J. (2005). Non-projective dependency parsing using spanning tree algorithms. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 523–530. Nilsson, J. (2008). User guide for malteval 1.0 (beta). Technical report, V¨axj¨o University. Nivre, J. (2003). An efficient algorithm for projective dependency parsing. Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pages 149–160.

BIBLIOGRAPHY

59

Nivre, J. (2006). Constraints on non-projective dependency parsing. Eleventh Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 73–80. Nivre, J. (2007). Incremental non-projective dependency parsing. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pages 396–403. Association for Computational Linguistics. Nivre, J., Hall, J., K¨ ubler, S., McDonald, R., Nilsson, J., Riedel, S., and Yuret, D. (2007a). The CoNLL 2007 Shared Task on Dependency Parsing. Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pages 915–932. Nivre, J., Hall, J., and Nilsson, J. (2006a). MaltParser: A datadriven parsergenerator for dependency parsing. Proceedings of the international conference on Language Resources and Evaluation (LREC2006). Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., K¨ ubler, S., Marinov, S., and Marsi, E. (2007b). MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(02):95–135. Nivre, J. and Nilsson, J. (2004). Multiword Units in Syntactic Parsing. Dias, G., Lopes, J. G. P. and Vintar, S. (eds.) MEMURA 2004 - Methodologies and Evaluation of Multiword Units in Real-World Applications. Nivre, J., Nilsson, J., and Hall, J. (2006b). Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. Proceedings of the international conference on Language Resources and Evaluation (LREC2006). Nivre, J. and Scholz, M. (2004). Deterministic dependency parsing of English text. Proceedings of the International Conference on Computational Linguistics (COLING), pages 64–70.

Osenova, P. and Simov, K. (2004). BTB-TR05: BulTreeBank Stylebook. BulTreeBank Project Technical report Nr. 5. Technical report, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences. Pustylnikov, O. and Mehler, A. (2008). Towards a Uniform Representation of Treebanks: Providing Interoperability for Dependency Tree Data. Proceedings of First International Conference on Global Interoperability for Language Resources (ICGL 2008). Pyysalo, S., Ginter, F., Pahikkala, T., Boberg, J., J¨arvinen, J., and Salakoski, T. (2006). Evaluation of two dependency parsers on biomedical corpus targeted at protein–protein interactions. International Journal of Medical Informatics, 75(6):430–442. ´ Tesni`ere, L. (1959). Elements de syntaxe structurale. C. Klincksieck. Titov, I. and Henderson, J. (2007). Fast and Robust Multilingual Dependency Parsing with a Generative Latent Variable Model. Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pages 947–951. van der Beek, L., Bouma, G., Malouf, R., and van Noord, G. (2002). The Alpino dependency treebank. Computational Linguistics in the Netherlands (CLIN). Watson, R. and Briscoe, T. (2007). Adapting the RASP System for the CoNLL07 Domain-Adaptation Task. Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL, pages 1170–1174.

60

List of Figures

61

List of Figures 1

The sentence ‘La acest efort diplomatic participa premierul britanic Tony Blair’ [Lit. ‘The British prime-minister Tony Blair is part of this diplomatic effort’] annotated by DGA. . . . . . . . . Dependency graph for the sentence ‘Vor coopera in domeniile politiei si informatiilor’ [Lit. ‘They will cooperate in the areas of

12

13

3

police and information’] . . . . . . . . . . . . . . . . . . . . . . . Dependency graph for the sentence ‘La acest efort diplomatic

4

participa premierul britanic Tony Blair’ [Lit. ‘The British primeminister Tony Blair is part of this diplomatic effort’] . . . . . . The sentence ‘Va obtine recunoasterea’ [Lit. ‘(It) will obtain the

18

5

acknowledgment’]. . . . . . . . . . . . . . . . . . . . . . . . . . . The sentence ‘Fondurile sa fie blocate’ [Lit. ‘The fonds should be blocked’]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The sentence ‘[...] depozitul de combustibil nuclear ars’ [Lit. ‘the waste storage for spent nuclear fuel’]. . . . . . . . . . . . . . . .

19 19

7 8

A sentence from RDT in the CoNLL data format . . . . . . . . The modification on the dependency graph . . . . . . . . . . . .

36 38

9 10 11

Parsing accuracy relative to sentence length . . . . . . . . . . . Arc precision and recall for distance to root . . . . . . . . . . . Predicted graph for the sentence ‘Painea carnea si cartofii nu

45 47

12

sunt indicate’ [Lit. ‘Bread, meat and potatoes are not advisable’] 49 Precision and recall errors for several dependency relations . . . 49

2

6

13

Predicted graph for the sentence ‘care este organism independent’ [Lit. ‘which is an independent organism’] . . . . . . . . . .

17

50

14

Predicted graph for the sentence ‘sa formuleze o pozitie definitiva in legatura cu aceasta situatie’ [Lit. ‘In order to formulate a definitive position on this situation’] . . . . . . . . . . . . . . . .

51

List of Tables 1 2

The POS tags used in RDT. . . . . . . . . . . . . . . . . . . . . The dependency types used in RDT and their labels . . . . . .

14 16

3 4

Accuracy of the RDT . . . . . . . . . . . . . . . . . . . . . . . . Overview of several dependency treebanks (some of the tree-

18

banks only have a dependency version) . . . . . . . . . . . . . .

21

Default feature model for Nivre’s arc-eager algorithm (S=Stack, I= Input, G = Graph) . . . . . . . . . . . . . . . . . . . . . . . .

31

6

Overview of some dependency parsing systems . . . . . . . . . .

33

7

RDT - training, development and test sets (T/S - average sen-

8 9

tence length) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The reduced POS tags used in Data Set 2 . . . . . . . . . . . . Feature model for Romanian - Nivre arc-eager algorithm

37 37 42

10

(S=Stack, I= Input, G = Graph) . . . . . . . . . . . . . . . . . Parsing accuracy for four parsing algorithms, two feature models

5

(FM), LIBSVM learning method and Data Set 1. The accuracy is measured in label attachment (LA) and labeled and unlabeled attachment score (LAS and UAS). . . . . . . . . . . . . . . . . . 11

43

Overview of the results obtained with MaltParser for several languages in the CoNLL-X Shared Task and CoNLL 2007 Shared Task. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

43

List of Tables 12 13

Precision and Recall of Grouped Dependency Length (Nivre arceager algorithm) . . . . . . . . . . . . . . . . . . . . . . . . . . . Accuracy for several parts of speech . . . . . . . . . . . . . . .

63

47 48