Tamil Dependency Parsing: Results Using Rule Based and ... - ÚFAL

12 downloads 0 Views 160KB Size Report
Tamil dependency parsing with rule-based and corpus-based approaches. ... Tamil syntactic parsing is less discussed in the literature though there are some.
Tamil Dependency Parsing: Results Using Rule Based and Corpus Based Approaches ˇ Loganathan Ramasamy and Zdenˇek Zabokrtsk´ y ´ Institute of Formal and Applied Linguistics (UFAL) Faculty of Mathematics and Physics, Charles University in Prague {ramasamy,zabokrtsky}@ufal.mff.cuni.cz

Abstract. Very few attempts have been reported in the literature on dependency parsing for Tamil. In this paper, we report results obtained for Tamil dependency parsing with rule-based and corpus-based approaches. We designed annotation scheme partially based on Prague Dependency Treebank (PDT) and manually annotated Tamil data (about 3000 words) with dependency relations. For corpus-based approach, we used two well known parsers MaltParser and MSTParser, and for the rule-based approach, we implemented series of linguistic rules (for resolving coordination, complementation, predicate identification and so on) to build dependency structure for Tamil sentences. Our initial results show that, both rule-based and corpus-based approaches achieved the accuracy of more than 74% for the unlabeled task and more than 65% for the labeled tasks. Rule-based parsing accuracy dropped considerably when the input was tagged automatically. Keywords: Tamil, Dependency parsing, Syntax, Clause boundaries

1

Introduction

The most important thing in Natural Language Processing (NLP) research is data, importantly the data annotated with linguistic descriptions. Much of the success in NLP in the present decade can be attributed to data driven approaches to linguistic challenges, which discover rules from data as opposed to traditional rule based paradigms. The availability of annotated data such as Penn Treebank [1] and parallel corpora such as Europarl [2] had spurred the application of statistical techiniques [4], [5], [3] to various tasks such as Part Of Speech (POS) tagging, syntactic parsing and Machine Translation (MT) and so on. They produced state of the art results compared to their rule based counterparts. Unfortunately, only English and very few other languages have the privilege of having such rich annotated data due to various factors. In this paper, we take up the case of dependency parsing task for Tamil language for which no annotated data is available. We report our initial results of applying rule based and corpus based techniques to this task. For rule-based approach, the rules (such as rules for coordination and main predicate identification) have been crafted after stuying the Tamil data in general. The rules often

rely on morphological cues to identify governing or dependent nodes. For corpusbased approach, we used two popular dependency parsers: Malt and MSTParser. For the purpose of experimentation, both the approaches have been tested on small syntactically annotated corpus. The annotation is partially based on popular Prague Dependency Treebank (PDT) [15]. In Section 2, we introduce previous attempts on this topic and similar work that is happening for other Indian languages. Section 3 introduces some important aspects of Tamil language. Section 4 proposes the annotation scheme, Section 5 introduces the rule-based method and corpus-based techniques and the remaining sections introduce experimental evaluations and discussions.

2

Related Work

Tamil syntactic parsing is less discussed in the literature though there are some recent work on developing morphological analyzers and POS taggers for Tamil. This is evident from the scarce number of publications that have appeared on this topic. The situation is better for other major Indian languages such as Hindi and Telugu. There is an active research on dependency parsing ([7], [8], [9]) and developing annotated treebanks for Hindi and Telugu. One such effort is, developing a large scale dependency treebank [10] (aimed at 1 million words) for Telugu, as of now the development for which stands [11] at around 1500 annotated sentences. Recently in year 2009, there was an NLP tools contest dedicated for parsing Indian languages (Hindi, Telugu and Bangla) as part of the ICON 2009 conference. That unfortunately didn’t include Tamil. There were two prior works [12], [13] that discussed about Tamil syntactic parsing. [12] used morphological analyzer and heuristic rules to identify phrase structures. [13] built a dependency analyzer for spoken language Tamil utterences. The work [13] used relative position of words to identify semantic relations. The current paper is different from [12] with respect to building dependency trees rather than phrase structure trees. The current paper gives a comprehensive treatment (in terms annotation scheme, parsing approaches and experimentation) to Tamil dependency parsing than the other two papers. There is a recent paper [14] which used machine learning approach to dependency parsing of Tamil. Since no results were reported, we are not able to compare our results with theirs.

3

General Aspects of Tamil Language

Tamil is an south Indian language that belongs to Dravidian family of languages. Other major languages in the Dravidian family include Telugu, Malayalam and Kannada. The main features of the Tamil language include agglutination, relatively free word order, head final and the subject-verb agreement. Below we touch briefly on these features.

Morphology. Tamil is an agglutinative language [6] and has rich set of morphological suffixes which can be added one after another to noun and verb stems (mainly) as suffixes. Tamil morphology is mainly concatenative and derivations are also possible by means of adjectivalization, adverbialization and nominalization. In general, Tamil morphology can be represented [6] as [stem (+affix) n ]. Though there are only eight basic POS categories, with no such restrictions placed on as to how many words can be glued together, Tamil morphology pose significant challenges to POS tagging and parsing. Head Final and Relatively Free Word Order. Tamil is a head final language, meaning the head of the phrasal categories always occur at the end of a phrase or constituent. Modifiers and other co-constituents always precede the phrasal head. For example, postposition is the head of the postpositional phrase, and will be modified by noun phrases. There are very few exceptions (identifiable) such as the subject of a sentence occuring after the finite verb (head). In most cases, head final rule is preserved. Tamil is a Subject Object Verb (SOV) language and the word order is relatively free. Within a clause, phrases can be moved to almost any position except to the postion of clause head which should always be a verb. Consider the Tamil sentence ‘appA enakku puTTakam kotuTTAr’ 1 (Father gave me a book). The free word order nature of Tamil for this sentence is given below, appA enakku puTTakam kotuTTAr enakku appA puTTakam kotuTTAr puTTakam enakku appA kotuTTAr puTTakam appA enakku kotuTTAr .... enakku puTTakam kotuTTAr appA

S-O2-O1-V O2-S-O1-V O1-O2-S-V O1-S-O2-V ... O2-O1-V-S

Agreement. There are two kinds of verbs in Tamil: finite verbs and non finite verbs. Finite verbs usually occur as sentence final and will act as a main verb of the sentence. Finite verbs agree with their subject in number and gender. This is accomplished via coding the number and gender of subject in the finite verbs as suffixes. Both finite and non-finite verbs can take subjects, but only finite verbs can code their agreement with subjects. The explicit coding of subjectverb agreement in finite verbs make the presence of subjects optional in certain situations, and the subject can be partially inferred from finite verb affixes. Table 1 shows that the total number of subjects are less than the total number of finite and non-finite verbs. This implies that some of the subjects are shared between finite and non finite verbs in sentences or some verbs may not take subjects at all. This point is to illustrate that it is not always possible to identify the subjects just by knowing agreement markers. 1

We have transliterated the Tamil script into Latin script. Transliterated form is used throughout the paper.

Table 1. Counts of subjects and verbs Category Sentences Subjects Finite verbs Non finite verbs

4

#Num 204 332 203 220

Annotation Scheme

In this section, we describe the data and propose annotation scheme for Tamil dependency structures. We collected Tamil data from a news website2 . We picked news articles from the website randomly, so to have our data as diverse as possible. The raw Tamil data was in utf-8 encoding, we transliterated that into Latin script for the ease of processing during annotation work and as well as to handle the data programmatically. As of now, the corpus size we have taken for annotation is 2961 words. Annotation is divided into two parts: (i) POS tagging of the data and (ii) assigning relations and dependency structure to words. Tagging the data with POS is necessary as our rule based dependency parser often use POS tags of words to predict dependency relations. 4.1

POS Tagging

As is the case for any morphologically rich language, providing a fixed tagset for languages such as Tamil is a complex task. Many tagsets were proposed in the literature and many methodologies based on finite state machines, Hidden Markov Models (HMM) and Support Vector Machines (SVM) were proposed for tackling morphology and tagging of Tamil. The main question here is, “What kind of POS tagset is required for our task? Whether simple POS tagset or fine grained morphological pos tagset”. Considering the rich suffix base Tamil have [6], we decided to use detailed morphological tags instead of just categorical. For morphologically rich languages, morphological cues can to some extent help identifying syntactic relationship between words. Unfortunately, no standard exist for morphological tagging of Tamil. Our manual morphological tagging is not based on marking every linguistic aspect within a word. Rather, our aim is to add more information to the tag which are needed for identifying syntactic relationships. For example, we tag the word pUkkatai (flower shop) as a single noun rather than having composed of two separate nouns pU (flower) and katai (shop). Thus the tag of that word will be “3n nn”, where nn would indicate the word is a noun and 3n would indicate the noun is a 3rd person neuter gender. For tagging verbs, to distinguish lexical 2

http://www.dinamani.com

and auxiliary verbs, we add prefixes “mv” and “aux” respectively. For example, the lexical verb patiTTAn (read he) is tagged as “mv pst 3sm f”. In the tag, “pst 3sm” indicates the verb is a past tense verb and the subject is 3rd person singular masculine. “f” indicates the verb is a finite verb. Knowing whether the verb is finite or non finite will greatly help in identifying the main predicate of the sentence. In the Table 2, we have Corpus 1 with 2961 words tagged and syntactically annotated, and Corpus 2 which also contains Corpus 1 and the remaining words in the corpus are tagged, but not syntactically annotated. The table also provides some insights into how many tags a single token can take. From the Table 2, it is clear that, more than 90% of the tokens take only a single tag. Tokens with 2 tags vary between 5-8%, and the remaining count is almost negligible. Table 2. Number of tags assigned to tokens

Tagset size Lexical verb tags Auxliary verb tags # of words Unique tokens 1 tag count 2 tag count 3 tag count 4 tag count

4.2

Corpus 1 296 120 31 2961 1634 1534/(93.88%) 92/(05.63%) 8/(00.49%) 0/(00.00%)

Corpus 2 459 194 44 8421 3747 3427/(91.46%) 284/(07.58%) 33/(00.88%) 3/(00.08%)

Dependency Annotation

Our annotation scheme is partially based on Prague Dependency Treebank (PDT) [15], [16] , a large scale dependency annotation project for Czech language. The PDT was annotated on three layers: morphological layer (m-layer), analytical layer (a-layer) and tectogrammatical layer (t-layer). In m-layer, a sentence is annotated at the word level, marking their POS and identifying their lemmas. This layer roughly corresponds to morphological tagging of a sentence. The sentence is represented as a rooted tree in a-layer, where each node correspond to a word in the sentence. Each dependent node is connected with a governing node via an edge which hold a relationship between the two nodes. In the PDT style annotation, the edges are not labeled, rather each node (analytical node) is assigned an analytical function (afun) which acts as relation with which it connect to its governing node. In tectogrammatical layer, a tree represents the deep syntactic structure of a sentence. In t-layer tree, the nodes correspond to auto- semantic words which have the lexcical meaning. Words that correspond to prepositions, determiners and other function words will not be represented in

t-layer. There is also one more layer called w-layer which simply represents the input sentence or tokens (prior to m-layer ). Our annotation involves only m-layer and a-layer. In m-layer, words are annotated with morphological tags as mentioned in the previous sub section (POS tagging). As of now, for a-layer we have defined 19 analytical functions, 13 of them comes from PDT. The list of analytical functions is given in Table 3. The a-layer annotation is performed using TrEd annotation tool [17] after parsing the text using rule-based parser (Section 5.1). The sample annotation is shown in Figure 1.

StaM

katawTa katawTa adj

StaA

aTimuka aTimuka nnpc

Atciyil Atciyil loc_3n_nn

latcumi latcumi nnpc

pirAnEsh pirAnEsh 3n_nnp

Talaimaic Talaimaic nnc

ceyalALarAka ceyalALarAka 3h_nn

iruwTAr iruwTAr mv_pst_3sh_f

. . .

mv_pst_3sh_f iruwTAr Pred loc_3n_nn Atciyil NR adj katawTa Atr

nnpc aTimuka Atr

3n_nnp pirAnEsh Sb

nnpc latcumi Atr

3h_nn_adv ceyalALarAka Atr

nnc Talaimaic Atr

katawTa aTimuka Atciyil latcumi pirAnEsh Talaimaic ceyalALarAka iruwTAr.

Fig. 1. Sample annotation using TrEd tool

Atr, Adv, AuxA, AdjCl. Atr is used to mark the nodes in an attribute relationship to their governing nodes. In Figure 2a) the word walla (good) is in an attribute relation to it’s governing node maniTan (human). In noun-noun combinations, the first noun will be in attribute relationship to the second noun (head). Similarly Adv and AuxA are used to mark adverbials and determiners, and they modify verbs and nouns. Adjectival clauses in Tamil are equivalent to relative clauses in English. Though adjectival heads act as modifiers to the following nouns, they still retain the verbal properties and take left side arguments. Adjectival clause heads are marked with AdjCl. Coord, AuxC,AuxX. Coordination is one of the complex phenomena in Tamil. [6] talks about only one type coordination which uses morphological suffix as a coordination device. But coordination can be marked either morphologically with -um (“and”) or with a separate word maRRum (“and”) or with commas or in combination of any of the previous three. They can coordinate any type of individual words, phrases and clauses with same categorical status. The type of role which the coordination node can take depends on the type of conjuncts they coordinate. Here we list the most common type of coordination patterns we identified from the corpus. If A, B, C and D are elements to be conjoined via coordination, then the pattern (also in Figure 3) for coordination (and, or ) is listed in the Table 4.

. . AuxK

Table 3. Analytical functions for a-layer annotation AFUN Description AdjCl Adjectival clauses Adv Atr AuxA AuxC AuxK AuxP AuxV AuxX CondCl Coord InfCl NR Obj PObj Pred Sb VbpCl VFin

Example inRu wataipeRRa pOttiyil today take-place-which match-loc Adverbial inimEl watakkATu hereafter happen-will-not-it Attribute iwTiya aracu Indian government Determiners iwTap paiyan this boy Conjuncts rAman maRRum cITA Ram and Sita Terminal punctuation – Postposition pUmikkuk kIzE earth-dat under Auxiliary cAppAtu cAppittu vittAn food eat-pst leave-pst-3sm Comma – Conditional clause avar ennai azaiTTAl he-honorific me-acc call-if Coordination node rAman maRRum cITA Ram and Sita Infinitive clause puTTakam kotukka rAman ... book give-inf Ram ... Dependency not defined – Direct object rAman oru puTTakam kotuTTAr Ram a book give-3s-honorific Postpositional object rAmanaip paRRi Ram-acc about Predicate of the sentence ramEsh paNam kotuTTAn Ramesh money give-pst-3sm Subject ramEsh paNam kotuTTAn Ramesh money give-pst-3sm Verbal participle clause kAcu kotuTTu mittAy money give-pst candy Finite verb (not predicate) paricu kotuTTAn enRu kURinAn gift give-pst-3sm that say-pst-3sm

Table 4. Coordination pattern S.No Pattern Coordination type Coordination head 1 ... Aum ... Bum ... and(um) B 2 ... Aum , ... Bum ... and(um) , 3 ... A, ...B, ...C maRRum D and(maRRum) maRRum 4 ... AO ... BO ... or(O) B or(O) allaTu 5 ... AO allaTu ...BO ...

w−layer walla m−layer (adj) a−layer Atr

awTa (det) AuxA

walla (adj) Atr

wanRAka (adv) Adv (c) wanRAka patiTTa paiyan

(b) awTa walla maniTan (That good man)

(a) walla maniTan (good man)

paiyan (3sm_nn)

patiTTa (pst_adj_part) AdjCl

maniTan (3sm_nn)

maniTan (3sm_nn)

(The boy who studied well)

Fig. 2. Illustration of Atr, Adv, AuxA, AdjCl analytical functions maRRum (cconj) Coord

, (,) Coord

w−layer m−layer a−layer

A−um (mtag1) AuxC

B−um (mtag1) AuxC

, A (mtag1) (,) AuxC AuxX

(a) A−um , B−um (A , B)

B (mtag1) AuxC

, (,) AuxX

C (mtag1) AuxC

D (mtag1) AuxC

(b) A, B, C maRRum D (A, B, C and D)

allaTu (cconj) Coord

A−O (mtag1) AuxC

B−O (mtag1) AuxC

(c) A−O allaTu B−O (A or B)

Fig. 3. Illustration of coordination conjunction

The main issue in coordination conjunction is, when elements are conjoined, the coordinating node should assume the role of the conjoined elements and should attach to the appropriate node where the conjoined elements would have attached. At present, during annotation, the category of the conjoined elements are not reflected in the coordinating node.

5

Approaches to Parsing

In this section, we define two approaches (rule based & corpus based) we have used to parse Tamil sentences. 5.1

Rule Based Parsing

The algorithm 5.1 simply returns the parents (in array) of each word token which is equivalent to unlabeled edges in the dependency tree. Each element of this array is an integer, representing the position of the parent word in the sentence. Array index correspond to word position of a child node . We have mainly defined four procedures inside the algorithm, which update the parent (p) and is parent available (ipa) array whenever the linguistic rules defined inside them are satisfied. For instance, the procedure identify main predicate will look for the main predicate of the sentence. If found,

then the parent of that node will be updated (0 in this case. 0 is a technical root of the tree). Once the parent is found, finding the parent again for this node is prohibited via the ipa array, which sets the boolean variable, once the parent is found for that node. Each procedure implements a set of linguistic rules which assign parents to words in the sentence. For instance, predicate finding rule is very simple. It says, if the last word of the sentence is a ‘finite verb’, and it’s morphological tag ends with ‘ f ’ then the parent of that node is 0 which is the technical root of the tree. Algorithm 5.1: GetUnlabeledEdges(words, mtags) comment: Set parents (p) and is parent available (ipa) array to 0 comment: size(p) = size(ipa) = size(words) p←0 ipa ← 0 identify main predicate(p, ipa, words, tags) resolve coordination(p, ipa, words, tags) identify trivial parents(p, ipa, words, tags) process complements(p, ipa, words, tags) return (p) procedure set parent(node, parent) acyclicity ← check acyclicity() if not acyclicity and ipa[node] 6= 1  p[node] ← parent then ipa[node] ← 1 return

The procedure resolve coordination will try to locate coordination head and the conjuncts. If the coordination head is found, then the conjuncts will be located and their parent will be set to the coordination head. Setting the parent of the coordination head is the complicated task, the procedure will try to assign the parent based on the conjoined elements. This procedure implements the coordination rules specified in Table 4. The procedure identify trivial parents will try to locate modifiers such as adjectives, determiners, cardinals, ordinals and so on, and set their parents which would be the immediate phrasal head following the modifiers. In the case of postpositions which act as the PP head, the task is to identify the noun phrase preceding them, and attach them as children of the postpositional head. The procedure process complements takes care of complementation in Tamil. Tamil uses various devices to perform complementation. Of which, the most common types of complementations are non finite clauses, using certain lexical verbs such as ‘en’ (say), and few nouns and postpositions as complementizers.

Presence of a complementation would indicate that there is some clause ending at that point. If there is an explicit presence of complementizer, then the complementizer would act as a head of the clausal predicate preceding it. Then the complementizer would attach itself to the appropriate node that follow the complementizer. In the same procedure, we also perform clause boundary identification and attach arguments to the clausal predicates. Clausal predicate would signal the end of a clause. Based on morphological tags, we identify these clause boundaries and attach arguments of that clause to the clausal predicate. Let us assume, there are three clauses inside a sentence. This procedure would identify the 3 clausal predicates and attach arguments that belong to them. In Tamil, clauses usually occur in linear order, so the arguments always appear to the left of a clausal predicate. Thus the arguments for clause 2 would immediately start from clause 1 boundary and end before the second clausal predicate. Once the clausal predicates and their arguments are attached, the clausal predicates would be attached to either complementizers, or nouns (in the case of adjectival clauses) or other clauses. At present, the procedure works only if the clauses are in linear order. If there is an embedded clause, then this procedure often attaches wrong arguments to the clausal predicates. Label Assignment. Once the unlabeled edges are identified, this procedure assigns labels to nodes. The procedure again makes use of morphological tags of parents and their children to assign labels to them. For example, the children of postpositional node is assigned PObj label and if the case of a node is accusative and its parent is verb, then the node is assigned Obj. This procedure labels all the nodes that are present in the sentence. The rule-based parsing system is implemented within TectoMT framework [18], a highly modular NLP system for Machine Translation and other related NLP tasks. All tasks of the rule-based parser have been implemented as modules inside TectoMT system. Evaluation of this parser is given in the next section. 5.2

Corpus Based Approach

For corpus based approach we are using two well known parsers namely MaltParser [20] and MSTParser [19]. Evaluation of these parsers for our data is explained in the next section.

6

Experiments and Results

We did manual morphological tagging on 8421 tokens. Out of them, 2961 tokens are both morphlogically tagged as well as dependency annotated. Refer Section 4 and Table 2 for more information about the nature of the data. We name the corpus as follows: 2961 tokens corpus is called C1, 8421 tokens corpus as C2 , and 5500 tokens (C2-C1) as C3.

Experiments for Rule Based (RB) parsing and Corpus Based (CB) parsing is done with different settings. Though direct comparison cannot be made, we can find some similarities in the individual label performance. The main input for the RB parsing is an input sentence and its morphological tags. Two experiments have been conducted for RB parsing. For both the experiments, RB approach was tested against the whole dataset (2961) tokens. In the first experiment, morphological tags for input tokens have been automatically tagged by TnT tagger [21] trained on C3 (tagged the C1 with 72.34% accuracy). For the second experiment, input tokens are provided to RB parser with gold standard morphological tags. The Table 5 (left) shows the accuracy of RB parser. The column Auto POS corresponds to the first experiment and the Manual POS corresponds to the second experiment. We can see the sharp decline (for both unlabeled and labeled) in performance when the input tokens are not given gold standard tags. Individual label performance is given in Table 6. The only label which achieved a higher performance in both the experiments is the prediction of Pred which is the main predicate of the sentence. Since AuxK was a sentence termination, it was not included. The worst performed label in both the experiments is Coord. The precision of subject Sb is also low due to difficulties in identifying the right subject when more than one nominatives qualify for subject status. The performance of Sb may also decrease if the subject of one clause is occurring outside of its boundary in the case of embedded clauses.

Table 5. Parsing Accuracy a) Rule based b) Corpus based

Unlabeled Labeled

Auto POS Manual POS 71.94 84.73 61.70 79.13

Unlabeled Labeled

MaltParser 75.03 65.69

MST Parser 74.92 65.69

For CB parsing, we have divided the corpus C1 into two parts: training set (2008 tokens) and test set (953 tokens). Both the Malt and MST parsers were trained on the same training set and evaluated on the same test set. Table 5 (right) shows the accuracy of both Malt and MST parser. They perform almost similarly for both labeled and unlabeled tasks. In the case of individual label performance (Table 7), labeled precision for Sb is high and low for Coord. Some individual labels from both RB parsing and CB parsing show same level of performance, for instance Sb and Coord. As a general remark, the current experiments for both RB & CB tasks have been done with very small amount of data. Yet we were able to correlate some of the labels’ individual performance. As observed from the labeled precision of both RB and CB tasks, some labels are easily predictable (Pred ) and some labels are not, such as Coord, Sb. As far as the non-projective structures are concerned, for Hindi, the study [22] has found that 14% of Hindi treebank structures are non-projective. For Tamil, non-projectivity may arise in embedded clauses.

Table 6. Precision/Recall of Rule Based Parser: a) Auto POS (left) b) Manual POS (right) Afun Precision Recall AdjCl 48.31 78.18 Adv 43.75 72.41 Atr 68.26 79.93 AuxA 78.95 100.00 AuxC 68.89 48.44 AuxK 100.00 99.03 AuxP 50.88 61.70 AuxV 50.00 39.39 AuxX 44.92 100.00 CondCl 50.00 16.67 Coord 33.33 26.92 InfCl 75.68 66.67 NR 57.17 53.99 Obj 48.28 65.12 PObj 49.06 66.67 Pred 96.06 96.06 Sb 44.73 74.55 VFin 48.00 80.00 VbpCl 53.52 61.29

Afun Precision Recall AdjCl 81.94 98.33 Adv 87.27 97.96 Atr 87.08 97.18 AuxA 100.00 100.00 AuxC 87.50 67.74 AuxK 99.03 100.00 AuxP 82.54 100.00 AuxV 69.44 100.00 AuxX 52.54 100.00 CondCl 57.14 100.00 Coord 59.09 41.94 InfCl 93.33 100.00 NR 73.64 73.37 Obj 81.05 73.33 PObj 70.37 97.44 Pred 97.07 98.03 Sb 58.91 94.93 VFin 91.18 100.00 VbpCl 81.08 100.00

Table 7. Precision/Recall of corpus based parsing: a) MaltParser (left) and b) MST Parser (right) Afun Precision Recall AdjCl 71.43 71.43 Adv 57.89 100.00 Atr 91.46 90.55 AuxA 91.67 84.62 AuxC 100.00 84.62 AuxK 100.00 96.72 AuxP 53.57 100.00 AuxV 100.00 26.67 AuxX 40.74 100.00 CondCl – 0.00 Coord 50.00 25.00 InfCl 82.35 70.00 NR 41.28 75.32 Obj 38.46 66.67 PObj 56.25 52.94 Pred 96.61 96.61 Sb 59.38 55.88 VFin 77.78 87.50 VbpCl 68.42 56.52

Afun Precision Recall AdjCl 76.00 90.48 Adv 50.00 100.00 Atr 84.79 93.40 AuxA 92.31 92.31 AuxC 46.67 58.33 AuxK 100.00 96.72 AuxP 50.00 100.00 AuxV 66.67 28.57 AuxX 50.00 100.00 CondCl – 0.00 Coord 20.00 16.67 InfCl 83.33 75.00 NR 42.86 73.94 Obj 51.72 78.95 PObj 44.44 47.06 Pred 80.60 91.53 Sb 70.73 56.31 VFin 71.43 62.50 VbpCl 78.26 75.00

7

Conclusion and Future Work

In this paper, we reported our initial experiments with dependency parsing for Tamil using both rule based and corpus based approaches. For the rule based approach, the labeled accuracy achieved 79% when input tokens are provided with morphological tags and declined to 61% when tested in a real world scenario. For the corpus based approach, the labeled accuracy stood at around 65% (for both Malt and MST) and unlabeled accuracy stood at around 75%. From the experiments, we observed that, both the rule based and corpus approaches found difficulty in identifying coordination nodes (Coord ) while they performed well in identifying main predicate of the sentence for unseen cases. Tagging accurately also had an impact on the performance (as shown by the rule based parser). The current experiments were done with very small data, more insights can be gained and accuracy can be improved if we have more data. In the future, we are planning to add more annotated corpora for our experiments. Acknowledgement The research leading to these results has received funding from the European Commission’s 7th Framework Program (FP7) under grant agreement n◦ 238405 (CLARA), and from grant MSM 0021620838. We also would like to thank anonymous reviewers for their useful comments.

References 1. Mitchell P.M., Mary Ann, M., Beatrice, S.: Building a Large Annotated Corpus of English: the Penn Treebank. Comput. Linguist. 9, 313–330 (1993) 2. Koehn, P.: Europarl: A Parallel Corpus for Statistical Machine Translation. In: MT Summit (2005) 3. Koehn, P., Och, F.J., Marcu, D.: Statistical Phrase-Based Translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 48–54. Association for Computational Linguistics (2003) 4. Ratnaparkhi, A.: A Maximum Entropy Model for Part-Of-Speech Tagging. In: Proceedings of the Empirical Methods in Natural Language Processing, pp. 133142, (1996) 5. Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. Comput. Linguist. 29, 589–637 (2003) 6. Lehmann, T.: A Grammar of Modern Tamil. Pondicherry Institute of Linguistics and Culture (PILC), Pondicherry, India (1989) 7. Bharati, A., Sangal, R.: Parsing Free Word Order Languages in the Paninian Framework. In: Proceedings of the 31st annual meeting on Association for Computational Linguistics, pp. 105–111, Association for Computational Linguistics (1993) 8. Bharati, A., Gupta, M., Yadav, V., Gali, K., Sharma, D.M.: Simple Parser for Indian Languages in a Dependency Framework. In: Proceedings of the Third Linguistic Annotation Workshop (ACL-IJCNLP 2009), pp. 162–165, Association for Computational Linguistics (2009)

9. Nivre, J.: Parsing Indian Languages with MaltParser. In: Proceedings of the ICON09 NLP Tools Contest: Indian Language Dependency Parsing, pp. 12–18 (2009) 10. Begum, R., Husain, S., Dhwaj, A., Sharma, D., Bai, L., Sangal, R.: Dependency Annotation Scheme for Indian Languages. In: Proceedings of the Third International Joint Conference on Natural Language Processing (IJCNLP), Hyderabad, India (2008) 11. Vempaty, C., Naidu, V., Husain, S., Kiran, R., Bai, L., Sharma, D., Sangal, R.: Issues in Analyzing Telugu Sentences towards Building a Telugu Treebank. In: Alexander F., G. (Ed): CICLing 2010, LNCS 6008, pp. 50–59 (2010). 12. Saravanan, K., Ranjani, P., Geetha, TV.: Syntactic Parser for Tamil. In Proceedings of Sixth Tamil Internet 2003 Conference, Chennai, India (2003) 13. Janarthanam, S., Nallasamy, U., Ramasamy, L., Santhoshkumar, C.: Robust Dependency Parser for Natural Language Dialog Systems in Tamil. In Proceedings of 5th Workshop on Knowledge and Reasoning in Practical Dialogue Systems(IJCAI KRPDS-2007), pp. 1–6, Hyderabad, India (2007) 14. Dhanalakshmi, V., Anand Kumar, M., Rekha, R.U., Soman, K.P., Rajendran, S.: Grammar Teaching Tools for Tamil Language. In: Technology for Education Conference (T4E 2010), pp. 85–88, India (2010) 15. Hajic, J.: Building a Syntacticly Annotated Corpus: The Prague Dependency Treebank. In: Issues of Valency and Meaning, pp. 106-132. Karolinum, Prague (1998) 16. The Prague Dependency Treebank 2.0, http://ufal.mff.cuni.cz/pdt2.0/ 17. Tree Editor TrEd, http://ufal.mff.cuni.cz/~ pajas/tred/ ˇ 18. Zabokrtsk´ y, Z., Pt´ aˇcek, J., Pajas, P.: TectoMT: Highly Modular MT System with Tectogrammatics Used as Transfer Layer. In: Proceedings of the Third Workshop on Statistical Machine Translation (StatMT ´ 08), pp. 167–170, ACL (2008) 19. McDonald, R., Crammer, K., Pereira, F.: Online Large-margin Training of Dependency Parsers. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 91–98, ACL (2005) 20. Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., Marinov, S., Marsi, E.: MaltParser: A Language-Independent System for Data-Driven Dependency Parsing. In: Natural Language Engineering, 13, 95–135 (2007) 21. Brants, T.: TnT - A Statistical Part-of-Speech Tagger. In: Proceedings of the Sixth Conferenceon Applied Natural Language Processing ANLP-2000, Seattle (2000) 22. Mannem, P., Chaudhry, H., Bharati, A.: Insights into non-projectivity in Hindi. In: Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, pp. 10–17, ACL (2009)