report, entitled “DEPENDENCY PARSER FOR TAMIL USING MACHINE.
LEARNING APPROACH”, is a record of the original work done by me under the.
DEPENDENCY PARSER FOR TAMIL USING MACHINE LEARNING APPROACH A PROJECT REPORT Submitted by
DHIVYA R CB.EN.P2CEN09009
Email:
[email protected]
DEPENDENCY PARSER FOR TAMIL USING MACHINE LEARNING APPROACH
A PROJECT REPORT Submitted by
DHIVYA R CB.EN.P2CEN09009 in partial fulfillment for the award of the degree of MASTER OF TECHNOLOGY IN COMPUTATIONAL ENGINEERING AND NETWORKING
Center for Excellence in Computational Engineering and Networking AMRITA SCHOOL OF ENGINEERING, COIMBATORE
AMRITA VISHWA VIDYAPEETHAM COIMBATORE 641 112
JULY 2011
AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING, COIMBATORE-641 112.
BONAFIDE CERTIFICATE
This is to certify that the project report entitled “DEPENDENCY PARSER FOR TAMIL USING MACHINE LEARNING APPROACH” submitted by “DHIVYA R”, (Reg No.: CB.EN.P2CEN09009)” in partial fulfillment of the requirements for the award of the Degree of Master of Technology in “COMPUTATIONAL ENGINEERING AND NETWORKING” is a bonafide record of the work carried out under my guidance and supervision at Amrita School of Engineering, Coimbatore.
Dr. K P Soman
Dr. K P Soman,
Project guide,
Professor & Head,
Professor & Head, CEN.
CEN.
This project report was evaluated by us on ……………
INTERNAL EXAMINER
EXTERNAL EXAMINER
AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING, COIMBATORE
DEPARTMENT OF COMPUTATIONAL ENGINEERING AND NETWORKING
DECLARATION I, DHIVYA R (Reg, No. CB.EN.P2CEN09009) hereby declare that this project report, entitled “DEPENDENCY PARSER FOR TAMIL USING MACHINE LEARNING APPROACH”, is a record of the original work done by me under the guidance of
Dr. K P SOMAN, Head of the Department, Department of
COMPUTATIONAL ENGINEERING AND NETWORKING, Amrita School of Engineering, Coimbatore and that this work has not formed the basis for the award of any degree / diploma /associate ship / fellowship or a similar award, to any candidate in any University, to the best of my knowledge.
DHIVYA R Place: Date: COUNTERSIGNED
Dr. K P Soman Head of the Department Computational Engineering and Networking
Acknowledgement First and foremost, I thank God for bestowing his blessings on me which has enabled me to complete the project with astounding success. I would like to thank my parents for their support and belief which had led to the success of the project. I pay obeisance to our divine mother, Satguru Sri Mata Amritanandamayi Devi. She has been the leading light and the motivating force behind all my aspirations and accomplishments. Very special thanks to Prof. K P Soman, Head of the Department, Computational Engineering and Networking and my guide for his constant encouragement and consistent support. I would like to express my heart felt gratitude to Mr. M Anand Kumar and Mrs. V Dhanalakshmi , Research Associates, Centre for Excellence in Computational Engineering and Networking who were inspiring me and providing consistent support and timely help in every possible way by constantly monitoring the performance and making this project successful. I would like to thank project coordinator Dr. M Sabarimalai Manikandan, Assistant Professor, Department of Electronics and Communication Engineering, and all other staff members of CEN department who have helped me. Finally I would like to thank all the staff members and friends who have helped, guided and encouraged me directly or indirectly during the project work.
Abstract Dependency parsing is one of the very important tasks in Natural Language Processing. Dependency parsing is the task of analyzing the dependency structural relationship between the words in a sentence. For a free word order language like Tamil, dependency parser suits the best to extract the relation between the words in the sentences. In this project, malt tool is used to obtain the dependency tree structure. The features for the parser include parts of speech tag and chunk tag. Parts of speech tags are identified using the SVM tool. For the chunking process, CRF tool is used to create a model and to tag the data. The model created for the parser differs from the existing system since the model developed includes the chunk tag and also the tag set used is different. The parsed data is given to the GraphViz tool to get the dependency tree structure. The developed model gives satisfactory results.
i
Table of Contents Abstract ................................................................................................................................ i Table of Contents ...............................................................................................................ii List of Figures ..................................................................................................................... v List of Tables .....................................................................................................................vi CHAPTER 1. INTRODUCTION ...................................................................................... 1 1.1 Motivation ................................................................................................................ 1 1.2 Problem Statement ................................................................................................... 2 1.3 Issues in the Dependency Parsing ........................................................................... 2 1.4 Objectives ................................................................................................................. 3 1.5 Theoretical Background........................................................................................... 3 1.5.1 Parsing ............................................................................................................... 3 1.5.2 Types of Parsing ............................................................................................... 5 1.6 Contribution .............................................................................................................. 6 1.6 Thesis Overview....................................................................................................... 7 CHAPTER 2. LITERATURE SURVEY .......................................................................... 8 2.1 Various Works for Dependency Parser .................................................................. 8 2.2 Works for the Clause Boundary Identification .................................................... 13 2.3 Summary ................................................................................................................. 14 CHAPTER 3. DEPENDENCY PARSING .................................................................... 15 3.1 Introduction ............................................................................................................ 15 3.2 Dependency Graph ................................................................................................. 15 3.3 Classes of Dependency Parsing............................................................................. 17 3.4 Data-driven Dependency Parsing .......................................................................... 18 3.4.1 Machine Learning Approach ......................................................................... 18 3.4.2 Transition based Dependency Parsing........................................................... 19 3.4.3 Graph-based Dependency Parsing ................................................................. 20 ii
3.5 Grammar-based Dependency Parsing ................................................................... 20 3.5.1 Context-free Dependency Parsing ................................................................. 21 3.5.2 Constraint-based Dependency Parsing .......................................................... 21 3.5.3 Comparison of Grammar-based with Data-driven approach ....................... 22 3.6 Types of Dependency Parsing ............................................................................... 22 3.6.1 Projective Dependency Parsing ..................................................................... 22 3.6.2 Non projective Dependency Parsing ............................................................. 23 3.7 Applications of Dependency Parsing .................................................................... 23 3.8 Summary ................................................................................................................. 25 CHAPTER 4. METHODOLOGY ................................................................................... 26 4.1 Introduction ............................................................................................................ 26 4.2 Modifications to the Existing System ................................................................... 27 CHAPTER 5. IMPLEMENTATION .............................................................................. 28 5.1 Introduction ............................................................................................................ 28 5.2 POS Tagging .......................................................................................................... 28 5.2.1 POS Tagger ..................................................................................................... 29 5.2.2 POS Tag set ..................................................................................................... 29 5.2.3 Explanation of the POS Tag set ..................................................................... 31 5.2.4 SVM Tool........................................................................................................ 43 5.3 Chunking................................................................................................................. 57 5.3.1 Chunk Tag set ................................................................................................. 58 5.3.2 Explanation of the Tag set.............................................................................. 59 5.3.3 CRF Tool ......................................................................................................... 63 5.4 Dependency Parsing............................................................................................... 66 5.4.1 Dependency Tags............................................................................................ 67 5.4.2 Explanation of the Dependency tag sets........................................................ 68 5.4.3 Dependency Head and Dependency Relation ............................................... 74 iii
5.4.4 MALT Tool ..................................................................................................... 74 5.5 Learning Algorithm for MALT Tool .................................................................... 79 5.5.1 Shift Reduce Parser ........................................................................................ 79 5.5.2 Example for SR Parser ................................................................................... 80 5.5.3 Classifier based Parsing.................................................................................. 81 5.5.4 SVM for MALT .............................................................................................. 82 5.6 Tree Generator........................................................................................................ 83 5.7 Corpus Development ............................................................................................. 85 5.8 Sample Output ........................................................................................................ 86 5.9 Results ..................................................................................................................... 86 5.10 Summary ............................................................................................................... 91 CHAPTER 6. CONCLUSION AND FUTURE WORK ............................................... 93 REFERENCES ................................................................................................................. 95
iv
List of Figures 3.1: An example of dependency graph ............................................................................ 16 3.2 An example of labeled dependency graph ................................................................ 17 3.3 Example of projective dependency parsing .............................................................. 23 3.4 Example of non-projective dependency parsing ...................................................... 23 4.1. Block diagram of the proposed method ................................................................... 26 5.1 Output of tree generator ............................................................................................. 85 5.2 GUI for the developed parser system ........................................................................ 86 5.3 Sample output from the parser for a simple sentence in example 1........................ 87 5.4 Sample output from the parser for a simple sentence in example 2........................ 87 5.5 Sample output from the parser for a simple sentence in example 3........................ 88 5.6 Sample output from the parser for a complex sentence in example 4 .................... 88 5.7 Sample output from the parser for a complex sentence in example 5 .................... 89 5.8 Sample output from the parser for a complex sentence in example 6 .................... 90 5.9 Sample output from the parser for a more complex sentence in example 7 ........... 91
v
List of Tables 5.1 POS tag set.................................................................................................................. 30 5.2 Chunk tag set .............................................................................................................. 58 5.3 Dependency tag set 1 ................................................................................................. 67 5.4 Dependency tag set 2 ................................................................................................. 68 5.5 Shift Reduce parser actions ....................................................................................... 80
vi
CHAPTER 1 INTRODUCTION 1.1 Motivation Machine translation is becoming a very important task now. This is because to the reason that many of the important resources such as news, weather report, annual report and technical books are available in English. But people who know only the local language may not be able to use the resources, though the resources are available. Human translation is an expensive and time consuming job. So we are going for machine translation. But, machine translation requires the parsed output to translate. Parsing gives the structural analysis of the sentence. There exist tools like Stanford parser which gives the dependency information about the sentence. But, the tool gives the dependency structure only for English. There are no such efficient tools for the Tamil language. Tamil, being a free word order language, dependency parsing suits the best to get the structural information about the sentence. Since dependency parsing will give the subject and object information about the sentences, it is very helpful in machine translation. Also, dependency parsing is a very useful tool in the sentence structure identification. For languages like English, it can be easily told that the sentence structure is SVO (Subject Verb Object) format. Though the sentence structure for Tamil is SOV (Subject Object Verb) format, it is not necessary that the sentences should follow the same format as the target language. Tamil follows a free word order, so we can write the sentence in any pattern. If the sentence structure is changed also, the exact meaning of the sentence can be conveyed. So, in these cases of free word order language, dependency parsing will give the sentence structure. Sentence structure identification is also an important application of the dependency parsing. Clause boundary identification is a very important task in the natural language processing. It serves as the key process in the machine translation. If the sentence length 1
is too long, it can be split based on the clauses which make the translation very easy. And that is why the clause boundary identification plays a key role. The clauses in the sentences may be joined using the connectives or the clauses may be embedded within the other clauses. The clauses connected using the connectives can be split easily. Connectives are those words connecting the two clauses. Connectives are mostly conjunctions and in rare cases, prepositions. Sometimes, the two clauses are separated only by commas, in those cases, the translation is easy. But in the other cases, identifying the clauses in the sentences itself becomes a complex task. Clause boundary identification plays a major role for machine translation in case of embedded clauses. And, that is why we have added one more tag to our tag set and identified the clause boundary. Dependency parsing is also useful in information retrieval and relation extraction. Dependency parsing will gives us the information as how the words in the sentence are related and what type of relationship exists between the words in the sentence. So, considering the various uses of the dependency parsing, an efficient tool for the dependency parsing needs to be developed. That is the reason in attempting many researches for developing an efficient dependency parser tool.
1.2 Problem Statement The task of the dependency parsing is to obtain the dependency tree structure for a given input sentence. This tree gives us the information as how the words in the sentences are related and what kind of relationship exists between the words in the sentence. From the tree structure, the head and the modifier of words in the sentence can be identified.
1.3 Issues in Dependency Parsing Dependency parser identifies the structure of the sentence. Deciding upon the tag set to be used to parse the sentences is a considered as the main issue of this project. The Stanford parser uses Penn tree tag set for parsing the English sentences. The Penn tree tag set contains 64 tags and it is a very fine tag set. But the shallow parser used for the machine translation need not require all those information. So deciding upon the tag set 2
to be used is considered as a big issue. There also exist various other tools for parser. MST and MALT are the major parsers. Deciding on which parser to use was considered as the other issue in this project.
1.4 Objectives To generate the tree structure for the sentence is the main scope of this project. The dependency tree should contain
The head dependent relation
The type of relation that exists between the head and the dependent
The root of the sentence
The aim is to develop the efficient parser that gives all the above information. To develop an efficient Parser, we need an efficient pos tagger and Chunker so that there won‟t be any problem in parsing.
1.5 Theoretical Background 1.5.1 Parsing Parsing is an important process of Natural Language Processing (NLP) and Computational Linguistics (CL) which is used to understand the syntax and semantics of a natural language sentences confined to the grammar. Parsing is actually related to the automatic analysis of texts according to a grammar. Technically, it is used to refer to practice of assigning syntactic structure to a text Parser is a computational system which processes input sentence according to the productions of the grammar, and builds one or more constituent structures called parse trees which conform to the grammar. Parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens, to determine its grammatical structure with respect to a given (more or less) formal grammar. Parsing is a standard technique used in the field of natural language processing. Parsing means taking an input and producing some sort of structure for it. Before a syntactic parser can parse a sentence, it must be supplied with information 3
about each word in the sentence. Parsing may be defined as the process of assigning structural descriptions to sequences of words in a natural language (or to sequences of symbols derived from word sequences). In another way, a parser accepts as input a sequence of words in some language and an abstract description of possible structural relations that may hold between words or sequences of words in the language, and produces as output zero or more structural descriptions of the input as permitted by the structural rule set. There will be zero descriptions if either the input sequence cannot be analyzed by the grammar, i.e. is ungrammatical, or if the parser is incomplete, i.e. fails to find all of the structure the grammar permits. There will be more than one description if the input is ambiguous with respect to the grammar, i.e. if the grammar permits more than one analysis of the input. In English countable nouns have only have two inflected forms, singular and plural, and regular verbs have only four inflected forms: the base form, the -s form, the -ed form, and the –ing form. But the case is not same for many other languages, such as Finnish, Turkish, and Quechua and Dravidian Languages like Tamil, which may have hundreds of inflected forms for each noun or verb. Here an exhaustive lexical listing is simply not feasible. For such languages, one must build a word parser that will use the morphological system of the language to compute the part of speech and inflectional categories of any word. Statistical methods mainly focus on semantics whereas structural methods focus on syntax. Though statistical parsing gives the better performance through N-gram probabilities and large vocabulary size, it has some disadvantages like lack of support in free ordering of words and long term relationship. Structural parsing provides solutions to some extent but it is very tedious for larger vocabulary corpus. To accommodate syntax and semantics, structural component is to be involved in statistical approach. To add the structural component and balance the vocabulary size, Lexicalized and Statistical Parsing (LSP) is to be employed with phrase or dependency structure language model. To maintain the long term relationship in complex and large sentences, phrase structure language model is not suitable. Also large training set is needed to leverage the
4
performance. When dependency relations are applied among words, direct relationships can be established. Chunking can be seen as the basic task in Partial Parsing. Partial Parsing was introduced as a response to the difficulties of the full traditional parsing and it is described as techniques to recover syntactic information efficiently and reliably from unrestricted text, by sacrificing completeness and depth of analysis. Among the critiques of full parsing (and in favor of partial parsing) the most important are that full parsing are not sufficiently robust for many NLP application and that full parsing doesn‟t identify good parse tree in noisy surroundings. Recent progresses in full statistical parsing (see for instance the CoNLL 2007 Shared Task on multilingual dependency parsing) show that full parser is not only robust but also capable of giving good results in analyzing many different languages.
1.5.2 Types of Parsing The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways: Top-Down Parser Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules. It starts from the start symbol S, and goes down to reach the input. This is an advantage of this method. The top-down strategy never wastes time exploring trees that cannot result in an S (root), since it begins by generating just those trees. This means it also never explores sub trees that cannot find a place in some S-rooted tree. But this also has disadvantages. While it does not waste time with trees that do not lead to an S, it does spend considerable effort on S trees that are not consistent with the input. This weakness in top-down parsers arises from the fact that they generate trees before ever examining the input. Recursive descent parser and LL parsers are examples of this parser. 5
Bottom-up Parser A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing. The advantage of this method is it never suggests trees that are not at least locally grounded in the input. The major disadvantage of this method is that trees that have no hope of leading to an S, or fitting in with any of their neighbors, are generated with wild abandon. LR parser and Operator Precedence parsers are examples of this type of parsers. Another important distinction is whether the parser generates a leftmost derivation or a rightmost derivation. LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse). Each of these two architectures has its own advantages and disadvantages. The topdown strategy never wastes time exploring trees that cannot result in an S (root), since it begins by generating just those trees. This means it also never explores sub trees that cannot find a place in some S-rooted tree. In the bottom-up strategy, by contrast, trees that have no hope of leading to an S, or fitting in with any of their neighbors, are generated with wild abandon. The top-down approach has its own inefficiencies. While it does not waste time with trees that do not lead to an S, it does spend considerable effort on S trees that are not consistent with the input. This weakness in top-down parsers arises from the fact that they generate trees before ever examining the input. Bottom-up parsers, on the other hand, never suggest trees that are not at least locally grounded in the input.
1.6 Contribution The most important work in this project is to create the training data set. The training data set plays a key role in determining the efficiency of the parser. The more accurate the training data, the more accurate is the parser. Around 2000 sentences of different sentence structures has been taken from various Tamil grammar books and manually 6
tagged the pos tag, chunk tag and the dependency head and dependency relation. This covers around 12,000 data for the training.
1.6 Thesis Overview Chapter 2 covers the various works of dependency parsing and the various approaches for the clause boundary identification. Chapter 3 covers the concepts of dependency parsing. The various kinds of dependency parsing approaches and the applications of the dependency parsing are also presented. Chapter 4 covers the methodology used in this project for the dependency parsing and the modifications made to the existing system. Chapter 5 presents the implementation of the dependency parser and the various tools used for the dependency parser. The learning algorithm behind the parser and the results obtained are also discussed. Chapter 6 gives the conclusion of this project and also the methods that can be done for improving the accuracy of the parser.
7
CHAPTER 2 LITERATURE SURVEY 2.1 Various Works for Dependency Parser Dependency parsing is very important work in natural language processing in the last few several years. Various dependency parsing works has been experimented for different language some of the work has been listed here. Nivre [1] reviewed the state of art dependency-based parsing in his paper on dependency grammar and dependency parsing. He added that the notion of dependency is based on the idea that the syntactic structures of a sentence consists of binary asymmetrical relations between words of the sentence. He explained about the computational implementation of syntactic analysis based on dependency representations ie., representations involving lexical nodes, connected by dependency arcs, possibly labeled with dependency types. He also added that the dependency-based syntactic representations have played a fairly major role in the history of linguistic theory as well as that of NLP. Ryan McDonald [2] developed a discriminative learning method for dependency parsing using online large-margin training combined with spanning tree inference algorithms. Also, showed that this provides state-of-the-art accuracy and is extensible through the feature set and can be implemented efficiently. He also displayed the language independent nature of the method by evaluating it on over a dozen diverse languages as well as showed its practical applicability through integration into a sentence compression. The earliest known parsing algorithm, first suggested by Yngve in 1955, is used in the shift-reduce parsers common for computer languages by Aho and Ullman in 1972. In bottom-up parsing, the parser starts with the words of the input, and tries to build trees from the words up, again by applying rules from the grammar one at a time.
8
An important research effort of CoNLL 2006 and 2007 shared tasks [3], which allowed for a comparison of many algorithms and approaches for this task on many languages. The dependency parsers can be categorized into three families: local-andgreedy transition based parsers, globally optimized graph-based parsers, and hybrid systems, which combine the output of various parsers into a new and improved parse. Example of transition based parsers are the MALT parser and for graph based, it is MST parser and the parser developed by Sagae and Lavie in 2006 and Nivre and McDonald, 2008 are examples of hybrid parsing system. Transition-based parsers can the input from left to right, are fast. The complexity of this type of parser is O(n). Graph-based parsers, on the other hand, are globally optimized. DeSR is a shift-reduce dependency parser, which uses a variant of the approach of Yamada and Matsumoto in 2003. Dependency structures are built scanning the input from left to right and deciding at each step whether to perform a shift or to create a dependency between two adjacent tokens [4]. DeSR uses though a different set of rules and includes additional rules to handle non-projective dependencies that allow parsing to be performed deterministically in a single pass. The algorithm also produces fully labeled dependency trees. A classifier is used for learning and predicting the proper parsing action. ISBN dependency parser (idp) is a configurable and trainable dependency parser based on a generative statistical model, ISBNs (Incremental Sigmoid Belief Networks)[5].One modern approach to building dependency parsers, called data-driven dependency parsing, is to learn good and bad parsing decisions solely from labeled data, without the intervention of an underlying grammar (Ryan McDonald and Joakim Nivre). Deivasundarm has prepared a morphological analyzer for Tamil for his Tamil word processor. He too makes use of phonological and morphophonemic rules and morphotactic constraints for developing his parser. AUKBC morphological parser for Tamil, AUKBC NLP team under the supervision of Rajendran [6] prepared a morphological parser for Tamil. The API processor of
9
AUKBC makes use of the finite state machinery like PCKimmo. It parses, but does not generate. Winston Cruz‟s parsing and generation of Tamil verbs, Winston Cruz makes use of GSmorph method for parsing Tamil verbs. GSmorph too does morphotactics by indexing. The algorithm simply looks up two files to see if the indices match or not. The processor generates as many forms as it parses and uses only two files. Dhurai Pandi‟s morphological generator and parsing engine for Tamil verb forms: It is a full-fledged morphological generator and a parsing engine on verb patterns in modern Tamil. Baskaran‟s finite-state machine for syntactic parsing, finite-state automata is one of the important techniques for parsing at all the level of a language structure. On experimental basis Baskaran in 1984 has attempted a finite-state-machine for parsing sentences in Tamil. Keeping the characteristics of Tamil in mind, Kumara Shanmugam in 2004 has prepared a parser for Tamil. The parser he has designed carry out a complete morphological analysis of words of the sentences at the first level in order to help in dependency determination. Shanmugam while proposing a program for syntactic parsing in Tamil makes the following comments, “Structural description of the units of a language can be provided by the grammar of language, making use of a principle called „Projection Principle‟. According to this principle, as in transformational grammatical treatise, the structure of a sentence or phrase can be projected or plotted from the lexical specification of the head of the phrase or sentence. The projected structure will be abstract structure which will be modified with due substitution of appropriate lexical items. Shanmugam in 2002 advocated for minimalist program for Tamil parsing.
All
grammatical formalisms identify lexicon and certain procedures for creating and manipulating grammatical structures. Minimalist program which is a grammatical model and an extension of GB framework was proposed by Chomsky to expose the grammatical 10
patterns found in languages. Some of his MPhil and Ph.D. students have worked for their dissertation on context free grammar formalism, transformational generative grammar formalism, projection principle, and minimalist program and prepared syntactic parser models for Tamil based on the formalism they have chosen. The RCILTS-Tamil syntactic parser handles simple and complex sentences with multiple nouns, adjective and adverb clauses. Handling of conjunction has been tackled to a limited extent. The addition of rules for semantic dependencies can enhance the performance of the parser. That seems to parse sentences in terms of clauses such as noun clauses, verb clauses, adjective clauses and adverbial clauses. The clauses have been parsed into categories. Selvam et al. in 2008 have developed a parser using a hybrid language model. A hybrid language model has been trained with the phrase structured treebank using immediate head parsing technique. Lexicalized and statistical parser which employs this hybrid language model and immediate head parsing technique gives better results. Indian Institute of Technology Bombay and Central Institute of Indian Languages Mysore jointly organized a three-day national symposium on modeling and shallow parsing of Indian languages (MSPIL-06) at IIT Bombay in April 2006, with the sponsorship of C-DAC, IBM Research India, Yahoo and Development Gateway Foundation. The parsing system developed by Aniruddha Ghosh et al. [7] for ICON NLP tools contest 2010 uses a MALT parser and a post processing approach. The system is developed for Bengali language and uses a hybrid approach whereas the previous methods used either grammar driven approach or data driven approach. The reason behind the hybrid architecture is due to the syntactic richness of the language. The grammar driven parsing parse the sentences by eliminating the sentences which do not satisfy the constraints and data driven parsing needs a large set of manually annotated corpus. The Corpus size being small, hybrid method is adopted. The data are trained with the malt parser and the model is created. Due to the small dataset, many dependencies are sparsely distributed which leads to low accuracy. So, post processing for the parser 11
output. This step is mainly to reduce the errors during parsing. Depending on the nature of the errors, a set of rules have been devised. Output of malt parser and rule based system are compared. Rule based system is given the high priority since it is based on syntaco-semantic rules. The system works best with simple sentences. Another parsing system „A two stage constraint based hybrid dependency parser for Telugu‟ has been developed for the ICON NLP tools contest 2010 by Sruthilaya Reddy Kesidi et al. [8]. As the name tells, it uses two stages for parsing the system. First stage of the system mainly handling the intra-clausal relations and the second stage handling the more complex intra-clausal relations such as those involved in the construction of the coordination and the subordination between the clauses. In the system the average number of output parsers for each sentence is 10. Ranking is done to the parser outputs based on S-constraints and H-constraints. The parser output having the maximum rank is given as the parser output. The difference between the parsers is very minimal and so the ranking becomes non-trivial. Also, this method does not handle ellipses and so the sentences with NULL nodes are eliminated. The parsing system „Experiments with malt parser for parsing Indian languages‟ developed by Sudheer Kolachina et al. [9] for the ICON NLP tools contest 2010 works for Hindi, Telugu and Bangla languages. The system uses the malt parser and the features arc dimensionality and valency are added. The system was experimented by changing the malt parser parameters. Use of the novel features improves the performance of the parser. The other parsing system „Experiments on Indian language dependency parsing‟ was developed by Prudhvi Kosaraju et al. [10] for Hindi, Telugu and Bangla. This system uses malt parser and SVM as learning algorithms. It explores the state-of-art malt parser with different parsing strategies. Data driven approach has been chosen. Morpho syntactic information improves the accuracy of the parser. But there was not much improvement due to the less efficient semantic tagger. Contribution in achieving high accuracy is by intra-chunk relation. Inter-chunk relations are still suffering with the problem of perfection.
12
Chunking and dependency parsing by G Arratdi et al. [11] explained how chunking can be used in improving the accuracy of the parser. The paper tells us that if we identify the head of the chunk, the dependencies among the chunk will usually be to the head of the chunk and only intra-chunk relations exist. The process will become easier if the head of the chunk is identified and also the error occurring due to incorrect dependency head will be reduced to some extent because of the reason that dependencies from one chunk to the other will usually be to the head of the chunk and also the head of the noun chunk will generally be the word at the right end of the noun chunk.
2.2 Works for the Clause Boundary Identification R. Vijay Sundar Ram et al. [12] from AU-KBC Research Institute identified the clause boundary using the conditional random fields. The method uses a hybrid approach. The boundaries are identified using the CRF and then checked with error pattern analyzer for false boundary identification. Fredrik Jorgensen [13] from department of linguistics and Scandinavian Studies, University of Oslo detected the clause boundaries in the transcribed spoken language by classifying coordinating conjunctions in spoken language discourse as belonging to either the syntactic level or discourse level of analysis. Hyun-Ju Lee et al. [14] from department of computer engineering, Kyungpook National University, Daegu, Korea proposes a method for Korean clause boundary recognition. The ending points of clauses are first recognized, and then identify the starting points by considering the typological characteristics of Korean. Tomohiro Ohno et al. [15] published a paper on incremental dependency parsing of Japanese spoken monologue based on clause boundaries. This paper proposed a technique for incremental dependency parsing of Japanese spoken monologue on a clause-by-clause basis. The technique identifies the clauses based on clause boundaries analysis, analyzes the dependency structures of them, and tries to decide the dependency relations with another clause, simultaneously with the monologue speech input. Tomohiro Ohno et al. [16] proposed a paper on, dependency parsing of Japanese monologue using clause boundaries. The method has been implemented in two levels, one is the clause level and the other one is the sentence level. In the clause level, the 13
sentences are split into many clauses and then the dependencies within the clause are identified. In the sentence level, the dependencies across the clause are identified. Dan Lowe Wheeler [17] presented a paper for machine translation through clausal syntax, a statistical approach for Chinese to English for his thesis, which is a tree to tree translation from English to Chinese that predicts the English clause structure from Chinese clause structure. For this, the sentences have to be split into clauses. The splitting is based on the set of rules on the Chinese side in two steps. Minimal verb phrases are first marked as clauses and then they propagate up through a series of constraints. This is how the sentences are split in Chinese. There are also other works in Tamil. Stanford Natural Language Processing group also has developed software for the dependency parser of English sentences. They have been using the Penn tree tag set for parsing the new sentences.
2.3 Summary
Various works are going on for the dependency parsing.
ICON NLP Tool contest 2010 showcases various approaches used to improve the accuracy of the dependency parser in different languages.
Atterdi from Microsoft Research shows that chunking will improve the accuracy of the parser and reduce the errors.
Nivre explained about the dependency grammar and the dependency Parsing.
14
CHAPTER 3 DEPENDENCY PARSING 3.1 Introduction Dependency parsing is the task of finding the tree structure for the given input sentence. Dependency structure for a sentence is a directed graph originating out of a unique and artificially inserted root node [18]. Dependency graph is a weakly connected directed graph where each word has exactly one incoming edge except the root which has no incoming edge. There is no cycle i.e. if there are n words in the sentence, and then the graph has exactly n −1 edges. Dependency graphs which satisfy the tree constraints are called dependency trees. Whenever two words are connected by a dependency relation, we say that one of them is the head and the other is the dependent, and that there is a link connecting them. In general, the dependent is in the form of modifier, object or complement. The head plays the larger role in determining the behavior of the pair. In our dependency representation the source of the edge represents the modifier and destination points to the head word. The primary reasons for using dependency structures instead of more informative lexicalized phrase structures is that they are more efficient to learn and parse while still encoding much of the predicate-argument information needed in applications.
3.2 Dependency Graph Dependency graphs represent words and their relationship to syntactic modifiers using directed edges. Figure 3.1 shows a dependency graph for the sentence: . This example belongs to the special class of dependency graphs that only contain projective (also known as nested or non-crossing) edges.
15
Fig. 3.1 An example of dependency graph Figure 3.1 shows this construction of projective dependency graph for the example sentence. Equivalently, we can say a dependency graph is projective if and only if an edge from word w to word u implies that there exists a directed path in the graph from w to every word between w and u in the sentence. Due to English‟s rigid word order, projective graphs are sufficient to analyze most English sentences. In fact, the largest source of English dependencies is automatically generated from the Penn tree bank and is by construction exclusively projective. However, there are certain examples in which a non-projective graph is preferable. Formally, a dependency structure for a given sentence is a directed graph originating out of a unique and artificially inserted root node, which we always insert as the leftmost word. In the most common case, every valid dependency graph has the following properties, 1. It is weakly connected (in the directed sense). 2. Each word has exactly one incoming edge in the graph (except the root, which has no incoming edge). 3. There are no cycles. 4. If there are n words in the sentence (including root), then the graph has exactly n−1edges. It is easy to show that 1 and 2 imply 3, and that 2 imply 4. In particular, a dependency graph that satisfies these constraints must be a tree. Thus we will say that dependency graphs satisfying these properties satisfy the tree constraint, and call such graphs dependency trees. For most of this work we will be addressing the problem of parsing dependency graphs that are trees, which is a common constraint.
16
Directed edges in a dependency graph represent words and their modifiers, e.g., a verb and its subject, a noun and a modifying adjective, etc. The word constitutes the head of the edge and the argument the modifier. This relationship is often called the headmodifier or the governor-dependent relationship. The head is also sometimes called the parent and the modifier is also sometimes called the child or argument. We will always refer to words in a dependency relationship as the head and modifier.
Fig. 3.2 An example of labeled dependency graph Dependency structures can be labeled to indicate grammatical, syntactic and even semantic properties of the head-modifier relationships in the graph. For instance, we can add syntactic/grammatical function labels to the structure in Figure 3.1 to produce the graph in Figure 3.2.
3.3 Classes of Dependency Parsing There are two classes of dependency parsing:
Data-driven dependency parsing
Grammar-based dependency parsing
An approach is data-driven if it makes essential use of machine learning from linguistic data in order to parse new sentences. An approach is grammar-based if it relies on a formal grammar, defining a formal language, so that it makes sense to ask whether a given input sentence is in the language defined by the grammar or not. It is important to note that these categorizations are orthogonal, since it is possible for a parsing method to make essential use of machine learning and use a formal grammar, hence to be both datadriven and grammar-based. However, most of the methods will fall into one of these classes only. 17
3.4 Data-driven Dependency Parsing Data-driven approaches make use of machine learning approaches to parse the new sentences. In this class of dependency parsing, whatever sentence is given as input, we will definitely get an output. The parser will not check whether the given sentence is a valid sentence or not; i.e., the data driven approaches assume that any input string given is a valid sentence and that the task of the parser is to return the most probable dependency structure for the sentence, no matter how unlikely it may be. There are two classes of dependency parsing:
Transition based dependency parsing
Graph based dependency parsing
Both these categories contain most of the methods for data driven approaches. The data driven dependency data-driven approaches differ in the type of parsing model adopted, the algorithms used to learn the model from data, and the algorithms used to parse new sentences with the model.
3.4.1 Machine Learning Approach There are two types of methods in the machine learning approach. They are: supervised and the unsupervised methods. In the supervised method, annotated corpus will be given and the system will learn from the annotated corpus and will create a model based on the training set given to the system for learning, whereas in the unsupervised method of learning, only the data will be given and the system will learn from the data itself and will create a model based on those data. One form of unsupervised learning is the clustering. This method of learning induces the model from the large corpora. The supervised method of learning learns from the annotated corpora. We focus on supervised methods, that is, methods presupposing that the sentences used as input to machine learning have been annotated with their correct dependency structures. 18
In supervised dependency parsing, there are two different problems that need to be solved computationally. The first is the learning problem, which is the task of learning a parsing model from a representative sample of sentences and their dependency structures. The second is the parsing problem, which is the task of applying the learned model to the analysis of a new sentence. We can represent this as follows: • Learning: Given a training set of sentences (annotated with dependency graphs), induce a parsing model that can be used to parse new sentences. • Parsing: Given a parsing model and a sentence, derive the optimal dependency graph for the sentence according to the induced model. Data-driven approaches differ in the type of parsing model adopted, the algorithms used to learn the model from data, and the algorithms used to parse new sentences with the model.
3.4.2 Transition based Dependency Parsing Transition-based method defines a transition system, or state machine, for mapping a sentence to its dependency graph. In other words, transition based dependency parsing is a type of parsing that can be viewed as a sequence of state transitions between states. This approach used a greedy algorithm in which a single action is chosen at every point. These systems parameterize a model over the transitions of an abstract machine for deriving the dependency trees, where the greedy, deterministic parsing algorithm is used to predict new trees. A transition system is an abstract machine, consisting of a set of configurations or states and transitions between configurations. Transition systems used for the dependency parsing have complex configurations with internal structure and the transitions are the steps in the derivation of the dependency tree. The idea is that a sequence of valid transitions, starting in the initial configuration for a given sentence and ending in one of several terminal configurations, defines a valid dependency tree for the input sentence. The machine learning problem can be viewed as,
Learning problem: To induce a model for predicting the next state transition, given the transition history. 19
Parsing problem: To construct the optimal transition sequence for the input sentence, given the induced model.
This type of parsing method is sometimes referred to as the shift-reduce dependency parsing. Malt parser follows this sort of dependency parsing approach.
3.4.3 Graph-based Dependency Parsing Graph-based methods define a space of candidate dependency graphs for a sentence. A graph-based system explicitly parameterizes models over substructures of a dependency tree, instead of indirect parameterization over transitions used to construct a tree. Many trees will be constructed and each candidate tree will be given a score based on some local or a global scoring function. The parser outputs the highest scoring function. The Machine Learning problem can be viewed as,
Learning problem: To induce a model for assigning scores to the candidate dependency graphs for a sentence.
Parsing problem: To find the highest-scoring dependency graph for the input sentence, given the induced model.
This is often called maximum spanning tree parsing, since the problem of finding the highest-scoring dependency graph is equivalent to the problem of finding a maximum spanning tree in a dense graph. MST (Maximum Spanning Tree) parser follows this type of dependency parsing approach.
3.5 Grammar-based Dependency Parsing An approach is grammar-based if it relies on a formal grammar, defining a formal language, so that it makes sense to ask whether a given input sentence is in the language defined by the grammar or not. Here, the formal grammar is an important component of the model to parse new sentences. In this approach, parsing is defined as the analysis of a sentence with respect to the given grammar. If the parser finds an analysis, the sentence is said to belong to the language described by the grammar. If the parser does not find an analysis, the sentence does not belong to the language. In other words, if a new sentence 20
is given to the grammar-based dependency parsing, the model will check in the formal grammar, it the sentence is a valid sentence. If the sentence is valid only, the parser will proceed to parse; else the parser will give as invalid string. The Grammar-based approaches may or may not be data driven. The two classes of Grammar-based approaches are:
Context-free dependency parsing
Constraint-based dependency parsing
Unlike data-driven dependency parsing which works independent of languages, this type of parsing is specific to a particular language.
3.5.1 Context-free Dependency Parsing Context-free dependency parsing exploits a mapping from dependency structures to context-free phrase structure representations and reuses the algorithm developed for context-free grammar. This includes chart parsing algorithms, which are also used in graph-based parsing, as well as shift-reduce type algorithms, which are closely related to the methods used in transition-based parsing.
3.5.2 Constraint-based Dependency Parsing Constraint-based dependency parsing views parsing as a constraint satisfaction problem. A grammar is defined as a set of constraints on well-formed dependency graphs, and the parsing problem amounts to finding a dependency graph for a sentence that satisfies all the constraints of the grammar. Some approaches allow soft, weighted constraints and score dependency graphs by a combination of the weights of constraints violated by that graph. Parsing then becomes the problem of finding the dependency graph for a sentence that has the best score, which is essentially the same formulation as in graph-based parsing.
21
3.5.3 Comparison of Grammar-based with Data-driven approach
Parameterization: Transition-based systems parameterize models over transitions in an abstract state machine, where each state (or configuration) represents a dependency graph. This allows these models to create rich feature representations over possible next transitions as well as all previous transitions that have occurred to bring the system into the current state. Conversely, graph-based models parameterize over sub-graphs of the resulting dependency tree. As such, these models have a rather impoverished feature representation with a very local scope– often just over a single arc – and cannot model decisions on a truly global scale.
Parsing Algorithms: Transition-based models use greedy algorithms to move from one configuration to the next by simply choosing the most likely next transition. Such a greedy approach cannot provide any guarantees that mistakes made early in the process do not propagate to decisions at later stages – a defect often called error propagation. On the other hand, graph-based models can often search the space of dependency trees, with guarantees that the returned dependency tree is the most likely under the model parameters. This is true for arc-factored models as well as many projective non-arc-factored models.
3.6 Types of Dependency Parsing Based on the structure of the dependency graph, there are two types of dependency parsing:
Projective dependency parsing
Non Projective dependency parsing
3.6.1 Projective Dependency Parsing A dependency tree is projective when the words are in linear order, preceded by the root and the edges can be drawn above the words without crossings. Figure 3.3 is an example for Projective dependency parsing. 22
Fig. 3.3 Example of projective dependency parsing
3.6.2 Non projective Dependency Parsing A dependency tree is non-projective when the words are in non-linear order, preceded by the root and the edges cannot be drawn above the words without crossings. The arcs have crossings. Figure 3.4 is an example for Non Projective Dependency Parsing. This type of parsing is more common in free word order.
Fig. 3.4 Example of non-projective dependency parsing
3.7 Applications of Dependency Parsing Dependency parsing helps to serve various NLP tasks.
Relation extraction
Machine translation
Synonym generation
Lexical resource augmentation 23
Information extraction
Question answering
Word sense disambiguation
Sentence structure identification
Summarization
Relation Extraction Dependency tree is useful in relation extraction. Relation extraction is extracting the relation between the words in the sentence as how the words are related to one another and what relation exists between those words in the sentence and how the words in the sentence are connected. All these information can be extracted from the dependency tree. This extracted information can be used for various purposes. Machine Translation Machine translation is the automatic translation from one human language to another. Dependency graph and dependency tree shows how the words are related to one another and so the translation task becomes easy. Synonym Generation Using the dependency parsing, we can find the exact meaning of the word, ie., we can find the sense in which the word has come. For example if we take the word
, it
can occur as either the verb or a noun (dance if it is verb and goat if it is noun). So, dependency parser gives the subject-object information, with the help of the parser, we can identify the synonym of the word used. Lexical Resource Augmentation For a free word order language like Tamil, dependency parser suits the best to know how the lexicons (words) are related to one another. This is because, the languages like English follows a certain pattern for the sentence structure. But, it is not so in case of
24
Tamil, we can reorder it in any way. So, in those cases, dependency parser is very helpful. Question Answering and Summarization These are another application of the dependency parsing, which is very helpful. Word Sense Disambiguation Using the dependency parsing, we can find the sense in which the word has been used. There are certain words which fall under more than one category of parts of speech. For example if we take the word
, it can occur as either the noun or a cardinal (river
if it is noun and number 6 if it is cardinal). Since, dependency parser gives the subjectobject information, with the help of the parser, we can easily identify the synonym of the word used. Sentence Structure Identification The structure of the sentence can be easily identified with the help of the parser. Since Tamil is a free word order language, it can follow any sentence structure pattern. Parser is helpful in identifying the structure of the sentence.
3.8 Summary
Dependency parsing is the task of identifying the dependency structure for the sentence
The two types of dependency parsing: Data-driven and Grammar-based.
Data-driven used machine learning approach and has two classes: Transitionbased and Graph-based.
Grammar-based approaches make use of the formal grammar and the two classes are Context-free and Constraint-based grammars.
Based on the dependency graph, there are two types of parsing: Projective and Non-projective.
25
CHAPTER 4 METHODOLOGY 4.1 Introduction The implementation of dependency parsing involves a sequence of steps. The input sentences are first tokenized. The tokenized words are first sent to the pos tagger to get the pos tagged data. The pos tagged sentences are then passed to the chunker to get the chunked data. Now the words contain the pos tags and the chunk tags. It is then preprocessed to the parser input format and then passed to the parser. The parsed output is then converted to a digraph format which serves as the input to the tree viewer. The tree viewer gives the parsed output to be viewed in a tree format. The block diagram of the proposed system is given in Figure 4.1
Fig. 4.1 Block diagram of the proposed method
26
4.2 Modifications to the Existing System There are many systems currently existing for the dependency parsing for different languages. Many are currently working in this field of dependency parsing. This is because of its application in various areas. The system which we developed has been made changes to the existing system to improve the accuracy of the parser, which will lead to the improvement in the accuracy of the other applications of dependency parser. The system developed differs from the existing system in the following ways:
The system is developed for Tamil language
The tag set used is different
Chunking is also include in the process
Two different models are created
27
CHAPTER 5 IMPLEMENTATION 5.1 Introduction The input Sentence is given as a String. First, the input sentence will be written into a file. Then, the input sentence in the file is split into tokens. Splitting the sentence into tokens is based on the space between the words. So if the word encounters a space it„ll split the sentence into tokens. Then, the tokenized input is given to the svm tool. Now, the space between the words and the pos tag is replaced by a Tab, since malt accepts only the tab separated columns. But since pos tagger accepts only the space separated columns, space was given as the separator between the words before. Chunker will accept space or tab, so the space between the words is replaced by tabs. So, the processed pos tagged data is given to the Chunker. Then, the chunked data is preprocessed to the malt input format. Then the malt parser will run and give the parsed output. The parsed sentence is then given to the GraphViz tool to generate the tree structure. The tree structure is generated from the digraph format.
5.2 POS Tagging Part of speech (POS) tagging is the process of assigning the part of speech tag or other lexical class marker to each and every word in a sentence. In linguistics, part-ofspeech tagging, also termed grammatical tagging or word-category disambiguation, is the process of marking up the words in a text or corpus as corresponding to a particular part of speech, based on both its definition, as well as its context - ie., relationship with adjacent and related words in a phrase, sentence, or a paragraph [19]. In other words, it can also be defined as the process of labeling automatic annotation of syntactic categories for each word in a corpus. It is similar to the process of tokenization for computer languages. Pos tagging is a well-understood problem in NLP, to which machine learning approaches are applied. The interest in annotated corpora is spreading, as there is
28
increasing concern with using existing machine learning approaches for corpus processing. The input to a tagging algorithm is a string of words of a natural language sentence and a specified tag set (a finite list of Part-of-speech tags). The output is a single best pos tag for each word. For Morphological rich language such as Tamil and Malayalam, the morph analyzer (a tool that splits a given word into its constituent morphemes and identifies their corresponding grammatical categories) itself can identify the part-ofspeech in most of the cases. But the morph analyzer fails to resolve some of the lexical ambiguities for which we need a pos tagger. Tags play an important role in natural language applications like speech recognition, natural language parsing, information retrieval and information extraction. This is the main reason for us to go for pos tagging.
5.2.1 POS Tagger A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., in the given text. A tagger is a necessary component of most text analysis systems, as it assigns a syntax class to every word in a sentence. Parts of Speech Tagger also called, as grammatical tagger or word category disambiguator, does the task of assigning each word of a text with the proper pos tag in its context of appearance in the sentences. The importance of the problem focuses from the fact that the pos is one of the first stages in the process performed by various natural language related process.
5.2.2 POS Tag set We are following Amrita tag set for tagging the words. The main drawbacks in all other tag set are they consider the verb and noun inflections. So at the tagging time we need to split each and every inflected word in the corpus. It‟s a tough process. For pos level we want to determine the word‟s
pos category or tag. That can be done using a
limited number of tag set. The inflectional problems can be solved by morph analyzer. So there is no need of using large number of tags. Moreover large number of tags will lead to 29
more complexity which intern reduces the tagging accuracy. Considering
the
complexity of Tamil in pos tagging and referring various tag sets, amrita tag set [20] is being used and it is summarized in Table.5.1. Our tag set contains 28 tags we are not considering the inflections. In our tag set we used compound tag for only nouns (NNC) and proper nouns (NNPC). We considered tag VBG for verbal nouns and participle nouns. Table 5.1 POS tag set No
Tag
Description
No
Tag
Description
1
NOUN
15
VERB INFINITE
2
COMPOUND NOUN
16
CONJUNCTION
3
PROPER NOUN
17
4
18
QUESTION WORDS
5
ORDINALS
19
COMPLIMENTIZER
6
CARDINALS
20
QUANTITY NOUN
7
PRONOUN
21
posTposITIONS
8
ADJECTIVE
22
DETERMINERS
9
ADVERB
23
INTENSIFIER
10
24
EMPHASIS
11
25
COMMA
12
VERBAL GERUND
26
DOT
13
VERB FINITE
27
QUSTION MARK
14
AUXILIARY VERB
28
ADVERBIAL NOUN
COMPOUND PROPER NOUN
VERB NON FINITE ADJECTIVE VERB NON FINITE ADVERB
30
CONDITIONAL VERB
5.2.3 Explanation of the POS Tag set NN (Noun) The tag NN is used for common nouns (general nouns) without differentiating them based on the grammatical information. Words such as
,
,
,
will fall under in this
category. Example: .
NNC (Compound Noun) Nouns that are compound can be tagged using the tag NNC. Words such as
,
,
fall under in this category. Example: .
NNP (Proper Nouns) The tag NNP tags the proper nouns. 31
will
,
Words such as
,
,
will fall under in this category.
Example: .
NNPC (Compound Proper Nouns) Compound proper nouns are tagged using the tag NNPC. ,
Words such as
will fall under in this category.
Example: .
ORD (Ordinal) Expressions denoting ordinals will be tagged as ORD. Words such as
,
, 140032
will fall under in this category.
Example: .
CRD (Cardinal) Cardinal tag CRD tag tags the cardinals (numbers) in the language. ,
Words such as
, 42 will fall under in this category.
Example: .
PRP (Pronoun) All pronouns are tagged using the tag PRP. Words such as
,
,
,
,
will fall under in this category.
Example:
33
.
ADJ (Adjective) All adjectives in the language will be tagged as ADJ. ,
Words such as
,
,
,
will fall under in
this category. Example: .
ADV (Adverb) ADV tag tags the adverbs in the language. This tag is used only for manner adverbs. ,
Words such as
,
,
category. Example:
34
,
will fall under in this
.
VNAJ (Verb Nonfinite Adjective) VNAJ tag tags the Verb Nonfinite Adjective in the language. ,
Words such as
,
,,
,
,
will fall under
in this category. Example: .
VNAV (Verb Nonfinite Adverb) VNAV tag tags the Verb Nonfinite Adverb in the language. Words such as
,
,
,
will fall under in this category. Example:
35
,
,
,
.
VBG (Verbal Gerund) All Verbal Gerund in the language will be tagged as VBG. ,
Words such as
,
,
,
,
will fall under in this category. Example: .
VF (Verb Finite) VF tag is used to tag the finite verbs in the language. Words such as
,
,
will fall under in this category. Example:
36
,
,
,
?
VAX (Auxiliary Verb) VAX tags the auxiliary verbs in the language. ,
Words such as
,
,
,
will fall under in
this category. .
VINT (Verb Infinite) VINT tags the Verb Infinite in the language. This is generally preceeded by auxillary verbs or finite verbs. Words such as
,
,
,
,
will fall under in this category.
Example:
37
.
CVB (Conditional Verb) CVB tags the Conditional Verb in the language. ,
Words such as
,
will fall under in this category.
Example: .
CNJ (Conjuncts, both co-ordinating and subordinating) The tag CNJ can be used for tagging co-ordinating and subordinating conjuncts. Words such as
,
,
will fall under in this category.
Example:
38
.
QW (Question Words) The question words in the language like will be tagged as QW. Words such as
,
,
,
will fall under in this category.
Example: ?
COM (Complimentizer) COM tags the Complimentizer in the language. Words such as
,
,
will fall under in this category.
Example:
39
.
NNQ (Quantity Noun) NNQ tags the Quantity Noun in the language. Words such as
,
,
,
will fall under in this category.
Example: .
PPO (Postposition) All the Indian languages have the phenomenon of postpositions. Postpositions are tagged using the tag PPO. Words such as
,
,
will fall under in this category.
Example:
40
.
DET (Determiners) The tag DET tags the determiners in the language. Words such as
,
,
will fall under in this category.
Example: .
INT (Intensifier) Intensifier is used for intensifying adjectives or adverbs in a language. Words such as
,
,
will fall under in this category.
EMP (Emphasis) The tag EMP tags the Emphasis words in the language.
41
,
Words such as
will fall under in this category.
Example: .
COMM The tag COMM tags the comma in a sentence. Example: ,
.
DOT The tag DOT tags the dots (period) in a sentence. Example: 42
.
QM (Question Mark) The question marks in the language are tagged using the tag QM. Example: ?
5.2.4 SVM Tool The tool used for the pos tagging is the Support Vector Machines (SVM) tool [21]. SVM tool is a simple, flexible, and effective generator of sequential taggers based on SVM and it is being applied to the problem of part-of-speech tagging. This SVM-based tagger is robust and flexible for feature modeling (including lexicalization), trains efficiently with almost no parameters to tune, and is able to tag thousands of words per second, which makes it really practical for real NLP applications. Regarding accura cy, the SVM-based tagger significantly outperforms the TnT tagger exactly under the same conditions, and achieves a very competitive accuracy for Tamil. Generally, tagging is required to be as accurate as possible, and as efficient as possible. But, certainly, there is a trade-off between these two desirable properties. This 43
is so because obtaining a higher accuracy relies on processing more and more information, digging deeper and deeper into it. However, sometimes, depending on the kind of application, a loss in efficiency may be acceptable in order to obtain more precise results. Or the other way around, a slight loss in accuracy may be tolerated in favor of tagging speed. Moreover, some languages have a richer morphology than others, requiring the tagger to have into account a bigger set of feature patterns. Also the tag set size and ambiguity rate may vary from language to language and from problem to problem. Besides, if few data are available for training, the proportion of unknown words may be huge. Sometimes, morphological analyzers could be utilized to reduce the degree of ambiguity when facing unknown words. Thus, a sequential tagger should be flexible with respect to the amount of information utilized and context shape. Another very interesting property for sequential taggers is their portability. Multilingual information is a key ingredient in nlp tasks such as machine translation, information retrieval, information extraction, question answering and word sense disambiguation, just to name a few. Therefore, having a tagger that works equally well for several languages is crucial for the system robustness. Besides, quite often for some languages, but also in general, lexical resources are hard to obtain. Therefore, ideally a tagger should be capable for learning with fewer (or even none) annotated data. The svm tool is intended to comply with all the requirements of modern nlp technology, by combining simplicity, flexibility, robustness, portability and efficiency with state–of–the–art accuracy. This is achieved by working in the support vector machines (svm) learning framework, and by offering nlp researchers a highly customizable sequential tagger generator. 5.2.4.1 Properties of SVM Tool The following are the properties of the SVM Tool: Simplicity The SVM Tool is easy to configure and to train. The learning is controlled by means of a very simple configuration file. There are very few parameters to tune.
44
And the tagger itself is very easy to use, accepting standard input and output pipelining. Embedded usage is also supplied by means of the SVM Tool API. Flexibility The size and shape of the feature context can be adjusted. Also, rich features can be defined, including word and pos (tag) n-grams as well as ambiguity classes and “may be‟s”, apart from lexicalized features for unknown words and sentence general information. The behavior at tagging time is also very flexible, allowing different strategies. Robustness The over fitting problem is well addressed by tuning the C parameter in the soft margin version of the SVM learning algorithm. Also, a sentence-level analysis may be performed in order to maximize the sentence score. And, for unknown words not to punish so severely on the system effectiveness, several strategies have been implemented and tested. Portability The SVM Tool is language independent. It has been successfully applied to English and Spanish without a priori knowledge other than a supervised corpus. Moreover, thinking of languages for which labeled data is a scarce resource, the SVM Tool also may learn from unsupervised data based on the role of nonambiguous words with the only additional help of a morpho-syntactic dictionary. Accuracy Compared to state–of–the–art pos taggers reported up to date, it exhibits a very competitive accuracy (97.2% for English on the WSJ corpus). Clearly, rich sets of features allow to model very precisely most of the information involved. Also the learning paradigm, SVM, is very suitable for working accurately and efficiently with high dimensionality feature spaces. Efficiency
45
Performance at tagging time depends on the feature set size and the tagging scheme selected. For the default (one-pass left-to-right greedy) tagging scheme, it exhibits a tagging speed of 1,500 words/second whereas the C++ version achieves a tagging speed of over 10,000 words/second. This has been achieved by working in the primal formulation of SVM. The use of linear kernels causes the tagger to perform more efficiently both at tagging and learning time, but forces the user to define a richer feature space. However, the learning time remains linear with respect to the number of training examples. 5.2.4.2 Components of SVM Tool The SVM Tool software package consists of three main components, namely the model learner (SVMTlearn), the tagger ( SVMTagger) and the evaluator ( SVMTeval). Previous to the tagging, SVM models (weight vectors and biases) are learned from a training corpus using the SVMTlearn component. Different models are learned for the different strategies. Then, at tagging time, using the SVMTagger component, one may choose the tagging strategy that is most suitable for the purpose of the tagging. Finally, given a correctly annotated corpus, and the corresponding SVMTool predicted annotation, the SVMTeval component displays tagging results. SVMTlearn Given a training set of examples (either annotated or unannotated), it is responsible for the training of a set of SVM classifiers. So as to do that, it makes use of SVM–light, an implementation of Vapnik‟s SVMs in C, developed by Thorsten Joachim‟s. The SVMlight software implementation of Vapnik‟s Support Vector Machine by Thorsten Joachim‟s has been used to train the models. Training Data Format Training data must be in column format, i.e. a token per line corpus in a sentence by sentence fashion. The column separator is the blank space. The token is expected to be the first column of the line. The tag to predict takes the second column in the output. The rest of the line may contain additional information. See an example:
46
.
700
.
SVMTlearn behavior is easily adjusted through a configuration file. Usage: SVMTlearn [options] 47
Options: - V verbose 0: none verbose 1: low verbose [default] 2: medium verbose 3: high verbose Example: SVMTlearn -V 2 config.svmt These are the currently available config-file options: Sliding window: The size of the sliding window for feature extraction can be adjusted. Also, the core position in which the word to disambiguate is to be located may be selected. By default, window size is 5 and the core position is 2, starting at 0.Here I took the default window size 5. Feature set: Three different kinds of feature types can be collected from the sliding window: Word features: Word form n-grams. Usually unigrams, bigrams and trigrams suffice. Also, the sentence last word, which corresponds to a punctuation mark („.‟, „?‟, „!‟), is important. pos features: Annotated parts–of–speech and ambiguity classes n-grams, and “may be‟s”. As for words, considering unigrams, bigrams and trigrams is enough. The ambiguity class for a certain word determines which pos are possible. A “may be” states, for a certain word, that certain pos may be possible, i.e. it belongs to the word ambiguity class. Lexicalized
features:
including
prefixes
and
suffixes,
capitalization,
hyphenization, and similar information related to a word form. But in Tamil capitalization will not used all the other features are accepted. Default feature sets for every model are defined. Feature filtering: The feature space can be kept in a convenient size. Smaller models allow for a higher efficiency. By default, no more than 100,000 dimensions are used. 48
Also, features appearing less than n times can be discarded, which indeed causes the system both to fight against overfitting and to exhibit a higher accuracy. By default, features appearing just once are ignored. SVM model compression: Weight vector components lower than a given threshold, in the resulting SVM models can be filtered out, thus enhancing efficiency by decreasing the model size but still preserving accuracy level. That is an interesting behavior of SVM models being currently under study. In fact, discarding up to 70% of the weight components accuracy remains stable, and it is not until 95% of the components are discarded that accuracy falls below the current state–of–the–art (97.0% - 97.2%). C parameter tuning: In order to deal with noise and outliers in training data, the soft margin version of the SVM learning algorithm allows the misclassification of certain training examples when maximizing the margin. This balance can be automatically adjusted by optimizing the value of the C parameter of SVMs. A local maximum is found exploring accuracy on a validation set for different C values at shorter intervals. Dictionary repairing: The lexicon extracted from the training corpus can be automatically repaired either based on frequency heuristics or on a list of corrections supplied by the user. This makes the tagger robust to corpus errors. Also a heuristic threshold may be specified in order to consider as tagging errors those (word x tagy) pairs occurring less than a certain proportion of times with respect to the number of occurrences of word x. For example, a threshold of 0.001 would consider (run DT) as an error if the word run had been seen at least 1000 times and only once tagged as a „DT‟. This kind of heuristic dictionary repairing does not harm the tagger performance, on the contrary, it may help a lot. Repairing list must comply with the SVMTool dictionary format, i.e. 1{