pattern based english to tamil machine translation system

16 downloads 600 Views 5MB Size Report
entitled PATTERN BASED ENGLISH TO TAMIL MACHINE TRANSLATION SYTEM is the record of the original work done by me under the guidance of Dr. K P ...
PATTERN BASED ENGLISH TO TAMIL MACHINE TRANSLATION SYSTEM

A Thesis submitted for the degree of

Master of Science (by research) in the School of Engineering

By SARAVANAN. S

Centre for Excellence in Computational Engineering Amrita School of Engineering Amrita Vishwa Vidyapeetham University Coimbatore – 641112

March, 2012

Amrita School of Engineering Amrita Vishwa Vidyapeetham, Coimbatore – 641112

BONAFIDE CERTIFICATE

This is to certify that the thesis entitled “PATTERN BASED ENGLISH TO TAMIL MACHINE TRANSLATION SYSTEM” submitted by SARAVANAN. S (Reg. No.: CB.EN.M*CEN09008) for the award of the degree of Master of Science (by research) in the School of Engineering, is a bonafide record of the research work carried out by him under my guidance. He has satisfied all the requirements put forth for the project and has completed all the formalities regarding the same to the fullest of my satisfaction.

Ettimadai, Coimbatore. Date:

DR. K P SOMAN RESEARCH GUIDE AND HEAD, CEN.

Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore – 641112 Centre for Excellence in Computational Engineering. DECLARATION

I, SARAVANAN. S (REG. NO.: CB.EN.M*CEN09008), hereby declare that this thesis entitled PATTERN BASED ENGLISH TO TAMIL MACHINE TRANSLATION SYTEM is the record of the original work done by me under the guidance of Dr. K P Soman, Head, Centre for Excellence in Computational Engineering, Amrita School of Engineering, Coimbatore and to the best of my knowledge this work has not formed the basis for the award of any degree / diploma / associateship / fellowship or a similar award, to any candidate in any University.

Place: Ettimadai Date: Signature of the Student Countersigned by

K P SOMAN PROFESSOR AND HEAD, CEN, AMRITA VISHWA VIDYAPEETHAM, COIMBATORE.

ACKNOWLEDGEMENTS I would like to thank all minds that helped to mould me as what I am now. For all those hands that crafted me and have helped me for completing this thesis, I am inarticulate to express and all I can say is „thanks‟ that‟s the only word that I could recollect from my mental lexicon.

I am deeply indebted to my thesis advisor the most energetic person I know, Prof. K.P.Soman from whom I learnt „how things to be learned in the way it is to be learned‟. Without the inspiration I drew from him, I wouldn‟t be pursued a research career which I never dreamt off. I sincerely thank him who is touted as a hero in my University circle, for his patting on my shoulder whenever necessary and for being a good critic.

Though the motivation and resources are sufficed to get going, I am stumbled and struggled as being naïve especially in the area of linguistics. I am fortunate to get to learn from and work with Prof. A.G.Menon (University of Leiden) who transformed me from knowing-the-language being to knowing-the-language-scientifically being. The day long discussions and lessons with the parental touch definitely put some character into me. I am very much grateful to him for being a great well-wisher, mentor and for everything.

I would like to thank Dr. C.J. Srinivasan (University of Edinburgh) who introduced NLP to me formally and Dr. Hemant Darbari, (Executive Director, CDAC-Pune) who groomed my skills in MT.

My special thanks to the three Horsemen of CEN department: Mr. Loganathan Ramasamy (University of Charles), Mr. Vijaya Krishna Menon (Caledonian College of Engineering Muscat) and Mr. Rakesh Peter. I was indulged and molded by them. The numerous hour long

debates and discussions not only helped me to develop academic skills but also to gain the worldly knowledge and that later turned to be the fetus to my own ideologies and my perception of the universe as a whole.

I would also like to thank all-rounder Mr. Ajith, who is the creator the English-Tamil Transliteration tool that integrated with the MT system, for providing me the non-licensed version of the tool, trouble-shooter Mr.Sajan for helping me configuring the online version of translation system, shiv-monster Mr.Shivapratap for re-charging me with his preposterous funny, no-one-dare-to-do stuffs, gui-designer Mr. Senthil for developing the GUI interface for standalone version.

This thesis wouldn‟t have possible without the linguistic data. I would like to thank the contributors, Mrs. Meena Sukumaran for English-Tamil Proper and Place Name transliteration parallel corpora, Mrs. Dhanalakshmi, Mr. V. Anand Kumar, Mr. C. Murugeshan, Mrs. Kaveri, Mr. Harshawardhan and Mr. Rama Samy for English-Tamil lexicon data creation, Miss. Mridula Sara Augstine and Mrs. Dhanya for English-Malayalam lexicon creation. Last but not the least; I would like to thank the CEN department and its faculties & research fellows for their unconditional support.

ABSTRACT The native languages all over the world are growing rapidly along with the growth of technology, in general, and information technology, in particular. On the one hand, the world experiences a growth in the native language and on the other hand, precious and nascent information come through foreign languages. The demand for making the information available in native languages is increasing. Therefore, we need an efficient and practical method to fulfill this demand. This thesis describes one such method; English to Tamil Machine Translation System. This method employs pattern based reordering mechanism and dependency to morphological feature transformation method. Though the thesis elucidates English to Tamil translation system, the methodology can be extended to other Dravidian languages. Like any conventional rule based system, the developed system parses the input sentence, reorders to get the target phrasal structure, replaces the words of the sentence with its target equivalence, and finally synthesizes the target word to get the complete word form. Though in the process of translation, various modules are involved, assiduous research effort in development of morphological analyzer and synthesizer paid off and that reflects in the performance of the morph analyzer and synthesizer system. The corroboratory inferences from the results emphasize the need for morphological synthesizer for agglutinative languages, a translation engine for translating English to Dravidian languages, and a better dependency to morphological feature transformation method. The accuracy of the morphological analyzer is 88%. The system gives the score of 3 in 1 to 5 scoring scale for 63% of the sentences and score of 4 and 5 for 14% of the sentences tested from the EILMT Tourism corpus.

i

TABLE OF CONTENTS INTRODUCTION.................................................................................................................................. 1

1 1.1

MACHINE TRANSLATION FOR DUMMIES ...................................................................................... 2

1.2

IS IT WORTH GOING MT WAY? .................................................................................................... 3

1.3

PREVIOUS WORK ON ENGLISH-TAMIL MT ................................................................................... 3

1.4

OVERVIEW OF CHAPTERS ............................................................................................................. 4 THEORETICAL BACKGROUND ........................................................................................................... 6

2 2.1

APPROACHES ................................................................................................................................. 6

2.2

RULE-BASED MT ........................................................................................................................... 6

2.3

INTER-LINGUAL MT ...................................................................................................................... 7

2.4

TRANSFER-BASED MT................................................................................................................... 7

2.5

STATISTICAL MT ........................................................................................................................... 7

2.6

CONTEXT FREE GRAMMARS ......................................................................................................... 8

2.7

PROBABILISTIC CFG (PCFG) ........................................................................................................ 9

2.8

PARSING ........................................................................................................................................ 9

2.9

SYNCHORNOUS CONTEXT FREE GRAMMAR ............................................................................... 10

2.10

SYNCHRONOUS TREE ADJOINING GRAMMAR ............................................................................. 11

3

SYSTEM OVERVIEW ........................................................................................................................ 16

4

RULE BASED WORD REORDERING .................................................................................................. 27 4.1

TAMIL SENTENCE STRUCTURE.................................................................................................... 27

4.2

ENGLISH STRUCTURE TO TAMIL STRUCTURE ............................................................................. 27

4.3

PATTERN BASED CFG FOR REORDERING .................................................................................... 28

4.3.1

DISSECTION OF THE REORDERING RULE ............................................................................. 29

4.3.1.1

SOURCE PATTERN ........................................................................................................... 29

4.3.1.2

TARGET PATTERN ........................................................................................................... 29 ii

4.3.1.3

TRANSFER LINK .............................................................................................................. 30

4.3.1.4

REORDERING RULE ......................................................................................................... 31

4.3.2

REORDERING ALGORITHM – SIMPLIFIED VERSION ............................................................. 31

4.3.3

ARCHITECTURE ................................................................................................................... 35

4.3.3.1

REORDERING EXAMPLE .................................................................................................. 35

4.3.3.2

LIMITATIONS OF THE REORDERING RULES ...................................................................... 36

4.3.3.3

HANDLING PREPOSITION ................................................................................................. 37

4.3.4

HANDLING COPULA ............................................................................................................ 39

4.3.5

HANDLING AUXILIARY ....................................................................................................... 39

4.3.6

HANDLING RELATIVE PRONOUNS ....................................................................................... 39

PHRASAL STRUCTURE VIEWER ....................................................................................................... 40

5 5.1

NEED ........................................................................................................................................... 40

5.2

INTRODUCTION ............................................................................................................................ 40

5.3

ARCHITECTURE OF P-VIEWER..................................................................................................... 41

5.4

TREE RENDERING ........................................................................................................................ 42

5.5

REORDERING RULE FORMAT ...................................................................................................... 43

5.6

FILTERED SEARCH ....................................................................................................................... 44

5.7

SCREEN SHOT OF THE P-VIEWER TOOL....................................................................................... 46

5.8

REORDERING RULE EDITOR ........................................................................................................ 47

5.9

DICTIONARY EDITOR .................................................................................................................. 47

5.10

MULTIPLE SENTENCE AND WORD OPTIONS ............................................................................... 48

5.11

MORPH SYNTHESIZER FEATURES INFORMATION ....................................................................... 48

5.12

STEP BY STEP PROCEDURE TO CREATE A NEW RULE IN P-VIEWER ............................................ 49 MAPPING OF DEPENDENCY TO MORPHOLOGICAL FEATURE .......................................................... 52

6 6.1

INTRODUCTION AND WHY NEED FEATURE EXTRACTOR ............................................................ 52

6.2

STANFORD TYPED DEPENDENCIES ............................................................................................. 54

6.3

MORPHOLOGICAL FEATURE INFORMATION FROM DEPENDENCY RELATIONS ............................ 55

6.4

INFORMATION FOR VERB ............................................................................................................. 57 iii

6.5

COMPUTING AUXILIARY INFORMATION ..................................................................................... 59

6.6

HANDLING COPULA SENTENCES ................................................................................................. 61

6.7

COPULA: MORE ISSUES ............................................................................................................... 62

6.8

HANDLING DATIVE VERBS ......................................................................................................... 62

6.9

HANDLING MULTIPLE SUBJECTS ................................................................................................ 65 MORPHOLOGICAL ANALYZER AND SYNTHESIZER.......................................................................... 66

7 7.1

NEED OF SYNTHESIZER FOR MACHINE TRANSLATION ............................................................... 66

7.2

WHY MORPHOLOGICAL ANALYZER ........................................................................................... 68

7.3

INTRODUCTION ............................................................................................................................ 68

7.4

MORPH ANALYZER AND SYNTHESIZER – SYSTEM ARCHITECTURE ........................................... 70

7.4.1

NOUN LEXICON ................................................................................................................... 71

7.4.2

MORPHOTATICS MODEL ..................................................................................................... 71

7.4.3

ORTHOGRAPHIC MODEL ..................................................................................................... 72

7.5

BUILDING A SIMPLIFIED MORPH ANALYZER AND SYNTHESIZER ............................................... 73

7.5.1

FILE FORMAT FOR FSM TOOL KIT ...................................................................................... 74

7.5.2

FSM TOOLKIT COMMANDS ................................................................................................. 76

7.5.3

FST MODEL FOR MORPHOTACTICS RULE OF NOUN (SIMPLIFIED VERSION)...................... 77

7.5.4

FST MODEL FOR ORTHOGRAPHIC RULE.............................................................................. 78

7.5.5

TWO LEVEL MORPHOLOGY WITH AN EXAMPLE WORD ........................................................ 79

EXPERIMENTS AND RESULTS .......................................................................................................... 83

8 8.1

DATA FORMATS .......................................................................................................................... 83

8.2

TRANSFER RULES ........................................................................................................................ 83

8.3

DEPENDENCY TO MORPH MAPPING ............................................................................................. 84

8.4

AUXTENSE TO MORPH MAPPING................................................................................................. 85

8.5

NOUN LEXICON ........................................................................................................................... 86

8.6

TRANSLATION: STEP-BY-STEP PROCESS ..................................................................................... 86

8.7

TESTING ...................................................................................................................................... 87

iv

8.7.1

TESTING RESULTS: MORPHOLOGICAL ANALYZER AND SYNTHESIZER .............................. 88

8.7.2

CONTRIBUTIONS .................................................................................................................. 90

9

SCREEN SHOTS ................................................................................................................................ 94

10

LIMITATIONS AND FUTURE WORK ................................................................................................ 101

11

CONCLUSION ................................................................................................................................. 104

REFERENCES ............................................................................................................................................. 105 APPENDIX A .............................................................................................................................................. 108 A.1

TAMIL TRANSLITERATION SCHEME .......................................................................................... 108

APPENDIX B .............................................................................................................................................. 109 B.1

REORDERING RULES ................................................................................................................. 109

APPENDIX C .............................................................................................................................................. 113 C.1

TENSE-MORPHOLOGICAL FEATURES LOOKUP TABLE ............................................................... 113

APPENDIX D .............................................................................................................................................. 116 D.1

TAMIL VERB MORPHOLOGY ..................................................................................................... 116

D.2

TAMIL NOUN MORPHOLOGY .................................................................................................... 139

D.3

ORTHOGRAPHIC RULES ............................................................................................................. 141

APPENDIX E............................................................................................................................................... 148 E.1

LIST OF POST POSITIONS IN TAMIL (PARTIAL LIST) .................................................................. 148

PUBLICATIONS .......................................................................................................................................... 151

v

LIST OF FIGURES

FIG. 2.1. A PARSE TREE FOR 'BEAUTIFUL GIRL' ........................................................................................... 9 FIG. 2.2. SYNCHRONOUS CFG DERIVATIONS ............................................................................................. 11 FIG. 2.3. ELEMENTARY TREES (INITIAL TREES).......................................................................................... 11 FIG. 2.4. REWRITTEN ELEMENTARY TREES ................................................................................................ 12 FIG. 2.5. ELEMENTARY TREE REQUIRED FOR THE EXAMPLE SENTENCE ................................................... 13 FIG. 2.6. TAG DERIVATION.......................................................................................................................... 14 FIG. 2.7. SYNCHRONOUS TAG ............................................................................................................... 15 FIG. 3.1. TRANSLATION SYSTEM- BLOCK DIAGRAM .................................................................................. 17 FIG. 3.2. ANNOTATION OF THE INPUT SENTENCE ....................................................................................... 18 FIG. 3.3. PARSE TREE (RAM GAVE HIM A BOOK)......................................................................................... 18 FIG. 3.4. REORDERING RULE ....................................................................................................................... 19 FIG. 3.5. TRANSFORMATION OF PARSE TREE TO REORDERED TREE .......................................................... 20 FIG. 3.6. FLATTENING OF TREE ................................................................................................................... 21 FIG. 3.7. SYNTHESIZING OPERATION .......................................................................................................... 22 FIG. 3.8. TRANSLATED OUTPUT .................................................................................................................. 23 FIG. 3.9. DEPENDENCY RELATION TO MORPH FEATURES .......................................................................... 24 FIG. 4.1. SOURCE RULE ............................................................................................................................... 29 FIG. 4.2. TARGET RULE ............................................................................................................................... 30 FIG. 4.3. REORDERING RULE ....................................................................................................................... 31 FIG. 4.4. PARSE TREE .................................................................................................................................. 33 FIG. 4.5. REORDERING RULES ..................................................................................................................... 33 FIG. 4.6. APPLICATION OF RULE R1 AND R2 ............................................................................................... 34 FIG. 4.7. REORDERING ARCHITECTURE ...................................................................................................... 35 FIG. 4.8. TRANSFORMATION OF PARSE TREE TO REORDERED TREE .......................................................... 36 FIG. 4.9. REORDERING RULES ..................................................................................................................... 38 FIG. 4.10. PARSE AND REORDER STRUCTURE ............................................................................................. 38 FIG. 5.1. ARCHITECTURE OF P-VIEWER ...................................................................................................... 42 FIG. 5.2. FILTERED SEARCH ........................................................................................................................ 45 FIG. 5.3. SCREEN SHOT OF P-VIEWER ......................................................................................................... 46 FIG. 5.4. SCREEN SHOT OF REORDERING RULE EDITOR ............................................................................. 47 FIG. 5.5. SCREEN SHOT OF DICTIONARY EDITOR ........................................................................................ 48 FIG. 5.6. P-VIEWER CREATION OF NEW RULE ............................................................................................. 49 FIG. 5.7. P-VIEWER: OUTPUT WITH THE NEW RULE .................................................................................... 51 FIG. 6.1. FLATTENING OF TREE AND REPLACING WORDS WITH TARGET LEXICON .................................... 52 FIG. 6.2. MORPH FEATURE INFO TO SYNTHESIZER ...................................................................................... 54 FIG. 6.3. DEPENDENCY TREE: RAM GAVE HIM A BOOK .............................................................................. 55 FIG. 6.4. SL TO TL DEPENDENCY RELATIONS TRANSFORMATION ............................................................. 56 vi

FIG. 6.5. FEATURE EXTRACTION ................................................................................................................. 59 FIG. 6.6. REORDERING: SENTENCE WITH POSSESSIVE VERB ...................................................................... 64 FIG. 6.7. PHRASAL STRUCTURE (HE HAS TWO BOOKS) ............................................................................... 64 FIG. 7.1. MORPHOLOGICAL ANALYZER AND SYNTHESIZER - BLOCK DIAGRAM ........................................ 70 FIG. 7.2. TRANSDUCER FOR MORPHOTATICS RULE .................................................................................... 71 FIG. 7.3. TRANSDUCER FOR MORPHOTACTICS RULE - LEXICON LESS MODEL .......................................... 72 FIG. 7.4. TRANSDUCER FOR ORTHOGRAPHIC RULE FOR LEXICONLESS MODEL .......................................... 72 FIG. 7.5. TAMIL NOUNS: FSM REPRESENTATION ........................................................................................ 73 FIG. 7.6. TRANSDUCER FOR TAMIL NOUN INFLECTION .............................................................................. 78 FIG. 7.7. TRANSDUCER FOR V/Y INSERTION RULE ...................................................................................... 79 FIG. 7.8. TRANSDUCER FOR MORPHOTACTICS RULE .................................................................................. 79 FIG. 7.9. TRANSDUCER FOR ORTHOGRAPHIC / SPELLING RULE.................................................................. 80 FIG. 7.10. INPUT WORD IN FINITE-STATE REPRESENTATION....................................................................... 80 FIG. 7.11. INTERMEDIATE STAGE FSA ........................................................................................................ 80 FIG. 7.12. FST FOR SYNTHESIZED WORD .................................................................................................... 81 FIG. 7.13. FLOW GRAPH OF MORPH SYNTHESIZER ..................................................................................... 81 FIG. 8.1. TRANSFER RULES DB ................................................................................................................... 84 FIG. 8.2. DEPENDENCY-MORPH FEATURE DB ............................................................................................ 85 FIG. 8.3. AUXTENSE-MORPH FEATURES DB .............................................................................................. 85 FIG. 8.4. NOUN LEXICON............................................................................................................................. 86 FIG. 9.1. GUI OF ENGLISH-TAMIL MT SYSTEM (STAND ALONE VERSION) ................................................ 94 FIG. 9.2. DICTIONARY PANEL AND MORPH SYNTHESIZER PANEL ............................................................. 95 FIG. 9.3. GUI: TAMIL MORPH ANALYZER AND GENERATOR ...................................................................... 96 FIG. 9.4. GUI: MALAYALAM MORPH ANALYZER AND GENERATOR .......................................................... 96 FIG. 9.5. GUI: ENGLISH-MALAYALAM MT SYSTEM.................................................................................... 97 FIG 9.6. ENGLISH-TAMIL MT SYSTEM (ONLINE VERSION)......................................................................... 98 FIG 9.7. MORPH ANALYZER AND SYNTHESIZER (ONLINE VERSION) ......................................................... 99

vii

LIST OF TABLES TABLE 4.1. POS LABELS ............................................................................................................................. 28 TABLE 5.1. FORMAT OF REORDERING RULE IN DB ..................................................................................... 44 TABLE 6.1. DEPENDENCY RELATIONS ........................................................................................................ 56 TABLE 6.2. LEXICON ................................................................................................................................... 57 TABLE 6.3. PERSON NUMBER GENDER FEATURE ....................................................................................... 58 TABLE 6.4. AUXTENSE-MORPH FEATURE LOOKUP .................................................................................... 60 TABLE 7.1. NOUN INFLECTIONS .................................................................................................................. 67 TABLE 8.1. TRANSLATION RESULTS ........................................................................................................... 87 TABLE 8.2. TESTING RESULTS OF MORPH ANAYLZER AND SYNTHESIZER ................................................ 88 TABLE 8.3. IMPLEMENTATION DETAILS...................................................................................................... 91 TABLE 8.4. DEPENDENCY TO MORPHOLOGICAL FEATURE MAPPING ......................................................... 92 TABLE 8.5. REORDERING RULES ................................................................................................................. 92 TABLE 8.6. NUMBER OF WORDS USED IN MORPH ANALYZER AND SYNTHESIZER MODEL........................ 93 TABLE 8.7. NUMBER OF RULES IN MORPH ANALYZER AND SYNTHESIZER................................................ 93

viii

CHAPTER 1 1 INTRODUCTION Language is not only the means of communication. It could influence our culture and in fact, it influences the thought process of human beings. It is an important element of culture and through the language the culture can be learned and preserved. The native languages all over the world are growing rapidly along with the growth of technology, in general, and information technology, in particular. On the one hand, the world experiences a growth in the native language and on the other hand, precious and nascent information comes through foreign languages. Knowledge of the mother tongue alone is no longer enough to follow the information supplied by the other languages. Because of this ever-increasing gap and the speed with which information is supplied, there is a possibility of death knell for many native languages. Recent research shows language death is accelerated to the rate of two languages per month1. It is necessary to bridge this gap with the help of modern technologies as early as possible, thus enables the information supplied by the other languages available in the native language. This impedes the shrinking of native language speakers, thus helps to save mother tongue eventually, and in turn helps to preserve our culture. Even though it‟s inevitable to prevent the evolving culture, which is insane but an effort to preserve it in some way or another, is appreciated and well received by some of the cultural groups. This notion incites me to work on Machine Translation (MT). The mind blowing applications of MT and its potentiality and impact in future eventually pursues and led me to develop a MT system for English to the one of the longest surviving classical, could be the world‟s oldest surviving, literature rich, culturally significant language Tamil. The aim of this project is to translate the English input sentence to Tamil sentence as close as the human translation and to get comprehendible translated Tamil sentence. This project primarily focuses on the development of the MT system to translate English to Tamil and depends on the success of the prototype model of the MT system; the approach that employed for 1

http://www.commonsenseadvisory.com/Default.aspx?Contenttype=ArticleDetAD&tabID=63&Aid=1207&moduleI d=391

1

the prototype can be extended to the English to Dravidian language pairs like Malayalam, Kannada and Telugu and also focuses on the morphological analyzer and synthesizer which are one of the important component of the MT system. The various other MT tools that aid linguists are to be developed such as a framework for the development of heuristics in various levels of MT system, testing and improving the system.

1.1 MACHINE TRANSLATION FOR DUMMIES The material in this dissertation is not recommended for the cubs. The aspirant cubs have to commence from other standard materials; but for the sake of not disappointing the cubs, here comes the elucidation of MT in few words. MT is translation of one natural language to another natural language, mechanically. The simplest method for doing automatic translation that pops up in naive‟s mind is word-to-word translation. The words in the Source Language (SL) input sentence is translated to the Target Language (TL). Each word in the SL input sentence would be input into an exhaustive bilingual lookup search program. The bilingual dictionary lookup provides target language word equivalences. In real world, languages aren‟t as simple to translate SL to TL using word to word translation method. The bilingual dictionary lookup provides target equivalences for the root or stem words (Ex: boy, give, etc) of SL and not the inflected forms (Ex: boys, played, playing, etc). The word-to-word translation fails. What if we have a mechanism that converts the inflected forms (boys etc) to root form (boy) which that enables the direct word-to-word translation? Fortunately, we have one and that mechanism is called Stemmer. The stemmer rips off the inflections (Ex: “s” from boys) and outputs the root / stem form (Ex: boy). This approach is called direct translation system. The information provided for the aspirant cubs are sufficed to spring up from the standard materials because all the other MT approaches can‟t be explained without heavy lifting MT terminologies, which the cubs aren‟t familiar with. Wait a minute. Do you know what parser means (just the basic stuffs) and what it does? Do you know what parts of speech tagger mean and what it does? Do you know what morphological synthesizer/generator means and what it does? If the answer is yes for all those questions, the system overview chapter may provide some more ideas for the rookies. Give it a try. 2

1.2 IS IT WORTH GOING MT WAY? To begin from the very beginning, the philosophical arguments and history are traced back to seventeenth century. It‟s a pretty long way to go from the seventeenth century to the latest Google online translation services. Couple of issues in the history are enough to know why MT way‟ is discussed here and anyone hate to have the blank in the history refer to MT history2 resources to fill the gap. In 1954, Georgetown experiment [1] gave a promising automatic machine translation results for over sixty Russian sentences to English [2]. The team of the experiment claimed that the machine translation problem is solvable and can be solved within three to five years. From 1954 to the present3, there were several 5 years passed on but still the problem is wide open and many a researchers trying to bring down this monstrous artificial intelligence task to a solvable problem. Though few methodologies found some success in the years, the problem is not solved completely. Some of the major players worth mentioning here are SysTran [3], Japanese MT systems [4], [5], EUROTRA, AltaVista‟s Babel Fish (uses SysTran technology) and Google Language Tools (initially uses SysTran technology). Attacking an unsolved problem itself is worth doing. The European Union spends over 1 billion Euros4 annually to make the official documents available in all 23 official languages for its 27 member states. This shows that how important it is to tackle the challenges in MT as early as possible.

1.3 PREVIOUS WORK ON ENGLISH-TAMIL MT Though numerous works have been reported in MT using various methodologies, very few work has done or reported for Dravidian languages (Tamil, Kannada, Telugu, Malayalam, etc). Recently, Google released the alpha version of the MT online services for Tamil, Kannada, and Telugu. Google uses the Statistical Machine Translation (SMT) approach5. The quality of the translation is not bad for the simple and frequently used sentences. Mostly, the translation output 2

History of MT: http://www.hutchinsweb.me.uk/Nutshell-2005.pdf Present Status of Automatic MT: http://www.mt-archive.info/Bar-Hillel-1960-App3.pdf 4 EU spends 1 billion Euro on language services: http://www.independent.co.uk/news/world/europe/cost-intranslation-eu-spends-83641bn-on-language-services-407991.html 5 Google SMT: http://translate.google.com/about/ 3

3

is comprehendible even though the long sentences have issues with the word ordering and morphological generation of the Tamil words. Renganathan (2002) [6] demonstrated a functional English-Tamil Rule based MT system with limited set of words and rules. No further work had been reported after that. Germann (2001) [7] reported a SMT system trained using 5000 sentence parallel corpora. Most of the research and development in Tamil NLP is been reported by AUKBC research centre. Prototype of English-Tamil MT is reported by AU-KBC. The performance of this system is unknown and it‟s not available for testing. The English to Indian Language Machine Translation (Anuvadaksh EILMT6) consortia of 8 academic institutions and 2 government organizations focuses on developing domain specific (Tourism and Health domain) MT system [8] (funded by Department of IT, India) for 6 language pairs including English to Tamil. Amrita Vishwa Vidyapeetham is the one of the consortia member who looks after the English-Tamil MT system along with CDAC, Pune (Leader of the Consortium). Though the consortia planned to release the four versions of the English-Tamil MT system, LTAG7 [9] approach based MT, SMT [10], EBMT [11] and Anal-Gen [12], only the LTAG based MT of English-Tamil is released at the end of the first phase of the project. This is the first ever-viable English-Tamil MT system released to public. Amrita Vishwa Vidyapeetham is currently working on the English-Tamil MT system funded by Ministry of Human Resource and Development (MHRD)8. The system is available in the university‟s website9 for testing for closed groups, even though it is not officially launched by MHRD.

1.4 OVERVIEW OF CHAPTERS CHAPTER 2 briefs the necessary theory. It introduces various approaches of MT and, in particular, rule based MT. CHAPTER 3 introduces the general overview of the machine translation (MT) system. This chapter summarizes the necessary components of the MT system that to be explained in greater

6

Anuvadaksh online translation service: http://tdil-dc.in/components/com_mtsystem/CommonUI/homeMT.php XTAG Report: http://www.cis.upenn.edu/~xtag/ 8 MT work at Amrita: http://www.amrita.edu/pages/research/projects/cb-pr-38.php 9 Translation Demo: http://nlp.amrita.edu:8080/AMriTs/ 7

4

detail in the following chapter. The objective of this chapter is to give the reader the idea of the developed MT system as a whole with the flow of the system. CHAPTER 4 introduces the pattern based word order transformation methodology for word reordering in the sentence. The format of reordering rules and the application of the word reordering rules on the Source Language (SL) that transforms the SL word order to TL word order is explained in great detail. CHAPTER 5 introduces the phrasal structure viewer tool, the framework assists linguist in creation of the word reordering rules with visual aid and empowers the linguist to do thorough analysis of SL and TL with the help of the visual tree representation. CHAPTER 6 discusses about the mapping of the dependency feature information to the morphological feature information. The intricacies involved in the feature mapping are elucidated in this section. CHAPTER 7 gives a brief introduction to Tamil Morphology, Finite State Machines Toolkit and detailed the development process of the morphological analyzer and synthesizer using Finite State Transducer.

5

CHAPTER 2 2 THEORETICAL BACKGROUND The process of translation is described simply as the two-stage process: decoding the meaning of the source language (SL) text and re-encoding the meaning in the target language (TL). The simplicity of the description is not simple, as it seems to be. The word „decoding‟ connotes the interpretation and analyze all of the features of the SL text, a process that requires a profound knowledge of the grammar, semantics, syntax, etc of the SL. Similarly, the re-encoding process requires a sound knowledge in the TL. The challenge of the automatic MT is how to teach a computer to do what human beings do, the understanding of the SL and to create a new TL text based on the SL. This problem is approached in a number of ways. The approaches are mainly categorized as, the linguistic way and everything else approaches involve no or little linguistic knowledge.

2.1 APPROACHES Those who believe that the MT requires Natural Language Understanding (NLU) problem should be solved first go linguistic way (the rule-based methods) and others go statistical way (Statistical Machine Translation).

2.2 RULE-BASED MT The rule based methods generally parse (analyze the grammar, semantics, etc) a text and create an intermediate representation from which the text in the TL is generated. According to the intermediate representation, the method is described as inter-lingual MT or transfer based MT.

6

2.3 INTER-LINGUAL MT The SL text to be translated is transformed to the inter-lingua, an abstract language independent representation. The TL text is then generated from the language independent representation.

2.4 TRANSFER-BASED MT The notion of having an intermediate representation that captures the meaning of the original sentence in order to generate the correct translation is same for inter-lingual and transferbased MT. Even though both following the same pattern like, having same linguistic rules to get the intermediate representation from the SL text, etc, they differ in the intermediate representation. The inter-lingua‟s intermediate representation must be language independent whereas in transfer based MT, it has dependence on the source and target language pair.

2.5 STATISTICAL MT The idea of the seeing translation as the SL text is encrypted as TL text and the solving the problem of decrypting the SL text from the encrypted text comes from Information theory. SMT is a MT paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The translation of the SL to TL is done according to the probability distribution prob (TL/SL). The Bayes theorem is used to model the probability distribution. prob(TL/SL) = prob(SL/TL) * prob(TL) where the translation model, prob(SL/TL) is the probability that the SL string is the translation of the TL string and the language model, prob(TL) is the probability of seeing that target language string. Finding the best translation is done by choosing the one that gives the highest probability. 7

TLˆ = arg max prob ( SL | TL) * prob (TL) TLTL*

2.6 CONTEXT FREE GRAMMARS A context free grammar (CFG) is a formal grammar that consists of a set of production rules. It‟s also called Phrase-Structure Grammar [13]. The production rules describe how the symbols of the language can be grouped and ordered together. For example the following productions express that a noun phrases can be composed of either a proper noun or an adjective followed by a noun. These can be hierarchically embedded. NP  Adjective Noun

(R1)

NP  ProperNoun

(R2)

Adjective  beautiful

(R3)

Noun  girl

(R4)

The lexicons „beautiful‟ and „girl‟ are called terminal symbols. The symbols that express the grouping of these terminal symbols are called non-terminals. In the production rules, the item to left of the arrow „‟ is a non-terminal symbol and the right of the arrow is a list of one or more terminals or non-terminals. CFG is used to describe the structure of the sentences and words in the language. It provides mechanism for describing how the phrases are built from the smaller blocks. Read the arrow „‟ as rewrite the symbol on the left with the set of symbols on the right. The symbol „NP‟ can be rewritten as „Adjective Noun‟ as in rule R1 or another possibility is to rewrite as „Proper Noun‟ as in rule R2. And further, the symbol „Adjective Noun‟ is rewritten as „beautiful girl‟. This sequence of rewriting of the strings is called derivation of the string of the words. This hierarchically embedded set of symbols is represented by parse tree as shown in the figure 2.1.

8

FIG. 2.1. A PARSE TREE FOR 'BEAUTIFUL GIRL'

The set of strings those are derivable from the single non-terminal symbol called start symbol or root node in the parse tree representation is a formal language, which can be defined by CFG. The start symbol / root node is often interpreted as the sentence node (S) and the set of strings that are derivable from the „S‟ is the set of sentences.

2.7 PROBABILISTIC CFG (PCFG) The augmented probability to the CFG productions is PCFG, also known as Stochastic Context Free Grammar (SCFG) [13]. A PCFG augments each production rule with a conditional probability, Noun  girl [0.5]. It means the probability that the non-terminal symbol „Noun‟ will be expanded to the set of string, in this case its „girl‟ and it often referred as prob (Noungirl) or as prob (Noungirl/Noun).

2.8 PARSING Parsing is the task of recognizing a sentence and assigning a syntactic structure to it [13]. The task is to determine how the input is derived from start symbol. This task can be done in two methods primarily, Top-Down Parsing and Bottom-Up Parsing. Some of the parsers that use Top-Down Parsing are recursive descent parser, LL parser and Earley parser. Some of the 9

parsers that use Bottom-Up parsing are precedence parser, bounded context parser, LR parser, CYK parser.

2.9 SYNCHORNOUS CONTEXT FREE GRAMMAR A synchronous CFG is like a CFG except its production rules have two right hand sides (called source side CFG and target side CFG) instead of one as in CFG [14], [15]. A Synchronous CFG derivation for the example sentence, “I gave a book” is shown in the figure 2.2. The source side and target side symbols are linked based on the numbering. For example, synchronous CFG of English and Tamil sentence is as follows,

I.

S  (NP1 VP2, NP1 VP2)

II.

VP  (V1 NP2, NP2 V1)

III.

NP  (I, naan)

IV.

NP  (a book, oru puTTakam) V  (gave, kotu)

V.

The non-terminal symbols of the source side are mapped to the non-terminal symbols of the target side by the numbering. The symbol mapping is one to one and also numbering constraints should satisfy. Like CFG, the non-terminal symbols are rewritten using production rules. The synchronous CFG rewrites the non-terminal symbols on the source side and target side simultaneously. Similar to CFG derivation, Synchronous CFG derivations can be viewed as tree but here we get pair of derivation trees. Starting with the symbol „S‟ like CFG,

(S1, S1)  (NP1 VP2, NP1 VP2)  (NP1 V21 NP23, NP1 NP23 V21)  (I gave a book, naan oru puTTakam kotu)

10

FIG. 2.2. SYNCHRONOUS CFG DERIVATIONS

2.10 SYNCHRONOUS TREE ADJOINING GRAMMAR In Synchronous TAG [15], the productions are pair of elementary trees [16](See figure 2.3). The non-terminals of source and target nodes are linked as similar to Synchronous CFG. Initial and auxiliary trees are two types of elementary trees; the former is the elementary tree that gets attached in the leaf node of another elementary tree, if the root of the first elementary tree is same as the leaf of another. This process is called substitution. Later is the auxiliary elementary tree that gets substituted in two step process and it‟s called adjoining. The auxiliary tree‟s leaf node called foot node that marked with an asterisk „*‟.

FIG. 2.3. ELEMENTARY TREES (INITIAL TREES) 11

In TAG derivation [17], [18], we start with an elementary tree that is rooted in the start symbol, and then repeatedly choose a leaf non-terminal symbol „X‟ and attach to it an elementary tree rooted in „X‟. Similar to CFG production rule where the symbol which gets rewritten by the set of symbols, in TAG the production rule is an elementary tree, the initial elementary tree „B‟ is substituted in tree „A‟ and „C‟ in the „A‟. The process of the repeated rewriting is shown in the figure 2.4.

FIG. 2.4. REWRITTEN ELEMENTARY TREES

The auxiliary tree adjoining process is explained with the following example, „The boy who is my classmate gave a book‟. The initial and the auxiliary trees required for this example sentence is shown in the figure 2.5.

12

FIG. 2.5. ELEMENTARY TREE REQUIRED FOR THE EXAMPLE SENTENCE

The auxiliary tree „D‟ adjoins with the initial tree „A‟ by the following two-stage substitution process. 1) The leaf node NP1 of the tree „A‟ is removed and substituted in the auxiliary tree „D‟. 2) The auxiliary tree is substituted in the leaf node of the tree „A‟. The TAG starts with an elementary tree that rooted in start symbol and repeatedly attaches the elementary tree to get derivation.

13

FIG. 2.6. TAG DERIVATION

Synchronous TAG generalizes the TAG exactly same as Synchronous CFG. It starts with pair of elementary trees where the nodes of source are linked with the target elementary trees. The figure 2.6 shows the TAG derivation tree and figure 2.7 shows the synchronous TAG derivation tree of the example sentence, “The boy who is my classmate gave a book”.

14

FIG. 2.7. SYNCHRONOUS TAG

15

CHAPTER 3 3 SYSTEM OVERVIEW The aim of the system is to translate the source language sentence to the target language sentence. For any rule based machine translation system for Subject-Verb-Object structured source language to Subject-Object-Verb, structured target language involves a word reordering transformation technique. The machine translation system uses the pattern based word reordering technique [19]. Tamil is largely a free-word order language but still the word order is bounded in certain constructions at least in the literary language. Though there isn‟t any restriction of the construction of the sentence in any order, the verb headed sentences are largely acceptable formally and the placing verb in other positions makes the sentence more poetic which has less fluency in modern literary language. The dependency relation between the words defines the syntax and the word order has a little say about the syntax of the Tamil sentence. The relation between the words in the sentence is defined by the postpositions, case endings, or inflections. Although dependency transfer method is the best method for translating English-Tamil pair, the developed system uses the pattern based reordering mechanism. The main reason for the pattern based word reordering is that the transformation rules required is minimal and very easy to develop when compared to the dependency transfer method. In the target language the case endings, the preposition counterpart of English, which defines the relation between nouns, is not isolated and that is glued with the preceding noun. The synthesizing of the noun with the case endings requires dependency feature information. Thus the reason for going to the Stanford parser is it provides both the typed dependency relations [20] and the parse tree structure [21] [22]. The reordering is done using the parse tree structure and the word generation (synthesizing) uses the typed dependency features. The general block diagram of the simplified version of the translation system is shown in figure 3.1. The basic components required for the machine translation is shown in the block diagram (refer figure 3.1) and the other intricate components are saved for the later chapters. There is nothing much to do with the parts of speech tagger and parser for English as for the developed MT system is concerned. The developed MT system mainly focuses on the remaining 16

components in the block diagram. As for as the parser is concerned, the Stanford parser packages are used to get the parse tree structure and the typed dependency relations. The contribution for the project starts after the post parsing stage. For technical background of pos tagger and parser is out of the scope of this dissertation report.

FIG. 3.1. TRANSLATION SYSTEM- BLOCK DIAGRAM

The objective of this chapter is to give the general overview and flow of the MT system. The system is best explained with an example sentence. The source language sentence “Ram gave him a book” is taken as the example for explaining the components of the system.

17

FIG. 3.2. ANNOTATION OF THE INPUT SENTENCE

FIG. 3.3. PARSE TREE (RAM GAVE HIM A BOOK) 18

The input sentence “Ram gave him a book” is parsed using the Stanford parser with any of the English parser model. The parser requires the annotated sentence as the input. The input sentence is pos tagged before passed on to the parser. The annotated sentence is shown in the figure 3.2. The pos tagger annotated the sentence with the Upenn tag set. The parser recognizes the input sentence and assigns a syntactic structure to the sentence, refer figure 3.3. The word reordering can‟t be done blindly based only knowing position of the word in the sentence. The grammatical knowledge of the input sentence such as how the input sentence constructed serves better in word reordering rather than doing blind reordering. The parse tree representation has an equivalent bracketed representation. “(S (NP (NNP Ram)) (VP (VBD gave) (NP (PRP him)) (NP (DT a) (NN book))))”. The tree representation is followed in this document wherever it‟s necessary. The word reordering is done by swapping the nodes of the tree based on the transformation rules. The figure 3.4 is the tree representation of the reordering rule. The transformation rule has three parts: source pattern, target pattern and the transfer link. The rules are similar to context free grammar (CFG) rules, except that they specify the structure of two phrases, one in the source language and other in the target language.

FIG. 3.4. REORDERING RULE

The tree-reordering algorithm searches the source rule pattern in the parse tree structure. If any match found, the source rule pattern is replaced with its counter target rule pattern. The node mapping is done using the transfer link. In the figure, the numbering shows the transfer link, a mapping of the source rule pattern‟s node to the target rule pattern‟s node.

19

In the example parse tree, the source rule pattern matches and it get replaced with the target rule pattern. The figure 3.5 shows the transformation.

FIG. 3.5. TRANSFORMATION OF PARSE TREE TO REORDERED TREE

The reordered tree structure is more or like the target phrasal structure. The leaf nodes of the tree are still anchored with the English words but the source language sentence pattern is transformed to the target language pattern. SVO  SOV. There are more than 150 such transformation rules identified for the English-Tamil language pair. The anchor of the reordered tree with the English words is substituted with the target language word with help of the bilingual dictionary, actually a perfectly managed set of lexicons for each parts of speech category. The English dictionary doesn‟t have the word „gave‟. Storing all the inflected forms of the word „give‟ in the lexicon is not feasible and for the all the verbs and nouns it‟s laborious and painstakingly tedious. Say, such a lexicon list is available in the source language. What is it about the target language? Tamil words are highly inflectional and the word „gave‟ doesn‟t have the direct equivalence in Tamil, instead the word form has to be generated from the root form of the equivalence of the word „give‟. The lexicon lookup table just need to store the root of the source language and its equivalence in the target and not all the inflectional forms.

20

The word that to be searched in the lexicon list to get the target equivalence in the example sentence is „gave‟ but the root word lexicon list doesn‟t have one. It‟s quite interesting. Now the English stemmer / morphological analyzer come into the picture. The morphological analyzer outputs the root word along with the morphological feature information as shown in the figure 3.6.

FIG. 3.6. FLATTENING OF TREE

Now, the Tamil words get substituted and the word order is perfect but still not a perfect translation. The scrambled words don‟t make any sense. The input sentence “Ram gave him a book” is supposedly translated to “பாம் அயனுக்கு எரு புத்த஑த்ததக் க஑ாடுத்தான்” (rAm 21

avannukku oru puTTakaTTaik kotuTTAn). The scrambled words are synthesized to attain this as shown in figure 3.7.

FIG. 3.7. SYNTHESIZING OPERATION

avan + DAT



avannuku

puTTakam + ACC



puTTakaTTai

kotu + past + 3SM



kotuTTAn

DAT, ACC, past and 3SM are morphological feature information that represents the dative morpheme, accusative morpheme, past tense marker, and 3rd person singular masculine marker respectively. Read morphological analyzer and synthesizer for more details (refer chapter 7).

22

FIG. 3.8. TRANSLATED OUTPUT

I have read the morph chapter. You know what, „m satisfied with “avan + DAT”, blah, blah sorts. Nevertheless, how do you manage to get the morphological feature information like „DAT‟, „ACC‟, and „PAST‟ etc? The answer is dependency relations. Those features are extracted from typed dependency relation provided by Stanford parser. The typed dependency relation given by the parser is shown in the figure 3.9. The relations are transformed to the target language. The dependency relation of the target language may not necessarily be same as the source language. From the target language dependency relations, the morphological features are taken. The dependency relation information of the target language is mapped to the morphological feature of the target language. For example, indirect object (iobj) relation is mapped to the accusative case marker (ACC) and direct object (dobj) is mapped to dative case marker (DAT) in Tamil. The translated output of the sentence with the intermediate stages is shown in the figure 3.8.

23

FIG. 3.9. DEPENDENCY RELATION TO MORPH FEATURES

Though the current system has more than 60k+ entries and still growing, for a practical MT system it is not enough. What about words that is not in bilingual dictionary or lexicon? How to translate the words that is not present in the lexicon list or words that not eligible to be in the lexicon list such as person names, place names, etc. The out of vocabulary words (OOV) cannot be translated but transliterated. Transliteration is an automatic method that converts words/characters in one alphabetical system to corresponding phonetically equivalent words/characters in another alphabetical system [23]. A third party transliteration module10 developed based on the Sequence Labeling Approach [23] is integrated in the current system. The Named-Entities are transliterated from English to Tamil using this tool. 30k person names and 30k place names (English-Tamil pairs) are used for training using SVM. The words that are not present in the root lexicon (Out Of Vocabulary words) are also transliterated in the same manner using the same tool that trained for names and place. Though the transliteration tool works well for name and place, it‟s not producing good results for other OOV words since the tool specifically trained for the name and place. The current system doesn‟t have a separate 10

Transliteration Module is developed by Ajith, CEN, Amrita Vishwa Vidyapeetham.

24

transliteration module for the OOV words. As a future enhancement, a Named Entity Identifier can be used to identify the names, places and those are transliterated using the existing tool, and other OOV words can be transliterated using new tool that build using the mapping rules. For example: He gave Sitha a pen அயன்

சீதாயிற்கு

எரு

ப஧஦ா

க஑ாடுத்தான்

Avan

cITAviRku

oru

pEnA

kotuTTAn

In the previous example, the target equivalence for the word „Sitha‟ (Proper name) isn‟t present in the lexicon. This word gets transliterated as „cITA‟, phonetically equivalent characters in Tamil alphabetical system of the English word „Sitha‟. Apart from the main challenges, other issues like multi-word expression and compound nouns are handled in the current system. Multi-word expression (MWE) is a lexeme made of a sequence of more than two lexemes that constitutes a meaning. Based on the individual entities the meaning is not predictable. A MWE can be a compound, a fragment of a sentence. Example: „He kicked the bucket‟ means die and not the literal meaning of kicked the bucket with foot. Translating such expressions into target language literally gives a non-sensical meaning, though the translated sentence may be syntactically correct. The compound words are lexemes of two or more lexemes that constitute a meaning similar to MWE but the compound words aren‟t idiosyncratic. It‟s cluster of words that supposed to be interpreted as a single entity. The individual entities give different meaning but not convey any meaning to the context of the sentence. Example: I/PRP bought/VBD a/DT table/NN top/NN wet/NN grinder/NN ஥ான் எரு பநதை ஈபநா஦ ஑ிாிந்தர் மூடிக஧ாருத்து யாங்஑ிப஦ன் In the above example, the cluster of words is translated individually into target language, which translated into non-sensible sentence where to get the correct translation; the cluster (table, top, wet, and grinder) should be treated as a single entity. Handling multiple outputs, subject-verb feature agreement, resolving sense using word sense disambiguation technique etc and along with the detailed explanation of components of the MT system block diagram are discussed in later chapters.

25

26

CHAPTER 4 4 RULE BASED WORD REORDERING 4.1 TAMIL SENTENCE STRUCTURE

Tamil is head-final language. The verb comes usually at the end of the clause in the standard sentence with the typical Subject Object Verb (SOV) word order though Object Subject Verb (OSV) is not uncommon. Although in the standard literary text Tamil is head-final, Tamil is a free-word order language. For the machine translation, we are mostly concerned with the standard text. The frequency of the SOV and OSV sentence structures in the standard monolingual corpora that bagged from the various sources depends on various factors such as the literary style, sub-language and author, etc. To the English‟s prepositions counterpart Tamil has postpositions. Most of the postpositions are agglutinated with the noun preceding it. Tamil is a null subject language. It is not necessary that all Tamil sentences have subjects, verbs, and objects11. Tamil does not have the linking verb and relative pronouns. These are the few important issues that to be considered for the word reordering from English to Tamil.

4.2

ENGLISH STRUCTURE TO TAMIL STRUCTURE

Even though translating the English sentence with the word order intact conveys the correct meaning of the translation, for the sake of the fluency of the Tamil translated sentence and to get the translated sentence as close to the standard text, the English structure has to be transformed to Tamil word-order structure with the help of some kind a mechanism. The mechanism involves pos-tagging, parsing, reordering rules. The English syntax cannot be transformed to the Tamil syntax without knowing the structure of the English sentence. The

11

http://en.wikipedia.org/wiki/Tamil_grammar#Sentence_structure

27

mapping of the sentence to the sentence structure for all grammatically possible sentences in the language is tedious and almost impossible if that‟s manually done. The process of this automatic mapping is parsing. Parsing is the process of analyzing the text that made of sequence of words to determine the grammatical structure with respect to given formal grammar.

4.3

PATTERN BASED CFG FOR REORDERING

A pattern is a pair of Context Free Grammar rules (CFG) [24]. These rules give the equivalent structures in the source and target languages. On the basis of the patterns the reordering rules are formulated to facilitate the machine translation. They reflect the translation patterns of the source and target languages. For example, the following reordering rule is based 

on an English- Tamil pattern: VP (VBD NP NP)

VP (NP NP VBD) || 1:2 2:3 3:1

The rule has three parts: source pattern, target pattern and the transfer link information. These patterns are represented either in the tree form or in Penn bracket notation. Even though the internal reordering process makes use of the bracketed notation, the tree representation is more illustrious and easy to follow. The labels used, follow the Penn notation. The table 4.1 shows the meaning of the labels used in the example.

TABLE 4.1. POS LABELS

Label

Meaning

VP

Verbal Phrase

VBD

Verb (Past tense form)

NP

Noun Phrase

In contrast to the pair of CFG rules of Synchronous CFG [15]where the source CFG rule is used to get the parse derivation of the input sentence, this method uses the pattern to do reordering and not used for parsing the source syntactic structure. The derivation of the target structure is not done concurrently as opposed to synchronous CFG. Only the patterns that need to 28

be replaced with the target patterns are required unlike in synchronous CFG and synchronous TAG [16] where every source rule have its equivalent target counterpart.

4.3.1 DISSECTION OF THE REORDERING RULE

4.3.1.1 SOURCE PATTERN

The bracketed representation of the source pattern CFG rule is VP (VBD NP NP) and its equivalent tree representation is shown in the figure 4.1.

FIG. 4.1. SOURCE RULE

VP is the root node. VBD, NP, NP are the Children nodes. The numbering below the child nodes indicates the position of the nodes in the source pattern.

4.3.1.2 TARGET PATTERN

The bracketed representation of the target pattern CFG rule is VP (NP NP VBD) and its equivalent tree representation is shown in the figure 4.2.

29

FIG. 4.2. TARGET RULE

VP is the root node. VBD, NP, NP are the Children nodes. The numbering below the child nodes indicates the position of the nodes originally in the source pattern.

4.3.1.3 TRANSFER LINK

The final part of the reordering rule is the transfer link. The transfer link for the above example is “1:2 2:3 3:1”. Just replacing the source pattern with the target pattern may not solve the reordering completely in all the cases. In the above example, there are two NP nodes. The position of these NP nodes after in the target pattern has to be defined. This mapping is done using the transfer link. The transfer link exactly says which NP node out of the two NP nodes of the source pattern is linked to target pattern‟s NP node. The main function of the transfer link is to disambiguate the nodes, which have same name, and to do the exact mapping of the nodes between the source and target pattern with no confusion. In the following transfer link, “1:2 2:3 3:1” maps three nodes of the source pattern to the target pattern. 1:2 means the second child node of the source pattern realigned to the first node of the target pattern rule. 2:3, the third child node to second node of target pattern rule and so on.

30

4.3.1.4 REORDERING RULE

„VP (VBD NP NP)‟ is the English CFG rule („VP‟ is the root node and „VBD‟, „NP‟, „NP‟ are the children nodes in the tree representation) called a source rule and „VP (NP NP VBD)‟ is the Tamil CFG rule called a target rule. “1:2 2:3 3:1” is the transfer link. The tree representation of the above rule is shown in the figure 4.3.

FIG. 4.3. REORDERING RULE

The transfer link contains the order of the children nodes of the target rule. “1:2 2:3 3:1” says first child of the target rule is from second child of the source rule; second child of the target rule is from third child of the source rule, so on and so forth. The parse tree of the source language is checked against the source rules. If any match found in the parse tree, then the source pattern is replaced with the corresponding target pattern.

4.3.2 REORDERING ALGORITHM – SIMPLIFIED VERSION

Let T be the parse tree with N total number of nodes that we process for reordering; n be the indices of nodes and c be the indices of the children nodes of the node n; R be the number of 31

reordering rules; St be the source pattern; Gt be the target pattern; Sc be the array of child nodes of the „t‟th source pattern. Gc be the array of child nodes of the „t‟th target pattern. Re-order (T): Visit node n For each St of reordering rule R If (root node (St) equals label (n)) If (children (n) equal Stc) Replace St with target tree, Gt at node n End If End If End For For each child c of n Re-order (sub tree(c)) End For End

The Re-order algorithm with the N number of nodes in the parse tree, T and R number of reordering rules has the complexity of O (N*R).In the following parse tree, the node labels are numbered in the order that the tree traversal happens.

32

FIG. 4.4. PARSE TREE

The pattern based reordering rules for reordering the above parse tree (refer figure 4.4) are as follows (figure 4.5):

Rule R1

Rule R2

FIG. 4.5. REORDERING RULES 33

Re-order (T): Visit node 7 For each St of reordering rules R If (root node (S1) equals label (n)) If (children (7) equal S1c) Replace S1 with target tree, G1 at node n

Satisfies at Node 7 Reorder rule applied.

For each child c of node 7 Re-order (sub tree (c1)) On traversing the parse tree, at node 7 the label of node 7 equals the root node label of the source pattern S1 of the reordering rule R1. The children of the node 7 are (8) and (9) and that matches with the children of the source pattern S1. So the source pattern is replaced with the target pattern with the specific node replacement using the transfer link, which is not mentioned in the algorithm for the simplicity. The traversal is kept on continuing with the child nodes. At the completion of the traversal of the tree, all the reordering rules that got matches are applied. The tree after the application of the given set of rules is shown in the figure 4.6.

FIG. 4.6. APPLICATION OF RULE R1 AND R2 34

4.3.3 ARCHITECTURE

The output of the parser (parse tree) is the input for the reordering module along with the reordering rules from the database. After the application of the necessary rules, the reordering module outputs a reordered tree which is more similar to the syntactic structure of the Tamil with the English words on the leaf nodes rather than the Tamil words as of in the Tamil syntactic structure. The reordering architecture diagram is shown in the figure 4.7.

FIG. 4.7. REORDERING ARCHITECTURE

4.3.3.1

REORDERING EXAMPLE

For example, the parse tree of the English sentence, “Ram gave him a book” is (S (NP (NNP Ram)) (VP (VBD gave) (NP (PRP him)) (NP (DT a) (NN book)))). This English phrasal structure is checked with the available reordering rules for finding a match. The pattern in the source language such as „VP (VBD NP NP)‟ is present in the English phrasal structure and it is eligible for undergoing the reordering rule. The pattern “VP (VBD NP NP)” of the source language is thus replaced with its counterpart “VP (NP NP VBD)” in the target language. The replacement of the source pattern to the target pattern is highlighted in the figure 4.8.

35

FIG. 4.8. TRANSFORMATION OF PARSE TREE TO REORDERED TREE

4.3.3.2

LIMITATIONS OF THE REORDERING RULES

Tamil shows a very high degree of flexibility in ordering the words within a sentence. The position of the words can be easily transposed without much change in the meaning. For example, “Ram gave him a book” can be reordered in multiple ways in Tamil, and the most common ways are, Ram him a book gave, Ram a book him gave, Him a book Ram gave, etc,. The predicate verb takes mostly the last position. In our system, the reordering rules are strictly one to one map. Every source rule is mapped to one target rule. Based on the most common usage, the target rule is formulated. The Tamil clausal structure is more rigid and shows little flexibility. For example, “Ram, who is smart, gave him a book.” is reordered as “(smart Ram) (him) (a book) (gave)”. Here the adjectival clause „who is smart‟ has to be positioned before the noun „Ram‟ in Tamil. The current system can handle only the generic reordering rules. For example, consider the following example constructs, “The capital of India” and “The thousands of devotees.” The parse structures for the phrases are (NP (NP (DT The) (NN capital)) (PP (IN of) (NP (NNP India)))) and (NP (NP (DT The) (NNS thousands)) (PP (IN of) (NP (NNS devotees)))) respectively. The reordering rule transforms the above phrases to “India of – 36

capital” and “devotees of – thousands” respectively. The later target phrase “devotees of – thousands” is not correct and it happens because of the one to one reordering rule map. This is the limitation of the current reordering rule mechanism. This can overcome by letting the system generate multiple outputs by one-many reordering rule maps. In the post processing, the best output can be chosen based on the fluency of the sentence. Currently, this is not incorporated in the system.

4.3.3.3

HANDLING PREPOSITION

Tamil doesn‟t have the prepositions; instead it has postpositions and another peculiarity is that the postpositions aren‟t isolated words. Tamil doesn‟t have the equivalent word for the prepositions; instead Tamil has the case endings. The prepositions in English and case endings in Tamil don‟t have the one to one mapping. The case ending are depends on the syntax of the sentence, which is pretty hard to handle at the reordering stage of the translation system. Those case endings are very ambiguous. For same preposition in English, there are many case endings in Tamil depends on the syntactic feature of the sentence construction. To indicate the preposition to case endings transformation, specific reordering rules (see figure 4.9) are necessary in order to make sure the words are in proper position even though the preposition has no equivalent in Tamil. The case endings for the word are determined based on the dependency information of the source sentence that is done in separate module. In the reordering module, the position of the words are ought to be corrected according to the target language. The reordering example for the preposition case is as follows and the parse tree & the reordered tree for the given example is shown in the figure 4.10. Example Sentence:

Delhi is the capital of India

Reordering Rules for Preposition to postposition transformation is shown in the figure 4.9.

37

FIG. 4.9. REORDERING RULES

FIG. 4.10. PARSE AND REORDER STRUCTURE

38

4.3.4 HANDLING COPULA

Tamil doesn‟t have a linking verb (copula). The English equivalent of the copula is replaced with the Tamil for the ease of translation but the sentence construction with the Tamil copula like word is not fluent in the language.

4.3.5 HANDLING AUXILIARY

The isolated auxiliaries and the finite verb in English have to translate as composite verb as a single entity. The auxiliary slots are remaining empty and are not replaced with its counterpart in Tamil. Rather the aux and the verb is synthesized based on feature information extracted from the typed dependency relations in the synthesizer module.

4.3.6 HANDLING RELATIVE PRONOUNS

Tamil doesn‟t have the relative pronoun but the meaning is conveyed by the relative participle constructions which synthesized based the dependency relations. In case of the reordering, the reordering rule specific to the relative pronoun construction transforms the English sentence construction to Tamil syntactic construction.

39

CHAPTER 5 5 5.1

PHRASAL STRUCTURE VIEWER

NEED

Creation of reordering rules demands an exhaustive analysis of source and target language syntactic structure. For the comparative analysis of different syntactic structure in both the source and target, language requires a visual interface rather than the mere bracketed representation of the syntactic structure. Even though the bracketed notation serves, better in the text processing for comparative analysis bracketed representation lack the visual aid to help the linguists. The more painstaking and laborious careful analysis of the syntactic structure can be avoided with the phrasal structure viewer, the visual interface for creation of reordering rules, lexicon development and also for the analysis of morphological information.

5.2

INTRODUCTION

The phrasal structure of the input sentence and output sentence is presented as a tree structure in the parse tree viewer tool. The phrasal structure has a crucial role in the patternbased approach. Handful of examples have to be analyzed before creating a reordering rule. Analyzing quite a number of parse structures helps to create a general rule that applies for many a type of sentence constructions. The parsed structure of the source language sentence is checked against the existing reordered rule in the rule database and using the pattern based approach, the source phrasal structure is transformed to target structure. The source language phrasal structure is reordered based on the linguistic rules to form a target language phrasal structure. The correctness of the target phrasal structure is assured during testing by viewing both the source and target structure side by side. The new reordered rules can be created or the existing reordered rule can be modified using the phrasal structure viewer (P- Viewer). 40

The P-Viewer not only assists the linguist to create pattern based reordering rules based on the analysis of both the source and target structures; but also the tool has the additional features to add, delete and modify the lexicon. In a larger scale, the linguists who work independently do not need to bother learning the framework to create reordering rules and lexicon. Since the database is centralized, the duplication of the data creation is avoided. Duplicating the existing rules or lexicon is one of the major impediments in the development of resources. In any rulebased system, the coverage of lexicon of the language or the sub-language determines the performance of the translation system. Having said that development of translation system doesn‟t afford the duplication of rules or lexicon or for that matter, any resource development work.

5.3

ARCHITECTURE OF P-VIEWER

The input sentence of the source language is parsed. Any parser can be plugged-in with the P- Viewer. Currently, the P- Viewer features Stanford Parser and Lexicalized Tree Adjoining Grammar Earley parser. The parsed tree is reordered to make it closer to the target language phrasal structure. Tree reordering mechanism is explained in greater detail in the Tree Reordering chapter. The bracketed representation of the syntactic structure of the source and target language is converted to the acyclic tree diagram in Tree rendering. The reordering rules are stored in database in the specific format so as to reduce the data latency. The reordering rule db format is discussed later in this chapter. The architecture of the P-Viewer is shown in the figure 5.1.

41

FIG. 5.1. ARCHITECTURE OF P-VIEWER

5.4 TREE RENDERING The equivalence of the bracketed notation of syntactic structure and the tree diagram representations of the analysis of the sentence can be established by devising a recursive function, a well-defined, step-by-step process that converts one of the representations into the other. The bracketed notation can be converted into a tree by starting with the bracketed words. For each word, the brackets are converted into a tree branch that runs from word to a node labelled with the label of the left bracket. Each word actually can be called a leaf of the tree. The brackets that enclose sequences of words, and which correspond to syntactic categories, can then be converted into the branches of the tree that connect the lexical category nodes to nodes corresponding to, and identified by the syntactic category labels on the brackets. The other syntactic category brackets are transformed into branches that connect nodes labelled with the category labels on the brackets. This process continues until the outer-most brackets are encountered and the root node is attached to the tree by branches connecting it to the nodes at the next level down in the tree. Another pre-order traversal function can be devised that converts a tree into a bracketed string. Hence, the two representations are equivalent. 42

5.5 REORDERING RULE FORMAT The reordering rule patterns are represented either in the tree form or in Penn bracket notation. Even though the internal reordering process makes use of the bracketed notation, the tree representation is more illustrious and easy to follow. Consider the following reordering rule example, VP (VBD NP NP)



VP (NP NP VBD) || 1:2 2:3 3:1

The bracketed representation of the source pattern CFG rule is VP (VBD NP NP) and its equivalent tree representation is as below. VP is the root node. VBD, NP, NP are the Children nodes. The numbering below the child nodes indicates the position of the nodes in the source pattern. The bracketed representation of the target pattern CFG rule is VP (NP NP VBD) and its equivalent tree representation is as below. VP is the root node. VBD, NP, NP are the Children nodes. The numbering below the child nodes indicates the position of the nodes originally in the source pattern. The third part of the reordering rule is the transfer link. The transfer link for the above example is “1:2 2:3 3:1”. Just replacing the source pattern with the target pattern may not solve the reordering completely in all the cases. In the above example, there are two NP nodes. The position of these NP nodes after in the target pattern has to be defined. This mapping is done using the transfer link. The transfer link exactly says which NP node out of the two NP nodes of the source pattern is linked to target pattern‟s NP node. The main function of the transfer link is to disambiguate the nodes that have same name and to do the exact mapping of the nodes between the source and target pattern with no confusion. In the following transfer link, “1:2 2:3 3:1” maps three nodes of the source pattern to the target pattern. 1:2 means the second child node of the source pattern realigned to the first node of the target pattern rule. 2:3, the third child node to second node of target pattern rule and so on. „VP (VBD NP NP)‟ is the English CFG rule („VP‟ is the root node and „VBD‟, „NP‟, „NP‟ are the children nodes in the tree representation) called a source rule and „VP (NP NP VBD)‟ is the Tamil CFG rule called a target rule. “1:2 2:3 3:1” is the transfer link. The tree representation of the above rule is shown in the figure below.

43

The transfer link contains the order of the children nodes of the target rule. “1:2 2:3 3:1” says first child of the target rule is from second child of the source rule; second child of the target rule is from third child of the source rule, so on and so forth. The parse tree of the source language is checked against the source rules. If any match found in the parse tree, then the source pattern is replaced with the corresponding target pattern.

5.6

FILTERED SEARCH

No node is excluded from doing a search in database to check for the match. While doing so for each node, all the rules in the db have to be checked to find the match. This slows down the tree reordering algorithm to a certain extent in a scaled up version of the translation system. The time compromising of the system can be avoided by doing the filtered search rather doing the complete search for the match. For example, consider the following parser tree and the reordering rules. At the node 4, the label of the node has to be checked against the root node of all the reordering rules in the database. And once it get any match, the children nodes of node 4 has to be checked against the children node of the source pattern wherever the root node label is same as the label of the node 4. For the ease of processing, the bracketed string of the source pattern and target pattern of the reordering rule is decomposed into four parts such as, root node of the source/target pattern, children nodes of the source pattern as one entity, children nodes of the target pattern as one entity and the target link.

TABLE 5.1. FORMAT OF REORDERING RULE IN DB

ROOT NODE 7 4 4

SOURCE PATTERN TARGET PATTERN TRANSFER LINK CHILDREN CHILDREN 89 98 1:2 2:1 5 10 10 5 1:2 2:1 56 65 1:2 2:1

44

FIG. 5.2. FILTERED SEARCH

For example, for the reordering rules in figure 5.2 are stored in the database as the format shown in the table 5.1.

45

5.7 SCREEN SHOT OF THE P-VIEWER TOOL In the screen shot of the P-Viewer, tool (see figure 5.3), the input sentence “He gave a book to him” is tested. The top panel of the interface is the input-sentence text-area where any language input can be tested provided the parse plugged-in with the tool supports that language. Left panel displays the tree diagram equivalence of the bracketed string of source parse structure where as the right panel displays the tree diagram equivalence of the reordered tree structure. The bottom panel of the interface has the rule editor and viewer.

FIG. 5.3. SCREEN SHOT OF P-VIEWER

46

5.8

REORDERING RULE EDITOR Based on the source and target phrasal structure it‟s easy to determine the correctness of

the target structure that we are concerned about. The accuracy of the translation output is primarily depends on this reordered structure. The new rules can created easily using the rule editor along with the phrase structure information and save into the rule database which is used by the reordering algorithm for transforming the source structure to target structure.

FIG. 5.4. SCREEN SHOT OF REORDERING RULE EDITOR

The rule editor has the option to save the new rule and has delete option to delete the existing rule, see figure 5.4. The rule viewer helps to determine the correctness of the existing rules by looking at the source and target phrasal structure.

5.9

DICTIONARY EDITOR

The terminal nodes in the pseudo phrasal structure are attached with the lexicons. During the transformation from source phrasal structure to target based on the reordering rules, the source lexicons are translated using the lexical database or transliterated to the target language. The dictionary editor (refer figure 5.5) helps to add missing lexicon of the current sentence, to add a new lexicon to database and to modify the existing one The dictionary editor has its vital role in the P-Viewer tool. While doing the exhaustive analysis of the word-order transformation from source to target language, having the lexicon 47

glued to leaf node empowers the visual aid. Both the rule editor and dictionary editor assists linguists in development of reordering rules and lexicon in greater extent with ease.

FIG. 5.5. SCREEN SHOT OF DICTIONARY EDITOR

5.10 MULTIPLE SENTENCE AND WORD OPTIONS Both Pattern-based and LTAG approach may give multiple outputs for the input sentence. The user is provided with the option to choose „n‟ number of outputs for the given input sentence. The user is provided with the multiple word options. On the right click of the target word, which has multiple words displays all the sense of that word. The input and corrected (if required) output sentence pair can be saved in database which builds up to the parallel corpora.

5.11 MORPH SYNTHESIZER FEATURES INFORMATION The Synthesizer module requires the feature information that extracted from phrasal structure and dependency structure of the source tree. The correctness of the synthesized word can be verified using this information that is provided in the P-Viewer. By seeing the intermediate feature extraction and synthesizer information, the morph synthesizer rule can be updated using the simple morph synthesizer rule editor.

48

5.12 STEP BY STEP PROCEDURE TO CREATE A NEW RULE IN PVIEWER The input sentence is “Delhi is the capital of India”. In the rule viewer, the existing rules that qualifies to transform the source sentence structure to target sentence structure are shown in the following table,

ROOT NODE VP PP

SOURCE PATTERN CHILDREN VB* NP IN NP

TARGET PATTERN TRANSFER LINK CHILDREN NP VB* 1:2 2:1 NP IN 1:2 2:1

„*‟ in the reordering rule is the wild card character. VB* means any character that follows VB.

FIG. 5.6. P-VIEWER CREATION OF NEW RULE 49

The source sentence is reordered and the output is Delhi the capital India of is கெல்லி – தத஬஥஑பம் இந்தினா – இரு By looking at the phrasal structure of the source and target language, it‟s easy to understand that the reordered structure (see figure 5.6) is not correct and lacks one rule. The addition of one rule perfects the target phrasal structure. The rule to be added in the database is, NP

NP PP

PP NP

1:2 2:1

Click save button to add rule in the db and dialog popup saying Rule is inserted successfully if there isn‟t any problem. The addition of rule to db fails in case if the rule already exists in db or if the user doesn‟t have the permission to add new rules or to modify existing rules. After the addition of new rule and running the reordering gives the following output, refer figure 5.7. The source sentence is reordered and the output is Delhi India the capital of is கெல்லி இந்தினா – தத஬஥஑பம் – இரு

50

FIG. 5.7. P-VIEWER: OUTPUT WITH THE NEW RULE

51

CHAPTER 6 6 MAPPING OF DEPENDENCY TO MORPHOLOGICAL FEATURE 6.1 INTRODUCTION AND WHY NEED FEATURE EXTRACTOR The translation of English to any isolated SOV language may require the word reordering to the most with some post manipulation process. The structure from the source to target language transformation and replacing the source word with the target lexicon completes the translation in case of isolated SOV languages. Tamil is agglutinative head-final language. The reordering and replacing the source word with the target lexicon leaves the target sentence with proper word order and incomplete word generation. For example sentence “Ram gave him a book”, the reordered tree and the lexicalization process are shown in the figure 6.1.

FIG. 6.1. FLATTENING OF TREE AND REPLACING WORDS WITH TARGET LEXICON 52

The output of the lexicalization is பாம் அயன் எரு புத்த஑ம் க஑ாடு. (rAm avan oru puTTakam kotu) The word order of the target language is proper but not the words. The words are incomplete and so it has to be synthesized to get the complete word and in turn correct translation. அயன் should be அயனுக்கு. புத்த஑ம் should be புத்த஑த்தத. க஑ாடு should be க஑ாடுத்தான். The morphological synthesizer module synthesizes the root or stem word with the morphological features to form the complete word that convey correct meaning in the context. The morphological synthesizer requires the input word and the morphological features to synthesize. For example, அயன் + DATIVE --> அயன் + கு --> அயனுக்கு புத்த஑ம் + ACCUSATIVE --> புத்த஑ம் + ஍ --> புத்த஑த்தத க஑ாடு + PAST TENSE MORPHEME + PERSON NUMBER GENDER MARKER --> க஑ாடு + த்த் + ஆன் --> க஑ாடுத்தான். Fine, the morphological synthesizer synthesizes the root word with the morphological feature information to get the complete word form. Hang on, how to get these morphological features. There comes the typed dependency information of the Stanford parser to rescue. Not only Stanford parser, any dependency parser will do. The dependency relations between the words play a major role in the extraction of the morphological feature information from the source sentence to synthesize to form a complete word form. Not only the typed dependency information, the parts of speech tagging, the parse tree and target phrasal structure too helped to extract feature information, see figure 6.2.

53

FIG. 6.2. MORPH FEATURE INFO TO SYNTHESIZER

6.2 STANFORD TYPED DEPENDENCIES The Stanford typed dependencies [20] [25] are all binary relations: a grammatical relation holds between a governor and dependent. The grammatical relations for an example sentence are defined below, Example sentence: Ram gave him a book. Dependency relations for the above example sentence are nsubj (gave, ram) iobj(gave, him) det(book, a) dobj(gave, book) These dependency relations map straightforwardly onto a directed graph representation. The words in the sentence are nodes in the graph and grammatical relations are edge labels. The figure 6.3 shows the graph representation for the example sentence above.

54

FIG. 6.3. DEPENDENCY TREE: RAM GAVE HIM A BOOK

det: DETERMINER A determiner is the relation between the head of the Noun Phrase (NP) and its determiner. det (book, a) dobj: DIRECT OBJECT The direct object of the verbal phrase (VP) is the noun phrase, which is the (accusative) object of the verb; the direct object of a clause is the direct object of the VP, which is the predicate of that clause. dobj (gave, book) iobj: INDIRECT OBJECT The indirect object of a VP is the NP, which is the (dative) object of the verb; the indirect object of a clause is the indirect object of the VP, which is the predicate of that clause. iobj (him, book)

6.3 MORPHOLOGICAL FEATURE INFORMATION FROM DEPENDENCY RELATIONS

The dependency relations have three parts: relation, governor and the dependent. In dobj (gave, book), „dobj‟ is the relation between the words „gave‟ and „book‟. „gave‟ is the governor and „book‟ is the dependant. This dependency relation can be read, as „book is the direct object of the word gave‟.

55

The relation „dobj‟ is the key to find the morphological feature. The counterpart of „dobj‟ in Tamil is accusative case ending morpheme. The figure 6.4 shows the source language to target language dependency relations transformation.

FIG. 6.4. SL TO TL DEPENDENCY RELATIONS TRANSFORMATION

dobj (gave, book)  dobj ( க஑ாடு, புத்த஑ம்)  புத்த஑ம் + ACCUSATIVE  புத்த஑ம் + ஍ புத்த஑ம் + ஍  புத்த஑த்தத In Tamil, the accusative morpheme is „஍‟. The orthographic rule applied on the root word and the morpheme forms the complete word. The following table 6.1 shows some of the relations and the morphological information associated with the relation.

TABLE 6.1. DEPENDENCY RELATIONS

MEANING

MORPHOLOGICAL INFORMATION

dobj

Direct Object

ACCUSATIVE

iobj

Indirect Object

DATIVE

prep_in

Preposition (in)

LOCATIVE

prep_on

Preposition (on)

LOCATIVE

prep_during

Preposition (during)

TEMPORAL

prep_to

Preposition (to)

DATIVE

RELATION

56

prep_with

Preposition (with)

SOCIATIVE

prep_since

Preposition (since)

ABLATIVE

rcmod

Relative Modifier

RELATIVE PARTICIPLE

agent

Agent (Eg: by police)

INSTRUMENTAL

amod

Adjectival modifier

NA

det

Determiner

NA

6.4 INFORMATION FOR VERB The verb „க஑ாடு‟ gets the tense information from the parts of speech tagger. For the input sentence, Ram gave him a book. The parts of speech tagger output is shown in the table 6.2,

TABLE 6.2. LEXICON

PARTS OF SPEECH

TARGET WORD

Ram

PRP

பாம்

gave

VBD (Verb Past)

க஑ாடு

him

PRP

அயன்

a

DT

எரு

book

NN

புத்த஑ம்

SOURCE WORD

The source word „gave‟ is tagged as VBD. VBD in Penn notation is the finite Verb (Past). So the Tamil verb „க஑ாடு‟ has to be synthesized with the past tense marker. The past tense marker of the verb „க஑ாடு‟ is „த்த்‟. Person Number Gender (PNG) marker is required to generate the finite verb.

57

The subject associated with the verb provides this information. From the dependency relation, nsubj (gave, ram)  nsubj (க஑ாடு, பாம்). The PNG feature information is stored in the lexicon db. Some of the PNG information is shown in the table 6.3.

TABLE 6.3. PERSON NUMBER GENDER FEATURE

EXPLANATION

EXAMPLE

TAMIL VERB (DO)

PNG 3SM

3rd

Person

Singular He, John.

கசய்தான்

Singular She, Rita

கசய்தாள்

Singular It, Dog

கசய்தது

Singular He (with respect)

கசய்தார்

Masculine 3SF

3rd

Person

Feminine 3SN

3rd

Person

Neuter 3SH

3rd

Person

Honorific 3P

3rd Person Plural

They

கசய்தார்஑ள்

1S

1st Person Singular

I

கசய்பதன்

1P

1st Person Plural

We (Inclusive)

஥ாம் கசய்பதாம்

1P

1st Person Plural

We (Exclusive)

஥ாங்஑ள் கசய்பதாம்

2S

2nd Person Singular

You

கசய்தாய்

2P

2nd Person Plural

You

கசய்தீர்஑ள்

2PH

2nd

Person

Plural You

கசய்தீர்

Honorific

To generate the Tamil verb, the morphological information is collected from PosTagger, Typed Dependency and lexicon as shown in the figure 6.5. 58

FIG. 6.5. FEATURE EXTRACTION

6.5 COMPUTING AUXILIARY INFORMATION Tamil doesn‟t have the equivalence for the English Auxiliary but instead the auxiliary information is glued with the root word to become a single entity. The aux and any participle verb form in English such „He is doing‟ is translated as „கசய்து க஑ாண்டிருக்஑ி஫ான்‟. Tamil equivalent of the word „do‟ is „கசய்‟. „is doing‟ is the present continuous form of the verb „do‟. For synthesizing the Tamil, the present continuous morph information is required along with the PNG information.

59

கசய் + PRESENT CONTINUOUS + 3SM  கசய் + VERBAL PARTICIPLE + PROGRESSIVE MARKER + PRESENT TENSE MARKER + 3SM  கசய் + து + க஑ாண்டிரு + க்஑ிற் + ஆன்  கசய்து க஑ாண்டிருக்஑ி஫ான் (DELETE K) The typed

dependency relations

for the

input

sentence

“He

is

doing” is

[ nsubj(doing-3, He-1), aux(doing-3, is-2) ]. The „present continuous‟ information is not provided by the dependency relation. Nevertheless, it is known from the dependency relation that the word „is‟ is the auxiliary of the present participle verb „doing‟. The information available is not sufficed to synthesize the word. Pos Tagging may help to determine that the phrase „is doing‟ is the present continuous form of the verb „do‟. INPUT SENTENCE: He is doing POS TAGGED: He / PRP is / VBZ doing / VBG DEPENDENCY INFO: [ nsubj(doing-3, He-1), aux(doing-3, is-2) ] Whenever the relation is aux, then the auxiliary information is computed using a recursive procedure. Combine all the aux forms as a one string. This aux info string and the pos category of the governor word, the participle form of the verb used to determine the auxiliary and tense information for synthesizing the word. Aux-tense information for some of the auxiliary and the parts of speech of the participle verb combination is shown in the table 6.4.

TABLE 6.4. AUXTENSE-MORPH FEATURE LOOKUP

POS CATEGORY AUXILIARY AUX-TENSE FORM OF PARTICIPLE

INPUT FORMAT FOR MORPH SYNTHESIZER

VERB

VBZ

PRES

PRES

VBD

PAST

PAST

VB

WILL

FUT

FUT

VBG

IS

PRES CONT

PAST~VP~PROG~PRES

VBG

WERE

PAST CONT

PAST~VP~PROG~PAST

60

6.6 HANDLING COPULA SENTENCES The dependency information for the copula sentence is different and it does not provide direct information like subject-verb pair. Tamil does not have the linking verb, so the English copula sentences are translated differently based on the tense of the linking verb. The dependency relations for the input sentence, “She is beautiful” is [nsubj(beautiful-3, She-1), cop(beautiful-3, is-2)] where subject-verb pair is missing and instead the relation between the complement „beautiful‟ and the subject exist. In contrast to the normal sentence like “She gave a book”, the dependency relation is [nsubj(gave-2, She-1), det(book-4, a-3), dobj(gave-2, book-4)] where the subject-verb pair is clearly defined. „She‟ is the subject of the verb „gave‟ where as in the copula case; „She‟ is the subject of the link word „beautiful‟. With the simple procedure, the copula verb-subject pair is determined for the linking verb sentences. From the available dependency information [nsubj(beautiful-3, She-1), cop(beautiful-3, is-2)] for the input sentence “She is beautiful”, the procedure determines that subject of the linking verb „is‟ as „She‟. Then the PNG marker for the word „She‟ is glued along with the tense marker to synthesize the equivalent of the English copula verb. Although there is not any equivalent for most of the linking verb cases, for the sake of translation easiness the Tamil word „இரு‟ considered as the equivalent with compromising the fluency of the target sentence construction. The more fluent translation of the sentence “She is beautiful” is “அம஑ா஦யள்”. For the input sentence “She was beautiful”, the translation is “அயள் அம஑ா஑ இருந்தாள்”. In former case the target sentence does not have English copula‟s equivalent, instead the equivalent of the adjective „beautiful‟ (அம஑ா஦) and equivalent of the subject „she‟ (அயள்) is synthesized to form a noun “அம஑ா஦யள்”. Later case the sentence construction is totally different. The sentence is translated as “அயள் (She) அம஑ா஑ (beautifully) இருந்தாள் (finite verb)”. The adjective “beautiful” becomes the adverb. The word „was‟ is replaced with a finite verb „இரு‟ and synthesized with the tense and png markers. Instead of handling for various linking verbs, a general method is employed for word reordering and morphological information extraction. Henceforth the sentence “she is beautiful” outputs “அயள் அம஑ா஑ இருக்஑ி஫ாள்” (not fluent but either it‟s not a bad translation) 61

6.7 COPULA: MORE ISSUES Compare the two sentences and its translation.

A)

She/PRP is/VBZ beautiful/JJ  அயள்(She) அம஑ா஑(beautiful) இருக்஑ி஫ாள்(is)

B)

She/PRP is/VBZ a/DT beautiful/JJ girl/NN  அயள்(She) எரு(a) அம஑ா஦(beautiful)

க஧ண்(girl)

In Sentence A, the word „beautiful‟ (adjective) changes its POS category to adverb where as in the sentence B, the POS category of the word „beautiful‟ (adjective) remains same. In both the cases, the equivalent word for „beautiful‟ is either synthesized from the root word „அமகு‟ (beauty) or fetched directly from the lookup table. This category changes occurs in the copula sentences. Therefore, a specific heuristic is necessary to identify this kind of sentence and treat in a different manner. The synthesizing varies depends on the sentence type. For copula sentence, the morph information for word generation is ADVZ (adverbalization) and for another case, the morph information is ADJZ (adjectivization).

A)

அமகு + ADVZ  அமகு + ஆ஑  அம஑ா஑

B)

அமகு + ADJZ  அமகு + ஆ஦  அம஑ா஦

6.8 HANDLING DATIVE VERBS The sentence with the dative verbs like „have‟, „has‟, etc need a different treatment during morphology feature extraction from the dependencies and also in the reordering. The sentence “He has two books” is translated in Tamil as “இபண்டு புத்த஑ங்஑ள் அய஦ிெம் இருக்஑ின்஫஦”.

62

With the general reordering rule, the above sentence gets reordered as shown in the figure below. In the reordered, tree „books‟ is an object where as in the target language the object „books‟ becomes „subject‟. nsubj(has, He)  nsubj(has, books) num(books, two)  num(books, two) dobj(has, books)  possession(He) The

phrasal

structure

of

Tamil

sentence

“இபண்டு

புத்த஑ங்஑ள்

அய஦ிெம்

இருக்஑ின்஫஦” is shown in the figure 6.6. The phrasal structure (see figure 6.5) is not same as the reordered tree. The specific reordering rule is required to reorder the sentence which having the dative verbs. The morphological information cannot be extracted from the dependency relations. The dative-verb-type morph extractor function does the job. It swaps the subject and object position and the marks the relation between the subject that turned to be an object and verb. Now, the general morphological-information extractor-function does do the extract the proper feature information from the typed-dependencies.

63

FIG. 6.6. REORDERING: SENTENCE WITH POSSESSIVE VERB

FIG. 6.7. PHRASAL STRUCTURE (HE HAS TWO BOOKS)

64

6.9 HANDLING MULTIPLE SUBJECTS The typed dependencies for the sentence, which has multiple subjects, are not sufficed to extract all the subject-verb pair information. The typed dependencies for the following sentence “John and Mary presented a car” are [nsubj(presented-4, John-1), conj_and(John-1, Mary-3), det(car-6, a-5), dobj(presented-4, car-6)]. The typed dependency relation provides only one subject-verb pair. For synthesizing the Tamil verb „யமங்கு‟ with the tense and PNG morpheme all the subject-verb pair in the sentence has to be determined. The PNG marker varies for the multiple subjects. nsubj( present, John) helps to find the png feature of „John‟ which is „3SM‟ but for the multiple subjects „John‟ and „Mary‟ the png marker that to be synthesized is „3rd person plural‟. யமங்கு + PAST TENSE MARKER + 3RD PERSON PLURAL  யமங்கு + இன் + ஆர்஑ள்  யமங்஑ி஦ார்஑ள்

65

CHAPTER 7 7 MORPHOLOGICAL ANALYZER AND SYNTHESIZER 7.1 NEED OF SYNTHESIZER FOR MACHINE TRANSLATION The lemma of any agglutinative language would be having thousands of inflections. The source language lemma has to be replaced with the target language lemma for translating the source language sentence to target language. Having a look up table for mapping the source lemma with target lemma is not feasible in case of agglutinative languages where every lemma in the language have numerous word forms. The preposition in English does not have the equivalent isolated lemma in Tamil. Instead, the preposition is transformed to postpositions or case endings. These case endings are not isolated but they glued to another lemma preceding it. The agglutinative language, Tamil verb lemmas are synthesized with the morphemes, which are the morphological information extracted from the source sentence to form the inflected form of the lemma. For synthesizing the Tamil verb requires subject information of the verb, which has to be computed from the English sentence. Therefore, it is impossible to store all the inflected forms with the subject information of the verb with no the subject-verb pair features information. To store all the inflected forms in the lookup table may not be feasible solution. Here comes the morphological synthesizer to rescue. The sample for the behaviour of the verb „கசய்‟ (do) and it inflections are shown in the below list. This is pretty much limited to present, past and future tense and two different subjects. It has been estimated that Tamil verb has more than three thousand inflected forms. He did

avan ceyTAn

He does

avan ceykiRAn

He will do

avan ceyvAn

She did

avaL ceyTAL

She does

avaL ceykiRAL

66

The complication of having been stored the verb „do‟ and its Tamil counterpart „கசய்‟ is that at the time of creating the look up table, the subject-verb pair information is unknown and for storing all the forms are pretty much laborious and not possible. Nevertheless, the verb can be generated on the fly during the translation process. He did  அயன் கசய் + past tense + png feature of the subject „He‟ அயன் கசய் + த் + ஆன்  அயன் கசய்தான். Consider the following English sentence and its Tamil counterpart, ஋ன்னுதென

The son of my friend is a medical doctor

஥ண்஧஦ின்

ந஑ன்

எரு

நருத்துயர் ஥ண்஧஦ின்

of friend

The phrase „of friend‟ is translated to „஥ண்஧஦ின்‟. The following table 7.1 shows few of the inflected forms of the word „஥ண்஧ன்‟ (friend) and its equivalent counterpart in English. This shows why storing all the forms in the lookup table is tedious.

TABLE 7.1. NOUN INFLECTIONS

LEMMA

+

MORPH WORD + MORPHEME

SYNTHESIZED FORM

FEATURE

ENGLISH EQUIVALENT

INFORMATION ஥ண்஧ன் + Possessive

஥ண்஧ன் + இன்

஥ண்஧஦ின்

of friend

஥ண்஧ன் + Dative

஥ண்஧ன் + கு

஥ண்஧னுக்கு

to friend

஥ண்஧ன்+ Benefactor

஥ண்஧ன் + ஑ா஑

஥ண்஧னுக்஑ா஑

for friend

஥ண்஧ன் + Sociative

஥ண்஧ன் + ஏடு

஥ண்஧ப஦ாடு

with friend

஥ண்஧ன்+ Accusative

஥ண்஧ன் + ஍

஥ண்஧த஦

friend (accusative)

67

7.2 WHY MORPHOLOGICAL ANALYZER What‟s the role of morphological analyzer for the English to Tamil translation system? The answer is no role. The framework used for developing morphological synthesizer does do the analyzing task with minimal modification. Same heuristics are used for morph synthesizer and analyzer. In fact, the morphological synthesizer is the reverse process of the morphological analyzer. So the development of the synthesizer system parallels with the morphological analyzer. The approach used for developing morphological analyzer and synthesizer is explained in the forthcoming sections.

7.3 INTRODUCTION12 Morphology deals, primarily, with the structure of words. Morphological analysis detects, identifies, and describes the meaningful constituent morphs in a word, which function as building blocks of a word [26]. More on Tamil Morphology is in Appendix. The densely agglutinative Dravidian languages such as Tamil, Malayalam, Telugu, and Kannada display a unique structural formation of words by the addition of suffixes representing various senses or grammatical categories, after the roots or stems. The senses such as person, number, gender, and case are linked to a Noun stem in an orderly formation. Verbal categories such as transitive, causative, tense and person, number and gender are added to a verbal root or stem. The morphs representing these categories have their own slots behind the roots or stems. The highly complicated nominal and verbal morphology do not stand-alone. It regulates the direct syntactic agreement between the subject and the predicate. Another important aspect of the addition of morphs is the changes that often take place in the space between these morphs and within a stem. A Morphological Analyzer and Synthesizer should take care of these changes while assigning a suitable morph to the correct slot to generate a word. The combination of sense and form in a morph and the possibility to identify the governing rules are the incentives to attempt to build an engine, which can automatically analyse and generate the same processes taking place in the brain of a native speaker. 12

Excerpt from our published work [26].

68

The slots behind the root/stem can be filled by many morphs. The rules governing the order of the morphs in a word and the selection of the correct morph for the correct slot should be formulated for analysis and synthesis. The inflections and derivations are not the same for all the nouns and verbs. The biggest challenge is the grouping of nouns and verbs in such a way that the members of the same group have similar inflections and derivations. Otherwise, one has to make rules for each noun and verb, which is not feasible. The most difficult slot in a verb is the one that follows the verb root/stem. This position is occupied by the suffixes belonging to the category transitive. The elusive behaviour of these suffixes poses many problems, and most of the earlier Morphological Analyzers did not handle this problem adequately. Our system, as mentioned earlier, works on rules and these rules are capable of solving this clumsiness in an elegant manner. Many changes take place at the boundaries of morphs and words. Identifying the rules that govern these changes is a challenge because dissimilar changes take place in similar contexts. In such cases, it is necessary to look into the phonological as well as morphological factors that induce such changes. The designed system involves building an exhaustive lexicon for noun, verb, and other categories. The performance is directly related to this exhaustiveness. It is a laborious task. Finite State Transducer (FST) is used for morphological analyzer and generator [27]. We have used AT &T Finite State Machine to build this tool. FST maps between two sets of symbols. It can be used as a transducer that accepts the input string if it is in the language and generates another string on its output. The system is based on lexicon and orthographic rules from a two level morphological system. For the Morphological generator, if the string, which has the root word and its morphemic information, is accepted by the automaton, then it generates the corresponding root word and morpheme units in the first level. The output of the first level becomes the input of the second level where the orthographic rules are handled, and if it gets accepted then it generates the inflected word.

69

7.4

MORPH ANALYZER AND SYNTHESIZER – SYSTEM

ARCHITECTURE

The simplified version of the system architecture is shown in the figure 7.1. The practical system has combination of both lexicon and lexicon less model. The lexicon less model serves as the fail-safe, if the input lexicon not present in the lexical model. However, the new lexicon tested appends in the lexicon list automatically in case of noun and in case of verb with the minimal human intervention is required to classify the verb paradigm.

FIG. 7.1. MORPHOLOGICAL ANALYZER AND SYNTHESIZER - BLOCK DIAGRAM

70

7.4.1 NOUN LEXICON

Currently, the Tamil noun lexicon has around 100,000 entries. In Morphology literature, it shown that nouns are categorized into finite number of paradigms, however the developed system uses lexeme based approach (Item and Process). The synthesized word is the result of applying rules on the lexicon that changes root / stem to form a new word.

7.4.2 MORPHOTATICS MODEL

The order of the morphemes and its positions are restricted by a set of rules. These rules are based on the noun structure or verb structure of the language. The structure of Tamil noun and verb has explained in greater depth in Tamil morphology section. In the lexicon-based model, the transducer of noun lexicon (Lexicon.fst) is concatenated with the morphotatics rule (Moprhotatics.fst). The example of the Tamil noun „ammA‟ after concatenation of the lexicon.fst with morphotatics.fst is given below in the figure 7.2,

FIG. 7.2. TRANSDUCER FOR MORPHOTATICS RULE

In the lexicon less model, there is no defined restriction to say what precedes the morphotatics model, see figure 7.3.

71

FIG. 7.3. TRANSDUCER FOR MORPHOTACTICS RULE - LEXICON LESS MODEL

7.4.3 ORTHOGRAPHIC MODEL

The spelling rules of noun is handled in the noun orthographic rules and complied to create a transducer model. Every rule is an fst model and each of them gets composed with the morphotatics model to create the noun morph-analyzer transducer, refer 7.4.

FIG. 7.4. TRANSDUCER FOR ORTHOGRAPHIC RULE FOR LEXICONLESS MODEL

The Model is ready for the testing. Swapping the input and output symbols in the transducer model of the noun analyzer server as the synthesizer model. Reverse FST does reverse the input and output symbols. ‟fstreverse‟ command is used for this operation. For details, refer the FSM Toolkit commands topic. „fstunion‟ command does the union operation for the given input transducer models. The remaining blocks in the system blocks are selfexplanatory. The process is pretty much similar to noun.

72

7.5 BUILDING A SIMPLIFIED MORPH ANALYZER AND SYNTHESIZER A simplified version of the Morphological Analyzer and Synthesizer (MAS) for Tamil for a limited set of nouns and noun inflections of the plural marker and four case markers is elucidated in this section. This gives the step-by-step procedure for developing it.

STEP-1: REPRESENTING THE MORPHOLOGY AS A FSM

Come up with the Finite state representation of the morphology of the individual categories (such as nouns, verbs etc.). For example for the simplified analyzer it would look like: (refer figure 7.5)

FIG. 7.5. TAMIL NOUNS: FSM REPRESENTATION

Note that this is just a partial representation of Tamil nouns and this does not include dedicated states for morphophonemic changes popularly known as „Sandhi‟, the orthographic rules.

STEP-2A: IDENTIFICATION OF PARADIGMS

Five nouns have been selected for the simplified analyzer. Note that the paradigm approach is explained here for the simplicity, even though the system uses the hybrid of lexeme based morphology and word-based (or paradigm) morphology. 

maram (tree) 73



karam (arm)



siram (head)



kaN (eye)



maN (sand/ earth)

It can be noted that the first three words are rhyming and similarly the last two words. Interestingly, words ending in similar sounds behave in the same way morphologically. This means that the first three words always inflect in the same way, when some lexical information (such as pl. or case marker) is added to them. These individual groups are called „paradigms‟ in the literature. The task for the linguist is to identify the different paradigms in a language and come up with a representative example for that paradigm.

STEP-2B: AUTOMATIC CLASSIFICATION OF THE NOUNS/ VERBS INTO PARADIGMS

Once the linguists identify the paradigms, The script can be written to classify the words automatically in the lexicon into one of the identified paradigms. This can be done relatively easily (with some error) by looking at the word endings.

STEP-3: BUILDING THE FST FOR THE INDIVIDUAL CATEGORIES:

It will be better to build the FST for individual categories and then compose them into a bigger FST using the commands available in the toolkit.

7.5.1 FILE FORMAT FOR FSM TOOL KIT13 FSM toolkit14 takes acceptor files in a space/tab separated format of five columns (last column is optional for the weights). The first and second column marks the source and destination state of a specific transition. The next two columns mark the input symbol and output 13

Excerpt: A Toy Morphological Analyzer – Handout from Loganathan Ramasamy, 2009. Amrit Vishwa Vidyapeetham 14 Open FST and FSM ToolKit Tutorial: http://www2.research.att.com/~fsmtools/fsm/tut.html, http://www.openfst.org/twiki/bin/view/FST/FstExamples

74

symbol of the transition. Each transition has to be in a separate line. The lines below give an example of the transduction of the word „maram‟.

0

1

ma

ma

1

2

ra

ra

2

3

m

m

3 Note that individual Tamil aksharas (for example „ma‟) are transduced in every state instead of the entire word in one transition. This is easier to implement morphophonemic changes in the word, which will usually change the last akshara. The last line gives the end state and hence has just one column. In addition, we need to create two files for the symbols used in input and output side of the FST, such that each symbol is given a unique index. These files have two columns giving the symbol and the unique identifier. For example for the above case the input and output files will be same as: ma

1

ra

2

m

3

However, in a real scenario these files will be different, since the output side will have symbols for the categories and lexical information (such as N, PL, ACC, etc.). However, it is a good idea to use the same identifier for common symbols. In building a morph analyzer and synthesizer, the creation of these three files should be automated to the extent possible.

STEP-4: Once the data files are created as detailed above, use the fsmcompile command (commands are explained below) to generate the binary form of FST. This FST is then reversed (Swapping input ouput symbols to make it as synthesizer or analyzer depends on input and output symbol) and determinized again by the commands. This completes the bulding of the transducer and is ready for analysis / synthesis.

75

STEP-5: Now a word is tested by passing it thro‟ the transducer, which outputs the path traversed by the word. However, FSM toolkit does it in a slightly circuitous way: 1.

Create the acceptor file for the word and then compile it using the fsmcompile

command. For example, the acceptor file for the word „cirattil‟ (on the head) will look like 0

1

ci

1

2

ra

2

3

tt

3

4

il

4

2.

Compose this fsm with the fst and then project it on to the output side (again

commands are given below) 3.

The output will be another fsm file giving the analysis along with the categories.

7.5.2 FSM TOOLKIT COMMANDS For any command in the toolkit”fsmcommand -?” gives a decent help for that command. 1.

To compile an acceptor/ transducer file into FSA/FST a.

fsmcompile –t –i input_symbols.txt –o output_symbols.txt –F nouns.fst

nouns_acceptor.txt b.

fsmdraw –i input_symbols.txt –o output_symbols.txt nouns.fst |

dot

> nouns.jpg

2.

c.

Here –t and –o is for transducer and drop them for acceptors

d.

-F option is for specifying the output file name to store the fsm/fst binary

e.

The last argument takes the name of the transition file.

Reverse, remove epsilon transitions and determinize the fst a.

3.

fsmreverse nouns.fst | fsmrmepsilon | fsmdeterminize -F nouns_rev.fst

To print the fst in human readable form a.

fsmprint -o output_symbols.txt -i input_symbols.txt nouns.fst

76

–Tjpg

b.

Omitting the –i and –o switches will print the data with indices instead of

symbols. 4.

Similarly compile the acceptor file to be used for testing the morph analyzer (this file

uses the word „cirattil‟ (on the head) for testing)

5.

a.

fsmcompile –i input_symbols.txt –F test.fsm test_set.txt

b.

Notice that –t and –o switches are not used

Reverse, remove epsilon transitions and determinize the fsa a.

6.

7.

fsmreverse test.fsm | fsmrmepsilon | fsmdeterminize -F test_rev.fsm

Now, analyzer the fsm by composing it and project it on the output side a.

fsmcompose test_rev.fsm nouns_rev.fst | fsmproject -o -F result.fsm

b.

This creates the file result.fsm, which has the analysed output

Print the results in human readable form a.

fsmprint -i output_symbols.txt result.fsm

b.

Note that the –i switch actually calls the output_symbols.txt. This is because we

have projected the fsm on the output side which now becomes the input side of the fsm.

7.5.3 FST MODEL FOR MORPHOTACTICS RULE OF NOUN (SIMPLIFIED VERSION)

The possibilities inflections of the Noun (simplified version) are given below and refer figure 7.6 for the FST model of noun transducer.  Root / Stem + Plural  Root / Stem + Plural + Case  Root / Stem + Case  Root /Stem + Oblique  Root / Stem + Oblique + Case  Root / Stem + Oblique + Plural + Case  Root / Stem + Interrogation Marker  Root / Stem + Plural + Interrogation Marker  Root / Stem + Oblique + Interrogation Marker  Root / Stem + Oblique + Plural + Interrogation Marker 77

 Root / Stem + Conjunction Marker  Root / Stem + Plural + Conjunction Marker  Root / Stem + Oblique + Conjunction Marker  Root / Stem + Oblique + Plural + Conjunction Marker

FIG. 7.6. TRANSDUCER FOR TAMIL NOUN INFLECTION

7.5.4 FST MODEL FOR ORTHOGRAPHIC RULE The (v/y) insertion rule model is shown in the following figure 7.7. The „v‟ insertion happens when any word that ends with Vowel other than „i‟, „I‟, „ai‟, „e‟ and „E‟ glues with the morpheme that begins with a vowel. Example: amma + Al  ammavAl The „y‟ insertion occurs when any words that ends with „i‟, „I‟, „ai‟, „e‟ and „E‟ and next morpheme that begins with a vowel. Example: alai + Al  alaiyAl

78

FIG. 7.7. TRANSDUCER FOR V/Y INSERTION RULE

7.5.5 TWO LEVEL MORPHOLOGY WITH AN EXAMPLE WORD For the Tamil Noun, „ammA‟, the morphotactics rule, orthographic rule transducer model is shown in the figure 7.8. ammA + N + INS  ammA ^ Al  ammAval In the first stage, ammA + N + INS becomes ammA ^ Al. „^‟ indicates the word boundary. „INS‟ is instrumental case and „Al‟ is the instrumental case marker. In the second level, the spelling rule is applied. Since the word „ammA‟ ends with „A‟ and the morpheme that to be glued begins with vowel (in this case „Al), ammA ^ Al accepted by the „v Insertion rule‟ transducer and generates the output symbol ammAvAl.

FIG. 7.8. TRANSDUCER FOR MORPHOTACTICS RULE

In the figure 7.9 input symbol „+N‟‟s counter output symbol is „empty‟. In the real application, it‟s epsilon. This epsilon or empty has to be removed using fst epsilon remove command. For more details, see FST manual section.

79

FIG. 7.9. TRANSDUCER FOR ORTHOGRAPHIC / SPELLING RULE

Note that the string „empty‟ hangs on in the fst model which isn‟t the case in the real application. For simplicity, it‟s kept as such. Hurray, the model is done, let‟s do some testing. The above figures are just a small part of the bigger picture of the model which can be shown here. Think, we zoomed the bigger fst model figure 1000 times and viewing a particular rule and application for the noun „ammA‟. The input word for the synthesizer for testing is „ammA‟. The Finite-State model representation is shown in the below figure 7.10.

FIG. 7.10. INPUT WORD IN FINITE-STATE REPRESENTATION

The input FSA composed with the Morphotactis rule transducer and the intermediate stage finite state representation is,

FIG. 7.11. INTERMEDIATE STAGE FSA

80

The further composition of the intermediate Finite-State model with the orthographic rule gives the synthesized output. The synthesized Finite-State model is shown in the following figure 7.12.

FIG. 7.12. FST FOR SYNTHESIZED WORD

For simplicity, the input FSA is composed with lexical transducer and then the intermediate with the orthographic transducer but in the real case, the model is created by composing the lexical transducer and intermediate transducer. The input finite state is composed on the created model. The flow graph of the morphological synthesizer is shown below in the figure 7.13.

FIG. 7.13. FLOW GRAPH OF MORPH SYNTHESIZER 81

For morphological analyzer, swapping the input and the output symbol is sufficed. The swapping can be done using the fst reverse command.

82

CHAPTER 8 8 EXPERIMENTS AND RESULTS Most of the modules of the translation system are implemented in Java. Implemented modules include syntax reordering, dependency to morphological feature mapping, morphological synthesizer/analyzer for Tamil and transliteration module for English to Tamil transliteration. Stanford Parser is integrated with the system to pos-tag and to parse the English sentence. The system can be scaled up by developing enough resources that include EnglishTamil pattern based reordering rules, English-Tamil transfer lexicon, rules for mapping the typed dependency relations to morphological features, morphotactics & orthographic rules for Tamil noun and verb.

8.1 DATA FORMATS This section is devoted to explain the format of the databases that used in the system. Understanding the data formats for system experiments is vital. The missing or invalid resources may crash the system. The system uses databases namely, transfer rules, dependency to morph mapping, aux-tense to morph feature mapping, noun, verb, adjective, adverb, pronoun, preposition and general (other pos categories) transfer lexicon.

8.2 TRANSFER RULES The format of the reordering rules is explained in detail in the reordering rules chapter. The four columns in the db CurNode, Source, Target and TransferLink means root node of the source/target pattern rule, child of the source rule pattern, child of the target rule pattern and transfer link the maps the node between source & target rule respectively. The screen shot of the reordering rules db is shown in the figure 8.1. 83

FIG. 8.1. TRANSFER RULES DB

8.3 DEPENDENCY TO MORPH MAPPING The first column of the db is the relation between two words and the second is the features that to be synthesized along with the one of the word‟s stemmed. Consider the example sentence; Ram gave a book to him. The dependency relations given by the parser are, [nsubj(gave-2, Ram-1), det(book-4, a-3), dobj(gave-2, book-4), prep_to(gave-2, him-6)]. In the db shown in the figure 8.2, prerp_to is mapped to ~N~DAT. The target lexicon of the word „him‟ in the transliterated form is „avan‟. Remember that the transfer lexicon has only the root/stem form of the English word and its equivalence. The word „avan‟ gets the morphological feature information as „~N~DAT‟. The synthesizer requires this information to synthesize the complete word form. avan~N~DAT  avan + ku  avannukku

84

FIG. 8.2. DEPENDENCY-MORPH FEATURE DB

8.4 AUXTENSE TO MORPH MAPPING The auxiliary and tense are computed based on the lookup table as shown in the figure 8.3. The columns Pos, Aux, AuxTense and MorphInfo are the pos category of the verb for which the auxTense information is computed, the auxiliary verb if (any) precedes the verb, auxTense form and the morphinfo that required by the synthesizer respectively.

FIG. 8.3. AUXTENSE-MORPH FEATURES DB 85

8.5 NOUN LEXICON The following figure 8.4 shows the db format of the noun lexicon. The fourth column „Feature‟ has the person number gender information and the fifth column have synonyms (if any) of the source word. In case multiple synonyms; the words are separated by a comma.

FIG. 8.4. NOUN LEXICON

The verb lexicon, adjective, adverb, preposition and general lexicon have the same db format as noun lexicon.

8.6 TRANSLATION: STEP-BY-STEP PROCESS Step 1: The source sentence is passed to Stanford parser. For the input sentence: Ram gave a book to him‟, the parse tree in the bracketed representation is as follows, (S (NP (NNP Ram)) (VP (VBD gave) (NP (DT a) (NN book)) (PP (TO to) (NP (PRP him))))) The typed dependency output from the parse is as follows, [nsubj(gave-2, Ram-1), det(book-4, a-3), dobj(gave-2, book-4), prep_to(gave-2, him-6)] Step 2: The parse tree is passed to the reordering module. The output of the reordered module is as follows, (S (NP (NNP Ram)) (VP (NP (DT a) (NN book)) (PP (NP (PRP him)) (TO to)) (VBD gave))) Step 3:The typed dependency output is inputted to the morph feature extraction module. The intermediate stage and output of this module is shown below, Rel

[nsubj, det, dobj, prep_to] 86

Gov

[gave-2, book-4, gave-2, gave-2]

Dep

[Ram-1, a-3, book-4, him-6]

S (NP (NNP Ram)) (VP (NP (DT a) (NN book ~N~ACC)) (PP (NP (PRP him ~N~DAT)) (TO to)) (VBD gave~V~PAST~3SM))) Step 4: The English words are translated to Tamil and the flattening of tree form gives the reordered sentence, S (NP (NNP பாம்)) (VP (NP (DT எரு) (NN புத்த஑ம்~N~ACC)) (PP (NP (PRP அயன்~N~DAT)) (TO -)) (VBD க஑ாடு~V~PAST~3SM))) பாம் எரு புத்த஑ம்~N~ACC அயன்~N~DAT க஑ாடு~V~PAST~3SM Step 5: The incomplete words forms are synthesized to get the final translation output, பாம் எரு புத்த஑த்தத அயனுக்கு க஑ாடுத்தான்

8.7 TESTING

The stochastic evaluation metric initially tried but the testing results were misleading. Even the translated sentences that rated high by the metric are not par. So manual testing is employed for time being until the best method is figured out. The testing of the MT system is done manually. The quality of the translation system‟s output is rated in 1 to 5 by human judges, where the output sentence is rated 1 for very poor quality, rated 2 for good reordering and bad lexicon, rated 3 for comprehendible output, rated 4 for comprehendible, good lexicon and good reordering and rated 5 for comprehendible output, prefect syntax and prefect lexicon. The 14100 input sentences from the sub-language of tourism were tested.

TABLE 8.1. TRANSLATION RESULTS

RATE 1 & 2 RATE 3 RATE 4 & 5

NO. OF SENTENCES PERCENTAGE 3134 22.22 8970 63.61 1996 14.15 87

The online demo version for testing the English to Tamil translation system is available at the following link, http://www.nlp.amrita.edu:8080/TransWeb/index.jsp.

8.7.1 TESTING RESULTS: MORPHOLOGICAL ANALYZER AND SYNTHESIZER

The monolingual corpora of finite number of words tested with the developed system and store the analyzed/synthesized forms and unanalyzed/non-synthesized forms separately. For better testing results, nouns and verbs should be identified from the corpora and tested individually. The testing results of the morphological analyzer and synthesizer are given in the table 8.2.

TABLE 8.2. TESTING RESULTS OF MORPH ANAYLZER AND SYNTHESIZER

MODEL

TOTAL ANALYZED/SYNTHESIZED PERCENTAGE

LEXICON

113653

101076

88.93

12577

9789

77.83

NO LEXICON

The online demo version for testing the Morphological analyzer and synthesizer is available at the following link. http://www.nlp.amrita.edu:8080/MorphWeb/index.jsp. The sample input for testing is also available in the website. The testing of morph analyzer and synthesizer means testing the orthographic and morphotactic rules. The testing of morph analyzer is comparatively easier than the testing synthesizer because of the lack of input samples for testing synthesizer. For testing analyzer, the monolingual corpora from various sources have been collected, pruned and tokenized. The words are tested in the lexicon model, initially. During the development, the orthographic and morphotactics rules were tested for all possibilities just after the creation of every rule. The words that are not analyzed probably because of lack of the rules or because of the absence of the lexical entry. In that case, the system produces no output. All the other cases where the system produces one or more outputs may or may not be correct. It is been found through the experiment that at least one of the analyzed 88

outputs is correct for most of the cases. It is observed that in very few cases, all the analyzed outputs are spurious and it is negligible in the large test set. 88.93% is not the accuracy but it is the percentage of words that were analyzed. The accuracy of the system would be less than that but it is not tested manually. If the spurious outputs were neglected, the accuracy would be almost 88.93%. The MAS that developed using the openFST package shares the same orthographic and morphotactic rules for analyzer and synthesizer. The morph synthesizer is the inverse process of the morph analyzer. The inputting of the output of the analyzer to the synthesizer outputs the original word along with the other possible word formations, if any. i.e. All the 88.93% of the analyzed output is synthesized back to the original word. For example, the word “஧டித்திருந்தான்” is parsed and the output is “஧டி~V~PAST~VP~PERF~PAST~3SM”. The output of the analyzer that fed to the synthesizer produces “஧டிந்திருந்தான்” and “஧டித்திருந்தான்”. Though both the output is correct, one important observation here is that one of the outputs is the original word that inputted for the analyzer and it is true for all the cases. Thus, it shows that 88.93% of root word and its morph information synthesize to form a word. The accuracy would be almost 88.93% as if the spurious outputs were considered. The words that are not analyzed are tested in the lexicon less model. However, the output of the lexicon less model may not be correct as opposed to the lexicon model where most of the time the system gives output only if it has the proper rule. The percentage of the words analyzed varies with the corpus. The performance of the system can be improved by doing the exhaustive testing. The non-analyzed words forms are clustered and the patterns can be identified to create new orthographic rules. This continuous iterative process of creating heuristics and testing and finding new heuristics based on the non-analyzed word forms during testing can improve the accuracy and the coverage of the system.

The category of the sentences used for the testing procedure is as follows, Simple + copula

(Co-ordinate) + That Complement

Simple + copula (Possessive form)

Co-ordinates

Simple + copula (Co-ordinate)

Infinitival clause (initial)

Simple + copula (Infinitive)

Gerundial 89

Simple with PP object

Discourse connector

Relative Clause (subordinate clause)

Conditional

Relative Clause (subordinate clause) with imperative Relative Clause (subordinate clause) – Hidden

Copula with infinitive Copula constructs with gerundial & „Wh‟ sub-ordinates Appositional with verb participial initial

Appositional (verb participial + initial)

and conjuncts

Appositional(complement + initial)

Appositional with verb participial initial

That Complement

Complex sentence with Relative clause Complex sentence with Relative clause

(Co-ordinate) + That Complement

(Hidden) complement

Co-ordinates

Adverbial clause initial

PP initial

Simple (NP conjuncts)

The result shown in the table 8.1 cannot be compared with the couple of weeks old Google online translation service (alpha version) at the time of this write up. It seems Google‟s EnglishTamil SMT system is better in choosing the target lexicon where as in the morphological synthesizing part lacks than the rule-based system. The only other system available to compare these results against is EILMT consortia‟s LTAG based MT system. Comparison of these systems is done with 500 sentences and the observed results of this initial testing seem to be neck to neck.

8.7.2 CONTRIBUTIONS

The main contributions to the project in programming and linguistics are shown in the tables 8.3 to 8.7.

90

TABLE 8.3. IMPLEMENTATION DETAILS

MODULE

DETAILS

Morph Analyzer and

Rules are written in „lextools‟ frame work and the model is

Synthesizer

created using „openfst‟ packages This module is implemented in Java. It takes the parse tree

Reordering

and reordering rules as input and outputs the reordered structure. This module is implemented in Java. The parse tree, typed

Dependency to Morph feature mapping

dependency relations are used to compute the subject of verb, to deduce the auxiliary-tense and copula features and to map the dependency relations with noun inflections and post positions. This module is implemented in Java. The target equivalence of source word is translated using the bilingual lexicon. The module also handles the pos category jumping. Ex: The word “tall” in the input sentence “He is tall” has the pos category

Translation of words

„adjective‟.

This

sentence

translated

as

“அயன் உனபநா஑ இருக்஑ி஫ான்”. The word “உனபநா஑” is adverb. This module ensures to choose the adverb and not the adjective equivalence of tall that is „உனபநா஦‟. Integration of Synthesizer Integration of Stanford Parser These modules are implemented in Java. The multi-threaded Integration of Transliteration

synthesizer is developed for window‟s version of the MT

Integration of Font Converter

system.

Desktop GUI Phrase Tree Viewer and Reordering rules Editor Database Organization

This tool is implemented in Java. This is not a part of the MT system. It is the visual aid tool for developing reordering rules and lexicon. MySql is used for db solutions. 12 different tables are 91

present in the db. Online Version of MT Online Version of Morph Analyzer and Synthesizer

The online version is implemented using Jsp, Java Servlets, Java Script, Java, xml, html. The system is developed in such a way that the linguistic components are clustered out from the system to make a generic framework of MT system for English to any Dravidian language and Hindi. Therefore, for English to the other language pair, only the linguistic data have to be

Extensions

provided in database, i.e. the bilingual dictionary, the reordering rules for the new language pair and the morph synthesizer rules. A toy version of the English to Malayalam MT system and Malayalam morphological analyzer and synthesizer is developed and tested to verify the adaptability of the developed framework.

TABLE 8.4. DEPENDENCY TO MORPHOLOGICAL FEATURE MAPPING

NUMBER OF MAPPINGS VERB

97

NOUN 44

TABLE 8.5. REORDERING RULES

NUMBER OF REORDERING RULES REORDER RULES 114

92

TABLE 8.6. NUMBER OF WORDS USED IN MORPH ANALYZER AND SYNTHESIZER MODEL

CATEGORY NUMBER OF WORDS NOUN

70207

VERB

2930

ADJECTIVE

141

TABLE 8.7. NUMBER OF RULES IN MORPH ANALYZER AND SYNTHESIZER

CATEGORY RULE TYPE NOUN

VERB

NUMBER OF RULES

MORPHOTACTIC 218 ORTHOGRAPHIC

113

NUMBER VARIES DEPENDS ON HOW THE RULES ARE COUNTED

MORPHOTACTIC 370 ORTHOGRAPHIC

158

93

CHAPTER 9 9

SCREEN SHOTS

FIG. 9.1. GUI OF ENGLISH-TAMIL MT SYSTEM (STAND ALONE VERSION)

94

FIG. 9.2. DICTIONARY PANEL AND MORPH SYNTHESIZER PANEL

95

FIG. 9.3. GUI: TAMIL MORPH ANALYZER AND GENERATOR

FIG. 9.4. GUI: MALAYALAM MORPH ANALYZER AND GENERATOR 96

FIG. 9.5. GUI: ENGLISH-MALAYALAM MT SYSTEM

97

FIG 9.6. ENGLISH-TAMIL MT SYSTEM (ONLINE VERSION)

98

FIG 9.7. MORPH ANALYZER AND SYNTHESIZER (ONLINE VERSION)

99

The graphical user interface (gui) of the standalone version of the MT system that implemented in Java is shown in the figure 9.1. Other than translation, the gui version has the other features such as to modify/add the lexicon on the fly and to verify the morphological synthesizer output. The figure 9.2 shows those features in the gui with the dictionary panel and the morphological synthesizer panel enabled. The gui of the standalone version of the morphological analyzer and the synthesizer for Tamil that implemented in Java is shown in the figure 9.3. With the success of the prototype implementation of the MT system, the framework developed for English-Tamil language pair is extended to the other Dravidian languages. The framework is tested for the toy version of the English to Malayalam MT system. As part of the testing, a prototype version of the morphological analyzer and synthesizer is developed for Malayalam and is shown in the figure 9.4. Using framework of English-Tamil MT system and just by changing the English-Tamil dictionary to English-Malayalam dictionary along with the Malayalam Unicode font mapping, the toy version of the English-Malayalam is tested and is shown in the figure 9.5. The online version of the English-Tamil MT system and the morphological analyzer and synthesizer are implemented using Java Servlets and JSP and are shown in figure 9.6 and figure 9.7 respectively.

100

CHAPTER 10 10

LIMITATIONS AND FUTURE WORK

The quality of the translation output and the morphological analyzer and synthesizer deteriorates for varied reasons. This chapter discusses about those reasons, the limitations of the current system and suggestions to overcome the issues that degrades the performance of every module and the translation system as a whole. The Stanford parser is chosen for its capability of producing the syntactic structure and the typed dependency relations. The reordering module uses the parser‟s output for word reordering. Therefore, if the parser outputs a wrong syntactic structure or any ambiguous parse output, the reordering module produces a wrong target phrasal structure and thus leads to wrong translation as well. Neither the input nor the output of the parser is adjusted or tweaked to get better output in the current system. The performance of the parser might have been improved if the parser would have been trained for the specific domain. Currently, WSJ trained model is used in the parser. The training of the parser for a domain is out of scope of this thesis, since the training requires a gold standard corpus on the particular domain for the source language. This thesis is devoted to develop necessary tools for the target language and is not considered the source side. Tamil shows a very high degree of flexibility in ordering the words within a sentence. The position of the words can be easily transposed without much change in the meaning. For example, “Ram gave him a book” can be reordered in multiple ways in Tamil, and the most common ways are, Ram him a book gave, Ram a book him gave, Him a book Ram gave, etc,. The predicate verb takes mostly the last position. In our system, the reordering rules are strictly one to one map. Every source rule is mapped to one target rule. Based on the most common usage, the target rule is formulated. The Tamil clausal structure is more rigid and shows little flexibility. For example, “Ram, who is smart, gave him a book.” is reordered as “(smart Ram) (him) (a book) (gave)”. Here the adjectival clause „who is smart‟ has to be positioned before the noun „Ram‟ in Tamil. The current system can handle only the generic reordering rules. For 101

example, consider the following example constructs, “The capital of India” and “The thousands of devotees.” The parse structures for the phrases are (NP (NP (DT The) (NN capital)) (PP (IN of) (NP (NNP India)))) and (NP (NP (DT The) (NNS thousands)) (PP (IN of) (NP (NNS devotees)))) respectively. The reordering rule transforms the above phrases to “India of the capital” and “devotees of the thousands” respectively. The later target phrase “devotees of the thousands” is not correct and it happens because of the one to one reordering rule map. This is the limitation of the current reordering rule mechanism. This can overcome by letting the system generate multiple outputs by one-many reordering rule maps. In the post processing, the best output can be chosen based on the fluency of the sentence using the language model of the target language. Currently, this is not incorporated in the system. The quality of the translation output is affected by the lack of reordering rules. The new heuristics can be created through exhaustive testing and identification of the patterns during testing. The morphological information to synthesize the target word is extracted from the typed dependencies from the Stanford parser. The information that is extracted and mapped to the morphological syntax is the input for the morphological synthesizer. The performance of the synthesizer depends on the proper input. The synthesizer will be affected by the wrong typed dependency relational information from the parser. The performance enhancement of the typed dependency relation of the parser is out of scope of this thesis as like state before, the thesis is not concerned about the development of the tools in the source language. The quality of the morphological analyzer and synthesizer will be degraded because of the lack of orthographic rules or morphotactics rules or improper heuristics or the absence of the lexical entry and thus leads to the poor quality translation. The performance of the morphological analyzer and synthesizer system is increased with the help of vigorous testing with the huge corpus. All the words that are not analyzed from huge dataset are sorted alphabetically from right to left and clustered by the useful pattern that might leads to constitute a new rule by analyzing the pattern from the cluster. The absence of the lexical entry in the dictionary or having an entry that is not perfect for the given context of the sentence plays an active role in the quality of the translation output. Even having a domain specific dictionary may not enough to get the perfect equivalence of the source word in a correct sense without the Word-Sense-Disambiguation (WSD) module. The 102

current MT system does not have WSD module, which is very critical for the rule based MT system. The lack of WSD module in the present system is a serious drawback of the system and that leads to poor quality translation. The out of vocabulary words (OOV) cannot be translated but transliterated. The current system does not have the transliteration module for transliterating OOV words other than names and place names. It does have the module to transliterate Named-Entities. The words that are not present in the dictionary (OOV words) are also transliterated in the same manner using the same tool that trained for names and place. Though the transliteration tool works well for name and place, it‟s not producing good results for other OOV words since the tool specifically trained for the name and place. This is a setback for the developed MT system that degrades the translation quality. As a future enhancement, a Named Entity Identifier can be used to identify the names and places and those are transliterated using the existing tool and other OOV words can be transliterated using tool that build using the mapping rules. As the chain reaction of the wrongly tagged word/word(s) of the input sentence by the pos tagger, the quality of the translation will be poor. Apart from these limitations, there is handful of other issues with respect to the specific sentence structures, specific word category, specific context, etc, etc.

103

CHAPTER 11 11

CONCLUSION

With the moderate success of this system, the methodology is extended to Malayalam and Kannada and the prototype version has been developed. The accuracy of the developed EnglishTamil MT system at present is around 14%. This is only the current status. Through the experiments and vigour testing of the system, the careful observations are made. The inferences are that the performance of the system meliorates by all or any of the following: beefing-up of the transfer lexicon and creation of more specific rules for mapping the dependency relations to the morphological features. Another important observation made by comparing different approaches is that morphologically rich languages like Tamil demands a top-notch morphological analyzer and synthesizer for any sort of the approach namely, linguistic or stochastic or hybrid of both. The word reordering of the sentence contributes very little to convey sense of the sentence in Tamil. As long as the relations between the words are clearly defined by the inflections of the word, the meaning of the sentence is intact irrespective of the wrong or the less fluent reordering. The success of the rule based English-Tamil MT mostly depends on how well the relations between the words are captured from English and transformed to Tamil rather than reordering the SVO sentence structure of English to SOV sentence structure of Tamil. Steering our thoughts in that direction may help to accomplish our goals in a far more specific way. All I can say now is the journey that started in 1954 at Georgetown is not yet over. There is plenty of room and scope to play around and that gives me the hope that one day mother tongues could be saved and the culture could be preserved.

104

REFERENCES [1] [2] [3] [4] [5] [6] [7]

[8]

[9] [10]

[11] [12]

[13] [14]

[15] [16]

Paul Garvin, "The Georgetown-IBM experiment of 1954: an evaluation in retrospect," in Papers in linguistics in honor of Dostert, New York, 1967, pp. 46-56. W.John Hutchins, "Machine Translation: A Brief History," in Concise history of the language sciences: from the Sumerians to the cognitivists, 1995, pp. 431-445. Dorothy Senez, "Developments in Systran," in Aslib Proceedings, vol. 47, 1995, pp. 99 107, Issue: 4. Hideki Hirakawa, Hiroyasu Nogami, and Shin-ya Amano, "EJ/JE Machine Translation System ASTRANSAC," in MT Summit III, Washington, DC, USA, July 1991. M Nagao, J Tsujii, and Nakamura , "The Japanese Government Project of Machine Translation," in Computational Linguistics, 1985, pp. 91-110. V. Renganathan, "An ineractive approach to development of englishtamil," in The international Tamil Internet, 2002. U Germann, "Building a statistical machine translation system from scratch: how much bang for the buck can we expect?," in In Proceedings of the workshop on Data-driven methods in machine translation, Morristown, NJ, USA, 2001, pp. 1–8. Hemant Darbari, Anuradha Lele, Aparupa Dasgupta, Priyanka Jain, and Sarvanan S, "EILMT: A Pan-Indian Perspective in Machine Translation," in Tamil Internet Conference, Coimbatore, Tamil Nadu, 2010. Abeill Anne, Bishop Kathleen M, Sharon Cote, and Yves Schabes, "A Lexicalized Tree Adjoining," University of Pennsylvania, Pennsylvania, Technical Report 1990. Ananthakrishnan Ramanathan, Pushpak Bhattacharyya, Jayprasad Hegde, Ritesh M. Shah, and Sasikumar M, "Simple Syntactic and Morphological Processing Can Help English-Hindi," in International Joint Conference on NLP (IJCNLP08), Hyderabad, India, Jan, 2008. Sudip Naskar and Sivaji Bandyopadhyay, "A Phrasal EBMT System for Translating English to Bengali," in In the Proceedings of MT SUMMIT X, Phuket, Thailand, 2005. Akshar Bharati, Vineet Chaitanya, Amba P. Kulkarni, and Rajeev Sangal, "ANUSAARAKA: MACHINE TRANSLATION IN STAGES," in A Quarterly in Artificial Intelligence, vol. 10, NCST, Mumbai, July 1997, pp. 22-25. Daniel Jurafsky and James H Martin, Speech and Language Processing, 3rd ed. Univserity of Colorado, Boulder: Pearson, 2008. Kenji Yamada and Kevin Knight, "A Syntax-based Statistical Translation Model," in Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, 2001, pp. P01-1067. D Chiang, "An introduction to synchronous grammars. ," Institute of Advanced Computer Studies, University of Maryland, College Park, Maryland, United States, Technical 2006. Stuart M. Shieber and Yves Schabes, "Synchronous tree-adjoining grammars," in Proc. Thirteenth International Conference on Computational Linguistics (COLING), vol. 3, 1990, pp. 1–6. 105

[17]

[18] [19]

[20] [21]

[22]

[23]

[24] [25] [26]

[27] [28] [29]

[30] [31] [32]

[33]

Aravind K. JoshI and Yves Schabes, "Tree-adjoining grammars.," in In Grzegorz Rosenberg and Arto Salomaa, editors, Handbook of Formal Languages and Automata, Verlag, Heidelberg, 1997. A. K. Joshi, Takahashi M, and Levy L.S., "Tree adjunct grammars," in Journal of Computer and System, 1975, pp. 136–163. Saravanan S, Menon A.G, and Soman K.P, "Pattern Based English-Tamil Machine Translation," in Proceedings of Tamil Internet Conference, Coimbatore, India, 2010, pp. 295-299. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D.Manning, "Generating Typed Dependency Parses from Phrase Structure Parses," in LREC, 2006. Dan Klein and Christopher D. Manning, "Accurate Unlexicalized Parsing," in Proceedings of the 41st Meeting of the Association for Computational Linguistics, 2003, pp. 423-430. Dan Klein and Christopher D. Manning, "Fast Exact Inference with a Factored Model for Natural Language Parsing," in In Advances in Neural Information Processing Systems, Cambridge, 2003, pp. 3-10. Vijaya MS, Shivapratap G, Dhanalakshmi V, Ajith V.P, and Soman KP, "Sequence labelling approach for English to Tamil Transliteration using Memory based Learning," in Proceedings of the 6th International Conference of Natural Language Processing, 2008. Koichi Takeda, "Pattern-Based Machine Translation," in Association for Computational Linguistics, Santa Cruz, California, USA , June 1996, pp. 96-1020. Marie-Catherine de Marneffe and Christopher D.Manning, "Stanford typed dependencies manual," 2008. Menon A. G., Saravanan S, Loganathan R, and Soman K. P, "Amrita Morph Analyzer and Generator for Tamil: A Rule-Based Approach," in Proceedings of Tamil Internet Conference, Cologne, Germany, 2009, pp. 239-243. Kimmo Koskenniemi, "General Computational Model for Word-Form Recognition and Production," in Coling 84, 1984, pp. 178-181. Thomas Lehmann, A Grammar of Modern Tamil, 2nd ed. Pondicherry, India: Pondicherry Institute of Linguistics and Culture, 1992. Sribadri narayanan R, Saravanan S, and Soman K.P, "Data Driven Suffix List And Concatenation Algorithm For Telugu Morphological Generator," In Proceedings of International Journal Of Engineering Science and Technology, vol. 3, no. 8, pp. 67126717, August 2011. Anandan , Ranjini parthasarathy, and Geetha, "Morphological Generator for Tamil," in Tamil Internet 2001 Conference, Kuala Lumpur, Malaysia, 2001. Tina Bogel, Miriam Butt, Annette Hautli, and Sebastian Sulger, "Developing a FiniteState Morphological Analyzer for Urdu and Hindi," in LREC, 2010. Vikram T.N and Shalini R., "Development of Prototype Morphological Analyzer for the South Indian Language of Kannada," in Proceedings of the 10th international conference on Asian digital libraries, Heidelberg, Berlin, 2007, pp. 109-116. Goyal V and Singh Lehal G., "Hindi Morphological Analyzer and Generator," in Emerging Trends in Engineering and Technology, Washington, DC, USA, 2008. 106

[34]

Ramasamy Veerappan, Antony P J, Saravanan S, and Soman K.P, "A Rule Based Kannada Morphological Analyzer and Generator using Finite State Transducer," In Proceedings of International Journal of Computer Applications, pp. 45-52, August 2011.

107

APPENDIX A A.1 TAMIL TRANSLITERATION SCHEME

TAMIL ROMAN CHARACTER MAPPING அa ஆA VOWELS

இi ஈI

CONSONANTS

SANSKRIT LETTERS

உu ஊU ஋e ஌ E

஍  ai எo ஏO ஐ  au ஃq

க்  k

த்  T

ல்  l

ங்  G

ந்  W

வ்  v

ச்  c

ப்  p

ழ்  z

ஞ்  J

ம்  m

ள்  L

ட்  t

ய்  y

ற்  R

ண்  N

ர்  r

ன்  n

ஜ்  j

ஷ்  s

ஸ்  S

க்ஷ்  x

ஹ்  h

108

APPENDIX B B.1 REORDERING RULES

ROOT NODE OF SOURCE/TARGET

SOURCE RULE CHILD

TARGET RULE CHILD

TRANSFER LINK

ADJP

ADJP PP

PP ADJP

1:2 2:1

ADJP

JJ PP

PP JJ

1:2 2:1

ADJP

JJ S

S JJ

1:2 2:1

ADJP

VB* PP

PP VB*

1:2 2:1

ADVP

ADVP SBAR

SBAR ADVP

1:2 2:1

ADVP

RB RB

RB RB

1:2 2:1

FRAG

NP PP NP-TMP .

PP NP-TMP NP .

1:2 2:3 3:1 4:4

NP

DT NN CD

CD DT NN

1:3 2:1 3:2

NP

DT NN S

S NN DT

1:3 2:2 3:1

NP

JJ NN S

S JJ NN

1:3 2:1 3:2

NP

NN S

S NN

1:2 2:1

NP

NP , SBAR

SBAR , NP

1:3 2:2 3:1

NP

NP NP

NP NP

1:2 2:1

NP

NP NP .

NP NP .

1:2 2:1 3:3

NP

NP PP

PP NP

1:2 2:1

NP

NP SBAR

SBAR NP

1:2 2:1

NP

NP VP

VP NP

1:2 2:1

NP

QP RB

RB QP

1:2 2:1

NP

RB CD NN

CD NN RB

1:2 2:3 3:1

PP

CC NP

NP CC

1:2 2:1

PP

IN ADVP

ADVP IN

1:2 2:1

PP

IN NP

NP IN

1:2 2:1

PP

IN PP

PP IN

1:2 2:1

PP

IN S

S IN

1:2 2:1

RULE

109

PP

TO NP

NP TO

1:2 2:1

PP

VB* NP

NP VB*

1:2 2:1

QP

IN CD TO CD

CD TO CD IN

1:2 2:3 3:4 4:1

QP

RB JJR IN CD

CD IN RB JJR

1:4 2:3 3:1 4:2

S

ADVP NP VP .

NP ADVP VP .

1:2 2:1 3:3 4:4

S

NP ADVP VP .

NP VP ADVP .

1:1 2:3 3:2 4:4

SBAR

IN S

S IN

1:2 2:1

SBAR

RB IN S

S IN RB

1:3 2:2 3:1

SBAR

WHADVP S

S WHADVP

1:2 2:1

SBAR

WHNP S

S WHNP

1:2 2:1

SBAR

WHPP S

S WHPP

1:2 2:1

SBAR

XS

SX

1:2 2:1

SBARQ

WHNP SQ .

SQ WHNP .

1:2 2:1 3:3

SINV

PP VP NP .

PP NP PP .

1:1 2:3 3:2 4:4

SINV

VB* NP ADJP

NP ADJP VB*

1:2 2:3 3:1

SQ

MD NP VP

NP VP MD

1:2 2:3 3:1

SQ

MD NP VP .

NP VP MD .

1:2 2:3 3:1 4:4

SQ

MD RB NP VP .

NP VP RB MD .

1:3 2:4 3:2 4:1 5:5

SQ

S , MD NP VP .

S , NP VP MD .

1:1 2:2 3:4 4:5 5:3 6:6

SQ

VB* NP ADJP .

NP ADJP VB* .

1:2 2:3 3:1 4:4

SQ

VB* NP NP .

NP NP VB* .

1:2 2:3 3:1 4:4

SQ

VB* NP PP

PP NP VB*

1:3 2:2 3:1

SQ

VB* NP VP

NP VP VB*

1:2 2:3 3:1

SQ

VB* NP VP .

NP VP VB* .

1:2 2:3 3:1 4:4

VP

ADVP VB* NP

NP ADVP VB*

1:3 2:1 3:2

VP

ADVP VB* SBAR

ADVP SBAR VB*

1:1 2:3 3:2

VP

IN S

S IN

1:2 2:1

VP

MD ADVP VP

ADVP VP MD

1:2 2:3 3:1

VP

MD RB VP

VP MD RB

1:3 2:1 3:2

VP

MD VP

VP MD

1:2 2:1

VP

NN ADVP SBAR

ADVP NN SBAR

1:2 2:1 3:3

VP

TO VP

VP TO

1:2 2:1

VP

VB* , ADVP , S

, ADVP , S VB*

1:2 2:3 3:4 4:5 5:1

110

VP

VB* ADJP

ADJP VB*

1:2 2:1

VP

VB* ADJP , SBAR

ADJP VB* , SBAR

1:2 2:1 3:3 4:4

VP

VB* ADJP ADVP

ADVP ADJP VB*

1:3 2:2 3:1

VP

VB* ADJP PP

PP ADJP VB*

1:3 2:2 3:1

VP

VB* ADVP

ADVP VBZ

1:2 2:1

VP

VB* ADVP ADJP

ADVP ADJP VB*

1:2 2:3 3:1

VP

VB* ADVP ADVP

ADVP ADVP VB*

1:3 2:2 3:1

VP

VB* ADVP NP

ADVP NP VB*

1:2 2:3 3:1

VP

VB* ADVP NP-TMP

NP-TMP ADVP VB*

1:3 2:2 3:1

VP

VB* ADVP PP

PP ADVP VB*

1:3 2:2 3:1

VP

VB* ADVP VP

VP ADVP VB*

1:3 2:2 3:1

VP

VB* CC VB* PP

PP VB* CC VB*

1:4 2:1 3:2 4:3

VP

VB* FRAG

FRAG VB*

1:2 2:1

VP

VB* NP

NP VB*

1:2 2:1

VP

VB* NP ADVP

NP ADVP VB*

1:2 2:3 3:1

VP

VB* NP NP

NP NP VB*

1:2 2:3 3:1

VP

VB* NP NP-TMP

NP-TMP NP VB*

1:3 2:2 3:1

VP

VB* NP PP

NP PP VB*

1:2 2:3 3:1

VP

VB* NP PP , PP

NP PP , PP VB*

1:2 2:3 3:4 4:5 5:1

VP

VB* NP PP PP

PP PP NP VB*

1:3 2:4 3:2 4:1

VP

VB* NP SBAR

NP SBAR VB*

1:2 2:3 3:1

VP

VB* NP-TMP

NP-TMP VB*

1:2 2:1

VP

VB* PP

PP VB*

1:2 2:1

VP

VB* PP , PP

PP , PP VB*

1:2 2:3 3:4 4:1

VP

VB* PP ADVP

ADVP PP VB*

1:3 2:2 3:1

VP

VB* PP NP-TMP

PP NP-TMP VB*

1:2 2:3 3:1

VP

VB* PP PP

PP PP VB*

1:2 2:3 3:1

VP

VB* PP S

S PP VB*

1:3 2:2 3:1

VP

VB* PP SBAR

PP SBAR VB*

1:2 2:3 3:1

VP

VB* PRT

PRT VB*

1:2 2:1

VP

VB* PRT NP

NP PRT VB*

1:3 2:2 3:1

VP

VB* PRT NP-TMP PP

NP-TMP PRT PP VB*

1:3 2:2 3:4 4:1

VP

VB* PRT PP

PP PRT VB*

1:3 2:2 3:1

111

VP

VB* RB ADJP

ADJP VB* RB

1:3 2:1 3:2

VP

VB* RB NP

NP VB* RB

1:3 2:1 3:2

VP

VB* RB VP

VP RB VB*

1:3 2:2 3:1

VP

VB* S

S VB*

1:2 2:1

VP

VB* SBAR

SBAR VB*

1:2 2:1

VP

VB* VP

VP VB*

1:2 2:1

WHPP

IN WHNP

WHNP IN

1:2 2:1

VP

VB* NP NP SBAR

SBAR NP NP VB*

1:4 2:2 3:3 4:1

VP

VB* ADVP ADJP PP

ADVP ADJP PP VB*

1:2 2:3 3:4 4:1

112

APPENDIX C C.1 TENSE-MORPHOLOGICAL FEATURES LOOKUP TABLE

POS CATEGORY OF PARTICIPLE

AUXILIARY

AUX-TENSE

INPUT FORMAT FOR MORPH SYNTHESIZER

FORM VERB VBZ

PRES

PRES

VBP

PRES

PRES

VBD

PAST

PAST

VB

WILL

FUT

FUT

VBG

AM

PRES CONT

PAST~VP~PROG~PRES

VBG

IS

PRES CONT

PAST~VP~PROG~PRES

VBG

WERE

PAST CONT

PAST~VP~PROG~PAST

VBG

WAS

PAST CONT

PAST~VP~PROG~PAST

VBG

WILL BE

FUT CONT

PAST~VP~PROG~FUT

VBG

SHALL BE

FUT CONT

PAST~VP~PROG~FUT

VBN

HAVE

PRES PERF

PAST~VP~PERF~PRES

VBN

HAS

PRES PERF

PAST~VP~PERF~PRES

VBN

HAD

PAST PERF

PAST~VP~PERF~PAST

VBN

WILL HAVE

FUT PERF

PAST~VP~PERF~FUT

VBN

SHALL HAVE

FUT PERF

PAST~VP~PERF~FUT

VBG

HAVE BEEN

PRES PREF

PAST~VP~PROG~PRES CONT PRES PREF

VBG

PAST~VP~PROG~PRES

HAS BEEN CONT PAST PREF

VBG

PAST~VP~PROG~PAST

HAD BEEN CONT

113

WILL HAVE

FUT PREF

BEEN

CONT

SHALL HAVE

FUT PREF

BEEN

CONT

VBN

AM

PASS PRES

INF~PAS~PRES

VBN

IS

PASS PRES

INF~PAS~PRES

VBN

ARE

PASS PRES

INF~PAS~PRES

VBN

WERE

PASS PAST

INF~PAS~PAST

VBN

WAS

PASS PAST

INF~PAS~PAST

VBN

WILL BE

PASS FUT

INF~PAS~FUT

VBN

SHALL BE

PASS FUT

INF~PAS~FUT

VBN

IS BEING

PAST~VP~PROG~FUT

VBG

PAST~VP~PROG~FUT

VBG

PASS PRES

INF~PAS~PAST~VP~PROG~PRES CONT PASS PRES

VBN

INF~PAS~PAST~VP~PROG~PRES

ARE BEING CONT PASS PAST

VBN

INF~PAS~PAST~VP~PROG~PAST

WAS BEING CONT PASS PAST

VBN

INF~PAS~PAST~VP~PROG~PAST

WERE BEING CONT PASS PRES

VBN

INF~PAS~PAST~VP~PERF~PRES

HAS BEEN PERF PASS PRES

VBN

INF~PAS~PAST~VP~PERF~PRES

HAVE BEEN PERF PASS PAST

VBN

INF~PAS~PAST~VP~PERF~PAST

HAD BEEN PERF WILL HAVE

PASS FUT

BEEN

PERF

HAS BEEN

PASS PRES

BEING

PREF CONT

HAVE BEEN

PASS PRES

INF~PAS~PAST~VP~PERF~FUT

VBN

INF~PAS~PAST~VP~PROG~PRES

VBN VBN

INF~PAS~PAST~VP~PROG~PRES

114

BEING

PREF CONT

HAD BEEN

PASS PAST

BEING

PREF CONT

WILL HAVE

PASS FUT

BEEN BEING

PREF CONT

WILL BE

PASS FUT

BEING

CONT

VB

CAN

CAN

INF~AUX_CAN~NPNG

VB

CAN NOT

CAN NOT

INF~AUX_CANNOT~NPNG

VB

MAY

MAY

VN~AUX_MAY~NPNG

VB

MAY NOT

MAY NOT

NEG~VP~PERF~VN~AUX_MAYNOT~NPNG

VB

SHOULD

SHOULD

INF~AUX_SHOULD~NPNG

VB

SHOULD NOT

SHOULD NOT

INF~AUX_SHOULDNOT~NPNG

VBG

ARE

PRES CONT

PAST~VP~PROG~PRES

VB

DID NOT

DID NOT

INF~NEG_NF~NPNG

VB

DOES NOT

DOES NOT

INF~NEG_F

VB

WILL NOT

WILL NOT

INF~NEG_F

VBN

HAD NOT

HAD NOT

PAST~VP~PERF~INF~NEG_NF~NPNG

VBN

HAS NOT

HAS NOT

PAST~VP~PERF~INF~NEG_NF~NPNG

WILL NOT

PAST~VP~PERF~INF~NEG_F

INF~PAS~PAST~VP~PROG~PAST

VBN

INF~PAS~PAST~VP~PROG~FUT

VBN

INF~PAS~PAST~VP~PROG~FUT

VBN

WILL NOT VBN HAVE VBZ

NEVER

NEVER

INF~CLI_E~NEG_F

VBN

NEVER

NEVER

INF~CLI_E~NEG_NF~NPNG

VBZ

WILL NEVER

WILL NEVER

INF~CLI_E~NEG_F

WILL HAVE

WILL HAVE

NEVER

NEVER

VBN

HAVE NEVER

HAVE NEVER

PAST~VP~PERF~INF~CLI_E~NEG_F

VBN

HAS NEVER

HAS NEVER

PAST~VP~PERF~INF~CLI_E~NEG_F

VBN

HAD NEVER

HAD NEVER

PAST~VP~PERF~INF~CLI_E~NEG_NF~NPNG

VBD

HAS

PRES PERF

PAST~VP~PERF~PRES

PAST~VP~PERF~INF~CLI_E~NEG_F

VBN

115

APPENDIX D15 D.1 TAMIL VERB MORPHOLOGY The structure of a Tamil verb is as follows ROOT / STEM + TRANSITIVE + CAUSATIVE + TENSE / NEGATIVE + EMPTY + PERSON-NUMBERGENDER A Tamil verb is characterized by its ability to take a tense suffix, a Person-Number-Gender (PNG) suffix and wherever possible a transitive and causative suffix. We start with a simple verb stem in Tamil. A verb stem in Tamil equivalent to an English verb „say „ is „ col [ கசால் ]„. The paradigm of this verb following the structure mentioned above is follows: கசால் + ஑ிற் + ஆன்

( he says)

STEM + PRESENT TENSE + 3

RD

PERSON MASCULINE SINGULAR. கசால் + வ் + ஆள்

( she will say )

STEM + FUTURE TENSE + 3RD PERSON FEMININE SINGULAR.

கசால் + இன் + ஆர்

STEM + PAST TENSE + 3RD PERSON

( they said )

COMMON PLURAL. கசால் + இன் + அது

STEM + PAST TENSE + 3RD PERSON

( it said )

NEUTER SINGULAR. கசால் + இன் + ஆர்஑ள்

( they [ neuter ] said )

STEM + PAST TENSE + 3RD PERSON PLURAL

The verb „col„ has no transitive and intransitive contrast. Therefore, we start below with a verb stem which is capable of taking a transitive suffix.

thazh - ( தாழ் ) –sink.

15

This section is written based on the Phd thesis of Dr. A.G.Menon

116

+

STEM

தாழ் + ஑ிற் + ஆன்

( he sinks )

தாழ் + த்து + ஑ிற் + ஆன்

( he makes something to sink )

தாழ் + த்து + வ் + ஆன்

( he will make something to

STEM

sink )

3SM

தாழ் + த்து + இன் + ஆன்

( he made something to sink )

PRES

+ 3RD

PER

SINGULAR. STEM

+

TRANSITIVE

+

PRES

+

3SM

STEM

+

TRANSITIVE

+

FUT

+

+ TRANSITIVE + PAST + 3

SM

We go further with an example for causative. In case of verbs with intransitive and transitive contrast both transitive and causative suffixes should be present. In other words, the transitive suffix is obligatory before adding a causative suffix.

தாழ் + த்து + யி + த்த் + ஆன்

( he caused someone to make STEM something sink )

+

+

CAUSATIVE + PAST + 3SM

தாழ் + த்து + யி + க்஑ிற் + ( he causes someone to make STEM ஆன்

something sink )

தாழ் + த்து + யி + ப்ப் + ஆன்

( he will cause someone to STEM make something sink )

TRANSITIVE

+

TRANSITIVE

+

CAUSATIVE + PRES + 3SM

+

TRANSITIVE

+

CAUSATIVE + FUT + 3SM

In the case of verbs which are dividing of an intransitive

verses

transitive contrast, the

causative suffix is added immediately after the stem to form a transitive. Example: கசய் [ do ] கசய் + யி

[ to make someone do ]

STEM + CAUSATIVE SUFFIX

கசய் + த் + ஆன்

[ he did ]

STEM + PAST + 3SM

கசய் + யி + த்த் + ஆன்

[ he made someone to do ]

STEM

+

CAUSATIVE

+

PAST

+

3SM

The causative suffix is replaced by a lexical item in the form of an auxiliary verb in modern Tamil. The following example will illustrate this change. 117

தாழ் [ sink ] ( Intransitive ) தாம தய

[ to make something to sink ]

தாழ்த்த தய

[ to cause someone to make

தாம தய + த்த் + ஆன்

[ he made something to sink ]

தாம தய + க்஑ிற் + ஆன்

[ he makes something to sink ]

தாம தய + ப்ப் + ஆன்

[ he will make something to VERB INFINITE + AUXILARY +

தாழ்த்த தய + த்த் + ஆன்

[ he caused someone to make TRANSITIVE VERB INFINITE +

தாழ்த்த தய + க்஑ிற் + ஆன் தாழ்த்த தய + ப்ப் + ஆன்

something sink ]

sink ]

( TRANSITIVE )

(

CAUSATIVE )

VERB INFINITE + AUXILARY + PAST + 3SM

VERB INFINITE + AUXILARY + PRES + 3SM

FUT + 3SM

something sink ]

AUXILARY + PAST + 3SM. [

-

HE

CAUSES

SOMEONE

TO

MAKE SOMETHING SINK ]

[ he will cause someone to TRANSITIVE VERB INFINITE + make something sink ]

AUXILARY + FUT + 3SM.

In this case, the auxiliary verb „வை „ has taken over the functions of both transitive and intransitive. ஋டு + க்஑ிற் + ஆன்

[ he takes ]

STEM + PRES + 3SM

஋டு + ப்ப் + ஆன்

[ he will take ]

STEM + FUT + 3SM

஋டு + த்த் + ஆன்

[ he took ]

STEM + PAST

புத஑ + ந்த் + ஆன்

[ he became angry ]

STEM + PAST + 3SM

118

+3SM

புத஑ + த் + த் + ஆன்

[ he burnt something / he

புத஑ + ஑ிற்+ ஆன்

[ he is angry ]

புத஑ + க்+ ஑ிற்+ ஆன்

[ he burns something / he

புத஑ + வ் + ஆன்

[ he will angry ]

புத஑ + ப் + ப் + ஆன்

[ he will burn something / he

smoked something ]

STEM + TRANS + PAST + 3SM STEM + PRES + 3SM

smokes something ]

will some something ]

STEM + TRANS + PRES + 3SM STEM + FUT + 3SM STEM + TRANS + FUT + 3SM

TENSE In general, Tamil verbs distinguish three tenses: present, past and future.

PRESENT TENSE Suffixes: ஑ிறு, ஑ின்று There are no distributional differences between these two suffixes. Both of them occur after same verb stems, except in the case of the finite verbs with a neuter pronominal plural such as

஧டி + க்஑ின்ற்+ அன் + அ

( They ( neuter ) read )

STEM + PRES

+EMPTY + 3PN

கசய் + க்஑ின்ற் + அன் + அ

(They do)

STEM + PRES

+EMPTY + 3PN

PAST There are three sets of past tense marker in Tamil. They are ந்த், த & its variance and இன் &its variance. One of the biggest problems of Tamil verb morphology is the prediction of the past and future suffixes after the verb stems.

119

It is difficult, if not impossible to predict which verb stem will take which past tense marker. The distribution of the past tense suffixes are in a way predetermined in the language. அழு + த + ஆன்

[ he cried ]

STEM + PAST + 3SM

க஑ாடு + த்த் + ஆன்

[ he give ]

STEM + PAST + 3SM

சாடு + இன் + ஆன்

[ he jumped ]

STEM + PAST + 3SM

ப஧ாடு + ட் + ஆன்

[ he let something fall ]

STEM + PAST + 3SM

஥ெ + ந்த் + ஆன்

[ he walked ]

STEM + PAST + 3SM

In the above cases, it is not predictable which past tense suffix will occur after a each stem. The distribution of the past tense suffixes is not restricted by the form of the verb stem. It is rather information which comes along with the language.

Verb stem are classified into three major groups on the basis of past tense markers they take. ந்த் has variant in spoken dialects. Example: ஋ாிந்தான் - ஋ாிஞ்சான் „ த் „ has the following variance „ த்‟ , „த்த் „. In spoken language „ த்த் „ has a palatalized variance „ச்ச்‟. To the same group belong also the stems which produce past tense by the gemmination of the stem consonants. Example ஥டு + ஑ிற்+ ஆன்  ஥டு஑ி஫ான் ஥டு + வ் + ஆன்  ஥டுயான் ஥டு + ட் + ஆன்  ஥ட்ொன் In the case of the past tense the „ த் „ is geminated to produce a past tense. This can be explained in two ways. As a process. of gemmination.

120

As a case of sandhi in which the stem consonant and the following past tense marker „ த „ are converted in to „ த்த் „ . An another example for the sandhi change and the resulting geminate past tense marker is „ பதாற்஫ான் „ பதாற் + ற் + ஆன்

[ he failed ]

STEM + PAST + 3SM

The verb stem „ பதால் „ ends in‟ ல் „ after the addition of the past tense marker „ த்த் „ it becomes „ற்ற்‟. பதால் + த்த் + ஆன் become பதாற்஫ான். பதால் + த் + த் + ஆன்  பதாற் + த் + ஆன்  பதாற்஫ான் பதால் + ஑ிற் + ஆன்  பதாற் + ஑ிற் + ஆன்  பதாற்஑ி஫ான் பதால் + ப் + ப் + ஆன்  பதாற் + ப் + ஆன்  பதாற்஧ான் Though Tamil verbs can be classified into three general groups, it is necessary to divide each group further into sub groups. The reasons for this sub grouping are 

Absence of a common feature in the verbs which take for example „த்‟ or „த்த் „ as past tense markers.



The automatic generation and analysis can be simplified by grouping these stems in to sub groups.

தய + ந் + ஆன்

[ he scoled ]

தய + த்த் + ஆன்

[ he placed something ]

Syntactically there is no difference between these two stems because both of them can take accusative case marker. In other words there is no dividing morphological feature.

121

The third past tense suffix „ இன் „ has three variants „ இன் „, „இ„ and „ன்‟ . The distribution of these suffixes is easy to predict. „இன் „ occurs in the finite verbs and some times in the relative participles. The following examples will illustrate these distributions. ஏடு + இன் + ஌ன்

[ I ran ]

ஏடி஦ான்

STEM + PAST + 1S

ஏடு + இ + அ

ஏடின

STEM

[ which ran ]

+

PAST

+

RP

MARKER

In the case of the relative participle it alternate with „இ „. For example, ஏடு + இன் + அ and ஏடு + இ + ய் + அ The suffix „ இ „ occurs at the end of the verbal participle such as „ ஏடி „ [ having run ] The verbs „ ப஧ா „ [ to go ] and „ ஆ „ [ to become ] take „ ய் „ as the past tense marker apart from „ இன் „ and „இ „. Example: ப஧ா + ய் + அ corresponding to ப஧ா + ன் + அ and ப஧ா + ய் + இன் + அ The most frequent form in modern Tamil is „ ப஧ா஦ „. The suffix „ ன் „ occurs in the finite verb of „ ப஧ா „ and „ ஆ‟ apart from the relative participles. Example: ப஧ா + ன் + ஆன்

[ he went ]

STEM + PAST + 3SM

ஆ + ன் + ஆன்

[ he became ]

STEM + PAST + 3SM

ப஧ா + ன் + அ

[ which went ]

STEM

+

PAST

+

RELATIVE

+

RELATIVE

PARTICIPLE

ஆ + ன் + அ

STEM

[ which became ]

+

PAST

PARTICIPLE

Some of the verbs take two different past tense suffixes. This takes place during the transition from intransitive to transitive.

122

Example: தாழ் + ந்த் + ஆன்

[ he sank ]

STEM + PAST + 3SM

தாழ் + த்த் + இன் + ஆன்

[ he made something to sink ]

STEM + TRANS +PAST + 1S

FUTURE

In Tamil, there are three future tense suffixes, ப் and its variant ப்ப், வ் and உம். The distribution of „ ப்ப் „ is parallel to the distribution of past tense marker „ த்த் „ and present tense marker „ க்஑ிற்‟ . Example: அடி [ beat ] அடி + க்஑ிற்‟ + ஆன்

[ he beats ]

STEM + PRES + 3SM

அடி + ப்ப் + ஆன்

[ he will beat ]

STEM + FUT + 3SM

அடி + த்த் + ஆன்

[ he beat ]

STEM + PAST + 3SM

஑ாண் + ப் + ஆன்

[ he will see ]

STEM + FUT + 3SM

தீன் + ப் + ஆன்

[ he will eat ]

STEM + FUT + 3SM

Examples for the future tense marker „v‟ are யரு + வ் + ஆன்

[ he will come ]

கசால் + வ் + ஆன்

[ he will say ]

஋ழு + வ் + ஆன்

[ he will standup ]

஧ாடு + வ் + ஆன்

[ he will sing ]

123

஧ெர் + வ் + ஆன்

[ he will spread ]

க஑ாள் + வ் + ஆன்

[ he will have ]

The last future tense marker is „உம்„. It occurs in finite verbs of the neuter singular and plural and also in the future relative participle of all verbs. அது யர் + உம் [ It will come ] அதய யர் + உம் [ They will come ] யரும் த஧னன் [ the boy who will come ] relative participle எடிக்கும் த஑஑ள் [ the hands which break something ] Some of the verbs take a formative suffix which may be either „க்க்‟ or „ப்ப்‟. The occurrence of the „க்க்‟ and „ப்ப்‟ goes parallel with the verb stem which take for the three tenses, double geminates such as in the stem „஋டு‟ [ take ], ஋டுக்஑ [ infinitive to „etu‟ ] [ to take ], ஋டுப்஧ [ infinitive to „etu‟ ] [ to take ], ஋ட்டுக்஑ி஫ான், ஋டுப்஧ான், ஋டுத்தான்.

EMPTY

Because the tense markers are already used in the above examples, we go to the next position in the Tamil verb. You might have noticed the presence of a suffix between the tense and png. This suffix is labeled as empty suffix because the function of this suffix is lost in Modern Tamil. This suffix is also known as a bridge suffix because of its connecting function. It connects a tense suffix with a png. We will not go further in to the historical function of the empty / bridge suffix. We give below a few examples for the empty / bridge suffix. ய + ந்த் + அன் + அன்

[ he came ]

STEM + PAST + EMPTY + 3SM

ய + ந்த் + அன் + அள்

[ she came ]

STEM + PAST + EMPTY + 3SF

In Modern Tamil, the bridge suffix is not in use. 124

Another and important aspect of a Tamil verb is its explicit use of a person, number and gender suffix in agreement with the subject. However, the verb stems with the future tense suffix „உம்„ don‟t show this agreement through explicit morphemes. The person number gender markers indicate whether the subject is a singular or plural and whether it is masculine or feminine or neuter and whether it is in the first or second or third person. The following examples will illustrate the functioning of the png in Tamil. ஧டி + க்஑ிற் + ஌ன்

( I read )

STEM + PRES + 1S

஧டி + க்஑ிற் + ஏம்

( We read )

STEM + PRES + 1P

஧டி + க்஑ிற் + ஆய்

( You read )

STEM + PRES + 2S

஧டி + க்஑ிற் + ஈர் ( -஑ள்)

( You ( pl ) read )

STEM + PRES

+2P

஧டி + க்஑ிற் + ஆன்

( He reads )

STEM + PRES

+3SM

஧டி + க்஑ிற் + ஆள்

( She reads )

STEM + PRES + 3SF

஧டி + க்஑ிற் + ஆர் ( -஑ள் )

( They read )

஧டி + க்஑ிற் + athu

( It reads )

STEM + PRES + 3SN

஧டி + க்஑ின்ற் + அன் + அ

( They ( neuter ) read )

STEM + PRES

஧டி + க்க் + உம்

( They / it will read )

STEM

+

PRES

+ 3P (

CAN ALSO

BE USED AS HONORIFIC )

STEM

+

+EMPTY + 3PN

FORMATIVE SUFFIX

+

FUT

NOTE ON FORMATIVE SUFFIX

The distribution of formative suffixes such as „க்க்‟ and „ப்ப்„ is not clear. The so-called Strong verbs form their stems by adding „க்க்‟ or „ப்ப்‟ to their stems such as „஧டி„ [ read ] and „஧டிக்க்- „ the same stem with the formative suffix „க்க்„ . It occurs also with the formative suffix „ப்ப்‟ such as „஧டிப்ப்-‟. The formative suffix exhibit very clearly in the infinitive forms of these verbs. For example, „஧டிக்஑„ [ to read ] , „஋டுக்஑„ [ to take ] , „க஑ாடுக்஑„ [ to give ] and „஧ார்க்஑„ [ to see ]. The same formative suffix also occurs as transitive suffix in the verb stems 125

which show a transitive-intransitive contrast. For example, „அமி஑ின„ verses

verses

„அமிக்஑ „ , „஋ாின„

„஋ாிக்஑„ , „நடின„ verses „நடிக்஑„.

NEGATIVE

In the above discussions of the Tamil finite verbs, we have left out the role of the negative suffixes. In a verb the tense markers are replaced by a negative marker or in other words, tense suffixes and negative suffixes don‟t co-occur in a Tamil verb. Formations of the negative differ radically in the Classical and Modern Tamil verbs. For example, the negative form of „தாழ்த்துபயன்„ ( I will make something to sink ) is „தாழ்த்பதன்„ ( I will not make something to sink ). In Modern Tamil, „இல்த஬‟ and „ நாட்ட்- „ are used for expressing negative. Example- யபயில்த஬ [ did not come ], „யபநாட்பென்‟ [ I will not come ],'கசால்஬யில்த஬‟ [ didn‟t say ], கசால்஬ நாட்பென் [ I will not say ]. „இல்த஬‟ is used to express a negative in the past / present and „நாட்ட்-„is used to express the negative in the future. There are also morphemes expressing negation, such as –ஆ- and –ஆத்- .

IMPERATIVES

Imperatives may be positive or negative. It can take a plural marker. Singular is unmarked in an imperative. In negative imperatives a negative suffix is added and in the case of transitive imperatives a transitive suffix is added. The structure of an imperative is as follows: STEM + TRANSITIVE + CAUSATIVE / STEM + TRANSITIVE + NEGATIVE + EE (IMPERATIVE MARKER) Example: தாழ்

[ sink ]

தாழ் + த்த் + உ

[ make something to (TRANSITIVE

தாழ் + த்த் + ஆத் + ஌

[don‟t

sink ]

(IMPERATIVE ) STEM IMPERATIVE)

STEM

+

TRANSITIVE + U

make (TRANSITIVE NEGATIVE IMPERATIVE) STEM 126

something to sink]

+

TRANSITIVE

+

+

NEGATIVE

IMPERATIVE

MARKER

தாழ் + த்த் + ஆத் + [you shouldn‟t make

( TRANSITIVE IMPERATIVE

ஈர் (-஑ள் )

something to sink]

தாழ் + த்த் + உங்஑ள்

[you make something (TRANSITIVE

தாழ் + த்த் + உம்

[you make something (TRANSITIVE

தாழ் + த்த் + உ

[make something to (TRANSITIVE

to sink]

2ND

NEGATIVE

)

STEM

+

PERSON

TRANSITIVE

+

NEGATIVE + 2P IMPERATIVE)

STEM

+

STEM

+

STEM

+

TRANSITIVE + UNKAL

to sink ]

IMPERATIVE)

TRANSITIVE + UM

sink]

IMPERATIVE)

TRANSITIVE + U

OBATATIVE

The structure of Tamil obtative is stem + transitive + causative + obtative marker. In case of verbs without transitive and intransitive opposition the obtative marker will add directly to the stem. Example: தாழ் + ஑

[may you go down]

STEM + OBATATIVE MARKER

தாழ் + த்த் + உ + ஑

[may you bring something

STEM

down]

OBATATIVE MARKER

NON FINITE

There are three types of non finite verbal forms. They are Infinitive, Verbal particle and Relative particle 127

+

TRANS

+

U

+

Conditionals occur both in infinitives and verbal particles. Examples kaaN ( see ) kaaN + a ( to see )

naan (I ) raamanai (Raman+acc) kaaNa ( to see ) pooneen ( I went ) ஥ான் பாநத஦ ஑ாண ப஧ாப஦ன் I went to see Raman. An infinitive can be intransitive, transitive, or conditional. kaaNa ( kaaN +a )

[ to see ]

intransitive infinitive

kaatta ( kaaN + trans + a )

[ to show ]

transitive infinitive

kaaN + in

[ if (one) see(s) ]

conditional infinitive

Note: In Modern Tamil, conditional infinitive is not in use. Conditional verbal particle has replaced the conditional infinitive. For example, „யாின்‟ has been replaced by „யந்தால்‟. VERBAL PARTICIPLE

One of the important formal features of verbal participle is that it takes only past tense markers. Hereafter the Tamil words are written in the Romanized form. For details about the Tamil Romanization, see the Appendix Example: kotu + thth + u

[ having given ]

stem + past + u( verbal participle marker )

maRa + nth + u

[ having forgotten ]

stem + past + u( verbal participle marker )

uN + t + u

[ having eaten ]

stem + past + u (verbal participle marker)

oot + i + 0

[ having run ]

stem + past + 0

(Verbs which take „in„as past tense marker have the structure verb + tense + 0) The stem may be transitive or intransitive.

128

oti + nt + u

[ having broken ]

oti + th + th + u

[ having broken by someone ]

kizhi + nt + u

[ having torn ]

kizhi + th + th + u

[ having torn something ]

(

INTRANSITIVE

)

STEM

+

PAST

+ VP MARKER ( TRANSITIVE ) STEM + TRANS + PAST + VP STEM + PAST + VP MARKER STEM

+

TRANS

+

PAST

+

VP

MARKER

NEGATIVE VERBAL PARTICIPLE

The past tense marker is replaced by a negative marker in the negative VP‟s For example, pook + aath + u [ without going ] stem + neg + vp marker oot + aath + u [ without running ] stem + neg + vp marker The negative suffix „aath‟ is always followed by „u‟ „aa„ is another negative marker occurs in combination with verbal particle marker „mal„. Example pook + aa + mal [ without going ] stem + neg + vp marker oot + aa + mal [ without running ] stem + neg + vp marker CONDITIONAL VERBAL PARTICIPLE

The place of „u‟ at the end of verbal participle form is replaced by the suffix („aal„). Example: vaa + nth + aal [ if ( one ) come(s) ] stem + past + conditional vp marker. Though a VP contains only a past tense suffix, the meaning is always determined by the main verb. Only verbs in the future tense can occur after a conditional verbal participle. Example nii vanthaal ithu natakkum [ ஥ீ யந்தால் இது ஥ெக்கும் ] If you come then this will happen. 129

RELATIVE PARTICLE

The structure of a relative particle is STEM + TRANSITIVE + TENSE / NEG + A ( RP MARKER ) Syntactically, RP functions as an adjective. Example: Intransitive stem va + nt + a [ which came ] stem + past + rp marker Transitive varu + thth + in + a [ which made something to come ] stem + trans + past + rp Transitive negative varu + thth + aath + a [ which didn‟t make someone to come ] stem + trans + neg + rp

VERBAL NOUNS

The category verbal noun, as a word class, has an unique position within verb morphology because like the nouns they can also take case suffixes. A verbal noun suffix is added to an intransitive or transitive verb stems. When they take the verbal noun suffixes they don‟t take tense or png suffixes. We are dealing here only with the verbal noun form ending with a verbal noun suffix. There are also other nominalized forms of verbs such as vanthoon or vanthavan, which contain both tense and png suffixes and they are also capable of taking case suffixes. We shall deal with such forms later. The structure of a verbal noun is STEM + TRANS + NEG + VERBAL NOUN SUFFIX The following examples will illustrate these structures. varu + thal

[ coming ]

varu + ththu + thal

[ making something to come ]

var + aa + mai

[ not coming ]

varu + thth + aa + mai

STEM + VERBAL NOUN SUFFIX

[ making something not to come ] 130

STEM + TRANS + VN SUFFIX STEM + NEG + VN SUFFIX STEM + TRANS + NEG + VN

varu + kai

[ coming ]

STEM + VN

etu + ppu

[ posture ], [ taking ]

STEM + VN

thool + vi

[ failure ]

STEM + VN

The verbal noun suffixes can be broadly classified into the following six groups, al thal mai kai pu vi Back to the pronominalized verbal nouns with Trans, tense or negative and pronominal suffixes. For example, from a verb like „varu „it‟s possible to derive the pronominalised verbal noun like „varu + p + avan „[he who will come] stem +fut + png These forms are different from the finite verbal forms such as varu + v + aan [he will come] stem + fut + png The translation makes it clear that the emphasis in the first case lies on the pronominal suffix and in the second case on the verb itself. varu + kinR + avan

[ he who comes ]

STEM + PRES + PNG

varu + kinR + aan

[ he comes ]

STEM + PRES + PNG

va + nth + avan

[ he who came ]

STEM + PAST + PNG

va + nth + aan

[ he came ]

STEM + PAST + PNG

varu + ththu + kinR + avan varu + ththu + kinR + aan

[ he who make something to come ] [ he make something to come ]

STEM + TRANS + PRES + PNG STEM + TRANS + PRES + PNG

One of the important differences lies in the pronominal suffixes. The nominalized verbal forms take different pronominal suffixes and they are different from pronominal suffixes found in the finite verbs.

131

ADJECTIVES

There are no real adjectival stem or root in Tamil. Many of the adjectives are formed from the noun roots. They end mostly with an „a‟. Example nal (root of nalla) nalla [ good ] peru ( root of periya) periya [ big ] ini [root of sweetness] iniya [ sweet ] ciR ciRiya [ small ] DERIVED ADJECTIVES

They are formed by adding the relative participle „ aana „ , „ aaya‟ to a noun. Example: azhaku [ beauty ] azhakaana [ beautiful ] azhakaaya [ beautiful ] The another type of derived adjectives are formed by adding „ iya „ to a noun. Example: azhaku [ beauty ] azhakiya [ beautiful ] puthu [ new ] puthiya puththakam [ new book ]

132

ADVERB

The most adverbs are formed by adding „aaka‟ to a noun. Example: veekam [ speed ] + aaka - > veekamaaka [ speedly ] methu [ soft ] + aaka - > methuvaaka [ softly ] katumai [ harshness ] + aaka - > katumaiyaaka [ harsly ] There are also simple adverbial forms such as atikkati [ frequently ] innum [ still ] ini [ hereafter ] marupati [ again ] mella [ slowly ] nanku [ good ] Some of the verbal participle forms are also used as adverbs INTERROGATIVES CLITICS

Clitics are added to the words or heads of all syntactic categories except the adjectival and nominals functioning as noun modifiers. Clitics are bound forms. They cannot be inflected or modified with any other suffix. They have clear semantic aspects such as emphasis, interrogative and collectiveness apart from some grammatical functions such as coordination. There are five important clitics in Tamil. They are um, oo, aa, ee and thaan CLITIC -„um„ 133

The clitic „um‟ has a very high frequency in Tamil. It is used for emphasis, concessive, completeness and also for conjunction. avaan vanththaan

[ he came ]

avan + um vanththaan

[ he also came ] [ emphasis ]

avan pookalam

[ he can go ]

avan + um pookalam

[ he can also go ] [ concessive ]

elloor + um pookam

[ everyone can go ] [ completeness ]

avanum

avalum

naanum

cinimaaviRku

poonoom

[ conjunction ]

Example: He, She and I went for the film avan [ he ] + um [ and ] aval [ she ] + um [and ] naan [ I ] + um [ and ]

CLITIC – „thaan‟

„thaan‟ is also used for emphasis. For example,

avan ceythaan [ he did ] avan thaan ceythaan [ he only did ]

avan [ he ] neeRRu [ yesterday ] vanthaan [ came ]

[ he came yesterday ]

avan [ he ] neeRRu [ yesterday ] thaan [ only ] vanthaan [ came ] [ he came only yesterday ]

avan ooti vanththaan [ he came running ] avan ootiththaan vanththaan [ he came indeed running ] avan ettu maNikki varuvaan [ he will come at eight‟O clock ] avan ettu maNikku thaan varuvaan [ he will come only at eight‟ O clock „

134

„ aavathu „ is morphologically a verb consisting of verb root + future tense suffix and neuter singular suffix. It is a derived pronominalized verbal form where the emphasis is on the pronominal suffix and its morphological meaning is „it that will come „. Syntactically it is used as a clitic in the meaning of an ordinal or as a concessive. For example, iraNtu [ two ] + aavaathu [ second ] [ ordinal ] naal [ four ] + aavathu [ fourth ] [ ordinal ]

raman inRu vanthirukkalaam [ Raman could have come today ] raman + aavathu inRu vanthirukkalaam [ Atleast Raman could have come today ] raman [ Raman ] + aavathu inRu [ today ] vanthirukkalaam [ could have come ] [concessive ]

raman mathuraikku pooyirukkalaam [Raman could have gone to Madurai] raman mathuraikku + aavathu pooyirukkalaam [Raman could have atleast gone to Madurai] [ concessive ] „ aam‟ also formed from the verb „aa‟ + „ m‟ the future tense marker is used to express the ordinals. But „aam‟ can‟t be used as a concessive clitic. Example: naal + aam viitu [ fourth house ] naal [ four ] + aam viitu [ house ] inthu + aam kii.mi [ fifth K.M ] The clitics „ aa „ , „ oo‟ , „ ee „ are suffixed with words except the adjectives. „aa‟ is used in informative question. „oo‟ is used in doubtful question. „ee „ is used in rhetoric question. ramanaa vanthaan [ did Rama came ?] [informative ] ramanoo vanthaan [ Rama came, didn‟t he ? ] [Questionnaire has doubt whether Rama has come or not] raman varavillaiyee [ Rama didn‟t come, isn‟t it ] [ rhetoric ] 135

ramam vanthaanee [ Rama came, isn‟t it ] [ rhetoric ] [Note: Just like verb „ aa‟ the verb kuutu also forms a word „ kuuta‟ which is used in the meaning of „ also „.

ASPECTUAL

Aspect is expressed by auxiliary verbs which are added to the verbal participle form of the main verb. We give below the aspects and the corresponding auxiliaries. CONTINUOUS TENSE

The auxiliary verb „ koNtu „ expresses a continuous action. Example vanthu [ come ] koNtu [ ing ] irunth [ was ] aan [ he ] [ he was coming ] „koNtu + iru‟ represents the continuous tense. vanthu koNtu irukkiR aan [ he is coming ] vanthu koNtu irupp aan [ he will be coming ] vanthu koNtu irunth aan [ he was coming ] It is possible to add number of clitics after „koNtu‟ Example patiththuk [ read ] koNtu [ ing ] + um [also] irukkiR [ is ] + aan [ he ][he is also reading] vanthu koNtu + thaan irukkiR + aan [ he is indeed coming ] ootik koNtu + ee irukkiR + aan [ he is continuously running ] ootik koNtu + aa irukkiR + aan [ is he running ?] ootik koNtu + oo + irukkiR + aan [ he is running, isn‟t he ? ] [Doubt] ootik koNtu + ee + aa + irukkiR + aan [ is he continuously running ] After the „thaan‟ it is possible to add interrogative suffixes. Example ootik koNtu thaan + aa irukkiR + aan [ is he running ? ] [ with emphasis on running ] ootik koNtu thaan + oo irukkiR + aan [is he running] doubt and emphasis on running 136

ootik koNtu thaan + ee irukkiR + aan [ he is indeed running, isn‟t he ] emphasis on running viLaiyaatik koNtee thaanee irukkiR + aan [ he is indeed continuously playing ] viLaiyaatik [play] koNtu [ing] + ee [continuously] thaanee [indeed] irukkiR [is] + aan [he] PERFECT

This aspect is expressed by the auxiliary verb „iru‟. Example paarththu + irukkiR + een [ I have seen ] paarththu [ seen ] + irukkiR [ have ] + een [ I ] naan mathuraikku pooy + iru + pp + een [ I would have gone to Madurai ] naan mathuraikku pooy + iru + ntth + een [ I had been to Madurai ] paarththu + iru + kkiR + aan [ He have seen ] paarththu + iru + ntth + aan [ He had seen ] paarththu + iru + pp + aan [ He would have seen ] DEFINITIVE

The auxiliary verb „vit‟ expresses definitiveness, Example avan veelaiyee ceythu + vittaan [ he has completed the work ] naan caappittu + vitteen [ I have finished eating ] REFLEXIVE

Reflexive express the action performed for its own benefit. The verb „ koL „ expresses this meaning. Example naan puththakangkalai vaangkik koNte een [I bought the books for myself] avan iraNtu rupaay etuththuk koNtaan [ He took two rupees for himself ] 137

naan en kaiyai vettik koNtu een [ I myself cut my hand ] TRIAL

The auxiliary verb for trial is „ paar „ Example caappittu paarththaan [ He tasted (it) ] yoociththu paarththaan [ He tried to recollect ] RESERVATIONAL

Reservational is expressed by the auxiliary verb „vai‟ Example avar oru itam othikki vaiththaar [ He reserved a place ] avaar enkkuc caappaatu othikki vaiththaar [ He reserved food for me ] avaar enkkuc caappaatuc camaiththu vaiththaar [ She cooked and kept the for me ] MODAL

We give below the modal verbs, their meaning and distribution. mutiyum [ can ]-

The modal verb occurs after an infinitive and without png in the future

tense and with neuter singular in the other tenses. It has also a negative form. Example ungkaLukku intha veelai ceyya muti + y (sandi)+um (fut) [ you can (fut) do this work ] ungkaLukku intha veelai ceyya muti + y + aathu (neg) [ you cannot (fut) do this work ] ungkaLukku intha veelai ceyya muti + kiR + athu [ you can (present) do this work ] ungkaLukku intha veelai ceyya muti + nth + athu [ you could do this work ] „aam‟ [ may ]-

„aam‟ occurs after a verbal noun.

Example niingkaL uLLee varal + aam [ you may come inside ] 138

niingkaL [you] uLLee [inside] varal [coming] + aam [may] niingkaL inRu varamal [ negative verbal participle ] irukal [ verbal noun ]+ aam [ you may not come today ] niingkaL [you] inRu [today] varaamal [without coming] irukal []+ aam [may] [ you may not come today ] VEENTUM

If it is forbidden then an infinitive is followed by „ kuutathu „ Example niingkaL uLLee vara [ Vinf ] kuutaathu [ you should not come ] niingkaL inRu kataikku vara veeNtum [ you should come to the shop today ] niingkaL inRu kataikku vara veeNtaam [ you should not come to the shop today ] PERMISSIVE MODE ttum avan uLLee varattum [ let him come inside ] avan [ he ] uLLee [ inside ] vara [ come ] ttum [ let ] [ iyalum, iyalaathu ] same as [ mutiyum , mutiaathu ]

D.2 TAMIL NOUN MORPHOLOGY

The Noun Structure is as follows, NOUN + PLURAL SUFFIX (OP) + [OBLIQUE SUFFIX (OP) + CASE SUFFIX] (OP) + CONJUNCTION / EMPHASIS (OP) + INTERROGATION / EMPHATIC (OP) Examples for the nominative and obligue type of noun in Tamil.

NOMINATIVES 

நபம்



஥ாடு naatu

maram

139



ஆறு aaRu



஑ெல் katal



஑ல்



஑ால் kaal

kal

OBLIGUE நபத்

MARATH

஥ாட்

NAAT

ஆற்

AAR

஑ெல்

KATAL

஑ல்

KAL

஑ால்

KAAL

ROOT: It‟s the minimal morpheme. STEM: A word with a suffix, which may be a formative suffix. In some cases there is no difference between a root and stem. Example: kal – it is a root and stem. GENDER: example: thief திருென் [ thirutan ] திருடி [ thiruti ] PLURAL:

஑ள் „ kaL „ is the neuter plural suffix. Example: Plural form of „ kaal „ is „ kaalkaL „ [ ஑ால்஑ள் ]

In the case of collective nouns such as makkaL [human beings] there is no need to add a separate plural suffix. Example: makkaL vanthanar [ நக்஑ள் யந்த஦ர் ] ( People came )

TAMIL CASE SYSTEM

140

CASE ENGLISH

SIGNIFICANCE

ARDEN

1st

Nominative

Subject of Sentence

Null

Null

2nd

Accusative

Object of Action

Ai

Ai

Instrumental 3rd

Means by which action is done Association , or means

Social

by which action is done Object to whom action is

4th

performed

Dative

Object for whom action is performed Motion

5th

Ablative of motion

from

[

an

inanimate object]

Aal

aal, aan3, ootu, otu, utan3

Ootu

[u]kku Kku [u]kkaka il,

ininru,

iliruntu, iruntu

Motion from [ an animate object]

itattilirunthu [Null] , in ,

6th

Genitive

Possesive

utaiya, inutaiya

Place in Which 7th

Locative

On

the

il

person

[animate];

in

Vocative

the Itam

Addressing, Calling

athu, aathu, a, utaiya il, itam, kan1. 27

of

e, a

[Null], e, a,

D.3 ORTHOGRAPHIC RULES

This chapter lists some of the orthographic rules with examples. (Two or more syllables) am + {Case / Post position}  aththu + {Case / Post position} 141

case

morphemes are there,..

presence of; 8th

Il, in

maram + a1i  maraththu + ai  maraththai maram meel  maraththu meel  maraththu meel nam + ai  nammai and doesn‟t become naththu because nam has one syllable. um + ai  uammai em + ai  emmai

tu / Ru + { Case / Post position }  ttu / RRu + { Case / Post position }

naatu + ai  naattu + ai  naattai naatu + paRRu  naattu + paRRu  naattuppaRRu aaRu + ai  aaRRu + ai  aaRRai aaRu + pakkam  aaRRu + pakkam  aaRRuppakkam akatu + ai  akattu + ai  akattai akatu + pakkam  akattu + pakkam  akattup pakkam This change happens only when the word has two or more syllables. In the case of disyllabic words, the first syllable should be long. natu + ai  natvai aRu + ai  aRuvai

[Consonant] Short Vowel [l / L] + [Consonant] 

pal + kaL  paRkaL col + kaL  coRkaL kal + kaL  kaRkaL muL + kaL  mutkaL muL + kiriitam  mutkiriitam In contrary to the rule mentioned above in the following cases the change from „l„to „R„doesn‟t take place. Reformulate the above rule or make a new rule to accommodate both these cases. 142

kaal + kaL  kaalkaL muungkil + kaL  muungkilkaL kool + kaL  koolkaL thukil + kaL  thukilkaL mukil + kaL  mukilkaL

y / r / zh / Vowel + Plosive  y / r / zh / Vowel + Plosive doubling

Plosive – k, c , th , pa nay + kutti  naykkutti ver + katalai  verrkatalai pukazh + col  pukazhccol puli + pal  pulippal

n/N+PR/t+P

pon + kutam  poRkutam maN + kutam  matkutam eN + kaL  eNkaL maN + kaL  maNkaL maan + kaL  maankaL katan + kaL  katankaL

m + { k / c / th / p }  { nk / nj / nth / 0 }

maram + kaL  marangkaL arum + ceyal  arunjceyal perum + thokai  perunthookai

143

karum + palakai  karumpalakai karum+kal  karungkal varum+kai  varungkai maram + pakkam  maram pakkam mara + pakkam  marappakkam mara [ adj ] + palakai  marappalakai mara [ adj ] + kuruvi  marakkuruvi em + kaL  engkaL um + kaL  ungkaL

l + w (dental)  n

naal + wuuRu  naanuuRu kaal + wimitam  kaanimitam kaal +weruppu  kaaneruppu vaaL + wimiththam  vaaNimiththam col + watai  connatai or colwatai kal + winaivu  kanninaivu nal + wakaram  nannakaram Exception vaal + weruppu  vaaneruppu or vaal weruppu il + winaivu  il winaivu pakal + watikan  pakal watikan

(Consonant) Vowel + Consonant - (Consonant) Vowel + consonant doubling

mu + pathu – muppathu mu + wuuru – munnuru pu + pathu – puppathu 144

pu + thal – puththal

{ a / aa / u / uu / o // oo / au } + Vowel – { 1 } . v.Vowel { i / ii / ai / e / ee } + Vowel – { 1 } . y. Vowel

alai + ai – alaiyai alai + um – alaiyum ii + ai – iiyai paNi + ootu – paNiyootu paNi + um – paNiyum kani + ai – kaniyai thii + ai – thiiyai nilaa + ai – nilaavai amma + aal – ammavaal ammu + ai – ammuvai koo + ai – koovai vee + ai – veeyai caravana + aal – caravanavaal

{ a / i / e } .v + Vowel – av .v. Vowel { a / i / e } + C – a . CC

av + uyir – avvuyir av + itam – avvitam av + aaRu – avvaaRu a + patam – appatam a + viitu – avviitu a + pati – appati e + pati – eppati 145

a , i - >demonstrative e - >Interrogative pati  manner [ noun ]

l + Nasal - n L + Nasal - N

kal + malai – kanmalai pal + malai – panmalai muL + makutam – muNmakutam muL + muti – muNmuti In modern Tamil, this rule is not in use.

kuRRiyalukaram

The ultra short „u‟ comes after the plosive consonants k, c, t, th, p and R Consonant Long Vowel { k/c/th/p} u + Vowel – Consonant Long Vowel {k/c/th/p} Vowel [Two or more syllables]{k/c/th/p} u + Vowel - { k/c/th/p} Vowel

Consonant Long Vowel {t/R} u + Vowel – Consonant Long Vowel {tt/RR} Vowel [Two or more syllables]{t/R} u + Vowel - {tt/RR} Vowel

natu

ukaram [ thani kuRil ]

natu + ai – natuvai [ Uyir + Uyir ] mathu + ai – mathuvai kocu + ai – koocuvai naatu

kuRRiyalukaram

katuku

kuRRiyalukaram

146

paricu

kuRRiyalukaram

akatu

kuRRiyalukaram

parunthu

kuRRiyalukaram

vaaththu

kuRRiyalukaram

katuku + ai – katukai vaaththu + ai – vaaththai marunthu + ai – marunthai paampu + ai – paampai kaathu + ai – kaathai kaacu + ai – kaacai naatu + ai – naattai kaatu + ai – kaattai kacatu + ai – kacattai aatu + ai – aattai aaRu + ai – aaRRai naaRu + ai - naaRRai

ACCUSATIVE AND DATIVE + PLOSIVE CONSONANTS - {1} Plosive doubling

avanaip paarththeen avanukkuk kotuththeen cinimaviRkuc cenReen

147

APPENDIX E E.1 LIST OF POST POSITIONS IN TAMIL (PARTIAL LIST)

About paRRi ஧ற்஫ி kuRiththu, கு஫ித்து paarththu, ஧ார்த்து nookki, ப஥ாக்஑ி Above meelee பநப஬ Across kuRukkee,குறுக்ப஑ After piRaku, pin ஧ி஫கு ஧ின் appuRam ,அப்பு஫ம் Against mun, munnaal முன் முன்஦ால் ethithaRpool, ஋திர்தாற் ப஧ால் Along kuuta கூெ Alongside Amid uLLee, itaiyil உள்ப஭ இதெனில் Amidst itaiyil இதெனில், natuvil ஥டுயில் Among itaiyee இதெபன itaiyil இதெனில் Amongst itaiyee இதெபன itaiyil இதெனில் Around eeRakkuRaiya, ஌஫க்குத஫ன cuRRi, சுற்஫ி As poola, ப஧ா஬ At il,இல் ku,கு Atop ucci உச்சி Before mun, முன் munnaal, முன்஦ால் Behind pin, ஧ின் pinnaal, ஧ின்஦ால் Below kiizh, ஑ீழ் kiizhee, ஑ீபம atiyil, அடினில் Beneath kiizh, ஑ீழ் kiizhee, ஑ீபம atiyil, அடினில் Beside pakkam, ஧க்஑ம் Between itayil, இதெனில் Beyond appaal, அப்஧ால் By koNtu க஑ாண்டு Despite iruntha poothilum, இருந்த ப஧ாதிலும் Down kizhee, ஑ீபம During poothu, ப஧ாது Except thavira, தயிப From il-irunthu, - இல்-இருந்து muthal, முதல் In il – uLLa, இல் – உள்஭ Inside uLLee, உள்ப஭ Into uLLee, உள்ப஭ 148

Like poola, ப஧ா஬ maathiri, நாதிாி Mid natu, ஥டு Near arukee, அருப஑ arukil, அரு஑ில் Next atuththu, அடுத்து Notwithstanding appati irunthaalum, அப்஧டி இருந்தாலும் aayinum, ஆனினும் Off appaal, அப்஧ால் On meel, பநல் Opposite ethiRee, ஋திபப ethiril, ஋திாில் Outside veLiyee, கய஭ிபன Over meel, பநல் Regarding paRRi, ஧ற்஫ி Round cuRRi, சுற்஫ி Since il – irunthu, இல் – இருந்து Than vita, யிெ Through vazhiyaaka, யமினா஑ vaayilaaka, யானி஬ா஑ muulam, மூ஬ம் Throughout muzhuvathumaaka, முழுயதுநா஑ Till varai, யதப Times neerangkaLil, ப஥பங்஑஭ில் To varai, யதப Toward nookki, ப஥ாக்஑ி paarththu, ஧ார்த்து Towards nookki, ப஥ாக்஑ி paarththu, ஧ார்த்து Under kiizhee, ஑ிபம atiyil, அடினில் Underneath kiizhee, ஑ிபம atiyil, அடினில் Unlike pool allaathu, ப஧ால் அல்஬ாது Until varai, யதப varaikkum, யதபக்கும் Up meelee, பநப஬ Upon meeR paRRi, பநற்஧ற்஫ி miithu, நீது Via vaziyaaka, யமினா஑ With utan, உென் itam, இெம் Within uLLee, உள்ப஭ Without veLiyee, கய஭ிபன According to pati, ஧டி Ahead of munnaal, முன்஦ால் Along with utan, உென் As to Aside from thavira, தயிப Because of aathalaal, ஆத஬ால் aakaiyaal, ஆத஑னால் Close to arukil, அரு஑ில் arukee, அருப஑ Due to aathalaal, ஆத஬ால் 149

Far from thooraththil irunthu, தூபத்தில் இருந்து Inside of uLLee, உள்ப஭ Instead of pathilaaka, ஧தி஬ா஑ Near to arukil, அரு஑ில் Next to atuththu, அடுத்து atuththaRpool, அடுத்தாற் ப஧ால் Out of veLiyee, கய஭ிபன Outside of veLiyee, கய஭ிபன Owing to karaNamaaka, ஑பணநா஑ Prior to munnaal, முன்஦ால் Pursuant to thotarnthu, கதாெர்ந்து Subsequent to atuththu, அடுத்து As far as varai, யதப As well as kuuta, கூெ By means of koNtu, க஑ாண்டு In accordance with pati, ஧டி In addition to meel, பநல் In front of munnaal, முன்஦ால் ethithaRpool, ஋திர்தாற் ப஧ால் In spite of irunthum, இருந்தும் In place of pathilaaka, ஧தி஬ா஑ On account of kaaraNamaaka, ஑ாபணநா஑ On behalf of pathilaaka, ஧தி஬ா஑ On top of meelaaka, பந஬ா஑ With regard to otti, எட்டி With reference to vaiththu, தயத்து paarvai, ஧ார்தய In case of irunthaal, இருந்தால் enRaal, ஋ன்஫ால் Ago munnaal, முன்஦ால் Apart thavira, தயிப puRampaka, பு஫ம்஧ா஑ Away appaal, அப்஧ால் thuuraththil, தூபத்தில் Hence aathalaal, ஆத஬ால் aakaiyal, ஆத஑னால்

150

PUBLICATIONS Sribadri narayanan R, Saravanan S, and Soman K.P, "Data Driven Suffix List And Concatenation Algorithm For Telugu Morphological Generator," In Proceedings of International Journal Of Engineering Science and Technology, vol. 3, no. 8, pp. 6712-6717, August 2011. Ramasamy Veerappan, Antony P J, Saravanan S, and Soman K.P, "A Rule Based Kannada Morphological Analyzer and Generator using Finite State Transducer," In Proceedings of International Journal of Computer Applications, pp. 45-52, August 2011. Hemant Darbari, Anuradha Lele, Aparupa Dasgupta, Priyanka Jain, and Sarvanan S, "EILMT: A Pan-Indian Perspective in Machine Translation," in Tamil Internet Conference, Coimbatore, Tamil Nadu, 2010 Saravanan S, Menon A.G, and Soman K.P, "Pattern Based English-Tamil Machine Translation," in Proceedings of Tamil Internet Conference, Coimbatore, India, 2010, pp. 295299. Menon A. G., Saravanan S, Loganathan R, and Soman K. P, "Amrita Morph Analyzer and Generator for Tamil: A Rule-Based Approach," in Proceedings of Tamil Internet Conference, Cologne, Germany, 2009, pp. 239-243.

151