entitled PATTERN BASED ENGLISH TO TAMIL MACHINE TRANSLATION
SYTEM is the record of the original work done by me under the guidance of Dr.
K P ...
PATTERN BASED ENGLISH TO TAMIL MACHINE TRANSLATION SYSTEM
A Thesis submitted for the degree of
Master of Science (by research) in the School of Engineering
By SARAVANAN. S
Centre for Excellence in Computational Engineering Amrita School of Engineering Amrita Vishwa Vidyapeetham University Coimbatore – 641112
March, 2012
Amrita School of Engineering Amrita Vishwa Vidyapeetham, Coimbatore – 641112
BONAFIDE CERTIFICATE
This is to certify that the thesis entitled “PATTERN BASED ENGLISH TO TAMIL MACHINE TRANSLATION SYSTEM” submitted by SARAVANAN. S (Reg. No.: CB.EN.M*CEN09008) for the award of the degree of Master of Science (by research) in the School of Engineering, is a bonafide record of the research work carried out by him under my guidance. He has satisfied all the requirements put forth for the project and has completed all the formalities regarding the same to the fullest of my satisfaction.
Ettimadai, Coimbatore. Date:
DR. K P SOMAN RESEARCH GUIDE AND HEAD, CEN.
Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Coimbatore – 641112 Centre for Excellence in Computational Engineering. DECLARATION
I, SARAVANAN. S (REG. NO.: CB.EN.M*CEN09008), hereby declare that this thesis entitled PATTERN BASED ENGLISH TO TAMIL MACHINE TRANSLATION SYTEM is the record of the original work done by me under the guidance of Dr. K P Soman, Head, Centre for Excellence in Computational Engineering, Amrita School of Engineering, Coimbatore and to the best of my knowledge this work has not formed the basis for the award of any degree / diploma / associateship / fellowship or a similar award, to any candidate in any University.
Place: Ettimadai Date: Signature of the Student Countersigned by
K P SOMAN PROFESSOR AND HEAD, CEN, AMRITA VISHWA VIDYAPEETHAM, COIMBATORE.
ACKNOWLEDGEMENTS I would like to thank all minds that helped to mould me as what I am now. For all those hands that crafted me and have helped me for completing this thesis, I am inarticulate to express and all I can say is „thanks‟ that‟s the only word that I could recollect from my mental lexicon.
I am deeply indebted to my thesis advisor the most energetic person I know, Prof. K.P.Soman from whom I learnt „how things to be learned in the way it is to be learned‟. Without the inspiration I drew from him, I wouldn‟t be pursued a research career which I never dreamt off. I sincerely thank him who is touted as a hero in my University circle, for his patting on my shoulder whenever necessary and for being a good critic.
Though the motivation and resources are sufficed to get going, I am stumbled and struggled as being naïve especially in the area of linguistics. I am fortunate to get to learn from and work with Prof. A.G.Menon (University of Leiden) who transformed me from knowing-the-language being to knowing-the-language-scientifically being. The day long discussions and lessons with the parental touch definitely put some character into me. I am very much grateful to him for being a great well-wisher, mentor and for everything.
I would like to thank Dr. C.J. Srinivasan (University of Edinburgh) who introduced NLP to me formally and Dr. Hemant Darbari, (Executive Director, CDAC-Pune) who groomed my skills in MT.
My special thanks to the three Horsemen of CEN department: Mr. Loganathan Ramasamy (University of Charles), Mr. Vijaya Krishna Menon (Caledonian College of Engineering Muscat) and Mr. Rakesh Peter. I was indulged and molded by them. The numerous hour long
debates and discussions not only helped me to develop academic skills but also to gain the worldly knowledge and that later turned to be the fetus to my own ideologies and my perception of the universe as a whole.
I would also like to thank all-rounder Mr. Ajith, who is the creator the English-Tamil Transliteration tool that integrated with the MT system, for providing me the non-licensed version of the tool, trouble-shooter Mr.Sajan for helping me configuring the online version of translation system, shiv-monster Mr.Shivapratap for re-charging me with his preposterous funny, no-one-dare-to-do stuffs, gui-designer Mr. Senthil for developing the GUI interface for standalone version.
This thesis wouldn‟t have possible without the linguistic data. I would like to thank the contributors, Mrs. Meena Sukumaran for English-Tamil Proper and Place Name transliteration parallel corpora, Mrs. Dhanalakshmi, Mr. V. Anand Kumar, Mr. C. Murugeshan, Mrs. Kaveri, Mr. Harshawardhan and Mr. Rama Samy for English-Tamil lexicon data creation, Miss. Mridula Sara Augstine and Mrs. Dhanya for English-Malayalam lexicon creation. Last but not the least; I would like to thank the CEN department and its faculties & research fellows for their unconditional support.
ABSTRACT The native languages all over the world are growing rapidly along with the growth of technology, in general, and information technology, in particular. On the one hand, the world experiences a growth in the native language and on the other hand, precious and nascent information come through foreign languages. The demand for making the information available in native languages is increasing. Therefore, we need an efficient and practical method to fulfill this demand. This thesis describes one such method; English to Tamil Machine Translation System. This method employs pattern based reordering mechanism and dependency to morphological feature transformation method. Though the thesis elucidates English to Tamil translation system, the methodology can be extended to other Dravidian languages. Like any conventional rule based system, the developed system parses the input sentence, reorders to get the target phrasal structure, replaces the words of the sentence with its target equivalence, and finally synthesizes the target word to get the complete word form. Though in the process of translation, various modules are involved, assiduous research effort in development of morphological analyzer and synthesizer paid off and that reflects in the performance of the morph analyzer and synthesizer system. The corroboratory inferences from the results emphasize the need for morphological synthesizer for agglutinative languages, a translation engine for translating English to Dravidian languages, and a better dependency to morphological feature transformation method. The accuracy of the morphological analyzer is 88%. The system gives the score of 3 in 1 to 5 scoring scale for 63% of the sentences and score of 4 and 5 for 14% of the sentences tested from the EILMT Tourism corpus.
i
TABLE OF CONTENTS INTRODUCTION.................................................................................................................................. 1
1 1.1
MACHINE TRANSLATION FOR DUMMIES ...................................................................................... 2
1.2
IS IT WORTH GOING MT WAY? .................................................................................................... 3
1.3
PREVIOUS WORK ON ENGLISH-TAMIL MT ................................................................................... 3
1.4
OVERVIEW OF CHAPTERS ............................................................................................................. 4 THEORETICAL BACKGROUND ........................................................................................................... 6
2 2.1
APPROACHES ................................................................................................................................. 6
2.2
RULE-BASED MT ........................................................................................................................... 6
2.3
INTER-LINGUAL MT ...................................................................................................................... 7
2.4
TRANSFER-BASED MT................................................................................................................... 7
2.5
STATISTICAL MT ........................................................................................................................... 7
2.6
CONTEXT FREE GRAMMARS ......................................................................................................... 8
2.7
PROBABILISTIC CFG (PCFG) ........................................................................................................ 9
2.8
PARSING ........................................................................................................................................ 9
2.9
SYNCHORNOUS CONTEXT FREE GRAMMAR ............................................................................... 10
2.10
SYNCHRONOUS TREE ADJOINING GRAMMAR ............................................................................. 11
3
SYSTEM OVERVIEW ........................................................................................................................ 16
4
RULE BASED WORD REORDERING .................................................................................................. 27 4.1
TAMIL SENTENCE STRUCTURE.................................................................................................... 27
4.2
ENGLISH STRUCTURE TO TAMIL STRUCTURE ............................................................................. 27
4.3
PATTERN BASED CFG FOR REORDERING .................................................................................... 28
4.3.1
DISSECTION OF THE REORDERING RULE ............................................................................. 29
4.3.1.1
SOURCE PATTERN ........................................................................................................... 29
4.3.1.2
TARGET PATTERN ........................................................................................................... 29 ii
4.3.1.3
TRANSFER LINK .............................................................................................................. 30
4.3.1.4
REORDERING RULE ......................................................................................................... 31
4.3.2
REORDERING ALGORITHM – SIMPLIFIED VERSION ............................................................. 31
4.3.3
ARCHITECTURE ................................................................................................................... 35
4.3.3.1
REORDERING EXAMPLE .................................................................................................. 35
4.3.3.2
LIMITATIONS OF THE REORDERING RULES ...................................................................... 36
4.3.3.3
HANDLING PREPOSITION ................................................................................................. 37
4.3.4
HANDLING COPULA ............................................................................................................ 39
4.3.5
HANDLING AUXILIARY ....................................................................................................... 39
4.3.6
HANDLING RELATIVE PRONOUNS ....................................................................................... 39
PHRASAL STRUCTURE VIEWER ....................................................................................................... 40
5 5.1
NEED ........................................................................................................................................... 40
5.2
INTRODUCTION ............................................................................................................................ 40
5.3
ARCHITECTURE OF P-VIEWER..................................................................................................... 41
5.4
TREE RENDERING ........................................................................................................................ 42
5.5
REORDERING RULE FORMAT ...................................................................................................... 43
5.6
FILTERED SEARCH ....................................................................................................................... 44
5.7
SCREEN SHOT OF THE P-VIEWER TOOL....................................................................................... 46
5.8
REORDERING RULE EDITOR ........................................................................................................ 47
5.9
DICTIONARY EDITOR .................................................................................................................. 47
5.10
MULTIPLE SENTENCE AND WORD OPTIONS ............................................................................... 48
5.11
MORPH SYNTHESIZER FEATURES INFORMATION ....................................................................... 48
5.12
STEP BY STEP PROCEDURE TO CREATE A NEW RULE IN P-VIEWER ............................................ 49 MAPPING OF DEPENDENCY TO MORPHOLOGICAL FEATURE .......................................................... 52
6 6.1
INTRODUCTION AND WHY NEED FEATURE EXTRACTOR ............................................................ 52
6.2
STANFORD TYPED DEPENDENCIES ............................................................................................. 54
6.3
MORPHOLOGICAL FEATURE INFORMATION FROM DEPENDENCY RELATIONS ............................ 55
6.4
INFORMATION FOR VERB ............................................................................................................. 57 iii
6.5
COMPUTING AUXILIARY INFORMATION ..................................................................................... 59
6.6
HANDLING COPULA SENTENCES ................................................................................................. 61
6.7
COPULA: MORE ISSUES ............................................................................................................... 62
6.8
HANDLING DATIVE VERBS ......................................................................................................... 62
6.9
HANDLING MULTIPLE SUBJECTS ................................................................................................ 65 MORPHOLOGICAL ANALYZER AND SYNTHESIZER.......................................................................... 66
7 7.1
NEED OF SYNTHESIZER FOR MACHINE TRANSLATION ............................................................... 66
7.2
WHY MORPHOLOGICAL ANALYZER ........................................................................................... 68
7.3
INTRODUCTION ............................................................................................................................ 68
7.4
MORPH ANALYZER AND SYNTHESIZER – SYSTEM ARCHITECTURE ........................................... 70
7.4.1
NOUN LEXICON ................................................................................................................... 71
7.4.2
MORPHOTATICS MODEL ..................................................................................................... 71
7.4.3
ORTHOGRAPHIC MODEL ..................................................................................................... 72
7.5
BUILDING A SIMPLIFIED MORPH ANALYZER AND SYNTHESIZER ............................................... 73
7.5.1
FILE FORMAT FOR FSM TOOL KIT ...................................................................................... 74
7.5.2
FSM TOOLKIT COMMANDS ................................................................................................. 76
7.5.3
FST MODEL FOR MORPHOTACTICS RULE OF NOUN (SIMPLIFIED VERSION)...................... 77
7.5.4
FST MODEL FOR ORTHOGRAPHIC RULE.............................................................................. 78
7.5.5
TWO LEVEL MORPHOLOGY WITH AN EXAMPLE WORD ........................................................ 79
EXPERIMENTS AND RESULTS .......................................................................................................... 83
8 8.1
DATA FORMATS .......................................................................................................................... 83
8.2
TRANSFER RULES ........................................................................................................................ 83
8.3
DEPENDENCY TO MORPH MAPPING ............................................................................................. 84
8.4
AUXTENSE TO MORPH MAPPING................................................................................................. 85
8.5
NOUN LEXICON ........................................................................................................................... 86
8.6
TRANSLATION: STEP-BY-STEP PROCESS ..................................................................................... 86
8.7
TESTING ...................................................................................................................................... 87
iv
8.7.1
TESTING RESULTS: MORPHOLOGICAL ANALYZER AND SYNTHESIZER .............................. 88
8.7.2
CONTRIBUTIONS .................................................................................................................. 90
9
SCREEN SHOTS ................................................................................................................................ 94
10
LIMITATIONS AND FUTURE WORK ................................................................................................ 101
11
CONCLUSION ................................................................................................................................. 104
REFERENCES ............................................................................................................................................. 105 APPENDIX A .............................................................................................................................................. 108 A.1
TAMIL TRANSLITERATION SCHEME .......................................................................................... 108
APPENDIX B .............................................................................................................................................. 109 B.1
REORDERING RULES ................................................................................................................. 109
APPENDIX C .............................................................................................................................................. 113 C.1
TENSE-MORPHOLOGICAL FEATURES LOOKUP TABLE ............................................................... 113
APPENDIX D .............................................................................................................................................. 116 D.1
TAMIL VERB MORPHOLOGY ..................................................................................................... 116
D.2
TAMIL NOUN MORPHOLOGY .................................................................................................... 139
D.3
ORTHOGRAPHIC RULES ............................................................................................................. 141
APPENDIX E............................................................................................................................................... 148 E.1
LIST OF POST POSITIONS IN TAMIL (PARTIAL LIST) .................................................................. 148
PUBLICATIONS .......................................................................................................................................... 151
v
LIST OF FIGURES
FIG. 2.1. A PARSE TREE FOR 'BEAUTIFUL GIRL' ........................................................................................... 9 FIG. 2.2. SYNCHRONOUS CFG DERIVATIONS ............................................................................................. 11 FIG. 2.3. ELEMENTARY TREES (INITIAL TREES).......................................................................................... 11 FIG. 2.4. REWRITTEN ELEMENTARY TREES ................................................................................................ 12 FIG. 2.5. ELEMENTARY TREE REQUIRED FOR THE EXAMPLE SENTENCE ................................................... 13 FIG. 2.6. TAG DERIVATION.......................................................................................................................... 14 FIG. 2.7. SYNCHRONOUS TAG ............................................................................................................... 15 FIG. 3.1. TRANSLATION SYSTEM- BLOCK DIAGRAM .................................................................................. 17 FIG. 3.2. ANNOTATION OF THE INPUT SENTENCE ....................................................................................... 18 FIG. 3.3. PARSE TREE (RAM GAVE HIM A BOOK)......................................................................................... 18 FIG. 3.4. REORDERING RULE ....................................................................................................................... 19 FIG. 3.5. TRANSFORMATION OF PARSE TREE TO REORDERED TREE .......................................................... 20 FIG. 3.6. FLATTENING OF TREE ................................................................................................................... 21 FIG. 3.7. SYNTHESIZING OPERATION .......................................................................................................... 22 FIG. 3.8. TRANSLATED OUTPUT .................................................................................................................. 23 FIG. 3.9. DEPENDENCY RELATION TO MORPH FEATURES .......................................................................... 24 FIG. 4.1. SOURCE RULE ............................................................................................................................... 29 FIG. 4.2. TARGET RULE ............................................................................................................................... 30 FIG. 4.3. REORDERING RULE ....................................................................................................................... 31 FIG. 4.4. PARSE TREE .................................................................................................................................. 33 FIG. 4.5. REORDERING RULES ..................................................................................................................... 33 FIG. 4.6. APPLICATION OF RULE R1 AND R2 ............................................................................................... 34 FIG. 4.7. REORDERING ARCHITECTURE ...................................................................................................... 35 FIG. 4.8. TRANSFORMATION OF PARSE TREE TO REORDERED TREE .......................................................... 36 FIG. 4.9. REORDERING RULES ..................................................................................................................... 38 FIG. 4.10. PARSE AND REORDER STRUCTURE ............................................................................................. 38 FIG. 5.1. ARCHITECTURE OF P-VIEWER ...................................................................................................... 42 FIG. 5.2. FILTERED SEARCH ........................................................................................................................ 45 FIG. 5.3. SCREEN SHOT OF P-VIEWER ......................................................................................................... 46 FIG. 5.4. SCREEN SHOT OF REORDERING RULE EDITOR ............................................................................. 47 FIG. 5.5. SCREEN SHOT OF DICTIONARY EDITOR ........................................................................................ 48 FIG. 5.6. P-VIEWER CREATION OF NEW RULE ............................................................................................. 49 FIG. 5.7. P-VIEWER: OUTPUT WITH THE NEW RULE .................................................................................... 51 FIG. 6.1. FLATTENING OF TREE AND REPLACING WORDS WITH TARGET LEXICON .................................... 52 FIG. 6.2. MORPH FEATURE INFO TO SYNTHESIZER ...................................................................................... 54 FIG. 6.3. DEPENDENCY TREE: RAM GAVE HIM A BOOK .............................................................................. 55 FIG. 6.4. SL TO TL DEPENDENCY RELATIONS TRANSFORMATION ............................................................. 56 vi
FIG. 6.5. FEATURE EXTRACTION ................................................................................................................. 59 FIG. 6.6. REORDERING: SENTENCE WITH POSSESSIVE VERB ...................................................................... 64 FIG. 6.7. PHRASAL STRUCTURE (HE HAS TWO BOOKS) ............................................................................... 64 FIG. 7.1. MORPHOLOGICAL ANALYZER AND SYNTHESIZER - BLOCK DIAGRAM ........................................ 70 FIG. 7.2. TRANSDUCER FOR MORPHOTATICS RULE .................................................................................... 71 FIG. 7.3. TRANSDUCER FOR MORPHOTACTICS RULE - LEXICON LESS MODEL .......................................... 72 FIG. 7.4. TRANSDUCER FOR ORTHOGRAPHIC RULE FOR LEXICONLESS MODEL .......................................... 72 FIG. 7.5. TAMIL NOUNS: FSM REPRESENTATION ........................................................................................ 73 FIG. 7.6. TRANSDUCER FOR TAMIL NOUN INFLECTION .............................................................................. 78 FIG. 7.7. TRANSDUCER FOR V/Y INSERTION RULE ...................................................................................... 79 FIG. 7.8. TRANSDUCER FOR MORPHOTACTICS RULE .................................................................................. 79 FIG. 7.9. TRANSDUCER FOR ORTHOGRAPHIC / SPELLING RULE.................................................................. 80 FIG. 7.10. INPUT WORD IN FINITE-STATE REPRESENTATION....................................................................... 80 FIG. 7.11. INTERMEDIATE STAGE FSA ........................................................................................................ 80 FIG. 7.12. FST FOR SYNTHESIZED WORD .................................................................................................... 81 FIG. 7.13. FLOW GRAPH OF MORPH SYNTHESIZER ..................................................................................... 81 FIG. 8.1. TRANSFER RULES DB ................................................................................................................... 84 FIG. 8.2. DEPENDENCY-MORPH FEATURE DB ............................................................................................ 85 FIG. 8.3. AUXTENSE-MORPH FEATURES DB .............................................................................................. 85 FIG. 8.4. NOUN LEXICON............................................................................................................................. 86 FIG. 9.1. GUI OF ENGLISH-TAMIL MT SYSTEM (STAND ALONE VERSION) ................................................ 94 FIG. 9.2. DICTIONARY PANEL AND MORPH SYNTHESIZER PANEL ............................................................. 95 FIG. 9.3. GUI: TAMIL MORPH ANALYZER AND GENERATOR ...................................................................... 96 FIG. 9.4. GUI: MALAYALAM MORPH ANALYZER AND GENERATOR .......................................................... 96 FIG. 9.5. GUI: ENGLISH-MALAYALAM MT SYSTEM.................................................................................... 97 FIG 9.6. ENGLISH-TAMIL MT SYSTEM (ONLINE VERSION)......................................................................... 98 FIG 9.7. MORPH ANALYZER AND SYNTHESIZER (ONLINE VERSION) ......................................................... 99
vii
LIST OF TABLES TABLE 4.1. POS LABELS ............................................................................................................................. 28 TABLE 5.1. FORMAT OF REORDERING RULE IN DB ..................................................................................... 44 TABLE 6.1. DEPENDENCY RELATIONS ........................................................................................................ 56 TABLE 6.2. LEXICON ................................................................................................................................... 57 TABLE 6.3. PERSON NUMBER GENDER FEATURE ....................................................................................... 58 TABLE 6.4. AUXTENSE-MORPH FEATURE LOOKUP .................................................................................... 60 TABLE 7.1. NOUN INFLECTIONS .................................................................................................................. 67 TABLE 8.1. TRANSLATION RESULTS ........................................................................................................... 87 TABLE 8.2. TESTING RESULTS OF MORPH ANAYLZER AND SYNTHESIZER ................................................ 88 TABLE 8.3. IMPLEMENTATION DETAILS...................................................................................................... 91 TABLE 8.4. DEPENDENCY TO MORPHOLOGICAL FEATURE MAPPING ......................................................... 92 TABLE 8.5. REORDERING RULES ................................................................................................................. 92 TABLE 8.6. NUMBER OF WORDS USED IN MORPH ANALYZER AND SYNTHESIZER MODEL........................ 93 TABLE 8.7. NUMBER OF RULES IN MORPH ANALYZER AND SYNTHESIZER................................................ 93
viii
CHAPTER 1 1 INTRODUCTION Language is not only the means of communication. It could influence our culture and in fact, it influences the thought process of human beings. It is an important element of culture and through the language the culture can be learned and preserved. The native languages all over the world are growing rapidly along with the growth of technology, in general, and information technology, in particular. On the one hand, the world experiences a growth in the native language and on the other hand, precious and nascent information comes through foreign languages. Knowledge of the mother tongue alone is no longer enough to follow the information supplied by the other languages. Because of this ever-increasing gap and the speed with which information is supplied, there is a possibility of death knell for many native languages. Recent research shows language death is accelerated to the rate of two languages per month1. It is necessary to bridge this gap with the help of modern technologies as early as possible, thus enables the information supplied by the other languages available in the native language. This impedes the shrinking of native language speakers, thus helps to save mother tongue eventually, and in turn helps to preserve our culture. Even though it‟s inevitable to prevent the evolving culture, which is insane but an effort to preserve it in some way or another, is appreciated and well received by some of the cultural groups. This notion incites me to work on Machine Translation (MT). The mind blowing applications of MT and its potentiality and impact in future eventually pursues and led me to develop a MT system for English to the one of the longest surviving classical, could be the world‟s oldest surviving, literature rich, culturally significant language Tamil. The aim of this project is to translate the English input sentence to Tamil sentence as close as the human translation and to get comprehendible translated Tamil sentence. This project primarily focuses on the development of the MT system to translate English to Tamil and depends on the success of the prototype model of the MT system; the approach that employed for 1
http://www.commonsenseadvisory.com/Default.aspx?Contenttype=ArticleDetAD&tabID=63&Aid=1207&moduleI d=391
1
the prototype can be extended to the English to Dravidian language pairs like Malayalam, Kannada and Telugu and also focuses on the morphological analyzer and synthesizer which are one of the important component of the MT system. The various other MT tools that aid linguists are to be developed such as a framework for the development of heuristics in various levels of MT system, testing and improving the system.
1.1 MACHINE TRANSLATION FOR DUMMIES The material in this dissertation is not recommended for the cubs. The aspirant cubs have to commence from other standard materials; but for the sake of not disappointing the cubs, here comes the elucidation of MT in few words. MT is translation of one natural language to another natural language, mechanically. The simplest method for doing automatic translation that pops up in naive‟s mind is word-to-word translation. The words in the Source Language (SL) input sentence is translated to the Target Language (TL). Each word in the SL input sentence would be input into an exhaustive bilingual lookup search program. The bilingual dictionary lookup provides target language word equivalences. In real world, languages aren‟t as simple to translate SL to TL using word to word translation method. The bilingual dictionary lookup provides target equivalences for the root or stem words (Ex: boy, give, etc) of SL and not the inflected forms (Ex: boys, played, playing, etc). The word-to-word translation fails. What if we have a mechanism that converts the inflected forms (boys etc) to root form (boy) which that enables the direct word-to-word translation? Fortunately, we have one and that mechanism is called Stemmer. The stemmer rips off the inflections (Ex: “s” from boys) and outputs the root / stem form (Ex: boy). This approach is called direct translation system. The information provided for the aspirant cubs are sufficed to spring up from the standard materials because all the other MT approaches can‟t be explained without heavy lifting MT terminologies, which the cubs aren‟t familiar with. Wait a minute. Do you know what parser means (just the basic stuffs) and what it does? Do you know what parts of speech tagger mean and what it does? Do you know what morphological synthesizer/generator means and what it does? If the answer is yes for all those questions, the system overview chapter may provide some more ideas for the rookies. Give it a try. 2
1.2 IS IT WORTH GOING MT WAY? To begin from the very beginning, the philosophical arguments and history are traced back to seventeenth century. It‟s a pretty long way to go from the seventeenth century to the latest Google online translation services. Couple of issues in the history are enough to know why MT way‟ is discussed here and anyone hate to have the blank in the history refer to MT history2 resources to fill the gap. In 1954, Georgetown experiment [1] gave a promising automatic machine translation results for over sixty Russian sentences to English [2]. The team of the experiment claimed that the machine translation problem is solvable and can be solved within three to five years. From 1954 to the present3, there were several 5 years passed on but still the problem is wide open and many a researchers trying to bring down this monstrous artificial intelligence task to a solvable problem. Though few methodologies found some success in the years, the problem is not solved completely. Some of the major players worth mentioning here are SysTran [3], Japanese MT systems [4], [5], EUROTRA, AltaVista‟s Babel Fish (uses SysTran technology) and Google Language Tools (initially uses SysTran technology). Attacking an unsolved problem itself is worth doing. The European Union spends over 1 billion Euros4 annually to make the official documents available in all 23 official languages for its 27 member states. This shows that how important it is to tackle the challenges in MT as early as possible.
1.3 PREVIOUS WORK ON ENGLISH-TAMIL MT Though numerous works have been reported in MT using various methodologies, very few work has done or reported for Dravidian languages (Tamil, Kannada, Telugu, Malayalam, etc). Recently, Google released the alpha version of the MT online services for Tamil, Kannada, and Telugu. Google uses the Statistical Machine Translation (SMT) approach5. The quality of the translation is not bad for the simple and frequently used sentences. Mostly, the translation output 2
History of MT: http://www.hutchinsweb.me.uk/Nutshell-2005.pdf Present Status of Automatic MT: http://www.mt-archive.info/Bar-Hillel-1960-App3.pdf 4 EU spends 1 billion Euro on language services: http://www.independent.co.uk/news/world/europe/cost-intranslation-eu-spends-83641bn-on-language-services-407991.html 5 Google SMT: http://translate.google.com/about/ 3
3
is comprehendible even though the long sentences have issues with the word ordering and morphological generation of the Tamil words. Renganathan (2002) [6] demonstrated a functional English-Tamil Rule based MT system with limited set of words and rules. No further work had been reported after that. Germann (2001) [7] reported a SMT system trained using 5000 sentence parallel corpora. Most of the research and development in Tamil NLP is been reported by AUKBC research centre. Prototype of English-Tamil MT is reported by AU-KBC. The performance of this system is unknown and it‟s not available for testing. The English to Indian Language Machine Translation (Anuvadaksh EILMT6) consortia of 8 academic institutions and 2 government organizations focuses on developing domain specific (Tourism and Health domain) MT system [8] (funded by Department of IT, India) for 6 language pairs including English to Tamil. Amrita Vishwa Vidyapeetham is the one of the consortia member who looks after the English-Tamil MT system along with CDAC, Pune (Leader of the Consortium). Though the consortia planned to release the four versions of the English-Tamil MT system, LTAG7 [9] approach based MT, SMT [10], EBMT [11] and Anal-Gen [12], only the LTAG based MT of English-Tamil is released at the end of the first phase of the project. This is the first ever-viable English-Tamil MT system released to public. Amrita Vishwa Vidyapeetham is currently working on the English-Tamil MT system funded by Ministry of Human Resource and Development (MHRD)8. The system is available in the university‟s website9 for testing for closed groups, even though it is not officially launched by MHRD.
1.4 OVERVIEW OF CHAPTERS CHAPTER 2 briefs the necessary theory. It introduces various approaches of MT and, in particular, rule based MT. CHAPTER 3 introduces the general overview of the machine translation (MT) system. This chapter summarizes the necessary components of the MT system that to be explained in greater
6
Anuvadaksh online translation service: http://tdil-dc.in/components/com_mtsystem/CommonUI/homeMT.php XTAG Report: http://www.cis.upenn.edu/~xtag/ 8 MT work at Amrita: http://www.amrita.edu/pages/research/projects/cb-pr-38.php 9 Translation Demo: http://nlp.amrita.edu:8080/AMriTs/ 7
4
detail in the following chapter. The objective of this chapter is to give the reader the idea of the developed MT system as a whole with the flow of the system. CHAPTER 4 introduces the pattern based word order transformation methodology for word reordering in the sentence. The format of reordering rules and the application of the word reordering rules on the Source Language (SL) that transforms the SL word order to TL word order is explained in great detail. CHAPTER 5 introduces the phrasal structure viewer tool, the framework assists linguist in creation of the word reordering rules with visual aid and empowers the linguist to do thorough analysis of SL and TL with the help of the visual tree representation. CHAPTER 6 discusses about the mapping of the dependency feature information to the morphological feature information. The intricacies involved in the feature mapping are elucidated in this section. CHAPTER 7 gives a brief introduction to Tamil Morphology, Finite State Machines Toolkit and detailed the development process of the morphological analyzer and synthesizer using Finite State Transducer.
5
CHAPTER 2 2 THEORETICAL BACKGROUND The process of translation is described simply as the two-stage process: decoding the meaning of the source language (SL) text and re-encoding the meaning in the target language (TL). The simplicity of the description is not simple, as it seems to be. The word „decoding‟ connotes the interpretation and analyze all of the features of the SL text, a process that requires a profound knowledge of the grammar, semantics, syntax, etc of the SL. Similarly, the re-encoding process requires a sound knowledge in the TL. The challenge of the automatic MT is how to teach a computer to do what human beings do, the understanding of the SL and to create a new TL text based on the SL. This problem is approached in a number of ways. The approaches are mainly categorized as, the linguistic way and everything else approaches involve no or little linguistic knowledge.
2.1 APPROACHES Those who believe that the MT requires Natural Language Understanding (NLU) problem should be solved first go linguistic way (the rule-based methods) and others go statistical way (Statistical Machine Translation).
2.2 RULE-BASED MT The rule based methods generally parse (analyze the grammar, semantics, etc) a text and create an intermediate representation from which the text in the TL is generated. According to the intermediate representation, the method is described as inter-lingual MT or transfer based MT.
6
2.3 INTER-LINGUAL MT The SL text to be translated is transformed to the inter-lingua, an abstract language independent representation. The TL text is then generated from the language independent representation.
2.4 TRANSFER-BASED MT The notion of having an intermediate representation that captures the meaning of the original sentence in order to generate the correct translation is same for inter-lingual and transferbased MT. Even though both following the same pattern like, having same linguistic rules to get the intermediate representation from the SL text, etc, they differ in the intermediate representation. The inter-lingua‟s intermediate representation must be language independent whereas in transfer based MT, it has dependence on the source and target language pair.
2.5 STATISTICAL MT The idea of the seeing translation as the SL text is encrypted as TL text and the solving the problem of decrypting the SL text from the encrypted text comes from Information theory. SMT is a MT paradigm where translations are generated on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The translation of the SL to TL is done according to the probability distribution prob (TL/SL). The Bayes theorem is used to model the probability distribution. prob(TL/SL) = prob(SL/TL) * prob(TL) where the translation model, prob(SL/TL) is the probability that the SL string is the translation of the TL string and the language model, prob(TL) is the probability of seeing that target language string. Finding the best translation is done by choosing the one that gives the highest probability. 7
TLˆ = arg max prob ( SL | TL) * prob (TL) TLTL*
2.6 CONTEXT FREE GRAMMARS A context free grammar (CFG) is a formal grammar that consists of a set of production rules. It‟s also called Phrase-Structure Grammar [13]. The production rules describe how the symbols of the language can be grouped and ordered together. For example the following productions express that a noun phrases can be composed of either a proper noun or an adjective followed by a noun. These can be hierarchically embedded. NP Adjective Noun
(R1)
NP ProperNoun
(R2)
Adjective beautiful
(R3)
Noun girl
(R4)
The lexicons „beautiful‟ and „girl‟ are called terminal symbols. The symbols that express the grouping of these terminal symbols are called non-terminals. In the production rules, the item to left of the arrow „‟ is a non-terminal symbol and the right of the arrow is a list of one or more terminals or non-terminals. CFG is used to describe the structure of the sentences and words in the language. It provides mechanism for describing how the phrases are built from the smaller blocks. Read the arrow „‟ as rewrite the symbol on the left with the set of symbols on the right. The symbol „NP‟ can be rewritten as „Adjective Noun‟ as in rule R1 or another possibility is to rewrite as „Proper Noun‟ as in rule R2. And further, the symbol „Adjective Noun‟ is rewritten as „beautiful girl‟. This sequence of rewriting of the strings is called derivation of the string of the words. This hierarchically embedded set of symbols is represented by parse tree as shown in the figure 2.1.
8
FIG. 2.1. A PARSE TREE FOR 'BEAUTIFUL GIRL'
The set of strings those are derivable from the single non-terminal symbol called start symbol or root node in the parse tree representation is a formal language, which can be defined by CFG. The start symbol / root node is often interpreted as the sentence node (S) and the set of strings that are derivable from the „S‟ is the set of sentences.
2.7 PROBABILISTIC CFG (PCFG) The augmented probability to the CFG productions is PCFG, also known as Stochastic Context Free Grammar (SCFG) [13]. A PCFG augments each production rule with a conditional probability, Noun girl [0.5]. It means the probability that the non-terminal symbol „Noun‟ will be expanded to the set of string, in this case its „girl‟ and it often referred as prob (Noungirl) or as prob (Noungirl/Noun).
2.8 PARSING Parsing is the task of recognizing a sentence and assigning a syntactic structure to it [13]. The task is to determine how the input is derived from start symbol. This task can be done in two methods primarily, Top-Down Parsing and Bottom-Up Parsing. Some of the parsers that use Top-Down Parsing are recursive descent parser, LL parser and Earley parser. Some of the 9
parsers that use Bottom-Up parsing are precedence parser, bounded context parser, LR parser, CYK parser.
2.9 SYNCHORNOUS CONTEXT FREE GRAMMAR A synchronous CFG is like a CFG except its production rules have two right hand sides (called source side CFG and target side CFG) instead of one as in CFG [14], [15]. A Synchronous CFG derivation for the example sentence, “I gave a book” is shown in the figure 2.2. The source side and target side symbols are linked based on the numbering. For example, synchronous CFG of English and Tamil sentence is as follows,
I.
S (NP1 VP2, NP1 VP2)
II.
VP (V1 NP2, NP2 V1)
III.
NP (I, naan)
IV.
NP (a book, oru puTTakam) V (gave, kotu)
V.
The non-terminal symbols of the source side are mapped to the non-terminal symbols of the target side by the numbering. The symbol mapping is one to one and also numbering constraints should satisfy. Like CFG, the non-terminal symbols are rewritten using production rules. The synchronous CFG rewrites the non-terminal symbols on the source side and target side simultaneously. Similar to CFG derivation, Synchronous CFG derivations can be viewed as tree but here we get pair of derivation trees. Starting with the symbol „S‟ like CFG,
(S1, S1) (NP1 VP2, NP1 VP2) (NP1 V21 NP23, NP1 NP23 V21) (I gave a book, naan oru puTTakam kotu)
10
FIG. 2.2. SYNCHRONOUS CFG DERIVATIONS
2.10 SYNCHRONOUS TREE ADJOINING GRAMMAR In Synchronous TAG [15], the productions are pair of elementary trees [16](See figure 2.3). The non-terminals of source and target nodes are linked as similar to Synchronous CFG. Initial and auxiliary trees are two types of elementary trees; the former is the elementary tree that gets attached in the leaf node of another elementary tree, if the root of the first elementary tree is same as the leaf of another. This process is called substitution. Later is the auxiliary elementary tree that gets substituted in two step process and it‟s called adjoining. The auxiliary tree‟s leaf node called foot node that marked with an asterisk „*‟.
FIG. 2.3. ELEMENTARY TREES (INITIAL TREES) 11
In TAG derivation [17], [18], we start with an elementary tree that is rooted in the start symbol, and then repeatedly choose a leaf non-terminal symbol „X‟ and attach to it an elementary tree rooted in „X‟. Similar to CFG production rule where the symbol which gets rewritten by the set of symbols, in TAG the production rule is an elementary tree, the initial elementary tree „B‟ is substituted in tree „A‟ and „C‟ in the „A‟. The process of the repeated rewriting is shown in the figure 2.4.
FIG. 2.4. REWRITTEN ELEMENTARY TREES
The auxiliary tree adjoining process is explained with the following example, „The boy who is my classmate gave a book‟. The initial and the auxiliary trees required for this example sentence is shown in the figure 2.5.
12
FIG. 2.5. ELEMENTARY TREE REQUIRED FOR THE EXAMPLE SENTENCE
The auxiliary tree „D‟ adjoins with the initial tree „A‟ by the following two-stage substitution process. 1) The leaf node NP1 of the tree „A‟ is removed and substituted in the auxiliary tree „D‟. 2) The auxiliary tree is substituted in the leaf node of the tree „A‟. The TAG starts with an elementary tree that rooted in start symbol and repeatedly attaches the elementary tree to get derivation.
13
FIG. 2.6. TAG DERIVATION
Synchronous TAG generalizes the TAG exactly same as Synchronous CFG. It starts with pair of elementary trees where the nodes of source are linked with the target elementary trees. The figure 2.6 shows the TAG derivation tree and figure 2.7 shows the synchronous TAG derivation tree of the example sentence, “The boy who is my classmate gave a book”.
14
FIG. 2.7. SYNCHRONOUS TAG
15
CHAPTER 3 3 SYSTEM OVERVIEW The aim of the system is to translate the source language sentence to the target language sentence. For any rule based machine translation system for Subject-Verb-Object structured source language to Subject-Object-Verb, structured target language involves a word reordering transformation technique. The machine translation system uses the pattern based word reordering technique [19]. Tamil is largely a free-word order language but still the word order is bounded in certain constructions at least in the literary language. Though there isn‟t any restriction of the construction of the sentence in any order, the verb headed sentences are largely acceptable formally and the placing verb in other positions makes the sentence more poetic which has less fluency in modern literary language. The dependency relation between the words defines the syntax and the word order has a little say about the syntax of the Tamil sentence. The relation between the words in the sentence is defined by the postpositions, case endings, or inflections. Although dependency transfer method is the best method for translating English-Tamil pair, the developed system uses the pattern based reordering mechanism. The main reason for the pattern based word reordering is that the transformation rules required is minimal and very easy to develop when compared to the dependency transfer method. In the target language the case endings, the preposition counterpart of English, which defines the relation between nouns, is not isolated and that is glued with the preceding noun. The synthesizing of the noun with the case endings requires dependency feature information. Thus the reason for going to the Stanford parser is it provides both the typed dependency relations [20] and the parse tree structure [21] [22]. The reordering is done using the parse tree structure and the word generation (synthesizing) uses the typed dependency features. The general block diagram of the simplified version of the translation system is shown in figure 3.1. The basic components required for the machine translation is shown in the block diagram (refer figure 3.1) and the other intricate components are saved for the later chapters. There is nothing much to do with the parts of speech tagger and parser for English as for the developed MT system is concerned. The developed MT system mainly focuses on the remaining 16
components in the block diagram. As for as the parser is concerned, the Stanford parser packages are used to get the parse tree structure and the typed dependency relations. The contribution for the project starts after the post parsing stage. For technical background of pos tagger and parser is out of the scope of this dissertation report.
FIG. 3.1. TRANSLATION SYSTEM- BLOCK DIAGRAM
The objective of this chapter is to give the general overview and flow of the MT system. The system is best explained with an example sentence. The source language sentence “Ram gave him a book” is taken as the example for explaining the components of the system.
17
FIG. 3.2. ANNOTATION OF THE INPUT SENTENCE
FIG. 3.3. PARSE TREE (RAM GAVE HIM A BOOK) 18
The input sentence “Ram gave him a book” is parsed using the Stanford parser with any of the English parser model. The parser requires the annotated sentence as the input. The input sentence is pos tagged before passed on to the parser. The annotated sentence is shown in the figure 3.2. The pos tagger annotated the sentence with the Upenn tag set. The parser recognizes the input sentence and assigns a syntactic structure to the sentence, refer figure 3.3. The word reordering can‟t be done blindly based only knowing position of the word in the sentence. The grammatical knowledge of the input sentence such as how the input sentence constructed serves better in word reordering rather than doing blind reordering. The parse tree representation has an equivalent bracketed representation. “(S (NP (NNP Ram)) (VP (VBD gave) (NP (PRP him)) (NP (DT a) (NN book))))”. The tree representation is followed in this document wherever it‟s necessary. The word reordering is done by swapping the nodes of the tree based on the transformation rules. The figure 3.4 is the tree representation of the reordering rule. The transformation rule has three parts: source pattern, target pattern and the transfer link. The rules are similar to context free grammar (CFG) rules, except that they specify the structure of two phrases, one in the source language and other in the target language.
FIG. 3.4. REORDERING RULE
The tree-reordering algorithm searches the source rule pattern in the parse tree structure. If any match found, the source rule pattern is replaced with its counter target rule pattern. The node mapping is done using the transfer link. In the figure, the numbering shows the transfer link, a mapping of the source rule pattern‟s node to the target rule pattern‟s node.
19
In the example parse tree, the source rule pattern matches and it get replaced with the target rule pattern. The figure 3.5 shows the transformation.
FIG. 3.5. TRANSFORMATION OF PARSE TREE TO REORDERED TREE
The reordered tree structure is more or like the target phrasal structure. The leaf nodes of the tree are still anchored with the English words but the source language sentence pattern is transformed to the target language pattern. SVO SOV. There are more than 150 such transformation rules identified for the English-Tamil language pair. The anchor of the reordered tree with the English words is substituted with the target language word with help of the bilingual dictionary, actually a perfectly managed set of lexicons for each parts of speech category. The English dictionary doesn‟t have the word „gave‟. Storing all the inflected forms of the word „give‟ in the lexicon is not feasible and for the all the verbs and nouns it‟s laborious and painstakingly tedious. Say, such a lexicon list is available in the source language. What is it about the target language? Tamil words are highly inflectional and the word „gave‟ doesn‟t have the direct equivalence in Tamil, instead the word form has to be generated from the root form of the equivalence of the word „give‟. The lexicon lookup table just need to store the root of the source language and its equivalence in the target and not all the inflectional forms.
20
The word that to be searched in the lexicon list to get the target equivalence in the example sentence is „gave‟ but the root word lexicon list doesn‟t have one. It‟s quite interesting. Now the English stemmer / morphological analyzer come into the picture. The morphological analyzer outputs the root word along with the morphological feature information as shown in the figure 3.6.
FIG. 3.6. FLATTENING OF TREE
Now, the Tamil words get substituted and the word order is perfect but still not a perfect translation. The scrambled words don‟t make any sense. The input sentence “Ram gave him a book” is supposedly translated to “பாம் அயனுக்கு எரு புத்தத்ததக் காடுத்தான்” (rAm 21
avannukku oru puTTakaTTaik kotuTTAn). The scrambled words are synthesized to attain this as shown in figure 3.7.
FIG. 3.7. SYNTHESIZING OPERATION
avan + DAT
avannuku
puTTakam + ACC
puTTakaTTai
kotu + past + 3SM
kotuTTAn
DAT, ACC, past and 3SM are morphological feature information that represents the dative morpheme, accusative morpheme, past tense marker, and 3rd person singular masculine marker respectively. Read morphological analyzer and synthesizer for more details (refer chapter 7).
22
FIG. 3.8. TRANSLATED OUTPUT
I have read the morph chapter. You know what, „m satisfied with “avan + DAT”, blah, blah sorts. Nevertheless, how do you manage to get the morphological feature information like „DAT‟, „ACC‟, and „PAST‟ etc? The answer is dependency relations. Those features are extracted from typed dependency relation provided by Stanford parser. The typed dependency relation given by the parser is shown in the figure 3.9. The relations are transformed to the target language. The dependency relation of the target language may not necessarily be same as the source language. From the target language dependency relations, the morphological features are taken. The dependency relation information of the target language is mapped to the morphological feature of the target language. For example, indirect object (iobj) relation is mapped to the accusative case marker (ACC) and direct object (dobj) is mapped to dative case marker (DAT) in Tamil. The translated output of the sentence with the intermediate stages is shown in the figure 3.8.
23
FIG. 3.9. DEPENDENCY RELATION TO MORPH FEATURES
Though the current system has more than 60k+ entries and still growing, for a practical MT system it is not enough. What about words that is not in bilingual dictionary or lexicon? How to translate the words that is not present in the lexicon list or words that not eligible to be in the lexicon list such as person names, place names, etc. The out of vocabulary words (OOV) cannot be translated but transliterated. Transliteration is an automatic method that converts words/characters in one alphabetical system to corresponding phonetically equivalent words/characters in another alphabetical system [23]. A third party transliteration module10 developed based on the Sequence Labeling Approach [23] is integrated in the current system. The Named-Entities are transliterated from English to Tamil using this tool. 30k person names and 30k place names (English-Tamil pairs) are used for training using SVM. The words that are not present in the root lexicon (Out Of Vocabulary words) are also transliterated in the same manner using the same tool that trained for names and place. Though the transliteration tool works well for name and place, it‟s not producing good results for other OOV words since the tool specifically trained for the name and place. The current system doesn‟t have a separate 10
Transliteration Module is developed by Ajith, CEN, Amrita Vishwa Vidyapeetham.
24
transliteration module for the OOV words. As a future enhancement, a Named Entity Identifier can be used to identify the names, places and those are transliterated using the existing tool, and other OOV words can be transliterated using new tool that build using the mapping rules. For example: He gave Sitha a pen அயன்
சீதாயிற்கு
எரு
பா
காடுத்தான்
Avan
cITAviRku
oru
pEnA
kotuTTAn
In the previous example, the target equivalence for the word „Sitha‟ (Proper name) isn‟t present in the lexicon. This word gets transliterated as „cITA‟, phonetically equivalent characters in Tamil alphabetical system of the English word „Sitha‟. Apart from the main challenges, other issues like multi-word expression and compound nouns are handled in the current system. Multi-word expression (MWE) is a lexeme made of a sequence of more than two lexemes that constitutes a meaning. Based on the individual entities the meaning is not predictable. A MWE can be a compound, a fragment of a sentence. Example: „He kicked the bucket‟ means die and not the literal meaning of kicked the bucket with foot. Translating such expressions into target language literally gives a non-sensical meaning, though the translated sentence may be syntactically correct. The compound words are lexemes of two or more lexemes that constitute a meaning similar to MWE but the compound words aren‟t idiosyncratic. It‟s cluster of words that supposed to be interpreted as a single entity. The individual entities give different meaning but not convey any meaning to the context of the sentence. Example: I/PRP bought/VBD a/DT table/NN top/NN wet/NN grinder/NN ான் எரு பநதை ஈபநா ிாிந்தர் மூடிகாருத்து யாங்ிபன் In the above example, the cluster of words is translated individually into target language, which translated into non-sensible sentence where to get the correct translation; the cluster (table, top, wet, and grinder) should be treated as a single entity. Handling multiple outputs, subject-verb feature agreement, resolving sense using word sense disambiguation technique etc and along with the detailed explanation of components of the MT system block diagram are discussed in later chapters.
25
26
CHAPTER 4 4 RULE BASED WORD REORDERING 4.1 TAMIL SENTENCE STRUCTURE
Tamil is head-final language. The verb comes usually at the end of the clause in the standard sentence with the typical Subject Object Verb (SOV) word order though Object Subject Verb (OSV) is not uncommon. Although in the standard literary text Tamil is head-final, Tamil is a free-word order language. For the machine translation, we are mostly concerned with the standard text. The frequency of the SOV and OSV sentence structures in the standard monolingual corpora that bagged from the various sources depends on various factors such as the literary style, sub-language and author, etc. To the English‟s prepositions counterpart Tamil has postpositions. Most of the postpositions are agglutinated with the noun preceding it. Tamil is a null subject language. It is not necessary that all Tamil sentences have subjects, verbs, and objects11. Tamil does not have the linking verb and relative pronouns. These are the few important issues that to be considered for the word reordering from English to Tamil.
4.2
ENGLISH STRUCTURE TO TAMIL STRUCTURE
Even though translating the English sentence with the word order intact conveys the correct meaning of the translation, for the sake of the fluency of the Tamil translated sentence and to get the translated sentence as close to the standard text, the English structure has to be transformed to Tamil word-order structure with the help of some kind a mechanism. The mechanism involves pos-tagging, parsing, reordering rules. The English syntax cannot be transformed to the Tamil syntax without knowing the structure of the English sentence. The
11
http://en.wikipedia.org/wiki/Tamil_grammar#Sentence_structure
27
mapping of the sentence to the sentence structure for all grammatically possible sentences in the language is tedious and almost impossible if that‟s manually done. The process of this automatic mapping is parsing. Parsing is the process of analyzing the text that made of sequence of words to determine the grammatical structure with respect to given formal grammar.
4.3
PATTERN BASED CFG FOR REORDERING
A pattern is a pair of Context Free Grammar rules (CFG) [24]. These rules give the equivalent structures in the source and target languages. On the basis of the patterns the reordering rules are formulated to facilitate the machine translation. They reflect the translation patterns of the source and target languages. For example, the following reordering rule is based
on an English- Tamil pattern: VP (VBD NP NP)
VP (NP NP VBD) || 1:2 2:3 3:1
The rule has three parts: source pattern, target pattern and the transfer link information. These patterns are represented either in the tree form or in Penn bracket notation. Even though the internal reordering process makes use of the bracketed notation, the tree representation is more illustrious and easy to follow. The labels used, follow the Penn notation. The table 4.1 shows the meaning of the labels used in the example.
TABLE 4.1. POS LABELS
Label
Meaning
VP
Verbal Phrase
VBD
Verb (Past tense form)
NP
Noun Phrase
In contrast to the pair of CFG rules of Synchronous CFG [15]where the source CFG rule is used to get the parse derivation of the input sentence, this method uses the pattern to do reordering and not used for parsing the source syntactic structure. The derivation of the target structure is not done concurrently as opposed to synchronous CFG. Only the patterns that need to 28
be replaced with the target patterns are required unlike in synchronous CFG and synchronous TAG [16] where every source rule have its equivalent target counterpart.
4.3.1 DISSECTION OF THE REORDERING RULE
4.3.1.1 SOURCE PATTERN
The bracketed representation of the source pattern CFG rule is VP (VBD NP NP) and its equivalent tree representation is shown in the figure 4.1.
FIG. 4.1. SOURCE RULE
VP is the root node. VBD, NP, NP are the Children nodes. The numbering below the child nodes indicates the position of the nodes in the source pattern.
4.3.1.2 TARGET PATTERN
The bracketed representation of the target pattern CFG rule is VP (NP NP VBD) and its equivalent tree representation is shown in the figure 4.2.
29
FIG. 4.2. TARGET RULE
VP is the root node. VBD, NP, NP are the Children nodes. The numbering below the child nodes indicates the position of the nodes originally in the source pattern.
4.3.1.3 TRANSFER LINK
The final part of the reordering rule is the transfer link. The transfer link for the above example is “1:2 2:3 3:1”. Just replacing the source pattern with the target pattern may not solve the reordering completely in all the cases. In the above example, there are two NP nodes. The position of these NP nodes after in the target pattern has to be defined. This mapping is done using the transfer link. The transfer link exactly says which NP node out of the two NP nodes of the source pattern is linked to target pattern‟s NP node. The main function of the transfer link is to disambiguate the nodes, which have same name, and to do the exact mapping of the nodes between the source and target pattern with no confusion. In the following transfer link, “1:2 2:3 3:1” maps three nodes of the source pattern to the target pattern. 1:2 means the second child node of the source pattern realigned to the first node of the target pattern rule. 2:3, the third child node to second node of target pattern rule and so on.
30
4.3.1.4 REORDERING RULE
„VP (VBD NP NP)‟ is the English CFG rule („VP‟ is the root node and „VBD‟, „NP‟, „NP‟ are the children nodes in the tree representation) called a source rule and „VP (NP NP VBD)‟ is the Tamil CFG rule called a target rule. “1:2 2:3 3:1” is the transfer link. The tree representation of the above rule is shown in the figure 4.3.
FIG. 4.3. REORDERING RULE
The transfer link contains the order of the children nodes of the target rule. “1:2 2:3 3:1” says first child of the target rule is from second child of the source rule; second child of the target rule is from third child of the source rule, so on and so forth. The parse tree of the source language is checked against the source rules. If any match found in the parse tree, then the source pattern is replaced with the corresponding target pattern.
4.3.2 REORDERING ALGORITHM – SIMPLIFIED VERSION
Let T be the parse tree with N total number of nodes that we process for reordering; n be the indices of nodes and c be the indices of the children nodes of the node n; R be the number of 31
reordering rules; St be the source pattern; Gt be the target pattern; Sc be the array of child nodes of the „t‟th source pattern. Gc be the array of child nodes of the „t‟th target pattern. Re-order (T): Visit node n For each St of reordering rule R If (root node (St) equals label (n)) If (children (n) equal Stc) Replace St with target tree, Gt at node n End If End If End For For each child c of n Re-order (sub tree(c)) End For End
The Re-order algorithm with the N number of nodes in the parse tree, T and R number of reordering rules has the complexity of O (N*R).In the following parse tree, the node labels are numbered in the order that the tree traversal happens.
32
FIG. 4.4. PARSE TREE
The pattern based reordering rules for reordering the above parse tree (refer figure 4.4) are as follows (figure 4.5):
Rule R1
Rule R2
FIG. 4.5. REORDERING RULES 33
Re-order (T): Visit node 7 For each St of reordering rules R If (root node (S1) equals label (n)) If (children (7) equal S1c) Replace S1 with target tree, G1 at node n
Satisfies at Node 7 Reorder rule applied.
For each child c of node 7 Re-order (sub tree (c1)) On traversing the parse tree, at node 7 the label of node 7 equals the root node label of the source pattern S1 of the reordering rule R1. The children of the node 7 are (8) and (9) and that matches with the children of the source pattern S1. So the source pattern is replaced with the target pattern with the specific node replacement using the transfer link, which is not mentioned in the algorithm for the simplicity. The traversal is kept on continuing with the child nodes. At the completion of the traversal of the tree, all the reordering rules that got matches are applied. The tree after the application of the given set of rules is shown in the figure 4.6.
FIG. 4.6. APPLICATION OF RULE R1 AND R2 34
4.3.3 ARCHITECTURE
The output of the parser (parse tree) is the input for the reordering module along with the reordering rules from the database. After the application of the necessary rules, the reordering module outputs a reordered tree which is more similar to the syntactic structure of the Tamil with the English words on the leaf nodes rather than the Tamil words as of in the Tamil syntactic structure. The reordering architecture diagram is shown in the figure 4.7.
FIG. 4.7. REORDERING ARCHITECTURE
4.3.3.1
REORDERING EXAMPLE
For example, the parse tree of the English sentence, “Ram gave him a book” is (S (NP (NNP Ram)) (VP (VBD gave) (NP (PRP him)) (NP (DT a) (NN book)))). This English phrasal structure is checked with the available reordering rules for finding a match. The pattern in the source language such as „VP (VBD NP NP)‟ is present in the English phrasal structure and it is eligible for undergoing the reordering rule. The pattern “VP (VBD NP NP)” of the source language is thus replaced with its counterpart “VP (NP NP VBD)” in the target language. The replacement of the source pattern to the target pattern is highlighted in the figure 4.8.
35
FIG. 4.8. TRANSFORMATION OF PARSE TREE TO REORDERED TREE
4.3.3.2
LIMITATIONS OF THE REORDERING RULES
Tamil shows a very high degree of flexibility in ordering the words within a sentence. The position of the words can be easily transposed without much change in the meaning. For example, “Ram gave him a book” can be reordered in multiple ways in Tamil, and the most common ways are, Ram him a book gave, Ram a book him gave, Him a book Ram gave, etc,. The predicate verb takes mostly the last position. In our system, the reordering rules are strictly one to one map. Every source rule is mapped to one target rule. Based on the most common usage, the target rule is formulated. The Tamil clausal structure is more rigid and shows little flexibility. For example, “Ram, who is smart, gave him a book.” is reordered as “(smart Ram) (him) (a book) (gave)”. Here the adjectival clause „who is smart‟ has to be positioned before the noun „Ram‟ in Tamil. The current system can handle only the generic reordering rules. For example, consider the following example constructs, “The capital of India” and “The thousands of devotees.” The parse structures for the phrases are (NP (NP (DT The) (NN capital)) (PP (IN of) (NP (NNP India)))) and (NP (NP (DT The) (NNS thousands)) (PP (IN of) (NP (NNS devotees)))) respectively. The reordering rule transforms the above phrases to “India of – 36
capital” and “devotees of – thousands” respectively. The later target phrase “devotees of – thousands” is not correct and it happens because of the one to one reordering rule map. This is the limitation of the current reordering rule mechanism. This can overcome by letting the system generate multiple outputs by one-many reordering rule maps. In the post processing, the best output can be chosen based on the fluency of the sentence. Currently, this is not incorporated in the system.
4.3.3.3
HANDLING PREPOSITION
Tamil doesn‟t have the prepositions; instead it has postpositions and another peculiarity is that the postpositions aren‟t isolated words. Tamil doesn‟t have the equivalent word for the prepositions; instead Tamil has the case endings. The prepositions in English and case endings in Tamil don‟t have the one to one mapping. The case ending are depends on the syntax of the sentence, which is pretty hard to handle at the reordering stage of the translation system. Those case endings are very ambiguous. For same preposition in English, there are many case endings in Tamil depends on the syntactic feature of the sentence construction. To indicate the preposition to case endings transformation, specific reordering rules (see figure 4.9) are necessary in order to make sure the words are in proper position even though the preposition has no equivalent in Tamil. The case endings for the word are determined based on the dependency information of the source sentence that is done in separate module. In the reordering module, the position of the words are ought to be corrected according to the target language. The reordering example for the preposition case is as follows and the parse tree & the reordered tree for the given example is shown in the figure 4.10. Example Sentence:
Delhi is the capital of India
Reordering Rules for Preposition to postposition transformation is shown in the figure 4.9.
37
FIG. 4.9. REORDERING RULES
FIG. 4.10. PARSE AND REORDER STRUCTURE
38
4.3.4 HANDLING COPULA
Tamil doesn‟t have a linking verb (copula). The English equivalent of the copula is replaced with the Tamil for the ease of translation but the sentence construction with the Tamil copula like word is not fluent in the language.
4.3.5 HANDLING AUXILIARY
The isolated auxiliaries and the finite verb in English have to translate as composite verb as a single entity. The auxiliary slots are remaining empty and are not replaced with its counterpart in Tamil. Rather the aux and the verb is synthesized based on feature information extracted from the typed dependency relations in the synthesizer module.
4.3.6 HANDLING RELATIVE PRONOUNS
Tamil doesn‟t have the relative pronoun but the meaning is conveyed by the relative participle constructions which synthesized based the dependency relations. In case of the reordering, the reordering rule specific to the relative pronoun construction transforms the English sentence construction to Tamil syntactic construction.
39
CHAPTER 5 5 5.1
PHRASAL STRUCTURE VIEWER
NEED
Creation of reordering rules demands an exhaustive analysis of source and target language syntactic structure. For the comparative analysis of different syntactic structure in both the source and target, language requires a visual interface rather than the mere bracketed representation of the syntactic structure. Even though the bracketed notation serves, better in the text processing for comparative analysis bracketed representation lack the visual aid to help the linguists. The more painstaking and laborious careful analysis of the syntactic structure can be avoided with the phrasal structure viewer, the visual interface for creation of reordering rules, lexicon development and also for the analysis of morphological information.
5.2
INTRODUCTION
The phrasal structure of the input sentence and output sentence is presented as a tree structure in the parse tree viewer tool. The phrasal structure has a crucial role in the patternbased approach. Handful of examples have to be analyzed before creating a reordering rule. Analyzing quite a number of parse structures helps to create a general rule that applies for many a type of sentence constructions. The parsed structure of the source language sentence is checked against the existing reordered rule in the rule database and using the pattern based approach, the source phrasal structure is transformed to target structure. The source language phrasal structure is reordered based on the linguistic rules to form a target language phrasal structure. The correctness of the target phrasal structure is assured during testing by viewing both the source and target structure side by side. The new reordered rules can be created or the existing reordered rule can be modified using the phrasal structure viewer (P- Viewer). 40
The P-Viewer not only assists the linguist to create pattern based reordering rules based on the analysis of both the source and target structures; but also the tool has the additional features to add, delete and modify the lexicon. In a larger scale, the linguists who work independently do not need to bother learning the framework to create reordering rules and lexicon. Since the database is centralized, the duplication of the data creation is avoided. Duplicating the existing rules or lexicon is one of the major impediments in the development of resources. In any rulebased system, the coverage of lexicon of the language or the sub-language determines the performance of the translation system. Having said that development of translation system doesn‟t afford the duplication of rules or lexicon or for that matter, any resource development work.
5.3
ARCHITECTURE OF P-VIEWER
The input sentence of the source language is parsed. Any parser can be plugged-in with the P- Viewer. Currently, the P- Viewer features Stanford Parser and Lexicalized Tree Adjoining Grammar Earley parser. The parsed tree is reordered to make it closer to the target language phrasal structure. Tree reordering mechanism is explained in greater detail in the Tree Reordering chapter. The bracketed representation of the syntactic structure of the source and target language is converted to the acyclic tree diagram in Tree rendering. The reordering rules are stored in database in the specific format so as to reduce the data latency. The reordering rule db format is discussed later in this chapter. The architecture of the P-Viewer is shown in the figure 5.1.
41
FIG. 5.1. ARCHITECTURE OF P-VIEWER
5.4 TREE RENDERING The equivalence of the bracketed notation of syntactic structure and the tree diagram representations of the analysis of the sentence can be established by devising a recursive function, a well-defined, step-by-step process that converts one of the representations into the other. The bracketed notation can be converted into a tree by starting with the bracketed words. For each word, the brackets are converted into a tree branch that runs from word to a node labelled with the label of the left bracket. Each word actually can be called a leaf of the tree. The brackets that enclose sequences of words, and which correspond to syntactic categories, can then be converted into the branches of the tree that connect the lexical category nodes to nodes corresponding to, and identified by the syntactic category labels on the brackets. The other syntactic category brackets are transformed into branches that connect nodes labelled with the category labels on the brackets. This process continues until the outer-most brackets are encountered and the root node is attached to the tree by branches connecting it to the nodes at the next level down in the tree. Another pre-order traversal function can be devised that converts a tree into a bracketed string. Hence, the two representations are equivalent. 42
5.5 REORDERING RULE FORMAT The reordering rule patterns are represented either in the tree form or in Penn bracket notation. Even though the internal reordering process makes use of the bracketed notation, the tree representation is more illustrious and easy to follow. Consider the following reordering rule example, VP (VBD NP NP)
VP (NP NP VBD) || 1:2 2:3 3:1
The bracketed representation of the source pattern CFG rule is VP (VBD NP NP) and its equivalent tree representation is as below. VP is the root node. VBD, NP, NP are the Children nodes. The numbering below the child nodes indicates the position of the nodes in the source pattern. The bracketed representation of the target pattern CFG rule is VP (NP NP VBD) and its equivalent tree representation is as below. VP is the root node. VBD, NP, NP are the Children nodes. The numbering below the child nodes indicates the position of the nodes originally in the source pattern. The third part of the reordering rule is the transfer link. The transfer link for the above example is “1:2 2:3 3:1”. Just replacing the source pattern with the target pattern may not solve the reordering completely in all the cases. In the above example, there are two NP nodes. The position of these NP nodes after in the target pattern has to be defined. This mapping is done using the transfer link. The transfer link exactly says which NP node out of the two NP nodes of the source pattern is linked to target pattern‟s NP node. The main function of the transfer link is to disambiguate the nodes that have same name and to do the exact mapping of the nodes between the source and target pattern with no confusion. In the following transfer link, “1:2 2:3 3:1” maps three nodes of the source pattern to the target pattern. 1:2 means the second child node of the source pattern realigned to the first node of the target pattern rule. 2:3, the third child node to second node of target pattern rule and so on. „VP (VBD NP NP)‟ is the English CFG rule („VP‟ is the root node and „VBD‟, „NP‟, „NP‟ are the children nodes in the tree representation) called a source rule and „VP (NP NP VBD)‟ is the Tamil CFG rule called a target rule. “1:2 2:3 3:1” is the transfer link. The tree representation of the above rule is shown in the figure below.
43
The transfer link contains the order of the children nodes of the target rule. “1:2 2:3 3:1” says first child of the target rule is from second child of the source rule; second child of the target rule is from third child of the source rule, so on and so forth. The parse tree of the source language is checked against the source rules. If any match found in the parse tree, then the source pattern is replaced with the corresponding target pattern.
5.6
FILTERED SEARCH
No node is excluded from doing a search in database to check for the match. While doing so for each node, all the rules in the db have to be checked to find the match. This slows down the tree reordering algorithm to a certain extent in a scaled up version of the translation system. The time compromising of the system can be avoided by doing the filtered search rather doing the complete search for the match. For example, consider the following parser tree and the reordering rules. At the node 4, the label of the node has to be checked against the root node of all the reordering rules in the database. And once it get any match, the children nodes of node 4 has to be checked against the children node of the source pattern wherever the root node label is same as the label of the node 4. For the ease of processing, the bracketed string of the source pattern and target pattern of the reordering rule is decomposed into four parts such as, root node of the source/target pattern, children nodes of the source pattern as one entity, children nodes of the target pattern as one entity and the target link.
TABLE 5.1. FORMAT OF REORDERING RULE IN DB
ROOT NODE 7 4 4
SOURCE PATTERN TARGET PATTERN TRANSFER LINK CHILDREN CHILDREN 89 98 1:2 2:1 5 10 10 5 1:2 2:1 56 65 1:2 2:1
44
FIG. 5.2. FILTERED SEARCH
For example, for the reordering rules in figure 5.2 are stored in the database as the format shown in the table 5.1.
45
5.7 SCREEN SHOT OF THE P-VIEWER TOOL In the screen shot of the P-Viewer, tool (see figure 5.3), the input sentence “He gave a book to him” is tested. The top panel of the interface is the input-sentence text-area where any language input can be tested provided the parse plugged-in with the tool supports that language. Left panel displays the tree diagram equivalence of the bracketed string of source parse structure where as the right panel displays the tree diagram equivalence of the reordered tree structure. The bottom panel of the interface has the rule editor and viewer.
FIG. 5.3. SCREEN SHOT OF P-VIEWER
46
5.8
REORDERING RULE EDITOR Based on the source and target phrasal structure it‟s easy to determine the correctness of
the target structure that we are concerned about. The accuracy of the translation output is primarily depends on this reordered structure. The new rules can created easily using the rule editor along with the phrase structure information and save into the rule database which is used by the reordering algorithm for transforming the source structure to target structure.
FIG. 5.4. SCREEN SHOT OF REORDERING RULE EDITOR
The rule editor has the option to save the new rule and has delete option to delete the existing rule, see figure 5.4. The rule viewer helps to determine the correctness of the existing rules by looking at the source and target phrasal structure.
5.9
DICTIONARY EDITOR
The terminal nodes in the pseudo phrasal structure are attached with the lexicons. During the transformation from source phrasal structure to target based on the reordering rules, the source lexicons are translated using the lexical database or transliterated to the target language. The dictionary editor (refer figure 5.5) helps to add missing lexicon of the current sentence, to add a new lexicon to database and to modify the existing one The dictionary editor has its vital role in the P-Viewer tool. While doing the exhaustive analysis of the word-order transformation from source to target language, having the lexicon 47
glued to leaf node empowers the visual aid. Both the rule editor and dictionary editor assists linguists in development of reordering rules and lexicon in greater extent with ease.
FIG. 5.5. SCREEN SHOT OF DICTIONARY EDITOR
5.10 MULTIPLE SENTENCE AND WORD OPTIONS Both Pattern-based and LTAG approach may give multiple outputs for the input sentence. The user is provided with the option to choose „n‟ number of outputs for the given input sentence. The user is provided with the multiple word options. On the right click of the target word, which has multiple words displays all the sense of that word. The input and corrected (if required) output sentence pair can be saved in database which builds up to the parallel corpora.
5.11 MORPH SYNTHESIZER FEATURES INFORMATION The Synthesizer module requires the feature information that extracted from phrasal structure and dependency structure of the source tree. The correctness of the synthesized word can be verified using this information that is provided in the P-Viewer. By seeing the intermediate feature extraction and synthesizer information, the morph synthesizer rule can be updated using the simple morph synthesizer rule editor.
48
5.12 STEP BY STEP PROCEDURE TO CREATE A NEW RULE IN PVIEWER The input sentence is “Delhi is the capital of India”. In the rule viewer, the existing rules that qualifies to transform the source sentence structure to target sentence structure are shown in the following table,
ROOT NODE VP PP
SOURCE PATTERN CHILDREN VB* NP IN NP
TARGET PATTERN TRANSFER LINK CHILDREN NP VB* 1:2 2:1 NP IN 1:2 2:1
„*‟ in the reordering rule is the wild card character. VB* means any character that follows VB.
FIG. 5.6. P-VIEWER CREATION OF NEW RULE 49
The source sentence is reordered and the output is Delhi the capital India of is கெல்லி – ததபம் இந்தினா – இரு By looking at the phrasal structure of the source and target language, it‟s easy to understand that the reordered structure (see figure 5.6) is not correct and lacks one rule. The addition of one rule perfects the target phrasal structure. The rule to be added in the database is, NP
NP PP
PP NP
1:2 2:1
Click save button to add rule in the db and dialog popup saying Rule is inserted successfully if there isn‟t any problem. The addition of rule to db fails in case if the rule already exists in db or if the user doesn‟t have the permission to add new rules or to modify existing rules. After the addition of new rule and running the reordering gives the following output, refer figure 5.7. The source sentence is reordered and the output is Delhi India the capital of is கெல்லி இந்தினா – ததபம் – இரு
50
FIG. 5.7. P-VIEWER: OUTPUT WITH THE NEW RULE
51
CHAPTER 6 6 MAPPING OF DEPENDENCY TO MORPHOLOGICAL FEATURE 6.1 INTRODUCTION AND WHY NEED FEATURE EXTRACTOR The translation of English to any isolated SOV language may require the word reordering to the most with some post manipulation process. The structure from the source to target language transformation and replacing the source word with the target lexicon completes the translation in case of isolated SOV languages. Tamil is agglutinative head-final language. The reordering and replacing the source word with the target lexicon leaves the target sentence with proper word order and incomplete word generation. For example sentence “Ram gave him a book”, the reordered tree and the lexicalization process are shown in the figure 6.1.
FIG. 6.1. FLATTENING OF TREE AND REPLACING WORDS WITH TARGET LEXICON 52
The output of the lexicalization is பாம் அயன் எரு புத்தம் காடு. (rAm avan oru puTTakam kotu) The word order of the target language is proper but not the words. The words are incomplete and so it has to be synthesized to get the complete word and in turn correct translation. அயன் should be அயனுக்கு. புத்தம் should be புத்தத்தத. காடு should be காடுத்தான். The morphological synthesizer module synthesizes the root or stem word with the morphological features to form the complete word that convey correct meaning in the context. The morphological synthesizer requires the input word and the morphological features to synthesize. For example, அயன் + DATIVE --> அயன் + கு --> அயனுக்கு புத்தம் + ACCUSATIVE --> புத்தம் + --> புத்தத்தத காடு + PAST TENSE MORPHEME + PERSON NUMBER GENDER MARKER --> காடு + த்த் + ஆன் --> காடுத்தான். Fine, the morphological synthesizer synthesizes the root word with the morphological feature information to get the complete word form. Hang on, how to get these morphological features. There comes the typed dependency information of the Stanford parser to rescue. Not only Stanford parser, any dependency parser will do. The dependency relations between the words play a major role in the extraction of the morphological feature information from the source sentence to synthesize to form a complete word form. Not only the typed dependency information, the parts of speech tagging, the parse tree and target phrasal structure too helped to extract feature information, see figure 6.2.
53
FIG. 6.2. MORPH FEATURE INFO TO SYNTHESIZER
6.2 STANFORD TYPED DEPENDENCIES The Stanford typed dependencies [20] [25] are all binary relations: a grammatical relation holds between a governor and dependent. The grammatical relations for an example sentence are defined below, Example sentence: Ram gave him a book. Dependency relations for the above example sentence are nsubj (gave, ram) iobj(gave, him) det(book, a) dobj(gave, book) These dependency relations map straightforwardly onto a directed graph representation. The words in the sentence are nodes in the graph and grammatical relations are edge labels. The figure 6.3 shows the graph representation for the example sentence above.
54
FIG. 6.3. DEPENDENCY TREE: RAM GAVE HIM A BOOK
det: DETERMINER A determiner is the relation between the head of the Noun Phrase (NP) and its determiner. det (book, a) dobj: DIRECT OBJECT The direct object of the verbal phrase (VP) is the noun phrase, which is the (accusative) object of the verb; the direct object of a clause is the direct object of the VP, which is the predicate of that clause. dobj (gave, book) iobj: INDIRECT OBJECT The indirect object of a VP is the NP, which is the (dative) object of the verb; the indirect object of a clause is the indirect object of the VP, which is the predicate of that clause. iobj (him, book)
6.3 MORPHOLOGICAL FEATURE INFORMATION FROM DEPENDENCY RELATIONS
The dependency relations have three parts: relation, governor and the dependent. In dobj (gave, book), „dobj‟ is the relation between the words „gave‟ and „book‟. „gave‟ is the governor and „book‟ is the dependant. This dependency relation can be read, as „book is the direct object of the word gave‟.
55
The relation „dobj‟ is the key to find the morphological feature. The counterpart of „dobj‟ in Tamil is accusative case ending morpheme. The figure 6.4 shows the source language to target language dependency relations transformation.
FIG. 6.4. SL TO TL DEPENDENCY RELATIONS TRANSFORMATION
dobj (gave, book) dobj ( காடு, புத்தம்) புத்தம் + ACCUSATIVE புத்தம் + புத்தம் + புத்தத்தத In Tamil, the accusative morpheme is „‟. The orthographic rule applied on the root word and the morpheme forms the complete word. The following table 6.1 shows some of the relations and the morphological information associated with the relation.
TABLE 6.1. DEPENDENCY RELATIONS
MEANING
MORPHOLOGICAL INFORMATION
dobj
Direct Object
ACCUSATIVE
iobj
Indirect Object
DATIVE
prep_in
Preposition (in)
LOCATIVE
prep_on
Preposition (on)
LOCATIVE
prep_during
Preposition (during)
TEMPORAL
prep_to
Preposition (to)
DATIVE
RELATION
56
prep_with
Preposition (with)
SOCIATIVE
prep_since
Preposition (since)
ABLATIVE
rcmod
Relative Modifier
RELATIVE PARTICIPLE
agent
Agent (Eg: by police)
INSTRUMENTAL
amod
Adjectival modifier
NA
det
Determiner
NA
6.4 INFORMATION FOR VERB The verb „காடு‟ gets the tense information from the parts of speech tagger. For the input sentence, Ram gave him a book. The parts of speech tagger output is shown in the table 6.2,
TABLE 6.2. LEXICON
PARTS OF SPEECH
TARGET WORD
Ram
PRP
பாம்
gave
VBD (Verb Past)
காடு
him
PRP
அயன்
a
DT
எரு
book
NN
புத்தம்
SOURCE WORD
The source word „gave‟ is tagged as VBD. VBD in Penn notation is the finite Verb (Past). So the Tamil verb „காடு‟ has to be synthesized with the past tense marker. The past tense marker of the verb „காடு‟ is „த்த்‟. Person Number Gender (PNG) marker is required to generate the finite verb.
57
The subject associated with the verb provides this information. From the dependency relation, nsubj (gave, ram) nsubj (காடு, பாம்). The PNG feature information is stored in the lexicon db. Some of the PNG information is shown in the table 6.3.
TABLE 6.3. PERSON NUMBER GENDER FEATURE
EXPLANATION
EXAMPLE
TAMIL VERB (DO)
PNG 3SM
3rd
Person
Singular He, John.
கசய்தான்
Singular She, Rita
கசய்தாள்
Singular It, Dog
கசய்தது
Singular He (with respect)
கசய்தார்
Masculine 3SF
3rd
Person
Feminine 3SN
3rd
Person
Neuter 3SH
3rd
Person
Honorific 3P
3rd Person Plural
They
கசய்தார்ள்
1S
1st Person Singular
I
கசய்பதன்
1P
1st Person Plural
We (Inclusive)
ாம் கசய்பதாம்
1P
1st Person Plural
We (Exclusive)
ாங்ள் கசய்பதாம்
2S
2nd Person Singular
You
கசய்தாய்
2P
2nd Person Plural
You
கசய்தீர்ள்
2PH
2nd
Person
Plural You
கசய்தீர்
Honorific
To generate the Tamil verb, the morphological information is collected from PosTagger, Typed Dependency and lexicon as shown in the figure 6.5. 58
FIG. 6.5. FEATURE EXTRACTION
6.5 COMPUTING AUXILIARY INFORMATION Tamil doesn‟t have the equivalence for the English Auxiliary but instead the auxiliary information is glued with the root word to become a single entity. The aux and any participle verb form in English such „He is doing‟ is translated as „கசய்து காண்டிருக்ிான்‟. Tamil equivalent of the word „do‟ is „கசய்‟. „is doing‟ is the present continuous form of the verb „do‟. For synthesizing the Tamil, the present continuous morph information is required along with the PNG information.
59
கசய் + PRESENT CONTINUOUS + 3SM கசய் + VERBAL PARTICIPLE + PROGRESSIVE MARKER + PRESENT TENSE MARKER + 3SM கசய் + து + காண்டிரு + க்ிற் + ஆன் கசய்து காண்டிருக்ிான் (DELETE K) The typed
dependency relations
for the
input
sentence
“He
is
doing” is
[ nsubj(doing-3, He-1), aux(doing-3, is-2) ]. The „present continuous‟ information is not provided by the dependency relation. Nevertheless, it is known from the dependency relation that the word „is‟ is the auxiliary of the present participle verb „doing‟. The information available is not sufficed to synthesize the word. Pos Tagging may help to determine that the phrase „is doing‟ is the present continuous form of the verb „do‟. INPUT SENTENCE: He is doing POS TAGGED: He / PRP is / VBZ doing / VBG DEPENDENCY INFO: [ nsubj(doing-3, He-1), aux(doing-3, is-2) ] Whenever the relation is aux, then the auxiliary information is computed using a recursive procedure. Combine all the aux forms as a one string. This aux info string and the pos category of the governor word, the participle form of the verb used to determine the auxiliary and tense information for synthesizing the word. Aux-tense information for some of the auxiliary and the parts of speech of the participle verb combination is shown in the table 6.4.
TABLE 6.4. AUXTENSE-MORPH FEATURE LOOKUP
POS CATEGORY AUXILIARY AUX-TENSE FORM OF PARTICIPLE
INPUT FORMAT FOR MORPH SYNTHESIZER
VERB
VBZ
PRES
PRES
VBD
PAST
PAST
VB
WILL
FUT
FUT
VBG
IS
PRES CONT
PAST~VP~PROG~PRES
VBG
WERE
PAST CONT
PAST~VP~PROG~PAST
60
6.6 HANDLING COPULA SENTENCES The dependency information for the copula sentence is different and it does not provide direct information like subject-verb pair. Tamil does not have the linking verb, so the English copula sentences are translated differently based on the tense of the linking verb. The dependency relations for the input sentence, “She is beautiful” is [nsubj(beautiful-3, She-1), cop(beautiful-3, is-2)] where subject-verb pair is missing and instead the relation between the complement „beautiful‟ and the subject exist. In contrast to the normal sentence like “She gave a book”, the dependency relation is [nsubj(gave-2, She-1), det(book-4, a-3), dobj(gave-2, book-4)] where the subject-verb pair is clearly defined. „She‟ is the subject of the verb „gave‟ where as in the copula case; „She‟ is the subject of the link word „beautiful‟. With the simple procedure, the copula verb-subject pair is determined for the linking verb sentences. From the available dependency information [nsubj(beautiful-3, She-1), cop(beautiful-3, is-2)] for the input sentence “She is beautiful”, the procedure determines that subject of the linking verb „is‟ as „She‟. Then the PNG marker for the word „She‟ is glued along with the tense marker to synthesize the equivalent of the English copula verb. Although there is not any equivalent for most of the linking verb cases, for the sake of translation easiness the Tamil word „இரு‟ considered as the equivalent with compromising the fluency of the target sentence construction. The more fluent translation of the sentence “She is beautiful” is “அமாயள்”. For the input sentence “She was beautiful”, the translation is “அயள் அமா இருந்தாள்”. In former case the target sentence does not have English copula‟s equivalent, instead the equivalent of the adjective „beautiful‟ (அமா) and equivalent of the subject „she‟ (அயள்) is synthesized to form a noun “அமாயள்”. Later case the sentence construction is totally different. The sentence is translated as “அயள் (She) அமா (beautifully) இருந்தாள் (finite verb)”. The adjective “beautiful” becomes the adverb. The word „was‟ is replaced with a finite verb „இரு‟ and synthesized with the tense and png markers. Instead of handling for various linking verbs, a general method is employed for word reordering and morphological information extraction. Henceforth the sentence “she is beautiful” outputs “அயள் அமா இருக்ிாள்” (not fluent but either it‟s not a bad translation) 61
6.7 COPULA: MORE ISSUES Compare the two sentences and its translation.
A)
She/PRP is/VBZ beautiful/JJ அயள்(She) அமா(beautiful) இருக்ிாள்(is)
B)
She/PRP is/VBZ a/DT beautiful/JJ girl/NN அயள்(She) எரு(a) அமா(beautiful)
கண்(girl)
In Sentence A, the word „beautiful‟ (adjective) changes its POS category to adverb where as in the sentence B, the POS category of the word „beautiful‟ (adjective) remains same. In both the cases, the equivalent word for „beautiful‟ is either synthesized from the root word „அமகு‟ (beauty) or fetched directly from the lookup table. This category changes occurs in the copula sentences. Therefore, a specific heuristic is necessary to identify this kind of sentence and treat in a different manner. The synthesizing varies depends on the sentence type. For copula sentence, the morph information for word generation is ADVZ (adverbalization) and for another case, the morph information is ADJZ (adjectivization).
A)
அமகு + ADVZ அமகு + ஆ அமா
B)
அமகு + ADJZ அமகு + ஆ அமா
6.8 HANDLING DATIVE VERBS The sentence with the dative verbs like „have‟, „has‟, etc need a different treatment during morphology feature extraction from the dependencies and also in the reordering. The sentence “He has two books” is translated in Tamil as “இபண்டு புத்தங்ள் அயிெம் இருக்ின்”.
62
With the general reordering rule, the above sentence gets reordered as shown in the figure below. In the reordered, tree „books‟ is an object where as in the target language the object „books‟ becomes „subject‟. nsubj(has, He) nsubj(has, books) num(books, two) num(books, two) dobj(has, books) possession(He) The
phrasal
structure
of
Tamil
sentence
“இபண்டு
புத்தங்ள்
அயிெம்
இருக்ின்” is shown in the figure 6.6. The phrasal structure (see figure 6.5) is not same as the reordered tree. The specific reordering rule is required to reorder the sentence which having the dative verbs. The morphological information cannot be extracted from the dependency relations. The dative-verb-type morph extractor function does the job. It swaps the subject and object position and the marks the relation between the subject that turned to be an object and verb. Now, the general morphological-information extractor-function does do the extract the proper feature information from the typed-dependencies.
63
FIG. 6.6. REORDERING: SENTENCE WITH POSSESSIVE VERB
FIG. 6.7. PHRASAL STRUCTURE (HE HAS TWO BOOKS)
64
6.9 HANDLING MULTIPLE SUBJECTS The typed dependencies for the sentence, which has multiple subjects, are not sufficed to extract all the subject-verb pair information. The typed dependencies for the following sentence “John and Mary presented a car” are [nsubj(presented-4, John-1), conj_and(John-1, Mary-3), det(car-6, a-5), dobj(presented-4, car-6)]. The typed dependency relation provides only one subject-verb pair. For synthesizing the Tamil verb „யமங்கு‟ with the tense and PNG morpheme all the subject-verb pair in the sentence has to be determined. The PNG marker varies for the multiple subjects. nsubj( present, John) helps to find the png feature of „John‟ which is „3SM‟ but for the multiple subjects „John‟ and „Mary‟ the png marker that to be synthesized is „3rd person plural‟. யமங்கு + PAST TENSE MARKER + 3RD PERSON PLURAL யமங்கு + இன் + ஆர்ள் யமங்ிார்ள்
65
CHAPTER 7 7 MORPHOLOGICAL ANALYZER AND SYNTHESIZER 7.1 NEED OF SYNTHESIZER FOR MACHINE TRANSLATION The lemma of any agglutinative language would be having thousands of inflections. The source language lemma has to be replaced with the target language lemma for translating the source language sentence to target language. Having a look up table for mapping the source lemma with target lemma is not feasible in case of agglutinative languages where every lemma in the language have numerous word forms. The preposition in English does not have the equivalent isolated lemma in Tamil. Instead, the preposition is transformed to postpositions or case endings. These case endings are not isolated but they glued to another lemma preceding it. The agglutinative language, Tamil verb lemmas are synthesized with the morphemes, which are the morphological information extracted from the source sentence to form the inflected form of the lemma. For synthesizing the Tamil verb requires subject information of the verb, which has to be computed from the English sentence. Therefore, it is impossible to store all the inflected forms with the subject information of the verb with no the subject-verb pair features information. To store all the inflected forms in the lookup table may not be feasible solution. Here comes the morphological synthesizer to rescue. The sample for the behaviour of the verb „கசய்‟ (do) and it inflections are shown in the below list. This is pretty much limited to present, past and future tense and two different subjects. It has been estimated that Tamil verb has more than three thousand inflected forms. He did
avan ceyTAn
He does
avan ceykiRAn
He will do
avan ceyvAn
She did
avaL ceyTAL
She does
avaL ceykiRAL
66
The complication of having been stored the verb „do‟ and its Tamil counterpart „கசய்‟ is that at the time of creating the look up table, the subject-verb pair information is unknown and for storing all the forms are pretty much laborious and not possible. Nevertheless, the verb can be generated on the fly during the translation process. He did அயன் கசய் + past tense + png feature of the subject „He‟ அயன் கசய் + த் + ஆன் அயன் கசய்தான். Consider the following English sentence and its Tamil counterpart, ன்னுதென
The son of my friend is a medical doctor
ண்ின்
நன்
எரு
நருத்துயர் ண்ின்
of friend
The phrase „of friend‟ is translated to „ண்ின்‟. The following table 7.1 shows few of the inflected forms of the word „ண்ன்‟ (friend) and its equivalent counterpart in English. This shows why storing all the forms in the lookup table is tedious.
TABLE 7.1. NOUN INFLECTIONS
LEMMA
+
MORPH WORD + MORPHEME
SYNTHESIZED FORM
FEATURE
ENGLISH EQUIVALENT
INFORMATION ண்ன் + Possessive
ண்ன் + இன்
ண்ின்
of friend
ண்ன் + Dative
ண்ன் + கு
ண்னுக்கு
to friend
ண்ன்+ Benefactor
ண்ன் + ா
ண்னுக்ா
for friend
ண்ன் + Sociative
ண்ன் + ஏடு
ண்பாடு
with friend
ண்ன்+ Accusative
ண்ன் +
ண்த
friend (accusative)
67
7.2 WHY MORPHOLOGICAL ANALYZER What‟s the role of morphological analyzer for the English to Tamil translation system? The answer is no role. The framework used for developing morphological synthesizer does do the analyzing task with minimal modification. Same heuristics are used for morph synthesizer and analyzer. In fact, the morphological synthesizer is the reverse process of the morphological analyzer. So the development of the synthesizer system parallels with the morphological analyzer. The approach used for developing morphological analyzer and synthesizer is explained in the forthcoming sections.
7.3 INTRODUCTION12 Morphology deals, primarily, with the structure of words. Morphological analysis detects, identifies, and describes the meaningful constituent morphs in a word, which function as building blocks of a word [26]. More on Tamil Morphology is in Appendix. The densely agglutinative Dravidian languages such as Tamil, Malayalam, Telugu, and Kannada display a unique structural formation of words by the addition of suffixes representing various senses or grammatical categories, after the roots or stems. The senses such as person, number, gender, and case are linked to a Noun stem in an orderly formation. Verbal categories such as transitive, causative, tense and person, number and gender are added to a verbal root or stem. The morphs representing these categories have their own slots behind the roots or stems. The highly complicated nominal and verbal morphology do not stand-alone. It regulates the direct syntactic agreement between the subject and the predicate. Another important aspect of the addition of morphs is the changes that often take place in the space between these morphs and within a stem. A Morphological Analyzer and Synthesizer should take care of these changes while assigning a suitable morph to the correct slot to generate a word. The combination of sense and form in a morph and the possibility to identify the governing rules are the incentives to attempt to build an engine, which can automatically analyse and generate the same processes taking place in the brain of a native speaker. 12
Excerpt from our published work [26].
68
The slots behind the root/stem can be filled by many morphs. The rules governing the order of the morphs in a word and the selection of the correct morph for the correct slot should be formulated for analysis and synthesis. The inflections and derivations are not the same for all the nouns and verbs. The biggest challenge is the grouping of nouns and verbs in such a way that the members of the same group have similar inflections and derivations. Otherwise, one has to make rules for each noun and verb, which is not feasible. The most difficult slot in a verb is the one that follows the verb root/stem. This position is occupied by the suffixes belonging to the category transitive. The elusive behaviour of these suffixes poses many problems, and most of the earlier Morphological Analyzers did not handle this problem adequately. Our system, as mentioned earlier, works on rules and these rules are capable of solving this clumsiness in an elegant manner. Many changes take place at the boundaries of morphs and words. Identifying the rules that govern these changes is a challenge because dissimilar changes take place in similar contexts. In such cases, it is necessary to look into the phonological as well as morphological factors that induce such changes. The designed system involves building an exhaustive lexicon for noun, verb, and other categories. The performance is directly related to this exhaustiveness. It is a laborious task. Finite State Transducer (FST) is used for morphological analyzer and generator [27]. We have used AT &T Finite State Machine to build this tool. FST maps between two sets of symbols. It can be used as a transducer that accepts the input string if it is in the language and generates another string on its output. The system is based on lexicon and orthographic rules from a two level morphological system. For the Morphological generator, if the string, which has the root word and its morphemic information, is accepted by the automaton, then it generates the corresponding root word and morpheme units in the first level. The output of the first level becomes the input of the second level where the orthographic rules are handled, and if it gets accepted then it generates the inflected word.
69
7.4
MORPH ANALYZER AND SYNTHESIZER – SYSTEM
ARCHITECTURE
The simplified version of the system architecture is shown in the figure 7.1. The practical system has combination of both lexicon and lexicon less model. The lexicon less model serves as the fail-safe, if the input lexicon not present in the lexical model. However, the new lexicon tested appends in the lexicon list automatically in case of noun and in case of verb with the minimal human intervention is required to classify the verb paradigm.
FIG. 7.1. MORPHOLOGICAL ANALYZER AND SYNTHESIZER - BLOCK DIAGRAM
70
7.4.1 NOUN LEXICON
Currently, the Tamil noun lexicon has around 100,000 entries. In Morphology literature, it shown that nouns are categorized into finite number of paradigms, however the developed system uses lexeme based approach (Item and Process). The synthesized word is the result of applying rules on the lexicon that changes root / stem to form a new word.
7.4.2 MORPHOTATICS MODEL
The order of the morphemes and its positions are restricted by a set of rules. These rules are based on the noun structure or verb structure of the language. The structure of Tamil noun and verb has explained in greater depth in Tamil morphology section. In the lexicon-based model, the transducer of noun lexicon (Lexicon.fst) is concatenated with the morphotatics rule (Moprhotatics.fst). The example of the Tamil noun „ammA‟ after concatenation of the lexicon.fst with morphotatics.fst is given below in the figure 7.2,
FIG. 7.2. TRANSDUCER FOR MORPHOTATICS RULE
In the lexicon less model, there is no defined restriction to say what precedes the morphotatics model, see figure 7.3.
71
FIG. 7.3. TRANSDUCER FOR MORPHOTACTICS RULE - LEXICON LESS MODEL
7.4.3 ORTHOGRAPHIC MODEL
The spelling rules of noun is handled in the noun orthographic rules and complied to create a transducer model. Every rule is an fst model and each of them gets composed with the morphotatics model to create the noun morph-analyzer transducer, refer 7.4.
FIG. 7.4. TRANSDUCER FOR ORTHOGRAPHIC RULE FOR LEXICONLESS MODEL
The Model is ready for the testing. Swapping the input and output symbols in the transducer model of the noun analyzer server as the synthesizer model. Reverse FST does reverse the input and output symbols. ‟fstreverse‟ command is used for this operation. For details, refer the FSM Toolkit commands topic. „fstunion‟ command does the union operation for the given input transducer models. The remaining blocks in the system blocks are selfexplanatory. The process is pretty much similar to noun.
72
7.5 BUILDING A SIMPLIFIED MORPH ANALYZER AND SYNTHESIZER A simplified version of the Morphological Analyzer and Synthesizer (MAS) for Tamil for a limited set of nouns and noun inflections of the plural marker and four case markers is elucidated in this section. This gives the step-by-step procedure for developing it.
STEP-1: REPRESENTING THE MORPHOLOGY AS A FSM
Come up with the Finite state representation of the morphology of the individual categories (such as nouns, verbs etc.). For example for the simplified analyzer it would look like: (refer figure 7.5)
FIG. 7.5. TAMIL NOUNS: FSM REPRESENTATION
Note that this is just a partial representation of Tamil nouns and this does not include dedicated states for morphophonemic changes popularly known as „Sandhi‟, the orthographic rules.
STEP-2A: IDENTIFICATION OF PARADIGMS
Five nouns have been selected for the simplified analyzer. Note that the paradigm approach is explained here for the simplicity, even though the system uses the hybrid of lexeme based morphology and word-based (or paradigm) morphology.
maram (tree) 73
karam (arm)
siram (head)
kaN (eye)
maN (sand/ earth)
It can be noted that the first three words are rhyming and similarly the last two words. Interestingly, words ending in similar sounds behave in the same way morphologically. This means that the first three words always inflect in the same way, when some lexical information (such as pl. or case marker) is added to them. These individual groups are called „paradigms‟ in the literature. The task for the linguist is to identify the different paradigms in a language and come up with a representative example for that paradigm.
STEP-2B: AUTOMATIC CLASSIFICATION OF THE NOUNS/ VERBS INTO PARADIGMS
Once the linguists identify the paradigms, The script can be written to classify the words automatically in the lexicon into one of the identified paradigms. This can be done relatively easily (with some error) by looking at the word endings.
STEP-3: BUILDING THE FST FOR THE INDIVIDUAL CATEGORIES:
It will be better to build the FST for individual categories and then compose them into a bigger FST using the commands available in the toolkit.
7.5.1 FILE FORMAT FOR FSM TOOL KIT13 FSM toolkit14 takes acceptor files in a space/tab separated format of five columns (last column is optional for the weights). The first and second column marks the source and destination state of a specific transition. The next two columns mark the input symbol and output 13
Excerpt: A Toy Morphological Analyzer – Handout from Loganathan Ramasamy, 2009. Amrit Vishwa Vidyapeetham 14 Open FST and FSM ToolKit Tutorial: http://www2.research.att.com/~fsmtools/fsm/tut.html, http://www.openfst.org/twiki/bin/view/FST/FstExamples
74
symbol of the transition. Each transition has to be in a separate line. The lines below give an example of the transduction of the word „maram‟.
0
1
ma
ma
1
2
ra
ra
2
3
m
m
3 Note that individual Tamil aksharas (for example „ma‟) are transduced in every state instead of the entire word in one transition. This is easier to implement morphophonemic changes in the word, which will usually change the last akshara. The last line gives the end state and hence has just one column. In addition, we need to create two files for the symbols used in input and output side of the FST, such that each symbol is given a unique index. These files have two columns giving the symbol and the unique identifier. For example for the above case the input and output files will be same as: ma
1
ra
2
m
3
However, in a real scenario these files will be different, since the output side will have symbols for the categories and lexical information (such as N, PL, ACC, etc.). However, it is a good idea to use the same identifier for common symbols. In building a morph analyzer and synthesizer, the creation of these three files should be automated to the extent possible.
STEP-4: Once the data files are created as detailed above, use the fsmcompile command (commands are explained below) to generate the binary form of FST. This FST is then reversed (Swapping input ouput symbols to make it as synthesizer or analyzer depends on input and output symbol) and determinized again by the commands. This completes the bulding of the transducer and is ready for analysis / synthesis.
75
STEP-5: Now a word is tested by passing it thro‟ the transducer, which outputs the path traversed by the word. However, FSM toolkit does it in a slightly circuitous way: 1.
Create the acceptor file for the word and then compile it using the fsmcompile
command. For example, the acceptor file for the word „cirattil‟ (on the head) will look like 0
1
ci
1
2
ra
2
3
tt
3
4
il
4
2.
Compose this fsm with the fst and then project it on to the output side (again
commands are given below) 3.
The output will be another fsm file giving the analysis along with the categories.
7.5.2 FSM TOOLKIT COMMANDS For any command in the toolkit”fsmcommand -?” gives a decent help for that command. 1.
To compile an acceptor/ transducer file into FSA/FST a.
fsmcompile –t –i input_symbols.txt –o output_symbols.txt –F nouns.fst
nouns_acceptor.txt b.
fsmdraw –i input_symbols.txt –o output_symbols.txt nouns.fst |
dot
> nouns.jpg
2.
c.
Here –t and –o is for transducer and drop them for acceptors
d.
-F option is for specifying the output file name to store the fsm/fst binary
e.
The last argument takes the name of the transition file.
Reverse, remove epsilon transitions and determinize the fst a.
3.
fsmreverse nouns.fst | fsmrmepsilon | fsmdeterminize -F nouns_rev.fst
To print the fst in human readable form a.
fsmprint -o output_symbols.txt -i input_symbols.txt nouns.fst
76
–Tjpg
b.
Omitting the –i and –o switches will print the data with indices instead of
symbols. 4.
Similarly compile the acceptor file to be used for testing the morph analyzer (this file
uses the word „cirattil‟ (on the head) for testing)
5.
a.
fsmcompile –i input_symbols.txt –F test.fsm test_set.txt
b.
Notice that –t and –o switches are not used
Reverse, remove epsilon transitions and determinize the fsa a.
6.
7.
fsmreverse test.fsm | fsmrmepsilon | fsmdeterminize -F test_rev.fsm
Now, analyzer the fsm by composing it and project it on the output side a.
fsmcompose test_rev.fsm nouns_rev.fst | fsmproject -o -F result.fsm
b.
This creates the file result.fsm, which has the analysed output
Print the results in human readable form a.
fsmprint -i output_symbols.txt result.fsm
b.
Note that the –i switch actually calls the output_symbols.txt. This is because we
have projected the fsm on the output side which now becomes the input side of the fsm.
7.5.3 FST MODEL FOR MORPHOTACTICS RULE OF NOUN (SIMPLIFIED VERSION)
The possibilities inflections of the Noun (simplified version) are given below and refer figure 7.6 for the FST model of noun transducer. Root / Stem + Plural Root / Stem + Plural + Case Root / Stem + Case Root /Stem + Oblique Root / Stem + Oblique + Case Root / Stem + Oblique + Plural + Case Root / Stem + Interrogation Marker Root / Stem + Plural + Interrogation Marker Root / Stem + Oblique + Interrogation Marker Root / Stem + Oblique + Plural + Interrogation Marker 77
Root / Stem + Conjunction Marker Root / Stem + Plural + Conjunction Marker Root / Stem + Oblique + Conjunction Marker Root / Stem + Oblique + Plural + Conjunction Marker
FIG. 7.6. TRANSDUCER FOR TAMIL NOUN INFLECTION
7.5.4 FST MODEL FOR ORTHOGRAPHIC RULE The (v/y) insertion rule model is shown in the following figure 7.7. The „v‟ insertion happens when any word that ends with Vowel other than „i‟, „I‟, „ai‟, „e‟ and „E‟ glues with the morpheme that begins with a vowel. Example: amma + Al ammavAl The „y‟ insertion occurs when any words that ends with „i‟, „I‟, „ai‟, „e‟ and „E‟ and next morpheme that begins with a vowel. Example: alai + Al alaiyAl
78
FIG. 7.7. TRANSDUCER FOR V/Y INSERTION RULE
7.5.5 TWO LEVEL MORPHOLOGY WITH AN EXAMPLE WORD For the Tamil Noun, „ammA‟, the morphotactics rule, orthographic rule transducer model is shown in the figure 7.8. ammA + N + INS ammA ^ Al ammAval In the first stage, ammA + N + INS becomes ammA ^ Al. „^‟ indicates the word boundary. „INS‟ is instrumental case and „Al‟ is the instrumental case marker. In the second level, the spelling rule is applied. Since the word „ammA‟ ends with „A‟ and the morpheme that to be glued begins with vowel (in this case „Al), ammA ^ Al accepted by the „v Insertion rule‟ transducer and generates the output symbol ammAvAl.
FIG. 7.8. TRANSDUCER FOR MORPHOTACTICS RULE
In the figure 7.9 input symbol „+N‟‟s counter output symbol is „empty‟. In the real application, it‟s epsilon. This epsilon or empty has to be removed using fst epsilon remove command. For more details, see FST manual section.
79
FIG. 7.9. TRANSDUCER FOR ORTHOGRAPHIC / SPELLING RULE
Note that the string „empty‟ hangs on in the fst model which isn‟t the case in the real application. For simplicity, it‟s kept as such. Hurray, the model is done, let‟s do some testing. The above figures are just a small part of the bigger picture of the model which can be shown here. Think, we zoomed the bigger fst model figure 1000 times and viewing a particular rule and application for the noun „ammA‟. The input word for the synthesizer for testing is „ammA‟. The Finite-State model representation is shown in the below figure 7.10.
FIG. 7.10. INPUT WORD IN FINITE-STATE REPRESENTATION
The input FSA composed with the Morphotactis rule transducer and the intermediate stage finite state representation is,
FIG. 7.11. INTERMEDIATE STAGE FSA
80
The further composition of the intermediate Finite-State model with the orthographic rule gives the synthesized output. The synthesized Finite-State model is shown in the following figure 7.12.
FIG. 7.12. FST FOR SYNTHESIZED WORD
For simplicity, the input FSA is composed with lexical transducer and then the intermediate with the orthographic transducer but in the real case, the model is created by composing the lexical transducer and intermediate transducer. The input finite state is composed on the created model. The flow graph of the morphological synthesizer is shown below in the figure 7.13.
FIG. 7.13. FLOW GRAPH OF MORPH SYNTHESIZER 81
For morphological analyzer, swapping the input and the output symbol is sufficed. The swapping can be done using the fst reverse command.
82
CHAPTER 8 8 EXPERIMENTS AND RESULTS Most of the modules of the translation system are implemented in Java. Implemented modules include syntax reordering, dependency to morphological feature mapping, morphological synthesizer/analyzer for Tamil and transliteration module for English to Tamil transliteration. Stanford Parser is integrated with the system to pos-tag and to parse the English sentence. The system can be scaled up by developing enough resources that include EnglishTamil pattern based reordering rules, English-Tamil transfer lexicon, rules for mapping the typed dependency relations to morphological features, morphotactics & orthographic rules for Tamil noun and verb.
8.1 DATA FORMATS This section is devoted to explain the format of the databases that used in the system. Understanding the data formats for system experiments is vital. The missing or invalid resources may crash the system. The system uses databases namely, transfer rules, dependency to morph mapping, aux-tense to morph feature mapping, noun, verb, adjective, adverb, pronoun, preposition and general (other pos categories) transfer lexicon.
8.2 TRANSFER RULES The format of the reordering rules is explained in detail in the reordering rules chapter. The four columns in the db CurNode, Source, Target and TransferLink means root node of the source/target pattern rule, child of the source rule pattern, child of the target rule pattern and transfer link the maps the node between source & target rule respectively. The screen shot of the reordering rules db is shown in the figure 8.1. 83
FIG. 8.1. TRANSFER RULES DB
8.3 DEPENDENCY TO MORPH MAPPING The first column of the db is the relation between two words and the second is the features that to be synthesized along with the one of the word‟s stemmed. Consider the example sentence; Ram gave a book to him. The dependency relations given by the parser are, [nsubj(gave-2, Ram-1), det(book-4, a-3), dobj(gave-2, book-4), prep_to(gave-2, him-6)]. In the db shown in the figure 8.2, prerp_to is mapped to ~N~DAT. The target lexicon of the word „him‟ in the transliterated form is „avan‟. Remember that the transfer lexicon has only the root/stem form of the English word and its equivalence. The word „avan‟ gets the morphological feature information as „~N~DAT‟. The synthesizer requires this information to synthesize the complete word form. avan~N~DAT avan + ku avannukku
84
FIG. 8.2. DEPENDENCY-MORPH FEATURE DB
8.4 AUXTENSE TO MORPH MAPPING The auxiliary and tense are computed based on the lookup table as shown in the figure 8.3. The columns Pos, Aux, AuxTense and MorphInfo are the pos category of the verb for which the auxTense information is computed, the auxiliary verb if (any) precedes the verb, auxTense form and the morphinfo that required by the synthesizer respectively.
FIG. 8.3. AUXTENSE-MORPH FEATURES DB 85
8.5 NOUN LEXICON The following figure 8.4 shows the db format of the noun lexicon. The fourth column „Feature‟ has the person number gender information and the fifth column have synonyms (if any) of the source word. In case multiple synonyms; the words are separated by a comma.
FIG. 8.4. NOUN LEXICON
The verb lexicon, adjective, adverb, preposition and general lexicon have the same db format as noun lexicon.
8.6 TRANSLATION: STEP-BY-STEP PROCESS Step 1: The source sentence is passed to Stanford parser. For the input sentence: Ram gave a book to him‟, the parse tree in the bracketed representation is as follows, (S (NP (NNP Ram)) (VP (VBD gave) (NP (DT a) (NN book)) (PP (TO to) (NP (PRP him))))) The typed dependency output from the parse is as follows, [nsubj(gave-2, Ram-1), det(book-4, a-3), dobj(gave-2, book-4), prep_to(gave-2, him-6)] Step 2: The parse tree is passed to the reordering module. The output of the reordered module is as follows, (S (NP (NNP Ram)) (VP (NP (DT a) (NN book)) (PP (NP (PRP him)) (TO to)) (VBD gave))) Step 3:The typed dependency output is inputted to the morph feature extraction module. The intermediate stage and output of this module is shown below, Rel
[nsubj, det, dobj, prep_to] 86
Gov
[gave-2, book-4, gave-2, gave-2]
Dep
[Ram-1, a-3, book-4, him-6]
S (NP (NNP Ram)) (VP (NP (DT a) (NN book ~N~ACC)) (PP (NP (PRP him ~N~DAT)) (TO to)) (VBD gave~V~PAST~3SM))) Step 4: The English words are translated to Tamil and the flattening of tree form gives the reordered sentence, S (NP (NNP பாம்)) (VP (NP (DT எரு) (NN புத்தம்~N~ACC)) (PP (NP (PRP அயன்~N~DAT)) (TO -)) (VBD காடு~V~PAST~3SM))) பாம் எரு புத்தம்~N~ACC அயன்~N~DAT காடு~V~PAST~3SM Step 5: The incomplete words forms are synthesized to get the final translation output, பாம் எரு புத்தத்தத அயனுக்கு காடுத்தான்
8.7 TESTING
The stochastic evaluation metric initially tried but the testing results were misleading. Even the translated sentences that rated high by the metric are not par. So manual testing is employed for time being until the best method is figured out. The testing of the MT system is done manually. The quality of the translation system‟s output is rated in 1 to 5 by human judges, where the output sentence is rated 1 for very poor quality, rated 2 for good reordering and bad lexicon, rated 3 for comprehendible output, rated 4 for comprehendible, good lexicon and good reordering and rated 5 for comprehendible output, prefect syntax and prefect lexicon. The 14100 input sentences from the sub-language of tourism were tested.
TABLE 8.1. TRANSLATION RESULTS
RATE 1 & 2 RATE 3 RATE 4 & 5
NO. OF SENTENCES PERCENTAGE 3134 22.22 8970 63.61 1996 14.15 87
The online demo version for testing the English to Tamil translation system is available at the following link, http://www.nlp.amrita.edu:8080/TransWeb/index.jsp.
8.7.1 TESTING RESULTS: MORPHOLOGICAL ANALYZER AND SYNTHESIZER
The monolingual corpora of finite number of words tested with the developed system and store the analyzed/synthesized forms and unanalyzed/non-synthesized forms separately. For better testing results, nouns and verbs should be identified from the corpora and tested individually. The testing results of the morphological analyzer and synthesizer are given in the table 8.2.
TABLE 8.2. TESTING RESULTS OF MORPH ANAYLZER AND SYNTHESIZER
MODEL
TOTAL ANALYZED/SYNTHESIZED PERCENTAGE
LEXICON
113653
101076
88.93
12577
9789
77.83
NO LEXICON
The online demo version for testing the Morphological analyzer and synthesizer is available at the following link. http://www.nlp.amrita.edu:8080/MorphWeb/index.jsp. The sample input for testing is also available in the website. The testing of morph analyzer and synthesizer means testing the orthographic and morphotactic rules. The testing of morph analyzer is comparatively easier than the testing synthesizer because of the lack of input samples for testing synthesizer. For testing analyzer, the monolingual corpora from various sources have been collected, pruned and tokenized. The words are tested in the lexicon model, initially. During the development, the orthographic and morphotactics rules were tested for all possibilities just after the creation of every rule. The words that are not analyzed probably because of lack of the rules or because of the absence of the lexical entry. In that case, the system produces no output. All the other cases where the system produces one or more outputs may or may not be correct. It is been found through the experiment that at least one of the analyzed 88
outputs is correct for most of the cases. It is observed that in very few cases, all the analyzed outputs are spurious and it is negligible in the large test set. 88.93% is not the accuracy but it is the percentage of words that were analyzed. The accuracy of the system would be less than that but it is not tested manually. If the spurious outputs were neglected, the accuracy would be almost 88.93%. The MAS that developed using the openFST package shares the same orthographic and morphotactic rules for analyzer and synthesizer. The morph synthesizer is the inverse process of the morph analyzer. The inputting of the output of the analyzer to the synthesizer outputs the original word along with the other possible word formations, if any. i.e. All the 88.93% of the analyzed output is synthesized back to the original word. For example, the word “டித்திருந்தான்” is parsed and the output is “டி~V~PAST~VP~PERF~PAST~3SM”. The output of the analyzer that fed to the synthesizer produces “டிந்திருந்தான்” and “டித்திருந்தான்”. Though both the output is correct, one important observation here is that one of the outputs is the original word that inputted for the analyzer and it is true for all the cases. Thus, it shows that 88.93% of root word and its morph information synthesize to form a word. The accuracy would be almost 88.93% as if the spurious outputs were considered. The words that are not analyzed are tested in the lexicon less model. However, the output of the lexicon less model may not be correct as opposed to the lexicon model where most of the time the system gives output only if it has the proper rule. The percentage of the words analyzed varies with the corpus. The performance of the system can be improved by doing the exhaustive testing. The non-analyzed words forms are clustered and the patterns can be identified to create new orthographic rules. This continuous iterative process of creating heuristics and testing and finding new heuristics based on the non-analyzed word forms during testing can improve the accuracy and the coverage of the system.
The category of the sentences used for the testing procedure is as follows, Simple + copula
(Co-ordinate) + That Complement
Simple + copula (Possessive form)
Co-ordinates
Simple + copula (Co-ordinate)
Infinitival clause (initial)
Simple + copula (Infinitive)
Gerundial 89
Simple with PP object
Discourse connector
Relative Clause (subordinate clause)
Conditional
Relative Clause (subordinate clause) with imperative Relative Clause (subordinate clause) – Hidden
Copula with infinitive Copula constructs with gerundial & „Wh‟ sub-ordinates Appositional with verb participial initial
Appositional (verb participial + initial)
and conjuncts
Appositional(complement + initial)
Appositional with verb participial initial
That Complement
Complex sentence with Relative clause Complex sentence with Relative clause
(Co-ordinate) + That Complement
(Hidden) complement
Co-ordinates
Adverbial clause initial
PP initial
Simple (NP conjuncts)
The result shown in the table 8.1 cannot be compared with the couple of weeks old Google online translation service (alpha version) at the time of this write up. It seems Google‟s EnglishTamil SMT system is better in choosing the target lexicon where as in the morphological synthesizing part lacks than the rule-based system. The only other system available to compare these results against is EILMT consortia‟s LTAG based MT system. Comparison of these systems is done with 500 sentences and the observed results of this initial testing seem to be neck to neck.
8.7.2 CONTRIBUTIONS
The main contributions to the project in programming and linguistics are shown in the tables 8.3 to 8.7.
90
TABLE 8.3. IMPLEMENTATION DETAILS
MODULE
DETAILS
Morph Analyzer and
Rules are written in „lextools‟ frame work and the model is
Synthesizer
created using „openfst‟ packages This module is implemented in Java. It takes the parse tree
Reordering
and reordering rules as input and outputs the reordered structure. This module is implemented in Java. The parse tree, typed
Dependency to Morph feature mapping
dependency relations are used to compute the subject of verb, to deduce the auxiliary-tense and copula features and to map the dependency relations with noun inflections and post positions. This module is implemented in Java. The target equivalence of source word is translated using the bilingual lexicon. The module also handles the pos category jumping. Ex: The word “tall” in the input sentence “He is tall” has the pos category
Translation of words
„adjective‟.
This
sentence
translated
as
“அயன் உனபநா இருக்ிான்”. The word “உனபநா” is adverb. This module ensures to choose the adverb and not the adjective equivalence of tall that is „உனபநா‟. Integration of Synthesizer Integration of Stanford Parser These modules are implemented in Java. The multi-threaded Integration of Transliteration
synthesizer is developed for window‟s version of the MT
Integration of Font Converter
system.
Desktop GUI Phrase Tree Viewer and Reordering rules Editor Database Organization
This tool is implemented in Java. This is not a part of the MT system. It is the visual aid tool for developing reordering rules and lexicon. MySql is used for db solutions. 12 different tables are 91
present in the db. Online Version of MT Online Version of Morph Analyzer and Synthesizer
The online version is implemented using Jsp, Java Servlets, Java Script, Java, xml, html. The system is developed in such a way that the linguistic components are clustered out from the system to make a generic framework of MT system for English to any Dravidian language and Hindi. Therefore, for English to the other language pair, only the linguistic data have to be
Extensions
provided in database, i.e. the bilingual dictionary, the reordering rules for the new language pair and the morph synthesizer rules. A toy version of the English to Malayalam MT system and Malayalam morphological analyzer and synthesizer is developed and tested to verify the adaptability of the developed framework.
TABLE 8.4. DEPENDENCY TO MORPHOLOGICAL FEATURE MAPPING
NUMBER OF MAPPINGS VERB
97
NOUN 44
TABLE 8.5. REORDERING RULES
NUMBER OF REORDERING RULES REORDER RULES 114
92
TABLE 8.6. NUMBER OF WORDS USED IN MORPH ANALYZER AND SYNTHESIZER MODEL
CATEGORY NUMBER OF WORDS NOUN
70207
VERB
2930
ADJECTIVE
141
TABLE 8.7. NUMBER OF RULES IN MORPH ANALYZER AND SYNTHESIZER
CATEGORY RULE TYPE NOUN
VERB
NUMBER OF RULES
MORPHOTACTIC 218 ORTHOGRAPHIC
113
NUMBER VARIES DEPENDS ON HOW THE RULES ARE COUNTED
MORPHOTACTIC 370 ORTHOGRAPHIC
158
93
CHAPTER 9 9
SCREEN SHOTS
FIG. 9.1. GUI OF ENGLISH-TAMIL MT SYSTEM (STAND ALONE VERSION)
94
FIG. 9.2. DICTIONARY PANEL AND MORPH SYNTHESIZER PANEL
95
FIG. 9.3. GUI: TAMIL MORPH ANALYZER AND GENERATOR
FIG. 9.4. GUI: MALAYALAM MORPH ANALYZER AND GENERATOR 96
FIG. 9.5. GUI: ENGLISH-MALAYALAM MT SYSTEM
97
FIG 9.6. ENGLISH-TAMIL MT SYSTEM (ONLINE VERSION)
98
FIG 9.7. MORPH ANALYZER AND SYNTHESIZER (ONLINE VERSION)
99
The graphical user interface (gui) of the standalone version of the MT system that implemented in Java is shown in the figure 9.1. Other than translation, the gui version has the other features such as to modify/add the lexicon on the fly and to verify the morphological synthesizer output. The figure 9.2 shows those features in the gui with the dictionary panel and the morphological synthesizer panel enabled. The gui of the standalone version of the morphological analyzer and the synthesizer for Tamil that implemented in Java is shown in the figure 9.3. With the success of the prototype implementation of the MT system, the framework developed for English-Tamil language pair is extended to the other Dravidian languages. The framework is tested for the toy version of the English to Malayalam MT system. As part of the testing, a prototype version of the morphological analyzer and synthesizer is developed for Malayalam and is shown in the figure 9.4. Using framework of English-Tamil MT system and just by changing the English-Tamil dictionary to English-Malayalam dictionary along with the Malayalam Unicode font mapping, the toy version of the English-Malayalam is tested and is shown in the figure 9.5. The online version of the English-Tamil MT system and the morphological analyzer and synthesizer are implemented using Java Servlets and JSP and are shown in figure 9.6 and figure 9.7 respectively.
100
CHAPTER 10 10
LIMITATIONS AND FUTURE WORK
The quality of the translation output and the morphological analyzer and synthesizer deteriorates for varied reasons. This chapter discusses about those reasons, the limitations of the current system and suggestions to overcome the issues that degrades the performance of every module and the translation system as a whole. The Stanford parser is chosen for its capability of producing the syntactic structure and the typed dependency relations. The reordering module uses the parser‟s output for word reordering. Therefore, if the parser outputs a wrong syntactic structure or any ambiguous parse output, the reordering module produces a wrong target phrasal structure and thus leads to wrong translation as well. Neither the input nor the output of the parser is adjusted or tweaked to get better output in the current system. The performance of the parser might have been improved if the parser would have been trained for the specific domain. Currently, WSJ trained model is used in the parser. The training of the parser for a domain is out of scope of this thesis, since the training requires a gold standard corpus on the particular domain for the source language. This thesis is devoted to develop necessary tools for the target language and is not considered the source side. Tamil shows a very high degree of flexibility in ordering the words within a sentence. The position of the words can be easily transposed without much change in the meaning. For example, “Ram gave him a book” can be reordered in multiple ways in Tamil, and the most common ways are, Ram him a book gave, Ram a book him gave, Him a book Ram gave, etc,. The predicate verb takes mostly the last position. In our system, the reordering rules are strictly one to one map. Every source rule is mapped to one target rule. Based on the most common usage, the target rule is formulated. The Tamil clausal structure is more rigid and shows little flexibility. For example, “Ram, who is smart, gave him a book.” is reordered as “(smart Ram) (him) (a book) (gave)”. Here the adjectival clause „who is smart‟ has to be positioned before the noun „Ram‟ in Tamil. The current system can handle only the generic reordering rules. For 101
example, consider the following example constructs, “The capital of India” and “The thousands of devotees.” The parse structures for the phrases are (NP (NP (DT The) (NN capital)) (PP (IN of) (NP (NNP India)))) and (NP (NP (DT The) (NNS thousands)) (PP (IN of) (NP (NNS devotees)))) respectively. The reordering rule transforms the above phrases to “India of the capital” and “devotees of the thousands” respectively. The later target phrase “devotees of the thousands” is not correct and it happens because of the one to one reordering rule map. This is the limitation of the current reordering rule mechanism. This can overcome by letting the system generate multiple outputs by one-many reordering rule maps. In the post processing, the best output can be chosen based on the fluency of the sentence using the language model of the target language. Currently, this is not incorporated in the system. The quality of the translation output is affected by the lack of reordering rules. The new heuristics can be created through exhaustive testing and identification of the patterns during testing. The morphological information to synthesize the target word is extracted from the typed dependencies from the Stanford parser. The information that is extracted and mapped to the morphological syntax is the input for the morphological synthesizer. The performance of the synthesizer depends on the proper input. The synthesizer will be affected by the wrong typed dependency relational information from the parser. The performance enhancement of the typed dependency relation of the parser is out of scope of this thesis as like state before, the thesis is not concerned about the development of the tools in the source language. The quality of the morphological analyzer and synthesizer will be degraded because of the lack of orthographic rules or morphotactics rules or improper heuristics or the absence of the lexical entry and thus leads to the poor quality translation. The performance of the morphological analyzer and synthesizer system is increased with the help of vigorous testing with the huge corpus. All the words that are not analyzed from huge dataset are sorted alphabetically from right to left and clustered by the useful pattern that might leads to constitute a new rule by analyzing the pattern from the cluster. The absence of the lexical entry in the dictionary or having an entry that is not perfect for the given context of the sentence plays an active role in the quality of the translation output. Even having a domain specific dictionary may not enough to get the perfect equivalence of the source word in a correct sense without the Word-Sense-Disambiguation (WSD) module. The 102
current MT system does not have WSD module, which is very critical for the rule based MT system. The lack of WSD module in the present system is a serious drawback of the system and that leads to poor quality translation. The out of vocabulary words (OOV) cannot be translated but transliterated. The current system does not have the transliteration module for transliterating OOV words other than names and place names. It does have the module to transliterate Named-Entities. The words that are not present in the dictionary (OOV words) are also transliterated in the same manner using the same tool that trained for names and place. Though the transliteration tool works well for name and place, it‟s not producing good results for other OOV words since the tool specifically trained for the name and place. This is a setback for the developed MT system that degrades the translation quality. As a future enhancement, a Named Entity Identifier can be used to identify the names and places and those are transliterated using the existing tool and other OOV words can be transliterated using tool that build using the mapping rules. As the chain reaction of the wrongly tagged word/word(s) of the input sentence by the pos tagger, the quality of the translation will be poor. Apart from these limitations, there is handful of other issues with respect to the specific sentence structures, specific word category, specific context, etc, etc.
103
CHAPTER 11 11
CONCLUSION
With the moderate success of this system, the methodology is extended to Malayalam and Kannada and the prototype version has been developed. The accuracy of the developed EnglishTamil MT system at present is around 14%. This is only the current status. Through the experiments and vigour testing of the system, the careful observations are made. The inferences are that the performance of the system meliorates by all or any of the following: beefing-up of the transfer lexicon and creation of more specific rules for mapping the dependency relations to the morphological features. Another important observation made by comparing different approaches is that morphologically rich languages like Tamil demands a top-notch morphological analyzer and synthesizer for any sort of the approach namely, linguistic or stochastic or hybrid of both. The word reordering of the sentence contributes very little to convey sense of the sentence in Tamil. As long as the relations between the words are clearly defined by the inflections of the word, the meaning of the sentence is intact irrespective of the wrong or the less fluent reordering. The success of the rule based English-Tamil MT mostly depends on how well the relations between the words are captured from English and transformed to Tamil rather than reordering the SVO sentence structure of English to SOV sentence structure of Tamil. Steering our thoughts in that direction may help to accomplish our goals in a far more specific way. All I can say now is the journey that started in 1954 at Georgetown is not yet over. There is plenty of room and scope to play around and that gives me the hope that one day mother tongues could be saved and the culture could be preserved.
104
REFERENCES [1] [2] [3] [4] [5] [6] [7]
[8]
[9] [10]
[11] [12]
[13] [14]
[15] [16]
Paul Garvin, "The Georgetown-IBM experiment of 1954: an evaluation in retrospect," in Papers in linguistics in honor of Dostert, New York, 1967, pp. 46-56. W.John Hutchins, "Machine Translation: A Brief History," in Concise history of the language sciences: from the Sumerians to the cognitivists, 1995, pp. 431-445. Dorothy Senez, "Developments in Systran," in Aslib Proceedings, vol. 47, 1995, pp. 99 107, Issue: 4. Hideki Hirakawa, Hiroyasu Nogami, and Shin-ya Amano, "EJ/JE Machine Translation System ASTRANSAC," in MT Summit III, Washington, DC, USA, July 1991. M Nagao, J Tsujii, and Nakamura , "The Japanese Government Project of Machine Translation," in Computational Linguistics, 1985, pp. 91-110. V. Renganathan, "An ineractive approach to development of englishtamil," in The international Tamil Internet, 2002. U Germann, "Building a statistical machine translation system from scratch: how much bang for the buck can we expect?," in In Proceedings of the workshop on Data-driven methods in machine translation, Morristown, NJ, USA, 2001, pp. 1–8. Hemant Darbari, Anuradha Lele, Aparupa Dasgupta, Priyanka Jain, and Sarvanan S, "EILMT: A Pan-Indian Perspective in Machine Translation," in Tamil Internet Conference, Coimbatore, Tamil Nadu, 2010. Abeill Anne, Bishop Kathleen M, Sharon Cote, and Yves Schabes, "A Lexicalized Tree Adjoining," University of Pennsylvania, Pennsylvania, Technical Report 1990. Ananthakrishnan Ramanathan, Pushpak Bhattacharyya, Jayprasad Hegde, Ritesh M. Shah, and Sasikumar M, "Simple Syntactic and Morphological Processing Can Help English-Hindi," in International Joint Conference on NLP (IJCNLP08), Hyderabad, India, Jan, 2008. Sudip Naskar and Sivaji Bandyopadhyay, "A Phrasal EBMT System for Translating English to Bengali," in In the Proceedings of MT SUMMIT X, Phuket, Thailand, 2005. Akshar Bharati, Vineet Chaitanya, Amba P. Kulkarni, and Rajeev Sangal, "ANUSAARAKA: MACHINE TRANSLATION IN STAGES," in A Quarterly in Artificial Intelligence, vol. 10, NCST, Mumbai, July 1997, pp. 22-25. Daniel Jurafsky and James H Martin, Speech and Language Processing, 3rd ed. Univserity of Colorado, Boulder: Pearson, 2008. Kenji Yamada and Kevin Knight, "A Syntax-based Statistical Translation Model," in Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, 2001, pp. P01-1067. D Chiang, "An introduction to synchronous grammars. ," Institute of Advanced Computer Studies, University of Maryland, College Park, Maryland, United States, Technical 2006. Stuart M. Shieber and Yves Schabes, "Synchronous tree-adjoining grammars," in Proc. Thirteenth International Conference on Computational Linguistics (COLING), vol. 3, 1990, pp. 1–6. 105
[17]
[18] [19]
[20] [21]
[22]
[23]
[24] [25] [26]
[27] [28] [29]
[30] [31] [32]
[33]
Aravind K. JoshI and Yves Schabes, "Tree-adjoining grammars.," in In Grzegorz Rosenberg and Arto Salomaa, editors, Handbook of Formal Languages and Automata, Verlag, Heidelberg, 1997. A. K. Joshi, Takahashi M, and Levy L.S., "Tree adjunct grammars," in Journal of Computer and System, 1975, pp. 136–163. Saravanan S, Menon A.G, and Soman K.P, "Pattern Based English-Tamil Machine Translation," in Proceedings of Tamil Internet Conference, Coimbatore, India, 2010, pp. 295-299. Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D.Manning, "Generating Typed Dependency Parses from Phrase Structure Parses," in LREC, 2006. Dan Klein and Christopher D. Manning, "Accurate Unlexicalized Parsing," in Proceedings of the 41st Meeting of the Association for Computational Linguistics, 2003, pp. 423-430. Dan Klein and Christopher D. Manning, "Fast Exact Inference with a Factored Model for Natural Language Parsing," in In Advances in Neural Information Processing Systems, Cambridge, 2003, pp. 3-10. Vijaya MS, Shivapratap G, Dhanalakshmi V, Ajith V.P, and Soman KP, "Sequence labelling approach for English to Tamil Transliteration using Memory based Learning," in Proceedings of the 6th International Conference of Natural Language Processing, 2008. Koichi Takeda, "Pattern-Based Machine Translation," in Association for Computational Linguistics, Santa Cruz, California, USA , June 1996, pp. 96-1020. Marie-Catherine de Marneffe and Christopher D.Manning, "Stanford typed dependencies manual," 2008. Menon A. G., Saravanan S, Loganathan R, and Soman K. P, "Amrita Morph Analyzer and Generator for Tamil: A Rule-Based Approach," in Proceedings of Tamil Internet Conference, Cologne, Germany, 2009, pp. 239-243. Kimmo Koskenniemi, "General Computational Model for Word-Form Recognition and Production," in Coling 84, 1984, pp. 178-181. Thomas Lehmann, A Grammar of Modern Tamil, 2nd ed. Pondicherry, India: Pondicherry Institute of Linguistics and Culture, 1992. Sribadri narayanan R, Saravanan S, and Soman K.P, "Data Driven Suffix List And Concatenation Algorithm For Telugu Morphological Generator," In Proceedings of International Journal Of Engineering Science and Technology, vol. 3, no. 8, pp. 67126717, August 2011. Anandan , Ranjini parthasarathy, and Geetha, "Morphological Generator for Tamil," in Tamil Internet 2001 Conference, Kuala Lumpur, Malaysia, 2001. Tina Bogel, Miriam Butt, Annette Hautli, and Sebastian Sulger, "Developing a FiniteState Morphological Analyzer for Urdu and Hindi," in LREC, 2010. Vikram T.N and Shalini R., "Development of Prototype Morphological Analyzer for the South Indian Language of Kannada," in Proceedings of the 10th international conference on Asian digital libraries, Heidelberg, Berlin, 2007, pp. 109-116. Goyal V and Singh Lehal G., "Hindi Morphological Analyzer and Generator," in Emerging Trends in Engineering and Technology, Washington, DC, USA, 2008. 106
[34]
Ramasamy Veerappan, Antony P J, Saravanan S, and Soman K.P, "A Rule Based Kannada Morphological Analyzer and Generator using Finite State Transducer," In Proceedings of International Journal of Computer Applications, pp. 45-52, August 2011.
107
APPENDIX A A.1 TAMIL TRANSLITERATION SCHEME
TAMIL ROMAN CHARACTER MAPPING அa ஆA VOWELS
இi ஈI
CONSONANTS
SANSKRIT LETTERS
உu ஊU e E
ai எo ஏO ஐ au ஃq
க் k
த் T
ல் l
ங் G
ந் W
வ் v
ச் c
ப் p
ழ் z
ஞ் J
ம் m
ள் L
ட் t
ய் y
ற் R
ண் N
ர் r
ன் n
ஜ் j
ஷ் s
ஸ் S
க்ஷ் x
ஹ் h
108
APPENDIX B B.1 REORDERING RULES
ROOT NODE OF SOURCE/TARGET
SOURCE RULE CHILD
TARGET RULE CHILD
TRANSFER LINK
ADJP
ADJP PP
PP ADJP
1:2 2:1
ADJP
JJ PP
PP JJ
1:2 2:1
ADJP
JJ S
S JJ
1:2 2:1
ADJP
VB* PP
PP VB*
1:2 2:1
ADVP
ADVP SBAR
SBAR ADVP
1:2 2:1
ADVP
RB RB
RB RB
1:2 2:1
FRAG
NP PP NP-TMP .
PP NP-TMP NP .
1:2 2:3 3:1 4:4
NP
DT NN CD
CD DT NN
1:3 2:1 3:2
NP
DT NN S
S NN DT
1:3 2:2 3:1
NP
JJ NN S
S JJ NN
1:3 2:1 3:2
NP
NN S
S NN
1:2 2:1
NP
NP , SBAR
SBAR , NP
1:3 2:2 3:1
NP
NP NP
NP NP
1:2 2:1
NP
NP NP .
NP NP .
1:2 2:1 3:3
NP
NP PP
PP NP
1:2 2:1
NP
NP SBAR
SBAR NP
1:2 2:1
NP
NP VP
VP NP
1:2 2:1
NP
QP RB
RB QP
1:2 2:1
NP
RB CD NN
CD NN RB
1:2 2:3 3:1
PP
CC NP
NP CC
1:2 2:1
PP
IN ADVP
ADVP IN
1:2 2:1
PP
IN NP
NP IN
1:2 2:1
PP
IN PP
PP IN
1:2 2:1
PP
IN S
S IN
1:2 2:1
RULE
109
PP
TO NP
NP TO
1:2 2:1
PP
VB* NP
NP VB*
1:2 2:1
QP
IN CD TO CD
CD TO CD IN
1:2 2:3 3:4 4:1
QP
RB JJR IN CD
CD IN RB JJR
1:4 2:3 3:1 4:2
S
ADVP NP VP .
NP ADVP VP .
1:2 2:1 3:3 4:4
S
NP ADVP VP .
NP VP ADVP .
1:1 2:3 3:2 4:4
SBAR
IN S
S IN
1:2 2:1
SBAR
RB IN S
S IN RB
1:3 2:2 3:1
SBAR
WHADVP S
S WHADVP
1:2 2:1
SBAR
WHNP S
S WHNP
1:2 2:1
SBAR
WHPP S
S WHPP
1:2 2:1
SBAR
XS
SX
1:2 2:1
SBARQ
WHNP SQ .
SQ WHNP .
1:2 2:1 3:3
SINV
PP VP NP .
PP NP PP .
1:1 2:3 3:2 4:4
SINV
VB* NP ADJP
NP ADJP VB*
1:2 2:3 3:1
SQ
MD NP VP
NP VP MD
1:2 2:3 3:1
SQ
MD NP VP .
NP VP MD .
1:2 2:3 3:1 4:4
SQ
MD RB NP VP .
NP VP RB MD .
1:3 2:4 3:2 4:1 5:5
SQ
S , MD NP VP .
S , NP VP MD .
1:1 2:2 3:4 4:5 5:3 6:6
SQ
VB* NP ADJP .
NP ADJP VB* .
1:2 2:3 3:1 4:4
SQ
VB* NP NP .
NP NP VB* .
1:2 2:3 3:1 4:4
SQ
VB* NP PP
PP NP VB*
1:3 2:2 3:1
SQ
VB* NP VP
NP VP VB*
1:2 2:3 3:1
SQ
VB* NP VP .
NP VP VB* .
1:2 2:3 3:1 4:4
VP
ADVP VB* NP
NP ADVP VB*
1:3 2:1 3:2
VP
ADVP VB* SBAR
ADVP SBAR VB*
1:1 2:3 3:2
VP
IN S
S IN
1:2 2:1
VP
MD ADVP VP
ADVP VP MD
1:2 2:3 3:1
VP
MD RB VP
VP MD RB
1:3 2:1 3:2
VP
MD VP
VP MD
1:2 2:1
VP
NN ADVP SBAR
ADVP NN SBAR
1:2 2:1 3:3
VP
TO VP
VP TO
1:2 2:1
VP
VB* , ADVP , S
, ADVP , S VB*
1:2 2:3 3:4 4:5 5:1
110
VP
VB* ADJP
ADJP VB*
1:2 2:1
VP
VB* ADJP , SBAR
ADJP VB* , SBAR
1:2 2:1 3:3 4:4
VP
VB* ADJP ADVP
ADVP ADJP VB*
1:3 2:2 3:1
VP
VB* ADJP PP
PP ADJP VB*
1:3 2:2 3:1
VP
VB* ADVP
ADVP VBZ
1:2 2:1
VP
VB* ADVP ADJP
ADVP ADJP VB*
1:2 2:3 3:1
VP
VB* ADVP ADVP
ADVP ADVP VB*
1:3 2:2 3:1
VP
VB* ADVP NP
ADVP NP VB*
1:2 2:3 3:1
VP
VB* ADVP NP-TMP
NP-TMP ADVP VB*
1:3 2:2 3:1
VP
VB* ADVP PP
PP ADVP VB*
1:3 2:2 3:1
VP
VB* ADVP VP
VP ADVP VB*
1:3 2:2 3:1
VP
VB* CC VB* PP
PP VB* CC VB*
1:4 2:1 3:2 4:3
VP
VB* FRAG
FRAG VB*
1:2 2:1
VP
VB* NP
NP VB*
1:2 2:1
VP
VB* NP ADVP
NP ADVP VB*
1:2 2:3 3:1
VP
VB* NP NP
NP NP VB*
1:2 2:3 3:1
VP
VB* NP NP-TMP
NP-TMP NP VB*
1:3 2:2 3:1
VP
VB* NP PP
NP PP VB*
1:2 2:3 3:1
VP
VB* NP PP , PP
NP PP , PP VB*
1:2 2:3 3:4 4:5 5:1
VP
VB* NP PP PP
PP PP NP VB*
1:3 2:4 3:2 4:1
VP
VB* NP SBAR
NP SBAR VB*
1:2 2:3 3:1
VP
VB* NP-TMP
NP-TMP VB*
1:2 2:1
VP
VB* PP
PP VB*
1:2 2:1
VP
VB* PP , PP
PP , PP VB*
1:2 2:3 3:4 4:1
VP
VB* PP ADVP
ADVP PP VB*
1:3 2:2 3:1
VP
VB* PP NP-TMP
PP NP-TMP VB*
1:2 2:3 3:1
VP
VB* PP PP
PP PP VB*
1:2 2:3 3:1
VP
VB* PP S
S PP VB*
1:3 2:2 3:1
VP
VB* PP SBAR
PP SBAR VB*
1:2 2:3 3:1
VP
VB* PRT
PRT VB*
1:2 2:1
VP
VB* PRT NP
NP PRT VB*
1:3 2:2 3:1
VP
VB* PRT NP-TMP PP
NP-TMP PRT PP VB*
1:3 2:2 3:4 4:1
VP
VB* PRT PP
PP PRT VB*
1:3 2:2 3:1
111
VP
VB* RB ADJP
ADJP VB* RB
1:3 2:1 3:2
VP
VB* RB NP
NP VB* RB
1:3 2:1 3:2
VP
VB* RB VP
VP RB VB*
1:3 2:2 3:1
VP
VB* S
S VB*
1:2 2:1
VP
VB* SBAR
SBAR VB*
1:2 2:1
VP
VB* VP
VP VB*
1:2 2:1
WHPP
IN WHNP
WHNP IN
1:2 2:1
VP
VB* NP NP SBAR
SBAR NP NP VB*
1:4 2:2 3:3 4:1
VP
VB* ADVP ADJP PP
ADVP ADJP PP VB*
1:2 2:3 3:4 4:1
112
APPENDIX C C.1 TENSE-MORPHOLOGICAL FEATURES LOOKUP TABLE
POS CATEGORY OF PARTICIPLE
AUXILIARY
AUX-TENSE
INPUT FORMAT FOR MORPH SYNTHESIZER
FORM VERB VBZ
PRES
PRES
VBP
PRES
PRES
VBD
PAST
PAST
VB
WILL
FUT
FUT
VBG
AM
PRES CONT
PAST~VP~PROG~PRES
VBG
IS
PRES CONT
PAST~VP~PROG~PRES
VBG
WERE
PAST CONT
PAST~VP~PROG~PAST
VBG
WAS
PAST CONT
PAST~VP~PROG~PAST
VBG
WILL BE
FUT CONT
PAST~VP~PROG~FUT
VBG
SHALL BE
FUT CONT
PAST~VP~PROG~FUT
VBN
HAVE
PRES PERF
PAST~VP~PERF~PRES
VBN
HAS
PRES PERF
PAST~VP~PERF~PRES
VBN
HAD
PAST PERF
PAST~VP~PERF~PAST
VBN
WILL HAVE
FUT PERF
PAST~VP~PERF~FUT
VBN
SHALL HAVE
FUT PERF
PAST~VP~PERF~FUT
VBG
HAVE BEEN
PRES PREF
PAST~VP~PROG~PRES CONT PRES PREF
VBG
PAST~VP~PROG~PRES
HAS BEEN CONT PAST PREF
VBG
PAST~VP~PROG~PAST
HAD BEEN CONT
113
WILL HAVE
FUT PREF
BEEN
CONT
SHALL HAVE
FUT PREF
BEEN
CONT
VBN
AM
PASS PRES
INF~PAS~PRES
VBN
IS
PASS PRES
INF~PAS~PRES
VBN
ARE
PASS PRES
INF~PAS~PRES
VBN
WERE
PASS PAST
INF~PAS~PAST
VBN
WAS
PASS PAST
INF~PAS~PAST
VBN
WILL BE
PASS FUT
INF~PAS~FUT
VBN
SHALL BE
PASS FUT
INF~PAS~FUT
VBN
IS BEING
PAST~VP~PROG~FUT
VBG
PAST~VP~PROG~FUT
VBG
PASS PRES
INF~PAS~PAST~VP~PROG~PRES CONT PASS PRES
VBN
INF~PAS~PAST~VP~PROG~PRES
ARE BEING CONT PASS PAST
VBN
INF~PAS~PAST~VP~PROG~PAST
WAS BEING CONT PASS PAST
VBN
INF~PAS~PAST~VP~PROG~PAST
WERE BEING CONT PASS PRES
VBN
INF~PAS~PAST~VP~PERF~PRES
HAS BEEN PERF PASS PRES
VBN
INF~PAS~PAST~VP~PERF~PRES
HAVE BEEN PERF PASS PAST
VBN
INF~PAS~PAST~VP~PERF~PAST
HAD BEEN PERF WILL HAVE
PASS FUT
BEEN
PERF
HAS BEEN
PASS PRES
BEING
PREF CONT
HAVE BEEN
PASS PRES
INF~PAS~PAST~VP~PERF~FUT
VBN
INF~PAS~PAST~VP~PROG~PRES
VBN VBN
INF~PAS~PAST~VP~PROG~PRES
114
BEING
PREF CONT
HAD BEEN
PASS PAST
BEING
PREF CONT
WILL HAVE
PASS FUT
BEEN BEING
PREF CONT
WILL BE
PASS FUT
BEING
CONT
VB
CAN
CAN
INF~AUX_CAN~NPNG
VB
CAN NOT
CAN NOT
INF~AUX_CANNOT~NPNG
VB
MAY
MAY
VN~AUX_MAY~NPNG
VB
MAY NOT
MAY NOT
NEG~VP~PERF~VN~AUX_MAYNOT~NPNG
VB
SHOULD
SHOULD
INF~AUX_SHOULD~NPNG
VB
SHOULD NOT
SHOULD NOT
INF~AUX_SHOULDNOT~NPNG
VBG
ARE
PRES CONT
PAST~VP~PROG~PRES
VB
DID NOT
DID NOT
INF~NEG_NF~NPNG
VB
DOES NOT
DOES NOT
INF~NEG_F
VB
WILL NOT
WILL NOT
INF~NEG_F
VBN
HAD NOT
HAD NOT
PAST~VP~PERF~INF~NEG_NF~NPNG
VBN
HAS NOT
HAS NOT
PAST~VP~PERF~INF~NEG_NF~NPNG
WILL NOT
PAST~VP~PERF~INF~NEG_F
INF~PAS~PAST~VP~PROG~PAST
VBN
INF~PAS~PAST~VP~PROG~FUT
VBN
INF~PAS~PAST~VP~PROG~FUT
VBN
WILL NOT VBN HAVE VBZ
NEVER
NEVER
INF~CLI_E~NEG_F
VBN
NEVER
NEVER
INF~CLI_E~NEG_NF~NPNG
VBZ
WILL NEVER
WILL NEVER
INF~CLI_E~NEG_F
WILL HAVE
WILL HAVE
NEVER
NEVER
VBN
HAVE NEVER
HAVE NEVER
PAST~VP~PERF~INF~CLI_E~NEG_F
VBN
HAS NEVER
HAS NEVER
PAST~VP~PERF~INF~CLI_E~NEG_F
VBN
HAD NEVER
HAD NEVER
PAST~VP~PERF~INF~CLI_E~NEG_NF~NPNG
VBD
HAS
PRES PERF
PAST~VP~PERF~PRES
PAST~VP~PERF~INF~CLI_E~NEG_F
VBN
115
APPENDIX D15 D.1 TAMIL VERB MORPHOLOGY The structure of a Tamil verb is as follows ROOT / STEM + TRANSITIVE + CAUSATIVE + TENSE / NEGATIVE + EMPTY + PERSON-NUMBERGENDER A Tamil verb is characterized by its ability to take a tense suffix, a Person-Number-Gender (PNG) suffix and wherever possible a transitive and causative suffix. We start with a simple verb stem in Tamil. A verb stem in Tamil equivalent to an English verb „say „ is „ col [ கசால் ]„. The paradigm of this verb following the structure mentioned above is follows: கசால் + ிற் + ஆன்
( he says)
STEM + PRESENT TENSE + 3
RD
PERSON MASCULINE SINGULAR. கசால் + வ் + ஆள்
( she will say )
STEM + FUTURE TENSE + 3RD PERSON FEMININE SINGULAR.
கசால் + இன் + ஆர்
STEM + PAST TENSE + 3RD PERSON
( they said )
COMMON PLURAL. கசால் + இன் + அது
STEM + PAST TENSE + 3RD PERSON
( it said )
NEUTER SINGULAR. கசால் + இன் + ஆர்ள்
( they [ neuter ] said )
STEM + PAST TENSE + 3RD PERSON PLURAL
The verb „col„ has no transitive and intransitive contrast. Therefore, we start below with a verb stem which is capable of taking a transitive suffix.
thazh - ( தாழ் ) –sink.
15
This section is written based on the Phd thesis of Dr. A.G.Menon
116
+
STEM
தாழ் + ிற் + ஆன்
( he sinks )
தாழ் + த்து + ிற் + ஆன்
( he makes something to sink )
தாழ் + த்து + வ் + ஆன்
( he will make something to
STEM
sink )
3SM
தாழ் + த்து + இன் + ஆன்
( he made something to sink )
PRES
+ 3RD
PER
SINGULAR. STEM
+
TRANSITIVE
+
PRES
+
3SM
STEM
+
TRANSITIVE
+
FUT
+
+ TRANSITIVE + PAST + 3
SM
We go further with an example for causative. In case of verbs with intransitive and transitive contrast both transitive and causative suffixes should be present. In other words, the transitive suffix is obligatory before adding a causative suffix.
தாழ் + த்து + யி + த்த் + ஆன்
( he caused someone to make STEM something sink )
+
+
CAUSATIVE + PAST + 3SM
தாழ் + த்து + யி + க்ிற் + ( he causes someone to make STEM ஆன்
something sink )
தாழ் + த்து + யி + ப்ப் + ஆன்
( he will cause someone to STEM make something sink )
TRANSITIVE
+
TRANSITIVE
+
CAUSATIVE + PRES + 3SM
+
TRANSITIVE
+
CAUSATIVE + FUT + 3SM
In the case of verbs which are dividing of an intransitive
verses
transitive contrast, the
causative suffix is added immediately after the stem to form a transitive. Example: கசய் [ do ] கசய் + யி
[ to make someone do ]
STEM + CAUSATIVE SUFFIX
கசய் + த் + ஆன்
[ he did ]
STEM + PAST + 3SM
கசய் + யி + த்த் + ஆன்
[ he made someone to do ]
STEM
+
CAUSATIVE
+
PAST
+
3SM
The causative suffix is replaced by a lexical item in the form of an auxiliary verb in modern Tamil. The following example will illustrate this change. 117
தாழ் [ sink ] ( Intransitive ) தாம தய
[ to make something to sink ]
தாழ்த்த தய
[ to cause someone to make
தாம தய + த்த் + ஆன்
[ he made something to sink ]
தாம தய + க்ிற் + ஆன்
[ he makes something to sink ]
தாம தய + ப்ப் + ஆன்
[ he will make something to VERB INFINITE + AUXILARY +
தாழ்த்த தய + த்த் + ஆன்
[ he caused someone to make TRANSITIVE VERB INFINITE +
தாழ்த்த தய + க்ிற் + ஆன் தாழ்த்த தய + ப்ப் + ஆன்
something sink ]
sink ]
( TRANSITIVE )
(
CAUSATIVE )
VERB INFINITE + AUXILARY + PAST + 3SM
VERB INFINITE + AUXILARY + PRES + 3SM
FUT + 3SM
something sink ]
AUXILARY + PAST + 3SM. [
-
HE
CAUSES
SOMEONE
TO
MAKE SOMETHING SINK ]
[ he will cause someone to TRANSITIVE VERB INFINITE + make something sink ]
AUXILARY + FUT + 3SM.
In this case, the auxiliary verb „வை „ has taken over the functions of both transitive and intransitive. டு + க்ிற் + ஆன்
[ he takes ]
STEM + PRES + 3SM
டு + ப்ப் + ஆன்
[ he will take ]
STEM + FUT + 3SM
டு + த்த் + ஆன்
[ he took ]
STEM + PAST
புத + ந்த் + ஆன்
[ he became angry ]
STEM + PAST + 3SM
118
+3SM
புத + த் + த் + ஆன்
[ he burnt something / he
புத + ிற்+ ஆன்
[ he is angry ]
புத + க்+ ிற்+ ஆன்
[ he burns something / he
புத + வ் + ஆன்
[ he will angry ]
புத + ப் + ப் + ஆன்
[ he will burn something / he
smoked something ]
STEM + TRANS + PAST + 3SM STEM + PRES + 3SM
smokes something ]
will some something ]
STEM + TRANS + PRES + 3SM STEM + FUT + 3SM STEM + TRANS + FUT + 3SM
TENSE In general, Tamil verbs distinguish three tenses: present, past and future.
PRESENT TENSE Suffixes: ிறு, ின்று There are no distributional differences between these two suffixes. Both of them occur after same verb stems, except in the case of the finite verbs with a neuter pronominal plural such as
டி + க்ின்ற்+ அன் + அ
( They ( neuter ) read )
STEM + PRES
+EMPTY + 3PN
கசய் + க்ின்ற் + அன் + அ
(They do)
STEM + PRES
+EMPTY + 3PN
PAST There are three sets of past tense marker in Tamil. They are ந்த், த & its variance and இன் &its variance. One of the biggest problems of Tamil verb morphology is the prediction of the past and future suffixes after the verb stems.
119
It is difficult, if not impossible to predict which verb stem will take which past tense marker. The distribution of the past tense suffixes are in a way predetermined in the language. அழு + த + ஆன்
[ he cried ]
STEM + PAST + 3SM
காடு + த்த் + ஆன்
[ he give ]
STEM + PAST + 3SM
சாடு + இன் + ஆன்
[ he jumped ]
STEM + PAST + 3SM
பாடு + ட் + ஆன்
[ he let something fall ]
STEM + PAST + 3SM
ெ + ந்த் + ஆன்
[ he walked ]
STEM + PAST + 3SM
In the above cases, it is not predictable which past tense suffix will occur after a each stem. The distribution of the past tense suffixes is not restricted by the form of the verb stem. It is rather information which comes along with the language.
Verb stem are classified into three major groups on the basis of past tense markers they take. ந்த் has variant in spoken dialects. Example: ாிந்தான் - ாிஞ்சான் „ த் „ has the following variance „ த்‟ , „த்த் „. In spoken language „ த்த் „ has a palatalized variance „ச்ச்‟. To the same group belong also the stems which produce past tense by the gemmination of the stem consonants. Example டு + ிற்+ ஆன் டுிான் டு + வ் + ஆன் டுயான் டு + ட் + ஆன் ட்ொன் In the case of the past tense the „ த் „ is geminated to produce a past tense. This can be explained in two ways. As a process. of gemmination.
120
As a case of sandhi in which the stem consonant and the following past tense marker „ த „ are converted in to „ த்த் „ . An another example for the sandhi change and the resulting geminate past tense marker is „ பதாற்ான் „ பதாற் + ற் + ஆன்
[ he failed ]
STEM + PAST + 3SM
The verb stem „ பதால் „ ends in‟ ல் „ after the addition of the past tense marker „ த்த் „ it becomes „ற்ற்‟. பதால் + த்த் + ஆன் become பதாற்ான். பதால் + த் + த் + ஆன் பதாற் + த் + ஆன் பதாற்ான் பதால் + ிற் + ஆன் பதாற் + ிற் + ஆன் பதாற்ிான் பதால் + ப் + ப் + ஆன் பதாற் + ப் + ஆன் பதாற்ான் Though Tamil verbs can be classified into three general groups, it is necessary to divide each group further into sub groups. The reasons for this sub grouping are
Absence of a common feature in the verbs which take for example „த்‟ or „த்த் „ as past tense markers.
The automatic generation and analysis can be simplified by grouping these stems in to sub groups.
தய + ந் + ஆன்
[ he scoled ]
தய + த்த் + ஆன்
[ he placed something ]
Syntactically there is no difference between these two stems because both of them can take accusative case marker. In other words there is no dividing morphological feature.
121
The third past tense suffix „ இன் „ has three variants „ இன் „, „இ„ and „ன்‟ . The distribution of these suffixes is easy to predict. „இன் „ occurs in the finite verbs and some times in the relative participles. The following examples will illustrate these distributions. ஏடு + இன் + ன்
[ I ran ]
ஏடிான்
STEM + PAST + 1S
ஏடு + இ + அ
ஏடின
STEM
[ which ran ]
+
PAST
+
RP
MARKER
In the case of the relative participle it alternate with „இ „. For example, ஏடு + இன் + அ and ஏடு + இ + ய் + அ The suffix „ இ „ occurs at the end of the verbal participle such as „ ஏடி „ [ having run ] The verbs „ பா „ [ to go ] and „ ஆ „ [ to become ] take „ ய் „ as the past tense marker apart from „ இன் „ and „இ „. Example: பா + ய் + அ corresponding to பா + ன் + அ and பா + ய் + இன் + அ The most frequent form in modern Tamil is „ பா „. The suffix „ ன் „ occurs in the finite verb of „ பா „ and „ ஆ‟ apart from the relative participles. Example: பா + ன் + ஆன்
[ he went ]
STEM + PAST + 3SM
ஆ + ன் + ஆன்
[ he became ]
STEM + PAST + 3SM
பா + ன் + அ
[ which went ]
STEM
+
PAST
+
RELATIVE
+
RELATIVE
PARTICIPLE
ஆ + ன் + அ
STEM
[ which became ]
+
PAST
PARTICIPLE
Some of the verbs take two different past tense suffixes. This takes place during the transition from intransitive to transitive.
122
Example: தாழ் + ந்த் + ஆன்
[ he sank ]
STEM + PAST + 3SM
தாழ் + த்த் + இன் + ஆன்
[ he made something to sink ]
STEM + TRANS +PAST + 1S
FUTURE
In Tamil, there are three future tense suffixes, ப் and its variant ப்ப், வ் and உம். The distribution of „ ப்ப் „ is parallel to the distribution of past tense marker „ த்த் „ and present tense marker „ க்ிற்‟ . Example: அடி [ beat ] அடி + க்ிற்‟ + ஆன்
[ he beats ]
STEM + PRES + 3SM
அடி + ப்ப் + ஆன்
[ he will beat ]
STEM + FUT + 3SM
அடி + த்த் + ஆன்
[ he beat ]
STEM + PAST + 3SM
ாண் + ப் + ஆன்
[ he will see ]
STEM + FUT + 3SM
தீன் + ப் + ஆன்
[ he will eat ]
STEM + FUT + 3SM
Examples for the future tense marker „v‟ are யரு + வ் + ஆன்
[ he will come ]
கசால் + வ் + ஆன்
[ he will say ]
ழு + வ் + ஆன்
[ he will standup ]
ாடு + வ் + ஆன்
[ he will sing ]
123
ெர் + வ் + ஆன்
[ he will spread ]
காள் + வ் + ஆன்
[ he will have ]
The last future tense marker is „உம்„. It occurs in finite verbs of the neuter singular and plural and also in the future relative participle of all verbs. அது யர் + உம் [ It will come ] அதய யர் + உம் [ They will come ] யரும் தனன் [ the boy who will come ] relative participle எடிக்கும் தள் [ the hands which break something ] Some of the verbs take a formative suffix which may be either „க்க்‟ or „ப்ப்‟. The occurrence of the „க்க்‟ and „ப்ப்‟ goes parallel with the verb stem which take for the three tenses, double geminates such as in the stem „டு‟ [ take ], டுக் [ infinitive to „etu‟ ] [ to take ], டுப் [ infinitive to „etu‟ ] [ to take ], ட்டுக்ிான், டுப்ான், டுத்தான்.
EMPTY
Because the tense markers are already used in the above examples, we go to the next position in the Tamil verb. You might have noticed the presence of a suffix between the tense and png. This suffix is labeled as empty suffix because the function of this suffix is lost in Modern Tamil. This suffix is also known as a bridge suffix because of its connecting function. It connects a tense suffix with a png. We will not go further in to the historical function of the empty / bridge suffix. We give below a few examples for the empty / bridge suffix. ய + ந்த் + அன் + அன்
[ he came ]
STEM + PAST + EMPTY + 3SM
ய + ந்த் + அன் + அள்
[ she came ]
STEM + PAST + EMPTY + 3SF
In Modern Tamil, the bridge suffix is not in use. 124
Another and important aspect of a Tamil verb is its explicit use of a person, number and gender suffix in agreement with the subject. However, the verb stems with the future tense suffix „உம்„ don‟t show this agreement through explicit morphemes. The person number gender markers indicate whether the subject is a singular or plural and whether it is masculine or feminine or neuter and whether it is in the first or second or third person. The following examples will illustrate the functioning of the png in Tamil. டி + க்ிற் + ன்
( I read )
STEM + PRES + 1S
டி + க்ிற் + ஏம்
( We read )
STEM + PRES + 1P
டி + க்ிற் + ஆய்
( You read )
STEM + PRES + 2S
டி + க்ிற் + ஈர் ( -ள்)
( You ( pl ) read )
STEM + PRES
+2P
டி + க்ிற் + ஆன்
( He reads )
STEM + PRES
+3SM
டி + க்ிற் + ஆள்
( She reads )
STEM + PRES + 3SF
டி + க்ிற் + ஆர் ( -ள் )
( They read )
டி + க்ிற் + athu
( It reads )
STEM + PRES + 3SN
டி + க்ின்ற் + அன் + அ
( They ( neuter ) read )
STEM + PRES
டி + க்க் + உம்
( They / it will read )
STEM
+
PRES
+ 3P (
CAN ALSO
BE USED AS HONORIFIC )
STEM
+
+EMPTY + 3PN
FORMATIVE SUFFIX
+
FUT
NOTE ON FORMATIVE SUFFIX
The distribution of formative suffixes such as „க்க்‟ and „ப்ப்„ is not clear. The so-called Strong verbs form their stems by adding „க்க்‟ or „ப்ப்‟ to their stems such as „டி„ [ read ] and „டிக்க்- „ the same stem with the formative suffix „க்க்„ . It occurs also with the formative suffix „ப்ப்‟ such as „டிப்ப்-‟. The formative suffix exhibit very clearly in the infinitive forms of these verbs. For example, „டிக்„ [ to read ] , „டுக்„ [ to take ] , „காடுக்„ [ to give ] and „ார்க்„ [ to see ]. The same formative suffix also occurs as transitive suffix in the verb stems 125
which show a transitive-intransitive contrast. For example, „அமிின„ verses
verses
„அமிக் „ , „ாின„
„ாிக்„ , „நடின„ verses „நடிக்„.
NEGATIVE
In the above discussions of the Tamil finite verbs, we have left out the role of the negative suffixes. In a verb the tense markers are replaced by a negative marker or in other words, tense suffixes and negative suffixes don‟t co-occur in a Tamil verb. Formations of the negative differ radically in the Classical and Modern Tamil verbs. For example, the negative form of „தாழ்த்துபயன்„ ( I will make something to sink ) is „தாழ்த்பதன்„ ( I will not make something to sink ). In Modern Tamil, „இல்த‟ and „ நாட்ட்- „ are used for expressing negative. Example- யபயில்த [ did not come ], „யபநாட்பென்‟ [ I will not come ],'கசால்யில்த‟ [ didn‟t say ], கசால் நாட்பென் [ I will not say ]. „இல்த‟ is used to express a negative in the past / present and „நாட்ட்-„is used to express the negative in the future. There are also morphemes expressing negation, such as –ஆ- and –ஆத்- .
IMPERATIVES
Imperatives may be positive or negative. It can take a plural marker. Singular is unmarked in an imperative. In negative imperatives a negative suffix is added and in the case of transitive imperatives a transitive suffix is added. The structure of an imperative is as follows: STEM + TRANSITIVE + CAUSATIVE / STEM + TRANSITIVE + NEGATIVE + EE (IMPERATIVE MARKER) Example: தாழ்
[ sink ]
தாழ் + த்த் + உ
[ make something to (TRANSITIVE
தாழ் + த்த் + ஆத் +
[don‟t
sink ]
(IMPERATIVE ) STEM IMPERATIVE)
STEM
+
TRANSITIVE + U
make (TRANSITIVE NEGATIVE IMPERATIVE) STEM 126
something to sink]
+
TRANSITIVE
+
+
NEGATIVE
IMPERATIVE
MARKER
தாழ் + த்த் + ஆத் + [you shouldn‟t make
( TRANSITIVE IMPERATIVE
ஈர் (-ள் )
something to sink]
தாழ் + த்த் + உங்ள்
[you make something (TRANSITIVE
தாழ் + த்த் + உம்
[you make something (TRANSITIVE
தாழ் + த்த் + உ
[make something to (TRANSITIVE
to sink]
2ND
NEGATIVE
)
STEM
+
PERSON
TRANSITIVE
+
NEGATIVE + 2P IMPERATIVE)
STEM
+
STEM
+
STEM
+
TRANSITIVE + UNKAL
to sink ]
IMPERATIVE)
TRANSITIVE + UM
sink]
IMPERATIVE)
TRANSITIVE + U
OBATATIVE
The structure of Tamil obtative is stem + transitive + causative + obtative marker. In case of verbs without transitive and intransitive opposition the obtative marker will add directly to the stem. Example: தாழ் +
[may you go down]
STEM + OBATATIVE MARKER
தாழ் + த்த் + உ +
[may you bring something
STEM
down]
OBATATIVE MARKER
NON FINITE
There are three types of non finite verbal forms. They are Infinitive, Verbal particle and Relative particle 127
+
TRANS
+
U
+
Conditionals occur both in infinitives and verbal particles. Examples kaaN ( see ) kaaN + a ( to see )
naan (I ) raamanai (Raman+acc) kaaNa ( to see ) pooneen ( I went ) ான் பாநத ாண பாபன் I went to see Raman. An infinitive can be intransitive, transitive, or conditional. kaaNa ( kaaN +a )
[ to see ]
intransitive infinitive
kaatta ( kaaN + trans + a )
[ to show ]
transitive infinitive
kaaN + in
[ if (one) see(s) ]
conditional infinitive
Note: In Modern Tamil, conditional infinitive is not in use. Conditional verbal particle has replaced the conditional infinitive. For example, „யாின்‟ has been replaced by „யந்தால்‟. VERBAL PARTICIPLE
One of the important formal features of verbal participle is that it takes only past tense markers. Hereafter the Tamil words are written in the Romanized form. For details about the Tamil Romanization, see the Appendix Example: kotu + thth + u
[ having given ]
stem + past + u( verbal participle marker )
maRa + nth + u
[ having forgotten ]
stem + past + u( verbal participle marker )
uN + t + u
[ having eaten ]
stem + past + u (verbal participle marker)
oot + i + 0
[ having run ]
stem + past + 0
(Verbs which take „in„as past tense marker have the structure verb + tense + 0) The stem may be transitive or intransitive.
128
oti + nt + u
[ having broken ]
oti + th + th + u
[ having broken by someone ]
kizhi + nt + u
[ having torn ]
kizhi + th + th + u
[ having torn something ]
(
INTRANSITIVE
)
STEM
+
PAST
+ VP MARKER ( TRANSITIVE ) STEM + TRANS + PAST + VP STEM + PAST + VP MARKER STEM
+
TRANS
+
PAST
+
VP
MARKER
NEGATIVE VERBAL PARTICIPLE
The past tense marker is replaced by a negative marker in the negative VP‟s For example, pook + aath + u [ without going ] stem + neg + vp marker oot + aath + u [ without running ] stem + neg + vp marker The negative suffix „aath‟ is always followed by „u‟ „aa„ is another negative marker occurs in combination with verbal particle marker „mal„. Example pook + aa + mal [ without going ] stem + neg + vp marker oot + aa + mal [ without running ] stem + neg + vp marker CONDITIONAL VERBAL PARTICIPLE
The place of „u‟ at the end of verbal participle form is replaced by the suffix („aal„). Example: vaa + nth + aal [ if ( one ) come(s) ] stem + past + conditional vp marker. Though a VP contains only a past tense suffix, the meaning is always determined by the main verb. Only verbs in the future tense can occur after a conditional verbal participle. Example nii vanthaal ithu natakkum [ ீ யந்தால் இது ெக்கும் ] If you come then this will happen. 129
RELATIVE PARTICLE
The structure of a relative particle is STEM + TRANSITIVE + TENSE / NEG + A ( RP MARKER ) Syntactically, RP functions as an adjective. Example: Intransitive stem va + nt + a [ which came ] stem + past + rp marker Transitive varu + thth + in + a [ which made something to come ] stem + trans + past + rp Transitive negative varu + thth + aath + a [ which didn‟t make someone to come ] stem + trans + neg + rp
VERBAL NOUNS
The category verbal noun, as a word class, has an unique position within verb morphology because like the nouns they can also take case suffixes. A verbal noun suffix is added to an intransitive or transitive verb stems. When they take the verbal noun suffixes they don‟t take tense or png suffixes. We are dealing here only with the verbal noun form ending with a verbal noun suffix. There are also other nominalized forms of verbs such as vanthoon or vanthavan, which contain both tense and png suffixes and they are also capable of taking case suffixes. We shall deal with such forms later. The structure of a verbal noun is STEM + TRANS + NEG + VERBAL NOUN SUFFIX The following examples will illustrate these structures. varu + thal
[ coming ]
varu + ththu + thal
[ making something to come ]
var + aa + mai
[ not coming ]
varu + thth + aa + mai
STEM + VERBAL NOUN SUFFIX
[ making something not to come ] 130
STEM + TRANS + VN SUFFIX STEM + NEG + VN SUFFIX STEM + TRANS + NEG + VN
varu + kai
[ coming ]
STEM + VN
etu + ppu
[ posture ], [ taking ]
STEM + VN
thool + vi
[ failure ]
STEM + VN
The verbal noun suffixes can be broadly classified into the following six groups, al thal mai kai pu vi Back to the pronominalized verbal nouns with Trans, tense or negative and pronominal suffixes. For example, from a verb like „varu „it‟s possible to derive the pronominalised verbal noun like „varu + p + avan „[he who will come] stem +fut + png These forms are different from the finite verbal forms such as varu + v + aan [he will come] stem + fut + png The translation makes it clear that the emphasis in the first case lies on the pronominal suffix and in the second case on the verb itself. varu + kinR + avan
[ he who comes ]
STEM + PRES + PNG
varu + kinR + aan
[ he comes ]
STEM + PRES + PNG
va + nth + avan
[ he who came ]
STEM + PAST + PNG
va + nth + aan
[ he came ]
STEM + PAST + PNG
varu + ththu + kinR + avan varu + ththu + kinR + aan
[ he who make something to come ] [ he make something to come ]
STEM + TRANS + PRES + PNG STEM + TRANS + PRES + PNG
One of the important differences lies in the pronominal suffixes. The nominalized verbal forms take different pronominal suffixes and they are different from pronominal suffixes found in the finite verbs.
131
ADJECTIVES
There are no real adjectival stem or root in Tamil. Many of the adjectives are formed from the noun roots. They end mostly with an „a‟. Example nal (root of nalla) nalla [ good ] peru ( root of periya) periya [ big ] ini [root of sweetness] iniya [ sweet ] ciR ciRiya [ small ] DERIVED ADJECTIVES
They are formed by adding the relative participle „ aana „ , „ aaya‟ to a noun. Example: azhaku [ beauty ] azhakaana [ beautiful ] azhakaaya [ beautiful ] The another type of derived adjectives are formed by adding „ iya „ to a noun. Example: azhaku [ beauty ] azhakiya [ beautiful ] puthu [ new ] puthiya puththakam [ new book ]
132
ADVERB
The most adverbs are formed by adding „aaka‟ to a noun. Example: veekam [ speed ] + aaka - > veekamaaka [ speedly ] methu [ soft ] + aaka - > methuvaaka [ softly ] katumai [ harshness ] + aaka - > katumaiyaaka [ harsly ] There are also simple adverbial forms such as atikkati [ frequently ] innum [ still ] ini [ hereafter ] marupati [ again ] mella [ slowly ] nanku [ good ] Some of the verbal participle forms are also used as adverbs INTERROGATIVES CLITICS
Clitics are added to the words or heads of all syntactic categories except the adjectival and nominals functioning as noun modifiers. Clitics are bound forms. They cannot be inflected or modified with any other suffix. They have clear semantic aspects such as emphasis, interrogative and collectiveness apart from some grammatical functions such as coordination. There are five important clitics in Tamil. They are um, oo, aa, ee and thaan CLITIC -„um„ 133
The clitic „um‟ has a very high frequency in Tamil. It is used for emphasis, concessive, completeness and also for conjunction. avaan vanththaan
[ he came ]
avan + um vanththaan
[ he also came ] [ emphasis ]
avan pookalam
[ he can go ]
avan + um pookalam
[ he can also go ] [ concessive ]
elloor + um pookam
[ everyone can go ] [ completeness ]
avanum
avalum
naanum
cinimaaviRku
poonoom
[ conjunction ]
Example: He, She and I went for the film avan [ he ] + um [ and ] aval [ she ] + um [and ] naan [ I ] + um [ and ]
CLITIC – „thaan‟
„thaan‟ is also used for emphasis. For example,
avan ceythaan [ he did ] avan thaan ceythaan [ he only did ]
avan [ he ] neeRRu [ yesterday ] vanthaan [ came ]
[ he came yesterday ]
avan [ he ] neeRRu [ yesterday ] thaan [ only ] vanthaan [ came ] [ he came only yesterday ]
avan ooti vanththaan [ he came running ] avan ootiththaan vanththaan [ he came indeed running ] avan ettu maNikki varuvaan [ he will come at eight‟O clock ] avan ettu maNikku thaan varuvaan [ he will come only at eight‟ O clock „
134
„ aavathu „ is morphologically a verb consisting of verb root + future tense suffix and neuter singular suffix. It is a derived pronominalized verbal form where the emphasis is on the pronominal suffix and its morphological meaning is „it that will come „. Syntactically it is used as a clitic in the meaning of an ordinal or as a concessive. For example, iraNtu [ two ] + aavaathu [ second ] [ ordinal ] naal [ four ] + aavathu [ fourth ] [ ordinal ]
raman inRu vanthirukkalaam [ Raman could have come today ] raman + aavathu inRu vanthirukkalaam [ Atleast Raman could have come today ] raman [ Raman ] + aavathu inRu [ today ] vanthirukkalaam [ could have come ] [concessive ]
raman mathuraikku pooyirukkalaam [Raman could have gone to Madurai] raman mathuraikku + aavathu pooyirukkalaam [Raman could have atleast gone to Madurai] [ concessive ] „ aam‟ also formed from the verb „aa‟ + „ m‟ the future tense marker is used to express the ordinals. But „aam‟ can‟t be used as a concessive clitic. Example: naal + aam viitu [ fourth house ] naal [ four ] + aam viitu [ house ] inthu + aam kii.mi [ fifth K.M ] The clitics „ aa „ , „ oo‟ , „ ee „ are suffixed with words except the adjectives. „aa‟ is used in informative question. „oo‟ is used in doubtful question. „ee „ is used in rhetoric question. ramanaa vanthaan [ did Rama came ?] [informative ] ramanoo vanthaan [ Rama came, didn‟t he ? ] [Questionnaire has doubt whether Rama has come or not] raman varavillaiyee [ Rama didn‟t come, isn‟t it ] [ rhetoric ] 135
ramam vanthaanee [ Rama came, isn‟t it ] [ rhetoric ] [Note: Just like verb „ aa‟ the verb kuutu also forms a word „ kuuta‟ which is used in the meaning of „ also „.
ASPECTUAL
Aspect is expressed by auxiliary verbs which are added to the verbal participle form of the main verb. We give below the aspects and the corresponding auxiliaries. CONTINUOUS TENSE
The auxiliary verb „ koNtu „ expresses a continuous action. Example vanthu [ come ] koNtu [ ing ] irunth [ was ] aan [ he ] [ he was coming ] „koNtu + iru‟ represents the continuous tense. vanthu koNtu irukkiR aan [ he is coming ] vanthu koNtu irupp aan [ he will be coming ] vanthu koNtu irunth aan [ he was coming ] It is possible to add number of clitics after „koNtu‟ Example patiththuk [ read ] koNtu [ ing ] + um [also] irukkiR [ is ] + aan [ he ][he is also reading] vanthu koNtu + thaan irukkiR + aan [ he is indeed coming ] ootik koNtu + ee irukkiR + aan [ he is continuously running ] ootik koNtu + aa irukkiR + aan [ is he running ?] ootik koNtu + oo + irukkiR + aan [ he is running, isn‟t he ? ] [Doubt] ootik koNtu + ee + aa + irukkiR + aan [ is he continuously running ] After the „thaan‟ it is possible to add interrogative suffixes. Example ootik koNtu thaan + aa irukkiR + aan [ is he running ? ] [ with emphasis on running ] ootik koNtu thaan + oo irukkiR + aan [is he running] doubt and emphasis on running 136
ootik koNtu thaan + ee irukkiR + aan [ he is indeed running, isn‟t he ] emphasis on running viLaiyaatik koNtee thaanee irukkiR + aan [ he is indeed continuously playing ] viLaiyaatik [play] koNtu [ing] + ee [continuously] thaanee [indeed] irukkiR [is] + aan [he] PERFECT
This aspect is expressed by the auxiliary verb „iru‟. Example paarththu + irukkiR + een [ I have seen ] paarththu [ seen ] + irukkiR [ have ] + een [ I ] naan mathuraikku pooy + iru + pp + een [ I would have gone to Madurai ] naan mathuraikku pooy + iru + ntth + een [ I had been to Madurai ] paarththu + iru + kkiR + aan [ He have seen ] paarththu + iru + ntth + aan [ He had seen ] paarththu + iru + pp + aan [ He would have seen ] DEFINITIVE
The auxiliary verb „vit‟ expresses definitiveness, Example avan veelaiyee ceythu + vittaan [ he has completed the work ] naan caappittu + vitteen [ I have finished eating ] REFLEXIVE
Reflexive express the action performed for its own benefit. The verb „ koL „ expresses this meaning. Example naan puththakangkalai vaangkik koNte een [I bought the books for myself] avan iraNtu rupaay etuththuk koNtaan [ He took two rupees for himself ] 137
naan en kaiyai vettik koNtu een [ I myself cut my hand ] TRIAL
The auxiliary verb for trial is „ paar „ Example caappittu paarththaan [ He tasted (it) ] yoociththu paarththaan [ He tried to recollect ] RESERVATIONAL
Reservational is expressed by the auxiliary verb „vai‟ Example avar oru itam othikki vaiththaar [ He reserved a place ] avaar enkkuc caappaatu othikki vaiththaar [ He reserved food for me ] avaar enkkuc caappaatuc camaiththu vaiththaar [ She cooked and kept the for me ] MODAL
We give below the modal verbs, their meaning and distribution. mutiyum [ can ]-
The modal verb occurs after an infinitive and without png in the future
tense and with neuter singular in the other tenses. It has also a negative form. Example ungkaLukku intha veelai ceyya muti + y (sandi)+um (fut) [ you can (fut) do this work ] ungkaLukku intha veelai ceyya muti + y + aathu (neg) [ you cannot (fut) do this work ] ungkaLukku intha veelai ceyya muti + kiR + athu [ you can (present) do this work ] ungkaLukku intha veelai ceyya muti + nth + athu [ you could do this work ] „aam‟ [ may ]-
„aam‟ occurs after a verbal noun.
Example niingkaL uLLee varal + aam [ you may come inside ] 138
niingkaL [you] uLLee [inside] varal [coming] + aam [may] niingkaL inRu varamal [ negative verbal participle ] irukal [ verbal noun ]+ aam [ you may not come today ] niingkaL [you] inRu [today] varaamal [without coming] irukal []+ aam [may] [ you may not come today ] VEENTUM
If it is forbidden then an infinitive is followed by „ kuutathu „ Example niingkaL uLLee vara [ Vinf ] kuutaathu [ you should not come ] niingkaL inRu kataikku vara veeNtum [ you should come to the shop today ] niingkaL inRu kataikku vara veeNtaam [ you should not come to the shop today ] PERMISSIVE MODE ttum avan uLLee varattum [ let him come inside ] avan [ he ] uLLee [ inside ] vara [ come ] ttum [ let ] [ iyalum, iyalaathu ] same as [ mutiyum , mutiaathu ]
D.2 TAMIL NOUN MORPHOLOGY
The Noun Structure is as follows, NOUN + PLURAL SUFFIX (OP) + [OBLIQUE SUFFIX (OP) + CASE SUFFIX] (OP) + CONJUNCTION / EMPHASIS (OP) + INTERROGATION / EMPHATIC (OP) Examples for the nominative and obligue type of noun in Tamil.
NOMINATIVES
நபம்
ாடு naatu
maram
139
ஆறு aaRu
ெல் katal
ல்
ால் kaal
kal
OBLIGUE நபத்
MARATH
ாட்
NAAT
ஆற்
AAR
ெல்
KATAL
ல்
KAL
ால்
KAAL
ROOT: It‟s the minimal morpheme. STEM: A word with a suffix, which may be a formative suffix. In some cases there is no difference between a root and stem. Example: kal – it is a root and stem. GENDER: example: thief திருென் [ thirutan ] திருடி [ thiruti ] PLURAL:
ள் „ kaL „ is the neuter plural suffix. Example: Plural form of „ kaal „ is „ kaalkaL „ [ ால்ள் ]
In the case of collective nouns such as makkaL [human beings] there is no need to add a separate plural suffix. Example: makkaL vanthanar [ நக்ள் யந்தர் ] ( People came )
TAMIL CASE SYSTEM
140
CASE ENGLISH
SIGNIFICANCE
ARDEN
1st
Nominative
Subject of Sentence
Null
Null
2nd
Accusative
Object of Action
Ai
Ai
Instrumental 3rd
Means by which action is done Association , or means
Social
by which action is done Object to whom action is
4th
performed
Dative
Object for whom action is performed Motion
5th
Ablative of motion
from
[
an
inanimate object]
Aal
aal, aan3, ootu, otu, utan3
Ootu
[u]kku Kku [u]kkaka il,
ininru,
iliruntu, iruntu
Motion from [ an animate object]
itattilirunthu [Null] , in ,
6th
Genitive
Possesive
utaiya, inutaiya
Place in Which 7th
Locative
On
the
il
person
[animate];
in
Vocative
the Itam
Addressing, Calling
athu, aathu, a, utaiya il, itam, kan1. 27
of
e, a
[Null], e, a,
D.3 ORTHOGRAPHIC RULES
This chapter lists some of the orthographic rules with examples. (Two or more syllables) am + {Case / Post position} aththu + {Case / Post position} 141
case
morphemes are there,..
presence of; 8th
Il, in
maram + a1i maraththu + ai maraththai maram meel maraththu meel maraththu meel nam + ai nammai and doesn‟t become naththu because nam has one syllable. um + ai uammai em + ai emmai
tu / Ru + { Case / Post position } ttu / RRu + { Case / Post position }
naatu + ai naattu + ai naattai naatu + paRRu naattu + paRRu naattuppaRRu aaRu + ai aaRRu + ai aaRRai aaRu + pakkam aaRRu + pakkam aaRRuppakkam akatu + ai akattu + ai akattai akatu + pakkam akattu + pakkam akattup pakkam This change happens only when the word has two or more syllables. In the case of disyllabic words, the first syllable should be long. natu + ai natvai aRu + ai aRuvai
[Consonant] Short Vowel [l / L] + [Consonant]
pal + kaL paRkaL col + kaL coRkaL kal + kaL kaRkaL muL + kaL mutkaL muL + kiriitam mutkiriitam In contrary to the rule mentioned above in the following cases the change from „l„to „R„doesn‟t take place. Reformulate the above rule or make a new rule to accommodate both these cases. 142
kaal + kaL kaalkaL muungkil + kaL muungkilkaL kool + kaL koolkaL thukil + kaL thukilkaL mukil + kaL mukilkaL
y / r / zh / Vowel + Plosive y / r / zh / Vowel + Plosive doubling
Plosive – k, c , th , pa nay + kutti naykkutti ver + katalai verrkatalai pukazh + col pukazhccol puli + pal pulippal
n/N+PR/t+P
pon + kutam poRkutam maN + kutam matkutam eN + kaL eNkaL maN + kaL maNkaL maan + kaL maankaL katan + kaL katankaL
m + { k / c / th / p } { nk / nj / nth / 0 }
maram + kaL marangkaL arum + ceyal arunjceyal perum + thokai perunthookai
143
karum + palakai karumpalakai karum+kal karungkal varum+kai varungkai maram + pakkam maram pakkam mara + pakkam marappakkam mara [ adj ] + palakai marappalakai mara [ adj ] + kuruvi marakkuruvi em + kaL engkaL um + kaL ungkaL
l + w (dental) n
naal + wuuRu naanuuRu kaal + wimitam kaanimitam kaal +weruppu kaaneruppu vaaL + wimiththam vaaNimiththam col + watai connatai or colwatai kal + winaivu kanninaivu nal + wakaram nannakaram Exception vaal + weruppu vaaneruppu or vaal weruppu il + winaivu il winaivu pakal + watikan pakal watikan
(Consonant) Vowel + Consonant - (Consonant) Vowel + consonant doubling
mu + pathu – muppathu mu + wuuru – munnuru pu + pathu – puppathu 144
pu + thal – puththal
{ a / aa / u / uu / o // oo / au } + Vowel – { 1 } . v.Vowel { i / ii / ai / e / ee } + Vowel – { 1 } . y. Vowel
alai + ai – alaiyai alai + um – alaiyum ii + ai – iiyai paNi + ootu – paNiyootu paNi + um – paNiyum kani + ai – kaniyai thii + ai – thiiyai nilaa + ai – nilaavai amma + aal – ammavaal ammu + ai – ammuvai koo + ai – koovai vee + ai – veeyai caravana + aal – caravanavaal
{ a / i / e } .v + Vowel – av .v. Vowel { a / i / e } + C – a . CC
av + uyir – avvuyir av + itam – avvitam av + aaRu – avvaaRu a + patam – appatam a + viitu – avviitu a + pati – appati e + pati – eppati 145
a , i - >demonstrative e - >Interrogative pati manner [ noun ]
l + Nasal - n L + Nasal - N
kal + malai – kanmalai pal + malai – panmalai muL + makutam – muNmakutam muL + muti – muNmuti In modern Tamil, this rule is not in use.
kuRRiyalukaram
The ultra short „u‟ comes after the plosive consonants k, c, t, th, p and R Consonant Long Vowel { k/c/th/p} u + Vowel – Consonant Long Vowel {k/c/th/p} Vowel [Two or more syllables]{k/c/th/p} u + Vowel - { k/c/th/p} Vowel
Consonant Long Vowel {t/R} u + Vowel – Consonant Long Vowel {tt/RR} Vowel [Two or more syllables]{t/R} u + Vowel - {tt/RR} Vowel
natu
ukaram [ thani kuRil ]
natu + ai – natuvai [ Uyir + Uyir ] mathu + ai – mathuvai kocu + ai – koocuvai naatu
kuRRiyalukaram
katuku
kuRRiyalukaram
146
paricu
kuRRiyalukaram
akatu
kuRRiyalukaram
parunthu
kuRRiyalukaram
vaaththu
kuRRiyalukaram
katuku + ai – katukai vaaththu + ai – vaaththai marunthu + ai – marunthai paampu + ai – paampai kaathu + ai – kaathai kaacu + ai – kaacai naatu + ai – naattai kaatu + ai – kaattai kacatu + ai – kacattai aatu + ai – aattai aaRu + ai – aaRRai naaRu + ai - naaRRai
ACCUSATIVE AND DATIVE + PLOSIVE CONSONANTS - {1} Plosive doubling
avanaip paarththeen avanukkuk kotuththeen cinimaviRkuc cenReen
147
APPENDIX E E.1 LIST OF POST POSITIONS IN TAMIL (PARTIAL LIST)
About paRRi ற்ி kuRiththu, குித்து paarththu, ார்த்து nookki, பாக்ி Above meelee பநப Across kuRukkee,குறுக்ப After piRaku, pin ிகு ின் appuRam ,அப்பும் Against mun, munnaal முன் முன்ால் ethithaRpool, திர்தாற் பால் Along kuuta கூெ Alongside Amid uLLee, itaiyil உள்ப இதெனில் Amidst itaiyil இதெனில், natuvil டுயில் Among itaiyee இதெபன itaiyil இதெனில் Amongst itaiyee இதெபன itaiyil இதெனில் Around eeRakkuRaiya, க்குதன cuRRi, சுற்ி As poola, பா At il,இல் ku,கு Atop ucci உச்சி Before mun, முன் munnaal, முன்ால் Behind pin, ின் pinnaal, ின்ால் Below kiizh, ீழ் kiizhee, ீபம atiyil, அடினில் Beneath kiizh, ீழ் kiizhee, ீபம atiyil, அடினில் Beside pakkam, க்ம் Between itayil, இதெனில் Beyond appaal, அப்ால் By koNtu காண்டு Despite iruntha poothilum, இருந்த பாதிலும் Down kizhee, ீபம During poothu, பாது Except thavira, தயிப From il-irunthu, - இல்-இருந்து muthal, முதல் In il – uLLa, இல் – உள் Inside uLLee, உள்ப Into uLLee, உள்ப 148
Like poola, பா maathiri, நாதிாி Mid natu, டு Near arukee, அருப arukil, அருில் Next atuththu, அடுத்து Notwithstanding appati irunthaalum, அப்டி இருந்தாலும் aayinum, ஆனினும் Off appaal, அப்ால் On meel, பநல் Opposite ethiRee, திபப ethiril, திாில் Outside veLiyee, கயிபன Over meel, பநல் Regarding paRRi, ற்ி Round cuRRi, சுற்ி Since il – irunthu, இல் – இருந்து Than vita, யிெ Through vazhiyaaka, யமினா vaayilaaka, யானிா muulam, மூம் Throughout muzhuvathumaaka, முழுயதுநா Till varai, யதப Times neerangkaLil, பபங்ில் To varai, யதப Toward nookki, பாக்ி paarththu, ார்த்து Towards nookki, பாக்ி paarththu, ார்த்து Under kiizhee, ிபம atiyil, அடினில் Underneath kiizhee, ிபம atiyil, அடினில் Unlike pool allaathu, பால் அல்ாது Until varai, யதப varaikkum, யதபக்கும் Up meelee, பநப Upon meeR paRRi, பநற்ற்ி miithu, நீது Via vaziyaaka, யமினா With utan, உென் itam, இெம் Within uLLee, உள்ப Without veLiyee, கயிபன According to pati, டி Ahead of munnaal, முன்ால் Along with utan, உென் As to Aside from thavira, தயிப Because of aathalaal, ஆதால் aakaiyaal, ஆதனால் Close to arukil, அருில் arukee, அருப Due to aathalaal, ஆதால் 149
Far from thooraththil irunthu, தூபத்தில் இருந்து Inside of uLLee, உள்ப Instead of pathilaaka, திா Near to arukil, அருில் Next to atuththu, அடுத்து atuththaRpool, அடுத்தாற் பால் Out of veLiyee, கயிபன Outside of veLiyee, கயிபன Owing to karaNamaaka, பணநா Prior to munnaal, முன்ால் Pursuant to thotarnthu, கதாெர்ந்து Subsequent to atuththu, அடுத்து As far as varai, யதப As well as kuuta, கூெ By means of koNtu, காண்டு In accordance with pati, டி In addition to meel, பநல் In front of munnaal, முன்ால் ethithaRpool, திர்தாற் பால் In spite of irunthum, இருந்தும் In place of pathilaaka, திா On account of kaaraNamaaka, ாபணநா On behalf of pathilaaka, திா On top of meelaaka, பநா With regard to otti, எட்டி With reference to vaiththu, தயத்து paarvai, ார்தய In case of irunthaal, இருந்தால் enRaal, ன்ால் Ago munnaal, முன்ால் Apart thavira, தயிப puRampaka, பும்ா Away appaal, அப்ால் thuuraththil, தூபத்தில் Hence aathalaal, ஆதால் aakaiyal, ஆதனால்
150
PUBLICATIONS Sribadri narayanan R, Saravanan S, and Soman K.P, "Data Driven Suffix List And Concatenation Algorithm For Telugu Morphological Generator," In Proceedings of International Journal Of Engineering Science and Technology, vol. 3, no. 8, pp. 6712-6717, August 2011. Ramasamy Veerappan, Antony P J, Saravanan S, and Soman K.P, "A Rule Based Kannada Morphological Analyzer and Generator using Finite State Transducer," In Proceedings of International Journal of Computer Applications, pp. 45-52, August 2011. Hemant Darbari, Anuradha Lele, Aparupa Dasgupta, Priyanka Jain, and Sarvanan S, "EILMT: A Pan-Indian Perspective in Machine Translation," in Tamil Internet Conference, Coimbatore, Tamil Nadu, 2010 Saravanan S, Menon A.G, and Soman K.P, "Pattern Based English-Tamil Machine Translation," in Proceedings of Tamil Internet Conference, Coimbatore, India, 2010, pp. 295299. Menon A. G., Saravanan S, Loganathan R, and Soman K. P, "Amrita Morph Analyzer and Generator for Tamil: A Rule-Based Approach," in Proceedings of Tamil Internet Conference, Cologne, Germany, 2009, pp. 239-243.
151