morphology based prototype statistical machine ... - Amrita University

MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE A Thesis Submitted for the Degree of Doctor of Philosophy in the School of Engineering

by ANAND KUMAR M

CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING

AMRITA SCHOOL OF ENGINEERING

AMRITA VISHWA VIDYAPEETHAM COIMBATORE-641 112, TAMILNADU, INDIA April, 2013

AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM, COIMBATORE-641 112

BONAFIDE CERTIFICATE

This is to certify that the thesis entitled “MORPHOLOGY BASED PROTOTYPE STATISTICAL MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE” submitted by Mr. ANAND KUMAR M, Reg. No. CB.EN.D*CEN08002 for the award of the Degree of Doctor of Philosophy in the School of Engineering is a bonafide record of the work carried out by him under my guidance and supervision at Amrita School of Engineering, Coimbatore.

Thesis Advisor

Dr. K.P.SOMAN Professor and Head, Center for Excellence in Computational Engineering and Networking.

AMRITA SCHOOL OF ENGINEERING AMRITA VISHWA VIDYAPEETHAM, COIMBATORE 641 112 CENTER FOR EXCELLENCE IN COMPUTATIONAL ENGINEERING AND NETWORKING

DECLARATION I, ANAND KUMAR M (Reg. No. CB.EN.D*CEN08002) hereby declare that this thesis entitled

“MORPHOLOGY

BASED

PROTOTYPE

STATISTICAL

MACHINE TRANSLATION SYSTEM FOR ENGLISH TO TAMIL LANGUAGE” is the record of the original work done by me under the guidance of Dr. K.P. SOMAN, Professor and Head, Center for Excellence in Computational Engineering and Networking, Amrita School of Engineering, Coimbatore and to the best of my knowledge this work has not formed the basis for the award of any degree/diploma/associateship/fellowship or a similar award, to any candidate in any University.

Place: Coimbatore

Signature of the Student

Date:

COUNTERSIGNED Thesis Advisor

Dr. K.P.SOMAN Professor and Head Center for Excellence in Computational Engineering and Networking

TABLE OF CONTENTS ACKNOWLEDGEMENT .............................................................................................. xii LIST OF FIGURES ........................................................................................................ xv LIST OF TABLES ....................................................................................................... xviii ABBREVIATIONS ........................................................................................................ xxi ABSTRACT .................................................................................................................. xxiv

1 INTRODUCTION ....................................................................................................... 1 1.1

GENERAL ....................................................................................................... 1

1.2

OVERVIEW OF MACHINE TRANSLATION ............................................. 2

1.3

ROLE OF MACHINE TRANSLATION IN NLP .......................................... 3

1.4

FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM ... 4

1.5

MOTIVATION OF THE THESIS .................................................................. 6

1.6

OBJECTIVE OF THE THESIS....................................................................... 7

1.7

RESEARCH METHODOLOGY .................................................................... 9 1.7.1

Overall System Architecture .............................................................. 9

1.7.2

Details of Preprocessing English Language Sentence ..................... 10

1.7.3

1.7.2.1

Reordering English Language Sentence ........................... 10

1.7.2.2

Factorization of English Language Sentence .................... 11

1.7.2.3

Compounding of English Language Sentence .................. 11

Details of Preprocessing Tamil Language Sentence ........................ 12 1.7.3.1

Tamil Part-of-Speech Tagger ............................................ 13

1.7.3.2

Tamil Morphological Analyzer ......................................... 13

1.7.4

Factored SMT System for English to Tamil Language.................... 14

1.7.5

Postprocessing for English to Tamil SMT ....................................... 15 1.7.5.1

Tamil Morphological Generator........................................ 15

1.8

RESEARCH CONTRIBUTIONS ................................................................. 16

1.9

ORGANISATION OF THE THESIS ............................................................ 17

2 LITERATURE SURVEY ......................................................................................... 19 2.1

PART OF SPEECH TAGGER ..................................................................... 19 iv

2.2

2.3

2.1.1

Part Of Speech Tagger for Indian Languages .................................. 21

2.1.2

Part Of Speech Tagger for Tamil Language .................................... 23

MORPHOLOGICAL ANALYZER AND GENERATOR .......................... 25 2.2.1

Morphological Analyzer and Generator for Indian Languages ....... 26

2.2.2

Morphological Analyzer and Generator for Tamil Language .......... 26

MACHINE TRANSLATION SYSTEMS ..................................................... 30 2.3.1

Machine Translation Systems for Indian Languages ....................... 30

2.3.2

Machine Translation Systems for Tamil Language ......................... 35

2.4

ADDING LINGUISTIC INFORMATION FOR SMT SYSTEM ................ 38

2.5

RELATED NLP WORKS IN TAMIL .......................................................... 43

2.6

SUMMARY ................................................................................................... 46

3 THEORITICAL BACKGROUND .......................................................................... 47 3.1

3.2

3.3

GENERAL .................................................................................................... 47 3.1.1

Tamil Language................................................................................ 47

3.1.2

Tamil Grammar ................................................................................ 48

3.1.3

Tamil Characters .............................................................................. 49

3.1.4

Morphological Richness of Tamil Language ................................... 50

3.1.5

Challenges in Tamil NLP ................................................................. 51 3.1.5.1

Ambiguity in Morpheme ................................................... 51

3.1.5.2

Ambiguity in Word Class .................................................. 52

3.1.5.3

Ambiguity in Word Sense ................................................. 52

3.1.5.4

Ambiguity in Sentence ...................................................... 53

MORPHOLOGY ........................................................................................... 53 3.2.1

Types of Morphology ....................................................................... 53

3.2.2

Lexemes ........................................................................................... 54

3.2.3

Lemma and Stems ............................................................................ 54

3.2.4

Inflections and Word forms .............................................................. 55

3.2.5

Morphemes and Types ..................................................................... 55

3.2.6

Allomorphs ....................................................................................... 56

3.2.7

Morpho-Phonemics .......................................................................... 56

3.2.8

Morphotactics ................................................................................... 57

MACHINE LEARNING FOR NLP .............................................................. 58 v

3.4

3.5

3.6

3.3.1

Machine Learning ............................................................................ 58

3.3.2

Support Vector Machines ................................................................. 59

3.3.3

Geometrical Interpretation of SVM ................................................. 61

3.3.4

SVM Formulation ............................................................................ 64

VARIOUS APPROACHES FOR POS TAGGING ...................................... 67 3.4.1

Supervised POS Tagging ................................................................. 67

3.4.2

Unsupervised POS Tagging ............................................................. 68

3.4.3

Rule based POS Tagging.................................................................. 68

3.4.4

Stochastic POS Tagging ................................................................... 69

3.4.5

Other Techniques ............................................................................. 69

VARIOUS APPROACHES FOR MORPHOLOGICAL ANALYZER........ 70 3.5.1

Two level Morphological Analysis .................................................. 70

3.5.2

Unsupervised Morphological Analyser ............................................ 71

3.5.3

Memory based Morphological Analysis .......................................... 72

3.5.4

Stemmer based Approach................................................................. 72

3.5.5

Suffix Stripping based Approach ..................................................... 72

VARIOUS APPROACHES IN MACHINE TRANSLATION ..................... 73 3.6.1

3.6.2

3.6.3 3.7

Linguistic or Rule based Approaches............................................... 73 3.6.1.1

Direct Approach ................................................................ 74

3.6.1.2

Interlingua Approach......................................................... 76

3.6.1.3

Transfer Approach............................................................. 77

Non Linguistic Approaches .............................................................. 79 3.6.2.1

Dictionary based Approach ............................................... 79

3.6.2.2

Empirical or Corpus based Approach ............................... 79

3.6.2.3

Example based Approach .................................................. 80

3.6.2.4

Statistical Approach .......................................................... 81

Hybrid Machine Translation System................................................ 82

EVALUATING STATISTICAL MACHINE TRANSLATION .................. 83 3.7.1

Human Evaluation Techniques ........................................................ 84

3.7.2

Automatic Evaluation Techniques ................................................... 85 3.7.2.1

BLEU Score ...................................................................... 85

3.7.2.2

NIST Metric ...................................................................... 86

3.7.2.3

Precision and Recall .......................................................... 87 vi

3.7.2.4 3.8

Edit Distance Measures ..................................................... 88

SUMMARY ................................................................................................... 89

4 PREPROCESSING FOR ENGLISH LANGUAGE .............................................. 90 4.1

4.2

MORPHO-SYNTACTIC INFORMATION OF ENGLISH LANGUAGE .. 90 4.1.1

POS and Lemma and Information .................................................... 91

4.1.2

Syntactic Information ....................................................................... 92

4.1.3

Dependency Information .................................................................. 93

DETAILS OF PREPROCESSING ENGLISH SENTENCES ...................... 94 4.2.1

4.2.1.1

Syntactic Comparision between English and Tamil ......... 97

4.2.1.2

Reordering Methodology .................................................. 98

4.2.2

Factoring English Sentence ............................................................ 102

4.2.3

Compounding English Language Sentence.................................... 105

4.2.4 4.3

Reordering English Sentences .......................................................... 96

4.2.3.1

Morphological Comparision between English and Tamil106

4.2.3.2

Compounding Methodology for English Sentence ......... 109

Integrating Reordering and Compounding ..................................... 113

SUMMARY ................................................................................................. 115

5 PART OF SPEECH TAGGER FOR TAMIL LANGUAGE .............................. 117 5.1

5.2

5.3

5.4

GENERAL ................................................................................................... 117 5.1.1

Part of Speech Tagging ................................................................. 117

5.1.2

Tamil POS Tagging ........................................................................ 120

COMPLEXITY IN TAMIL POS TAGGING ............................................. 122 5.2.1

Root Ambiguity .............................................................................. 122

5.2.2

Noun Complexity ........................................................................... 122

5.2.3

Verb Complexity ............................................................................ 123

5.2.4

Adverb Complexity ........................................................................ 125

5.2.5

Postposition Complexity ................................................................ 126

PART OF SPEECH TAGSET DEVELOPMENT ...................................... 126 5.3.1

Available POS Tagsets for Tamil................................................... 127

5.3.2

AMRITA POS Tagset .................................................................... 128

DEVELOPMENT OF TAMIL POS CORPORA FOR PREPROCESSING129 vii

5.5

5.4.1

Untagged and Tagged Corpus ....................................................... 130

5.4.2

Available Corpus for Tamil............................................................ 131

5.4.3

POS Tagged Corpus Development ................................................ 131

5.4.4

Applications of Tagged Corpus...................................................... 134

5.4.5

Details of POS Tagged corpus developed ...................................... 134

DEVELOPMENT OF POS TAGGER USING SVMTOOL....................... 136 5.5.1

SVMTool ....................................................................................... 136

5.5.2

Features of SVMTool ..................................................................... 137

5.5.3

Components of SVMTool .............................................................. 138 5.5.3.1

SVMTlearn ...................................................................... 138

5.5.3.2

SVMTagger ..................................................................... 146

5.5.3.3

SVMTeval ....................................................................... 151

5.6

RESULTS AND COMPARISON WITH OTHER TOOLS........................ 160

5.7

ERROR ANALYSIS ................................................................................... 161

5.8

SUMMARY ................................................................................................. 162

6 MORPHOLOGICAL ANALYZER FOR TAMIL .............................................. 163 6.1

6.2

GENERAL ................................................................................................... 163 6.1.1

Morphology in Language ............................................................... 163

6.1.2

Computational Morphology ........................................................... 163

6.1.3

Morphological Analyzer ................................................................ 164

6.1.4

Role of Morphological Analyzer in NLP ....................................... 165

TAMIL MORPHOLOGY............................................................................ 166 6.2.1

Tamil Morphology and Language .................................................. 166

6.2.2

Syntax of Tamil Morphology ......................................................... 167

6.2.3

Word Formation Rules(WFR) in Tamil ......................................... 168

6.2.4

Tamil Verb Morphology ................................................................ 171

6.2.5

Tamil Noun Morphology ............................................................... 172

6.2.6

Tamil Morphological Analyzer ...................................................... 175

6.2.7

Challenges in Tamil Morphological Analzer ................................. 175

6.3

TAMIL MORPHOLOGICAL ANALYZER SYSTEM .............................. 176

6.4

TAMIL MORPHOLOGICAL ANALYZER FOR NOUNS AND VERBS 177 6.4.1

Morphological Analyzer using Machine Learning ........................ 177 viii

6.4.2

6.4.3

Novel Data Modeling for Noun/Verb Morphological Analyzer .... 179 6.4.2.1

Paradigm Classification................................................... 179

6.4.2.2

Word forms ..................................................................... 180

6.4.2.3

Morphemes ...................................................................... 183

6.4.2.4

Data Creation for Noun/Verb Morphological Analyzer.. 186

6.4.2.5

Issues in Data Creation .................................................... 188

Morphological Tagging Framework using SVMTool ................... 189 6.4.3.1

Support Vector Machine (SVM) ..................................... 189

6.4.3.2

SVMTool ......................................................................... 189

6.4.3.3

Implementation of Morphological Analyzer System ...... 190

6.5

MORPH ANALYZER FOR PRONOUN USING PATTERNS ................. 192

6.6

MORPH ANALYZER FOR PROPER NOUN USING SUFFIXES........... 194

6.7

RESULTS AND EVALUATION ............................................................... 195

6.8

PREPROCESSED ENGLISH AND TAMIL SENTENCE......................... 198

6.9

SUMMARY ................................................................................................. 198

7 FACTORED SMT SYSTEM FOR ENGLISH TO TAMIL ............................... 200 7.1

STATISTICAL MACHINE TRANSLATION ........................................... 200

7.2

COMPONENTS OF SMT ........................................................................... 201 7.2.1

7.2.2

Translation Model .......................................................................... 202 7.2.1.1

Expectation Maximization .............................................. 202

7.2.1.2

Word based Translation Model ....................................... 203

7.2.1.3

Phrase based Translation Model ..................................... 204

Language Model ............................................................................. 206 7.2.2.1

7.2.3 7.3

Statistical Machine Translation Decoder ....................................... 210

INTEGRATING LINGUISTIC INFORMATION IN SMT ...................... 210 7.3.1

Factored Translation Models .......................................................... 210 7.3.1.1

7.3.2 7.4

N-gram Language Models ............................................... 208

Decomposition of Factored Translation .......................... 212

Syntax based Translation Models .................................................. 212

TOOLS USED IN SMT SYSTEM .............................................................. 213 7.4.1

MOSES .......................................................................................... 213

7.4.2

GIZA++ & MKCLS ...................................................................... 214 ix

7.4.3 7.5

7.6

7.7

SRILM ............................................................................................ 214

DEVELOPMENT OF FACTORED CORPORA ........................................ 215 7.5.1

Parallel Corpora Collection ............................................................ 215

7.5.2

Monolingual Corpora Collection .................................................. 216

7.5.3

Automatic Creation of Factored Corpora ....................................... 216

FACTORED SMT FOR ENGLISH TO TAMIL LANGUAGE ................. 217 7.6.1

Building Language Model .............................................................. 218

7.6.2

Building Phrase based Translation Model...................................... 219

SUMMARY ................................................................................................. 221

8 POSTPROCESSING FOR ENGLISH TO TAMIL SMT ................................... 222 8.1

GENERAL ................................................................................................... 222

8.2

MORPHOLOGICAL GENERATOR.......................................................... 223

8.3

8.2.1

Challenges in Tamil Morphological Generator .............................. 223

8.2.2

Simplified Part-of-Speech Catagories ............................................ 225

MORPHOLOGICAL GENERATOR FOR TAMIL NOUN AND VERB . 226 8.3.1

Algorithm for Noun and Verb Morphological Generator .............. 227

8.3.2

Word-forms Handled in Morphological Generator ........................ 229

8.3.3

Data Required for the Algorithm ................................................... 230 8.3.3.1

Morpho Lexical Information File .................................... 230

8.3.3.2

Paradigm Classification Rules ........................................ 232

8.3.3.3

Suffix Table ..................................................................... 234

8.3.3.4

Stemming Rules .............................................................. 235

8.4

MORPHOLOGICAL GENERATOR FOR TAMIL PRONOUNS............. 236

8.5

SUMMARY ................................................................................................. 238

9 EXPERIMENTS AND RESULTS ......................................................................... 240 9.1

GENERAL ................................................................................................... 240

9.2

EXPERIMENTAL SETUP AND RESULTS.............................................. 240

9.3

SUMMARY ................................................................................................. 245

10 CONCLUSION AND FUTUREWORK................................................................ 246 10.1

SUMMARY ................................................................................................. 247 x

10.2

SUMMARY OF WORK DONE ................................................................. 247

10.3

CONCLUSIONS ......................................................................................... 249

10.4

FUTURE DIRECTIONS ............................................................................. 250

APPENDIX-A ................................................................................................................ 252 A.1

TAMIL TRANSLITERATION ................................................................ 252

A.2

DETAILS OF AMRITA POS TAGS ....................................................... 256

APPENDIX-B ................................................................................................................ 264 B.1

PENN TREE BANK POS TAGS ............................................................. 264

B.2

DEPENDENCY TAGS .............................................................................. 265

B.3

TAMIL VERB MLI ................................................................................... 266

B.4

TAMIL NOUN WORD FORM ................................................................ 272

B.5

TAMIL VERB WORD FORM................................................................. 275

B.6

MOSES INSTALLATION AND TRAINING......................................... 280

B.7

COMPARISION WITH GOOGLE OUTPUT ....................................... 285

B.8

GRAPHICAL USER INTERFACES....................................................... 286

REFERENCES ............................................................................................................. 290 AUTHOR’S PUBLICATIONS ................................................................................... 310

xi

ACKNOWLEDGEMENT I would never have been able to finish my dissertation without the guidance, support and encouragement of numerous people including my mentors, my friends, colleagues and support from my family and wife. At the end of my thesis I would like to thank all those people who made this thesis possible and an unforgettable experience for me. First and foremost, I feel deeply indebted to Her Holiness Most Revered Mata Amritanandamayi Devi (Amma) for her inspiration and guidance throughout of my doctoral studies, both in unseen and unconcealed ways. Wholeheartedly, I thank our respected Pro Chancellor, Swami Abhayamrita Chaitanya, by providing the necessary environment, infrastructure and encouragement for my research in Amrita Vishwa Vidyapeetham University. I thank Dr. P. Venkat Rangan, our respected Vice Chancellor, for his full hearted encouragements and supports throughout my doctoral studies. I would like to express my sincere gratitude to my supervisor, Dr. K.P Soman, Professor and Head, Centre for Excellence in Computational Engineering and Networking (CEN), for his excellent guidance, patience, and providing an excellent atmosphere for doing research. His wide knowledge and logical way of thinking have been of great source of inspiration for me. I am really so happy and proud to say that I am a student of Dr.K.P.Soman. He has always extended his helping hands in solving research problems. The in-depth discussions, scholarly supervision and constructive suggestions received from him have broadened my knowledge. I strongly believe that without his guidance, the present work could have not reached this stage. I wish to thank my doctoral committee members Dr.C.S Shunmuga Velayutham and Dr.V.P.Mohandass, for their encouraging words and support throughout this research. I express my heartfelt gratitude to Dr.N.S.Pandian, Dean, PG Programmes, Amrita Vishwa Vidyapeetham, and Coimbatore, for the continuous support of my Ph.D study and research.

xii

I wish to thank Dr.S.Rajendran for his supervision, advice, and guidance from the very early stage of this research as well as giving me extraordinary experiences through-out the work. I express my deepest gratitude to Mrs.V.Dhanalakshmi, Head of the Department-Tamil, SRM University, Chennai. Whatever knowledge I have gained in linguistic is definitely because of her. I also wish to thank my school teacher Mr. B. Vaithiyanathan M.Sc M.Ed for supporting me from School days. I would like to thank Mr. Arun Sankar K, who as a good friend from my graduate is always willing to help and give his best suggestions. I express my sincere gratitude to my beloved Director, Dr.K.A.Chinnaraju, and Principal, Dr N.Nagarajan, CIET for giving me all the moral support to complete the thesis successfully. I would like to express my gratitude to my Head of the Department Dr.S.Gunasekaran, who is always inspiring me to complete this thesis work. I would also like to thank Mr.G.Ravi Kumar and Prof. Mrs.Janaki Kumar for their timely support and suggestions. I would like to thank my colleagues at the department of Computer science and engineering, especially Mr. N.Ramkumar, Mr.N.Boopal, Mr.A.Suresh, Mr.M.Yogesh, Mr.C.Prabu, and Mr.B .Saravanan for sharing their enthusiasm and for supporting me from the beginning of my career at CIET.

I wish to express my warm and sincere thanks to Dr. Mrs. M.S Vijaya, HOD (MCA), GRD Krishnamal College for Women and Dr.M. Sabarimalai Manikandan, SAMSUNG Electronics, for their kind support and direction which have been of great value in this study. My sincere thanks also goes to Mr.Sivaprathap, Mr.Rakesh Peter, Mr.Loganathan and Mr.Antony P J, Mr.Ajit, Mr Saravanan, Mr.Kathir, Mr. Senthil, Mr.V Anand Kumar, Mrs. Latha Menon, and Sampath Kumar CEN department for supporting me in all the ways. I also express my sense of gratitude to my friends Ms.Resmi N.G and Ms.Preeja for their encouragement and guidance. My research would not have been possible without the help of my friends C.Murugesan, S.Ramakrishnan, S.Mohanraj and A.Baladhandapani, I like to thank them for being with me in all circumstances. xiii

I wish to give a special thank to my friends Mrs. Rekha Kishore, Mr.C. Arun Kumar, Mrs. Padmavathy and Mr.Tirumeni for supporting me in this research. I would like to thank to my Grandpa Mr.M.Narayanasamy and Mr. A.Peter who left us too soon. I hope that this work will make them proud. I would like to thank my uncle Mr.P.M.Palraj and aunt Mrs.P.Rajeswari for their encouragement and motivation during my difficult moments during the long years of my education. I would also like to express deepest gratitude to my Grandma Mrs.N.Valliyammal and my uncles Mr.N.Natesapandiyan and Mr.N.Pandiyan for supporting me from my school days. I want to thank my parents Mr. N. Madasamy and Mrs. M.Manohari for their kind support, the confidence and the love they have shown to me. You have been my greatest strength and I am blessed to be your son. I would also like to give a special thanks to my beloved brother Mr.M.Vasanthkumar for his support to me in all ways. I wish to thank my sister Mrs.S.Arthi and her husband Mr.K.Suresh, for supporting me in all the ways. I would like to thank my father-in-law Mr.P.Velusamy, and mother-in-law Mrs.V. Ponnuthai, without their encouragement and moral support it would have been impossible for me to finish this work. Finally, I would like to give a special thank to my wife Mrs.Sharmiladevi V. She is always there for cheering me up at difficult times with great patience. Without her love and support it would have been impossible for me to finish this work.

-ANAND KUMAR M

xiv

LIST OF FIGURES

Figure 1.1

Morphology based Factored SMT for English to Tamil Language ................... 10

Figure 1.2

Reordering of English Language ........................................................................ 11

Figure 1.3

Mapping English Word Factors to Tamil Word Factors .................................... 14

Figure 1.4

Thesis Organizations........................................................................................... 17

Figure 3.1

Maximum Margin and Support Vectors ............................................................. 62

Figure 3.2

Training Errors in Support Vector Machine....................................................... 63

Figure 3.3

Non-linear Classifier ........................................................................................... 64

Figure 3.4

Classification of POS Tagging Models .............................................................. 67

Figure 3.5

Two Level Morphology ...................................................................................... 71

Figure 3.6

Block Diagram of Direct Approach to Machine Translation ............................. 75

Figure.3.7

The Vauquios Triangle ....................................................................................... 77

Figure 3.8

Block Diagram of Transfer Approach ................................................................ 78

Figure 3.9

Block Diagram of EBMT System ...................................................................... 80

Figure 3.10 Block Diagram of SMT System ......................................................................... 81 Figure 3.11 Rule based Translation System with Post-processing ........................................ 83 Figure 3.12 Statistical Machine Translation System with Pre-processing ............................ 83 Figure 4.1

Example of English Syntactic Tree .................................................................... 92

Figure 4.2

Preprocessing Stages of English Sentence ......................................................... 95

Figure 4.3

Process of Reordering ......................................................................................... 99

Figure 4.4

English Syntactic Tree ...................................................................................... 101

Figure 4.5

English to Tamil Alignment ............................................................................. 110

Figure 4.6

Block Diagram for Compounding ................................................................... 111

Figure 4.7

Integration Process ........................................................................................... 114

Figure 5.1

Example of Untagged Corpus........................................................................... 130 xv

Figure 5.2

Example of Tagged Corpus .............................................................................. 130

Figure 5.3

Untagged Corpus before Pre-editing ................................................................ 132

Figure 5.4

Untagged Corpus after Pre-editing ................................................................... 133

Figure 5.5

Training Data Format........................................................................................ 139

Figure 5.6

Implementation of SVMTlearn ........................................................................ 143

Figure 5.7

Example Input ................................................................................................... 149

Figure 5.8

Example Output ................................................................................................ 149

Figure 5.9

Implementation of SVMTagger........................................................................ 150

Figure 5.10 Implementation of SVMTeval .......................................................................... 152 Figure 6.1

Role of Morphological Analyzer in NLP ......................................................... 166

Figure 6.3

General Framework for Morphological Analyzer System ............................... 176

Figure 6.4

Preprocessing Steps .......................................................................................... 187

Figure 6.5

Implementation of Noun/Verb Morph Analyzer .............................................. 191

Figure 6.6

Structure of Pronoun Word form ...................................................................... 192

Figure 6.7

Implementation of Pronoun Morph Analyzer ................................................. 193

Figure 6.8

Implementation of Proper Noun Morph Analyzer .......................................... 195

Figure 6.9

Training Data Vs Accuracy .............................................................................. 196

Figure 7.1

The Noisy Channel Model to Machine Translation ......................................... 201

Figure 7.2

Block Diagram for Factored Translation .......................................................... 211

Figure 7.3

Mapping English Factors to Tamil Factors ...................................................... 280

Figure 8.1

Tamil Sentence Generation............................................................................... 225

Figure 8.2

Algorithm for Morphological Generator .......................................................... 227

Figure 8.3

Architecture of Tamil Morphological Generator ............................................. 228

Figure 8.4

Pseudo Code for Paradigm Classification ........................................................ 233

Figure 8.5

Structure of Pronoun Word form ...................................................................... 237

Figure 8.6

Pronoun Morphological Generator ................................................................... 238

xvi

Figure 9.1

BLEU-1 Score for Various Models .................................................................. 243

Figure 9.2

BLEU-4 Score for Various Models .................................................................. 244

Figure 9.3

NIST Score for Various Models ....................................................................... 244

Figure 9.4

Google Translation System............................................................................... 245

xvii

LIST OF TABLES Table 1.1

Factored English Sentences ................................................................................ 12

Table 1.2

Compounded English Sentences ........................................................................ 12

Table 3.1

Tamil Grammar................................................................................................... 48

Table 3.2

Tamil Vowels ...................................................................................................... 49

Table 3.3

Tamil Compound Letters .................................................................................... 50

Table 3.4

Ambiguity in Morpheme’s Position ................................................................... 52

Table 3.5

An Example to Illustrate the Direct Approach ................................................... 75

Table 3.6

An Example for Interlingua Representation ....................................................... 76

Table 3.7

An Example for Transfer Approach ................................................................... 79

Table 3.8

Example of English and Tamil Sentences ......................................................... 81

Table 3.9

Scales of Evaluation........................................................................................... 85

Table 4.1

POS and Lemma of Words ................................................................................. 91

Table 4.2

Reordering Rules .............................................................................................. 100

Table 4.3

Original and Reordered Sentences ................................................................... 102

Table 4.4

Description of Factors in English Word ........................................................... 103

Table 4.5

Example of English Word Factors.................................................................... 104

Table 4.6

Factored Representation of English Language Sentence ................................. 104

Table 4.7

Word forms of English ..................................................................................... 106

Table 4.8

Content Words of English ................................................................................ 107

Table 4.9

Function Words of English ............................................................................... 107

Table 4.10

English Word Forms based on Tenses ............................................................. 108

Table 4.11

Tamil Word Forms based on Tenses ............................................................... 109

Table 4.12

Compounding Rules for English Sentence ...................................................... 112

Table 4.13

Average Words per Sentence............................................................................ 113

Table 4.14

Factored English Sentence ................................................................................ 113 xviii

Table 4.15

Compounded English Sentence ........................................................................ 113

Table 4.16

Preprocessed English Sentences ....................................................................... 115

Table 5.1

AMRITA POS Tagset....................................................................................... 129

Table 5.2

Tag Count.......................................................................................................... 134

Table 5.3

Corpus Statistics................................................................................................ 135

Table 5.4

Example of Suitable POS Features for Model 0 .............................................. 141

Table 5.5


Table 5.6


Table 5.7

Comparison of Accuracies................................................................................ 161

Table 5.8

Trials and Error ................................................................................................ 162

Table 5.9

Confusion Matrix .............................................................................................. 162

Table 6.1

Compound Word-forms Formation .................................................................. 171

Table 6.2

Simple Verb Finite Forms ................................................................................ 172

Table 6.3

Noun Case Markers .......................................................................................... 173

Table 6.4

Minimized POS Tagset ..................................................................................... 177

Table 6.5

Number of Paradigms and Inflections ............................................................. 180

Table 6.6

Noun Paradigms ................................................................................................ 180

Table 6.7

Verb Paradigms................................................................................................. 181

Table 6.8

Noun Word Forms ............................................................................................ 181

Table 6.9

Verb Word Forms ............................................................................................. 182

Table 6.10

Noun Morphemes ............................................................................................. 183

Table 6.11

Verb Morphemes .............................................................................................. 184

Table 6.12

Verb/Noun Ambiguous Morphemes ............................................................... 185

Table 6.13

Sample Data Format ......................................................................................... 187

Table 6.14

Example of Proper Noun Inflections ................................................................ 195

Table 6.15

Tagged Vs Untagged Accuracies ..................................................................... 196

Table 6.16

Number of Words and Characters and Level of Efficiencies .......................... 197 xix

Table 6.17

Sentence Level Accuracies ............................................................................... 198

Table 6.18

Preprocessed English and Tamil Sentence ....................................................... 198

Table 7.1

Factored Parallel Sentences .............................................................................. 217

Table 8.1

Morpho-phonemic Changes ............................................................................. 224

Table 8.2

Simplified POS Tagset ..................................................................................... 225

Table 8.3

Verb and Noun Word Forms ............................................................................ 229

Table 8.4

MLI for Tamil Verb .......................................................................................... 231

Table 8.5

Look up Table for Paradigm Classification...................................................... 233

Table 8.6

Paradigms and inflections ................................................................................. 234

Table 8.7

Suffix Table ...................................................................................................... 235

Table 8.8

Stemming End Characters ................................................................................ 236

Table 9.1

Details of Baseline Parallel Corpora ................................................................ 241

Table 9.2

Details of Factored Parallel Corpora ................................................................ 241

Table 9.3

BLEU and NIST Scores.................................................................................... 243

Table 10.1

Mapping of Major Research Outcome to Publications .................................... 248

xx

LIST OF ABBREVIATIONS

ABBREVIATIONS

FULL FORM

1PL

First person Plural

1S

First person Singular

2PE

Second person Plural Epicene

2S

Second person Singular

2SE

Second person Singular Epicene

3PE

Third person Plural Singular

3PN

Third person Plural Neutral

3SE

Third person Singular Epicene

3SF

Third person Singular Feminine

3SM

Third person Singular Masculine

3SN

Third person Singular Neutral

ACC

Accusative

AI

Artificial Intelligence

AU-KBC

Anna University K B Chandrasekhar

BL

Base line

BLEU

Bi-Lingual Understudy

CALTS

Centre for Applied Linguistics and Translation Studies

CIIL

Central Institute of Indian Languages

CLIR

Cross lingual information retrieval

CRF

Conditional Random Fields

CWF

Compressed Word Format

EBMT

Example based Machine Translation

EM

Expectation Maximization

EOS

End of Sentences

FSA

Finite State Automata

FSM

Finite State Machine

FSMT

Factored Statistical Machine Translation

FST

Finite State Transducer xxi

HMM

Hidden Markov Model

IBM

International Business Machine

IE

Information Extraction

IIIT

International Institute of Information Technology

IR

Information Retrieval

KWIC

Key word in context

LDC

Language data Consortium

LSV

Letter Successor Varieties

ManTra

MAchiNe assisted TRAnslation

MBMA

Memory based Morphological Analysis

MEMM

Maximum Entropy Markov Models

MG

Morphological Generator

MIRA

Margin Infused Relaxed Algorithm

ML

Machine Learning

MLI

Morpho-Lexical Information

MT

Machine Translation

NIST

National Institute of Standards and Technology

NLI

Natural Language Interface

NLP

Natural Language Processing

NLU

Natural Language Understanding

PBSMT

Phrase based Statistical Machine Translation

PCFG

Probalistic Context Free Grammar

PER

Position Independent Word Error Rate

PLIL

Pseudo Lingual for Indian Languages

PN

Proper Noun

PNG

Person-Number-Gender

POS

Part-of-Speech

POST

Part-of-Speech Tagging

QA

Question Answering

RBMT

Rule based Machine Translation xxii

RCILTS

Resource Centre for Indian Language Technology Solutions

SMR

Statistical Machine Reordering

SMT

Statistical Machine Translation

SOV

Subject-Object-Verb

SRILM

Stanford Research Institute for Language Modeling

SVM

Support Vector Machine

SVO

Subject-Verb-Object

TBL

Transformation based learning

TDIL

Technology Development for Indian Languages

TER

Translation Edit Rate

TnT

Trigrams n Tagger

UCSG

Universal Clause Structure Grammar

UN

United Nations

VG

Verb Group

WER

Word Error Rate

WFR

Word Formation Rules

WSJ

Wall Street Journal

WWW

World Wide Web

xxiii

ABSTRACT Machine translation is about automatic translation of one natural language text to another using computer. In this thesis, morphology based Factored Statistical Machine Translation system (F-SMT) is proposed for translating sentence from English to Tamil. Tamil linguistic tools such as Part-of-Speech Tagger, Morphological Analyzer and Morphological Generator are also developed as a part of this research work. Conventionally, rule-based approaches are employed for developing Machine Translation. It uses transfer-rules between the source language and the target language for producing grammatical translations. The major drawback of this approach is that it always requires the help of a good linguist for the rule improvement. So, recently datadriven approaches such as example-based and statistical based systems are getting more attention from research community. Currently, Statistical Machine Translation (SMT) systems are playing a major role in developing translation between languages. The main advantage of using Statistical Machine Translation system is that it is language independent and it disambiguates the sense automatically with the use of large quantities of parallel corpora. SMT system considers the translation problem as a machine learning problem. Statistical learning methods perform translation based on large amounts of parallel training data. At first, non-structural information and statistical parameters are derived from the bi-lingual corpora. These statistical parameters are then used for translation. Baseline Statistical Machine Translation system considers only surface forms and does not use linguistic knowledge of the languages. Therefore its performance is better for similar language pair when compared to the dissimilar language pair. Translating English into morphologically rich languages is a challenging task. Because of the highly rich morphological nature of Tamil language, a simple lexical mapping alone does not help for retrieving and mapping all the morphological and syntactic information from the English language sentences. Tamil word forms are productive, that is, word forms are written without spaces. Inflected forms of Tamil words are seperate words in Tamil. This leads to the problem of sparse data. It is very difficult to collect or create a parallel corpus which contains all the possible Tamil surface words. Because, a single Tamil root verb is xxiv

inflected into more than ten thousand different forms. Moreover, selecting a correct Tamil word or phrase during translation is a challenging job. The corpus size and quality decides the accuracy of the Machine Translation system. The limited availability of parallel corpora for English-Tamil language and high inflectional variation increases the data sparseness problem for baseline phrase-based SMT system. While translating from English to Tamil language, the SMT baseline system will not generate the Tamil word forms that are not present in the training corpora. The proposed Machine Translation system is based on factored Statistical Machine Translation models. The words are factored into lemma and inflected forms based on their part of speech. This factorization reduces the data sparseness in decoding. Factored translation models allow the integration of the linguistic information into a phrase-based translation model. These linguistic features are treated as separate tokens during the factored training process. Baseline SMT system uses untagged corpora for training, whereas factored SMT uses linguistically factored corpora. Pre-processing phase allows including language specific knowledge into the parallel corpus indirectly. In preprocessing, bi-lingual corpora are converted into factored bi-lingual corpora using linguistic tools and reordering rules. Similarly, Tamil language sentences are also pre-processed using the proposed linguistic tools like POS tagger and Morphological analyzer. These factored corpora are then given to the Statistical Machine Translation models for training. Finally, Tamil morphological generator is used for generating a surface word from output factors.

xxv

CHAPTER 1 INTRODUCTION 1.1

GENERAL

Machine Translation is an automatic translation of one natural language text to another using computer. Initial attempts for Machine Translation made in 1950’s didn’t meet with success. Now internet users need a fast automatic translation system between languages. Several approaches like Linguistic based and Interlingua based systems are used to develop a machine translation system. But currently, statistical methods dominate the machine translation field. Statistical Machine Translation (SMT) approach draws knowledge from automata theory, artificial intelligence, data structure and statistics. SMT system treats translation as a machine learning problem. This means that a learning algorithm is applied to a large amount of parallel corpora. Parallel corpora are sentences in one language along with its translation. Learning algorithms create a model from parallel sentences and using this model, unseen sentences are translated. If parallel corpora are available for a language pair then it is easy to build a bilingual SMT system. The accuracy of the system is highly dependent on the quality and quantity of the parallel corpus and the domain. These parallel corpora are constantly growing. Parallel corpora are the fundamental resource for SMT system. Parallel corpora are available from government’s bi-lingual text books, news papers, websites and novels. SMT models are giving good accuracy for language pairs, particularly for similar languages in specific domains or languages that have large availability of bi-lingual corpora. If a sentence in language pair is not structurally similar, then the translation patterns are difficult to learn. Huge amounts of parallel corpora are required for learning the pattern, therefore statistical methods are difficult to use in “less resourced” languages. To enhance the translation performance of dissimilar language pairs and less resourced languages, an external preprocessing is required. This preprocessing is performed using linguistic tools. In SMT system, statistical methods are used for mapping of source language phrases into target language phrases. Statistical model parameters are estimated from bi-lingual and mono-lingual corpora. There are two models in the SMT system. They 1

are Translation model and Language model. The translation model takes parallel sentences and finds the translation hypothesis between the phrases. Language model is based on the statistical properties of n-grams. It uses the monolingual corpora. Several translation models are available in SMT system. Some important models are phrase based model, syntax based model and factored model. Phrase Based Statistical Machine Translation (PBSMT) is limited to the mapping of small text chunks. Factored translation model is an extension of phrase based models. It integrates linguistic information at the word level. This thesis proposes a pre-processing method that uses linguistic tools to the development of English to Tamil machine translation system. In this translation system, external linguistic tools are used to augment the linguistic information into the parallel corpora. The pre and post processing methodology proposed in this thesis are applicable to other language pairs too.

1.2

OVERVIEW OF MACHINE TRANSLATION

Machine translation is one of the major oldest and the most active area in natural language processing. The word ‘translation’ refers to transformation of text or speech from one language into other. Machine translation can be defined as, the application of computers to the task of translating texts from one natural language to another. It is a focussed field of research in linguistic concepts of syntax, semantics, pragmatics and discourse. Today a number of systems are available for producing translations, though they are not perfect. In the process of translation, which is either carried out manually or automated through machines, the context of the text in the source language when translated must convey the exact context in the target language. Translation is not just word level replacement. A translator, either a machine or human, must interpret and analyse all the elements in the text. Also human/machine should be familiar with all the issues during the translation process and must know how to handle it. This requires indepth knowledge in grammar, sentence structure, meanings, etc and also an understanding in each language’s culture in order to handle idioms and phrases originated from different culture. The cross culture understanding is an important issue that holds the accuracy of the translation.

2

It will be a great challenge for humans to design automatic machine translation system. It is difficult for translating sentences by taking into consideration all the required information. Humans need several revisions to make the perfect translation. No two individual human translators can generate identical translations of the same text in the same language pair. Hence it will be a greater challenge for humans to design a fully automated machine translation system to produce high quality translations.

1.3

ROLE OF MACHINE TRANSLATION IN NLP

Natural Language Processing (NLP) is the field of computer science devoted to the development of models and technologies enabling computers to use human languages both as input and output [1]. The ultimate goal of NLP is to build computational models that equal human performance in the task of reading, writing, learning, speaking and understanding. Computational models are useful to explore the nature of linguistic communication as well as for enabling effective human-machine interaction. Jurafsky and Martin (2005) [2] describe Natural Language Processing as “computational techniques that process spoken and written human language as language”. According to the Microsoft researchers, the goal of the Natural Language Processing (NLP) is “to design and build software that will analyze, understand and generate languages that humans use naturally, so that eventually one will be able to address their computer like addressing another person”. Machine Translation is used for translating texts for assimilation purpose which aids bilingual or cross-lingual communication and also for searching, accessing and understanding foreign language information from databases and web-pages [3]. In the field of information retrieval a lot of research is going on in Cross-Language Information Retrieval (CLIR), i.e. information retrieval systems capable of searching databases in many different languages [4]. Construction of robust systems for speech-to-speech translation to facilitate “crosslingual” oral communication has been the dream of speech and natural language researchers for decades. Machine translation is an important module in speech translation systems. Currently, computer assisted learning plays a major role in academic environment. The use of Machine Translation in language learning has not yet got enough attention because of poor quality of automatic translation output. Using

3

good automatic translation system, students can improve their translation and writing skills. Such system can break the language barriers of students and language learners.

1.4

FEATURES OF STATISTICAL MACHINE TRANSLATION SYSTEM

Traditionally, rule based approaches are used to develop a machine translation system. Rule based approach feeds the rules into machine using appropriate representations. Feeding all linguistic knowledge into a machine would be very hard. In this context, the statistical approach to Machine Translation has some attractive qualities that made it the preferred approach in machine translation research over the past two decades. Statistical translation models learn translation patterns directly from data, and generalize them to translate a new text. The SMT approach is largely languageindependent, i.e. the models can be applied to any language pair. System based on statistical methods is much better than the traditional rule-based systems. In SMT, implementation and development times are much shorter. SMT can improve by coupling new models for reordering and decoding. It only needs to learn parallel corpora for generating a translation system. In contrast, rule based system needs transfer rules which only linguistic experts can generate. These rules are entirely dependent on language pair involved and defining general “transfer-rules” is not an easy task, especially for languages with different structures [5]. SMT system can be developed rapidly if the appropriate corpus is available. A Rule Based Machine Translation (RBMT) system requires a lot of development and customization costs until it reaches the desired quality threshold. Packaged RBMT systems have been already developed and it is extremely difficult to reprogram models and equivalences. Above all, RBMT has a much longer process involving more human resources. RBMT system is retrained by adding new rules and vocabulary among other things [5]. Statistical Machine Translation works well for translations in a specific domain with the engine trained with bilingual corpus in that domain. A SMT system requires more computing resources in terms of hardware to train the models. Billions of calculations need to take place during the training of the engine and the computing knowledge required for it is highly specialized. However, training time can be reduced 4

nowadays thanks to the wider availability of more powerful computers. RBMT requires a longer deployment and compilation time by experts so that, in principle, building costs are also higher. SMT generates statistical patterns automatically, including a good learning of exceptions to rules. As regards to the rules governing the transfer of RBMT systems, certainly they can be seen as special cases of statistical standards. Nevertheless, they generalize too much and cannot handle exceptions. Finally SMT systems can be upgraded with syntactic information and even semantics, like the RBMT. A SMT engine can generate improved translations if retrained or adapted again. In contrast, the RBMT generates very similar translations after retraining [5]. SMT systems, in general, have trouble in handling the morphology on the source or the target side especially for morphologically rich languages. Errors in morphology can have severe consequences on meaning of the sentence. They change the grammatical function of words or the interpretation of the sentence through the wrong verb tense. Factored translation models try to solve this issue by explicitly handling morphology on the generation side. Another advantage of Statistical Machine Translation system is that, it generates a more natural or closer to the literal translation of the input sentence. Symbolic approaches to machine translation take great human effort in language engineering. In knowledge based machine translation, for example, designers must first find out what kinds of linguistic, general common-sense and domain-specific knowledge is important for a task. Then they have to design an Interlingua representation for the knowledge and write grammars to parse input sentences. Output sentences are generated using the Interlingua representation. All of these require expertise in language technologies and it requires tedious and laborious work. The major advantage of Statistical Machine Translation system is its learnability. As long as a model is set up, it can learn automatically with well-studied algorithms for parameter estimation. Therefore parallel corpus replaces the human expertise for the task. The coverage of grammar is also one of the serious problems in rule based system. Statistical Machine Translation system is a good candidate that meets these criteria. It can learn to have a good coverage as long as the training data is representative enough. It can statistically model the noise in spoken language, so it does not have to make a binary keep/abandon decision and is therefore more robust to noisy data [5]. 5

1.5

MOTIVATION OF THE THESIS

Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. Even though machine translation was envisioned as a computer application in the 1950’s, machine translation is still considered to be an open problem [3]. The demand for machine translation is growing rapidly. As multilingualism is considered to be a part of democracy, the European Union funds EuroMatrixPlus [6], a project to build machine translation system for all European language pairs, to automatically translate the documents to its 23 official languages, which were being translated manually. Also as the United Nations (UN) is translating a large number of documents into several languages, the UN has created bilingual corpora for some language pairs like Chinese–English, Arabic–English which are among the largest bilingual corpora distributed through the Linguistic Data Consortium (LDC). In the World Wide Web, as around 20% of web pages and other resources are available in their national languages. Machine Translation can be used to translate these web pages and resources to the required language in order to understand the content in those pages and resources, thereby decreasing the effect of language as a barrier of communication [7]. In a linguistically diverse country like India, machine translation is a very essential technology. Human translation is widely prevalent in India since ancient times which are evident from the various works of philosophy, arts, mythology, religion and science which have been translated among ancient to modern Indian languages. Also, numerous classic works of art, ancient, medieval and modern, have also been translated between European and Indian languages since the 18th century. As of now, human translation in India finds application mainly in the administration, media and education and to a lesser extent in business, arts and science and technology [8]. India has 18 constitutional languages, which are written in 10 different scripts. Hindi is the official language of the India. English is the language which is most widely used in the media, commerce, science and technology and education. Many of the states have their own regional language, which is either Hindi or one of the other constitutional languages.

6

In such a situation, there is a big market for translation between English and the various Indian languages. Currently, the translation is done manually. Use of automation is largely restricted to word processing. Two specific examples of high volume manual translation are translation of news from English into local languages, translation of annual reports of government departments and public sector units among English, Hindi and the local language. Many resources such as news, weather reports, books, etc., in English are being manually translated to Indian languages. Of these, News and weather reports from all around the world are translated from English to Indian languages by human translators more often. Human translation is slow and also consumes more time and cost compared to machine translation. It is clear from this that there is large market available for machine translation rather than human translation from English into Indian languages. The reason for choosing automatic machine translation rather than human translation is that machine translation is faster and cheaper than human translation. Tamil, a Dravidian language, is spoken by around 72 million people and has the official status in the state of Tamilnadu and Indian union territory of Puducherry. Tamil is also an official language of Sri Lanka and Singapore. Tamil is also spoken by significant minorities in Malaysia and Mauritius as well as emigrant communities around the world. It is one of the 22 scheduled languages of India and declared a classical language by the government of India in 2004 [9]. In this thesis a methodology for English to Tamil Statistical Machine Translation is proposed, along with a pre-processing technique. This pre-processing method is used to handle morphological variance between English and Tamil. Linguistic tools are developed to generate linguistically motivated data for the factored translation model for English-Tamil.

1.6

OBJECTIVE OF THE THESIS

The main aim of this research is to develop a morphology based prototype Statistical Machine Translation system for English to Tamil language by integrating different linguistic tools. This research will also address the issue of how the morphologically correct sentence is generated when translating from a morphologically simple language into a morphologically rich language. The objective of the research is detailed as follows: 7

• Develop

a

pre-processing

module

(Reordering,

Compounding

and

Factorization) for English language sentence to transform the structure to more similar to that of Tamil. The pre-processing module for source language includes three stages, which are reordering, factorization and compounding. In reordering stage, the source language sentence is to be syntactically reordered according to the Tamil language syntax. After reordering, the English words will be factored into lemma and other morphological features. It will be followed by the compounding process, in which the various function words are removed from the reordered sentence and attached as a morphological factor to the corresponding content word. • Develop a Tamil Part-of-Speech (POS) tagger to label the Tamil words in a sentence. Tamil POS tagger is going to develop using Support Vector Machine (SVM) based machine learning tool. POS annotated corpus will be created for training the automatic tagger system. • Develop a Morphological Analyser to segment the Tamil surface word into linguistic factors. Morphological analyzer system is to be developed using machine learning approach. POS tagger and morphological analyser tools are to be used for preprocessing the Tamil language sentence. Linguistic information from the tools is to be incorporated to the surface words before SMT training. • Build a Morphology based prototype Factored Statistical Machine Translation (F-SMT) system for English to Tamil. After pre-processing, the bi-lingual sentences are to be created and transformed as factored bi-lingual sentences. Monolingual corpora for Tamil are collected and factored using Tamil POS tagger and morphological analyser. These sentences will be used for training the factored Statistical machine translation model.

8

• Develop a Tamil Morphological Generator system to generate Tamil surface word form. Morphological generator transforms the translation output into grammatically correct target language sentence. Morphological generator is used in post processing module for English to Tamil machine translation system.

1.7

RESEARCH METHODOLOGY

1.7.1 Overall System Architecture Tamil is a morphologically rich language with free word-order of Subject-ObjectVerb (SOV) pattern. English language is morphologically simple with a fixed word order of Subject-Verb-Object (SVO) pattern. The baseline SMT system would not perform well for the languages with different word order and disparate morphological structure. For resolving this, factored models are introduced in SMT system. The factored model, which is a subtype of SMT system, will allow multiple levels of representation of the word-from the most specific level to more general levels of analysis such as lemma, part-of-speech and morphological features [10]. Figure 1.1 shows the overall architecture of the proposed English to Tamil SMT system. The preprocessing module is externally attached to the factored SMT system. This module converts bilingual corpora into factored bi-lingual corpora using morphology based linguistic tools and reordering rules. After preprocessing, the representations of source language sentence syntax closely follow the sentence structure of target language. This transformation decreases the complexity in alignment, which is also one of the key problems in baseline SMT system. Parallel corpora are used to train the statistical translation models. Parallel corpora are created and converted into factored parallel corpora using preprocessing. English sentences are factored using Stanford Parser tool and Tamil sentences are factored using Tamil POS Tagger and Morphological analyzer. Monolingual corpus is collected from various news papers and factored using Tamil linguistic tools. This mono-lingual corpus is used in language model. Finally, in post-processing, Tamil morphological generator is used for generating a surface word from output factors.

9

Figure 1.1 Morphoology based d Factored SMT S for En nglish to Tam mil languagge

1.77.2 Detaills of Pre-p processing English Language L Sentence Maachine Transslation systeem for languuage pair wiith disparatee morphologgical structurre neeeds appropriiate pre-proccessing or moodeling befoore translatioon. The prepprocessing caan be performed on the raw source langguage sentennce to makee it more ap ppropriate fo for o target lannguage senttence. The pre-processing modulee for Englissh trannslating into lannguage sentence consistss of reorderinng, factorizaation and com mpounding. 1.77.2.1 Reordeering Englissh Language Sentence Reordering g means, reaarrange the word order of source llanguage sentence into a wo ord order thaat is closer to that of tthe target laanguage senntence. It is an importannt proocess for lannguages whhich differs in their synntactic struccture. Englissh and Tam mil lannguage pair has disparate syntactic structure. English E worrd order is Subject-Verb S bOb bject (SVO) whereas Tam mil word orrder is Subjeect-Object-V Verb (SOV). For examplle, thee main verb of a Tamil sentence allways comess at the endd but in Engglish it comees bettween subject and objeect [11]. Ennglish syntacctic relationns are retrievved from thhe Staanford Parseer tool. Based on reorderring rules soource languaage sentencee is reordered.

10

Reordering rules are handcrafted using the syntactic word order difference between English and Tamil language. 180 reordering rules are created based on the sentence structure of English and Tamil. Reordering significantly improves the performance of the Machine Translation system. Lexicalized distortion reordering model is implemented in Moses toolkit [180]. But this automatic reordering in Moses toolkit is good for short range sentences. Therefore external tool or component is needed for dealing the long distance reordering. This reordering is also a one way of indirectly integrating syntactic information to the source language. 80% of English sentences are reordered correctly according to the rules which are developed. Example for English reordering is given in the Figure 1.2. English Sentence:

I

Reordered English: I

Tamil Sentence : நான் என்

bought

my

vegetables

to

ைடய

to

home vegetables

my

home.

bought

ட் ற்கு காய்கறிகள் வாங்கிேனன் .

Figure 1.2 Reordering of English language 1.7.2.2 Factorization of English Language Sentence Factored models can be used for morphologically rich languages, in order to reduce the amount of bi-lingual data. Factorization refers splitting the word into linguistic factors and integrates as a vector. Stanford Parser is used to parse the English sentences. From the parsed tree, the linguistic information such as lemma, part-ofspeech tags, syntactic information and dependency information are retrieved. This linguistic information is integrated as factors in the original word. 1.7.2.3 Compounding for English language sentence Compounding is defined as adding additional morphological information to morphological factor of source (English) language words [188]. Additional morphological information includes function word, subject information, dependency relations, auxiliary verbs and model verbs. This information is based on the

11

morphological structure of Tamil language sentence. In compounding phase, the function words are identified from the English factored corpora using dependency information. After finding the function words, these are removed from the factored sentence and attached as a morphological factor to the corresponding content word. Compounding process reduces the length of the English sentence. Like function words, auxiliary verbs and model verbs are also removed and attached as a morphological factor of source language word. Now the morphological representation of the English language sentence is similar to that of the Tamil language sentence. This compounding step indirectly integrates dependency information into the source language factor. Table 1.1 and Table 1.2 show the factored and compounded sentences respectively. Table 1.1 Factored English Sentences

I | i | PN | prn my | my | PN | PRP$ home | home | N | NN to | to | TO | TO vegetables | vegetable | N | NNS bought | buy | V | VBD . Table 1.2 Compounded English Sentences

I | i | PN | prn_i my | my | PN | PRP$ home | home | N |NN_to vegetables | vegetable | N | NNS bought | buy | V | VBD_1S.

1.7.3 Details of Pre-processing for Tamil Language Sentence Like preprocessing of English sentence, Tamil sentence is also pre-processed using linguistic tools such as Parts-of-Speech (POS) Tagger and morphological analyzer. Tamil surface words are segmented into linguistic information and this information is integrated as factors in SMT training corpora. Tamil sentence is given to Part-ofSpeech Tagger tool and then using this part-of-speech information, the simplified partof-speech tag is identified. Based on this simplified tag, the word is given to the Tamil 12

morphological analyzer. Morphological analyzer split the word to lemma and morphological information. Parallel corpora as well as the monolingual corpora are preprocessed in this stage. 1.7.3.1 Tamil Part-of-Speech Tagger POS tagging means labeling grammatical classes i.e. assigning parts of speech tags to each and every word of the given sentence. Tamil sentences are POS tagged using Tamil POS Tagger tool. This tagger was developed, using Support Vector Machine (SVM) based machine learning tool, SVMTool [12], which make the task simple and efficient. In this method, POS tagged corpus is created and used to generate a trained model. The SVMTool is used for creating models using tagged sentences and untagged sentences are tagged using those models. 42k sentences (approx 5 lakh words) are tagged for this Part-of-Speech tagger with the help of eminent Tamil linguist. The experiments are conducted with our tagged corpus. The overall accuracy of 94.6% is obtained for the test set which contains 6K sentences (approx 35 thousand words). 1.7.3.2 Tamil Morphological Analyzer After POS tagging, sentences in the corpora are morphologically analyzed for finding the lemma and morphological information. Morphological analyzer is a software tool used to segment the word into meaningful units. Morphological analysis of Tamil is a complex process because of its “morphological-rich” nature. Generally, rule based approaches are used to develop morphological analyzer system. For a morphologically rich language like Tamil, the creation of rules is a challenging task. Here a novel machine learning based approach is proposed and implemented for Tamil verb and noun Morphological analyzer. Additionally, this approach is tested for languages such as Malayalam, Telugu and Kannada. This approach is based on sequence labeling and training by kernel methods. It captures the non-linear relationships and various morphological features of natural language words in a better and simpler way. In this machine learning approach, two training models are created for morphological analyzer. First model is trained using the sequence of input characters and their corresponding output labels. This trained ModelI is used for finding the morpheme boundaries. Second model is trained using sequence of morphemes and their grammatical categories. This trained Model-II is used for 13

assigning grammatical classes to each morpheme. The SVM based tool was used for training the data. This tool segments each word into its lemma and morphological information.

1.7.4 Factored SMT System for English to Tamil Language Factored translation is an extension of Phrase based Statistical Machine Translation (PBSMT) that allows the integration of additional morphological and lexical information, such as lemma, word class, gender, number, etc., at the word level on source and the target languages. In SMT system, three different toolkits are used for translation modeling, language modeling and decoding. These toolkits are implemented using GIZA++, SRILM and Moses toolkits. GIZA++ is a Statistical Machine Translation toolkit that is used to train IBM models 1-5 and an HMM word alignment model. It is an extension of GIZA which was designed as part of the SMT toolkit. SRILM is a toolkit for language modeling that can be used in speech recognition, statistical tagging and segmentation, and Statistical Machine Translation. Moses is an open source SMT system toolkit that allows to automatically training translation models for any language pair. What is needed is a collection of translated texts (parallel corpus). An efficient search algorithm finds quickly the highest probability translation among the exponential number of choices. Figure 1.3 explains the mapping of English factors and Tamil factors in Factored SMT system.

Figure 1.3 Mapping English Word Factors to Tamil Word Factors Morphological, syntactic and semantic information can be integrated as factors in factored translation model during training. Initially, English factors “Lemma” and “Minimized-POS” are aligned to Tamil factors “Lemma” and “M-POS” then 14

“Minimized-POS” and “Compound-Tag” factors of English word is aligned to “Morphological information” factor of Tamil word. Here, the important thing is Tamil surface new words are not generated in SMT decoder. Only factors are generated from SMT system and the surface word is generated in the post processing stage. Tamil morphological generator is used in post processing to generate a Tamil surface word from output factors. The system is evaluated with different sentence patterns like simple, continuous and model auxiliaries and with these types, 85% of the sentences are translated correctly. In addition, for other sentence types, the performance is 60%. The prototype machine translation system which is developed properly handles the noun-verb agreement. This is an essential requirement for translating into morphologically rich languages like Tamil. BLEU and NIST evaluation scores clearly show that the factored model with an integration of linguistic knowledge gives better result for English to Tamil Statistical Machine Translation system.

1.7.5 Post-processing for English to Tamil SMT Post-processing is engaged to generate a Tamil surface word using output factors. In factored SMT system, the aim is to translate factors only, not to generate a surface word. Due to the morphological rich nature of Tamil language, word generation is handled separately. Morphological generator is applied in post-processing stage of English to Tamil Machine Translation system. Post-processing transforms the translated factors into grammatically correct target language sentence. 1.7.5.1 Tamil Morphological Generator Tamil morphological generator receives the factors in the form of “lemma + word_class + morpho-lexical information”, where lemma specifies the lemma of the word form to be generated, word_class denotes the grammatical category and morpholexical information states the type of inflection. These factors are output of the proposed Machine Translation system. The novel suffix based approach is developed for Tamil Morphological generator. Tamil noun and verb paradigm classification is done based on its case and tense markers respectively. Number of paradigms for verb and noun is defined. In Tamil, verbs are classified into 32 paradigms and nouns and classified into 25 [13]. Noun and verb paradigms are used for creating suffix table. Morphological generator system is divided into 3 modules. The first module takes the 15

lemma and word-class as input and gives the lemma’s paradigm number and word’s stem as output. This paradigm number is referred as column index. Paradigm number provides information about all the possible inflected words of a lemma in a particular word class. The second module takes morpho-lexical information as an input and gives its index number as an output. From the complete morpho-lexical information list, the index number of the corresponding input morpho-lexical information factor is identified and this is referred as row index. In third module, a two dimensional suffix-table is used to generate the word using row index and column index. Finally the identified suffix is attached with the stem to create a word form. For pronouns, pattern matching approach is followed for generating pronoun word form.

1.8

RESEARCH CONTRIBUTIONS This thesis shows how preprocessing and post processing can be used to improve

the statistical machine translation for English to Tamil language. The main focus of this research is on translation from English into Tamil language, but also the development of linguistic tools for Tamil language. The contributions are, •

Introduced a novel pre-processing method for English sentences which is based on reordering and compounding. Reordering rearrange the English sentence structures according to Tamil sentence. Compounding removes the function words and auxiliaries then merged to the morphological factor of content word. This pre-processing reorganizes the English sentence structure according to the structure of Tamil sentence.

•

Created a Tamil POS Tagger and tagged corpora size of 5 lakh words which is a part of pre-processing Tamil language sentence.

•

Introduced a novel method for developing Tamil morphological analyser which is based on Machine learning approach. Corpora developed for this approach contains 4 lakh morphologically segmented Tamil verbs and 2 lakh Tamil nouns.

•

Introduced a novel algorithm for developing Tamil morphological generator with the use of paradigms and suffixes. Using this generator, it is possible to generate 10 thousand distinct word form of a single Tamil verb.

•

Successfully integrated these pre-processing and post-processing modules and developed English to Tamil factored SMT system. 16

1.9

ORGANIZATION OF THE THESIS

This thesis is divided into ten chapters. Figure 1.4 shows the Organization of the thesis.

Chapter‐I

INTRODUCTION

Chapter‐2

LITERATURE SURVEY

Chapter‐3

BACKGROUND

Chapter‐4

PREPROCESSING ENGLISH LANGUAGE

PREPROCESSING TAMIL LANGUAGE

POS TAGGER FOR TAMIL

Chapter‐5

Chapter‐6

Chapter‐7

MORPH ANALYZER FOR TAMIL

FACTORED SMT

MORPHOLOGICAL GENERATOR FOR TAMIL

Chapter‐8

Chapter‐9

Chapter‐10

EXPERIMENTS AND RESULTS

CONCLUSION

Figure 1.4 Thesis Organizations 17

This thesis is organized as follows. General introduction is presented in chapter 1. Chapter 2 presents the literature survey for linguistic tools and available Machine Translation systems for Indian languages. In Chapter 3, the theoretical background and language processing for Tamil is described. Chapter 4 contains the different stages of preprocessing English language sentences. Stages include reordering, factorization and compounding. Chapter 5 and 6 presents the preprocessing of Tamil sentence using linguistic tools. In Chapter 5, development of Tamil POS tagger is explained and Chapter 6 illustrates the Morphological Analyzer for Tamil language. This morphological analyzer is developed based on the new machine learning based approach. Additionally, the detailed descriptions of the method and data resources are also illustrated. Chapter 7 presents the Factored SMT system for English to Tamil language. This chapter explains how the factored corpora are trained and decoded using SMT Toolkit. Post-processing for Tamil language is discussed in chapter 8. Morphological generator is used as a Post-processing tool. This chapter also explains the detailed description about a new algorithm which is developed for Tamil Morphological generator. Chapter 9 explains the experiment and results of English to Tamil Statistical Machine Translation system. It also describes the training and testing details of SMT toolkit. The output of the developed system is evaluated using BLEU and NIST metrics. Finally Chapter 10 concludes the thesis and explains the future directions about this research.

18

CHAPTER 2 LITERATURE SURVEY This chapter presents the state of the art in the field of Tamil Linguistic tools and Machine Translation systems. Tamil Linguistic tools include POS Tagger, Morphological analyzer and Morphological generator. This chapter discusses the literature review about the Linguistic tools and Machine Translation systems for Indian languages and Tamil languages.

2.1

PART OF SPEECH TAGGER

Part-of-Speech (POS) tagging is the process of labeling a Part-of-Speech or other lexical class marker to each and every word in a sentence. It is similar to the process of tokenization for computer languages. Hence POS tagging is considered as an important process in speech recognition, natural language parsing, morphological parsing, information retrieval and machine translation. Different approaches have been used for Part-of-Speech (POS) tagging, where the notable ones are rule-based, stochastic, or transformation-based learning approaches. Rule-based taggers try to assign a tag to each word using a set of handwritten rules. These rules could specify, for instance, that a word following a determiner and an adjective must be a noun. This means that the set of rules must be properly written and checked by human experts. The stochastic (probabilistic) approach uses a training corpus to pick the most probable tag for a word [14-17]. All probabilistic methods cited above are based on first order or second order Markov Models. There are a few other techniques which use probabilistic approach for POS Tagging, such as the Tree Tagger [18]. Finally, the transformation-based approach combines the rule-based approach and statistical approach. It picks the most likely tag based on a training corpus and then applies a certain set of rules to see whether the tag should be changed to anything else. It saves any new rules that it has learnt in the process, for future use. One example of an effective tagger in this category is the Brill tagger [19-22]. All of the approaches discussed above fall under the rubric of supervised POS Tagging, where a pre tagged corpus is a prerequisite. On the other 19

hand, there is the unsupervised POS tagging [23] [24] [25] technique and it does not require any pre-tagged corpora. Koskenniemi(1985) [26] also used a rule-based approach implemented with finite-state machines. Greene and Rubin (1971) [27] have used a rule-based approach in the TAGGIT program, which was an aid in tagging the Brown corpus [28]. TAGGIT disambiguated 77% of the corpus; the rest was done manually over a period of several years. Derouault and Merialdo (1986) [29] have used a bootstrap method for training. At first, a relatively small amount of text was manually tagged and used to train a partially accurate model. The model was then used to tag more text, and the tags were manually corrected and then used to retrain the model. Church (1988) [15] uses the tagged Brown corpus for training. These models involve probabilities for each word in the lexicon and hence a large tagged corpus is required for a reliable estimation. Jelinek (1985) [30] has used Hidden Markov Model (HMM) for training a text tagger. Parameter smoothing can be conveniently achieved using the method of ‘deleted interpolation’ in which weighted estimates are taken from second and first-order models and a uniform probability distribution. Kupiec (1992) [31] used word equivalence classes (referred to here as ambiguity classes) based on parts of speech, to pool data from individual words. The most common words are still represented individually, as sufficient data exist for robust estimation. Yahya O. Mohamed Elhadj (2004) [32] presents the development of an Arabic part-of-speech tagger that can be used for analyzing and annotating traditional Arabic texts, especially the Quran text. The developed tagger employed an approach that combines morphological analysis with Hidden Markov Models (HMMs) based-on the Arabic sentence structure. The morphological analysis is used to reduce the size of the lexicon tags by segmenting Arabic words in their prefixes, stems and suffixes; this is due to the fact that Arabic is a derivational language. On the other hand, HMM is used to represent the Arabic sentence structure in order to take into account the linguistic combinations. In the recent literature, several approaches to POS tagging based on statistical and machine learning techniques are applied, including Hidden Markov Models [33], Maximum Entropy taggers [34], Transformation–based learning [35], Memory–based learning [36], Decision Trees [37], and Support Vector Machines [38]. Most of the 20

previous taggers have been evaluated on the English WSJ corpus, using the Penn Treebank set of POS categories and a lexicon constructed directly from the annotated corpus. Although the evaluations were performed with slight variations, there was a wide consensus in the late 90’s that the state–of-the–art accuracy for English POS tagging was between 96.4% and 96.7%. In the recent years, the most successful and popular taggers in the NLP community have been the HMM–based TnT tagger, the Transformation–based learning (TBL) tagger [35] and several variants of the Maximum Entropy (ME) approach [34]. The SVMTool [38] is intended to comply with all the requirements of modern NLP technology, by combining simplicity, flexibility, robustness, portability and efficiency with state–of–the–art accuracy. This is achieved by working in the Support Vector Machines (SVM) learning framework, and by offering NLP researchers a highly customizable sequential tagger generator. TnT is an example of a really practical tagger for NLP applications. It is available to anybody, simple and easy to use, considerably accurate, and extremely efficient, allowing training from 1 million word corpora in just a few seconds and tagging thousands of words per second [39]. In the case of TBL and ME approaches, the great success has been due to the flexibility they offer in modeling contextual information, being ME slightly more accurate than TBL.

2.1.1 Part-of-Speech Tagger for Indian Languages Various approaches have been used for developing Part-of-Speech tagger for Indian languages. Smriti Singh et.al (2006) [40] have proposed tagger for Hindi, that uses the affix information stored in a word and assigns a POS tag using no contextual information. By considering the previous and the next word in the Verb Group (VG), it correctly identifies the main verb and the auxiliaries. Lexicon lookup was used for identifying the other POS categories. Hidden Markov Model (HMM) based tagger for Hindi was proposed by Manish Shrivastava and Pushpak Bhattacharyya (2008) [41]. The authors attempted to utilize the morphological richness of the languages without resorting to complex and expensive analysis. The core idea of their approach was to explode the input in order to 21

increase the length of the input and to reduce the number of unique types encountered during learning. This in turn increases the probability score of the correct choice while simultaneously decreasing the ambiguity of the choices at each stage. In NLPAI ML contest, Dalal et al (2006) [42] have achieved accuracies of 82.22 % and 82.4% for Hindi POS tagging and chunking respectively using maximum entropy models. Karthik et al. (2006) [43] got 81.59 % accuracy for Telugu POS tagging using HMMs. Nidhi Mishra and Amit Mishra [44] proposed a Part-of-Speech Tagging for Hindi Corpus in 2011. In the proposed method, the system scans the Hindi corpus and then extracts the sentences and words from the given corpus. Also the system search the tag pattern from database and display the tag of each Hindi word like noun tag, adjective tag, number tag, verb tag etc. Based on lexical sequence constraints, a POS tagger algorithm for Hindi was proposed by Pradipta Ranjan Ray (2003) [45]. The proposed algorithm acts as the first level of part of speech tagger, using constraint propagation, based on ontological information, morphological analysis information and lexical rules. Even though the performance of the POS tagger has not been statistically tested due to lack of lexical resources, it covers a wide range of language phenomenon and accurately captures the four major local dependencies in Hindi. Sivaji Bandyopadhyay et.al (2006) [46] came up with a rule based chunker for Bengali which gave an accuracy of 81.64 %. The chunker has been developed using rule-based approach since adequate training data was not available. The list of suffixes has been prepared for handling unknown words. They used 435 suffixes; many of them usually appear at the end of verb, noun and adjective words. For Telugu, three POS taggers have been proposed by using different POS tagging approaches viz., (1) Rule-based approach, (2) Transformation based learning (TBL) approach of Erich Brill (3) Maximum Entropy Model, a machine learning technique [47]. For Bengali, Sandipan et al., (2007) [48] have developed a corpus based semisupervised learning algorithm for POS tagging based on HMMs. Their system uses a small tagged corpus (500 sentences) and a large unannotated corpus along with a 22

Bengali morphological analyzer. When tested on a corpus of 100 sentences (1003 words), their system obtained an accuracy of 95%. Antony P J and Soman KP [49] of Amrita University, Coimbatore proposed statistical approach to build a POS tagger for Kannada language using SVM. They have proposed a tagset consisting of 30 tags. The proposed POS tagger for Kannada language is based on supervised machine learning approach. The Part-of-Speech tagger for Kannada language was modeled using SVM kernel. A stochastic Hidden Markov Model (HMM) based part of speech tagger has been proposed for Malayalam. To perform the Part-of-Speech tagger using stochastic approach, an annotated corpus is required. Due to the non-availability of annotated corpus, a morphological analyzer was also developed to generate a tagged corpus from the training set [50]. Antony P.J et.al (2010) [51] developed tagset and tagged corpora size of 180,000 words for Malayalam language. This tagged corpus is used for training the system. The performance of the SVM based tagger achieves 94 % accuracy and showed an improved result than HMM based tagger.

2.1.2 Part of Speech Tagger for Tamil Language Various methodologies have been developed for POS Tagging for Tamil language. A rule-based POS tagger for Tamil was developed and tested by Dr.Arulmozhi P et.al [52]. This system gives only the major tags and the sub tags are overlooked during evaluation. A hybrid POS tagger for Tamil using HMM technique and a rule based system was also developed [53]. Parts of speech tagging scheme, tags a word in a sentence with its parts of speech. It is done in three stages: pre-editing, automatic tag assignment, and manual post-editing. In pre-editing, the corpus is converted to a suitable format to assign a Part of Speech tag to each word or word combination. Because of orthographic similarity one word may have several possible POS tags. After the initial assignment of possible POS, words are manually corrected to disambiguate words in texts.

23

Vasu Ranganathan’s Tagtamil (2001) Tagtamil by Vasu Ranganathan [55] is based on Lexical phonological approach. Tagtamil does morphotactics of morphological processing of verbs by using index method. Tagtamil does both tagging and generation. Ganesan’s POS tagger (2007) Ganesan [56] has prepared a POS tagger for Tamil. His tagger works well in CIIL Corpus. Its efficiency in other corpora has to be tested. He has a rich tagset for Tamil. He tagged a portion of CIIL corpus by using a dictionary as well as a morphological analyzer. He corrected it manually and trained the rest of the corpus with it. The tags are added morpheme by morpheme. pUkkaLai : pU_N_PL_AC vawthavan: va_IV_ wth_PT_avan_3PMS Kathambam of RCILTS-Tamil Kathambam attaches parts of speech tags to the words of a given Tamil document. It uses heuristic rules based on Tamil linguistics for tagging and does not use either the dictionary or the morphological analyzer. It gives 80% efficiency for large documents, uses 12 heuristic rules and identifies the tags based on PNG, tense and case markers. Standalone words are checked with the lists stored in the tagger. It uses ‘Fill in rule’ to tag ‘unknown words. It also uses bigram for identifying the unknown word using the previous word category. Lakshmana Pandian S and Geetha T V (2008) [54] have developed a Morpheme based Language Model for Tamil Part-of-Speech Tagging. A language model based on the information of the stem type, last morpheme, and previous to the last morpheme part of the word for categorizing its part of speech was developed. For estimating the contribution factors of the model, they have followed the generalized iterative scaling technique. Lakshmana Pandian S and Geetha T V (2009) [57] developed CRF Models for Tamil Part of Speech Tagging and Chunking. This method avoids a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative 24

Markov models. The Language models are developed using CRF and designed based on morphological information of Tamil. Selvam and Natarajan (2009) [58] have developed a Rule based Morphological analyser and POS Tagger for Tamil. They improved the above systems using Projection and Induction techniques. Rule based morphological analyzer and POS tagger can be built from well defined morphological rules of Tamil. Projection and induction techniques are used for POS tagging, base noun-phrase bracketing, named entity tagging and morphological analysis from a resource rich language to a resource deficient language. They applied alignment and projection techniques for projecting POS tags, and alignment, lemmatization and morphological induction techniques for inducing root words from English to Tamil. Categorical information and root words are obtained from POS projection and morphological induction respectively from English via alignment across sentence aligned corpora. They generated more than 600 POS tags for rule based morphological analysis and POS tagging.

2.2

MORPHOLOGICAL ANALYZER AND GENERATOR

Different methodologies have been adopted for developing morphological analyzer in various languages. Generally rule based approaches dominate to solve the morphological analyzer problem. Nowadays statistical methods are introduced for solving the analyzer framework. For example, Thai morphological analysis based on the theoretical background of Conditional Random Fields (CRF) formulates an unsegmented language as the sequential supervised learning problem [59]. Memory-based learning has been successfully applied to morphological analysis and part-of-speech tagging in Western and Eastern-European languages [60]. MBMA (Memory-Based Morphological Analysis), is a memory-based learning system. Memory-based learning is a class of inductive supervised machine learning algorithm that learns by storing examples of a task in memory. A corpus based morphological analyzer for unvocalized Modern Hebrew is developed by combining statistical methods with rule-based syntactic analysis [61]. John Goldsmith (2001) [62] shows how stems and affixes can be inferred from a large un-annotated corpus. Data-driven method for automatically analyzing the morphology of ancient Greek used a nearest neighbor machine learning framework

25

[63]. A language modeling technique to select the optimal segmentation rather than using heuristics is proposed for Thai morphological analyzer [64].

2.2.1 Morphological Analyzer and Generator for Indian Languages T. N. Vikram and Shalini R (2007) [65] developed a prototype of morphological analyzer for Kannada language based on Finite State Machine. This is just a prototype based on Finite state machines (FSM) and can simultaneously serve as a stemmer, part of speech tagger and spell checker. The proposed morphological analyzer tool does not handle compound formation morphology and can handle a maximum of 500 distinct nouns and verbs. Recently Shambhavi B. R and Dr. Ramakanth Kumar (2011) [66] developed a paradigm based morphological generator and analyzer using a trie based data strucure. The disadvantage of trie is that it consumes more memory as each node can have at most ‘y’ children, where y is the alphabet count of the language. As a result it can handle up to maximum 3700 root words and around 88K inflected words. Uma Maheshwar Rao G. and Parameshwari K. of CALTS, University of Hyderabad (2010) [67] attempted to develop a morphological analyzer and generators for South Dravidian languages MORPH- A network and process model for Kannada morphological analysis/ generation was developed by K. Narayana Murthy (2001) [68] and the performance of the system is 60 to 70% on general texts. For Bengali, unsupervised methodology is used in developing a Morphological Analyzer system [69] and two-level morphology approach was used to handle Bengali compound words. Rule based Morphological Analyzer was developed for Sanskrit [71] and Oriya languages [70].

2.2.2 Morphological Analyzer and Generator for Tamil Language In Tamil language, the first step towards the preparation of morphological analyzer for Tamil was initiated by Anusaraka group. Ganesan (2007) [56] developed a morphological analyzer for Tamil to analyze CIIL corpus. Phonological and morphophonemic rules are taken into for building morphological analyzer for Tamil. Resource Centre for Indian Language Technological Solutions (RCILTS) -Tamil has

26

prepared a morphological analyzer (Atcharam) for Tamil. Finite automata state-table has been adopted for developing this Tamil morphological analyzer [72]. Tamil morphological analyzers and generators were built based on various techniques and constraints like morpho-tactics, morphological alternations, phonology and morpho-phonemics. Some of these works were reported by the authors named Rajendran, Ganesan, Kapilan, Deivasundaram, Vishnavi, Ramasamy, Winston Cruz and Dhurai Pandi, and organizations named AU-KBC (AnnaUniversity-KBC) at Madras Institute of Technology (MIT) at Chennai and Resource Centre for Indian Language Technological Solutions-Tamil (RCILTS-T) at Anna University, Chennai. A simple morphological tagger which identifies suffixes, labels them and separates root words from transliterated Tamil words was reported. Parameswari.K (2010) [74] developed a Tamil morphological Analyzer and generator using APERTIUM tool kit. This attempt involves a practical adoption of lttoolbox for the modern standard written Tamil in order to develop an improvised open source Morphological Analyzer and generator. The tool uses the computational algorithm Finite State Transducers (FST) for one-pass analysis and generation, and the database is developed in the morphological model called word and paradigm. Vijay Sundar Ram R et.al (2010) [75] was designed Tamil Morphological Analyzer using paradigm based approach and Finite State Automata, which works efficiently in recursive tasks and considers only the current state for having a transition. In this approach complex affixations are easily handled by FSA and in the FSA, the required orthographic changes are handled in every state. In this approach, they built a FSA using all possible suffixes, categorize the root word lexicon based on paradigm approach to optimize the number of orthographic rules and use morpho-syntax rules to get the correct analysis for the given word. FSA is used in analysis of the word is done suffix by suffix. FSA are the proven technology for efficient and speedy processing. There are three major components in their morphological analyzer system. The first one is Finite State Automata which is modeled using all possible suffixes (allomorphs). Next is lexicon, categorized based on the paradigm approach and the final component is morpho-syntax rules for filtering the correct parse of the word.

27

Akshar bharati et.al (2001) [76] developed a algorithm for unsupervised learning of morphological analysis and generation of inflectionally rich languages. This algorithm uses the frequency of occurrences of word forms in a raw corpus. They introduce the concept of “observable paradigm “by forming equivalence classes of feature-structures which are not obvious. Frequency of word forms for each equivalence class is collected from such data for known paradigms. In this algorithm, suppose the morphological analyzer cannot recognize the inflectional form. The possible stem and paradigm was guessed using the corpus frequencies. The method assumes that the morphological package makes use of paradigms. This package was able to guess stem paradigm pair for an unknown word. This method only depends on the frequencies of the word forms in raw corpora and does not require any linguistic rules or tagger. The performance of this system is depends on the size of the corpora. Vasu Ranganathan (2001) [55] built a Tamil tagger by implementing the theory of lexical phonology and morphology and this system was tested with English-Tamil Machine translation system and a number of natural language processing tasks. This tagger tool was written in Prolog and built with knowledge base morphological rules of Tamil language. This tagger should be capable of accounting all morphological information during the process of recognition and generation. This tagger was built using successive stage of knowledge based morphological rules of Tamil Language. In this method, three different coding procedures were adopted to recognize the written Tamil literary word forms. The output was contains morphological information such as type of the word, root form of the word and suitable morphological tags for affixes. This tagger is capable of recognizing and generating Tamil word forms including finite and non finite verbs such as aspect, modality, tense forms as well as the noun forms like participial noun, verbal nouns and cases. The dictionary is built as part of this system which contains information about the root word and grammatical information. This tagger was tested and included in Machine translation system. Duraipandi (2002) [77] designed morpho-phonemic rules for Tamil computing. These rules are primary resource for spell checker and Machine aided translation system. This Morphological Generator and Parsing Engine for Tamil verb forms is a full-fledged engine on verb patterns in modern Tamil.

28

Dhanabalan et.al (2003) [13] developed a spell checker for Tamil Language. Lexicons with morphological and syntactic information are used for developing this spell checker. This spell checker can be integrated with word processors. Each word is compared against a dictionary of correctly spelled word. This tool needs syntactic and semantic knowledge for catching the misspell words. It also provides a facility to customize the spell checker’s dictionary so that the technical words and proper nouns can be appended. Initially the spell checker reads the word from document. If the word is present in the dictionary then it is interpreted as valid word. Otherwise the word is forwarded into error correcting process. This tool consists of three phases, they are, text parsing, spelling verification and generation. This spell checker uses Tamil morphological analyzer and morphological generator. Analyzer is used for analyzing the given word and generator is for generating different suggestions. The morphological analyzer first tries to split the suffix of the correct word. If spelling mistake is there then that word is passes to spelling verification and correction module for correcting the mistake. After finding the root word, system compares it with dictionary entries. If the root word is not present in the dictionary then the nearest root word is taken and given to morphological generator system. M.Ganesan (2007, 2009) [56] explained about the analysis and generation of Tamil corpora. He also developed various tools to analyze the Tamil corpora. The tools include POS tagging, morphological analyzer, frequency counter and KWIC (KeyWord In Context) concordance. At word level the tagset contains 22 tags and morph level it contains 82 tags.

Using this pos tagger and morphological analyzer he has also

developed a syntactic tagger. This tagger is for tagging a Tamil sentence at phrase level and clause level. K.Rajan et.al (2009) [79] developed an unsupervised approach to Tamil morpheme segmentation. The main objective of his work is production of list of morpheme for Tamil language. Morpheme identification is done by using Letter Successor Varieties (LSV) and N-gram based approach. The words which are used in this segmentation are collected from CIIL Tamil corpus. This segmentation algorithm is based on peak and plateau model which is one of the types of Letter Successor Varieties. The basic idea of the LSV is to count the amount of different letters encountered after a part of a word and to compare it to the counts before and after that 29

position. Successive split method is used for splitting the Tamil words. Initially all the words are treated as stems. In the first pass these stems are split into new stems and suffixes based on the similarity of the characters. They are split at the position where the two words differ. The right substring is stored in a suffix list and the left sub string is kept as a stem. In the second pass, the same procedure is followed, and the suffixes are stored in a separate suffix list. Uma Maheswar Rao (2004) [67] proposed a modular model that is based on a hybrid approach which combines the two basic primary concepts of analyzing word forms and Paradigm model. The morphological model underlying this description is influenced by the needs of flexibility of the design that permits to plugin language specific databases with minimum changes and the simplicity of computation. This architecture involves the identification of different layers among the affixes, which enter into concatenation to generate word forms. Unlike in traditional item and arrangement model, allomorphic variants are not given any special status if they are generable through simple morphophonemic rules. Automatic phonological rules also used for to derive surface forms. Menon et.al (2009) [131] developed a Finite State Transducer (FST) based Tamil morphological analyzer and generator. They have used AT &T Finite State Machine to build this tool. FST maps between two sets of symbols. This is used as a transducer that accepts the input string if it is in the language and generates another string on its output. The system is based on lexicon and orthographic rules from a two level morphological system. For the Morphological generator, if the string which has the root word and its morphemic information is accepted by the automaton, then it generates the corresponding root word and morpheme units in the first level.

2.3

MACHINE TRANSLATION SYSTEMS

2.3.1 Machine Translation Systems for Indian Languages Machine Translation systems have been developed in India, for translation from English to Indian Languages and from Indian languages to Indian languages. These systems are also used for teaching machine translation to the students and researchers. Most of these systems are in the English to Hindi domain with exceptions of a Hindi to English and English to Kannada machine translation system. English is a SVO 30

language while Indian regional languages are SOV and are relatively of free wordorder. The translation domains are mostly government documents, health, tourism, news reports and stories. . The University of Hyderabad under K. Narayana Murthy has worked on an English-Kannada MT system called “UCSG-based English-Kannada MT”, using the Universal Clause Structure Grammar (UCSG) formalism. A survey of the machine translation systems that have been developed in media for translation from English to Indian languages and among Indian languages reveals that the machine translation software is used in field testing or is available as web translation service. Indian Machine Translation system [80] are presented below; these systems are used to translate English to Hindi language R.M.K.Sinha et al. [81] proposed Anglabharti, a machine-aided translation system specifically designed for translating English to Indian languages. English is a SVO word order language while Indian languages are of SOV order and are relatively of free word-order. Instead of designing translators for English to each Indian language, Anglabharti uses a pseudo-interlingua approach. It analyses English only once and creates an intermediate structure called PLIL (Pseudo Lingua for Indian Languages). This is the basic translation process translating the English source language to PLIL with most of the disambiguation having been performed. The PLIL structure is then converted to each Indian language through a process of text-generation. The effort in analysing the English sentences and translating into PLIL is estimated to be about 70% and the text-generation accounts for the rest of the 30%. Thus only with an additional 30% effort, a new English to Indian language translator can be built. Anglabharti is a pattern directed rule based system with context free grammar like structure for English, the source language which generates a ‘pseudo-target’ (PLIL) applicable to a group of Indian languages (target languages). A set of rules obtained through corpus analysis is used to identify plausible constituents with respect to which movement rules for the PLIL is constructed. The idea of using PLIL is primarily to exploit structural similarity to obtain advantages similar to that of using interlingua approach. It also uses some example-base to identify noun and verb phrasal and resolve their ambiguities [82]. AksharBharati et al. proposed Anusaaraka, a project which was started at IIT Kanpur, and is now being continued at IIIT Hyderabad, was started with the explicit 31

aim of translation from one Indian language to another. It produces output which a reader can understand but is not exactly grammatical. For example, a Bengali to Hindi Anusaaraka can take a Bengali text and produce output in Hindi which can be understood by the user but will not be grammatically perfect. Likewise, a person visiting a site which is in a language he does not know, he can run Anusaaraka and read the text and understand the context. Anusaaraka's have been built from Telugu, Kannada, Bengali, and Marathi to Hindi [83]. MaTra [84] is a Human-Assisted translation project for English to Indian languages, currently Hindi, essentially based on a transfer approach using a frame-like structured representation. The focus is on the innovative use of man-machine synergy—the user can visually inspect the analysis of the system, and provide disambiguation information using an intuitive GUI, allowing the system to produce a single correct translation. The system uses rule-bases and heuristics to resolve ambiguities to the extent possible – for example, a rule-base is used to map English prepositions into Hindi postpositions. The system can work in a fully automatic mode and produce rough translations for end users, but is primarily meant for translators, editors and content providers. Currently, it works for simple sentences, and work is on to extend the coverage to complex sentences. The MaTra lexicon and approach is general-purpose, but the system has been applied mainly in the domains of news, annual reports and technical phrases, and has been funded by (TDIL). The Mantra (MAchiNe assisted TRAnslation tool) translates English text into Hindi in a specified domain of personal administration, specifically gazette notifications, office orders, office memorandums and circulars. Initially, the Mantra system was started with the translation of administrative document such as appointment letters, notification, and circular issued in central government from English to Hindi. The system is ready for use in its domains. It has a text categorization component at the front, which determines the type of news story such as political, terrorism, economic, etc., before operating on the given story. Depending on the type of news, it uses an appropriate dictionary. It also requires considerable human assistance in analysing the input. Another novel component of the system is that given a complex English sentence, it breaks it up into simpler sentences, which are then analysed and used to generate Hindi. They are using the translation 32

system in a project on Cross Lingual Information Retrieval (CLIR) that enables a person to query the web for documents related to health issues in Hindi [85]. Anubharti approach for machine-aided-translation is a hybridized examplebased machine translation approach (EBMT) that is a combination of example-based, corpus-based approaches and some elementary grammatical analysis. The examplebased approaches emulate human-learning process for storing knowledge from past experiences to use it in future. In Anubharti, the traditional EBMT approach has been modified to reduce the requirement of a large example-base. This is done primarily by generalizing the constituents and replacing them with abstracted form from the raw examples. The abstraction is achieved by identifying the syntactic groups. Matching of the input sentence with abstracted examples is done based on the syntactic category and semantic tags of the source language structure [85]. Two machine translation systems, Shiva and Shakti [86] for English to Hindi were developed jointly by Carnegie Mellon University USA, Indian Institute of Science, Bangalore, India, and International Institute of Information Technology, Hyderabad. Shakti machine translation system has been designed to produce machine translation systems for new languages rapidly. Shakti system combines rule-based approach with statistical approach whereas Shiva is example based machine translation system. The rules are based on linguistic nature, and the statistical approach tries to infer or use linguistic information. Some modules also use semantic information. Currently Shakti is working for three target languages (Hindi, Marathi and Telugu). English-Telugu machine translation system is developed jointly at CALTS with IIIT, Hyderabad, Telugu University, Hyderabad and Osmania University, Hyderabad. This system uses English-Telugu lexicon consisting of 42,000 words. A word form synthesizer for Telugu is developed and incorporated in the system [85]. English-Kannada machine aided translation system is developed at Resource Centre for Indian Language Technology Solutions, University of Hyderabad by Dr. K. Narayana Murthy [85]. Their approach is based on using the Universal Clause Structure Grammar (UCSG) formalism. This is essentially a transfer-based approach, and has been applied to the domain of government circulars, and funded by the Karnataka government [85].

33

Anubaad [87], a hybrid machine translation system for translating English news headlines to Bengali, developed by Sivaji Bandyopadhyay at Jadavpur University Kolkata. The current version of the system works at the sentence level. R. Mahesh et al. [88] proposed Hinglish, a machine translation system for pure (standard) Hindi to pure English. It had been implemented by incorporating additional layer to the existing English to Hindi translation (AnglaBharti-II) and Hindi to English translation (AnuBharti-II) systems developed by Sinha. The system claimed to be produced satisfactory acceptable results in more than 90% of the cases. Only in case of polysemous verbs, due to a very shallow grammatical analysis used in the process, the system is unable to resolve their meaning. English-Hindi example based machine translation system, developed by IBM India Research Lab at New Delhi. Now, they have recently initiated work on statistical machine translation between English and Indian languages, building on IBM‘s existing work on statistical machine translation [89]. Gurpreet Singh Josan et al. [90] developed Punjabi to Hindi machine translation system at Punjabi University, Patiala. This system is based on direct word-to-word translation approach. This system consists of modules like pre-processing, word-toword translation using Punjabi-Hindi lexicon, morphological analysis, word sense disambiguation, transliteration and post processing. The system has reported 92.8% accuracy. Sampark, a machine translation system for translation among Indian languages was developed by the Consortium of institutions. Consortiums of institutions include IIIT Hyderabad, University of Hyderabad, CDAC (Noida,Pune), Anna University, KBC, Chennai, IIT Kharagpur, IIT Kanpur, IISc Bangalore, IIIT Alahabad, Tamil University, Jadavpur University. Currently experimental systems have been released namely {Punjabi,Urdu, Tamil, Marathi} to Hindi

and Tamil-Hindi Machine

Translation systems [85]. Vishal Goyal et.al. [91] developed Hindi to Punjabi Machine translation System at Punjabi University, Patiala. This system is based on direct word-to-word translation approach. This system consists of modules like pre-processing, word-to-word translation using Hindi-Punjabi lexicon, morphological analysis, word sense 34

disambiguation, transliteration and post processing. The system has reported 95% accuracy.

2.3.2 Machine Translation Systems for Tamil Prashanth Balajapally et al. [92] developed English to {Hindi, Kannada, and Tamil} and Kannada to Tamil language-pair example based machine translation. It is based on a bilingual dictionary comprising of sentence-dictionary, phrases-dictionary, wordsdictionary and phonetic-dictionary is used for the machine translation. Each of the above dictionaries contains parallel corpora of sentence, phrases and words, and phonetic mappings of words in their respective files. Example Based Machine Translation (EBMT) has a set of 75000 most commonly spoken sentences that are originally available in English. These sentences have been manually translated into three of the target Indian languages, namely Hindi, Kannada and Tamil. Tamil-Hindi, machine-aided translation system developed by Prof. C.N. Krishnan at AU-KBC Research Centre, MIT Campus, Anna University Chennai. This system is based on Anusaaraka machine translation system. It uses a lexical level translation and has 80-85% coverage. Stand-alone, API, and Web-based on-line versions are developed. Tamil morphological analyser and Tamil-Hindi bilingual dictionary are the by-products of this system. They also developed a prototype of English-Tamil MAT system. It includes exhaustive syntactical analysis. At present it has limited vocabulary and small set of transfer rules. Telugu-Tamil machine translation system is also being developed at CALTS. This system uses the Telugu morphological analyser and Tamil generator developed at CALTS. The backbone of the system is Telugu-Tamil dictionary [85]. Ruvan Weerasinghe (2004), [93] developed a SMT system for Sinhala to Tamil Language. In this method, corpora were utilized from newspaper of Sri Lanka which publishing in both languages. He has also collected corpora from a website that contains translations of English articles into Sinhala and Tamil. These resources are formed a small trilingual parallel corpus for this research. This corpus consists of news items and articles related to politics and culture in Sri Lanka. The fundamental task of sentence boundary detection was performed employing a semi-automatic approach. In this scheme, a basic heuristic was first applied to identify sentence boundaries and 35

those situations that were exceptions to the heuristic identified. Sentences are aligned in manual way. After cleaning up the texts and manual alignment, a total of 4064 sentences of Sinhala and Tamil were used for SMT. All language processing done used raw words and were based on statistical information. The CMU-Cambridge Statistical Language Modeling Toolkit (version 2) was used to build n-gram language models. Vasu renganathan (2002) [94] pursued on development of English-Tamil web based machine translation system. This system is developed based on rules. This system contains around five thousand words in lexicon, and a wide range of transfer rules written in Prolog. This system also considers frequently occurring English structures mapped to corresponding Tamil structures. The interesting feature of this system is that it can update easily by adding words into lexicon and rules into rule-base. Two types of lexicons are explained in this system, one is based on grammatical categories of head and target words and the other is based on semantic and syntactic properties of words. This former lexicon type is used to translate technical, colloquial and news documents and where as the latter type of lexicon is mandatory for translating complex type of literary texts comprising fiction, poems, biographies etc. Former type of lexicon is used to build this English to Tamil translation system. The programming language Prolog is used to code the complex rules in a robust way. He also states the feasibilities of further research in this area. The morphological transducer built as part of this system uses this information to generate correct inflectional forms.

It is

constructed following the concepts of the theory of lexical phonology, which accounts for the interrelationship between phonological and morphological rules in terms of lexical and post lexical rules. Ulrich Germann(2001) [95] reported his experience with building a statistical MT system from scratch, including the creation of a small parallel Tamil-English corpus. Parallel corpus of about 100,000 words on the Tamil side is created within one month, using several translators. In this paper the complete experience about the creation of parallel corpus is explained and the author is also advised for similar future projects. The overall organization of this project is source data retrieval, hiring and management of the translators, design and implementation of the web interface for managing the project via the Internet, development of transliterator and stemmer, etc. In order to boost the text coverage they built a simple text stemmer for Tamil, based on 36

the Tamil inflection tables. The stemmer uses regular expression matching to cut off inflectional endings and introduce some extra tokens for negation and certain case markings (such as locative and genitive), which are all marked morphologically in Tamil. Finally, it is showed that the performance of MT system increases using the stemmer. Fredric C.Gey (2002) [96] report a prospects of machine translation of Tamil language. He mentions the major problems in connection with machine translation and cross-language retrieval of Tamil (and other Indian languages). The primary issue is the lack of machine-readable resources for either machine translation or cross-language dictionary lookup. Most of the Tamil language research has been in the rich classical literature. He has assembled a corpus of Tamil news stories from Thinaboomi website. This corpus contains over 3,000 news stories in the Tamil language, and provides a rich source for modern Tamil linguistic studies and retrieval. This corpus has been used to develop an experimental statistical machine translation system from Tamil to English by the Information Sciences Institute (http://www.isi.edu), one of the leading machine translation research organizations. KC.Chellamuthu(2002) [97] explained the

role of Machine translation in

information dissemination and a brief history of MT and its strategies. The various components and functions of an early MT system developed in Tamil University, Tanjore for Russian to Tamil translation is also explained. The Russian to Tamil MT system uses an intermediate language with a syntax more related to Target Language. The Russian to Tamil MT system consists of various functional components such as a preprocessor, parser, lexical analyzer, bi-lingual dictionary, morphological analyzer, and translator and generation modules. Depending upon the strategy adopted in a MT system, the functional organization may vary from system to system. Here, the primary task would be analyzing the input text, parsing the sentences, analyzing the words lexically and morphologically, conceptualizing the SL sentences, table look up using bi-lingual dictionary and translating the input word using the linguistic knowledge already defined in the system. The Bi-lingual dictionary contained about 1200 vocabularies with certain lexical markers and attributes is used in this MT system. This dictionary is a major database of a MT system. The translation strategy adopted in the Russian to Tamil MT system is a transfer methodology involving an intermediate language. In this system the given Russian sentence is lexically analyzed and the syntax 37

is transformed to the grammar of an intermediate language after carrying out the syntax and morphological analysis and word by word translation. Computational Engineering and Networking research centre of Amrita School of Engineering, Coimbatore, proposed a English – Tamil translation memory system. The system is based on phrase based approach by incorporating concept labeling using translation memory of parallel corpus. The translation system consists of 50,000 English – Tamil parallel sentences, 5000 proverbs, and 1000 idioms and phrases, with a dictionary containing more than 2,00,000 technical words and 100,000 general words and has the accuracy of 70% [98]. Loganathan R (2010) [124] developed the English-Tamil machine translation system using rule-based and corpus-based approaches. For rule based approach, structural difference between English and Tamil is considered and syntax transfer based methodology is adopted for translation. Saravanan et.al (2010) [99] developed a Rule based Machine translation system for English to Tamil. Using statistical machine translation approach, Google developed a web based machine translation engine for English to Tamil language. This system is also having the facility to identify the source language automatically.

2.4 ADDING LINGUISTIC INFORMATION FOR SMT SYSTEM Statistical translation models have evolved from the word-based models originally proposed by Brown et al. [100] to syntax-based and phrase-based techniques. The beginnings of phrase-based translation can be seen in the alignment template model introduced by Och et al. [101]. A joint probability model for phrase translation was proposed by Marcu et al. [102]. Koehn et al. [103] propose certain heuristics to extract phrases that are consistent with bidirectional word-alignments generated by the IBM models [100]. Phrases extracted using these heuristics are also shown to perform better than syntactically motivated phrases, the joint model, and IBM model 4 [103]. Syntax-based models use parse-tree representations of the sentences in the training data to learn, among other things, tree transformation probabilities. These methods require a parser for the target language and, in some cases, the source language too. Yamada et al. [104] propose a model that transforms target language parse trees to source language strings by applying reordering, insertion, and translation 38

operations at each node of the tree. Graehl et al. and Melamed, propose methods based on tree-to-tree mappings [105] [106]. Imamura et al. present a similar method that achieves significant improvements over a phrase-based baseline model for JapaneseEnglish translation [107]. Recently, various pre-processing approaches have been proposed for handling syntax within SMT. These algorithms attempt to reconcile the word-order differences between the source and target language sentences by reordering the source language data prior to the SMT training and decoding cycles. Nießen et al. [108] propose some restructuring steps for German-English SMT. Popovic et al. [109] report the use of simple local transformation rules for Spanish-English and Serbian-English translation. Collins et al. [110] propose German clause restructuring to improve German-English SMT. The use of morphological information for SMT has been reported in [108] and [109]. The detailed experiments described by Nießen et al.[108], show that the use of morph-syntactic information drastically, reduces the need for bilingual training data. Recent work by Koehn et al. [10] proposes factored translation models that combine feature functions to handle syntactic, morphological, and other linguistic information in a log-linear model. The following addresses the various approaches to handle idioms and phrasal verbs in machine translation. Handling of idioms and phrasal verbs is one of the most important tasks to be handled in Machine Translation. Various approaches are developed to handle idioms and phrases in machine Translation. Sahar Ahmadi et al. [111] focused on analysing the translatability of colour idiomatic expressions in English- Persian and Persian-English texts to explore the applied translation strategies in translation of colour idiomatic expressions and also to find cultural similarities and differences between colour idiomatic expressions in English and Persian. Martine Smets et al. [112] developed their Machine Translation system in such a way such that it handles the verbal idioms. Verbal idioms constitute a challenge for machine translation systems: their meaning is not compositional, preventing a wordfor-word translation, and they can be discontinuous, preventing a match during tokenization.

39

Elisabeth Breidt et al. [113] suggested describing their syntactic restrictions and their idiosyncratic peculiarities with local grammar rules, which at the same time permit to express regularities valid for a whole class of multi-word lexemes such as word order variation in German. Digital Sonata provides NLP services and products. It has released its tool kit called, Caraboa Language Kit, in which idioms serves as the backbone of its architecture and it is mainly rule based. Here, the idioms are considered as sequences and each sequence is a combination of one or more lexical units. Panagiotis(2005) [114] proposes a novel algorithm for incorporating morphological knowledge for English to Greek Statistical Machine Translation (SMT) system. They suggest a method of improving the translation quality of existing SMT systems, by incorporating word-stems into SMT systems. Initially word stems are acquired automatically for the source and target languages using an unsupervised morphological acquisition algorithm. Second the stems are incorporated into the SMT system using a general statistical framework which combines a word-based and a stembased SMT system. The combined lexical and morphological SMT system is implemented using late integration and lattice re-scoring. They used the Linguistica system to perform morphological analysis for both the source and target languages. The system has been trained on parts of the Europarl corpus [115], a parallel corpus in 11 European languages which is extracted from the proceedings of the European Parliament. The system is then evaluated on the Europarl corpus, using automatic evaluation methods for various training corpus sizes. Soha Sultan (2011) [116] introduces two approaches to augmenting linguistic knowledge with English-Arabic statistical machine translation (SMT). The first approach improves SMT by adding linguistically motivated syntactic features to particular phrases. These added features are based on the English syntactic information, namely part-of-speech tags and dependency parse trees. The second approach improves morphological agreement in machine translation output through post-processing. This method uses the projection of the English dependency parse tree onto the Arabic sentence in addition to the Arabic morphological analysis in order to extract the agreement relations between words in the Arabic sentence. Individual morphological features are trained using syntactic and morphological information from both the source 40

and target languages. The predicted morphological features are then used to generate the correct surface forms. Adrià de Gispert Ramis (2006) [117] addresses the use of morpho-syntactic information in order to improve the performance of Statistical Machine Translation (SMT) systems, providing them with additional linguistic information beyond the surface level of words from parallel corpora. This author proposes a translation model tackling verb form generation through an additional verb instance model, reporting experiments in English to Spanish tasks. The importance of word alignment is given as the first step in training SMT systems. Morpho-syntactic information is included prior to word alignment. Improvements in terms of word alignment and translation quality are also studied. Classification approach is proposed and attached with standard SMT decoding and report results for English to Spanish translation task. Ann Clifton (2010) [118] examines various methods of augmenting SMT models to use morphological information to improve the quality of translation into morphologically rich languages, comparing them on English to Finnish translation task. Unsupervised morphological segmentation methods are integrated into the translation model and combine this segmentation-based system with a Conditional Random Field morphology prediction model. Morphological awareness models yield significantly more fluent translation output compared to a baseline word-based model. Sara Stymne (2009), [119] explores how compound processing can be used to improve phrase-based statistical machine translation (PBSMT) between English and German/Swedish. For translation into Swedish and German the parts are merged after translation. The effect of different splitting algorithms for translation between English and German, and of different merging algorithms for German is also investigated. For translation between English and German different splitting algorithms work best for different translation directions. A novel merging algorithm based on art-of-speech matching is designed and evaluated. Rabih M. Zbib (2010) [120] presented methods for using linguistically motivated information to enhance the performance of statistical machine translation (SMT). The use linguistic knowledge at various levels to improve statistical machine translation for Arabic-English translation is presented. In the first part, morphological 41

information is used to preprocess the Arabic text for Arabic-to-English and English-toArabic translation. This preprocessing reduces the gap in the complexity of the morphology between Arabic and English language. The second method addresses the issue of long-distance reordering in translation to account for the difference in the syntax of the two languages. In the third part, it is showed that how additional local context information on the source side is incorporated. This part helps to reduce the lexical ambiguity. Two methods are also proposed for using binary decision trees to control the amount of context information introduced. Finally the system combines the outputs of an SMT system and a Rule-based MT (RBMT) system, taking advantage of the flexibility of the statistical approach and the rich linguistic knowledge embedded in the rule-based MT system. Young-Suk Lee (2004) [121] presents a novel morphological analysis technique which induces a morphological and syntactic symmetry between two languages with highly asymmetrical morphological structures to improve statistical machine translation qualities. They have applied this technique for Arabic- English

sentence alignment.

This algorithm identifies morphemes to be merged or deleted in the morphologically rich language to induce the desired morphological and syntactic symmetry.

The

algorithm utilizes two sets of translation probabilities to determine merge/deletion analysis of a morpheme. Additional morphological analysis induced from noun phrase parsing of Arabic is applied to accomplish a syntactic as well as morphological symmetry between the two languages. They have used an Arabic part-of-speech tagger with around 120 tags, and an English part-of-speech tagger with around 55 tags. This technique improves Arabic-to-English translation qualities significantly. Irimia Elena and Alexandru Ceauşu (2010) [122] presents a method for extracting translation examples using the dependency linkage of both the source and target language sentence. They identified two types of dependency link-structures super-links and chains - and used these structures to set the translation example borders. They used a Romanian-English parallel corpus contained about 600,000 translation units. In order to build the translation models from the linguistically analyzed parallel corpora GIZA++ tool is used and unidirectional translation models are also constructed. The performance of the dependency-based approach is measured with the BLEU-NIST score and in comparison with a baseline system. 42

Sriram Venkatapathy et.al (2010) [123] proposes a dependency based statistical system that uses discriminative techniques to train its parameters. Experiments are conducted for English- Hindi parallel corpora. The use of syntax (dependency tree) allows us to address the large word-reordering between English and Hindi. They grouped the function words with their corresponding function words. These groups of words are called local-word groups. In these cases, the function words are considered as factors of the content words. There are three types of transformation features are explored, first one is, Local Features, the next is Syntactic Features and the final one is Contextual Features. Online-large margin algorithm, MIRA is used for updating the weights which are learned in training algorithm.

2.5

RELATED NLP WORKS IN TAMIL

Kumaran A and Tobias Kellner (2007) [125] proposed machine transliteration framework based on a core algorithm modelled as a noisy channel, where the source string gets garbled into target string. Viterbi alignment was used for source and target language segments alignment. The transliteration is learned by estimating the parameters of the distribution that maximizes the likelihood of observing the garbling seen in the training data using Expectation Maximization (EM) algorithm. Subsequently, given a target language string ‘t’, the most probable source language string ‘s’ that gave raise to ‘t’, is decoded. The method is applied for forward transliteration from English to Hindi, Tamil, Arabic, Japanese and backward transliteration from Hindi, Tamil, Arabic, Japanese to English. Afraz and Sobha (2008) [126] developed a statistical transliteration engine using an n-grams based approach. This algorithm uses n-gram frequencies of the transliteration units, to find the probabilities. Each transliteration unit is pattern of consonant-vowel in the word. This transliteration engine is used in their Tamil to English CLIR system. Srinivasan C Janarthanam et.al. (2008) [127] proposed an efficient algorithm for transliteration of English named entities to Tamil. In the first stage of transliteration process, he used a Compressed Word Format (CWF) algorithm to compress both English and Tamil named entities from their actual forms. Compressed Word Format of words is created using an ordered set of rewrite and remove rules. Rewrite rules replace 43

characters and clusters of characters with other characters or clusters. Remove rules simply remove the characters or clusters. This CWF algorithm is used for both English and Tamil names, but with different rule set. The final CWF forms will only have the minimal consonant skeleton. In the second stage Levenshtein’s Edit Distance algorithm is modified to incorporate Tamil characteristics like long-short vowel, ambiguities in consonants like ‘n’, ‘r’, ‘i’, etc. Finally, the CWF Mapping transliteration algorithm takes an input source language named entity string, converts it into CWF form and then maps with similar Tamil CWF words using modified edit distance. This method produces a ranked list of transliterated names in the target language Tamil for an English source language name. Vijaya et.al (2010) [128] developed an English to Tamil Transliteration using one class Support Vector Machine (SVM) algorithm. This is a statistical based transliteration system, where training, testing and evaluations were performed with publically available SVM tool. The experiment result shows that, the SVM based transliteration was outperformed over other previous methods. Basically a chunker divides a sentence into its major-non-overlapping phrases and attaches a label to each. Chunker differs in terms of their precise output and the way in which a chunk is defined. Many do more than just simple chunking. Others just find NPs. Chunking falls between tagging (which is feasible but sometimes of limited use) and full parsing (which more useful but is difficult on unrestricted text and may result in massive ambiguity. The structure of individual chunks is fairly easy to describe, while relations between chunks are harder and more dependent on individual lexical properties. So chunking is a compromise between the currently available and the ideal processing output. Chunkers tokenize and tag the sentence. Most chunkers simply use the information in tags, but others look at actual words. Noun Phrase Chunking in Tamil Noun phrase chunking deals with extracting the noun phrases from a sentence. While NP chunking is much simpler than parsing, it is still a challenging task to build an accurate and very efficient NP chunker. The importance of NP chunking derives from the fact that it is used in many applications. Noun phrases can be used as a preprocessing tool before parsing the text. Due to the high ambiguity of the natural 44

language exact parsing of the text may become very complex. In these cases chunking can be used as a preprocessing tool to partially resolve these ambiguities. Noun phrases can be used in Information Retrieval systems. In this application the chunking can be used to retrieve the data's from the documents depending on the chunks rather than the words. In particular nouns and noun phrases are more useful for retrieval and extraction purposes. Most of the recent work on machine translation use texts in two languages (parallel corpora) to derive useful transfer patterns. Noun phrases also have applications in aligning of text in parallel corpora. The sentences in the parallel corpora can be aligned by using the chunk information and by relating the chunks in the source and the target language. This can be done lot more easily than doing word alignment between the texts of the two languages. Further noun phrases that are chunked can also be used in other applications where in depth parsing of the data is not necessary [129]. AUKBCRC’s Noun Phrase Chunker for Tamil The approach is a rule based one. In this method initially a corpus is taken and it is divided into two or more sets. One of these divided sets is used as the training data. The training data set is taken and manually chunked for noun phrases, thus evolving rules that can be applied to separate the noun phrases in a sentence. These rules serve as the base for chunking. The chunker program uses these rules and chunks the test data. The coverage of these rules is tested with this test data set. Precision and recall are calculated for this and the result is analyzed to check, if more rules are needed to improve the coverage of the system. If more rules are needed then additional rules are added and the same process as mentioned above is repeated to check for increase in the precision and recall of the system. The system is then tested for various other applications [130]. Vaanavil of RCILTS-Tamil Vaanavil identifies the syntactic constituents of a Tamil sentence. It gives the parsed tree in a list form. It tackles both simple and complex sentences. Simple sentences can have a verb, many noun phrase, simple adverbs and adjectives. Complex sentences can have multiple adjectival, adverbial and noun clausal forms. In the case of sentences with multiple clauses, vaanavil syntactically groups the clauses based on the cue words and phrases. It makes of phrase structure grammar. It uses look-ahead to 45

handle free word order. It handles ambiguity using 15 heuristic rules. It uses the morphological analyzer to obtain the root word [129].

2.6

SUMMARY This chapter presents the literature survey for linguistic tools and available

Machine Translation systems for Indian languages. Literature review about the Linguistic tools such as POS Tagger, Morphological analyzer and Morphological generator are described in this chapter. Most of the tools are developed based on rule based methods and few are developed using data driven. In India, Machine Translation Systems have been developed using direct machine translation approach for closely related language pairs. Some of these systems are very successful and still operational. Statistical Machine Translation methods are frequently applied for unrelated language pairs. Thus, it is concluded that statistical approach is the most appropriate for unrelated languages.

46

CHAPTER 3 THEORETICAL BACKGROUND 3.1

GENERAL

Natural Language Processing (NLP) research has a long tradition in European countries. It has taken giant leaps in the last decade with the initiation of efficient machine learning algorithms and the creation of large annotated corpora for various languages. In countries like India where more than thousands of language are in usage, so, the importance of the NLP is very relevant. However, NLP research in Indian languages has mainly focused on the development of rule based techniques due to the lack of annotated corpora. The pre-requisites for developing NLP applications in Tamil language are the availability of speech corpora, annotated text corpora, parallel corpora, lexical resources and computational models. The sparseness of these resources for Tamil language is one of the major reasons for the slow growth of NLP work in Tamil. Like other language processing, Tamil language also involves morphological analysis, syntax analysis and semantic analysis.

3.1.1 Tamil Language Tamil belongs to the southern branch of the Dravidian languages, a family of around twenty-six languages native to the Indian subcontinent. It flourished in India as a language with rich literature during the Sangam period (300 BCE to 300 CE). Tamil scholars categorize the history of the language into three periods, Old Tamil (300 BC 700 CE), Middle Tamil (700 - 1600) and Modern Tamil (1600–present). In Old Tamil, Epigraphic attestation of Tamil begins with rock inscriptions from the 3rd century BC, written in Tamil-Brahmi, an adapted form of the Brahmi script. The earliest extant literary text is the ெதால்காப்பியம் (tholkAppiyam), a work on grammar and poetics which describes the language of the classical period. The Sangam literature contains about 50,000 lines of poetry contained in 2381 poems attributed to 473 poets including many women poets [9]. During Modern Tamil i.e., in the early 20th century, the chaste Tamil Movement called for the removal of all Sanskrit and other foreign elements from 47

Tamil. It received support from Dravidian parties and nationalists who supported Tamil independence. This led to the replacement of a significant number of Sanskrit loan words by Tamil equivalents. An important factor specific to Tamil is the existence of two main varieties of the language, colloquial and formal Tamil ெசந்தமிழ் (sewthamiz), which are sufficiently divergent that the language is classed as diglossic. Colloquial Tamil is used for most spoken communication, and formal Tamil is spoken in a restricted number of high contexts, such as lectures and news bulletins, and also used in writing. They differ in terms of their lexis, morphology, and segmental phonology. Tamil is the official language of the Indian state of Tamilnadu and one of the 22 languages under schedule 8 of the constitution of India. It is also one of the official languages of the Union Territories of Puducherry, Andaman & Nicobar Islands, Sri Lanka, Malaysia and Singapore. Tamil became the first legally recognized classical language of India in the year 2004 [9].

3.1.2 Tamil Grammar Traditional Tamil grammar consists of five parts, namely எ த் (sol), ெபா ள் (poruL), யாப்

(ezuththu), ெசால்

(yAppu) and அணி(aNi). Of these, the last two are

applicable mostly in poetry. The following Table 3.1 gives additional information about these parts. The tholkAppiyam (ெதால்காப்பியம்) is the oldest work on the grammar of the Tamil language [132]. Table 3.1 Tamil Grammar Divisions

Meaning

எ த் (Ezuththu )

Letter

ெசால் (sol)

Word

ெபா ள் (poruL)

Meaning

யாப் (yAppu)

Form

அணி (aNi)

Method

Main grammar books ெதால்காப்பியம்(tholkAppiyam),நன் (wannUl) ெதால்காப்பியம்(tholkAppiyam),நன் (wannUl) ெதால்காப்பியம் (tholkAppiyam)

ல்

யாப்ெப ம்கலாக்காாிைக(yApperungkalAk

kArikai ) தனியலங்காரம் (thaniyalangkAram)

48

ல்

3.1.3 Tamil Characters Tamil is written using a script called the vattEzuththu. The Tamil script has twelve vowels uyirezuththu (உயிெர த் ) "soul-letters", eighteen consonants meyyezuththu (ெமய்ெய த் ) "body-letters" and one character, the Aythaezuththu (ஆய்த எ த் ) “the hermaphrodite letter”, which is classified in Tamil grammar as being neither a consonant nor a vowel though often considered as part of the vowel set. The script, however, is syllabic and not alphabetic. The complete script, therefore, consists of the thirty-one letters in their independent form, and an additional 216 compound letters representing a total 247 combinations. These compound letters are formed by adding a vowel marker to the consonant. The details of Tamil vowels are given in Table 3.2. Some vowels require the basic shape of the consonant to be altered in a way that is specific to that vowel. Others are written by adding a vowel-specific suffix to the consonant, yet others a prefix, and finally some vowels require adding both a prefix and a suffix to the consonant. The following Table 3.3 lists vowel letters across the top and consonant letters along the side, the combination of which gives all Tamil compound ( uyirmei) letters. Table 3.2 Tamil Vowels Short vowel

Long vowel

அ இ

ஆ ஈ

உ எ ஒ

ஊ ஏ ஒ

Diphthong ஐ ஔ

In every case, the vowel marker is different from the standalone character for the vowel. The Tamil script is written from left to right. Vowels are also called the 'life' (uyir) or 'soul' letters. Tamil vowels are divided into short and long kuril and nedil five of each type) and two diphthongs. Tamil compound (uyirmei) letters are formed by adding a vowel marker to the consonant. There are 216 compound letters in Tamil. The Tamil transliteration is given in the Appendix A.

49

Table 3.3 Tamil Compound Letters Tamil compound Characters table Vow→ அ ↓Cons a

ஆ (A)

இ (i)

ஈ (I)

உ (u)

ஊ (U)

எ (e)

ஏ (E)

ஐ (ai)

ஒ (o)

ஓ (O)

ஔ

க் (k)

க

கா

கி

கீ

கு

கூ

ெக

ேக

ைக

ெகா

ேகா

ெகௗ

ங் (ng)

ங

ஙா

ஙி

ஙீ

ஙு

ஙூ

ெங

ேங

ைங

ெஙா

ேஙா

ெஙௗ

ச் (s)

ச

சா

சி

சீ

சு

சூ

ெச

ேச

ைச

ெசா

ேசா

ெசௗ

ஞ் (nj)

ஞ

ஞா

ஞி

ஞீ

ெஞ

ேஞ

ைஞ

ெஞா

ேஞா

ெஞௗ

ட் (d)

ட

டா

டீ

ெட

ேட

ைட

ெடா

ேடா

ெடௗ

ண் (N)

ண ணா ணி ணீ

ெண ேண ைண ெணா ேணா ெணௗ

த் (th)

த

தா

தி

தீ

ெத

ேத

ைத

ெதா

ேதா

ெதௗ

ந் (w)

ந

நா

நி

நீ

ெந

ேந

ைந

ெநா

ேநா

ெநௗ

ப் (p)

ப

பா

பி

பீ

ெப

ேப

ைப

ெபா

ேபா

ெபௗ

ம் (m)

ம

மா

மி

மீ

ெம

ேம

ைம

ெமா

ேமா

ெமௗ

ய் (y)

ய

யா

யி

யீ

ெய

ேய

ைய

ெயா

ேயா

ெயௗ

ர் (r)

ர

ரா

ாி

ாீ

ெர

ேர

ைர

ெரா

ேரா

ெரௗ

ல் (l)

ல

லா

லீ

ெல

ேல

ைல

ெலா

ேலா

ெலௗ

வ் (v)

வ

வா

வி

ெவ

ேவ

ைவ

ெவா

ேவா

ெவௗ

ழ் (z)

ழ

ழா

ழி

ழீ

ெழ

ேழ

ைழ

ெழா

ேழா

ெழௗ

ள் (L)

ள

ளா

ளி

ளீ

ெள

ேள

ைள

ெளா

ேளா

ெளௗ

ற் (R)

ற

றா

றி

றீ

ெற

ேற

ைற

ெறா

ேறா

ெறௗ

ன் (n)

ன

னா

னி

னீ

ென

ேன

ைன

ெனா

ேனா

ெனௗ

(au)

3.1.4 Morphological Richness of Tamil Language Tamil is an agglutinative language. Tamil words consist of a lexical root to which one or more affixes are attached. Mostly, Tamil affixes are suffixes. Tamil suffixes can be derivational suffixes, which either changes the Part-of-Speech of the word or its meaning, or inflectional suffixes, which mark categories such as person, number, mood, tense, etc. There is no absolute limit on the length and extent of agglutination, which can lead to long words with a large number of suffixes, which would require several words or a sentence in English. 50

Tamil is a morphologically rich language in which most of the morphemes coordinate with the root words in the form of suffixes. Suffixes are used to perform the functions of cases, plural marker, euphonic increment and postpositions in noun class. Tamil verbs are inflected for tense, person, number, gender, mood and voice. Other features of Tamil language are, using plural for honorific noun, frequent echo words, and null subject feature i.e. not all sentences have subject. Computationally, each root word can take more than ten thousand inflected word-forms, out of which only a few hundred will exist in a typical corpus [129]. Tamil is consistently head-final language. The verb comes at the end of the clause with a typical word order of Subject-ObjectVerb (SOV). However, Tamil language allows word order to be changed, making it a relatively word order free language. In Tamil, subject-verb agreement is required for the grammaticality of a Tamil sentence.

3.1.5 Challenges in Tamil NLP There are many issues that make a Tamil language processing task to difficult. These relate to the problems of representation and interpretation. Language computing requires precise representation of context. The natural languages are highly ambiguous and vague, so achieving such representations are very hard. The various sources of ambiguities in Tamil language are described below. 3.1.5.1 Ambiguity in Morphemes Tamil morphemes are ambiguous in the grammatical category and the position it takes in a word construction. Ambiguity in morpheme’s grammatical category A morpheme can have more than one grammatical category. For example, the morpheme

athu,

ana,

thu can occur as Nominalizing suffix or 3rd Person neuter

suffix. Ambiguity in morpheme’s position The suffixation of the morpheme’s position also leads to ambiguity. The Table 3.4 gives a few examples for the morphemes and its possible grammatical features.

51

Table 3.4 Ambiguity in Morpheme’s Position

Morpheme

Possible Grammatical Features

அ (a)

Infinitive

Relative Participle

கல் (kal)

Root

Nominal Suffix

ஆக (Aka)

Benefactive

Adverbial Suffix

த் (th)

Sandhi

Tense

ெசய் (sey)

Root

Auxiliary Root

3.1.5.2 Ambiguity in Word Class A word may be ambiguous in its Part of Speech or the word class. A word may have more than one interpretation. For example, the word ப “padi” can take noun class or verb class. The word ambiguity has to be disambiguating while referring to its context. padi- study (V) or step (N) கீேழ ப

உள்ள

கவனமாக ெசல்ல ம் . step (N)

தின ம் பாடங்கைள ப

என ஆசிாிைய மாணவர்களிடம் கூறினார். study (V)

3.1.5.3 Ambiguity in Word Sense Even though a word belongs to a specific grammatical category, it may be ambiguous in the sense. For instance, the Tamil word கா

“ kAddu” has 11 senses in noun class

and 18 senses in verb class [kiriyAvin tharkAla Tamil akarAthi, 2006] [133]. For example the following sentence has two different meanings. அவன் பாடல் ேகட்டான் . (He heard the song) 52

(He ask the song ) 3.1.5.4 Ambiguity in Sentence A sentence may be ambiguous even if the words are not ambiguous. For example, the following sentence has two interpretations. “நான் ஒ

அழகான ெபண்ைண ம் ஆைண ம் பார்த்ேதன்”

(I saw the beautiful women and men) (I saw the beautiful women and beautiful men).

The words are not ambiguous but the sentences are ambiguous.

3.2

MORPHOLOGY

Morphology is the field within linguistics that studies the internal structure of words. While words are generally accepted as being the smallest units of syntax, it is clear that in most (if not all) languages, words can be related to other words by rules. Morphology is the branch of linguistics that studies patterns of word-formation within and across languages, and attempts to formulate rules that model the knowledge of the speakers of those languages.

3.2.1 Types of Morphology Morphology is traditionally classified into three main divisions: inflection, derivation, and compounding. Inflectional morphology deals with the formation of different forms in the paradigm of a lexeme. In inflectional morphology, words undergo a change in their form to express some grammatical functions but their syntactic category remains unchanged. Many inflectional features appear on words to express agreement purposes (agreement in person, number, and gender) as well as to express case, aspect, mood, and tense. The derivational morphology is concerned with “the creation of a new lexeme via affixation”. In English, the process of word formation through derivation involves two types of affixation: prefixation, which means placing a morpheme before a word, e.g. un-happy; and suffixation, which means placing a morpheme after a word, e.g.

53

happi-ness. Derivation poses a problem to translation in that “not all derived words have straight-forward compositional translation as derived words. In English, for example, the same meaning can be expressed by different affixes. Moreover, the same affix can have more than one meaning. This can be exemplified by the suffix -er. This suffix can be used to express the agent as in player and singer. But this is not the only meaning it can convey as it can describe instruments as in mixer and cooker. In this way the affix can have a range of equivalents in the target language and the attempt to have one-to-one correspondences for affixes will be greatly misguided. Compounding morphology is the process of forming a new word through combining two or more words. Compounding is a process of word formation that involves combining complete word forms into a single compound form; dog catcher is therefore a compound, because both dog and catcher are complete word forms in their own right before the compounding process has been applied, and are subsequently treated as one form. An important notion in compounding is the notion of head. A compound noun is divided into head and modifier or modifiers. For instance, the compound noun watchtower in which watch and tower can be represented as a head and modifier.

3.2.2 Lexemes A lexical database is organized around lexemes, which include all the morphemes of a language. A lexeme is conventionally listed in a dictionary as a separate entry. Generally lexeme corresponds to a set of forms taken by a single word. For example, in the English language, run, runs, ran and running are forms of the same lexeme “run”.

3.2.3 Lemma and Stems A lemma in morphology is the canonical form of a lexeme. In lexicography, this unit is usually the citation form or headword by which it is indexed. Lemmas have special significance in highly inflected languages such as Tamil. The process of determining the lemma for a given word is called lemmatization. A stem is the part of the word that never changes even when morphologically inflected, whilst a lemma is the base form of the verb. For example, for the word "produced", the lemma is "produce", but the stem is “produc-”. This is because there 54

are words such as production. In linguistic analysis, the stem is defined more generally as the analyzed base form from which all inflected forms can be formed. When phonology is taken into account, the definition of the unchangeable part of the word is not useful, as can be seen in the phonological forms of the words in the preceding example: "produced" vs. "production".

3.2.4 Inflections and Word Forms Given the notion of a lexeme, it is possible to distinguish two kinds of morphological rules. Some morphological rules relate different forms of the same lexeme; while other rules relate two different lexemes. Rules of the first kind are called inflectional rules, while those of the second kind are called word-formation. The English plural, as illustrated by dog and dogs, is an inflectional rule; compounds like dog-catcher or dishwasher provide an example of a word-formation rule. Informally, word-formation rules form "new words" (that is, new lexemes), while inflection rules yield variant forms of the "same" word (lexeme). Derivation involves affixing bound (non-independent) forms to existing lexemes, whereby the addition of the affix derives a new lexeme. One example of derivation is clear in this case: the word independent is derived from the word dependent by prefixing it with the derivational prefix in-, while dependent itself is derived from the verb depend.

3.2.5 Morphemes and Types Morpheme is the minimal meaningful unit in a word. The concept of word and morpheme are different, a morpheme may or may not stand alone. One or several morphemes compose a word. • Free morphemes, like town and dog, can appear with other lexemes (as in town hall or dog house) or they can stand alone, i.e. "free". • Bound morphemes like "un-" appear only together with other morphemes to form a lexeme. Bound morphemes in general tend to be prefixes and suffixes.

55

• Derivational morphemes can be added to a word to create (derive) another word: the addition of "-ness" to "happy," for example, to give "happiness." They carry semantic information. • Inflectional morphemes modify a word's tense, number, aspect, and so on, without deriving a new word or a word in a new grammatical category (as in the "dog" morpheme if written with the plural marker morpheme "-s" becomes "dogs"). They carry grammatical information. Agglutinative languages have words containing several morphemes that are always clearly differentiable from one another in that each morpheme represents only one grammatical meaning and the boundaries between those morphemes are easily demarcated. The bound morphemes are affixes, and they may be individually identified. Agglutinative languages tend to have a high number of morphemes per word, and their morphology is highly regular [134].

3.2.6 Allomorphs One of the largest sources of complexity in morphology is one-to-one correspondence between meaning and form which is scarcely applies to every case in the language. English have word form pairs like ship/ships, ox/oxen, goose/geese, and sheep/sheep, where the difference between the singular and the plural is signaled in a way that departs from the regular pattern, or is not signaled at all. Even cases considered "regular", with the final -s, are not so simple; the -s in dogs is not pronounced the same way as the -s in cats, and in a plural like dishes; an "extra" vowel appears before the -s. These cases, where the same distinction is affected by alternative forms of a "word", are called allomorph.

3.2.7 Morpho-Phonemics Morpho-phonology or Morpho-phonemics studies the phonemic changes when a morpheme is inflected with another. This phenomenon is called ‘sandhi’ in Tamil. Sandhi occurs very frequently in Tamil and should be taken care when building morphological analyzers or generators. For instance, the noun root

‘pU’ (flower),

when pluralized, becomes ‘pUkkaL’ instead of the ‘pUkaL’. When the root is 56

monosyllabic ending with a long and the following morpheme starts with a vallinam consonant, the consonant geminates. Sandhi changes can occur between two morphemes or words. Although sandhi rules are mostly dependent on phonemic properties of the morphemes, they sometimes depend on the grammatical relations of the words on which they operate. Sometimes gemination may be invalid when the words are in subject-predicate relation, but valid if they are in modifier-modified relation. Sandhi changes can occur in four different ways: Gemination, Insertion, Deletion and Modification. Gemination is a case of insertion where the vallinam consonants double themselves. In general, the insertion happens when new characters are inserted between words or morphemes. Deletion happens when existing characters at the end of the first word or the start of the second word are dropped. Modification happens when characters get replaced by some other characters with close phonological properties.

3.2.8 Morphotactics The morphemes of a word cannot occur in random order. In every language, there are well-defined ways to sequence the morphemes. The morphemes can be divided into a number of classes and the morpheme sequences are normally defined in terms of the sequence of classes. For instance, in Tamil, the case morphemes follow the number morpheme in noun constructions. For example, around is invalid. For example,

க்கைள ( _கள்_ஐ). The other way

ஐக்கள் ( _ஐ_ கள்).

The order in which

morphemes follow each other is strictly governed by a set of rules called morphotactics. In Tamil, these rule play a very important role in word construction and derivation as the language is agglutinative and words are formed by a long sequence of morphemes. Rules of morphotactics also serve to disambiguate the morphemes that occur in more than one class of morphemes. The analyzer uses these rules to identify the structure of words.

57

3.3

MACHINE LEARNING FOR NLP

3.3.1 Machine Learning Machine learning deals with techniques that allow computers to automatically learn and make accurate predictions based on past observations. The major focus of machine learning is to extract information from data automatically, by using computational and statistical methods. Machine learning techniques are being used for solving various tasks of Natural Language processing. This includes speech recognition, document categorization, document segmentation, part-of-speech tagging, and word-sense disambiguation, named entity recognition, parsing, machine translation and transliteration. There are two main tasks involved in machine learning; learning/training and prediction. The system is given with a set of examples called training data. The primary goal is to automatically acquire effective and accurate model from the training data. The training data provides the domain knowledge i.e., characteristics of the domain from which the examples are drawn. This is a typical task for inductive learning and is usually called concept learning or learning from examples. The larger the amount of training data, usually the better the model will be. The second phase of machine learning is the prediction, wherein a set of inputs is mapped into the corresponding target values. The main challenge of machine learning is to create a model, with good prediction performance on the test data i.e., model with good generalization on unknown data. Machine learning algorithms are categorized based on the desired outcome of the algorithm. Types of machine learning algorithms include Supervised learning, Unsupervised learning, Semi-supervised learning, Reinforcement learning and Transduction [135]. In supervised learning the target function is completely specified by the training data. There is a label associated with each example. If the label is discrete, then the task is called classification. Otherwise, for real valued labels, the task becomes a regression problem. Based on the examples in the training data, the label for new case is predicted. Hence, learning is not only a question of remembering but also of generalization to unseen cases. Any change in the learning system can be seen as

58

acquiring some kind of knowledge. So, depending on what the system learns, the learning is categorized as • Model Learning: The system predicts values of unknown function. This is called as prediction and is a task well known in statistics. If the function is discrete, the task is called classification. For continuous-valued functions it is called regression. • Concept learning: The systems acquire descriptions of concepts or classes of objects. • Explanation-based learning: Using traces (explanations) of correct (or incorrect) performances, the system learns rules for more efficient performance of unseen tasks. • Case-based (exemplar-based) learning: The system memorizes cases (exemplars) of correctly classified data or correct performances and learns how to use them (e.g. by making analogies) to process unseen data.

3.3.2 Support Vector Machines Support Vector Machine (SVM) represents a new approach to supervised pattern classification which has been successfully applied to a wide range of pattern recognition problems. SVM as supervised machine learning technology is attractive because it has an extremely well developed learning theory, statistical learning theory. SVM is based on strong mathematical foundations and results in simple yet very powerful algorithms. A simple way to build a binary classifier is to construct a hyperplane separating class members from non-members in the input space. Unfortunately, most real world problems involve non-separable data for which there does not exist a hyperplane that successfully separates the class members from nonclass members in the training set. One solution to the inseparability is to map the data into a higher dimensional space and define a separating hyperplane in that space. This higher dimensional space is called the feature space, as opposed to the input space occupied by training examples. With an appropriately chosen feature space of sufficient dimensionality, any consistent training set can be made separable. However, translating the training set into a higher dimensional space incurs both computational and learning-theoretic costs. Representing the feature vectors 59

corresponding to the training set can be extremely expensive in terms of memory and time. Furthermore, artificially separating the data in this way exposes the learning system to the risk of finding trivial solutions that overfit the data. Support Vector Machines elegantly sidestep both difficulties [136]. Support vector machines avoid overfitting by choosing a specific hyperplane among the many that can separate the data in the feature space. SVMs find the maximum margin hyperplane, the hyperplane that maximises the minimum distance from the hyperplane to the closest training point. The maximum margin hyperplane can be represented as a linear combination of training points. Consequently, the decision function for classifying points with respect to the hyperplane only involves dot products between points. Furthermore, the algorithm that finds a separating hyperplane in the feature space can be stated entirely in terms of vectors in the input space and dot products in the feature space. Thus, a support vector machine can locate a separating hyperplane in the feature space and classify points in that space without ever representing the space explicitly, simply by defining a function, called a kernel function that plays the role of the dot product in the feature space. This technique avoids the computational burden of explicitly representing the feature vectors. Another appealing feature of SVM classification is the sparseness of its representation of the decision boundary. The location of the separating hyperplane in the feature space is specified via real-valued weights on the training set examples. Those training examples that lie far away from the hyperplane do not participate in its specification and therefore receive weights of zero. Only the training examples that lie close to the decision boundary between the two classes receive nonzero weights. These training examples are called the support vectors, since removing them would change the location of the separating hyperplane. It is believed that all the information about classification in the training samples can be represented by these Support vectors. In a typical case, the number of support vectors is quite small compared to the total number of training samples. The maximum margin allows the SVM to select among multiple candidate hyperplanes. However, for many data sets, the SVM may not be able to find any separating hyperplane at all, either because the kernel function is inappropriate for the training data or because the data contains mislabeled examples. The latter problem can 60

be addressed by using a soft margin that accepts some misclassifications of the training examples. A soft margin can be obtained in two different ways. The first is to add a constant factor to the kernel function output whenever the given input vectors are identical. The second is to define a priori an upper bound on the size of the training set weights. In either case, the magnitude of the constant factor is to be added to the kernel or to tie the size of the weights which controls the number of training points that the system misclassifies. The setting of this parameter depends on the specific data at hand. Completely specifying a support vector machine therefore requires specifying two parameters: the kernel function and the magnitude of the penalty for violating the soft margin. Thus, a support vector machine finds a nonlinear decision function in the input space by mapping the data into a higher dimensional feature and separating it there by means of a maximum margin hyperplane. The computational complexity of the classification operation does not depend on the dimensionality of the feature space, which can even be infinite. Overfitting is avoided by controlling the margin. The separating hyperplane is represented sparsely as a linear combination of points. The system automatically identifies a subset of informative points and uses them to represent the solution. Finally, the training algorithm solves a simple convex optimization problem. All these features make SVMs an attractive classification system.

3.3.3 Geometrical Interpretation of SVM Typically, the machine is presented with a set of training examples, (xi,yi) where the xi are the real world data instances and the yi are the labels indicating which class the instance belongs to. For the two class pattern recognition problem, yi = +1 or yi = -1. A training example (xi,yi) is called positive if yi = +1 and negative otherwise. SVMs construct a hyperplane that separates two classes (this can be extended to multi-class problems). While doing so, the SVM algorithm tries to achieve maximum separation between the classes. Separating the classes with a large margin minimizes a bound on the expected generalization error [137]. A ‘minimum generalization error’, means that when new examples (data points with unknown class values) arrive for classification, the chance 61

of making an error in the prediction (of the class which it belongs) based on the learned classifier (hyperplane) should be minimum. Intuitively, such a classifier is one which achieves maximum separation-margin between the classes. Figure 3.1 illustrates the concept of ‘maximum margin’. The two planes parallel to the classifier and which pass through one or more points in the data set are called ‘bounding planes’. The distance between these bounding planes is called the ‘margin’ and SVM ‘learning’, means, finding a hyperplane which maximizes this margin. The points (in the dataset) falling on the bounding planes are called ‘support vectors’ . These points play a crucial role in the theory and hence the name support vector machines. ‘Machine’, means algorithm. Vapnik (1998) has shown that if the training vectors are separated without errors by an optimal hyperplane, the expected error rate on a test sample is bounded by the ratio of the expectation of the support vectors to the number of training vectors. Since this ratio is independent of the dimension of the problem, if one can find a small set of support vectors, good generalization is guaranteed [136].

Support Vectors

Maximum Margin

Figure 3.1 Maximum Margin and Support Vectors

62

In the case, wherein the data points are shown in Figure 3.2, one may simply minimize the number of misclassifications whilst maximizing the margin with respect to the correctly classified examples. In such a case it is said that the SVM training algorithm allows a training error. There may be another situation wherein the points are clustered such that the two classes are not linearly separable as shown in Figure 3.3, that is, if one tries for a linear classifier, it may have to tolerate a large training error. In such cases, one prefers non-linear mapping of data into some higher dimensional space called ‘feature space’, F, where it is linearly separable. In order to distinguish between these two spaces, the original space of data points is called ‘input space’. The hyperplane in ‘feature space’ corresponds to a highly non-linear separating surface in the original input space. Hence the classifier is called a non-linear classifier

Figure 3.2 Training Errors in Support Vector Machine

63

Figure 3.3 Non-linear Classifier The process of mapping the data into higher dimensional space involves heavy computation especially when the data which itself may be of high dimensional. However, there is no need to do any explicit mapping to higher dimensional space for finding the hyper plane classifier, all computations will be done in the input space itself [138].

3.3.4 SVM Formulation Notation used m = number of data points in the training set n = number of features (variables) in the data ⎡ xi1 ⎤ ⎢x ⎥ xi = ⎢ i 2 ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ xin ⎦ , n dimensional vector, which represent a data point in “input space”.

di = Dii =Taget value of the ith data, it takes +1 or -1 value

64

⎡ d1 ⎤ ⎢d ⎥ d =⎢ 2⎥ ⎢ ⎥ ⎢ ⎥ ⎣ d m ⎦ , vector representing target value of m data points ⎡ d1 ⎢0 D = diag( d )= ⎢ ⎢ ⎢ ⎣0

0 … . …

d2 0

0⎤ 0 ⎥⎥ ⎥ ⎥ dm ⎦

⎡ w1 ⎤ ⎢w ⎥ w = ⎢ 2⎥ ⎢ ⎥ ⎢ ⎥ ⎣ wn ⎦ , weight vector orthogonal to the hyper plane

w1 x1 + w2 x2 + … wn xn − γ = 0 γ

.

is a scalar which is generally known as bias term

⎡ x1T ⎤ ⎡ x1T x1 ⎢ T⎥ ⎢ T ⎢ x2 ⎥ ⎢ x2 x1 T A = ⎢ . ⎥ , AA = ⎢ . ⎢ ⎥ ⎢ ⎢ . ⎥ ⎢ . ⎢ xT ⎥ ⎢ xT x ⎣ m⎦ ⎣ m 1

x1T x2 x2T x2 . . T xm x2

. . . . .

. . . . .

x1T xm ⎤ ⎥ x2T xm ⎥ . ⎥ ⎥ . ⎥ xmT xm ⎥⎦

AAT is called linear kernel of the dataset φ (.) → x A nonlinear mapping function that maps input vector x into a high dimensional feature vector ⎡ φ ( x1 )T φ ( x1 ) ⎢ T ⎢ φ ( x2 ) φ ( x1 ) K =⎢ . ⎢ . ⎢ T ⎢φ(x ) φ(x ) m 1 ⎣

φ ( x )T φ ( x2 ) φ ( x2 )T φ ( x )

. . . . .

1

1

. .

φ ( x )T φ ( x ) m

2

K is called the non-linear Kernel of input dataset.

65

. . . . .

φ ( x )T φ ( xm ) ⎤ ⎥ φ ( x2 )T φ ( xm ) ⎥ 1

⎥ ⎥ ⎥ φ ( xm )T φ ( xm )⎥⎦ . .

T Q = an mxm matrix whose (i,j)th element is di d jφ ( xi ) φ ( x j )

(

Q = K .* d * d T

),

where .* represent element wise multiplication

From the geometric point of view, the support vector machine constructs an optimal hyperplane given by w

T

x - γ = 0 between two classes of examples. The free

parameters are a vector of weights w which is orthogonal to the hyperplane and a threshold value γ. The aim is to find maximally separating bounding planes wTx- γ=1 w T x - γ = -1 such that data points with d = -1 satisfy the constraints w T x - γ ≤ -1 and data points with d = +1 satisfy w T x - γ ≥ 1. The perpendicular distance of the bounding plane w T x - γ = 1 from the origin is |- γ + 1|/||w|| and the perpendicular distance of the bounding plane w T x - γ = -1 from the origin is, |- γ - 1|/||w|| . The margin between the optimal hyperplane and the bounding plane is 1/||w||, and so the distance between the bounding hyperplanes is 2/||w||. Then the learning problem is formulated as an optimization problem as given below. Minimize subject to

1 2 w 2 Dii (w T x i − γ ) ≥ 1, i = 1, … , l.

=

The ‘training of SVM’ consists of finding w and γ, given the matrix of data points A and the corresponding class vector d. Once w and γ are obtained then the decision T T boundary is w x − γ = 0 . The decision function is given by f ( x ) = sign( w x − γ) . That T is, for a new points the sign of w x − γ is assigned as the class value. The problem is

easily solved in terms of its Lagrangian dual variables. 66

3.4

VARIOUS APPROACHES FOR POS TAGGING

There are different approaches for POS tagging. The Figure 3.4 demonstrates different POS tagging models. Most tagging algorithms fall into one of the two classes which are rule-based taggers or stochastic taggers.

3.4.1 Supervised POS Tagging The supervised POS tagging models require pre-tagged corpora which are used for training to learn rule sets, information about the tagset, word-tag frequencies etc. The learning tool generates trained models along with the statistical information. The performance of the models generally increases with increase in the size of pre-tagged corpus.

POS Tagging

Supervised

Rule Based

Stochastic

Unsupervised

Neural

Rule Based

Brill

Brill

N‐gram based

Maximum Likelihood

Hidden Markov Model

Viterbi Algorithm

Figure 3.4 Classification of POS Tagging Models

67

Stochastic

Baum‐Welch Algorithm

Neural

3.4.2 Unsupervised POS Tagging Unlike the supervised models, the unsupervised POS tagging models do not require a pre-tagged corpus. Instead, they use advanced computational methods like the BaumWelch algorithm to automatically induce tagsets, transformation rules etc. Based on the information, they either calculate the probabilistic information needed by the stochastic taggers or induce the contextual rules needed by rule-based systems or transformation based systems.

3.4.3 Rule based POS Tagging The rule based POS tagging models apply a set of hand written rules and use contextual information to assign POS tags to words in a sentence. These rules are often known as context frame rules. For example, a context frame rule might say something like: “If an ambiguous/unknown word X is preceded by a Determiner and followed by a Noun, tag it as an Adjective.” On the other hand, the transformation based approaches use a pre-defined set of handcrafted rules as well as automatically induced rules that are generated during training. Some models also use information about capitalization and punctuation, the usefulness of which are largely dependent on the language being tagged. The earliest algorithms for automatically assigning Part-of-Speech were based on two-stage architecture. The first stage used a dictionary to assign each word a list of potential parts of speech. The second stage used large lists of hand-written disambiguation rules to bring down this list to a single Part-of-Speech for each word [139]. The ENGTWOL [140] tagger is based on the same two-stage architecture, although both the lexicon and the disambiguation rules are much more sophisticated than the early algorithms. The ENGTWOL lexicon is based on the two-level morphology. It has about 56,000 entries for English word stems, counting a word with multiple parts of speech (e.g. nominal and verbal senses of hit) as separate entries, and of course not counting inflected and many derived forms. Each entry is annotated with a set of morphological and syntactic features. In the first stage of the tagger, each word is run through the two-level lexicon transducer and the entries for all possible parts of speech are returned. 68

3.4.4 Stochastic POS Tagging A stochastic approach includes frequency, probability or statistics. The simplest stochastic approach finds out the most frequently used tag for a specific word in the annotated training data and uses this information to tag that word in the unannotated text. The problem with this approach is that it can come up with sequences of tags for sentences that are not acceptable according to the grammar rules of a language. An alternative to the word frequency approach is known as the n-gram approach that calculates the probability of a given sequence of tags. It determines the best tag for a word by calculating the probability that it occurs with the n previous tags, where the value of n is set to 1, 2 or 3 for practical purposes. These are known as the unigram, bigram and trigram models. The most common algorithm for implementing an n-gram approach for tagging a new text is known as the Viterbi Algorithm, which is a search algorithm that avoids the polynomial expansion of a breadth first search by trimming the search tree at each level using the best m Maximum Likelihood Estimates (MLE) where m represents the number of tags of the following word. Advantages of Statistical Approach, • Very robust, can process any input strings • Training is automatic, very fast • Can be retrained for different corpora / tagsets without much effort • Language independent • Minimize the human effort and human error.

3.4.5 Other Techniques Apart from these, a few different approaches for tagging have been developed. Support Vector Machines: This is the powerful machine learning method used for various applications in NLP and other areas like bio-informatics, data mining, etc. Neural Networks: These are potential candidates for the classification task since they learn abstractions from examples [141].

69

Decision Trees: These are classification devices based on hierarchical clusters of questions. They have been used for natural language processing such as POS Tagging. “Weka” can be used for classifying the ambiguous words [141]. Maximum Entropy Models: These avoid certain problems of statistical interdependence and have proven successful for tasks such as parsing and POS tagging. Example-Based Techniques: These techniques find the training instance that is most similar to the current problem instance and assume the same class for the new problem instance as for the similar one.

3.5

VARIOUS

APPROACHES

FOR

MORPHOLOGICAL

ANALYZER 3.5.1 Two level Morphological Analysis Koskenniemi (1985) [26] describes two-level morphology as a “general, language independent framework which has been implemented for a host of different languages (Finnish, English, Russian, Swedish, German, Swahili, Danish, Basque, Estonian, etc.)”. It consists of two representations and one relation. The surface representation of a word-form: This is the actual spelling of the final valid word. For example English words eating and swimming, are both surface representations. The lexical (also called morphophonemic) representation of a word-form: This shows a simple concatenation of base forms and tags. Consider the following examples showing the lexical and surface form of English words. Lexical Form

Surface Form

talk + Verb

talk

walk + Verb + 3PSg walks eat +Verb + Prog

eating

swim +Verb + Prog

swimming 70

It may be noted that the lexical representation (or form) is often invariant or constant. In contrast, affixes and bases of the surface form tend to have alternating shapes. This can be seen in the above examples. The same tag “+Verb + Prog” is used with both eat and swim, but swim is realized as swimm in the context of ing, while eat shows no alternation in the context of ing. The rule component consists of rules which map the two representations to each other. Each rule is described through a FiniteState-Transducer (FST). Figure 3.5, schematically depicts two-level morphology.

Figure 3.5 Two Level Morphology

3.5.2 Unsupervised Morphological Analyzer The definition of Unsupervised Learning of Morphology is given below. “Input: Raw (un-annotated, non-selective) natural language text data.” “Output: A description of the morphological structure (there are various levels to be distinguished) of the language of the input text.” Some approaches have explicit or implicit biases towards certain kinds of languages; they are nevertheless considered to be Unsupervised Learning of Morphology. Morphology may be narrowly taken as to include only derivational and grammatical affixation, where the number of affixations a root may take is finite and the order of affixation may not be permuted. A number of approaches focus on concatenative morphology/ compounding only. All works considered are designed to function on orthographic words, i.e., raw text data in orthography that segment on the word-level.

71

3.5.3 Memory based Morphological Analysis Memory based learning approach models morphological analysis (including compounding) of complex word-forms as sequences of classification tasks. MBMA (Memory-Based Morphological Analysis) is a memory-based learning system (Stanfill and Waltz, 1986) [142]. Memory-based learning is a class of inductive, supervised machine learning algorithm that learns by storing examples of a task in memory. Computational effort is invested on a "call-by-need" basis for solving new examples (henceforth called instances) of the same task. When new instances are presented to a memory-based learner, it searches for the best matching instances in memory, according to a task-dependent similarity metric. When it has found the best matches (the nearest neighbors), it transfers their solution (classification, label) to the new instance.

3.5.4 Stemmer based Approach Stemmer uses a set of rules containing list of stems and replacement rules to stripping of affixes. It is a program oriented approach where the developer has to specify all possible affixes with replacement rules. Potter algorithm is one of the most widely used stemmer algorithm and it is freely available. The advantage of stemmer algorithm is that it is very suitable to highly agglutinative languages like Dravidian languages for creating Morphological Analyzer and Generator.

3.5.5 Suffix Stripping based Approach Highly agglutinative languages such as Dravidian languages, a Morphological Analyzer and Generator can be successfully built using suffix stripping approach. The advantage of the Dravidian language is that no prefixes and circumfixes exist for words. Words are usually formed by adding suffixes to the root word serially. This property can be well suited for suffix stripping based Morphological Analyzer and Generator. Once the suffix is identified, the stem of the whole word can be obtained by removing that suffix and applying proper orthographic (sandhi) rules. A set of dictionaries like stem dictionary, suffix dictionary and also using morphotactics and sandhi rules, a suffix stripping algorithm successfully implements MAG.

72

3.6

VARIOUS APPROACHES IN MACHINE TRANSLATION

From the period when the first idea of using machine for the process of language translation, there have been many different approaches to machine translation that have been proposed, implemented and put into use, during the course of time. The main approaches to machine translation are: • Linguistic or Rule Based Approaches o Direct Approach o Interlingua Approach o Transfer Approach

• Non-Linguistic Approaches o Dictionary Based Approach o Corpus Based Approach o Example Based Approach o Statistical Approach

• Hybrid Approach Direct, Interlingua and Transfer approaches are linguistic approaches which require some sort of linguistic knowledge to perform translations, whereas dictionary based, example based and statistical approach falls under non-linguistic approaches that don’t require any linguistic knowledge to translate the sentences. Hybrid approach is a combination of both linguistic and non-linguistic approaches.

3.6.1 Linguistic or Rule Based Approaches Rule based approaches requires a lot of linguistic knowledge during the translation and so it uses grammar rules and computer programs which will be helpful in analysing the text for determining grammatical information and features for each and every word in the source language, translating it by replacing each word by lexicon or word that have the same context in the target language. Rule based approach is the principal methodology that was developed in machine translation. Linguistic knowledge will be required in order to write the rules for this type of approaches. These rules will play a vital role during the different levels of translation. This approach is also called as Theory based Machine Translation. 73

The benefit of rule based machine translation method is that it can intensely examine the sentence at its syntax and semantic levels. There are complications in this method such as prerequisite of vast linguistic knowledge and very huge number of rules is needed in order to cover all the features in a language. An advantage of the approach is that the developer has more control over the translations than is the case with corpusbased approaches. The three different approaches that require linguistic knowledge are as follows. 3.6.1.1 Direct Approach

Direct translation approach can be considered as the first approach to machine translation. In this type of approach, the machine translation system is designed more specifically for one particular pair of language. There is no need of identifying the schematic roles and universal concepts in this approach. It involves the process of analysing morphological information, identify the constituents and reorder the words in the source language according to the word order pattern of the target language and then replace the words in the source language by the target language words using a lexical dictionary of that particular language pair and as a last step, inflect the words appropriately to produce translations. This approach as it is seen, looks like a lot of work has to be done in order to produce translations, but all those work which has to be employed will be simple and can be accomplished very easily, in a short span of time. Figure 3.5 illustrates the block diagram of the direct approach to machine translation. This approach perform a simple and minimal syntactic and semantic analysis, by which it differs from the other rule based translation systems such as interlingua and the transfer-based approaches. As the direct approach to machine translation is considered to be ad-hoc and found to be an approach that is unsuitable approach to machine translation. Table 3.6 describes the example, how the sentence “he came late to school yesterday” will be translated from English to Tamil using the direct approach.

74

Figuree 3.6 Block Diagram off Direct App proach to Machine M Traanslation Table 3.5 An A Examplle to Illustraate the Direct Approach Input Sentence in n English

After

Morrphological Analysis Con nstituent Ideentification Worrd Reorderiing Dicttionary Lookup

He camee late to schoool yesterdayy He comee PAST late to school yeesterday >< > < < mtd; ne ew;W gs;spff;F neuk; fH Hpj;J th PAST

Infleect(the fina al translateed mtd; ne ew;W gs;spff;F neuk; fH Hpj;J te;jhd;;. sentence)

75

3.6.1.2 Interlingua Approach

Interlingua approach to machine translation mainly aims at transforming the texts in the source language to a common representation which is applicable to many languages. Using this representation the translation of text to the target language is performed and it should be possible to translate to every language from the same Interlingua representation with the right rules. Interlingua approach sees machine translation as a two stage process: 1. Analysing and transforming the source language texts into a common language independent representation. 2. From the common language independent form generate the text in the target language. The first stage is particular to source language and doesn’t require any knowledge about the target language whereas the second stage is particular to the target language and doesn’t require any knowledge from the source language. The main advantage of interlingua approach is that it creates an economical multilingual environment that requires 2n translation systems to translate among n languages where in the other case, the direct approach requires n(n-1) translation systems. Table 3.6 has the Interlingua representation of the sentence, “he will reach the hospital in ambulance”. Table 3.6 An Example for Interlingua Representation Predicate

Reach

Agent

Boy (Number: Singular)

Theme

Hospital (Number: Singular)

Instrument

Ambulance (Number: Singular)

Tense

FUTURE

The concepts and relations that are used are the most important aspect in any interlingua-based system. The ontology should be powerful enough that all subtleties of meaning that can be expressed using any language should be representable in the Interlingua. Interlingua approach can be found more economical when translation is 76

carrried out witth three or more m languagges but also the complexxity of this approach a geets inccreased, dram matically. This T is clearlly evident from f the Vaauquois trian ngle which is shoown in the Figure 3.7.

Figure 3.77 The Vauqu uois Trianglle 3.66.1.3 Transffer Approacch The less deterrmined transsfer approacch has threee stages, coomprising th he intellectuual t insteadd of the twoo stages in thhe reppresentationss of the sourrce and targeet language texts, Interlingua appproach. Thee transfer appproach cann be done eeither by coonsidering thhe ntactic or sem mantic inforrmation of thhe text. In geeneral, transffer can eitheer be syntacttic syn or semantic deppending on the t need. The transffer model involves thhree stages which are analysis, transfer, annd genneration. In the analysiis stage, thee source lan nguage senttence is parrsed, and thhe sen ntence structture and thee constituentts of the senntence are identified. i In n the transfe fer stage, transformations aree applied to the source language pparse tree too convert thhe ucture to thaat of the targget languagee. The generation stage translates th he words annd stru exppresses the tense, t numbber, gender eetc. in the taarget languaage. Figure 3.8 3 shows thhe bloock diagram of the transffer approachh.

77

D forr Transfer A Approach Figure 3.8 Block Diagram Consider the t sentencee, “he will come c to schoool in bus”. Table 3.7 illustrates thhe three stages off the translatiion of this seentence usin ng the transfeer approach. The sentencce nalysis stagee of the trannsfer approaach is show wn in analyssis reppresentation after the an stage. The reprresentation of o the sentence after reorrdering it acccording to thhe Tamil worrd T 3.7. Thhe ordder as result of the transfer stage of the transfer approach iss shown in Table final generatioon stage which replacess the sourcee language w words to tarrget languagge ords. wo a exampple, it will bbe clear thatt, the analysser stage of this approacch From the above prooduces a reppresentation that is sourrce languagee dependent and the gen neration stagge gennerates the final f translattion from thee target lang guage dependdent represeentation of thhe sen ntence. Thuss, using thiis approach in multilinngual machinne translatio on system to t trannslate n langguages, will require ‘n’ aanalyser com mponents, n((n-1) transfeer componennts sinnce individuaal transfer components c are requireed for transllation betweeen a pair oof lannguages for each e directioon and ‘n’ geeneration com mponents.

78

Table 3.7 An Example for Transfer Approach Input Sentence

He will come to school in bus

Analysis

Transfer

Generation (Output)

அவன் ேப ந்தில் பள்ளிக்கு வ வான்

3.6.2 Non-Linguistic Approaches The non-linguistic approaches are those which don’t require any linguistic knowledge explicitly to translate texts in the source language to target language. The only resource required by this type of approaches is data either the dictionaries for the dictionary based approach or bilingual and monolingual corpus for the empirical or corpus based approaches. 3.6.2.1 Dictionary based Approach

The dictionary based approach to machine translation uses dictionary for the language pair to translate the texts in the source language to target language. In this approach, word level translations will be done. This dictionary based approach can either be preceded by some pre-processing stages to analyse the morphological information and lemmatize the word to be retrieved from the dictionary. This kind of approach can be used to translate the phrases in a sentence and found to be least useful in translating a full sentence. This approach will be very useful in accelerating the human translation, by providing meaningful word translations and limiting the work of humans to correcting the syntax and grammar of the sentence. 3.6.2.2 Empirical or Corpus based Approach

The corpus based approaches don’t require any explicit linguistic knowledge to translate the sentence. But a bilingual corpus of the language pair and the monolingual corpus of the target language are required to train the system to translate a sentence. This approach has driven lots of interest in world-wide.

79

3.66.2.3 Examp ple based Ap pproach This approach to machine translation is a techniquue that is maiinly based on how humaan beiings interpreet and solve the t problem ms. That is, normally the humans spliit the problem m intoo sub probleems, solve eaach of the suub problems with the ideea of how they solved thhis typpe of similarr problems inn the past annd integrate them to sollve the probllem in wholle. This approachh needs a huge h bilinguual corpus of o the langguage pair among a whicch trannslation has to be perforrmed. The EB BMT system m functions llike a translaation memorry. A translaation memorry is a computer aided transllation tool tthat is able to reuse preevious transllations. If thhe sen ntence or a similar sentennce has beenn translated previously, p tthe previouss translation is retuurned. In coontrast, the EBMT system can traanslate noveel sentences and not juust repproduce prevvious senten nce translations. EBMT T translates iin three stepps; matching, alig gnment, andd recombinattion [143]. 1) In matchin ng, the system m looks in its database of o preevious exam mples and finnds the piecees of text that together ggive the besst coverage of o thee input sentence. This matching m is doone using vaarious heurisstics from ex xact characteer maatch to matchhes using hig gher linguisttic knowledg ge to calculaate the similaarity of wordds or identify gen neralized tem mplates. 2) T The alignmennt step is theen used to iddentify whicch o. This identtification caan be done by b targget words thhese matchinng strings coorrespond to usiing existing bilingual dicctionaries orr automaticaally deduced from the paarallel data. 3) 3 Finnally, these correspondeences are reccombined an nd the rejoinned sentences are judgeed usiing either heeuristic or sttatistical infoormation. Fiigure 3.9 shoows the blocck diagram of o exaample-based d approach.

ure 3.9 Blocck Diagram of EBMT S System Figu

80

In order to get a cleear idea of this t approachh, consider tthe English sentence “H He bou ught a homee” and the Taamil translattion also given in Table 3.8. Tab ble 3.8 Exam mple of Engglish and Taamil Senten nces

He bought a pen

English h

Tamill mtd; xU ngdh th h';fpdhd;

He has a hoome

mtDf;F F xU tPL ,Uf;fpwJ

The paarts of the sentence s to be translateed will be matched wiith these tw wo sen ntences in thhe corpus. Here, H the parrt of the senttence ‘He boought’ gets matched witth thee words in thhe first senttence pair annd ‘a home’ gets matchhed with the words in thhe seccond sentencce pair. Therrefore, the coorrespondingg Tamil partt of the matcched segmennts of the sentencees in the corrpus are takeen and comb bined approppriately. Som metimes, posstproocessing mayy be requireed in order to handle nu umbers and ggender if exact words arre nott available in n the corpus.. 3.66.2.4 Statistiical Approaach Staatistical app proach to machine m trannslation gen nerates trannslations usiing statistical meethods by deriving d the parameters for those methods byy analysing the bilinguual corrpora. This approach diiffers from the other appproaches tto machine translation in i maany aspects. Figure 3.10 0 shows thee simple blo ock diagram of a Statisttical Machinne Traanslation (SM MT) system..

Figu ure 3.10 Bloock Diagram m of SMT S System 81

The advantages of statistical approach over other machine translation approaches are as follows: • The enhanced usage of resources available for machine translation such as manually translated parallel and aligned texts of a language pair, books available in both languages and so on. That is large amount of machine readable natural language texts are available with which this approach can be applied. • In general, statistical machine translation systems are language independent i.e., it is not designed specifically for a pair of language. • Rule based machine translation systems are generally expensive as they employ manual creation of linguistic rules and also these systems cannot be generalised for other languages, whereas statistical systems can be generalised for any pair of languages, if bilingual corpora for that particular language pair is available. • Translations produced by statistical systems are more natural compared to that of other systems, as it is trained from the real time texts available from bilingual corpora and also the fluency of the sentence will be guided by a monolingual corpus of the target language. Statistical parameters are analysed and determined from Bi-lingual and Monolingual corpora. Using these parameters translation and language models are generated. Designing a statistical system for a particular language pair is a rapid process because the work lies on creating bilingual corpora for that particular language pair. In order to obtain better translations from this approach, the system needs at least more than two million words for a particular domain. Moreover, Statistical Machine Translation requires an extensive hardware configuration to create translation models in order to reach average performance levels.

3.6.3 Hybrid Machine Translation System Hybrid machine translation approach makes use of the advantages of both statistical and rule-based translation methodologies. Commercial translation systems such as Asia Online and Systran provide systems that were implemented using this approach. Hybrid machine translation approaches differ in many numbers of aspects: 82

Ru ule-based sysstem with post-processi p ing by statisstical approoach: Here thhe rule baseed maachine translation system m produces ttranslations for f a given ttext in sourcce language to t targget language. The outpput of this rule r based system willl be post-processed by a stattistical systeem to providde better traanslations. Figure F 3.11 sshows the block b diagram m forr this system.

Figgure 3.11 Ru ule based Trranslation System S with h Post-proceessing Staatistical tran nslation systtem with pree-processingg by the rulee based apprroach: In thhis appproach a staatistical macchine translaation system m is incorpoorated with a rule baseed sysstem to pre-pprocess the data before providing thhe data for ttraining andd testing. Alsso thee output of the statistical system ccan also be post-processsed using thhe rule baseed sysstem to provvide better translations.. The blockk diagram foor this type of system is shoown in Figurre 3.12.

Figure 3.12 Statisttical Machin ne Translattion System with Pre-processing

3.77

EVAL LUATING G STATIST TICAL MACHINE M E TRANSL LATION

This section provides p evaaluation metthods to findd the qualitty of machinne translatioon sysstem. Evaluaation of macchine translaation is a very active fieeld of researrch. There arre two o importantt types of evaluation techniques in machinne translatio on which arre auttomatic evalluation and manual evaaluation or human evalluation. Thiis subdivisioon shoows how too evaluate the perform mance of ann MT systtem, both manually m annd auttomatically. The most reliable meethod for evaluating e trranslation adequacy a annd fluency is throough human evaluation. But humann evaluation is a slow and a expensivve o more thann one humaan evaluator are usuallyy averaged. A proocess. The judgments of 83

quick, cheap and consistent approach is required to judge the MT systems. A precise automated evaluation technique would require linguistic understanding. Methods for automatic evaluation usually find the similarity between the translation output and one or more translation references.

3.7.1 Human Evaluation Techniques Statistical Machine Translation outputs are very hard to evaluate. To judge the quality of translation one may ask human translators to find the scores for a machine translation output or compare a system output with a gold standard output. This gold standard outputs are generated by human translators. In human evaluation, different translators translated same sentence in different ways. There is no single correct answer for the translation task because a sentence can be translated in different ways. The reason for translation variation is choice of words, word order and style of translators. So the machine translation quality is very hard to predict. The human evaluation tasks provide the best insight into the performance of an MT system, but they come with some major drawbacks. It is an expensive and time consuming evaluation method. To overcome some of these drawbacks, automatic evaluation metrics have been introduced. These are much faster and cheaper than human evaluation, and they are consistent in their evaluation, since they will always provide the same evaluation given the same data. The disadvantage of automatic evaluation metrics is that their judgments are often not as correct as those provided by a human. The evaluation process, however, has the advantage that it is not tied by the realistic scenery of translation. Most often, evaluation is performed on sentences where one or more gold standard reference translations already exist [143]. In human evaluation method, the judges are presented with a gold-standard sentence and some translations. Table 3.9 shows the scales used for evaluation when the language being translated into is English. Using this scale, the judges are asked to assign a score to each of the presented translations. Accuracy and fluency is a widespread means of doing manual evaluation.

84

Table 3.9 Scales of Evaluation

Score

Adequacy

Fluency

5

All

Flawless

4

Most

Good

3

Much

Non-native

2

Little

Disfluent

1

None

Incomprehensible

3.7.2 Automatic Evaluation Techniques The automatic evaluation is the method which use computer program to judge the translation output is better or not. Currently automatic evaluation metrics is widely used to evaluate machine translation system. These systems are upgrade based on the rise and fall of scores in this automatic evaluation. The major advantage of this technique is time and money. It requires less time to judge a huge amount of outputs. In situations like everyday system evaluation, human evaluation can be too expensive, slow, and inconsistent. Therefore, an automatic evaluation metric that is reliable and very important to the progress of Machine translation field. In this section, the most widely used automatic evaluation metrics, BLEU, NIST, Edit distance measures and precision and recall are described. 3.7.2.1 BLEU Score

The first and most widely-used first automatic evaluation measure is BLEU (BiLingual Evaluation Understudy) [144]. It was introduced by IBM in Papineni et.al. (2002). It finds the geometric mean of modified n-gram precisions. BLEU considers not only single word matches between the output and the reference sentence, but also n-gram matches, up to some maximum n. It is the ratio of correct n-gram of a certain order n in relation to the total number of generated n-gram of that order. The maximum order n for n-gram to be matched is typically set to four. This mean is then called BLEU-4. Multiple reference are also be used to compute BLEU. Evaluating system translation against multiple reference translation provides a more robust assignment of the translation quality [144]. The BLEU metric then takes the geometric mean of the scores assigned to all n-gram lengths. Equation 3.1 shows the formula for BLEU, where N is 85

the order of n-grams that are used, usually 4, pn is a modified n-gram precision, where each n-gram in the reference can be matched by at most one n-gram from the hypothesis. BP is a brevity penalty, which is used to penalize too short translations. It is based on the length of the hypothesis c, and the reference length r. If several references are used, there are alternative ways of calculating the reference length, using the closest, average or shortest reference length. BLEU can only be used to give accurate system wide scores, since the geometric mean formulation means it will be zero if there are no overlapping 4-grams, which is often the case in single sentences. BLEU =BP.

∑

log

(3.1) 1

BP= 3.7.2.2 NIST Metric

The NIST metric (Doddington, 2002) is an extension of the BLEU metric [145]. The introduction of this metric tried to meet two characteristics of BLEU. First, the geometric average of BLEU makes the overall score more sensitive to the modified precision of the individual n’s, than if the arithmetic average is used. This may be a problem if not many high n-gram matches exist. Second, all word forms are weighted equally in BLEU. Less frequent word forms may be of higher importance for the translation than for example high frequent function words, which NIST tries to compensate for by introducing an information weight. Additionally, the BP is also changed to have less impact for small variations in length. The information weight of an n-gram abc is calculated by the following equation:

info(abc) = log

(3.2)

This information weight is used in equation (3.4) instead of the actual count of matching n-grams. In addition, the arithmetic average is used instead of the geometric, and the BP is calculated based on the average reference length instead of the closest reference length. The lengths of these are summed for the entire corpus (r) and the same for the translations (t).

86

BP = exp

.

NIST = BP · ∑

min

,1

(3.3)

∑

(3.4)

∑

The NIST metric is very similar to the BLEU metric, and their correlations with human evaluations are also close. Perhaps NIST correlates a bit better with adequacy, while BLEU correlates a bit better with fluency (Doddington, 2002) [145]. 3.7.2.3 Precision and Recall

In Automatic evaluation metrics each sentence in system translation is compared against gold standard or human translations. This gold standard human translation is called Reference translation. This precision and recall approach is based on word matches. Precision is a fraction of retrieved docs that are relevant and Recall is defined as fraction of relevant docs that are retrieved. This metric is mainly used in information retrieval systems. The significant drawback of this metric while using in Machine translation is, not considerable of word order. Precision: number of relevant documents retrieved by a search divided by the total

number of documents retrieved by that search P(relevant | retrieved) Recall: the number of relevant documents retrieved by a search divided by the total

number of existing relevant documents (which should have been retrieved) P(retrieved | relevant) For Example, SMT OUTPUT: Israeli officials responsibility of airport

safety

REFERENCE: Israeli officials are responsible for airport

security

Precision = Correct / Output length

87

= 3 / 6 = 50% Recall = Correct / Reference length = 3 / 7 = 42.85% The F Measure (weighted harmonic mean) is a combined measure that assesses the precision/recall tradeoff. F= 2( P x R) / (P+R) F= 2(.5 x .4285 ) / (.5 + .4285) F= 46% 3.7.2.4 Edit Distance Measures

Edit Distance Measures provide an estimate of translation quality based on the number of changes which must be applied to the automatic translation so as to transform it into a reference translation • WER- Word Error Rate (Nießen et al., 2000) [147]. This measure is based on the Levenshtein distance (Levenshtein, 1966) [146] —the minimum number of substitutions, deletions and insertions that have to be performed to convert the automatic translation into a reference translation. • PER- Position-independent Word Error Rate (Tillmann et al., 1997) [148]. A shortcoming of the WER is that it does not allow reordering of words. In order to overcome this problem, the position independent word error rate (PER) compares the words in the two sentences without taking the word order into account. • TER- Translation Edit Rate (Snover et.al., 2006) [149]. TER measures the

amount of post-editing that a human would have to perform to change a system output so it exactly matches a reference translation. Possible edits include insertions, deletions, and substitutions of single words as well as shifts of word sequences. All edits have equal cost.

88

TER = # of edits to closest reference / average # of reference words The edits that TER considers are insertion, deletion and substitution of individual words, as well as shifts of contiguous words. TER has also been shown to correlate well with human judgment.

3.8

SUMMARY

This chapter provided background on Tamil language processing and various approaches for developing linguistic tools and Machine Translation system. This chapter also gives an overview of Tamil language and its morphology. Machine learning for Natural Language Processing and evaluation methods for Machine translation are also discussed in this chapter.

89

CHAPTER 4 PREPROCESSING FOR ENGLISH SENTENCE Current phrase based Statistical Machine Translation system does not use any linguistic information and it only operates on surface word form. It is shown that adding linguistic information helps to improve the translation process. Adding linguistic information can be done through preprocessing steps. On the other hand, machine translation system for language pair with disparate morphological structure needs best pre-processing or modeling before translation. This chapter explains about how preprocessing is applied on the raw source language sentence to make it more appropriate for translation.

4.1

MORPHO-SYNTACTIC INFORMATION OF ENGLISH LANGUAGE

Grammar of a language is divided into syntax and morphology. Syntax is how words are combined to form a sentence and morphology deals with the formation of words. Morphology is also defined as the study of how meaningful units can be combined to form words. One of the reasons to process a morphology and syntax together in language processing is that a single word in a language is equivalent to combination of words in another. The term “morpho-syntax” is a hybrid word that comes from morphology and syntax. It plays a major role in processing different types of languages and it is also a related term to machine translation because the fundamental unit of machine translation is words and phrases. Retrieving the syntactic information is a primary step in pre-processing English language sentences. The tool which is used for retrieving syntactic structure from a given sentence is called parsing and which is used to retrieve morphological features from a word is called as morphological analyzer. Syntactic information includes dependency relation, syntactic structure and POS tag morphological information consists of lemma and morphological features. Klein and Manning (2003) [150] from Stanford University proposed a statistical technique for retrieving the syntactical structure of English sentences. Based on this technique a “Stanford Parser tool” was developed. This parser provides dependency relationship as well as phrase structure trees for a given sentence. Stanford parser

90

package is a Java implementation of probabilistic natural language parsers, such as highly optimized PCFG and lexicalized dependency parsers, and a lexicalized PCFG parser. The parser was also developed for other languages such as Chinese, Italian, Bulgarian, and Portuguese. This parser uses the knowledge gained from hand-parsed sentences to produce the most likely analysis of new sentences. In this pre-processing Stanford parser is used to retrieve the morpho-syntactic information of English sentences.

4.1.1 POS and Lemma Information Part-of-Speech (POS) tagging is the task of labeling each word in a sentence with its appropriate parts-of-speech like noun, verb, adjective, etc. This process takes an untagged sentence as input then assigns a POS tag to words and produces tagged sentences as output. The most widely used part of speech tagset for English is PennTree bank tagset which is given in the Appendix-A. In this thesis, English sentences are tagged using this tagset. POS and lemma of word forms are shown in Table 4.1. The example shown bellow represents the POS tagging for English sentences.

English Sentence The boy is going to the school.

Part-of-Speech Tagging The/DT boy/NN is/VBZ going/VBG to/TO the/DT school/NN ./. Table 4.1 POS and Lemma of Words

Word

POS

Playing

NN

Playing

Playing

VBG

Play

Walked

VBD

Walk

Pens

NNS

Pen

Training

VBG

Train

Training

NN

Training

Trains

NNS

Train

Trains

VBZ

Train

91

Lemma

Morphological analyzer or lemmatizer is used to find the lemma of a word. Lemmas have special importance in highly inflected languages. Lemma is a dictionary word or a root word. For example the word “play” is available in dictionary but other word forms like playing, played, and plays aren’t available. So the word “play” is called a lemma or dictionary word for the above mentioned word forms.

4.1.2 Syntactic Information Syntactic information of a language is used in NLP tasks like Machine translation, Question Answering, Information Extraction and Language Generation. Syntactic information can be extracted from parsing. Parsing extracts the information such as parts-of-speech tags, phrases and relationships between the words in the sentences. In addition, from the parse tree of a sentence, noun phrases, verb phrases, and prepositional phrases are also identified. Figure 4.1 shows an example of English syntactic tree. The parser output is a tree structure with a sentence label as the root. The example shown bellow indicates the syntactic information of English sentences.

Figure 4.1 Example of English Syntactic Tree

92

English Sentence The boy is going to the school. Parts of speech for each word (NN = Noun, VBZ = Verb, DT = Determiner, VBG = Verbal Gerund) S NP DT the NN boy VP VBZ is VP VBG going PP TO to NP DT the NN school

Parsing information (ROOT (S (NP (DT The) (NN boy)) (VP (VBZ is) (VP (VBG going) (PP (TO to) (NP (DT the) (NN school))))) (. .))) Phrases Noun Phrases (NP): “the boy”, “the school” Verb Phrases (VP): “is”, “going” Sentences (S): “the boy is going to the school”

4.1.3 Dependency Information Dependency information represents a relation between individual words. A typed dependency parser additionally labels dependencies with grammatical relations, such as subject, direct object, indirect object etc. It is used in several NLP applications and such applications benefit particularly from having access to dependencies between words typed with grammatical relations. Since these relations also provide information about predicate-argument structure which is not readily available from phrase structure parse trees. The Stanford typed dependency representation was designed to provide a simple description of the grammatical relationships in a sentence. It can be easily understood and used even by people without linguistic knowledge. It is also used to extract textual relations. An example of the typed dependency relation for an English sentence is given below. 93

English Sentence The boy is going to the school.

Subject Verb Object The

boy

is

Subject

going Verb

to

the

school Object

Typed dependencies det(boy-2, The-1) nsubj(going-4, boy-2) aux(going-4, is-3) root(ROOT-0, going-4) prep(going-4, to-5) det(school-7, the-6) pobj(to-5, school-7)

4.2

DETAILS OF PREPROCESSING ENGLISH SENTENCES

Recently, SMT systems are introduced with linguistic information in order to address the problem of word order and morphological variance between the language pairs. This preprocessing of source language is done constantly on the training and testing corpora. More source side pre-processing steps brings the source language sentence closer to that of the target language sentence. This section explains the preprocessing methods for English sentence to improve the quality of English to Tamil Statistical Machine Translation system. The preprocessing module for English language sentence includes three stages, which are reordering, factorization and compounding. Figure 4.2 shows the preprocessing stages of English language sentence. The first step in preprocessing English sentence is to retrieve the linguistic features such as lemma, POS tag, and syntactic relations using Stanford parser. These linguistic features along with the sentence will be subjected to reordering and factorization stages. Reordering applies the reordering rules to the 94

syntactic trees for rearranging the phrases in the English sentence. Factorization takes the surface words in the sentence and then factored using syntactic tool. This information is appended to the words in the sentence. Part-of-Speech tags are simplified and included as a factor in factorization. This factored sentence is given to the compounding stage. Compounding is defined as adding additional morphological information to the morphological factor of source (English) language words. Additional morphological information includes function word, subject information, dependency relations, auxiliary verbs, and model verbs. This information is based on the morphological structure of the target language. After adding this information, few function words, auxiliary information are removed and reordered information is incorporated in integration phase.

English Sentence

Stanford Parser Tool

Reordering

Factorization

Compounding

Integration

Preprocessed English Sentence

Figure 4.2 Preprocessing Stages of English Sentence

95

4.2.1 Reordering English Sentences Reordering transforms the source language sentence into a word order that is closer to that of the target language. Mostly in Machine Translation system the order of the words in the source language sentence is often different from the words in the target language sentence. The word-order difference between source and target languages is one of the most significant errors in a Machine Translation system. Phrase based SMT systems are limited for handling long distance reordering. A set of syntactic reordering rules are developed and applied on the English language sentence to better align with the Tamil sentence. The reordering rules elaborate the structural differences of English and Tamil sentences. These transformation rules are applied to the parse trees of the English source language. Parse trees are developed using Stanford parser tool. Quality of parse trees plays an important role in syntactic reordering. In this thesis, the source language is English and therefore the parses are more accurate and the reordering based on the parses are exactly matched with target language. Generally, English parsers are performing better than other language parsers because, English parsers developed from longer and advanced statistical parsing techniques are applied. Reordering is successfully applied for French to English (Xia and Mc-Cord, 2004) [151] and from German to English (Collins et al., 2005) translation system [152]. Xia and McCord (2004) used reordering rules to improve the translation quality, with these reordering rules being automatically learned from the parse trees for both source and target sentences [151]. Marta Ruiz Costa-jussà 2006, proposes a novel reordering algorithm for SMT [153]. They introduced two new approaches; they are block reordering and Statistical Machine Reordering (SMR). The author also explains various reordering methods like syntax based reordering and heuristic reordering in 2009 [154]. Ananthakrishnan R, et.al (2008) developed a syntactic and morphological preprocessing for English to Hindi SMT system. They reorder the English source sentence as per Hindi syntax, and segment the suffixes of Hindi for morphological processing [155]. Recent developments showed an improvement in translation quality when using the explicit syntax based reordering. One of these developments is the pretranslation approach which alters the word order of source language sentence to target language word order before translation. This is done based on predefined linguistic rules that are either manually created or automatically learned from parallel corpora. 96

4.2.1.1 Syntactic Comparison between English and Tamil This subdivision gives a closer look and notable differences between the syntax of English and Tamil language. Syntax is a theory of sentence structure and it guides reordering when translations between a language pair contain disparate sentence structure. English and Tamil are from different language families. English is an IndoEuropean language and Tamil is a Dravidian language. English has the word order of Subject–Verb-Object (SVO) and Tamil has the word order of Subject-Object-Verb (SOV). For example, the main verb of a Tamil sentence always comes at the end but in English it comes between subject and object. English is a fixed word order language where Tamil word order is flexible. Flexibility in word order represent that the order may change freely without affecting the grammatical meaning of the sentence. While translating from English to Tamil, English verbs have to be moved from after the subject to end of the sentence. English prepositions are postpositions in Tamil language. Tamil is a head-final language also called as verb-final language. The Tamil verb comes at the end of the clause. Demonstratives and modifiers precede the noun within the noun phrase. The simplest Tamil sentence can consist of only two noun phrases, with no verb (even linking verb) are present. For example, when a pronoun (இ ) idu 'this' is followed by a noun ( த்தகம்) puththakam 'book', the sequence is translated into English 'This is a book.' Tamil is a null subject language. Not all Tamil sentences have subjects, verbs, and objects. It is possible to construct grammatically valid and meaningful sentences using subject and verb only. For example, a sentence may only have a verb—such as ("completed")—or only a subject and object, without a verb such as (அ

என்

.)

athu envEdu ("That [is] my house"). Tamil language does not have a copula verb (a linking verb equivalent to the word is). The word is included in the translations only to convey the meaning more easily. Schiffman (1999) observed that Tamil syntax is the mirror-image of the order in an English sentence, especially when there are relative clauses, quotations, adjectival and adverbial clauses, conjoined verbal constructions, and aspectual and modal auxiliaries, among others [156].

97

4.2.1.2 Reordering Methodology Reordering is a first step which is applied in preprocessing of source language sentence. Reordering is an important aspect for languages which differ in their syntactic structure. Reordering rules are handcrafted using syntactic word order difference between English and Tamil language. During training, English sentences in a parallel corpus are reordered and in decoding, English sentence which is given for testing is reordered. Lexicalized automatic reordering is implemented in Moses toolkit. This automatic reordering is only better for short range sentences. Therefore external module is needed for dealing the long range reordering. This reordering stage is also a way of indirectly integrating syntactic information to the source language. The example given bellow shows the reordering English sentences. The first sentence is the original English sentence and the second sentence is pre-translation reordered source sentence and the third sentence is the translated Tamil sentence. English Sentence:

I

bought

Reordered English: I

my

Tamil Sentence : நான் என் ( wAn

vegetables

to

ைடய

ennudaiya

to

home vegetables

my

home.

bought

ட் ற்கு காய்கறிகள் வாங்கிேனன் . vIddiRkku

kAikaRikaL vAmginEn )

Figure 4.3 shows the method of reordering English sentences. English sentence is given to the Stanford parser for retrieving syntactic information. After retrieving the information, the English sentence is reordered using pre-defined syntactic rules. In order to obtain the similar word order of the target language, reordering is applied in source sentence prior to translation. 180 rules are developed according to the syntactic difference between English and Tamil languages. Syntax-based rules are used to reorder the English language sentence to better match the sentence with Tamil language. It is applied to the English parse tree at the sentence level. The rule for reordering is restricted to the particular language pair. Reordering will be carried out in phrases as well as words. All the created rules are compared with the production rule of an English

98

sentence. If match is found then transformation is performed according to the target rule. Examples of English reordering rules are shown in Table 4.2. All the rules are included in Appendix B.

English Sentence

Stanford Parser Tool

Syntactic Information

Reordering

Syntactic Rules

Reordered English Sentence Figure 4.3 Process of Reordering In general, PBSMT reordering performance goes down when encountering unknown phrases or long sentences. The main advantage of reordering is word order improvement in translation and better utilization of Phrase based SMT system [157]. Incorporating reordering in the search process implies a high computational cost. Reordering rules consists of three units (Table 4.2). i. Production rules of original English sentence (source). ii. Transformed production rules according to Tamil sentence (target). iii. Source part numbers and target part numbers. These numbers indicate the reorder of the source sentence (transformations).

99

Table 4.2 Reordering Rules

S.no

Source

Target

Transformation

1

S > NP VP

# S > NP VP

# 0:0,1:1

2

PP > TO NPPRP

# PP > TO NPPRP

# 0:0,1:1

3

VP > VB NP* SBAR

# VP > NP* VB SBAR

#0:1,1:0,2:2

4

VP > VBD NP

# VP > NP VBD

# 0:1,1:0

5

VP > VBD NPTMP

# VP > NPTMP VBD

# 0:1,1:0

6

VP > VBP PP

# VP > PP VBP

# 0:1,1:0

7

VP > VBD NP NPTMP

# VP > NP NPTMP VBD

# 0:2,1:0,2:1

8

VP > VBD NP PP

# VP >PP NP VBD

# 0:2,1:1,2:0

9

VP > VBD S

# VP > S VBD

# 0:1,1:0

10

VP > VB S

# VP > S VB

# 0:1,1:0

11

VP > VB NP

# VP > NP VB

# 0:1,1:0

12

PP>TO NP

#PP>NP TO

#0:1,1:0

13

VP > VBD PP

# VP > PP VBD

# 0:1,1:0

For instance, take a Sixth reordering rule from Table 4.2. VP -> VBP PP

# VP -> PP VBP# 0:1,1:0

Where, # divides the units of reordering rules, the last unit indicates source and target indexes. In the above example, “0:1, 1:0” indicates first child of the target rule is from second child of the source rule; second child of the target rule is from first child of the source rule.

100

Source Æ 0(VBP)

1(PP)

Target Æ 1(PP)

0(VBP)

For example take an English sentence “I bought vegetables to my home” The Syntactic tree for the sentence is shown in Figure 4.4.

English Sentence: I bought vegetables to my home.

Figure 4.4 English Syntactic Tree

Production rules of English Sentence i. S->NP VP ii. VP->VBD NP PP iii. PP->TO NP iv. NP->PRP$ NN The first production rule (i) S->NP VP is matched with the first reordering rule in Table.4.2. The target transformation is same as the source pattern and therefore no change in first production rule. The next production rule (ii) VP->VBD NP PP is matched with the eighth reordering rule in table and the transformation is 0:2 1:1 2:0 , it means that source word order (0,1,2) is transformed into (2,1,0). (0,1,2) are the index of VBD NP and PP, 101

now the transformed pattern is PP NP VBD. This process is continuously applied to each of the production rules. Finally the transformed production rule is given below.

Reordered Production rules of English sentence i. S->NP VP ii. VP->PP NP VBD iii. PP->NP TO iv. NP->NN PRP$

Reordered English Sentence: I my home to vegetables bought. English parallel corpora which is used for training is reordered and the testing sentences are also reordered. 80% of English sentences are reordered correctly according to the rules which are developed. Original and reordered English sentences are shown in Table 4.3. After reordering the English sentences are given to the compounding stage. Table 4.3 Original and Reordered Sentences

Original Sentences

Reordered Sentences

I saw a beautiful child

I a beautiful child saw

He came last week

He last week came

Sharmi gave her book to Arthi

Sharmi her book Arthi to gave

She went to shop for buying fruits

She fruits buying for shop to went

Cat is sleeping on the table

Cat the table on sleeping is.

4.2.2 Factoring English Sentence The current phrase-based models are limited to the mapping of small text chunks without the use of any explicit linguistic information like morphological and syntactical. Such information plays a significant role in morphologically rich languages. In other hand, for many language pairs, the availability of bilingual corpora is very less. SMT performance is based on the quality and quantity of corpora. So, SMT strictly needs a new method which uses linguistic information explicitly with fewer 102

amounts of parallel data. Philip Koehn and Hoang (2007) developed a Factored translation framework for statistical translation models to tightly integrate linguistic information [10]. It is an extension of phrase-based Statistical Machine Translation that allows the integration of additional morphological and lexical information, such as lemma, word class, gender, number, etc., at the word level on both source and the target languages. Factoring English language sentence is a basic step in factored translation system. Factored translation model is one way of representing morphological knowledge to Statistical machine translation explicitly. Factors which are considered in preprocessing and their description of English language are shown in Table.4.4. In this example, word refers surface word, lemma represents the dictionary word or root word, word class represents word-class category and morphology tag represents compound tag which contains morphological information and/or function words. In some cases the “morphology” tag, also contains the dependency relations and/or PNG information. For instance, the English sentence, “I bought vegetables to my home”, is factored into linguistic factors which are shown in Table.4.5. Factored representation of English sentence is shown in Table 4.6. Table 4.4 Description of Factors in English Word

FACTORS

DESCRIPTION

EXAMPLE

Word

Surface words or word forms

Coming,went,beautiful,eyes

Lemma

Root word or Dictionary word

Play,run,home,pen

Word Class

Minimized POS tag

N,V,ADJ,ADV VBD,NNS, nsubj,pobj, to,

Morphology

POS tag, dependency information, function words, subject information, Auxilary and model verbs

103

has been,will

Table 4.5 Example of English Word Factors

WORD

LEMMA

POSTAG

W-C DEPLABEL

I

I

PRP

PRP

Nsubj

bought

buy

VBD

V

Root

vegetables vegetable

NNS

V

Dobj

to

to

TO

PRE

Prep

my

my

PRP$

PRP

Poss

home

home

NN

N

Pobj

Table 4.6 Factored Representation of English Language Sentence

WORD

FACTORS 1

I

i|PRP|PRP_nsubj

bought

buy|V|VBD

vegetables

vegetable|N|NNS_dobj

to

to|PRE|TO_prep

my

my|PRP|PRP$_poss

home

home|N|NN_pobj

English factorization is considered as one of the important pre-processing step. Factorization splits the surface word into linguistic factors and integrates as a vector. 1

Stanford Parser 1.6.5 Tool is used for Factorization

104

Instead of mapping surface words in translation, factored models maps the linguistic units (factors) of language pair. Stanford Parser is used for factorizing English language sentence. From the parser output, linguistic information such as, lemma, part-of-speech tags, syntactic information and dependency information are retrieved. This linguistic information is integrated as factors to the surface word.

4.2.3 Compounding English Language Sentence A baseline Statistical Machine Translation (SMT) system only considers surface word forms and does not use linguistic information. Translating into target surface word form is not only dependent on the source word-form and it also depends on additional morpho-syntactic information. While translating from morphologically simpler language to morphological rich language, it is very hard to retrieve the required morphological information from the source language sentence. This morphological information is an important term for producing a target language word-form. The preprocessing phase compounding is used to retrieve the required linguistic information from source language sentence. Morphologically rich languages have a large number of surface forms in the lexicon to compensate for a free word-order. This large number of word-forms in Tamil language is very difficult to generate from English language words. Compounding is defined as adding additional morphological information to morphological factor of source (English) language words. Additional morphological information includes subject information, dependency relations, auxiliary verbs, model verbs and few function words. This information is based on the morphological structure of Tamil language. In compounding phase, dependency relations are used to identify the function words from the English factored corpora. During integration, few function words are deleted from the factored sentence and attached as a morphological factor to the corresponding content word. In Tamil language, function words are not directly available but it is fused with corresponding content word. So instead of making the sentences into similar representation, function words are removed from an English sentence. This process reduces the length of the English sentences. Like function words, auxiliary verbs and model verbs are also identified and attached in morphological factor of head word of source sentence. Now the morphological factor representation of the English language

105

sentence is similar to that of the Tamil language sentence. This compounding step indirectly integrates dependency information into the source language factor. 4.2.3.1 Morphological Comparison between English and Tamil Morphology is the study of structure of words in a language. Words are made up of morphemes. These are the smallest meaningful unit in a word. For example, "pens" is made of "pen" + "s", here “s” is a plural marker, "talked" is made of "talk" + “ed” , here “ed” represents past tense. English is morphologically simple language but Tamil is a morphologically rich language. Morphology is one of the significant terms for improving the performance of machine translation system. Morphological changes in English verbs are due to tense and for nouns it’s because of count. Each root verb (or lemma) in English is inflected into four or five word-forms. Word forms of English are shown in Table 4.7. For example, an English verb ‘give’ has four word forms. In English, sometimes verb morphological changes based on tense is represented by auxiliary verbs. Table.4.7 Word forms of English Word Class

Lemma or Root word

Word-forms

Noun

Cat

Cats

Verb

Give

Gave ,giving, given, gives

Adjective

Green

Greener, greenest

Adverb

Soon

Sooner ,soonest

English words are divided into two types, one is content word which carries meaning by referring objects and actions and another one is function word which gives the relationship between content words. Table 4.8 and 4.9 shows the content word and function words of English language. The relationship between content words is also encoded in morphology. Content words are also called as open class words and Function words are called as closed class words. Part-of-speech categories of content words are verbs, nouns, adjectives and adverbs. For function words the categories are prepositions, conjunctions and determiners.

106

Generally, languages not only differ in the word order but also differ in encoding the relationship between words. English language is strictly in fixed word order and involves heavy usage of function words but less usage in morphology. Tamil language had a rich morphological structure and heavy usage of content word but free word-order language. Table 4.8 Content Words of English

Content Words

Examples

Nouns

John, room, answer, Kumar

Adjectives

happy, new, large, grey

Full verbs

search, grow, hold, play

Adverbs

really, completely, very, also, enough

Numerals

one, thousand, first

Yes/No answers

yes, no (as answers)

Table.4.9 Function Words of English

Function Words

Examples

Prepositions

of, at, in, without, between

Pronouns

he, they, anybody, it, one

Determiners

the, a, that, my, more, much, either, neither

Conjunctions

and, that, when, while, although, or

Modal verbs

can, must, will, should, ought, need, used

Auxiliary verbs

be (is, am, are), have, got, do

Particles

no, not, nor, as

107

Because of the function words, the average number of words in English sentences is more compared to the words in an equivalent Tamil sentence. Some of the function words in English don’t exist in Tamil language because these words are coupled with Tamil content words. English language contains more function words than content words but Tamil language has more content words. Corresponding translation of English function words are coupled in Tamil content word. While translating from English to Tamil language, equivalent translation will not available for English function words and this leads to the alignment problem. Table 4.10 shows the various word forms based on English tenses In Tamil, verbs are morphologically inflected due to tense and PNG (PersonNumber-Gender) markers and nouns are inflected due to count and cases. Each Tamil verb root is inflected into more than ten thousand surface word forms because of agglutinative nature of Tamil language. This morphological richness of Tamil language leads to sparse data problem in Statistical Machine Translation system. Examples of Tamil word forms based on tenses are given in Table 4.11.

Table 4.10 English Word Forms based on Tenses

Root Word

Play

Tenses Simple Present

Play

Present Continuous

is playing

Present Perfect

have played

Past

Played

Past perfect

had played

Future

will play

Future Perfect

will have played

108

Word Form

Table 4.11 Tamil Word Forms based on Tenses

Root Word

Tenses

Word Form

Present+1S

விைளயா கின்ேறன்

vilayAdu-kinR-En

Present+3SN

விைளயா கின்ற

vilayAdu-kinR-athu

Present+3PN

விைளயா கின்றன

vilayAd-kinR-ana

Past+1S

விைளயா ேனன்

vilayAd-in-En

Past+3SM

விைளயா னான்

vilayAd-in-An

Future+2S

விைளயா வாய்

vilayAdu-v-Ay

Future+3SF

விைளயா வாள்

vilayAdu-v-AL

விைளயா (vilayAdu)

Word Form

4.2.3.2 Compounding Methodology for English Sentence Morphological difference between English and Tamil makes the Statistical Machine Translation into a complex task. English language mostly conveys the relationship between words using function words or location of the words but Tamil language expresses using morphological variations of word. Therefore Tamil language had larger vocabulary of surface forms. This led to sparse data problem in English to Tamil SMT system. In order to solve this, large amount of parallel training corpora is required to cover the entire Tamil surface form. It is very difficult to create or collect the parallel corpora which contain all the Tamil surface forms because Tamil is one of the less resourced languages. Instead of covering entire surface forms a new method is required to handle all word forms with the help of limited amount of data. Consider an example English sentence (Figure 4.5) “the cat went to the room”. From this sentence, the word “to” is a function word which does not have any separate output translation unit in Tamil. Its translation output is coupled in Tamil function word “அைற” (aRai).But, Statistical machine translation system uses phrase based models

109

and it will consider “to the room” is a single phrase and it is aligned to the Tamil word “அைறக்கு” (aRaikku) so there is no difficulty in alignment and decoding. Again the problem is a raised for a new sentence which contains a phrase like “to the X” (eg. to the school). Here X is considered as any noun. Even if X (or home) is available in bilingual corpora, system cannot decode a correct translation for “to the X”. Because phrase based SMT guess “to the X” is an unknown phrase even if X is aligned correctly. So the function words should be treated separately prior to the SMT system. Here, these words are taken care of by a preprocessing step called compounding. Compounding identifies some of the function words and attaches to the morphological factor of related content word in factored model. It retrieves morphological information of English content word from dependency relations and function words.

Figure 4.5 English to Tamil Alignment Compounding also identifies the subject information from English dependency relations [158]. This subject information is folded into the morphological factor of English verb and it helps to identify the PNG (Person-Number-Gender) marker for Tamil language during translation. PNG marker plays an important role in Tamil morphology due to the subject-verb agreement nature of Tamil language. Most of the Tamil verbs are generated using this PNG marker. English auxiliary verbs are also identified from the dependency information and then removed and folded in morphological factor of the head word/verb. Figure 4.6 shows the block diagram of compounding English language sentence. English sentence is factorized and then subjected to the compounding phase. A word in factorized sentence includes part of speech and morphological information as factors. Compounding takes dependency relations from Stanford parser and produces the compounded sentences using pre110

defined linguistic rules. These rules are developed based on morphological difference between English and Tamil language. This rule identifies the transformations from English morphological factors to Tamil morphological factors. Sample compounding rules are shown in Table 4.12. Based on the dependency information the rules are developed.

English Sentence

Stanford Parser

Factorization

Dependency Information

Dependency Rules

Deletion of Function words

Update Morph

Update Morph

factor of

factor of

Head Word

Child Word

Compounded English Sentence

Figure 4.6 Block Diagram for Compounding Another important advantage of compounding is that it also used for solving the difficulty of handling copula construction in English sentence. Copula is a special type 111

of verb in English, while in other languages other parts of speech serve the role of copula. Copula is used to link the subject of a sentence with a predicate and it is also referred as a linking verb because it does not describe action. Example for copula sentences is given bellow. 1.

Sharmi is a doctor.

2.

Mani and Arthi are lawyers. 1.

Table 4.12 Compounding Rules for English Sentence

Dependency Information aux, auxpass

Morphological Features Removed Word Head Child Word

+Child Word

Child -

dobj

-

-

+ACC

pobj

Head word

-

+Head word

poss

Child Word

+poss

nsubj

-

+Subject

-

Alignment is one of the important features for improving the translation performance. Compounding helps to improve the quality of word alignment and reduces the length of the English sentences. Table 4.13 shows the average word count in sentences. It also helps for target word generation indirectly. In this thesis, factored SMT is used for only mapping the linguistic factors between English and Tamil language. After compounding, morphological factor of English and Tamil words are relatively more similar. Therefore now it is easy for SMT system to align and decode morphological information. Table 4.14 and Table 4.15 shows the factored and compounded sentences respectively. Reordered sentence is taken as input for compounding.

112

Table 4.13 Average Words per Sentence

Method

Sentences

Words

Average words per Sentence

SMT/FSMT

8300

49632

5.98

C-FSMT

8300

33711

4.06

Table 4.14 Factored English Sentence

Table 4.15 Compounded English Sentence

I | i | PN | prn

I | i | PN | prn

bought | buy | V | VBD

bought | buy | V | VBD_i

vegetables | vegetable | N | NNS


to | to | TO | TO

to | to | TO | TO

my | my | PN | PRP$

my | my | PN | PRP$

home | home | N | NN

home | home | N | NN_to

4.2.4 Integrating Reordering and Compounding Integration is the final stage in source side preprocessing. Here the preprocessed English sentence is obtained from reordering and compounding stages. Reordering takes the raw sentence and reorders according to the predefined rules. Compounding takes the factored sentence and alters the morphological factors of the content words using the compounding rules. Function words are identified in compounding stage. From these function words few of them are removed during the integration process. Figure.4.7 shows the integration process of preprocessing stages. Table.4.16 shows the examples of preprocessed sentences.

113

Original English sentence: I 0

bought vegetables 1 2

to 3

my 4

home. 5

Reordered English sentence: I 0

my 4

house to 5 3

vegetables 4

bought. 5

Factored English sentence: I | i | PN | prn bought | buy | V | VBD vegetables | vegetable | N | NNS to | to | TO | TO my | my | PN | PRP$ home | home | N | NN

Compounded English sentence: I | i | PN | prn bought | buy | V | VBD_i vegetables | vegetable | N | NNS to | to | TO | TO my | my | PN | PRP$ home | home | N | NN_to

Preprocessed English sentence: I | i | PN | prn_i my | my | PN | PRP$ home | home | N |NN_to vegetables | vegetable | N | NNS bought | buy | V | VBD_1S.

Factored Sentence

Reordered Sentence

Compounded Sentence

Function word Index

Integration

Preprocessed English Sentence Figure 4.7 Integration Process

114

Table 4.16 Preprocessed English Sentences Original Sentences She may not come here

Pre-processed Sentences she|she|PN|prp_she here|here|AD|adv come|come|V|vb_3SF_may_not

I went to school

I|i|PN|prp_i school|school|N|nn_to went|go|V|vb.past_1S

I gave a book to him

I|i|PN|prp_i a|a|AR|det book|book|N|nn_ACC him|him|PN|prp_to gave|give|V|vb.past_1S the|the|AR|det cat|cat|N|nn the|the|AR|det

The cat was killing the rat

rat|rat|N|nn_ACC killing|kill|V|vb.prog_3SN_was

4.3

SUMMARY

This chapter presented linguistic preprocessing for English language sentence for better matching with Tamil language sentence. Preprocessing stages includes reordering, factoring and compounding. Finally integration process incorporates the stages. The chapter has also presented the effect of syntactic and morphological variance between English and Tamil language. It is showed that reordering and compounding rules produce significant gain in Factored translation system. However, reordering plays an important role especially for language pairs with disparate sentence structure. The difference in word order between two languages is one of the most significant sources of errors in Machine Translation. While phrase based MT systems do very well at reordering inside short windows of words, long-distance reordering seems to be a challenging task. The translation accuracy can be significantly improved if the reordering is done prior to translation. Reordering rules which are developed here is only valid for English and Tamil language. It also can be used for other Dravidian languages with small modifications. In future, automatic rule creation for reordering using bi-lingual corpora will improve the accuracy and this system is applicable for any language pair also. Compounding and factoring are used in order to reduce the amount

115

of English-Tamil bilingual data. Preprocessing also reduces the number of words in English sentence. Accuracy of preprocessing heavily depends on the quality of the parser. Different researches have proven that preprocessing is the effective method in order to obtain a word-order and morphological information which match the target language. Moreover, this preprocessing approach can be generally applicable for other languages which differ in word order and morphology. This research work has proved that adding linguistic knowledge in preprocessing of training data can lead to remarkable improvements in translation performance.

116

CHAPTER 5 PART OF SPEECH TAGGER FOR TAMIL 5.1

GENERAL

The knowledge of the language pair is proved to improve the translation performance. Adding pre-processing in SMT system convert the language pairs into more similar. Philip Koehn and Hoang (2007) [10] developed a Factored translation framework for statistical translation models to tightly integrate linguistic information. It is an extension of phrase-based statistical machine translation that allows the integration of additional morphological and lexical information, such as lemma, word class, gender, number, etc., at the word level on both source and the target languages. Preprocessing methods are used to convert Tamil language sentences into factored Tamil sentences. The preprocessing module for Tamil language sentence includes two stages, which are POS tagging and Morphological analysis. The first step in preprocessing Tamil language sentence is to retrieve the Part-of-Speech information of each and every word. This information is included in the factors of surface word. This chapter explains about the development of Tamil POS tagger system. In the next stage, Tamil morphological analysis is used to retrieve the lemma and morphological information. This information also included in factors of surface word. The next chapter (Chapter-6) explains the implementation details about the Tamil morphological analyzer.

5.1.1 Part of Speech Tagging Part of Speech (POS) tagging is the process of labeling a Part-of-Speech or other lexical class marker to each and every word in a sentence. It is similar to the process of tokenization for computer languages. Hence POS tagging is considered to be an important process in speech recognition, natural language parsing, morphological parsing, information retrieval and machine translation. Generally, a word in a language contains both grammatical category and grammatical features. Tamil being a morphologically rich language inflects, with more grammatical features which makes the POS tagger system complex. Here a POS tagger system has been developed based only on grammatical categories. Additionally, a morphological analyzer has also been developed for handling grammatical features. Automatic Part-of-Speech tagger can 117

help in building automatic word-sense disambiguating algorithms. Parts of Speech are very often used for shallow parsing texts, or for finding noun and other phrases for information extraction applications. The corpora that have been marked for Part-ofSpeech are very useful for linguistic research, for example, to find instances or frequencies of a particular word or sentence constructions in large corpora. Apart from these, many Natural Language Processing (NLP) activities such as summarization, document classification and Natural Language Understanding (NLU) and Question Answering (QA) systems are dependent on Part-of-Speech Tagging. Words are divided into different classes called Parts of Speech (POS), word classes, morphological classes, or lexical tags. In traditional grammar, there are only a few parts of speech (noun, verb, adjective, adverb, etc.). Many of the recent models have much larger number of word classes (POS Tags). Part-of-Speech tagging (POS tagging or POST), also called grammatical tagging, is the process of marking up the words in a text as corresponding to a particular Part of Speech, based on both its definition, as well as its context . Parts-of-speech can be divided into two broad super categories: •

CLOSED CLASS types

•

OPEN CLASS types.

Closed classes are those that have relatively a fixed membership. For example, prepositions are a closed class because there is a fixed set of them in English; new prepositions are rarely coined. By contrast, nouns and verbs are open classes because new nouns and verbs are continually coined or borrowed from other languages. There are four major open classes that occur in the languages of the world; nouns, verbs, adjectives, and adverbs. It turns out that Tamil and English have all the four of these, although not every other language does. Parts of Speech (POS) tagging means assigning grammatical classes i.e. suitable Parts of Speech tags to each word in a natural language sentence. Assigning a POS tag to each word of an un-annotated text by hand is a laborious and time consuming process. This has led to the development of various approaches to automate the POS tagging work. Automatic POS tagger take a sentence as input, assigns a POS tag to each and every word in the sentence, and produces the tagged text as output. Tags are 118

also applied to punctuation markers; thus tagging for natural language is the same process as tokenization for computer languages. The input to a tagging algorithm is a string of words and a specified tagset. The output is a single best tag for each word. For example in English,

Take

that

Book.

VB

DT

NN

(Tagged using Penn Tree Bank Tagset)

Even in this simple sentence, automatically assigning a tag to each word is not trivial. For example, the word book is ambiguous. That is, it has more than one possible usage and Part of Speech. It can be a verb (as in, book that bus or to book the suspect) or a noun (as in, hand me that book, or a book of matches). Similarly that can be a determiner (as in, Does that flight serve dinner), or a complimentizer (as in, I thought that your flight was earlier). As Tamil is a morphologically rich language, a word may have many grammatical categories which lead to ambiguity. For example, consider the following sentence and its corresponding POS tags: Tamil Example:

ேகாவி ல்

ஆ

kOvilil

ARu

NN

CRD

அ

உயரமான

மணி

உள்ள

adi

uyaramAna

maNi

uLLathu

NN

ADJ

NN

.

VF

(Tagged using AMRITA Tagset) Here “adi”

can be tagged as Noun (NN) or Verb Finite (VF),

“ARu” can be tagged as Noun (NN) or Cardinal (CRD) “maNi” can be tagged as common noun(NN) or as proper noun (NNP). Considering the syntax or the context in the sentence, the word “adi” should be tagged as noun (NN). The problem of automatic POS-tagging is to resolve these ambiguities in choosing the proper tag for the context. Part-of-Speech tagging is thus a disambiguation task. Another important point which was discussed and agreed upon

119

was that POS tagging is NOT a replacement for morph analyzer. A 'word' in a text carries the following linguistic knowledge •

Grammatical category and

•

Grammatical features such as gender, number, person etc. The POS tag should be based on the 'category' of the word and the

features can be acquired from the morph analyzer.

5.1.2 Tamil POS Tagging Words can be classified under various parts of speech classes based on the role they play in the sentence. Traditionally Tamil grammarian Tholkappiar has classified Tamil word categories into four major classes. • ெபயர் peyar (noun) • விைன vinai (verb) • இைட idai (part of speech which modifies the relationships between verbs and nouns) • உாி uri (word that further qualifies a noun or verb) Examining the grammatical properties of words in modern Tamil, Thomas Lehman (1983) has proposed eight POS categories [134]. The following are the major POS classes in Tamil. 1. Nouns 2. Verbs 3. Adjectives 4. Adverbs 5. Determiners 6. Post Positions 7. Conjunctions 8. Quantifiers

120

Other POS categories for Tamil Apart from nouns and verbs, the other POS categories that are “open class” are the adverbs and adjectives. Most adjectives and adverbs, in their root can be placed in the lexicon. But there are adjectives and adverbs that can be derived from noun and verb stems. Following are the Morphotactics of derived adjectives from noun root and verb stems. Examples: Noun_root + adjective_suffix uyaram +

Ana = uyaramAna

உயரம்+ ஆன=உயரமான verb_stem + relative_participle cey + tha

= ceytha

ெசய் +த = ெசய்த Following is the Morphotactics of derived adverbs from noun roots and verb stem. noun_root + adverb_suffix uyaram +Aka

=

uyaramAka

உயரம்+ ஆக =உயரமாக verb_stem +adverbial participle cey + thu ெசய் +

=

ceythu

= ெசய்

There are number of non-finite verb structure forms in Tamil. Apart from participles forms, Grammatically they are classified into structures such as infinitive, conditional, etc,. Examples: verb_stem + Infinitive marker paRa + kka=paRakka (to fly) பற + க்க =பறக்க

121

Examples: verb_stem + Conditional suffix vA+ wth+ Al = vawthAl (if one comes) வா+ந்+ஆள் = வந்தாள் There are other categories like conjunctions, complementizers etc. Some of these may be derived forms. But there aren’t many. So they can be listed in the lexicon. Other categories that need to be listed in the lexicon as roots are post positions which are “closed class”. This is because they can occur as words in isolation even though they are semantically bonded to the noun or verb preceding them.

5.2

COMPLEXITY IN TAMIL POS TAGGING

As Tamil is an agglutinative language, nouns get inflected for number and cases. Verbs get inflected for various inflections which include tense, person, number, gender suffixes. Verbs are adjectivalized and adverbialized. Also verbs and adjectives are nominalized by means of certain nominalizers. Adjectives and adverbs do not inflect. Many post-positions in Tamil [159] are from nominal and verbal sources. So, many times one has to depend on the syntactic function or context to decide upon whether one is a noun or adjective or adverb or postposition. This leads to the complexity of Tamil in POS tagging.

5.2.1 Root Ambiguity The root word can be ambiguous. It can have more than one sense, sometimes roots belong to more than one POS category. Though the POS can be disambiguated using contextual information like co-occurring morphemes, it is not possible always. These issues should be taken care of when POS taggers are built for Tamil language. For example, the Tamil root words like adi, padi, isai, mudi, kudi can take both noun and verb category which leads to the root ambiguity problem in POS tagging.

5.2.2 Noun Complexity Nouns are the words which denote a person, place, thing, time, etc. In Tamil language, nouns are inflected for the number and case in morphological level. However on phonological level, four types of suffixes can occur with noun stem.

122

Morphological level inflection Noun ( + number ) (+ case ) க்கைள pUk-kaL-ai

Example:

Flower-plural-accusative case suffix Noun ( + number ) (+ oblique) (+ euphonic) (+ case ) க்களிைன pUk-kaL-in-Al

Example:

Flower-plural-euphonic suffix-accusative case suffix Nouns are needed to be annotated into common noun, compound noun, proper noun, compound proper noun, pronoun, cardinal and ordinal. Pronouns need to be further annotated for personal pronoun. There occurs confusion between common noun and compound noun and also between proper noun and compound proper noun. Common noun can also occur as compound noun, for example ஊராட்சி

UrAdci தைலவர் thalaivar

When UrAdci and thalaivar comes together it can be compound noun (), but when UrAdci and thalaivar comes separately in a sentence it should be tagged as a common noun (). Such complexity also occurs with the proper noun and compound proper noun (). Moreover there occurs confusion between noun and adverb, pronoun and emphasis in syntactic level.

5.2.3 Verb Complexity The verbal forms are complex in Tamil. A finite verb shows the following morphological structure Verb stem + Tense + Person-Number + Gender Example:

நட+ந்த்+ஏன் wada +wth +En நடந்ேதன் ‘I walked’

A number of non-finite forms are possible: adverbial forms, adjectival forms, infinitive forms, and conditional. Verb stem + Adverbial participle Example:

cey + thu = ceythu 123

ெசய் +

=ெசய்

‘having done’

Verb stem + relative_participle Example:

cey + tha = ceytha ெசய் +த = ெசய்த

‘who did’

Verb stem + infinitive suffix Example:

azu + a = aza அ +அ =அழ

‘to weep’

Verb stem + conditional suffix Example:

kEL+d + Al =kEddAl ேகள்+ட்+ஆல் = ேகட்டால் ‘if asked’

Distinction needs to be made between a main verb followed by another main verb and a main verb followed by an auxiliary verb. The main verb followed by an auxiliary verb need to be interpreted together, whereas the main verb followed by another main verb need to be interpreted separately. This lead to functional ambiguity as described below: Functional ambiguity in adverbial form The morphological structure of adverbial verb is Verb root +adverbial participle

Example: sey + thu = seythu ெசய் +

=ெசய்

‘having done’ vawthu sAppidduviddu pO

‘Having come and having eaten went’ Functional ambiguity in adjectival form The adjectival forms differ by tense markings:

124

Verb stem+Tense+Adjectivalizer Example:

vandta ‘x who came’ varukiRa ‘x who comes’ varum ‘x who will come’

Adjectival form allows several interpretations as given in the following examples. sAppida ilai ‘the leaf which is eaten by x’ ‘the leaf on which x had his food and ate’ um-suffixed adjectival form clashes with other homophonous forms which leads ambiguity. varum paiyan ‘the boy who will come’ varum ‘it will come’ Functional ambiguity in infinitival form verb_stem + infinitive suffix Example:

azu

+ a = aza

அ +அ =அழ vara-v-iru ‘going to come’ vara-k-kuuTaatu ‘should not come’ vara-s-sol ‘ask to come’

5.2.4 Adverb Complexity A number of adjectival and adverbial forms of verbs are lexicalized as adjectives and adverbs respectively and clash with their respective sentential adjectival and adverbial forms semantically creating ambiguity in POS tagging. Adverbs too need to be distinguished based on their source category. Many adverbs are derived by suffixing aaka with nouns in Tamil. Functional clash can be seen between noun and adverb in aaka suffixed forms. This type of clash is seen among other Dravidian languages too. 'அவள் அழகாக இ க்கிறாள்'

125

avaL azakAka irukkiRAL ‘she beauty_ADV be_PRE_she ‘she is beautiful’

5.2.5 Postposition Complexity Postpositions are from various categories such as verbal, nominal and adverbial in Tamil. Many a time, the demarking line between verb/noun/adverb and postposition is slim leading to ambiguity. Some postpositions are simple and some are compound. Postpositions are conditioned by the nouns inflected for case they follow. Simply tagging one form as postposition will be misleading. There are postpositions which come after noun and also after verbs which makes the postposition ambiguous (spatial vs. temporal). pinnAl ‘behind’ as in vIddukkup pinnAl ‘behind the house’ pinnAl ‘after’ avanukkup pinnAl vawthAn ‘he came after him’

5.3

PART OF SPEECH TAGSET DEVELOPMENT

For developing a POS tagged corpus, it is necessary to define a Tagset (POS Tagset) used in that corpus. Collection of all the possible tags is called tagset. Tagsets differ from language to language. After referring and considering the available tagsets for Tamil and other languages, a customized Tagset named AMRITA Tagset was developed. The guidelines from “AnnCorra, IIIT Hyderabad [160]” and EAGLES, (1996) , were also considered while developing the AMRITA Tagset. Guidelines followed while developing the AMRITA Tagset are given below.

1. The tags should be simple. 2. Maintaining simplicity for Ease of Learning and Consistency in annotation. 3. POS tagging is not a replacement for morph analyzer. 4. A 'word' in a text carries grammatical category and grammatical features such as gender, number, person etc. The POS tag should be based on the 'category' of the word and the features can be acquired from the morph analyzer.

126

Another point that was considered while deciding the tags was whether to come up with a totally new tag set or take any other standard tagger as a reference and make modifications in it according to the objective of the new tagger. It was felt that the later option is often better because the tag names which are assigned by an existing tagger may be familiar to the users and thus can be easier to adopt for a new language rather than a totally new one. It saves time in getting familiar to the new tags and then work on it.

5.3.1 Available POS Tagsets for Tamil Tagset by AUKBC AUKBC Research Centre at Chennai developed a tagset with the help of eminent linguists from Tamil University, Tanjore. This is an exhaustive tagset, which covers almost all possible grammatical and lexical constituents. It contains 68 tags [ 161]. Vasu Renganathan’s Tagset Tagtamil by Vasu Ranganathan is based on Lexical phonological approach. Tag Tamil does morphotactics of morphological processing of verbs by using index method. Tag Tamil does both tagging and generation [162]. Tagset by IIIT, Hyderabad POS Tag Set for Indian Languages was developed by IIIT, Hyderabad. Their tags are decided on coarse linguistic information with an idea to expand it to finer knowledge if required. The annotation standards for POS tagging for Indian languages include 26 tags [163]. CIIL Tagset for Tamil This tagset was developed by CIIL (Central Institute of Indian Languages) Mysore. It contains 71 tags for Tamil. As the tagset considers noun and verb inflection, the number of tags got increased. It has 30 noun forms including pronoun categories and 25 verb forms including participle forms [164].

127

Ganesan’s POS Tagset Ganesan has prepared a POS tagger for Tamil. His tagger works well in CIIL corpus. Its efficiency in other corpora has to be tested. He has a rich tagset for Tamil. He tagged a portion of CIIL corpus by using a dictionary as well as a morphological analyzer. He corrected it manually and trained the rest of the corpus with it. The tags are added morpheme by morpheme [165]. Selvam’s POS Tagset The tagset developed by Selvam considers morphological inflections on nouns for various cases such as accusative, dative, instrumental, sociative, locative, ablative, benefactive, genitive and vocative and clitics and morphological inflections on verbs for tense etc [166].

5.3.2 AMRITA POS Tagset The main drawback in majority of tagsets used for Tamil is that they take into account the verb and noun inflections for tagging. Hence at the tagging time, one needs to split each and every inflected word into morphemes in the corpus. It is a tough and time consuming process. At POS level, one needs to determine only the word’s grammatical category, which can be done using a limited number of tagset. The inflectional forms can be taken care of morph analyzer. So there is no need of using a large number of tags. Moreover a large number of tags will lead to more complexity which in turn reduces the tagging accuracy. Considering the complexity of Tamil in POS tagging and referring to various tagsets, a customized tagset has been developed (AMRITA POS tagset). The customized POS tagset which has been used for the present reaearch work contains 32 tags without considering the inflections. The 32 tags are listed in the Table 5.1. In AMRITA POS tagset, compound tags for common noun (NNC) and proper noun (NNPC) were used. Tag VBG is used for verbal nouns and participle nouns. These 32 POS tags are used for POS tagger and chunker. For morphological analyzer, these 32 tags were further simplified and reduced to 10 Tags.

128

Table 5.1 AMRITA POS Tagset

5.4

DEVELOPMENT

OF

TAMIL

POS

CORPORA

FOR

PREPROCESSING Corpus linguistics seeks to further the understanding of language through the analysis of large quantities of naturally occurring data. Text corpora are used in a number of different ways. Traditionally, corpora have been used for the study and analysis of language at different levels of linguistic description. Corpora have been constructed for the specific purpose of acquiring knowledge for information extraction systems, knowledge-based systems and e-business systems [167]. Corpora have been used for studying child language development. Speech corpora play a vital role in the specification, design and implementation of telephonic communication and for the broadcast media. There is a long tradition of corpus linguistic studies in Europe. The need for corpus for a language is multifarious. Starting from the preparation of a dictionary or lexicon to machine translation, corpus has become an inevitable resource for technological development of languages. Corpus means a body of huge text incorporating various types of textual materials, including newspaper, weeklies, fictions, scientific writings, literary writings, and so on. Corpus represents all the styles of a language. Corpus must be very huge in size as it is going to be used for many

129

language applications such as preparation of lexicons of different sizes, purposes and types, NLP tools, machine translation programs and so on.

5.4.1 Untagged and Tagged Corpus Untagged or un-annotated corpus provides limited information to the users. Corpus can be augmented with additional information by way of labeling the morpheme, word, phrase and sentence for their grammatical values. Such information helps the user to retrieve information selectively and easily. Figure 5.1 presents an example of untagged corpus. The frequency of the lemma is useful in the analysis of the corpus. When the frequency of a particular word is compared to other context words, it is useful to find whether the word is common or rare. The frequencies are relatively reliable for the most common words in a corpus, but to analyze the senses and association patterns of words, a very large number of occurrences with a very large corpus containing many different texts, a wider range of topics should be represented, so that the frequencies of words are less influenced by individual texts. Frequency list based on an untagged corpus are limited in usefulness, because they do not provide grammatical uses which are common or rare. Tagged corpus is an important dataset for NLP applications. Figure 5.2 shows an example of tagged corpus.

Figure 5.1 Example of Untagged Corpus

Figure 5.2 Example of Tagged Corpus

130

5.4.2 Available Corpus for Tamil Corpuses can be distinguished as tagged corpus, parallel corpus and aligned corpus. The tagged corpus is that which is tagged for Part-of-Speech, morphology, lemma, phrases etc. A parallel corpus contains texts and translations in each of the languages involved in it. It allows wider scopes for double-checking of the translation equivalents. Aligned corpus is a kind of bilingual corpus where text samples of one language and their translations into another language are aligned, sentence by sentence, phrase by phrase, word by word, or even character by character. CIIL Corpus for Tamil As for as building corpus for the Indian languages is concerned, it was Central Institute of Indian Languages (CIIL) which took the initiative and started preparing corpus for some of the Indian languages (Tamil, Telugu, Kannada, and Malayalam). Department of Electronics (DOE) financed the corpus-building project. The target was to prepare corpus with ten million words for each language. But, due to financial crunch and time restriction, it ended up with just three million words for each language. Tamil corpus, with three million words, is built by CIIL in this way. It is a partially tagged corpus. AUKBC-RC’s Improved Tagged Corpus for Tamil AUKBC Research Centre which has taken up NLP oriented works for Tamil, has improved upon the CIIL Tamil Corpus and tagged it for their MT programs. It also developed a parallel corpus for English-Tamil to promote its goal of preparing an MT tool for English-Tamil translation. Parallel corpus is very useful for training the corpus and for building example based machine translation. Parallel corpus is a useful tool for MT programs.

5.4.3 POS Tagged Corpus Development The tagged corpus is the immediate requirement for different analyses in the field of Natural Language Processing. Most of the language processing works are in need of such large database of texts, which provide a real, natural, native language of varying types. Annotation of corpora can be done at various levels through, Part of Speech, 131

phrase/clause level, dependency level, etc. Part of Speech tagging forms the basic step towards building an annotated corpus. For creating Tamil Part of Speech Tagger, a grammatically tagged corpus is needed. So a tagged corpus size of 500k words was built. Sentences from Dinamani news paper, yahoo Tamil news and Tamil short stories etc were collected and tagged. POS corpus tagging is done in three stages: 1. Pre-editing 2. Manual Tagging 3. Bootstrapping In pre-editing, untagged corpus is converted to a suitable format for SVMTool, in order to assign a Part of Speech tag to each word. Because of orthographic similarity, one word may have several possible POS tags. After an initial assignment of possible POS tags, words are manually tagged using AMRITA tagset. The tagged corpus is trained using SVMTlearn component. After training, the new untagged corpus is tagged using SVMTagger. The output of SVMTagger is again manually corrected and added into the tagged corpus to increase the corpus size. The corpora development process is elaborated bellow. Pre-editing Tamil text documents have been collected from Dinamani website, Yahoo Tamil, Tamil short stories etc (For example, Figure 5.3). The corpus has been cleaned using simple program i.e. to remove punctuations (except dots, commas and question marks). The corpus has been sententialy aligned. The next step is to change the corpora into a column format because the SVMTool training data must be in column format, i.e. a token (word) per line corpus in a sentence by sentence fashion. The column separator is the blank space.

Figure 5.3 Untagged Corpus before Pre-editing

132

Manual tagging After pre-editing, the untagged corpus is tokenized into column format (Figure 5.4). In second stage, the untagged corpus is manually POS tagged using AMRITA tagset. Initially 10,000 words were manually tagged. During manual POS tagging process, great difficulties were faced while assigning tags to corpora. Bootstrapping After completing the manual tagging, the tagged corpus is given to the learning component of training algorithm for generating the model. Using the model generated, decoder of the training algorithm tags the untagged sentences. The output of the component is a tagged corpus with some error. Then the tags are corrected manually. After correcting the tags, the tagged corpus was added into the training corpus for increasing the size of training corpus.

Figure 5.4 Untagged Corpus after Pre-editing 133

5.4.4 Applications of POS Tagged Corpus The POS tagged corpus is used in the following task. • Chunking • Parsing • Information extraction and retrieval • Machine Translation • Tree bank creation • Document classification • Question answering • Automatic dialogue system • Speech processing • Summarization • Statistical training of Language models • Machine Translation using multilingual corpora • Text checkers for evaluating spelling and grammar • Computer Lexicography • Educational application like Computer Assisted Language Learning

5.4.5 Details of POS Tagged Corpus Developed The POS tagged corpus details are given in the corpus statistics and tag count table (Table 5.2 and 5.3).

Table 5.2 Corpus Statistics No of sentences No of words

45682 510467

No of distinct words

134

70689

Table 5.3 Tag Count S.No

Tags

Counts

1

17400

2

22150

3

6853

4

8488

5

3955

6

12600

7

2315

8

6368

9

43514

10

8

11

2777

12

2289

13

128349

14

74575

15

34594

16

9042

17

1911

18

1773

19

7467

20

399

21

789

22

15559

23

1615

24

922

25

2389

26

273

27

7793

28

9294

29

34888

30

11604

31

17843

32

20671

135

5.5

DEVELOPMENT OF POS TAGGER USING SVMTOOL

5.5.1 SVMTool This section presents the SVMTool, a simple, flexible, and effective generator of sequential taggers based on Support Vector Machines (SVM) and explains how it is applied to the problem of Part-of-Speech tagging. This SVM-based tagger is robust and flexible for feature modeling (including lexicalization), trains efficiently with almost no parameters to tune, and is able to tag thousands of words per second, which makes it really practical for real NLP applications. Regarding accuracy, the SVM-based tagger significantly outperforms the TnT tagger [39] exactly under the same conditions, and achieves a very competitive accuracy of 94.6% for Tamil. Generally, tagging is required to be as accurate as possible, and as efficient as possible. But, certainly, there is a conflict between these two desirable properties. This is so because obtaining a higher accuracy relies on processing more and more information. However, sometimes, depending on the kind of application, a loss in efficiency may be acceptable in order to obtain more precise results. Or the other way around, a slight loss in accuracy may be tolerated in favor of tagging speed. Moreover, some languages like Tamil have a richer morphology than others. This leads the tagger to have a large set of feature patterns. Also, the tagset size and ambiguity rate may vary from language to language and from problem to problem. Besides, if few data are available for training, the proportion of unknown words may be huge. Sometimes, morphological analyzers could be utilized to reduce the degree of ambiguity when facing unknown words. Thus, a sequential tagger should be flexible with respect to the amount of information utilized and context shape. Another very interesting property for sequential taggers is their portability. Multilingual information is a key ingredient in NLP tasks such as Machine Translation, Information Retrieval, Information Extraction, Question Answering and Word Sense Disambiguation. Therefore, having a tagger that works equally well for several languages is crucial for the system robustness. For some languages, lexical resources are hard to obtain. Therefore, ideally, a tagger should be capable for learning with fewer annotated data. The SVMTool is intended to comply with all the requirements of modern NLP technology, by combining simplicity, flexibility, robustness, portability 136

and efficiency with state–of–the–art accuracy. This is achieved by working in the Support Vector Machines (SVM) learning framework, and by offering NLP researchers a highly customizable sequential tagger generator. The SVMTool which is a language independent sequential tagger is applied to Tamil POS tagging.

5.5.2 Features of SVMTool The following are the features of the SVMTool [10]. Simplicity: The SVMTool is easy to configure and train. The learning is controlled by means of a very simple configuration file. There are very few parameters to tune. And the tagger itself is very easy to use, accepting standard input and output pipelining. Embedded usage is also supplied by means of the SVMTool API. Flexibility: The size and shape of the feature context can be adjusted. Also, rich features can be defined, including word and POS (tag) n-grams as well as ambiguity classes and “may be’s”, apart from lexicalized features for unknown words and sentence general information. The behavior at tagging time is also very flexible, allowing different strategies. Robustness: The over fitting problem is well addressed by tuning the C parameter in the soft margin version of the SVM learning algorithm. Also, a sentencelevel analysis may be performed in order to maximize the sentence score. And, for unknown words not to punish so severely on the system effectiveness, several strategies have been implemented and tested. Portability: The SVMTool is language independent. It has been successfully applied to English and Spanish without a priori knowledge other than a supervised corpus. Moreover, thinking of languages for which labeled data is a scarce resource, the SVMTool also may learn from unsupervised data based on the role of non-ambiguous words with the only additional help of a morpho-syntactic dictionary. Accuracy: Compared to state–of–the–art POS taggers reported up to date, it exhibits a very competitive accuracy. Clearly, rich sets of features allow modeling very precisely most of the information involved. Also the learning paradigm, SVM, is highly suitable for working accurately and efficiently with high dimensionality feature spaces.

137

Efficiency: Performance at tagging time depends on the feature set size and the tagging scheme selected. For the default (one-pass left-to-right greedy) tagging scheme, it exhibits a tagging speed of 1,500 words/second whereas the C++ version achieves a tagging speed of over 10,000 words/second. This has been achieved by working in the primal formulation of SVM. The use of linear kernels causes the tagger to perform more efficiently both at tagging and learning time, but forces the user to define a richer feature space.

5.5.3 Components of SVMTool The SVMTool [12] software package consists of three main components, namely the model learner (SVMTlearn), the tagger (SVMTagger) and the evaluator (SVMTeval). Previous to the tagging, SVM models (weight vectors and biases) are learned from a training corpus using the SVMTlearn component. Different models are learned for different strategies. Then, at tagging time, using the SVMTagger component, one may choose the tagging strategy that is most suitable for the purpose of tagging. Finally, given a correctly annotated corpus and the corresponding SVMTool predicted annotation, the SVMTeval component displays tagging results. 5.5.3.1 SVMTlearn Given a set of examples (either annotated or unannotated for training), the SVMTlearn trains a set of SVM classifiers. So as to do that, it makes use of SVM–light, an implementation of Vapnik’s SVMs in C, developed by Thorsten Joachim’s (2002). The SVMlight software implementation of Vapnik’s Support Vector Machine by Thorsten Joachim’s has been used to train the models. Training Data Format Training data must be in column format, i.e. a token per line corpus in a sentence by sentence format. The column separator is the blank space. The word is to be the first column of the line. The tag to predict takes the second column in the output. The rest of the line may contain additional information. Example is given bellow in Figure 5.5. No special ‘’ mark is employed for sentence separation. Sentence punctuation is used instead, i.e. [.!?] symbols are taken as unambiguous sentence separators. In this system these symbols [.?] are used as sentence separators. 138

Figure 5.5 Training Data Format Known words features C(0;-1) C(0;0) C(0;1) C(0;-2,-1) C(0;-1,0) C(0;0,1) C(0;-1,1) C(0;1,2) C(0;-2,-1,0)

C(0;-1,0,1) C(0;0,1,2) C(1;-1) C(1;-2,-1) C(1;-

1,1) C(1;1,2) C(1;-2,-1,0) C(1;0,1,2) C(1;-1,0) C(1;0,1) C(1;-1,0,1) C(1;0) k(0) k(1) k(2) m(0) m(1) m(2)

Unknown words features C(0;-1) C(0;0) C(0;1) C(0;-2,-1) C(0;-1,0) C(0;0,1) C(0;-1,1) C(0;1,2) C(0;-2,-1,0)

C(0;-1,0,1) C(0;0,1,2) C(1;-1) C(1;-2,-1) C(1;-

1,1) C(1;1,2) C(1;-2,-1,0) C(1;0,1,2) C(1;-1,0) C(1;0,1) C(1;-1,0,1) C(1;0) k(0) k(1) k(2) m(0) m(1) m(2) a(2) a(3) a(4) a(5) a(6) a(7)(8) a(9) a(10) a(11) a(12) a(13) a(14) a(15) z(2) z(3) z(4) z(5) z(6) z(7) z(8) z(9) z(10) z(11) z(12) z(13) z(14) z(15) ca(1) cz(1) L SN CP CN

139

Models Five different kinds of models have been implemented in this Tool. Models 0, 1, and 2 differ only in the features they consider. Model 3 and Model 4 are just like Model 0 with respect to feature extraction but examples are selected in a different manner. Model 3 is for unsupervised learning. Hence, given an unlabeled corpus and a dictionary, at learning time it can only count on knowing the ambiguity class, and the POS information only for unambiguous words. Model 4 achieves robustness by simulating unknown words in the learning context at training time. Model 0: This is the default model. The unseen context remains ambiguous. It was thought of having in mind the one-pass on-line tagging scheme, i.e. the tagger goes either left-to-right or right-to-left making decisions. So, past decisions feed future ones in the form of POS features. At tagging time, only the parts-of-speech of already disambiguated tokens are considered. For the unseen context, ambiguity classes are considered instead (Table 5.4). Model 1: This model considers the unseen context already disambiguated in a previous step. So it is thought for working at a second pass, revisiting and correcting already tagged text (Table 5.5). Model 2: This model does not consider pos features at all for the unseen context. It is designed to work at a first pass, requiring Model 1 to review the tagging results at a second pass (Table 5.6). Model 3: The training is based on the role of unambiguous words. Linear classifiers are trained with examples of unambiguous words extracted from an unannotated corpus. So, fewer POS information is available. The only additional information required is a morpho-syntactic dictionary. Model 4: The errors caused by unknown words at tagging time punish the system severely. So as to reduce this problem, during learning, some words are artificially marked as unknown in order to learn a more realistic model. The process is very simple. The corpus is divided in a number of folders. Before starting to extract samples from each of the folders, a dictionary is generated out from the rest of folders. So, the words appearing in a folder but not in the rest are unknown words to the learner. 140

Table 5.4 Example of Suitable POS Features for Model 0 Ambiguity classes

a ,a ,a 0 1 2

May_be’s

m ,m ,m 0 1 2 p , p p-3, p-2, p-1, −2 −1

POS Features POS Bigrams POS Trigrams

( p−2, p−1),( p−1, a+1),( a+1, a+2 )

( p−2, p−1, a0 ), ( p−2, p−1, a+1), ( p−1,a0 ,a+1 ),( p−1, a+1, a+2 )

Single characters

ca(1), cz(1)

Prefixes

a(2), a(3), a(4)

Suffixes Lexicalized features Sentence_info

z(2), z(3), z(4) SA, CAA, AA, SN, CP, CN, CC, MW,L punctuation ( '.','?','!' )

Table 5.5 Example of Suitable POS Features for Model 1 Ambiguity classes

a ,a ,a 0 1 2

May_be’s

m ,m ,m 0 1 2

POS Features

p ,p ,p ,p −2 −1 +1 +2

POS Bigrams

( p−2, p−1),( p−1, p+1),( p+1, p+2 )

POS Trigrams

( p−2, p−1, a0 ), ( p−2, p−1, p+1), ( p−1,a0 , p+1 ),( p−1, p+1, p+2 )

Single characters

ca(1), cz(1)

Prefixes

a(2), a(3), a(4)

Suffixes

z(2), z(3), z(4)

Lexicalized features

SA, CAA, AA, SN, CP, CN, CC, MW,L

Sentence_info

punctuation ( '.','?','!' )

141

Table 5.6

Example of Suitable POS Features for Model 2

Ambiguity classes

a 0

May_be’s

m 0

POS Features

p ,p −2 −1

POS Bigrams

( p−2, p−1)

POS Trigrams

( p−2, p−1, a0 )

Single characters

ca(1), cz(1)

Prefixes

a(2), a(3), a(4)

Suffixes

z(2), z(3), z(4)

Lexicalized features

SA, CAA, AA, SN, CP, CN, CC, MW,L

Sentence_info

punctuation ( '.','?','!' )

SVMTlearn for Tamil POS Tagging The SVMTlearn is the primary component in SVMTool. It is used for training the tagged corpus using SVMlight. This component works in Linux Operating System. POS Tagged corpus is required for training. However, if there is enough data, it is a good practice to split it into three working sets (i.e. training, validation and test). That will allow the system to train, tune and evaluate it. With less data also one can still train, tune and test the system through cross-validation. But the accuracy and efficiency will suffer. In this component, input is POS tagged training corpus. The training corpus is given to the SVMTlearn component that trains the model using the features given in the configuration file. The features are defined in the configuration file, based on Tamil language. The outputs of the SVMTlearn are dictionary file, merged files for unknown and known words (for all models). Each merged file contains all the features of known words and unknown words (Figure 5.6).

142

Tagged Tamil Corpus

SVM Learn

Dictionary

Merged Models

Known

Features

Unknown

Figure 5.6 Implementation of SVMTlearn

Example for training output of SVMTlearn --------------------------------------------------------------------------------------SVMTool v1.3 (C) 2006 TALP RESEARCH CENTER. Written by Jesus Gimenez and Lluis Marquez. --------------------------------------------------------------------------------------TRAINING SET = /media/disk-1/SVM/SVMTool-1.3/bin/TAMIL_CORPUS.TRAIN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------DICTIONARY [31605 words] ********************************************************************* ************************ BUILDING MODELS... [MODE = 0 :: DIRECTON = LR] ********************************************************************* ************************

143

********************************************************************* ************************ C-PARAMETER TUNING by 10-fold CROSS-VALIDATION on on [KNOWN] C-RANGE = [0.01..1] :: [log] :: #LEVELS = 3 :: SEGMENTATION RATIO = 10 ********************************************************************* ************************ ===================================================================== ======================== LEVEL = 0 :: C-RANGE = [0.01..1] :: FACTOR = [* 10 ] ===================================================================== ======================== -------------------------------------------------------------------------------------------******************************** level - 0 : ITERATION 0 - C = 0.01 [M0 :: LR] -------------------------------------------------------------------------------------------TEST ACCURACY: 90.6093% KNOWN[92.886% ] AMBIG.KNOWN [ 83.3052% ] UNKNOWN [ 78.5781% ] TEST ACCURACY: 90.392% KNOWN{92.6809% ] AMBIG.KNOWN [ 82.838% ] UNKNOWN [ 78.0815% ] TEST ACCURACY: 90.1015% KNOWN [ 92.6128% ] AMBIG.KNOWN [ 83.4766% ] UNKNOWN [ 77.5075% ] TEST ACCURACY: 89.7127% KNOWN [ 92.0721% ] AMBIG.KNOWN [ 81.8731% ] UNKNOWN [ 77.5281% ] TEST ACCURACY: 90.7699% KNOWN [ 92.7304% ] AMBIG.KNOWN [ 83.4785% ] UNKNOWN [ 80.3874% ] TEST ACCURACY: 89.8988% KNOWN [ 92.3462% ] AMBIG.KNOWN [ 81.5675% ] UNKNOWN [ 77.286% ]

144

TEST ACCURACY: 90.8836% KNOWN [ 92.9671% ] AMBIG.KNOWN [ 83.6309% ] UNKNOWN [ 79.5591% ] TEST ACCURACY: 89.9724% KNOWN [ 92.4002% ] AMBIG.KNOWN [ 82.1854% ] UNKNOWN [ 77.664% ] TEST ACCURACY: 90.2643% KNOWN [ 92.5675% ] AMBIG.KNOWN [ 83.0289% ] UNKNOWN [ 78.0907% ] TEST ACCURACY: 90.7494% KNOWN [ 92.7798% ] AMBIG.KNOWN [ 82.7494% ] UNKNOWN [ 79.8929% ] OVERALL ACCURACY [Ck = 0.01 :: Cu = 0.07975] : 90.33539% KNOWN [ 92.6043% ] AMBIG.KNOWN [ 82.81335% ] UNKNOWN [ 78.45753% ] MAX ACCURACY -> 90.33539 :: C-value = 0.01 :: depth = 0 :: iter = 1 -----------------------------------------------------------------------******************************** level - 0 : ITERATION 1 - C = 0.1 [M0 :: LR] -----------------------------------------------------------------------TEST ACCURACY: 91.7702% KNOWN [ 94.2402% ] AMBIG.KNOWN [ 87.5492% ] UNKNOWN [ 78.7175% ] TEST ACCURACY: 91.8881% KNOWN [ 94.4737% ] AMBIG.KNOWN [ 88.4324% ] UNKNOWN [ 77.9821% ] TEST ACCURACY: 91.3219% KNOWN [ 94.0596% ] AMBIG.KNOWN [ 88.0441% ] UNKNOWN [ 77.5928% ] TEST ACCURACY: 91.0615% KNOWN [ 93.6037% ] AMBIG.KNOWN [ 86.6795% ] UNKNOWN [ 77.9326% ] TEST ACCURACY: 92.0852% KNOWN [ 94.2575% ] AMBIG.KNOWN [ 88.3275% ] UNKNOWN [ 80.5811% ] TEST ACCURACY: 91.3927% KNOWN [94.1299% ] AMBIG.KNOWN [ 87.4226% ] UNKNOWN [ 77.286% ] TEST ACCURACY: 91.9891% KNOWN[94.2944% ] AMBIG.KNOWN [ 88.0182% ] UNKNOWN [ 79.4589% ] TEST ACCURACY: 91.3063% KNOWN[93.9605% ] AMBIG.KNOWN [ 87.1258% ] UNKNOWN [ 77.8502% ]

145

TEST ACCURACY: 91.3654% KNOWN[93.8499% ] AMBIG.KNOWN [87.2127% ] UNKNOWN [ 78.2339% ] TEST ACCURACY: 91.8693% KNOWN [ 94.1% ] AMBIG.KNOWN [ 87.0546% ] UNKNOWN [ 79.9416% ] OVERALL ACCURACY [Ck = 0.1 :: Cu = 0.07975] : 91.60497% KNOWN [ 94.09694% ] AMBIG.KNOWN [ 87.58666% ] UNKNOWN [ 78.55767% ] MAX ACCURACY -> 91.60497 :: C-value = 0.1 :: depth = 0 :: iter = 2

5.5.3.2 SVMTagger Given a text corpus (one token per line) and the path to a previously learned SVM model (including the automatically generated dictionary), it performs the POS tagging of a sequence of words. The tagging goes on-line, based on a sliding window which gives a view of the feature context to be considered at every decision. In any case, there are two important concepts to be considered: •

Example generation

•

Feature extraction

Example generation: This step is to define what an example is, according to the concept in which the machine is to be learned. For instance, in POS tagging, the machine has to correctly classify the words according to their POS. Thus, every POS tag is the class of a word that generates a positive example for its class, and a negative example for the rest of the classes. Therefore, every sentence may generate a large number of examples.

Feature Extraction: The set of features based on the algorithm to be used have to be defined. For instance, the POS tags should be guessed according to the preceding and following words. Thus, every example is represented by a set of active features. These representations will be the input for the SVM classifiers. If the working of SVMTool has to be learned, it is necessary to run the SVMTlearn (Perl version). By setting the REMOVE_FILES (in the configuration file) option to 0, it will not remove the intermediate files; if option 1 is given, it will remove all the intermediate files. Feature extraction is performed by the sliding window object. A sliding window works on a very local context (as defined in the CONFIG file), usually a 5 words context [-2, -1, 0, +1, +2], being the current word under analysis at the core position. 146

Taking this context into account, a number of features may be extracted. The feature set depends on, how the tagger is going to proceed later (i.e., the context and information that’s going to be available at tagging time). Generally, all the words are known before tagging, but POS tag is available only for some words (those already tagged). In tagging stage, if the input word is known and ambiguous, the word is tagged (i.e., classified), and the predicted tag feeds forward next decisions. This will be done in the "sub classify_sample_merged ( )” subroutine in the SVMTAGGER file. In order to speed up SVM classification, merge mapping and SVM weights and biases, into a single file. Therefore, when a new example is to be tagged, the tagger just accesses the merged model and for every active feature, retrieves the associated weight. Then, for every possible tag, the bias will also be retrieved. Finally, SVM classification rule (i.e., scalar product + bias) is applied.

Example: கூட் க்கு நீர்

திட்டம்

kUddukkudiwIr thiddam

நிைறேவற்றப்ப ம்

என்றார்

தல்வர்

wiRaivERRappadum

enRAr

muthalvar

Tags:

/ (Ambiguity)

For tagging this sentence, first take the active features like w-1, w+1, i.e., to predict POS-tags based only on the preceding and following words. Here the ambiguity word is “wiRaivERRappadum”. The correct tag of this ambiguous word is . “w0, the current word is wiRaivERRappadum, the active features are "w-1 is thiddam " and "w+1 is enRAr ". while applying the SVM classification rule for a given POS Tagger, it is necessary to go to the merged model and retrieve the weight for these features, and the bias (first line after the header, beginning with "BIASES "), corresponding to the given POS. For instance, suppose this ".MRG" file:

147

BIASES

:0.37059487

:-0.19514606

:0.43007979

:-0.037037037 :0.55448766 :-0.19911161 :-1.1815452 :-0.86491783

:0.61775334

:0.072242349

:0.7906585

:-0.21980137 :0.44012828

:1.3656117 :0.30304924

:-0.2182171 :0.89491131 :-0.15550162 :0.56913633 :0.35316978

:0.039121434

:0.84771943

:0.041690388

:0.23199934 :0.33486366 :0.0048185684 :0.42063524 :0.18009116

C0~-1:thiddam

:0.00579042912371902

:-0.508699048073652

:0.532690716551973

:-0.000698015879911668

:0.142313085089229 :0.296699729267891 :-0.32 C0~1:enRAr

:0.132726597682121

:0.66667135122578

:-0.676332541749603

The SVM score for “wiRaivERRappadum” being is: Weight

("w-1:

thiddam","VNAJ")

+

weigh("w+1:enRAr

","VNAJ")– bias("VNAJ") = (-0.32) + (-0.676332541749603)-( 0.42063524) = -1.416967781749603

The SVM score for “wiRaivERRappadum” being is: Weight (“w-1: thiddam”, ”VF”)+weight (“w+1: enRAr”, ”VF”) – bias (“VF”) = (0.296699729267891) + (0.66667135122578)-( 0.33486366 )=0.6285047420493671

Here SVM score for is more compared to , So, the tag VF is assigned to the word ‘wiRaivERRappadum’. Calculated part–of–speech tags feed directly forward next tagging decisions as context features. The SVMTagger component works on standard input/output. It processes a token per line corpus in a sentence by sentence fashion. The token is expected to be the first column of the line. The predicted tag will take the second column in the output. The rest of the line remains unchanged. Lines beginning with ‘##’ are ignored by the tagger. Figure 5.7 is an example of input file. SVMTagger will consider only the first column of the input file. Figure 5.8 shows an example of output file.

148

Figure 5.7 Example Input

Figure 5.8 Example Output

SVMTagger for Tamil In SVMTagger component, the important options are strategies and backup lexicon. Here, it is important to choose the tagging strategy that is going to be used. This may depend, for instance, on efficiency requirements. If the tagging must be as fast as possible, then one should forget about strategies 1, 5, and 6, because strategy 1 goes in two passes and strategies 5 and 6 perform a sentence-level tagging. Strategy 3 is only for unsupervised learning (no hand-annotated data is needed). To choose among strategies 0, 2 and 4, the best solution is to try them all. If unknown words are known to the tagger at tagging time, strategies 2 and 4 are more robust than strategy 0. If any speed requirement or information about future data is not needed, the tagging strategies 4 and 6 systematically show best results.

149

Here the format of backup lexicon file is same as the dictionary format. So a PERL program can be used for converting a tagged corpus into a dictionary format. Tagging will be complex for open tag categories. The main drawback in POS tagging is tagging the proper nouns. For English, they use capitalization for tagging the proper noun words. But in Tamil, it is not possible; therefore a large backup lexicon with proper nouns is provided to the system. A large dataset for proper noun (Indian place and person names) was collected and given as the input to the morphological generator (using PERL program). Morph generator generates nearly twelve inflections for every proper noun. This new dataset is converted into SVMTool dictionary format and given to SVMTagger as a back up lexicon. Figure 5.9 shows the steps in implementation of SVMTagger for Tamil. The input to the system is an untagged cleaned Tamil corpus and output is tagged or annotated corpus. Supporting files are training corpus, dictionary file, merged models for unknown and known words and backup lexicon.

SVMLearn Untagged Tamil Corpus Backup lexicon Training Data Dictionary

SVM Tagger

Features Merged model

Tagged Tamil Corpus

Figure 5.9 Implementation of SVMTagger

150

Strategies

5.5.3.3 SVMTeval Given a SVMTool predicted tagging output and the corresponding gold-standard, SVMTeval evaluates the performance in terms of accuracy. It is a very useful component for the tuning of the system parameters, such as the C parameter, the feature patterns and filtering, the model compression etc. Based on a given morphological dictionary (e.g., the automatically generated at training time), results may be presented also for different sets of words (known words vs. unknown words, ambiguous words vs. unambiguous words). A different view of these same results can be seen from the class of ambiguity perspective too, i.e., words sharing the same kind of ambiguity may be considered together. Also, words sharing the same degree of disambiguation complexity, determined by the size of their ambiguity classes, can be grouped. Usage: SVMTeval [mode] - mode: 0 - complete report (everything) 1 - overall accuracy only [default] 2 - accuracy of known vs. unknown words 3 - accuracy per level of ambiguity 4 - accuracy per kind of ambiguity 5 - accuracy per class - model:

model name

- gold:

correct tagging file

- pred:

predicted tagging file

Example: SVMTeval TAMIL_CORPUS_4L TAMIL.GOLD TAMIL.OUT

SVMTeval for Tamil SVMTeval is the last component of SVMTool. In this, the component is used to evaluate the outputs based on different modes. The main input of this component is a correctly tagged corpus, also called gold standard (Figure 5.10).

151

Correct Tagged file (Gold Standard)

SVM Tagged Output

Model

SVMTeval

Model Name

Report

Figure 5.10 Implementation of SVMTeval SVMTeval report Brief report By default, a brief report mainly returning the overall accuracy is elaborated. It also provides information about the number of tokens processed, and how much were known/unknown and ambiguous/unambiguous according to the model dictionary. Results are always compared to the most-frequent-tag (MFT) baseline. *=========================SVMTevalreport======================== ====== * model

= [E:\\SVMTool-1.3\\bin\\CORPUS]

* testset (gold)

= [E:\\SVMTool-1.3\\bin\\files\\test.gold]

* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out] * ================================================================ ======== EVALUATING

vs.

on model ... *=================TAGGINGSUMMARY================================ ======================= #TOKENS

= 1063 152

AVERAGE_AMBIGUITY = 6.4901 tags per token * ---------------------------------------------------------------------------------------#KNOWN

= 80.3387% -->

854 / 1063

#UNKNOWN

= 19.6613% -->

209 / 1063

#AMBIGUOUS

= 21.7310% -->

231 / 1063

#MFT baseline

= 71.2135% -->

757 / 1063

*=================OVERALLACCURACY=============================== ======================= HITS

TRIALS

ACCURACY

MFT

* ---------------------------------------------------------------------------------------1002

1063

94.2615%

71.2135%

* ================================================================ =========================

Known vs. unknown tokens Accuracy for four different sets of words is returned. The first set is that of all known tokens, tokens which were seen during the training. The second and third sets contain respectively all ambiguous and all unambiguous tokens among these known tokens. Finally, there is the set of unknown tokens, which were not seen during the training. *=========================SVMTevalreport ==============================

* model


* testset (gold)


* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out] * ====================================================================== == EVALUATING vs. on model ... *=================TAGGINGSUMMARY====================================== ================ #TOKENS

= 1063

AVERAGE_AMBIGUITY = 6.4901 tags per token

153

* ---------------------------------------------------------------------------------------#KNOWN

= 80.3387% -->

854 / 1063

#UNKNOWN

= 19.6613% -->

209 / 1063

#AMBIGUOUS

= 21.7310% -->

231 / 1063

#MFT baseline

= 71.2135% -->

757 / 1063

*=================KNOWNvsUNKNOWNTOKENS================================ =============== HITS

TRIALS

ACCURACY

* ---------------------------------------------------------------------------------------*=======known========================================================= ================ 816

854

95.5504%

-------- known unambiguous tokens -------------------------------------------------------604

623

96.9502%

-------- known ambiguous tokens ---------------------------------------------------------212

231

91.7749%

*=======unknown======================================================= ================= 186

209

88.9952%

* ====================================================================== =================== *=================OVERALLACCURACY===================================== ================= HITS

TRIALS

ACCURACY

MFT

* ---------------------------------------------------------------------------------------1002

1063

94.2615%

71.2135%

* ====================================================================== ===================

Level of ambiguity This view of the results groups together all words having the same degree of POS–ambiguity.

154

*=========================SVMTevalreport ============================== * model


* testset (gold)


* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out] * ====================================================================== == EVALUATING vs. on model ... *=================TAGGINGSUMMARY====================================== ================ #TOKENS

= 1063


= 80.3387% -->

854 / 1063

#UNKNOWN

= 19.6613% -->

209 / 1063

#AMBIGUOUS

= 21.7310% -->

231 / 1063

#MFT baseline

= 71.2135% -->

757 / 1063

*=================ACCURACY

PER

LEVEL

OF

AMBIGUITY

======================================= #CLASSES = 5 *===================================================================== ==================== LEVEL

HITS

TRIALS

ACCURACY

MFT

*---------------------------------------------------------------------------------------1

605

624

96.9551%

96.6346%

2

204

220

92.7273%

66.8182%

3

7

9

77.7778%

66.6667%

4

2

3

66.6667%

33.3333%

28

184

207

88.8889%

0.0000%

*=================OVERALLACCURACY===================================== ================= HITS

TRIALS

ACCURACY

MFT

*---------------------------------------------------------------------------------------1002

1063

94.2615%

155

71.2135%

* ====================================================================== ===================

Kind of ambiguity This view is much finer. Every class of ambiguity is studied separately. *=========================SVMTevalreport ============================== * model


* testset (gold)


* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out] * ====================================================================== == EVALUATING

vs.

on model ... *=================TAGGINGSUMMARY====================================== ================ #TOKENS

= 1063


= 80.3387% -->

854 / 1063

#UNKNOWN

= 19.6613% -->

209 / 1063

#AMBIGUOUS

= 21.7310% -->

231 / 1063

#MFT baseline

= 71.2135% -->

757 / 1063

*=================ACCURACY

PER

CLASS

OF

AMBIGUITY

======================================= #CLASSES = 55 * ====================================================================== =================== CLASS

HITS

TRIALS

ACCURACY

MFT

---------------------------------------------------------------------------------------

28

28

100.0000%

100.0000%

___________ ____________
____ 88.8889%

184

207

0.0000%

_

1

1

100.0000%

100.0000%

_

1

1

100.0000%

100.0000%

31

32

96.8750%

96.8750%

__

1

2

50.0000%

50.0000%

_

2

2

100.0000%

100.0000%

20

20

100.0000%

100.0000%

_

1

1

100.0000%

100.0000%

17

17

100.0000%

100.0000%

49

49

100.0000%

100.0000%

23

23

100.0000%

95.6522%

_

22

22

100.0000%

100.0000%

___

2

2

100.0000%

50.0000%

_

2

2

100.0000%

0.0000%

6

6

100.0000%

100.0000%

_

0

1

0.0000%

0.0000%

14

14

100.0000%

100.0000%

77

77

100.0000%

100.0000%

_

1

1

100.0000%

100.0000%

6

6

100.0000%

100.0000%

_

0

1

0.0000%

0.0000%

81

91

89.0110%

89.0110%

148

161

91.9255%

58.3851%

1

1

100.0000%

100.0000%

___0

1

0.0000%

0.0000%

_

1

1

100.0000%

100.0000%

_

1

1

100.0000%

100.0000%

46

47

97.8723%

97.8723%

__

4

5

80.0000%

80.0000%

_

1

1

100.0000%

100.0000%

__

1

1

100.0000%

0.0000%

34

37

91.8919%

91.8919%

_

0

1

0.0000%

0.0000%

0

1

0.0000%

0.0000%

4

4

100.0000%

100.0000%

3

3

100.0000%

66.6667%

6

6

100.0000%

100.0000%

_

1

1

100.0000%

100.0000%

2

2

100.0000%

100.0000%

_ __

157

2

2

100.0000%

100.0000%

33

33

100.0000%

100.0000%

4

4

100.0000%

100.0000%

5

5

100.0000%

100.0000%

4

4

100.0000%

100.0000%

1

1

100.0000%

100.0000%

7

7

100.0000%

100.0000%

_

8

8

100.0000%

100.0000%

11

11

100.0000%

100.0000%

2

2

100.0000%

100.0000%

39

40

97.5000%

97.5000%

_

12

12

100.0000%

91.6667%

15

16

93.7500%

93.7500%

20

20

100.0000%

100.0000%

17

18

94.4444%

94.4444%

_

*=================OVERALLACCURACY===================================== ================= HITS

TRIALS

ACCURACY

MFT

* ---------------------------------------------------------------------------------------1002

1063

94.2615%

71.2135%

* ====================================================================== =========

Class Every class is studied individually. *=========================SVMTevalreport======================== ===== * model


* testset (gold)


* testset (predicted) = [E:\\SVMTool-1.3\\bin\\files\\test.out] *=============================================================== ========= EVALUATING

vs.

on model ... *=================TAGGINGSUMMARY================================ ======================= 158

#TOKENS

= 1063


= 80.3387% -->

854 / 1063

#UNKNOWN

= 19.6613% -->

209 / 1063

#AMBIGUOUS

= 21.7310% -->

231 / 1063

#MFT baseline

= 71.2135% -->

757 / 1063

*=================

ACCURACY

PER

PART-OF-SPEECH

=========================================== POS

HITS

TRIALS

ACCURACY

MFT

* ---------------------------------------------------------------------------------------

30

31

96.7742%

90.3226%

47

48

97.9167%

70.8333%

21

21

100.0000%

95.2381%

17

17

100.0000%

100.0000%

49

49

100.0000%

100.0000%

26

26

100.0000%

84.6154%

7

8

87.5000%

75.0000%

36

36

100.0000%

100.0000%

77

77

100.0000%

100.0000%

1

1

100.0000%

100.0000%

6

7

85.7143%

85.7143%

243

259

93.8224%

57.9151%

145

162

89.5062%

46.2963%

43

44

97.7273%

86.3636%

0

16

0.0000%

0.0000%

4

4

100.0000%

100.0000%

2

2

100.0000%

100.0000%

9

9

100.0000%

100.0000%

2

3

66.6667%

66.6667%

2

2

100.0000%

100.0000%

34

34

100.0000%

97.0588%

4

4

100.0000%

100.0000%

5

5

100.0000%

100.0000%

6

6

100.0000%

66.6667%

159

1

1

100.0000%

100.0000%

18

18

100.0000%

77.7778%

20

22

90.9091%

54.5455%

68

68

100.0000%

66.1765%

16

18

88.8889%

83.3333%

41

42

97.6190%

69.0476%

22

23

95.6522%

73.9130%

*=================OVERALLACCURACY=============================== ====================== HITS

TRIALS

ACCURACY

MFT

* ---------------------------------------------------------------------------------------1002

1063

94.2615%

71.2135%*

================================================================ =========================

5.6

RESULTS AND COMPARISON WITH OTHER TOOLS

Apart from SVMTool, three other taggers namely TnT [39], MBT [60] and WEKA [168] were trained with the same corpus. The accuracy result of SVMTool is compared with the above tools for the same testing corpus. Following is brief description of the above mentioned taggers.

TnT (Trigrams'n'Tags) is a very efficient statistical part-of-speech tagger that is trainable on different languages and virtually for any tagset. The tagger is an implementation of the Viterbi algorithm for second orders Markov Models. The component for parameter generation trains on tagged corpora. The system incorporates several methods of smoothing and of handling unknown words [39].

MBT (Memory Based Tagger) is an approach to POS tagging based on Memory-based learning. It is an extension of the classical k-Nearest Neighbor (k-NN) approach to statistical pattern classification. Here, all the instances are fully stored in memory and classification involves a pass along all stored instances. The approach is based on the assumption that reasoning is based on direct reuse of stored experiences rather than on the application of knowledge (such as rules or decision trees) abstracted from experience. Hence the tagging accuracy for unknown words is low [60].

160

WEKA is a collection of machine learning algorithms for solving real-world data mining problems. The J48 classifier was used for implementation of Tamil POS tagging. All the three tools were trained using the same corpus as used in SVMTool. The same data format was followed in all the cases [168]. The experiments were conducted with our tagged corpus. The corpus was divided into training set and test set. For SVMTool, 94.6% overall accuracy is obtained, which is much higher than that of the other taggers. POS Tagging using MBT gave a very low accuracy (65.65%) for unknown words since the algorithm is based on direct reuse of stored experiences. Ambiguous words were handled poorly by TnT, whereas WEKA gave a high accuracy of 90.11% for ambiguous words (Table 5.7). Though SVMTool gave very high accuracy for all cases, the training time was significantly higher when compared to other tools. The unknown word accuracy of the SVMTool is 86.25%. The accuracy goes down in case of some specific tags. Accuracy results of SVMTool compared to the various tools for the same corpus is given in Table 5.7.

Table 5.7 Comparison of Accuracies WEKA

MBT

TNT

SVMTool

Known

……..

88.45%

92.54%

96.74%

Ambiguous

90.11%

80.23%

78.72%

94.57%

Unknown

……..

65.65%

74.18%

86.25%

……..

78.48%

89.56%

94.6%

Overall Accuracy

5.7

ERROR ANALYSIS

The detailed error analysis is conducted to identify the miscalculation of tags. The untagged sentences size of about 1200 sentences (10 k words) is taken for testing the system. For analyzing the error, 8 frequently error occurred tags are considered. The tags and their trials and errors are shown in Table 5.8. For instance, errors represents the tagger is failed to identify the CRD tag at 30 occurrences. Table 5.9 shows the confusion matrix for 8 POS tags. This matrix shows the performance of the tagger.

161

Table 5.8 Trials and Error Tags CRD NN NNC NNP NNPC ORD VBG VNAJ

Trails 642 4200 2317 1768 47 32 274 682

Hits

Errors

612 3989 2264 1721 33 22 258 662

30 211 53 47 14 10 16 20

Table 5.9 Confusion Matrix

CRD CRD NN NNC NNP NNPC ORD VBG VNAJ

5.8

0.953 0 0.001 0.001 0 0.125 0 0.001

NN 0.019 0.95 0.018 0.013 0.106 0.063 0.04 0.004

NNC 0 0.016 0.977 0.001 0.106 0 0 0

NNP 0 0.003 0.002 0.973 0.085 0 0.004 0

NNPC ORD 0 0.001 0.001 0.007 0.702 0 0 0

0.028 0.001 0 0 0 0.688 0 0

VBG 0 0.01 0 0.001 0 0 0.942 0.007

VNAJ O 0 0.003 0 0 0 0 0 0.971

0 0.016 0.001 0.004 0 0.125 0.015 0.016

SUMMARY

This chapter gave the detail about the development of POS tagger and tagged corpora. Part of Speech tagging plays an important role in various speech and language processing applications. Currently, many statistical tools are available to do Part of Speech tagging. The SVMTool has been already successfully applied to English and Spanish POS Tagging, exhibiting state–of–the–art performance (97.16% and 96.89%, respectively). In both cases, results clearly outperform the HMM–based TnT part–of– speech tagger. For Tamil, an accuracy of 94.6% has been obtained. Any language can be trained easily using the existing statistical tagger tools. POS tagging can be extended by applying this to other languages. The obstacle for the POS tagging for Indian languages is there is no annotated (tagged) corpus. 45k sentences (5 lakh words) POS annotated sentences are developed for train the POS Tagger.

162

CHAPTER 6 MORPHOLOGICAL ANALYZER FOR TAMIL 6.1

GENERAL

Morphological analyzer is one of the most important basic tools in automatic processing of any human language. It analyses the naturally occurring word forms in a sentence and identifies the root word and its features. In spite of its significance, Dravidian languages do not have any morphological analyzers available in public domain. The absence of such a tool for research severely impedes the development of language technologies and applications like natural language interfaces, machine translation, etc in these languages. In this thesis, Tamil morphological analyzer is used for preprocessing Tamil sentences.

6.1.1 Morphology in Language Grammar of any Language can be broadly divided into morphology and syntax. The term morphology was coined by August Schleicher in 1859. Morphology deals with the words and their construction. Syntax deals with how to put the words together in some order to make meaningful sentences. Morphology is the field within linguistics that studies the internal structure of words. While words are generally accepted as being the smallest units of syntax, it is clear that in most languages, words can be related to other words by rules. Morphology is attempts to formulate rules that model the knowledge of the speakers of those languages. Morphemes are the smaller elements of which words are built. Two broad classes of morphemes are stems and affixes. Affixes that are added to the base to denote relations of words are morphemes. Morphemes can either be free (they can stand alone, i.e. they can be words in their own right) e.g. dog, or they can be bound (they must occur as part of a word) e.g. the plural suffix –s on dogs.

6.1.2 Computational Morphology Computational morphology deals with developing theories and techniques for computational analysis and synthesis of word forms. By computational analysis of morphology, one can extract any information encoded in a word and bring it out so that later layers of processing can make use of it. 163

6.1.3 Morphological Analyzer Morphological analysis segments the words into lemma and morpho-lexical information. It is a primary step for various types of text analysis of any language. Morphological analyzers are used in search engines for retrieving the documents from the keyword [169]. Morphological analyzer increases the recall of search engines. It is also used in speech recognizer, lemmatization, Information Retrieval/Extraction, Summarization, spell and grammar checker and machine translation. A word is defined as a sequence of characters delimited by spaces, punctuation marks, etc. in case of the written text. There is no difficulty in identifying words in the written text entered into the computer because one simply has to look for the delimiters. A word can be of two types: simple and compound. A simple word consists of a root or stem together with suffixes or prefixes. A compound word (also called a conjoined word) can be broken up into two or more independent words. Each of the constituent words in a compound word is either a compound word or a simple word and may be used independently as a word. On the other hand, the root and the affixes, which are constituents of a simple word, are not all independent words and cannot occur as separate words in the text. Constituents of a simple word are called morphemes or meaning units. The overall meaning of a simple word comes from the morphemes and their relationships. Similarly, in case of a compound word, its meaning follows from its constituent words and their inter-relationships. It should be noted that one has taken a pragmatic position regarding words. Anything that is identifiable using the delimiters is a word. This is a convenient position to take from the processing viewpoint. Similar is the case with the definition of compound words. With the above definition, an analyzer of words in a sentence does not have to do much work in identifying a word. It simply has to look for the delimiters. Having identified the word, it must determine whether it is a compound word or simple word. If it is a compound word, it must first break it up into its constituent simple words before proceeding to analyze them. The former is called as sandhi analyzer and the later is morphological analyzer, both of which are important parts of a word analyzer.

164

The detailed linguistic analysis of a word can be useful for NLP. However, most NLP researchers have concentrated on other aspects, like grammatical analysis, semantic interpretation etc. As a result, NLP systems use rather simple morphological analyzers. A generator does the reverse of an analyzer. Given a root and its features (or affixes), a morphological generator generates a word. Similarly, a sandhi generator can take the output of a morphological generator, and group simple words into compound words, where possible. Examples for Morphological Analysis; books

=

book+Noun+PluraL(or)book+Verb+Pres+3SG.

stopping

=

stop+Verb+Cont

happiest

=

happy+Adj+Superlative

went

=

go+Verb+Past

Examples for Morphological Generator; book+Noun+Plural

=

books

stop+Verb+Cont

=

stopping

happy+Adj+Superlative

=

happiest

go+Verb+Past

=

went

6.1.4 Role of Morphological Analyzer in NLP Morphological analyzer plays an important role in the field of Natural Language Processing. Figure 6.1 shows the Role of Morphological analyzer in NLP. Some of the applications are, •

Spell checker

•

Search Engines

•

Information extraction and retrieval

•

Machine Translation system

•

Grammar checker

•

Content analysis

•

Question Answering system 165

•

Automatic sentence Analyzer

•

Dialoge system

•

Knowlege representation in learning

•

Language Teaching

•

Language based educational exercises

Text Processing Tools

Morphological Analyzer

Speech Processing

Machine Translation System

Search EnginesIR/IE Figure 6.1 Role of Morphological Analyzer in NLP

6.2

TAMIL MORPHOLOGY

6.2.1 Tamil Morphology and Language Tamil Morphology is very rich. It is an agglutinative language, like the other Dravidian languages. Tamil words are made up of lexical roots followed by one or more affixes. The lexical roots and the affixes are the smallest meaningful units and are called morphemes. Tamil words are therefore made up of morphemes concatenated to one another in a series. The first one in the construction is always a lexical morpheme (lexical root). This may or may not be followed by other functional or grammatical morphemes. For instance, a word

த்தகங்கள் ‘puththakangkaL’ in Tamil, can be

meaningfully divided into த்தகம் ‘puththakam’ and கள் ‘kaL’ . In this example,

த்தகம்‘puththakam’ is the lexical root, representing a real

world entity and கள் ‘kaL ’ is the plural feature marker (suffix). கள் ‘kaL ’ is a 166

grammatical morpheme that is bound to the lexical root to add plurality to the lexical root. Unlike English, Tamil words can have a large sequence of morphemes. For instance, த்தகங்கைள ‘puththakangkaLai’ = த்தகம் ‘puththakam’ (book) + கள் ’kaL’ (s) +ஐ ‘ai’ (ACC. Case Marker). Tamil nouns can take case suffixes after the plural marker. They can also have post positions after that. Tamil words consist of a lexical root to which one or more affixes are attached. Most Tamil affixes are suffixes. Tamil suffixes can be derivational suffixes, which either changes the part of speech of the word or its meaning, or inflectional suffixes, which mark categories such as person, number, mood, tense, etc. The words can be analyzed like the one above by identifying the constituent morphemes and their features can be identified.

6.2.2 Syntax of Tamil Morphology Tamil is a consistently head-final language. The verb comes at the end of the clause, with typical word order Subject Object Verb (SOV). Tamil is also a free word-order language. Due to this relatively free word-order nature of Tamil language, the Noun Phrase arguments before a final verb can appear in any permutation, yet it conveys the same sense of a sentence. Tamil has postpositions rather than prepositions. Demonstratives and modifiers precede the noun within the noun phrase. Subordinate clauses precede the verb of the matrix clause. Tamil is a null subject language. Not all Tamil sentences have subjects, verbs and objects. It is possible to construct valid sentences that have only a verb—such as ந்த

“mudi-wth-athu”("Completed") or only a subject and object, without a verb

such as அ

என்

“athu en viitu” ("That, my house"). Tamil does not have a

copula (a linking verb equivalent to the word “is”). The word is included in the translations only to convey the meaning more easily.

167

6.2.3 Word Formation Rules (WFR) in Tamil Any new word created by Word Formation Rules (WFR) must be a member of a major lexical category. The WFR determines the category of the output of the rule. In Tamil, the grammatical category may change or may not change after the operation of WFR. The following is the list of inputs and outputs of different kinds of WFR's in the derivation of simple words in Tamil [170]. 1. Noun → Noun [ [vElai ]N + kAran ]suf ]N 'servant' [ [ேவைல ]N + காரன் ]suf ]N ேவைலக்காரன் [ [thozil ]N + ALi ]suf ]N 'laborer' [ெதாழில் ]N + ஆளி ]suf ]N ெதாழிலாளி 2. Verb → Noun [ [padi ]V + ppu ]suf ]N 'education' [ [ப ]V + ப் ]suf ]N ப ப் [ [ezuthu-]V + thu ]suf ]N 'letter' [ [எ

]V +

]suf ] எ த்

[ [kEL ]V + vi ]suf] N 'question' [ [ேகள் ]V + வி ]suf] N ேகள்வி 3. Adjective → Noun [ [walla ]adj + thanam ]suf ]N 'good quality' [ [நல்ல ]adj + தனம் ]suf ] நல்ல தனம் [ [periya]adj + tu ]suf ]N 'big' 168

[ [ெபாிய ]adj +

]suf ]N ெபாிய

4. Noun → Verb [ [uyir ]N + ppi ]suf ]V 'to give life' [ [உயிர் ]N + ப்பி ]suf ]V உயிர்ப்பி 5. Adjective → Verb [ [veLLai ]adj + aakku ]suf ]V 'to make (something) white' [ [ெவள்ைள ]adj + ஆக்கு ]suf ]V ெவள்ைளயாக்கு [ [karuppu ]adj + aakku ]suf ]V 'to make (something) black' [ [க ப் ]adj + ஆக்கு ]suf ]V க ப்பாக்கு 6. Verb → Verb [ [cey ]V + vi ]suf ]V 'cause to do' [ [ெசய் ]V + வி ]suf ]V ெசய்வி [ [wada ]V -thth]suf - u] ]V 'cause to walk' [ [நட ]V -த் ] suf ]V நடத் [ [vidu ]V + vi ]suf ]V 'to liberate' [ [வி

]V + வி ]suf ]V வி வி

7. Noun → Adjective [ [uyaram ]N + Ana ]suf ]adj 'high' [ [உயரம் ]N + ஆன ]suf ]adj உயரமான [ [azaku ]N + Ana ]suf ]adj 'beautiful' [ [அழகு ]N + ஆன ]suf ]adj அழகான 169

[ [nErmai ]N + Ana ]suf ]adj 'honest' [ [ேநர்ைம ]N + ஆன ]suf ]adj ேநர்ைமயான 8. Verb → Adverb [ [ cey]V + tu ]suf ]adv 'having done' [ [ ெசய்]V +

]suf ]adv 'having done' ெசய்

[ [ ezuth-]V + i ]suf ]adv 'having written' [[எ

-]V + இ ]suf ]adv எ தி

[ [ padi]V + ththu ]suf ]adv 'having read' [ [ ப ]V + த்

]suf ]adv ப த்

Compound Word forms Table 6.1 shows the possible combinations for compound word formation. Examples: { [kalvi] N # [kUdam ]N # }N 'educational institution' { [கல்வி] N # [கூடம் ]N # }N கல்விகூடம் { [paNi ]N # [puri ]V # }V '(perform) work' { [பணி ]N # [ ாி ]V # }V பணி ாி [ [ ezuth-]V # [kOl] N #}N 'writing instrument' {[எ

] V # [ேகால்] N #}N எ

ேகால்

{ [ periya ]adj # [ wakaram ]N # }N 'big city' { [ ெபாிய ]adj # [ நகரம் ]N # }N ெபாிய நகரம்

{ [ wERRu] adv # [ iravu] N #} N last night' { [ ேநற் ] adv # [ இர ] N #} N ேநற் 170

இர

Table 6.1 Compound Word-forms Formation No.

Surface form

Inflection

Compound form

1

Noun

Noun

Noun

2

Noun

Verb

Verb

3

Verb

Noun

Noun

4

Adjective

Noun

Noun

5

Adverb

Verb

Verb

6.2.4 Tamil Verb Morphology Tamil verbs are inflected by means of suffixes. Tamil verbs can be finite or non-finite forms. Finite verb forms occur in the main clause of the sentence and non-finite forms occur as the predicate of subordinate or embedded clauses. Morphologically, finite verb forms are inflected for tense, mood, aspect, person, number and gender. The simple finite verb forms are given in Table 6.2. First column presents the PNG (Person-Number-Gender) Tag and the further columns presents present, past and future tenses respectively. For the word “ப ” padi (study), various simple finite inflection forms with tense markers and PNG markers are given in Table 6.2. PNG_suffix is a portmanteau morpheme that encodes the person, number and gender all in one. Finite verbs take the form, Verb_stem + Tense + person_number_gender Tamil recognizes four kinds of non-finite verbs: infinitive, verbal participle, adjectival participle and conditional. They take the following Morphotactics. Verb_stem + infinitive_suffix (infinitive) Verb_stem + vp_suffix (verbal participle) Verb_stem + tense + rp_suffix (adjectival participle) Verb_stem + conditional_suffix (conditional)

171

Modal verbs can be defective in that, they cannot take any more inflectional suffixes, or they can be regular verbs that can get inflected for tense and PNG suffixes. Verb_stem + infinitive_suffix modal_verb Verb_stem + infinitive_suffix modal_stem + tense + png_suffix

Table 6.2 Simple Verb Finite Forms PNG

Root-Pres-PNG

Root-Past-PNG

Root-Fut-PNG

3SE

padi-kinR-Ar

padi-thth-Ar

padi-pp-Ar

3SM

padi-kinR-An

padi-thth-An

padi-pp-An

3SF

padi-kinR-AL

padi-thth-AL

padi-pp-AL

2S

padi-kinR-Ay

padi-thth-Ay

padi-pp-Ay

1P

padi-kinR-Om

padi-thth-Om

padi-pp-Om

1S

padi-kinR-En

padi-thth-En

padi-pp-En

2SE

padi-kinR-Ir

padi-thth-Ir

padi-pp-Ir

3SN

padi-kinR-athu

padi-thth-athu

padi-pp-athu

2PE

padi-kinR-IrkaL

padi-thth-IrkaL

padi-pp-IrkaL

3PE

padi-kinR-ArkaL

padi-thth-ArkaL

padi-pp-ArkaL

3PN

padi-kinR-ana

padi-thth-ana

padi-pp-ana

6.2.5 Tamil Noun Morphology Tamil nouns (and pronouns) are classified into two super-classes “rational” and the "irrational" which include a total of five classes. Humans and deities are classified as "rational", and all other nouns (animals, objects, abstract nouns) are classified as irrational. The "rational" nouns and pronouns belong to one of three classes masculine singular, feminine singular, and rational plural. The "irrational" nouns and pronouns belong to one of two classes - irrational singular and irrational plural. The plural form for rational nouns may be used as an honorific, gender-neutral, singular form [132]. Suffixes are used to perform the functions of cases or postpositions. Traditional grammarians tried to group the various suffixes into eight cases corresponding to the cases used in Sanskrit. These were the nominative, accusative, and dative, sociative, 172

genitive, instrumental, locative, and ablative. The various noun forms are given in the Table 6.3. The table represents the singular and plural forms of the word “எ ” eli (rat) with the case markers. Table 6.3 Noun Case Markers Case

Singular

Plural

Nominative

eli

eli-kaL

Accusative

eli-ai

eli-kaL-ai

Dative

eli-uku

eli-kaL-uku

Benfefactive

eli-ukk-Aka

eli-kaL-ukk-Aka

Instrumental

eli-Al

eli-kaL-Al

Sociative-Odu

eli-Odu

eli-kaL-Odu

Sociative-udan

eli-udan

eli-kaL-udan

Locative

eli-il

eli-kaL-il

Ablative

eli-il-iruwthu

eli-kaL-il-iruwthu

Genitive

eli-in-athu

eli-kaL-in-athu

Noun form without any inflections are called noun stem. Nouns in their stem forms are singular. aaciriyarkaL = aaciriyar (teacher) + kaL (pl.marker) ஆசிாியர்கள் = ஆசிாியர் + கள் peenaakkaL = peenaa (pen) + kaL (pl.marker) ேபனாக்கள்= ேபனா+கள் puththakangkaL = puththakam (book) + kaL (pl.marker) த்தகங்கள்= த்தகம் + கள் The examples shown above are a few instances of plural inflection. Creating a plural form of a noun isn’t simply about concatenating ‘kaL’. Similarly, in “puththakangkaL”, the stem (puththakam) is inflected to puththakangkaL (‘am’ in the stem is replaced by ‘ang’, followed by ‘kaL’). These differences are due to the ‘Sandhi’ 173

changes that take place when the noun stem is concatenated to the ‘kaL’ morpheme. Tamil uses case suffixes and post positions for case marking instead of prepositions. Case markers indicate the relationship between the noun phrases and the verb phrase. It indicates the semantic role of the noun phrases in the sentence. Genitive case, tells the relationship between noun phrases. This is expressed by ‘in’ morpheme. Case suffixes are concatenated to the nouns in their stem form or after the plural morpheme if it’s a plural noun. noun_stem + {kaL} + case_suffix e.g.

kaththiyAl (with a knife) = kaththi (knife) + Al (with) கத்தியால் = கத்தி+ஆல் Post positions are of two kinds: bound and free. In case of bound post positions,

they occur with their respective governing case suffixes. In such a case, the Morphotactics would be, noun_stem + {kaL} + case_suffix + bound_post_position e.g.

vIddiliruwthu (from the house) = vIdu (house) + il + iruwthu (from) ட்

ந்

=

+ இல் +இ ந்

Sometimes the post positions follow a blank space after the case suffix as another word. Free post positions follow noun stems without any case suffixes. However they are written as another word and do not concatenate with the noun. Basically, there are eight cases in Tamil. Verbs can take the form of nouns when followed by nominal suffixes. Nominalized verb forms are an example of derivational Morphology in Tamil. They occur in the following format. Verb_stem + tense + nominal_suffix e.g.

ceythavar (one who did ) = cey (do) + th (past) + avar (3rd person singular honorific) ெசய்தவர் =ெசய் + த்த் + அவர்

174

6.2.6 Tamil Morphological Analyzer Tamil language is morphologically rich and agglutinative. Such a morphologically rich language needs deep analysis at the word level to capture the meaning of the word from its morphemes and its categories. Each root is affixed with several morphemes to generate a word. In general, Tamil language is postpositionally inflected to the root word. Each root word can take more than ten thousand inflected word forms. Tamil language takes both lexical and inflectional morphology. Lexical morphology changes the meaning of the word and its class by adding the derivational and compounding morphemes to the root. Inflectional morphology changes the form of the word and adds additional information to the word by adding the inflectional morphemes to the root.

6.2.7 Challenges in Tamil morphological Analyzer The morphological structure of Tamil is quite complex since it inflects to person, gender, and number markings and also combines with auxiliaries that indicate aspect, mood, causation, attitude etc in verb. A single verb root can inflect more than ten thousand word forms including auxiliaries. Noun root inflects with plural, oblique, case, postpositions and clitics. A single noun root can inflect more than five hundred word forms including postpositions. The root and morphemes have to be identified and tagged for further language processing at word level. The structure of verbal complex is unique and capturing this complexity in a machine analyzable and generatable format is a challenging job. The formation of the verbal complex involves arrangement of the verbal units and the interpretation of their combinatory meaning. Phonology also plays its part in the formation of verbal complex in terms of morphophonemic or ‘Sandhi’ rules which account for the shape changes due to inflection. Understanding of verbal complexity involves identifying the structure of simple finite verbs and compound verbs. By understanding the nature of the verbal complexity, it is possible to evolve a methodology to recognize the verbal complexity. In order to analyze the verbal forms in which the inflection vary from one set of verbs to another, a classification of Tamil verbs based on tense markers is evolved. The inflection includes finite, infinite, adjectival, adverbial and conditional forms of verbs. 175

Compared to verb morphological analysis, noun morphological analysis is relatively easy. Noun can occur separately or with plural, oblique, case, postpositions and clitics suffixes. A corpus was developed with all morphological feature information. So the machine by itself captures all morphological rules, including ‘Sandhi’ and morphotactic rule.

6.3

TAMIL MORPHOLOGICAL ANALYZER SYSTEM

Morphological analyzer is the second stage of pre-processing Tamil language sentences. In first stage, Tamil sentences are tagged by Tamil POS Tagger tool. The system developed for Tamil morphological analyzer consists of five modules (Figure 6.3).

POS Tagged Sentence

Minimized POS Tagger

Noun/Verb Analyzer

Pronoun Analyzer

Proper Noun Analyzer

Other word Class Analyzers

Morphologically Annotated Sentence Figure 6.3 General Framework for Morphological Analyzer System

The five modules are, 1. Minimized POS Tagger 2. Noun/Verb Analyzer 3. Pronoun Analyzer 176

4. Proper Noun Analyzer 5. Other Word Class Analyzers The input to the morphological system is a POS tagged sentence. In the first module, POS Tagged sentence is refined according to the simplified POS tagset given in Table 6.4. The refined POS tagged sentence is split according to the simplified POS tags. The second module morphologically analyzes the Noun () and Verb () forms. The third and fourth modules morphologically analyze Pronoun (P) and Proper nouns (PN). Other word classes are analyzed in the fifth stage. This module considers the POS tag as morphological features. Table 6.4 Minimized POS Tagset

6.4

TAMIL MORPHOLOGICAL ANALYZER FOR NOUNS AND VERBS

6.4.1 Morphological Analyzer using Machine Learning The morphological analyzer identifies root and suffixes of a word. Generally, rule based approaches are used for morphological analysis which are based on a set of rules and dictionary that contains root words and morphemes. In rule based approach, a particular word is given as an input to the morphological analyzer and if the corresponding morphemes or root word is missing in the dictionary, then the rule based system fails. Here, each rule depends on the previous rule. So if one rule fails, it affects the entire set of rules which follows.

177

Recently, machine learning approaches are found to be dominating the Natural Language Processing field. Machine learning is a branch of Artificial Intelligence (AI) concerned with the design of algorithms that learn from the examples. Machine learning algorithms can be supervised or unsupervised. The input and corresponding output data are used in supervised learning. In unsupervised learning, only input samples are used. The goal of machine learning approach is to use the given examples and find out generalization and classification rules automatically. All the rules including complex spelling rules can be handled by this method. Morphological Analyzer based on machine learning approaches does not require any hand coded morphological rules. It only needs morphologically segmented corpora. H.Poon et.al (2009) [189] reported the first log-linear model for unsupervised morphological segmentation. For Arabic and Hebrew language, it outperforms the state-of-the-art systems by a large margin. The sequence labeling is a significant generalization of the supervised classification problem. One can assign a single label to each input element in a sequence. The elements to be assigned are typically like parts of speech or syntactic chunk labels [171]. Many tasks are formalized as sequence labeling problems in various fields such as natural language processing and bioinformatics. There are two types in sequence labeling approaches [171]. •

Raw labeling.

•

Joint segmentation and labeling.

In raw labeling, each element gets a single tag whereas in joint segmentation and labeling, whole segments get a single label. In a morphological analyzer, sequence is usually a word and, a character, is an element. As mentioned earlier, in morphological analyzer, input is a word and output is root and inflections. Input word is denoted as ‘W’, and, root word and inflections are denoted by ‘R’ and ‘I’ respectively. [W]Noun/Verb = [R] Noun/Verb + [I] Noun/Verb In turn, notation ‘I’ can be expressed as i1+ i2+…. + in where ‘n’ refers to the number of inflections or morphemes. Further ‘W’ is converted into a set of characters. Morphological analyzer accepts a sequence of characters as input and generates a sequence of characters as output. Let X be the finite set of input characters and Y be the finite set of output characters. If the input string is ‘x’, it is segmented as x1x2....xn 178

where each xn є X. Similarly, if y is an output string, it is segmented as y1y2...yn and yn є Y where ‘n’ is the number of segments. Inputs: x = (x1, x2, x3…, xn) Labels: y = (y1, y2, y3…, yn) The main objective of sequence labeling approach is predicting y from the given ‘x’. In training data, the input sequence ‘x’ is mapped with output sequence ‘y’. Now the morphological analyzer problem is transformed into a sequence labeling problem. The information about the training data is explained in the following sub sections. Finally the morphological analysis is redefined as a classification task which is solved by using sequence labeling methodology.

6.4.2 Novel Data Modeling for Noun/Verb Morphological Analyzer Data formulation plays the key role in supervised machine learning approaches. The first step involved in the corpora development for morphological analyzer is classifying paradigms for verbs and nouns. The classification of Tamil verbs and nouns are based respectively on tense markers and case markers. Each paradigm will inflect with the same set of inflections. The second step is to collect the list of root words for all paradigms. 6.4.2.1 Paradigm Classification Paradigm provides information about all the possible word forms of a root word in a particular word class. Tamil noun and verb paradigm classification is done based on its case and tense markers respectively. Number of paradigms for each word class (noun/verb) is defined. For the sake of computational data modeling, Tamil verbs were classified into 32 paradigms [13]. Nouns are classified into 25 paradigms to resolve the challenges in noun morphological analysis. Based on the paradigm, the root words are grouped into its paradigm. Table 6.5 shows the number of paradigms and inflections of verb and noun which are handled in the system. Total represents the total number of inflections that are handled in this analyzer system. Noun and verb paradigm list is shown in Tables 6.6 and 6.7.

179

Table 6.5 Number of Paradigms and Inflections Word forms

Paradigms

Inflections Auxiliaries Postpositions

Total

Verb

32

164

67

--

10988

Noun

25

30

--

290

320

Table 6.6 Noun Paradigms

6.4.2.2 Word-forms Noun word forms The Morphological System for noun handles more than three hundred word forms including postpositions. Traditional grammarians group the various suffixes into 8 cases corresponding to the cases used in Sanskrit. These were the nominative, accusative, and dative, sociative, genitive, instrumental, locative, and ablative. The sample word forms which are used in this thesis are shown in Table 6.8. Remaining word forms are included in Appendix B. 180

Table 6.7 Verb Paradigms

Table 6.8 Noun Word Forms நரம்பிைன (narampinai)

நரம்ப

நரம்ைப (narampai)

நரம்பின

நரம்பினத்ைத (narampinathai)

நரம்பின்கண் (narampnkaN)

நரம்ேபா

நரம்ப கண் (narampathukaN)

(narampOdu)

நரம்பிேனா (narampinOdu) நரம்பினத்ேதா (narampinaththOdu)

(narampathu) (narampinathu)

நரம் க்காக (narampukkAka) நரம்பாலான (narampAlAna)

நரம்பால் (narampAl)

நரம் ைடய (narampudaiya) நரம்பி ைடய (narampinudaiya)

நரம் க்கு (narampukku)

நரம்பில் (narampil)

நரம்பிற்கு (narampiRku)

நரம்பினில் (narampinil)

நரம்பின் (narampin)

நரம் டன் (narampudan)

நரம்பினால் (narampinAl)

181

Verb word forms Verbs also morphologically deficient i.e. some verbs do not take all the suffixes meant for verbs. Verb is an obligatory part of a sentence except copula sentences. Verbs can be classified into different types based on morphological, syntactic and semantic characteristics. Based on the tense suffixes, verbs can be classified into weak verb, strong verbs and medium verbs. Based on the form and function, verbs can be classified into finite verb (ex. va-ndt-aan 'come_PAST_he') and non-finite verb (ex. vandt-a 'come_PAST_RP' and va-ndt-u 'come_PAST_VPAR'). Depending the non-finite whether non-finite form occur before noun or verb, they can be classified as adjectival or relative participle form (ex. vandta paiyan 'the boy who came') and adverbial or verbal participle form (ex. vandtu poonaan 'having come he went'). The Morphological system for verb handles more than ten thousand word forms including auxiliaries and clitics. The sample verb forms which are used in this research are shown in Table 6.9. Remaining word forms are given in Appendix B. Table 6.9 Verb Word Forms ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப

ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப ப

(padi) த்தான் (padiththAn) த்தாள் (padiththAL) த்தார் (padiththAr) த்தார்கள் (padiththArkaL) த்த (padiththathu) த்தன (padiththana) த்தாய் (padiththAy) த்தீர் (padiththIr) த்தீர்கள் (padiththIrkaL) த்ேதன் (padiththEn) த்ேதாம் (padiththOm) க்கிறான் (padikkiRAn) க்கிறாள் (padikkiRAL) க்கிறார் (padikkiRAr) க்கிறார்கள் (padikkiRArkaL) க்கின்ற (padikkinRathu) க்கின்றன (padikkinRana) க்கின்றாய் (padikkinRAy) க்கின்றீர் (padikkinRIr) க்கின்றீர்கள் (padikkinRIrkaL)

182

க்கின்ேறன் (padikkinREn) க்கின்ேறாம் (padikkinROm) ப்பான் (padippAn) ப்பாள் (padippAL) ப்பார் (padippAr) ப்பார்கள் (padippArkaL) ப்ப (padippathu) ப்பன (padippana) ப்பாய் (padippAy) ப்பீர் (padippIr) ப்பீர்கள் (padippIrkaL) ப்ேபன் (padippEn) ப்ேபாம் (padippOm) க்கும் (padikkum) த்த (padiththa) க்கின்ற (padikkinRa) க்காத (padikkAtha) த்தவன் (padiththavan) த்தவள் (padiththavaL) த்தவர் (padiththavar) த்த (padiththathu)

6.4.2.3 Morphemes Noun morphemes The Morphological analyzer system for noun handles 92 morphemes in the Morpholexical Tagging (Phase II). The morphemes which are used in this thesis are given in Table 6.10. Table 6.10 Noun Morphemes ஐ ஆக ஆன க் ச் த் ப் அ ஆல் இன் இல் உள் ஓ கண் கள் ப ேபால வைர விட அதன் இடம் இனம் உடன் உைடய ஒழிய கீேழ கீழ் க்கு தவிர பின் ேபால் ன்

ேமேல ேமல் ற்கு அண்ைட அ ேக உக்கு உள்ேள எதிேர ஒட் கிட்ட க்காக பதில் பற்றி பிறகு தல் லம் ஆட்டம் ெகாண் சுற்றி தாண் ெதற்கு ேநாக்கி பக்கம் பதிலாக பிந்தி பின்ேன பின் மாதிாி ேமற்கு வடக்கு வழியாக விட் 183

ெவளிேய ைவத் அ யில் அப்பால் அ கில் இைடயில் இ ந் எதிாில் கிழக்கு ந வில் வைரயில் அப் றம் அல்லாமல் இல்லாமல் குறித் கு க்ேக பார்த் பின்னால் ன்னால் ெவளியில் எதிர்க்கு எதிர்க்ேக தவிர்த் வைரக்கும் அ த்தாற் ேபால் அ கி ந் எதிர்த்தார் ேபால் எதிர்த்தாற் ேபால்

Verb morphemes The Morphological system for verb handles 170 morphemes in the Morpholexical tagging (Phase II). The morphemes which are used in this analyzer are shown in Table 4.11. Table 6.11 Verb Morphemes அ ஆ இ உ ஏ ஓ க ட ண ன ய ர ற ல ள ைக க் சா ச் ட்

த் ன் ேபா ப் யி ற்

வா ேவ ைவ வ் அ அல் ஆத் ஆன் ஆைம ஆம் ஆய் ஆர் ஆல் ஆள் இன் இ ஈர் உம் உள் ஏன் ஓன் ஓம் கல் கள் கிட க்க டல் ணல் தல் னர் னல் ப 184

ய யல் ரல் லல் ளல் வன் வர் வள் ஆமல் இயல் காத் காைம கிற் கிழி கும் கூ ெகா ெகாள் சாகு ெசய் ணாத் ணாைம ம் தான் திாி தீர் ெதாைல த் த்த் ந் ந்த் னாத் னாைம

Ambigious Morphemes of Noun and Verb A morpheme may have one or more morpho-syntactic categories. This leads to the ambiguity in morphemes. The ambiguous morphemes of noun and verb are shown in Table 6.12. Table 6.12 Verb/Noun Ambiguous Morphemes Morphemes அ ஆ ஆள் இ இயல் இ உள் கட் கல் காட் கிட கிழி கூ ெகா ெகாண் ெகாள் ெசய் தள் திாி ெதாைல த் ப பண் பார் ெப ேபா ப் மாட் ய ள வா வி பக்கம் ஆக த் ேவண் ன

Ambiguous Tags

185

6.4.2.4 Data Creation for Noun/Verb Morphological Analyzer The data creation for the first phase of Noun/Verb Morphological analyzer system is done by the following stages. • Preprocessing • Mapping • Bootstrapping Preprocessing Preprocessing is an important step in data creation. It is involved in training stage as well as decoding stage. Figure 6.4 shows the preprocessing steps involved in the development of corpora. Morphological corpus which is used for machine learning is developed by the following steps. Romanization, Segmentation and Alignment Romanization The input word forms are converted to Romanized forms using the Unicode to Roman mapping. Romanization is done for easy computational processing. In Tamil, syllable (Compound characters) exists as a single character, where one cannot separate vowel and consonant. So, for this separation, Tamil graphemes are converted into Roman forms. Tamil roman mapping is given in Appendix A. Segmentation After Romanization, each and every word in the corpora is segmented based on the Tamil grapheme and additionally, each syllable in the corresponding word is further segmented into consonants and vowels. To the segmented syllable, postfix “–C” and “– V” to the consonant and vowel respectively. It is named as C-V representation i.e. Consonant–Vowel representation. The C-V representation is given only for input data. In the output data, morpheme boundaries are indicated by “*” symbol. Alignment The segmented words are aligned vertically as segments using the gap between them.

186

Figure 6.4 Preprocessing Steps Mapping and Bootstrapping The aligned input segments are consequently mapped with output segments in the mapping stage. Bootstrapping is done to increase the training data size. Sample data format for the word ப த்தான் ‘padiththAn’ is given in Table 6.13. First column represents the input data and the second one represents output data. “*” indicates the morpheme boundaries. Table 6.13 Sample Data Format

I/P

O/P

p-C

p

a-V

a

d-C

D

i-V

i*

th

Th

th-C

th*

A-V

A

n

n* 187

6.4.2.5 Issues in Data Creation Mapping mismatch segments Mismatching is the main problem which occurs in mapping the input characters with output characters. Mismatching occurs in two cases, i.e., either the input units are larger or smaller than those of the output units. The mismatching problem is solved by inserting a null symbol “$” or combining two units based on the morpho-phonemic rules and further the input segments are mapped with output segments. After mapping, machine learning tool is used for training the data. In case 1, the input sequence is having more number of segments (14) than the segments (13) in the output sequence. Tamil verb, “padikkayiyalum” is having 14 segments in input sequence but in output, only 13 segments are present. The first occurrence of “y”(8th Segment) in the input sequence becomes null due to the morphophonemic rule. So there is no segment to map the “y” segment in input sequence. For this reason, in training, the input segment “y” is mapped with “$” symbol (“$” indicates null) in output sequence. Now the input and the output segments are matched equally. Case 1: Input Sequence: P-C | a-V | d-C | i-V | k | k-C | a-V | y-C | i-V | y-C |a-V | l-C | u-V | m (14 segments) Mismatched Output Sequence: p | a | d | i* | k | k | a* | i | y | a | l* | u | m* (13 segments) Corrected Output Sequence: p | a | d | i* | k | k | a* | $ | i | y | a | l* | u | m*

(14 segments)

In case 2, the input sequence is having less number of segments than the output sequence. Tamil verb OdinAn is having 6 segments in input sequence but output has 7 segments. Using morpho-phonemic rule, the segment “d-C”(2nd Segment) in the input sequence is mapped to two segments “d” &”’u*”(2nd and 3rd Segments) in the output sequence. For this reason, in training, “d-C” is mapped with “du*”. Now the input and the output segments are equalized and thus the problem of sequence mismatching is solved. 188

Case 2: Input Sequence: O | d-C | i-V | n-C | A-V | n

(6 segments)

Mismatched Output Sequence: O | d | u* | i | n* | A | n

(7 segments)

Corrected Output Sequence: O | du* | i | n* | A | n

(6 segments)

6.4.3 Morphological Tagging Framework using SVMTool 6.4.3.1 Support Vector Machine (SVM) Support Vector Machine (SVM) approaches have been around since the mid 1990s, initially as a binary classification technique, with later extensions to regression and multi-class classification. Here, Morphological analyzer problem is converted into a classification problem. These classifications can be done through supervised machine learning algorithms [12]. Support Vector Machine is a machine learning algorithm for binary classification, which has been successfully applied to a number of practical problems, including NLP [12]. Let {( x1 , y1 ),......,( xN , yN )} be the set of N training N examples, where each instance xi is a vector in R and yi ∈ {−1, +1} is the class label.

SVM is attractive because it has an extremely well developed statistical learning theory. SVM is based on strong mathematical foundations and results in simple, yet very powerful, algorithms. SVMs are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. 6.4.3.2 SVMTool The SVMTool is an open source generator of sequential taggers based on Support Vector Machine [12]. Originally, SVMTool was developed for POS tagging, but, here this tool is used in morphological analysis. The SVMTool software package consists of three main components, namely the model learner (SVMTlearn), the tagger 189

(SVMTagger) and the evaluator (SVMTeval). SVM models (weight vectors and biases) are learned from a training corpus using the SVMTlearn. Different models are learned for the different strategies. Given a training set of annotated examples, it is responsible for the training of a set of SVM classifiers. So as to do that, it makes use of SVM–light an implementation of Vapnik’s SVMs in C, developed by Thorsten Joachims (2002). Given a text corpus (one token per line) and the path to a previously learned SVM model (including the automatically generated dictionary), it performs tagging of a sequence of characters. Finally, given a correctly annotated corpus, and the corresponding SVMTool predicted annotation, the SVMTeval component displays tagging results. SVMTeval evaluates the performance in terms of accuracy. 6.4.3.3 Implementation of Morphological Analyzer System Exiting Tamil morphological analyzers are explained in Chapter 2. Jan Hajic et.al (1998) [190] developed morphological tagging for inflectional languages using exponential probabilistic model based on automatically selected features. The parameters of the model are computed using simple estimates. Using the machine learning approach, the morphological analyzer for Tamil is developed. Separate engines are developed for noun and verb. Noun morphological analyzer can handle inflected noun forms and postpositionally inflected nouns. The verb analyzer handles all the verb forms like finite, infinite and auxiliaries. Morphological analyzer is redefined as a classification task. Classification problem is solved by using the SVM. In this machine learning approach, two training models are developed for morphological analysis. These two models are grouped as Model-I (segmentation model) and Model-II (morpho-syntactic tagging model). First model (Model-I) is trained using the sequence of input characters and their corresponding output labels. This trained Model-I is used for predicting the morpheme boundaries. Second model (Model-II) is trained using sequence of morphemes and their grammatical categories. This trained Model-II is used for assigning grammatical classes to each morpheme. Figure 6.5 illustrates the three phases involved in the process of morphological analyzer. •

Pre-processing. 190

•

Morpheme Segmentation.

•

Morpho syntactic tagging.

Preprocessing The word that has to be morphologically analyzed is given as the input to the pre-processing phase. The word primarily undergoes Romanization process. The romanized word is segmented based on Tamil graphemes. Tamil grapheme consists of vowel, consonant and syllable. The syllables are broken into vowel and consonant. To these consonant and vowel, –C and –V are suffixed. Segmentation of morpheme In segmentation of morpheme process, words are segmented into morpheme according to their morpheme boundary. The input sequence is given to the trained Model-I. The trained model predicts each label to the input segments. This output sequence is aligned as a morpheme segments using alignment program.

Trained Model-I

Input ப த்தான்

Preprocessing

Morph Analyzer

Postprocessing Output: ப த்த் ஆன்

Morpheme Alignment

Trained Model-II

Figure 6.5 Implementation of Noun/Verb Morph Analyzer Morpho-syntactic tagging The segmented morpheme sequence is given to the trained Model-II. It predicts grammatical categories to the each segment (morphemes) in the sequence. 191

6.5

MORPHOLOGICAL ANALYZER FOR PRONOUN USING PATTERN MATCHING

The morphological analyzer for Tamil pronoun is developed by using pattern matching approach. Personal pronouns are playing an important role in Tamil language therefore they need very special attention while generating as well as analyzing. Figure 6.6 shows the implementation of pronoun morphological analyzer. Morphological processing of Tamil pronoun word form is handled independently by using pronoun morphological analysis system. Morphological analysis of pronoun is based on the pattern matching and pronoun root word. Structure of the pronoun word form is used for creating a pattern file. Pronoun word structure is divided into four stages. They are, i. PRN – ROOT ii. CASE – CLITIC iii. PPO – CL iv. CLITIC These stages are explained in Figure 6.6.

PP

Clitic

Case

Clitic Pronoun

Clitic

Figure 6.6 Structure of Pronoun Word-form

192

Example for the Structure of Pronoun அவ அவ

க்க கில்

க்கு

அவன்

அவ அவ

க்க கிலா

க்கா

அவனா Pronoun word class is a closed class word, so it is easy to collect all root words of pronoun. In pronoun morphological system the word form is treated from left to right. Generally in morphological analysis systems handles the word from right to left but here limited vocabulary of pronoun makes to formulate a system from left to right. The pronoun word form is Romanized using Unicode to roman mapping this Romanized word is first compared with pronoun stem file. The pronoun stem file is consists of all the stems and roots of pronoun words. If the roman form is matched with any entry in the pronoun stem file then the matched part of the roman form is replaced with the value of the corresponding entry. After this process the remaining part is compared with three different suffix files. In this comparison, the matched part is replaced with corresponding value of the suffix element. Finally the root word is converted into Unicode form. Figure 6.7 shows the implementation of Pronoun Morph Analyzer

Pronoun

Pattern Dataset

Pronoun + MLI

Figure 6.7 Implementation of Pronoun Morph Analyzer Steps Step1: Check whether the input word is present in the dictionary. Step2: If present go to Step3.Else go to Step 4. Step3: Retrieve the Root and Morpho lexical information (MLI). Step4: Assign, the input word as root word and null to the MLI. Step5: The final output is a combination of Root word and MLI 193

6.6

MORPHOLOGICAL ANALYZER FOR PROPER NOUN USING SUFFIXES

The morphological analyzer for Proper Noun is developed by using the suffixes. Figure 6.8 shows the implementation of Proper Noun Morph analyzer. Proper noun word form is taken as input for proper noun morphological analysis. Proper noun word form is taken from minimized POS Tagged sentence. It is identified from a POS tag . Initially proper noun word form is converted into roman form for easy computation. This Roman conversion is done by using simple key pair mapping of Roman and Tamil characters. This mapping program recognizes each Tamil character unit and replace with corresponding roman character. This Roman form is given to the proper noun analyzer system. System compares the word with the suffix which is predefined. First, it identifies the suffix and replaced with the corresponding information in the proper noun suffix data set. The suffix data set is created using various proper noun inflection and their end characters. For example from a table 6.14, the word “sithamparam”(சிதம்பரம்) is end with ‘m’(ம்), and the other word “pANdisEri” (பாண் ச்ேசாி) is end with ‘ri‘(ாி) , the possible inflections of both words are given in table. Morphological changes are differing for the proper noun based on the end characters. So end characters are used in creating rules. From the various inflections of the word-form the suffix is identified and the remaining part is stem. This suffix is mapped to the original morphological information. This algorithm replaces the encountered suffix with the morphological information in a suffix table.

Steps Step1: Suffix of Input word is identified using Suffix Table. Step2: Identified suffix is stripped from the word Step3: Suffix striping also gives the stem of the word. Step4: Based on suffix, the stem is converted into root word. Step5: Morpho lexical information is identified for the suffix. Step6: The final output is a combination of Root word and Morpho lexical information.

194

Input Word Word

No

Suffix Stem ?

Yes Split the Suffix and Stem

Stem Convert into Lemma

Suffix DB

Suffix Morpho-Lexical Information

Lemma + MLI Morphological Output

Figure 6.8 Implementation of Proper Noun Morph Analyzer Table 6.14 Example for Proper Noun Inflections Root

சிதம்பரம்

பாண் ச்ேசாி

Root+ACC

சிதம்பரத்ைத

பாண் ச்ேசாிைய

Root+LOC

சிதம்பரத்தில்

பாண் ச்ேசாியில்

Root+DAT

சிதம்பரத்திற்கு

பாண் ச்ேசாியிற்கு

Root+ABL

சிதம்பரதி

பாண் ச்ேசாியி

Root+UM

சிதம்பர ம்

6.7

ந்

ந்

பாண் ச்ேசாி ம்

RESULTS AND EVALUATION

Efficiency of the system is compared in this sub section. Various machine learning tools are also compared using the same morphologically annotated data. The system accuracy is estimated at various levels, which are briefly discussed below. 195

Training Data a Vs Accuraacy In Figuure 6.9, X ax xis represennts training data d and Y axis represeents accuracyy. om the grapph, it is fou und that M Morphologicaal Analyzer accuracy inncreases witth Fro inccrease in thee volume off training daata. Accuraccies are calcculated from m 10k to3000k traiining corpuss size.

Accuraacy 100 90 A c c u r a c y

80 70 60 50 40 30 10k

25k

40k

50k

75k

100k 125k 150k 175k 200k 300k

Training Data

Figure 66.9 Trainingg Data Vs A Accuracy Taagged and Untagged U Acccuracies In the sequence s bassed morphollogical system, output is obtained inn two differennt stages using thhe trained models. m First stage takes a sequence of characterr as input annd a givves untaggedd morphemees as output using the trrained Model-I. It also represents as moorpheme ideentification. In the secoond stage, these morphemes are tagged usinng traiined Model--II. Accuraccies of the uuntagged an nd tagged m morphemes for f verbs annd nou un are shown n in Table 6.15. ble 6.15 Taggged Vs Untaagged Accu uracies Tab Accuracyy

Verb

Nou un

Untagged d(Model-I)

93.56%

94.334%

Tagged(M Model-II)

91.73%

92.22 %

196

Word level and character level accuracies Accuracies are compared with word level as well as character level. Two thousand three hundred verb data and one thousand seven hundred and fifty noun data are taken randomly from POS Tagged corpus for testing the system. Table 6.16 shows the number of words as well as the characters in the whole testing data set as well as the efficiency of prediction. Table 6.16 Number of Words and Characters and level of Efficiencies Category

VERB

NOUN

Words

Characters

Words

Characters

Testing data

2300

20627

1750

10534

Predicted correctly

2071

19089

1639

9645

Efficiency

90.4%

92.5%

91.5%

93.6%

The percentage of Word level efficiencies and Character level efficiencies are calculated by the following formulae.

Word level efficiency

Character level efficiency

Number of words split correctly Total number of words in Testing set

Number of characters tagged correctly Total number of characters in Testing set

Sentence level accuracies The POS tagged sentences are given to the Morphological analyzer tool. Therefore, the accuracy of POS tagging affects the performance of the analyzer. Here, 1200 POS tagged sentences consisting of 8358 words were taken for testing the Morphological system. Table 6.17 shows the Sentence level accuracy of Morphological analyzer system. For other categories of simplified POS tags, part-of-speech information is considered as morphological information.

197

Table 6.17 Sentence Level Accuracies Categories N V P (Pronoun) PN O (Others)

6.8

Input

WORD COUNT Correct Output

2821 2642 2794 2543 562 543 279 258 1902 1817 Overall Accuracy

Percentage 93.65 91.00 96.61 92.47 95.53 93.86

PREPROCESSED ENGLISH AND TAMIL SENTENCE

English language sentences are preprocessed using existing parser and developed rules (Chapter-4). For Tamil, POS tagger (Chapter-5) and Morphological analyzers (Chapter6) are used to preprocess the sentences. Preprocessing in Tamil sentences is same as the factorization of Tamil sentences. Table 6.18 shows the example of English and Tamil preprocessed sentence. Table 6.18 Preprocessed English and Tamil Sentence

Preprocessed English Sentence

Preprocessed Tamil Sentence

I | i | PN | prn_i

நான் | நான் |P| null

my | my | PN | PRP$

என்

ைடய|என்|P| poss

ட் ற்கு|

home | home | N |NN_to

|N| DAT


காய்கறிகள்|காய்கறி|N|PL

bought | buy | V | VBD_1S.

வாங்கிேனன்|வாங்கு|V|PAST_1S

6.9

SUMMARY

This chapter explains the development of Morphological analyzer for Tamil language using Machine learning approach. Capturing the agglutinative structure of Tamil words by an automatic system is a challenging job. Generally, rule based approaches are used for building morphological analyzer. Tamil morphological analyzer for noun and verb is developed using the new and state of the art machine learning approach. Morphological analyzer problem is redefined as a classification problem. This approach is based on sequence labeling and training by kernel methods that captures the non 198

linear relationships of the morphological features from training data samples in a better and simpler way. SVM based tool is used for training the system with the size of 6 lakh morphologically tagged verbs and nouns. The same methodology is implemented for other Dravidian languages like Malayalam, Telugu, and Kannada. Tamil Pronouns and Proper nouns are handled using separate analyzer system. Other word classes need not to be further analyzed for morphological features. So, POS tag information is considered as the morphological information.

199

CHAPTER 7 FACTORED SMT SYSTEM FOR ENGLISH TO TAMIL This chapter describes the outline about Statistical Machine Translation system and its components. This section also explains the integration of linguistic knowledge in factored translation models and the development of factored corpora.

7.1

STATISTICAL MACHINE TRANSLATION

An outline of the noisy channel model is used in speech recognition and machine translation. If one needs to translate a sentence, f in the source language F to a sentence, e in the target language E, the noisy channel model describes the situation as: the sentence f to be translated was initially conceived in language E as some sentence e. During the process the sentence e was corrupted by the channel to the sentence f. Now an assumption is made that each sentence in E is a translation of the sentence f with some probability, and the sentence which choose as the translation ( e ) is the one that has the highest probability which is given in equation (7.1). In mathematical terms [172], e = arg max P(e | f ) e

(7.1) According to Bayes theorem,

P (e | f ) =

P (e) P ( f | e) P( f )

(7.2)

Therefore, the equation (7.1) becomes

e = arg max e

P (e) P ( f | e) P( f )

(7.3)

Since f is always fixed so P ( f ) is omitted in the maximization step in above equation (7.3). e = arg max P (e) P ( f | e)

(7.4)

e

200

Here, P ( e | f ) is split s into P (e) and P ( f | e ) accordinng to Bayess rule. This is don ne because practical p trannslation moddels give hig gh probabilitiies to P ( f | e ) or P (e | f ) wh hen the word ds in f are geenerally transslations of thhe words in e. For instannce, when thhe English sentennce, “He willl come tomoorrow” is trranslated to Tamil, bothh P (e | f ) annd he following sentences. P ( f | e ) gives equal probaabilities to th அவ

. (avan ( wALa ai varuvAn))

அவ

. (avan ( varuvvAn wALai))

The abo ove problem m can be avooided when the equationn (7.4) is ussed instead of o equuation (7.1).. The second d sentence will w be ruled d out as the first sentennce which haas muuch higher vaalue of P (e) than the seccond one an nd the first seentence will be taken intto connsideration during d the traanslation proocess. This leads to anothher perspecttive in the statistical s maachine transslation modeel: thee best translaation is the output senteence that is both faithfuul to the orig ginal sentencce andd fluent in thhe target lannguage [173]]. This is shoown in equaation (7.4), where w P (e) is thee probabilityy of the senttences that are a likely inn the Englishh language, the languagge moodel which is responsibble for the fluency of the translattion and P ( f | e ) is thhe proobability of the t way in which w the seentences in E get translaated to sentennces in F, thhe trannslation model which iss responsiblle for the faithfulness of the translaated sentencce. Bloock diagram m of the noisy y channel moodel for Stattistical Machhine Translaation (SMT) is givven in Figuree 7.1.

Figure.7.1 The T Noisy Channel C Moodel to Machine Transllation

7.22

COMP PONENTS OF SMT T

The three mainn componentts in statisticcal machine translation t aare, 1. Translation model 201

2. Language model 3. The Statistical Machine Translation Decoder

7.2.1 Translation Model Translation system is capable of producing the words that retrieves its original meaning and arranging those words in a sequence that form fluent sentences in the target language. The role of the translation model is to find P ( f | e ) the probability of the source sentence f given the translated sentence e. Note that P ( f | e ) that is computed by the translation model and not P (e | f ) . The training corpus for the translation model is a sentence-aligned parallel corpus of the languages F and E. It is obvious to compute P ( f | e ) from counts of the sentences f and e in the parallel corpus. Again, the problem is that of data sparsity. The solution that is immediately apparent is to find (or approximate) the sentence translation probability using the translation probabilities of the words in the sentences. The word translation probabilities in turn can be found from the parallel corpus. There is, however, a remaining issue - the parallel corpus gives us only the sentence alignments; it does not tell us how the words in the sentences are aligned. A word alignment between sentences tells us exactly how each word in sentence f is translated in e. The problem is getting the word alignment probabilities given a training corpus that is only sentence aligned. This problem is solved by using the Expectation-Maximization (EM) algorithm. 7.2.1.1 Expectation Maximization

The key intuition behind Expectation Maximization is that if the number of times a word aligns with another in the corpus is known then it is easy to calculate the word translation probabilities. Conversely, if the word translation probability is known then it should be possible to find the probability of various alignments. However, if one can start with some uniform word translation probabilities and calculate alignment probabilities and then use these alignment probabilities to get better translation probabilities. This iterative procedure, which is called the Expectation202

Maximization algorithm, works because words that are actually translations of each other co-occur in the sentence-aligned corpus. 7.2.1.2 Word based Translation Model

As explicitly introduced by IBM formulation as a model parameter, word alignment becomes a function from source positions j to target positions i, so that a ( j ) = i . This definition implies that resultant alignment solutions will never contain many-to-many links, but only many-to- one , as only one function result is possible for a given source position j. Although this limitation does not account for many real-life alignment relationships, in principle IBM models can solve this by estimating the probability of generating the source empty word, which can translate into non-empty target words. However, many current statistical machine translation systems do not use IBM model parameters in their training methods, but only the most probable alignment (using a Viterbi search) given the estimated IBM models. Therefore, in order to obtain many-tomany word alignments, usually alignments from source-to-target and target-to-source are performed, and symmetrization strategies have to be applied. In word-based translation model [174], translation elements are words. Typically, the number of words in translated sentences is different due to compound words, morphology and idioms. The ratio of the length of sequences of translated words is called fertility, which tells how many English words, each native word produces. Simple word-based translation is not able to translate language pairs with fertility rates different from one. To make word-based translation systems manage, for instance, high fertility rates, and the system could be able to map a single word to multiple words, but not vice versa. For instance, if one is translating from English to Tamil, each word in Tamil could produce zero or more English words. But there's no way to group two Tamil words producing a single English word. An example of a word-based translation system is the freely available GIZA++ package, which includes the training program for IBM models and HMM models. The word-based translation is not widely used today comparing to phrase-based systems, whereas, most phrase based system are still using GIZA++ to align the corpus. The 203

alignments are then used to extract phrase or induce syntactical rules. And the word alignment problem is still actively discussed in the community. Because of the importance of GIZA++, there are now several distributed implementations of GIZA++ available online. Statistical machine translation is based on the assumption that every sentence t in a target language is a possible translation of a given sentence e in a source language. The main difference between two possible translations of a given sentence is a probability assigned to each, which is to be learned from a bilingual text corpus. The first statistical machine translation models applied these probabilities to words, therefore considering words to be the translation units of the process. 7.2.1.3 Phrase based Translation Model

In phrase-based translation model [175], the aim is to reduce the restrictions of wordbased translation by translating whole sequences of words, where the lengths may differ. The sequences of words are called blocks or phrases, but typically are not linguistic phrases but phrases found using statistical methods from corpora. The job of the translation model, given a Tamil sentence T and an English sentence E, is to assign a probability that T generates E. While one can estimate these probabilities by thinking about how each individual word is translated. Modern statistical machine translation is based on the intuition that a better way to compute these probabilities is by considering the behavior of phrases. The intuition of phrasebased statistical machine translation is to use phrases i.e., sequences of words as well as single words as the fundamental units of translation. The generative story of phrase based translation has three steps. First, the source word is grouped into phrases E1 , E2 ,… El . Second, each Ei is translated into Ti . Finally, each phrase in the source is reordered. The probability model for phrase based translation relies on a translation probability and distortion probability. The factor ϕ (Ti | Ei ) is the translation probability of generating source phrase Ti from target phrase Ei . The reordering of the source phrase is done by distortion probability d. The distortion probability in phrase based 204

translation means the probability of two consecutive Tamil phrases being separated in English by a span of English word of a particular length. The distortion is parameterized by d (ai − bi −1 ) where ai is the start position of the source English phrase generated by the ith Tamil phrase, and bi −1 is the end position of the source English phrase generated by i-1th Tamil phrase. One can use a very simple distortion probability which penalizes large distortions by giving lower and lower probability for larger distortion. The final translation model for phrase based machine translation is based on the equation (7.5). P(T | E ) = ∏ ϕ (Ti | Ei )d (ai − bi − 1)

(7.5)

i

Phrase based models works in a successful manner only if the source and the target language have almost same in word order. Difference in the order of words in phrase based models is handled by calculating distortion probabilities. Reordering is done by the phrase based models. It has been shown that restricting the phrases to linguistic phrases decreases the quality of translation. By the turn of the century it became clear that in many cases specifying translation models at the level of words turned out to be inappropriate, as much local context seemed to be lost during translation. Novel approaches needed to describe their models according to longer units, typically sequences of consecutive words or phrases. The translation process takes three steps: 1. The sentence is first split into phrases - arbitrary contiguous sequences of words. 2. Each phrase is translated. 3. The translated phrases are permuted into their final order. The permutation problem and its solutions are identical to those in word-based translation. Consider the following particular set of phrases for our example sentences: Position Tamil

1 ேநற்

2 நான்

205

3 அவைள

4 பார்த்ேதன்

English

Netru

naAn

avaLai

pArththEn

yesterday

i

saw

her

Since each phrase follows are not directly in order, the distortions are not all 1, and the probability P ( E | T ) can be computed as: P(E|T)=P(yesterday|Netru)×d(1) ×P(i|naAn)×d(1) ×P(her|avaLai)×d(2) ×P(saw|pArththaen)×d(2) Phrase-based models produce better translations than word-based models, and they are widely used. They successfully model many local re-orderings, and individual passages are often fluent. However, they cannot easily model long-distance reordering without invoking the expense of arbitrary permutation.

7.2.2 Language Model In general, the language model is used to estimate the fluency of the translated sentence. This plays an important role in the statistical approach as it chooses the best fluent sentence with high value of P (e) among all possible translations generated by the translation model P ( f | e ) . Language model can be defined as the model which estimates and assigns a probability P (e) to the sentence, e. A high value will be assigned for the most fluent sentence and a low value for the least fluent sentence. Language model can be estimated from a monolingual corpus of the target language in the translation process. For example, consider the English sentence,

“The pen is on the table.” If it is translated to Tamil by a system without the language model, the following are the few possible translations as output from the system with the translation model alone,

எ

ேகால் ேமைஜயின் ேமல் உள்ள . 206

எ

ேகால் ேமல் ேமைஜயின் உள்ள .

ேமைஜயின் ேமல் எ

ேகால் உள்ள .

Even the second and third translation looks awkward to read, the probability assigned to the translation model to each sentences will be same, as translation model mainly concerns with producing the best output words for each word in the source sentence, e. But when the fluency and accuracy of the translation comes into picture, only the first translation of the given English sentence is correct. This problem can be very well handled by the language models. This is because the probability assigned by the language model for the first sentence will be greater when compared with the other two sentences. That is,

P(எ

ேகாள் ேமைஜயின் ேமல் உள்ள

ேமைஜயின் உள்ள

P(எ எ

க்ேகாள் ேமல்

.)

ேகாள் ேமைஜயின் ேமல் உள்ள

ேகாள் உள்ள

.) > P(எ

.) >P(ேமைஜயின்

ேமல்

.)

Consider a sentence, e which consists of ‘n’ number of words w1 , w2 ,… wn .The naïve estimation of P (e) on a monolingual corpus of target language with N number of words is done as in equation (7.6). n

P (e) = ∏ P ( wi )

(7.6)

i =1

Where,

P( wi ) =

count ( wi ) count ( wn )

(7.7)

The above equation (7.6) will assign a zero probability to the sentence e, if a word in the sentence has not occurred in the monolingual corpus. This in turn will 207

affect the accuracy of the translation process. In order to overcome this problem the probabilities that have been estimated from the corpus have to be approximated and this can be done by using n-gram language models.

7.2.2.1 N-gram Language Models One of the methods for language models that have been widely used for language modelling is the n-gram language models. In general, n-gram language models are based on statistics of how likely the words are to follow each other. For the above example, analyse a corpus for the determining the probability of the sentence using ngram language model, the probability for the word ேமல் will be greater for following the word ேமைஜயின் than the other words. In n-gram language modelling, the process of predicting a word sequence W is broke up into a process of predicting one word at a time. Thus the probability is decomposed using the chain rule as in equation (7.8),

P( w1 , w2 ,…, wn ) = P( w1 ) P( w2 | w1 )… P( wn | w1 , w2 ,…, wn−1 )

(7.8)

The language model probability can be defined as the probability of word probabilities given a history of preceeding words. In order to estimate these probability disributions for words, It is possible to limit the history to m words as in equation (7.9),

P( wn | w1 , w2 ,…, wn−1 ) = P( wn | wn−m ,…, wn−2 , wn−1 )

(7.9)

Thus this type of chain where one can consider only a limited history is called as Markov Chain. And the number of previous words considered is termed as the order of the model. This is because of the Markov assumption which states that only a limited number of previous words affect the probability of the next word. But the above assumption can be proved wrong with counter examples that a longer history is needed. Typically the order of the language model is based on the amount of training data available. Limited data resource restricts to short histories i.e., small order for the language model. Generally, trigram language models are used, whereas language models of small order such as unigrams and bigrams as well also models of high orders

208

are also used. In most cases this depends mainly on the amount of data from which language model probabilities are estimated. The language model estimates the probability, P (e) to a sequence of words {

w1 , w2 ,… wm } in a sentence e using the n-gram approach as the product of conditional probabilities of each word wi given the previous N-1 words. In other words, an n-gram model [43] can be defined as the probability of the word given the previous N-1 words instead of the probability of a word given all the previous words. Thus the probability assigned by the language model using the N-gram approach to a sequence of words {

w1 , w2 ,… wm } in sentence e is given by equation (7.10). m

P ( wn | w1 ,… , wm ) = ∏ P ( wi | wi −( n −1) ,… , wi −1 )

(7.10)

i =1

In case of n-gram approach, the probability of each word wi is only conditioned by the previous N-1 words. Though N-gram approach is simple, it has been incorporated with many applications such as speech recognition, spell checker, translation and many other tasks where language modelling is required. In general, ngram model that generates the probability by taking account of the word and the previous one word is termed as bigram model and the previous two words is the trigram model. Language model probabilities with n-gram approach can be directly calculated from a monolingual corpus. The equation for calculating trigram probabilities is given by equation (7.11). P( wn | wn − 2 , wn −1 ) =

count ( wn − 2 wn −1wn ) ∑ count (wn−2 wn−1w)

(7.11)

w

Here count ( wn −2 wn −1wn ) denotes the number of occurrences of the sequence wn2wn−1wn

in the corpus. The denominator on the right hand side sums over all words w in

the corpus the number of times wn − 2 wn −1 occurs before any word. Since this is just the

count ( wn−2 wn−1 ) , the above equation (7.10) can be writing as in equation (7.12). P ( wn | wn − 2 , wn −1 ) =

count ( wn − 2 wn −1wn ) count ( wn − 2 wn −1 ) 209

(7.12)

7.2.3 The Statistical Machine Translation Decoder The statistical machine translation decoder performs decoding which is the process of finding a target translated sentence for a source sentence using translation model and language model. In general, decoding is a search problem that maximizes the translation and language model probability. Statistical machine translation decoders use best-first search based on heuristics. In other words, decoder is responsible for the search of best translation in the space of possible translations. Given a translation model and a language model, the decoder constructs the possible translations and look for the most probable one. There are a numerous decoders for statistical machine translation. A few of them is greedy decoders and beam search decoders. In greedy decoders, the initial hypothesis is a word to word translation which was refined iteratively using the hill climbing heuristics. Beam search decoders use a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set.

7.3

INTEGRATING LINGUISTIC INFORMATION IN SMT

7.3.1 Factored Translation Models Factored translation models [10] can be defined as an extension to phrase-based models where every word is substituted by a vector of factors such as word, lemma, part-ofspeech information, morphology, etc. Here, the translation process has now become a combination of pure translation and generation steps. Figure 7.2 provides a simple block diagram to illustrate the work of translation and generation steps. Factored translation models differ from the standard phrase based models from the following [10]:

• The parallel corpus must be annotated with factors such as lemma, part-ofspeech, morphology, etc., before training.

• Additional language models for every factor annotated can be used in training the system.

210

•

Transllation steps will be siimilar to sttandard phrrase based systems. Buut generaation steps im mply trainingg only on thee target side of the corpuus.

• Modelss correspondding to the different d facttors and com mponents aree combined in i a log-lin near fashion n.

Figure 7.2 Block B Diagrram for Facctored Tran nslation The cuurrent state-oof-the-art appproaches to o statistical machine traanslation, soocallled phrase-bbased modells, are limiteed to the maapping of sm mall text chuunks (phrases) witthout any exxplicit use off linguistic innformation. Such additioonal informaation has beeen dem monstrated to t be valuablle by integraating it in pree-processingg or post-processing. Therefo ore, a fram mework is ddeveloped for f statisticaal translatioon models to t inteegrate additional inform mation. This framework is an extennsion of the phrase-baseed appproach. It addds additionaal annotationn at the word d level. In baselline SMT, thhe word houuse is compleetely indepenndent of the word housees. An ny instance of house in i the trainiing data dooes not addd any know wledge to thhe 211

translation of houses. In the extreme case, while the translation of house may be known to the model, the word houses may be unknown and the system will not be able to translate it. While this problem does not show up as strongly in English - due to the very limited morphological production in English. But it is a significant problem for morphologically rich languages. Thus, it may be preferably to model translation between morphologically rich languages on the level of lemmas, and thus pooling the evidence for different word forms that derive from a common lemma. In such a model lemma and morphological information should be translated separately, and combine this information on the output side to ultimately generate the output surface words. Such a model can be defined straight-forward as a factored translation model.

7.3.1.1 Decomposition of Factored Translation The translation of factored representations of input words into the factored representations of output words is broken up into a sequence of mapping steps that either translate input factors into output factors, or generate additional output factors from existing output factors. In this model the translation process is broken up into the following three mapping steps: • Translate input lemmas into output lemmas • Translate morphological and POS factors • Generate surface forms given the lemma and linguistic factors

7.3.2 Syntax based Translation Models Syntax-based translation models [176] use parse-tree representations of the sentences in the training data to learn, among other things, tree transformation probabilities. These methods require a parser for the target language and, in some cases, the source language too. Yamada and Knight (2001) propose a model that transforms target language parse trees to source language strings by applying reordering, insertion, and

212

translation operations at each node of the tree. In general, this model incorporates syntax to the source and/or target languages. Graehl et al. [177] and Melamed [178], propose methods based on tree-to-tree mappings. Imamura et al. (2005) [179] present a similar method that achieves significant improvements over a phrase based baseline model for Japanese-English translation. Recently, various preprocessing approaches have been proposed for handling syntax within Statistical machine translation. These algorithms attempt to reconcile the word order differences between the source and target language sentences by reordering the source language data prior to the SMT training and decoding cycles. Approaches in syntax based models

Syntactic phrase-based based on tree transducers:

o Tree-to-string: Build mappings from target parse trees to source strings. o String-to-tree: Build mappings from target strings to source parse trees. o Tree-to-tree: Mappings from parse trees to parse trees. Synchronous grammar formalism that learns grammar can simultaneously generate both trees.

o Syntax-based: Respect linguistic units in translation. o Hierarchical phrase-based: Respect phrases in translation.

7.4

TOOLS USED IN SMT SYSTEM

7.4.1 MOSES Moses is an open-source toolkit for statistical machine translation. Moses is an extended phrase-based machine translation system with factors and confusion network decoding. Morphological, syntactic and semantic information can be integrated as factored during training. The confusion network allows the translation of ambiguous sentences. This enables, for instance, the strong integration of speech recognition and machine translation. Instead of passing along the one best output of the recognizer, a network of different word choices may be examined by the machine translation system 213

[180]. Moses has an efficient data structure that allows memory-intensive translation model and language model by exploiting larger data resources with limited hardware. It implements an efficient representation of phrase translation table using the prefix tree structure, which allows loading only the fraction of phrase table into memory that is needed to translate the test sentences. Moses uses the beam-search algorithm that quickly finds the highest probability translation among the exponential number of choices.

•

Moses offers two types of translation models: phrase-based and treebased

•

Moses features factored translation models, which enable the integration linguistic and other information at the word level

•

Moses allows the decoding of confusion networks and word lattices, enabling easy integration with ambiguous upstream tools, such as automatic speech recognizers or morphological analyzers

The development of Moses is mainly supported under the EuroMatrix, EuroMatrixPlus, and LetsMT projects funded by the European Commission under Framework Program 6 and 7.

7.4.2 GIZA++ & MKCLS GIZA++ is a statistical machine translation toolkit that is used to train IBM models 1-5 and an HMM word alignment word alignment model. It is an extension of GIZA which was designed as part of the SMT toolkit, in Egypt. It includes IBM models 3 and 4. It uses the mkcls [181] tool for unsupervised classification to help the model. It also implement the HMM alignment model and various smoothing techniques for fertility, distortion or alignment parameters. A bilingual dictionary will be built by this tool from the bilingual corpus. More details about GIZA++ can be found in [182].

7.4.3 SRILM SRILM is a toolkit for language modelling that can be used in speech recognition, statistical tagging and segmentation, and statistical machine translation. It is a freely available collection of C++ libraries, executable programs, and supporting scripts. It 214

can build and manage language models. SRILM implements various smoothing algorithm such as Good-Turing, Absolute discounting, Written-Bell and modified Kneser-Ney. Besides the standard word based n-gram back-off models, SRILM implements several other language model types [182], such as word-class based n-gram models, cache-based models, disfluency and hidden event language models, HMM of n-gram models and many more.

7.5

DEVELOPMENT OF FACTORED CORPORA

7.5.1 Parallel Corpora Collection Corpora are the term used on Linguistics, which corresponds to a (finite) collection of texts (in a specific language). A collection of documents in more than one language is called multilingual corpora. A parallel corpus is a collection of texts in different languages where one of them is the original text and the other is their translations. A bilingual corpus is a collection of texts in two different languages where each of one is translation of other. Parallel corpora are very important resources for tasks in the translation field like linguistic studies, information retrieval systems development or natural language processing. In order to be useful, these resources must be available in reasonable quantities, because most application methods are based on statistics. The quality of the results depends a lot on the size of the corpora, which means robust tools are needed to build and process them. The alignment at sentence and word levels makes parallel corpora both more interesting and more useful. Aligned bilingual corpora have been proved useful in many ways including machine translation, sense disambiguation and bilingual lexicography. The availability of parallel sentences for English-Tamil language pair is available, but not abundantly. In European countries, parallel data for many European language pair are available from the proceedings of the European Parliament. But in case of Tamil, no such parallel data are readily available. Hence English sentences have to be collected and manually translated to Tamil in order to create a bilingual corpus for English-Tamil language pair. Even though, if parallel data are available for English-Tamil language pair, there are chances that it might not be aligned properly and have to be separate the paragraphs 215

in to individual sentences (Example-News paper corpora). This will employ a lot of human resource. This is a time extensive work and has it is the main resource for the statistical machine translation system, more time and importance has to be provided in developing a bilingual corpus for English-Tamil language pair. During manual translations of English sentences to Tamil, terminology data banks for English-Tamil language pair are found to be very useful for humans

7.5.2 Monolingual Corpora Collection The situation for developing bilingual corpus for English-Tamil language pair is not the same for the development of monolingual corpus for Tamil language. Tamil data is available in the form of news in many websites of Tamil newspapers. And so it is not a tedious job to develop a monolingual corpus for Tamil language. But some human resource is necessary to perform some pre-processing to remove unnecessary words or characters from the data, manually.

7.5.3 Automatic Creation of Factored Corpora Before providing the bilingual corpus of English-Tamil language pair and monolingual corpus of Tamil language to the statistical machine translation decoder and the language modelling kit, SRILM, respectively for training the system in order to create translation models and language models, both the corpus has to be and tokenized in order to separate the words and punctuations i.e., ‘coming,’ will be separated as ‘coming’ and ‘,’ with space in between them, lowercased in order to consider all the same words but differs in case has a single word (for example, ‘He’ and ‘he’ if not lowercased will be considered as different entities by the statistical systems which will be a problem whereas if lowercased this problem can be avoided) and in some cases clean the corpus so has remove the sentences from the corpus that exceeds the limit which is the maximum length of the parallel sentences to be considered in the corpus. Cleaning the corpus is not necessary in case of monolingual corpus of Tamil language. Pre-processing plays a major role in creating factored training corpora. For English, reordering and compounding steps are used for creating the factored corpora. In Tamil, linguistic tools such as POS Tagger and morphological analysers are used in creation of factored corpora. Factored parallel sentences are given in the Table 7.1. 216

Table 7.1 Factored Parallel Sentences Factored English Sentences

Factored Tamil Sentences

I|i|PN|prp_i school|school|N|nn_to

நான்| நான் |PN| null பள்ளிக்கு| பள்ளி |NN

went|go|V|vb.past_1S

|DAT ெசன்ேறன்|ெசல்|V|PAST_1S .|.|.|.

நான்| நான்| PN| null அவ க்கு| அவர்

I|i|PN|prp_i a|a|AR|det book|book|N|nn_ACC

|PN|DAT ஒ | ஒ

த்தகம்| MN|ACC ெகா த்ேதன்|ெகா |

him|him|PN|prp_to gave|give|V|vb.past_1S

V| PAST-1S.|.|.|.

the|the|AR|det cat|cat|N|nn the|the|AR|det

ைன| ைன|NN|INS எ யால்|எ null ெகால்லப்பட்

rat|rat|N|nn_ACC

|NN|

க்கும்|ெகால்| V| VP-

IRU-UM .|.|.|.

killing|kill|V|vb.prog_3SN_was

7.6

AD| null த்தகத்ைத|

FACTORED SMT FOR ENGLISH TO TAMIL LANGUAGE

Factored translation is an extension of phrase-based statistical machine translation that allows the integration of additional morphological and lexical information, such as lemma, word class, gender, number, etc., at the word level on source and the target languages. In SMT, three key components are used for translation modeling, language modeling and decoding. These components are implemented using GIZA++, SRILM and Moses toolkits. GIZA++ is a statistical machine translation toolkit that is used to train IBM models 1-5 and an HMM word alignment model. It is an extension of GIZA which was designed as part of the SMT toolkit. SRILM is a toolkit for language modeling that can be used in speech recognition, statistical tagging and segmentation, and statistical machine translation. Moses is an open source statistical machine translation system toolkit that allows to automatically training translation models for any language pair. What is need is a collection of translated texts (parallel corpus).

217

An efficient search algorithm finds quickly the highest probability translation among the exponential number of choices. Morphological, syntactic and semantic information can be integrated in factors during training. Figure.7.5 explains the mapping of English factors and Tamil factors in Factored SMT. Initially, English factors “Lemma” and “Minimized-POS” are mapped to Tamil factors “Lemma” and “M-POS” then “Minimized-POS” and “Compound-Tag” factors of English language is mapped to “Morphological information” factor of Tamil language.

Figure 7.3 Mapping English factors to Tamil Factors Here, the important thing is Tamil surface word forms are not generated in SMT decoder. Only factors are generated from SMT and the word is generated in the post processing stage. Tamil morphological generator is used in post processing to generate a Tamil surface word from output factors.

7.6.1 Building Language Model SRILM language modelling kit can be used to build an n-gram language model from the monolingual corpus of Tamil language. A script, ‘n-gram-count’, in SRILM can be used to generate n-gram language models of any order by specifying optional parameters such as interpolation,

modified Kneser-Ney smoothing, absolute

discounting, Good – Turing smoothing and Written-Bell smoothing for unseen ngrams. The output of this script will be a language model file that contains the n-gram probabilities of each word in the monolingual corpus [183]. The general syntax of executing the script ‘ngram-count’ in SRILM is, 218

>ngram-count -order n -[options] -text CORPUS_FILE –lm LM_FILE Where, order n - the order of the n-gram language model can be mentioned here, with ‘– order n’, where ‘n’ denotes the order of the n-gram model. [options]– various switches, such as interpolate, kndiscount, ndiscount and so on, that can be used to generate the language model file. text – the file name of the monolingual corpus file lm – the file name of the language model file to be created by the script.

7.6.2 Building Phrase based Translation Model To build a phrase-based translation model, the perl script, ‘train-model.perl’ in Moses is used. The train-model perl script involves the following steps,

• Prepare the data: convert the parallel corpus into a format that is suitable to GIZA++ toolkit. Two vocabulary files are generated and the parallel corpus is converted into a numbered format. The vocabulary files contain words, integer word identifiers and word count information. GIZA++ also requires words to be placed into word classes. This is done automatically calling the mkcls program. Word classes are only used for the IBM reordering model in GIZA++.

• Run GIZA++:GIZA++ is a freely available implementation of the IBM Models. It is required in initial step to establish word alignments. Our word alignments are taken from the intersection of bidirectional runs of GIZA++ plus some additional alignment points from the union of the two runs. Running GIZA++ is the most time consuming step in the training process. It also requires a lot of memory. GIZA++ learns the translation tables of IBM Model 4, but the requirement is word alignment file.

• Aligning words: To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied. The default heuristic grow-diagfinal starts with the intersection of the two alignments and then adds additional alignment points. Other possible alignment methods are intersection, grow, grow-diag,

219

union, srctotgt and tgttosrc. Alternative alignment methods can be specified with the switch alignment.

• Get lexical translation table: Given the word alignment, it is quite straightforward to estimate a maximum likelihood lexical translation table. w(e | f ) and the inverse w( f | e) are estimated from word translation table.

• Extract Phrases: In the phrase extraction step, all phrases are dumped into one big file. The content of this file is for each line: foreign phrase, English phrase, and alignment points. Alignment points are pairs (English,Tamil). Also, an inverted alignment file extract.inv is generated, and if the lexicalized reordering model is trained (default), a reordering file extract.o.

•

Score Phrases: Subsequently, a translation table is created from the stored

phrase translation pairs. The two steps are separated, because for larger translation models, the phrase translation table does not fit into memory. Therefore, no need to store the phrase translation table into memory; It can be construct it disk itself. To estimate the phrase translation probability ϕ (e | f ) one should proceed as follows: First, the extract file is sorted. This ensures that all English phrase translations for a foreign phrase are next to each other in the file. Thus, It can process the file, one foreign phrase at a time, collect counts and compute ϕ (e | f ) for that foreign phrase f. To estimate

ϕ ( f | e ) , the inverted file is sorted, and then ϕ ( f | e ) is estimated for an English phrase at a time. Next to phrase translation probability distributions ϕ ( f | e ) and ϕ (e | f ) , additional phrase translation scoring functions can be computed, e.g. lexical weighting, word penalty, phrase penalty,etc. Currently, lexical weighting is added for both directions and a fifth score is the phrase penalty. Currently, five different phrase translation scores are computed. They are, phrase translation probability ϕ ( f | e ) , lexical weighting lex( f | e) , phrase translation probability ϕ (e | f ) , lexical weighting lex (e | f ) and phrase penalty (always exp(1) = 2.718 ).

•

Build Reordering model: By default, only a distance-based reordering model

is included in final configuration. This model gives a cost linear to the reordering distance. For instance, skipping over two words costs twice as much as skipping over one word. Possible configurations are msd-bidirectional-fe (default), msd-bidirectional220

f,

msd-fe,

msd-f,

monotonicity-bidirectional-fe,

monotonicity-bidirectional-f,

monotonicity-fe and monotonicity-f.

•

Build Generation model: The generation model is built from the target side

of the parallel corpus. By default, forward and backward probabilities are computed. If you use the switch generation-type single only the probabilities in the direction of the step are computed.

•

Creating Configuration file: As a final step, a configuration file for the

decoder is generated with all the correct paths for the generated model and a number of default parameter settings. This file is called model/moses.ini.

7.7

SUMMARY

This chapter described the factored translation model which extends the phrase based model to incorporate linguistic information as additional factors in the representation of words. The unavailability of more training data increases the advantage of using wordlevel information in more linguistically motivated models. Mapping translation factors in the factored model aids in the disambiguation of source words and improves the grammar of target factors. SMT’s generation model is not utilized in this translation system. The model is tuned only for producing the lemma, POS tag and morphological factors. It is shown that the developed system improves translation over a base line system and other factored system.

221

CHAPTER 8 POSTPROCESSING FOR ENGLISH TO TAMIL SMT 8.1

GENERAL

The aim of Natural Language Processing (NLP) is studying the problems in an automatic generation and understanding of natural languages. Computational models are built for analyzing and generating natural languages. Tamil is morphologically rich and agglutinative language [134], so Tamil words are postpositionally inflected with various grammatical features. Tamil verb specifies almost everything like gender, number, and person markings and also with auxiliaries which represents mood and aspect [13]. Tamil noun inflects for plural, case suffixes and post positions. In Tamil language, the lemma undergoes morphological change when it get attach to certain morphemes. Post-processing transforms the translated output from SMT system into standard target language sentence. In pre-processing, each and every Tamil words as well as English words are segmented into root and morphological information. In contrast, post-processing generates the Tamil word from root word and morphological information. Tamil morphological generator is utilized in the post-processing stage of English to Tamil machine translation system. The morphological generator takes lemma, POS category and morpho-lexical description as input and gives a word-form as output. It is a reverse process of morphological analyzer. In any natural language generation system, morphological generator is an essential component in postprocessing stage. Based on the syntactic category of a Tamil word, different approaches are followed for generating a word form. Morphological generator is developed for noun, verb and pronoun. Morphological generator system for verb and noun are implemented using a new paradigm based algorithm, which is simple and efficient. A paradigm classification is done for noun and verb based on Dr.S.Rajendran’s paradigm classification [13]. Tamil verbs are classified into 32 paradigms and nouns are classified into 25 paradigms. Table 6.6 and 6.7 shows verb and noun paradigm for Tamil.

222

This proposed morphological generator algorithm requires only minimum amount of data for generating a word form. It handles the morpho-phonemic changes without using any hand coded rules. So this approach can be easily implemented to less resourced and morphologically rich languages. For generating a pronoun word form separate morphological generator is developed using pattern matching approach. Noun morphological generator is used for handling proper nouns. Other categories like adjectives and adverbs are treated based on their suffixes and part of speech tags.

8.2

MORPHOLOGICAL GENERATOR

Morphological generator is an individual module or integrated with several NLP applications like Machine Translation (MT), automatic sentence generation, and information

retrieval

etc.

Automated

machine

translation

system

requires,

morphological analyzer of the source language and morphological generator of the target language. The most competent approach to morphological generator is using Finite State Transducers (FST). Letter transducer based morphological analyzer and generator was developed by Alicia Garrido [184]. Perez Aguiar has used an intuitive pattern-matching approach for developing morphological generator to Spanish language [184]. Guido Minnen and his team have developed a morphological generator based on Finite state techniques and it is implemented using the widely available Unix Flex utility [185]. For Indian languages, many attempts have been made to build morphological generator. A Hindi morphological generator has been developed based on data driven approach [186]. Tel-More, a morphological generator for Telugu is based on linguistic rules and implemented in Perl program [187]. Morphological generator has been designed for syntactic categories of Tamil using paradigms and sandhi rules [72]. Finite state machines are used for developing morphological generator for Tamil [131].

8.2.1 Challenges in Tamil Morphological Generator Tamil is morphologically rich and is an agglutinative language. Each verb is inflected with more than ten thousand forms including auxiliaries and clitics. Inflection includes finite, infinite, adjectival, adverbial and conditional forms of verbs. In generation the inflections vary from one set of verbs to another. So, it is difficult to translate/generate

223

the required word form of Tamil verbs. To resolve this complexity, a classification of Tamil verbs based on tense markers and inflections is made. Normally rule based approaches are used for developing a morphological generator system. In this approach, the paradigm number of the given root word is identified using the lookup table. The lookup table contains word class lemma and its corresponding paradigm number. If the given lemma is not present in the lookup table then the system fails to identify its paradigm. It’s difficult to create a look table with all dictionary words, as all the proper nouns and compound word forms are not easy to include in dictionary. This challenging task can be solved, if the system can automatically identify the paradigm number of the dictionary word. When the lemma is given as an input, the proposed system automatically identifies the paradigm number based on its end characters. Another challenging task in Tamil word generation is to handle the morphophonemic change. Morpho-phonemic change represents modifications that occur when an inflection is attached to a root word. The proposed morphological generator system solves this challenging task by using stemming rules and suffixes. Morpho-phonemic changes are shown in Table 8.1. Suffixes play a major role in this proposed Tamil morphological generator system. Creation of these suffixes for the all word forms is also a challenging job. Table 8.1 Morpho-phonemic Changes Word+ Inflection

+கள் (pU+kaL)

Morpho-phonemic Change +க்+கள் (pUkkaL)

க்கள்(pUkkaL)

கத்தி+ஆல்(kaththi+Al) கத்தி+ய்+ஆல் (kaththiyAl)

கத்தியால் (kaththiyAl)

படம்+ஐ (padatm+ai)

படத்ைத (padaththai)

பட+த்+ஐ (padaththai)

224

Word-form

8.2.2 Simplified Part-of-Speech Categories The input of the morphological generator system is a factored sentence from SMT output. The factored Tamil sentence is categorized according to the simplified POS tag. The simplified POS tagset is shown in Table 8.2. Based on this simplified tag factor the morphological generator generates the word-form. The morphological generator for noun handles the proper nouns and common nouns. The generation of Tamil verb forms is taken care of morphological generator for verbs. Table 8.2 Simplified POS Tagset

Figure 8.1 shows the categorization of Tamil sentence generation system. This system contains five different modules. Morphological generators for Tamil noun and verb are developed using suffix based approach. Tamil pronouns come under the ‘closed word-class’ category. So a pattern matching technique is followed for generating pronominal word forms. Tamil Sentence Generator

Morph Generator Morph Generator Morph Generator for Verbs

for Nouns

for Pronouns

Figure 8.1 Tamil Sentence Generation

225

Morph

Generator

for other categories

8.3

MORPHOLOGICAL GENERATOR FOR TAMIL NOUNS AND VERBS

This section depicts a new simple morphological generator algorithm for Tamil. Generally, morphological generator tool is developed using rule based approach where it requires a set of morpho-phonemic (spelling) rules and morpheme dictionary. The method which is proposed here can be applied to any morphologically rich language. In this novel approach, morphemes and dictionaries are not required. This algorithm only needs the suffix table and the code for paradigm classification. If the lemma, POS category and Morpho-Lexical Inflection (MLI) are given, the proposed algorithm will generate the intended word form. The

morphological

generator

receives

an

input

in

the

form

of

lemma+word_class+ Morpho-lexical Information, where lemma specifies the root word of the word-form to be generated, word_class specifies the grammatical category (POS category) and Morpho-lexical Information specifies the type of inflection. Word class information is used to decide whether the particular word is noun or verb. The Morpho-lexical Information (MLI) has been extracted from the morphological analyzer tool for Tamil. Example of the Tamil morphological generator system is given bellow. Lemma + WC+ MLI = WORD FORM ஓ

+ V + FT_3SM = ஓ வான்

Odu + V + FT_3SM = OduvAn (Run) கா

+ N + ACC = காட்ைட

kAdu + N + ACC = kAddai (Forest) In the above example “V” represents verb and “FT_3SM” represents future tense with third person singular masculine.”N” symbolizes noun and ACC means accusative case. FT_3SM and ACC are called as Morpho-lexical Information (MLI). Three different modules are used to build the noun and verb generator system. The first module takes the lemma and POS category as input and gives the lemma’s paradigm number and word’s stem as output. The second module takes morpho-lexical

226

information as an input and gives its index number as an output. In the third module a suffix-table is used to generate the word form with the information from the above two modules.

8.3.1 Algorithm for Noun and Verb Morphological Generator This subsection illustrates the new algorithm which is developed for morphological generator system. This algorithm is implemented using Java program. Algorithm is shown in Figure 8.2. Input = (Lemma +word class + morpho-lexical Information) lemma,wc,morph =SPLIT(Input) roman_lemma=ROMAN(lemma) parnum=PARNUM(roman_lemma,wc) col-index=parnum row-index=INDEX(morph,wc) suff=SUFFIX-TABLE[row-index][col-index] stem=STEM(roman_lemma,wc,parnum) word=JOIN(stem,suff) output=UNICODE(word) Figure 8.2 Algorithm for Morphological Generator In this algorithm, lemma represents the root word of the word form, wc denotes the word class and morph stands for morpho-lexical information. The given input is divided into lemma, word class and Morpho-lexical information; this is done by using SPLIT function. The lemma or root word in Unicode form is romanized using the function ROMAN. roman_lemma represents the romanized lemma. parnum represents paradigm number of lemma. PARNUM function identifies the paradigm number for the given lemma using the end suffixes. Romanized lemma and the paradigm number 227

are given as input to the STEM function along with the word class. This function is used to find the stem of the root word. Given morpho-lexical information is matched with the morpho-lexical information list, and the corresponding index number is retrieved. This index number is referred as row-index. Paradigm number of the input lemma is named as col-index. Using the row and column index the suffix part is retrieved from the suffix-table. The stem and the retrieved suffix are attached to generate the word form. This word form is then converted to Tamil Unicode form.

Figure 8.3 Architecture of Tamil Morphological Generator Figure 8.3 shows the architectural view for Tamil morphological generator system. The morphological generator system need to process three major component; first one is the lemma part, then the word class and finally the morpho-lexical information. By the way the generator is implemented makes it distinct from other generator system. The input which is in Tamil Unicode form is first romanized and then the paradigm number is identified by using the end characters. Romanized form is used for the purpose of easy computation and efficient processing. The morpho-lexical information of the required word form is given as input. From the morpho-lexicon information list the index number of the corresponding input is identified. This is 228

referred as row-index. Based on the word class specified the system uses the corresponding suffix table. In two-dimensional suffix table, rows are morpho-lexical information index and columns are paradigm numbers. For each paradigm a complete set of morphological inflections corresponding to the morpho-lexical information list is created. Finally using the column index and row index morphological suffix is retrieved from the suffix table. This suffix form is affixed with the stem to generate the word form.

8.3.2 Word-forms Handled in Morphological Generator Tamil verb morphological generator is accomplished for generating more than ten thousand forms of single Tamil verb. Some of the word-forms are shown in Table 8.3, remaining word forms are listed in Appendix-B. Noun morphological generator system handles nearly three hundred word forms including postpositions. These noun wordforms are also given in the Appendix-B. Table 8.3 Verb and Noun Word-forms

Verb Word forms

Noun Word forms

ப .

நரம் .

ப த்தான்.

நரம்ைப.

ப த்தாள்.

நரம்பிைன.

ப த்தார்.

நரம்பினத்ைத.

ப த்தார்கள்.

நரம்ேபா .

ப த்த .

நரம்பிேனா .

ப த்தன.

நரம்பினத்ேதா .

ப த்தாய்.

நரம்பினால்.

ப த்தீர்.

நரம்பால்.

ப த்தீர்கள்.

நரம் க்கு.

229

ப த்ேதன்.

நரம்பிற்கு.

ப த்ேதாம்.

நரம்பின்.

ப க்கிறான்.

நரம்ப .

ப க்கிறாள்.

நரம்பின .

ப க்கிறார்.

நரம்பின்கண்.

ப க்கிறார்கள்.

நரம்ப கண்.

ப க்கின்ற .

நரம் க்காக.

ப க்கின்றன.

நரம்பாலான.

ப க்கின்றாய்.

நரம் ைடய.

ப க்கின்றீர்.

நரம்பி

ப க்கின்றீர்கள்.

நரம்பில்.

ப க்கின்றீர்கள்.

நரம்பினில்.

ைடய.

8.3.3 Data Required for the Algorithm Tamil morphological generator for noun and verb requires following resources for generating a word form. i. Morpho Lexical Information (MLI) ii. Paradigm classification rules iii. Suffix table iv. Stemmer 8.3.3.1 Morpho Lexical Information File Morphological features of a word form are considered as Morpho-lexical information. For example, Morpho-lexical information of a word ப த்தான் (padiththAn) is 230

PAST_3SM. The Morpho-Lexical-Information (MLI) file is the collection of possible morphological patterns of a particular word. Noun and verb needs a separate MLI file. Example of the MLI file is shown in Table 8.4. All the patterns in the file are given in the Appendix-B. According to the different MLI patterns, a suffix table is created. Table 8.4 MLI file for Tamil Verbs PT+3SM

PRT+1P

PT+3SF

FT+3SM

PT+3SE

FT+3SF

PT+3SE+PL

FT+3SE

PT+3SN

FT+3SE+PL

PT+NOM_athu

FT+3SN

PT+RP+3SN

FT+RP+3SN

PT+3PN

FT+NOM_athu

PT+RP+3PN

FT+3PN

PT+NOM_ana

FT+RP+3PN

PT+2S

FT+NOM_ana

PT+2EH

FT+2S

PT+2EH+PL

FT+2EH

PT+1S

FT+2EH+PL

PT+1P

FT+1S

PRT+3SM

FT+1P

PRT+3SF

FT_3SN

PRT+3SE

RP_UM

PRT+3SE+PL

PT+RP

PRT+3SN

PRT+RP

PRT+RP+3SN

NM+RP

PRT+NOM_athu

PT+RP+3SM

PRT+3PN

PT+RP+3SF

PRT+RP+3PN

PT+RP+3SE

PRT+NOM_ana

PRT+RP+3SM

PRT+2S

PRT+RP+3SF

PRT+2EH

PRT+RP+3SE

PRT+2EH+PL

PRT+3SN

PRT+1S

PRT+3SN 231

8.3.3.2 Paradigm Classification Rules This section explains how the paradigm is classified based on the end characters. Generally paradigm is identified using look up table. Initially, the root word is romanized using Tamil Unicode to roman mapping file. This romanized form is compared with end suffixes in paradigm classification file. If an end suffix is matched with the end characters of the root word then the paradigm number is identified. End suffixes are created based on the paradigms and sorted according to their character length. Figure 8.4 shows the algorithm for paradigm classification. Look up table is used for only the paradigm 25 (ப , padi) and 26 (நட, wada), as the end suffixes cannot be generalized for these two paradigms. Figure 8.4 shows the pseudo code for paradigm classification. End suffixes with corresponding paradigm number are given in the Table 8.7. For example, the verb பயில் (payil), is matched with the end suffix ‘ல்’ (il), therefore, the first paradigm is 11 and there is no possibility for second paradigm (N). In some cases, the word may have two paradigms. For example the words like (ப , padi) (தீர், thIr) have two paradigms. Because of word’s sense padi has two paradigms and intransitive form makes the word thIr to be fall under two paradigms. Example of various word forms is given bellow. padi (ப ) (read)

ப த்தான், ப ந்தான், ப க்க, ப ய, ப த் , ப ந் .

thIr (தீர்) (finish)

தீர்த்தான்.தீர்ந்தான் தீர,தீர்க்க தீர்ந் ,தீர்த்

232

Root word is Romanized For all End Suffix If End Suffix is matched with root word Then , Paradigm number is identified End if End for Figure 8.4 Pseudo Code for Paradigm Classification Table 8.5 shows the look up table for Tamil verb paradigm and end suffixes. For instance, the first paradigm is ெசய் (sey), some other words fall under the paradigm are ெபய் (pey), ெமய் (mey), உய் (uy). So the generalized end suffixes are எய் (ey), ஏய்(Ey), உய்(uy). Table 8.5 Look-up Table for Paradigm Classification End Suffix A O ey Ey oy uy ozu uzu azu idu sAku wOku padu wadu kedu

1st PNum 10 17 1 1 1 1 2 2 2 4 3 9 25 4 25

IInd PNum N N N N N N N N N N N N 4 N 4

sudu odu pOdu eRu uRu RRu Ru Ez Az iz az Uz Ay Iy vizu izu ezu i 233

4 4 4 5 5 30 29 6 6 6 6 6 6 6 7 25 7 8

N N N N N 15 N N N N N N N N N N N N

ai akal kal al AL aL uL kEL sol ol el oL Ul Ol vil El UL IL

8 11 18 11 12 12 12 19 16 13 13 14 18 18 18 19 19 19

N N N 18 N N N N N N N N N N N N N 12

UN uN AN in wil il en In Aku puku Nuku uku iku u r

20 21 22 23 24 11 27 28 31 32 15 32 32 15 6

N N N N N N N N N N N N N N N

8.3.3.3 Suffix Table The suffix table is the most essential resource for this algorithm. It is a simple twodimensional (2D) table where row corresponds to the morpho-lexical information and column corresponds to the paradigm number. Noun and verb has its own suffix table. The noun suffix table contains 325 rows (word-forms) and 25 columns (paradigms). Verb suffix table has two suffix tables, first table is made for without auxiliaries and the next is designed for with auxiliary. First table contains 164 rows and the next has 67 rows. Both tables contain 32 columns (paradigms). Table 8.6 shows the number of paradigms and inflections of verb and noun which are handled. Total represents the total number of inflections which are handled by generator system. Table 8.6 Paradigms and Inflections Word forms

Paradigms

Inflections Auxiliaries Postpositions Verb

32

164

67

--

10988

Noun

25

30

--

290

320

234

Total

For eveery paradigm m, a word is selected and it is termeed as head word. w For thiis heaad word, alll the inflectted word foorms are creeated. For eexample thee verb (pAdinAn)),

“pA Adu”, the in nflected worrd forms aree

(pAdinAL),

(pAddiyathu). A morpho-lexxical Information list is also createed for all thhe wo ord forms. Using U these word-forms a 2D tablee is created with colum mns and rows, eacch column corresponds to t paradigm m and rows reepresent morpho-lexicall informationn. The word-form ms of every paradigm iss romanized and put in a table. Thee stem of thhe eacch head worrd is identiffied and rem moved from its word-foorm (i.e com mmon term is i rem moved in alll word form ms). Now, thhe remaining portion iss only availaable in tablee. Rem maining porrtion is calleed as suffix and the tablle is called bby suffix tab ble. Table 8..7 illu ustrates the sample s suffix-table for T Tamil verbs.. In this table row (MLI-1, MLI-2… …) speecifies the morpho-lexic m cal inflectioon and colum mn (P-1, P--2…) indicaates paradigm m num mber. Table 8.9 Suffix Taable

ming Rules 8.33.3.4 Stemm Steemming is thhe process for f reducing word-form to their stem m. The stem m need not bbe ideentical to thee morphologgical root off the word. It is an impportant proccess in searcch enggines and information i retrieval. T Table 8.8 shows s the identified i characters foor stem mming accoording to their paradigm.. In stemmin ng, these chaaracters are removed r from m thee root word.. For exampple, the stem mming charaacter for thee second paaradigm “azzu” is

‘zuu’ . So, the end e characteer

‘zu’ wiill be deletedd for the worrds which arre 235

fall in second paradigm. If paradigm number has an end character “*” means then, no character should be removed from the word. Table 8.8 Stemming End Characters 16

L

Characters

17

*

1

*

18

L

2

Zu

19

L

3

sAku

20

*

4

Du

21

*

5

Ru

22

kAN

6

wOku

23

*

7

Zu

24

L

8

*

25

*

9

wOku

26

*

10

A

27

*

11

L

28

*

12

L

29

U

13

L

30

Ru

14

L

31

U

15

U

32

U

Paradigm End Number

8.4

MORPHOLOGICAL GENERATOR FOR TAMIL PRONOUNS

Tamil pronouns are very essential and its structure is used in every day conversation. It include personal pronouns (refer to the persons speaking, the persons spoken to, or the persons or things spoken about), indefinite pronouns, relative pronouns (connect parts of sentences) and reciprocal or reflexive pronouns (in which the object of a verb is being acted on by verb's subject). Personal pronouns, indefinite pronouns, relative pronouns, reciprocal or reflexive pronouns play an important role in Tamil language. Therefore they need very special attention while generating as well as analyzing. Tamil pronouns come under the closed word class category. Morphological generator for 236

Tamil pronouns is developed separately using pattern matching technique. The primary advantage of using pattern matching method is that it performs well for closed class words. The reverse process of pronoun morphological analyzer is used in generating pronoun surface word. So, the same data which are used in pronoun morphological analyzer is used in generation too. Pronoun word form structure is shown in Figure 8.5. Example for the pronoun structure is described bellow.

PP Clitic

Case Clitic

Pronoun Root

Clitic

Figure 8.5 Structure of Pronoun Word-form

Example for the Structure of Pronoun அவ அவ

க்க கில்

க்கு

அவன்

அவ அவ

அவனா

237

க்கா

க்க கிலா

Pattern matching approach is followed for generating pronoun word form. Figure

8.6

shows

the

architecture

of

Pronoun

morphological

generation.

Pronoun root + MLI

Root word Romanization

Pronoun root + MLI

SUFFIX DATABASE 1

2

3

4

Pronoun Word Form Figure 8.6 Pronoun Morphological Generation The input for this morphological generator is pronoun root word and the morphological information. Pronoun root word is first romanized and then verify with the suffix database which is used in morphological analysis. Four types of suffix lookup tables are used in this system. Each table is created based on the levels in pronoun structure.

8.5

SUMMARY

Post-processing generates a Tamil sentence from the factored output of SMT system. Each unit in the factored output contains root word, word class and morphological information. Morphological generator is used as a post processing component for word 238

generation. Morphological generator generates a word-form from a lemma, a word class tag, and morpho-lexical description. It is needed for various applications in Natural Language Processing. It also acts as a post-processing component for other NLP applications. The structure of the main sentence in Tamil is Subject-Object- Verb (SOV). In this SOV order, the verb agrees with the subject in gender and number. Morphological generator fulfills this agreement while generating a word-form. Paradigms and suffixes are used to generate Tamil verbs and nouns. Pattern matching based algorithm is used for generating pronoun word-form. Finally these generators are integrated to generate a complete Tamil sentence. The important features of developed morphological generator system is given bellow, • Automatic paradigm identification. • Using very less suffix data set for generating more ten thousand word form. • Simple and efficient method. • Handles compound words and Proper nouns. • Handles Transitive and Intransitive forms • No explicit morpho-phonemic Rules. • No verb/noun dictionary for paradigm identification. • Easily updatable and errors can be corrected without difficulty. • Applicable for any morphologically rich language

239

CHAPTER 9 EXPERIMENTS AND RESULTS 9.1

GENERAL

This section explains the experiment and results of English to Tamil Statistical Machine Translation system. Experiments include installation of SMT toolkit, training and testing regulations. This machine translation system is an integartion of various modules. So, the accuracy is depends on each module in the system. Roughly, this translation system is divided into three modules. They are preprocessing, translation and post-processing. So, the results and the errors depends on the preprocessing stages of English sentence, factored SMT, and postprocessing. In preprocessing, English language sentence are transformed using reordering and compoundeing phases. Preprocessing use rules for reordering and compounding and English parser. The accuracy of preprocessing depends on parser’s output and the rules developed. The next stage is factored SMT system. The output of factored SMT system is depends on size and quality of corpora. Parameter tuning and decoding steps are also plays a major role in producing output for SMT system. Finally the output is given to the postprocessing stage. Tamil morphological generator is utilized in post processing. Therfore the accuracies of morphological generator decides the precision of postprocesing stage. Sample outputs of three modules are shown in appendix. Finally the output of this machine translation system is evaluated by using BLEU and NIST metrics. Different models are also developed to compare the results of developed translation engine.

9.2

EXPERIMENTAL SETUP AND RESULTS

This section describes the experimental setup and data used in the English to Tamil statistical machine translation system. The training data consists of approximately 16K English to Tamil parallel sentences. Health domain English-Tamil parallel corpora of EILMT (English to Indian Language Machine Translation, Project funded by DIT) project is used in experiments. The training set is built with 13,500 parallel sentences and a test set is constructed with 1532 sentences. 1000 parallel sentences are used for tuning the system.

For language model, sizes of 90k Tamil sentences are used.

240

Average word length of sentences in baseline and factored parallel corpora used in these experiments are shown in the Table 9.1 and 9.2. Table 9.1 Details of Baseline Parallel corpora Total

Average Word Length

Sentences

English

Tamil

Training

13500

22.047

17.145

Tuning

1000

21.8

15.334

Testing

1532

22.169

-

Table 9.2 Details of Factored Parallel corpora Total

Average Word Length

Sentences

English

Tamil

Training

13500

17.955

17.145

Tuning

1000

17.674

15.334

Testing

1532

18.039

-

Nine different types of model are trained, tuned and tested with the help of parallel corpora. The general categories of the models are Baseline and Factored systems. The detailed models are, 1. Baseline (BL) 2. Baseline with Automatic Reordering (BL+AR) 3. Baseline with Rule based Reordering (BL+RR) 4. Factored system + Morph-Generator (Fact) 5. Factored system + Auto Reordering +Morph-Generator (Fact+AR) 6. Factored system +Rule based Reordering + Morph-Generator (Fact+RR) 7. Factored system + Compounding + Morph-Generator (Fact+Comp) 8. Factored system + Auto Reordering +Compounding +Morph-Generator (Fact+AR+Comp) 9. Factored system +Rule based Reordering +Compounding+ MorphGenerator (Fact+RR+Comp)

241

For a baseline system, a standard phrase based system is built using the surface forms of the words without any additional linguistic knowledge and with a 4-gram LM in the decoder. Cleaned raw parallel corpus is used for training the system. Lexicalized reordering model (msd-bidirectional-fe) is used in the baseline with automatic reordering model. Another baseline system is built with the use of rule based reordering. In all the developed factored models, Tamil morphological generator is used in post processing stage. Instead of using the surface form of the word, a root, part-of-speech and morphological information are included into the word as an additional factors. A factored parallel corpus is used for training the system. English factorization is done by using Stanford Parser tool and for Tamil, POS Tagger and Morphological analyzers are used to factor the sentence. In this factored model system, a token/word is represented with four factors as Surface|Root|Wordclass|Morphology. Where, Morphology factor contains morphological information and function words on English side, and morphological tags on Tamil side. In factored model with rule based reordering and compounding (Fact+RR+Comp), English words are factored and reordered. In addition to this Compounding is also performed in English side. All the developed models are evaluated with the same test-set which contains 1532 English sentences. The well known Machine Translation metrics BLEU [144] and NIST [145] are used to evaluate the developed models. In addition to that the existing “Google Translate” online English-Tamil machine translation system is also evaluated to compare with the developed models. The results are in terms of BLEU-1, BLEU-4 and NIST score and it is shown in Table 9.3. In figure 9.1 and 9.2, X axis represents the various machine translation models and Y axis denotes the BLEU-1 and BLEU-4 scores. Figure 9.3 shows the NIST scores of developed models. From the graphs in the figures, it is clearly shown that the proposed system (Fact+RR+Compounding) improves the BLEU and NIST score compare to other developed models and “Google Translate” system. The Google translation system’s output is shown in the Figure 9.4. In this output, both sentences are failed to produce a noun verb agreement. Case markers are also not identified in second sentence. The grammatically correct output is not available in alternate translations also. Detailed output comparison is shown in Appendix-B. The developed F-SMT based system in this thesis handles the noun-verb agreement also. This is an important and challenging job for translating into morphologically rich languages like Tamil. 242

Table 9.3 BL LEU and NIST Scores BLEU-1

BLEU-4

NIST

BL

0.2924

0.0368

2.7221

BL+AR R

0.2929

0.0403

2.7488

BL+RR R

0.2594

0.0258

2.4148

Fact+M Mgen

0.6406

0.3722

3.9831

Fact+A AR+Mgen

0.6405

0.3725

3.9876

Fact+R RR+Mgen

0.6285

0.3653

3.3887

Fact+C Comp

0.6239

0.3573

3.9626

Fact+A AR+Comp+ +Mgen

0.6237

0.3577

3.9673

Fact+R RR+Comp+ +Mgen

0.6753

0.3894

4.2667

Googlee Translate

0.3105

0.3125

2.9526

Models

BASELINE E

F FACTORED D

B BLEU‐1 0.8 0.7

BLEU‐1 Score BLEU 1 Score

0.6 0.5 0.4 0.3 0.2 0.1 0

Variouss MT models

Figu ure 9.1 BLE EU-1 Scores for Various Models

243

BLEU‐4 0.45 0.4

BLEU‐4 Score

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

Variou us MT models

Figu ure 9.2 BLE EU-4 Scores for Various Models

NIIST Score es

4.5

4.2 2667

4 3.5 2.9526

NIST Scores

3 2.5 2 1.5 1 0.5 0

Various MT modells

Figu ure 9.3 NIST T Scores forr Various M Models 244

Figure 9.4 Google Translation System1

9.3

SUMMARY This chapter described the results which are used to test the effectiveness of English

to Tamil Machine translation systems. These results of the evaluation performed clearly confirm that the new techniques proposed in this thesis are definitely significant. The pre-processing and post-processing allows the developed system to achieve a relative improvement in BLEU and NIST score. Furthermore, the Tamil linguistic tools which are the modules of translation system are also implemented in this research. The preprocessing techniques developed in this work helps to increase the translation quality. BLEU and NIST evaluation scores clearly shows that the factored model with an integration of linguistic knowledge gives better result for English to Tamil Statistical Machine translation system.

1

Google Translate Output is Tested on 31‐07‐12

245

CHAPTER 10 CONCLUSION AND FUTURE WORK In this chapter, the main contributions and most significant achievements of this research work is summarized. The conclusion, which follows after the summary, attempts to highlight the research contributions in the field of Tamil language processing. At the same time, the limitations and future scope of the developed systems are also mentioned, so that researchers who are interested in extending any of this work can easily explore the possibilities.

10.1 SUMMARY Due to the multilingual nature of the present information society, Human Language Processing and Machine Translation have become essential for languages. Several linguistic features (both morphological and syntactical) makes, translation a truly challenging issue. The importance of machine translation will grow as the need for translating resources of knowledge from one language to other increases. Machine Translation is a challenging task for languages which are different in morphological structure and word order. In many applications only small amounts of bilingual corpora are available for the desired domain and language pair. Using linguistic knowledge in SMT can reduce the need for massive amounts of data by raising the level of generalization, and thereby providing a basis for more efficient data exploitation. This is especially desirable for language pairs (like English and Tamil) where massive amounts of parallel corpora are not available. For training the SMT system, both monolingual and bilingual sentence-aligned parallel corpora of significant size are essential. This thesis presents the novel methods for incorporating linguistic knowledge in SMT to achieve an enhancement in English to Tamil machine translation system. Most of the technique presented in this thesis can be applied directly to other language pairs especially for translating from morphologically simple language to morphologically rich language. The precision of the translation system depends on the performance of each and every modules and linguistic tools used in the system. The experimental results clearly 246

demonstrate that the new techniques proposed in this thesis are definitely significant. Four different machine translation models are experimented and the BLEU and NIST scores are compared. The developed model (Factored SMT with pre and postprocessing) has reported a 4.2667 NIST score for English to Tamil translations. Adding pre and post processing in factored SMT provided about 0.38 BLEU-1 improvement over a word based baseline system and 0.03 score improvement of a factored baseline system. Finally this score is compared with “Google Translate” online machine translation system. The developed model has reported 1.3 score improvement in NIST score and about 0.36 improvement in BLEU-1 score. Improvement in BLEU and NIST evaluation scores shows that this proposed approach is appropriate for English to Tamil Machine Translation system.

10.2 SUMMARY OF WORK DONE In this section, the research work done is summarized with reference to the objectives of the proposed work. This thesis mainly highlights the five different developments in the field of language processing which are listed below. •

Pre-processing Module (Reordering, Compounding and Factorization) for English language sentence is developed to transform the sentence structure to more similar to that of Tamil The compounding step in pre-processing is the novel method that concentrate specific to English-to-Tamil machine translation system.

•

Tamil POS Tagger is developed using SVM based machine learning tool. The major challenge for developing the statistical POS tagger for Indian languages is that unavailability of annotated (tagged) corpus. The developed Tamil POS tagger has 5 lakh POS annotated words. This tagged corpus is also built as a part of this research. This tagged corpus is a major resource for Tamil language processing.

•

Morphological Analyzer Tool is built for Tamil language using Machine learning approach. Morphological analyzer problem is redefined as a classification problem. SVM based tool is used for training the system with the size of 6 lakh morphologically tagged verbs and nouns. This tagged corpus is also developed as a part of this research work. These morphologically 247

segmented and tagged words are also the most significant resource for analyzing the Tamil word forms. The same methodology is successfully implemented for other Dravidian languages like Malayalam, Telugu, and Kannada. •

Morphological Generator is developed for Tamil language using a new suffix based algorithm. Post-processing module is developed to tackle the challenges in agreement and word form generation. Tamil morphological generator tool is developed and used in post-processing to assist the surface word generation. This algorithm is capable of automatically generating more than ten thousand word forms of a single Tamil verb.

•

English to Tamil Factored Statistical Machine Translation is developed by integrating different modules and various linguistic tools. The unavailability of more training data increases the advantage of using word-level information in more linguistically motivated models. Mapping translation factors in the factored model aids in the disambiguation of source words and improves the grammar of target word factors. It is shown that the developed system improves the accuracy of translation over a base line system and other factored machine translation models. The major outcomes of the research is mapped to the publications which have

resulted from this work is as shown in Table 10.1 Table 10.1 Mapping of Major Research Outcome to Publication

Major Outcomes Development of Tamil POS Tagger

Tamil

Morphological Analyzer for Tamil language.

on

POS Tagger and Chunker for Tamil Language Sequence

Labelling

Approach

to

Morphological Analyzer for Tamil Morphological Analyzer for Agglutinative Languages Approaches 248

based

SVMTool

A Development of

Publications Part-of-Speech tagger

Using

Machine

Learning

A Novel Data Driven Algorithm for Tamil Development of

Morphological Generator. (IJCA)

Morphological Generator for Tamil language.

A Novel Algorithm for Tamil Morphological generator. (ICON) Morphology

based

factored

Statistical

Machine Translation system from English to Development of English to Tamil Statistical Machine Translation system.

Tamil. Factored

Statistical

Machine

Translation

System for English to Tamil using Tamil Linguistic Tools. (Accepted for Publication)

10.3 CONCLUSION The major achievement of this research has been the development of Factored Statistical Machine Translation System for English to Tamil language by integrating linguistic tools. Linguistic tools like POS Tagger and Morphological analyzer are also developed as part of this research work. Developing these linguistic tools are challenging and demanding tasks especially for highly agglutinative language like Tamil. The performance of the statistical and machine learning methods mainly depends on the size and correctness of the corpus. If the corpus consists of all types of surface word forms, word categories and sentence structures, then it is possible for a learning algorithm to extract all required features. Preprocessing systems are automated for creating factored parallel and monolingual corpora. Factored corpora are an essential resource for developing a Factored Machine Translation system from English to Tamil. The proposed work aims at incorporating more lexical information of Tamil language and generates language processing models to solve the problem more effectively. The methods presented in this thesis such as Machine learning based morphological analyzer, Suffix based Morphological generator and compounding in the English preprocessing are the novel methods in language processing research.

249

The Tamil linguistic tools can be used in future to implement Machine translation system between Tamil to any other language, especially for Dravidian languages like Telugu, Malayalam and Kannada. The applications of these developed linguistic tools and annotated corpus can also be used in other language processing tasks such as Information Retrieval and Extraction, Speech processing, Question Answering and Word Sense Disambiguation etc. Finally the conclusion is that morphologically rich languages needs an extensive morphological preprocessing before the SMT training to make the source language structurally similar to target language and it also needs an efficient post-processing in order to generate the surface word correctly.

10.4 FUTURE DIRECTIONS This thesis addresses the technique to improve the quality of Machine Translation by separating root and morphological information of surface word. The main limitation of the approach presented here is that it is not directly applicable in the reverse direction (Tamil to English). All this developed computational linguistic tools and MT systems are domain specific and scalable, so that researchers who are interested in extending any of this work can easily explore the possibilities. There are a number of possible directions for future work, based on the findings in this thesis. Some of the directions are given bellow. •

Increasing the size of parallel corpora always help to improve the accuracy of the system. Adding different sentence structures and handling the idioms and phrases externally would help to improve the system.

•

The tools and methodologies which are developed are used to perform on translation between English to Tamil. It would be interesting to apply the similar methods for translating English to other morphologically rich languages.

•

The tools and methodologies which are developed are can be used to develop a translation system that translate other languages into Tamil.

250

•

The reordering rules and compounding rules suggested in this thesis are relatively simple, and do not perform very well on large sentences. It would be possible to replace the handcrafted rules with automatically learned rules.

•

Applying the advanced SMT models like syntax based model, hierarchal model and hybrid approaches would improve the system.

•

In future, advancement of this system lies in integrating with speech recognition systems. Additionally, there is a possibility to convert this system into mobile environment for translating simple sentences.

•

It would be useful to perform a thorough error analysis of the translation output. Such an analysis would give improvements in future.

251

APPENDIX-A A.1 TAMIL TRANSLITERATION Tamil

Roman

ேக்ஷ

xE

ேஙா

ngO

அ

a

ைக்ஷ

xai

ெஙௗ

ngau

அ

a

ெக்ஷா

xo

ச

sa

ஆ

A

ேக்ஷா

xO

ச்

s

இ

i

கா

kA

சா

sA

ஈ

I

கி

ki

சி

si

உ

u

கீ

kI

சீ

sI

ஊ

U

கு

ku

சு

su

எ

e

கூ

kU

சூ

sU

ஏ

E

ெக

ke

ெச

se

ஐ

ai

ேக

kE

ேச

sE

ஒ

o

ைக

kai

ைச

sai

ஓ

O

ெகா

ko

ெசா

so

ஔ

au

ேகா

kO

ேசா

sO

ஃ

q

ெகௗ

kau

ெசௗ

sau

க

ka

ங

nga

ஞ

nja

க்

k

ங்

ng

ஞ்

nj

க்ஷ

xa

ஙா

ngA

ஞா

njA

x

ஙி

ngi

ஞி

nji

க்ஷா

xA

ஙீ

ngI

ஞீ

njI

க்ஷி

xi

ஙு

ngu

nju

க்ஷீ

xI

ஙூ

ngU

njU

க்ஷு

xu

ெங

nge

ெஞ

nje

க்ஷூ

xU

ேங

ngE

ேஞ

njE

ெக்ஷ

xe

ைங

ngai

ைஞ

njai

ெக்ஷௗ

xau

ெஙா

ngo

ெஞா

njo

252

ேஞா

njO

தா

thA

ெஞௗ

njau

தி

thi

ெப

pe

ட

da

தீ

thI

ேப

pE

ட்

d

thu

ைப

pai

டா

dA

thU

ெபா

po

pU

di

ெத

the

ேபா

pO

dI

ேத

thE

ெபௗ

pau

du

ைத

thai

ம

ma

dU

ெதா

tho

ம்

m

ெட

de

ேதா

thO

மா

mA

ேட

dE

ெதௗ

thau

மி

mi

ைட

dai

ந

wa

மீ

mI

ெடா

do

ந்

w

mu

ேடா

dO

நா

wA

mU

ெடௗ

dau

நி

wi

ெம

me

ண

Na

நீ

wI

ேம

mE

ண்

N

wu

ைம

mai

ணா

NA

wU

ெமா

mo

ணி

Ni

ெந

we

ேமா

mO

ணீ

NI

ேந

wE

ெமௗ

mau

Nu

ைந

wai

ய

ya

NU

ெநா

wo

ய்

y

ெண

Ne

ேநா

wO

யா

yA

ேண

NE

ெநௗ

wau

யி

yi

ைண

Nai

ப

pa

யீ

yI

ெணா

No

ப்

p

yu

ேணா

NO

பா

pA

yU

ெணௗ

Nau

பி

pi

ெய

ye

த

tha

பீ

pI

ேய

yE

த்

th

pu

ைய

yai

டீ

253

ெயா

yo

வ்

v

Lu

ேயா

yO

வா

vA

LU

ெயௗ

yau

வி

vi

ெள

Le

ர

ra

vI

ேள

LE

ர்

r

vu

ைள

Lai

ரா

rA

vU

ெளா

Lo

ாி

ri

ெவ

ve

ேளா

LO

ாீ

rI

ேவ

vE

ெளௗ

Lau

ru

ைவ

vai

ற

Ra

rU

ெவா

vo

ற்

R

ெர

re

ேவா

vO

றா

RA

ேர

rE

ெவௗ

vau

றி

Ri

ைர

rai

ழ

za

றீ

RI

ெரா

ro

ழ்

z

Ru

ேரா

rO

ழா

zA

RU

ெரௗ

rau

ழி

zi

ெற

Re

ல

la

ழீ

zI

ேற

RE

ல்

l

zu

ைற

Rai

லா

lA

zU

ெறா

Ro

li

ெழ

ze

ேறா

RO

lI

ேழ

zE

ெறௗ

Rau

lu

ைழ

zai

ன

na

lU

ெழா

zo

ன்

n

ெல

le

ேழா

zO

னா

nA

ேல

lE

ெழௗ

zau

னி

ni

ைல

lai

ள

La

னீ

nI

ெலா

lo

ள்

L

nu

ேலா

lO

ளா

LA

nU

ெலௗ

lau

ளி

Li

ென

ne

வ

va

ளீ

LI

ேன

nE

லீ

254

ைன

nai

ெஸௗ

Sau

ெனா

no

ஷ

sha

ேனா

nO

ஷ்

sh

ெனௗ

nau

ஷா

shA

ஜ

ja

ஷி

shi

ஜ்

j

ஷீ

shI

ஜா

jA

ஷு

shu

ஜி

ji

ஷூ

shU

ஜீ

jI

ெஷ

she

ஜு

ju

ேஷ

shE

ஜூ

jU

ைஷ

shai

ெஜ

je

ெஷா

sho

ேஜ

jE

ேஷா

shO

ைஜ

jai

ெஷௗ

shau

ெஜா

jo

ஹ

ha

ேஜா

jO

ஹ்

h

ெஜௗ

jau

ஹா

hA

ஸ

Sa

ஹி

hi

ஸ்

S

ஹீ

hI

sri

ஹு

hu

SA

ஹூ

hU

Si

ெஹ

he

ஸீ

SI

ேஹ

hE

ஸு

Su

ைஹ

hai

ஸூ

SU

ெஹா

ho

ெஸ

Se

ேஹா

hO

ேஸ

SE

ெஹௗ

hau

ைஸ

Sai

ெஸா

So

ேஸா

SO

ஸா

255

A.2 DETAILS OF AMIRTA POS TAGS The major POS tags are noun, verb, adjective and adverb. The noun tag is further classified into 10 tag categories; verb tag is classified into 7 tag categories; the rest are adjective, adverb and others.

Noun Tags Nouns are the words which denote a person, place, thing, time, etc. In Tamil language, nouns are inflected for the number and case in the morphological level. However on phonological level, four types of suffixes can occur with noun stem. Noun (+ number) (+ case) Example:

pUk-kaL-ai Flower-plural-accusative case suffix

Noun ( + number ) (+ oblique) (+ euphonic) (+ case )

Example:

pUk-kaL-in-Al Flower-plural-euphonic suffix-accusative case suffix

As mentioned earlier, distinct tags based on grammatical information are avoided. So plurality and case suffixation can be obtained from a morph analyzer. This brings the number of tags down, and helps to achieve simplicity, consistency and better machine learning. Therefore, only two tags common noun and common compound noun are used without taken into consideration further based on the grammatical information contained in the noun word. Example for Common Nouns (NN) paRavai ‘bird’ paRavai-kaL ‘bird-s’ paRavai-kku ‘to- bird’ paRavai-yAl 256

‘by- bird’ Example for Compound Nouns (NNC) UrAdsi thalaivar ‘Township leader’ vanap pakuthi ‘forest area’ Proper Noun tag Proper Nouns are the words which denote a particular person, place, or thing. Indian languages, unlike English, do not have any specific marker for proper nouns in its orthographic convention. English proper nouns begin with a capital letter which distinguishes them from common nouns. Most of the words which occur as proper nouns in Indian languages can also occur as common nouns. For example in English, John, Harry, Mary occur only as proper nouns whereas in Tamil, thAmarai, maNi, pissai, arasi etc are used as proper nouns as well as common nouns. Given below, is a list of Tamil words with their grammatical category and English glosses. These words can be occurred in the text as common and proper nouns. thAmarai

noun

lotus

maNi

noun

bell

pissai

noun

beg

arasi

noun

queen

Two tags have been used for proper noun, • Proper noun • Compound proper noun Example for Proper Nouns (NNP) raja wERRu vawthAn. “Raja came yesterday” Example for Compound Proper Nouns (NNPC) apthul kalAm inRu cennai varukiRAr. “Today, Abdul kalam is coming to Chennai”

257

Cardinal Tag Any word denoting a cardinal number is tagged as . Example for Cardinals enakku 150 rupAy vENdum. “I need 150 ruppes”. mUnRu waparkaL angku amarwthirukkiRArkaL. “Three people were sitting there” Ordinal Tag Ordinals are an extension of the natural numbers different from integers and from cardinals. In Tamil, Ordinals are formed by adding the suffixes Am and Avathu. Expressions denoting ordinals are marked as . Example for Ordinals muthalAm vakuppu. “first class” 12-Am wURRANdu “12th century”

Pronoun Tag Pronouns are the words that take the place of nouns. One uses a pronoun in place of a noun to refer the same entity. Linguistically, a pronoun is a variable which is functionally a noun; seperate tag for pronouns will be helpful for anaphora resolution. Example for personal Pronouns (PRP) avan weRRu ingku vawthAn. “He came here yesterday”

Adjective Tag Adjectives are the noun modifiers. In modern Tamil, simple and derived adjectives are present. The derived adjectives in Tamil are formed by adding the suffixes (Ana) to the noun root. tag is used for representing adjectives. 258

Example for Adjectives iwtha walla paiyan “this nice boy” oru azakAna peN “A beautiful girl”

Adverb Tag Adverbs are words which tell more about the verbs. In modern Tamil, simple and derived adverbs are present. The derived adverbs in Tamil are formed by adding the suffixes (Aka and Ay) to the verb root. The temporal and spatial entities are also tagged as adverbs. tag is used for representing adverbs. Example for Adverbs Avan adikkadi vidumuRai eduththAn. “He took leave frequently” kuthirai vEkam-Aka Odiyathu. “The horse ran fast”’

VerbTags Verbs are defined as the action word, which can take tense suffixes, person, number, gender suffixes and few other verbal suffixes. Tamil verb forms can be distinguished into finite and non finite verbs. Finite verb tag Finite verbs as the predicate of the main clause occur at the end of the sentence. Example for Finite Verb avan paSsil vawthAn “he came by bus” Non finite verb tag Tamil distinguishes between four types of non-finite verb forms.They are verbal participle , adjectival participle , infinitive and conditional . 259

Example for verbal participle

wowthu poo

“become vexed” Example for adjectival participle vawtha paiyan “‘the boy who came” Example for infinitive awtha maraththil idi viza “may thunder fall on the tree” Example for conditional wI wEraththOdu vawthAl “If you would come in time” Nominalized verb tag Nominalized verbal forms are verbal nouns, participle nouns and adjectival nouns. tag is used for all the forms of nominalized verbs. Example for Nominalized verb forms seythal (doing) seyAthu (a neuter which is doing) seythavan (a man who did)

Anaw enna seyvathu ? “What shall (we) do Anand?” Auxiliary verb tag Auxiliary verbs are the verbs which lose their original syntactic and semantic properties when they collocate with the main verbs. They signify various grammatical meanings which are auxiliary to the main verbs in a sentence. tag is used for denoting auxiliary verb. Example for auxiliary verb 260

sItha angkE iruwthAL “Seetha was there” - Main Verb sItha angkE vawthirukkiRAL “Seetha has come there” – Auxiliary verb

Other Tags Postposition tag Tamil has a few free forms which are referred as postpositions. They are added after the case marker. Example kovil kiri vIddukku pakkaththil uLLathu “The temple is near to Giri’s house” In the above example, the post position pakkaththil 'near' occurs after the dative noun phrase vIddukku. Here the form pakkaththil is not considered as a suffix, it is a free form and because of its place of occurrence it is termed as postposition. Postpositions are historically derived from verbs. Schiffman (1999) describes various postpositions. Postpositions are conditioned by the nouns inflected for the case they follow. In Tamil, some postpositions are simple and some are compound. tag is used for representing postposition words. Example for Postposition avan wAyaip pOl kaththinAn. “He cried like a dog” Conjunction tag Co-ordination or conjunction in Tamil is mainly realized by certain noun forms, verb forms and clitics. tag is used for representing conjunctions. Example for Conjunction ciRiYA AnAl walla pen. “a small but nice girl” Determiner tag A determiner is a noun-modifier that expresses the reference of a noun or noun-phrase in a context, rather than attributes expressed by adjectives. This function is usually performed by 261

articles, demonstratives or possessive determiners. tag is used for annotating determiners. Example for determiners awthath thittam. “That plan” Complimentizer A complementizer is a syntactic category roughly equivalent to the term subordinating conjunction in traditional grammar. For example, the word “that” is a complimentizer in the following English sentence. “Mary believes that it is raining” . is used for tagging complimentizer. Example for complimentizer avan vawthAn enRu kELvippattEn. “I heard that he had come” Emphasis tag Force or intensity of expression that gives impressiveness or importance to the word category is termed emphasis. tag is used for annotating emphasis. Example for Emphasis avan than sonnAn. “He only said” Echo word tag The tag < ECH> is used to denote echo words. Example for Echo words (ECH) kAppi kIppi “coffee keeffee” Reduplication word tag Reduplication words are the same word which is written twice for various purposes such as indicating emphasis, deriving a category from another category. tag is used for 262

tagging reduplication words. Example for Reduplication words palapala thiddam. “many plans” Question word and Question marker Tags and are used for question word and question marker. Example for question word and marker and Avan vawthAnA ? . “Did he come?” Symbol tags Only two symbols dot tag and comma tag are considered in the corpus. tag is used to show the sentence separation. tag is used in-between the multiple nouns and proper noun. Example for avan thAn sonnAn . “He only said” Example for rithaniyA, tharunyAudan vawthaL. “Rethanya came with Dharunya”

263

APPENDIX-B B.1 PENN TREEBANK POS TAGS S.No Tag

Description

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Coordinating conjunction Cardinal number Determiner Existential there Foreign word Preposition or subordinating conjunction Adjective Adjective, comparative Adjective, superlative List item marker Modal Noun, singular or mass Noun, plural Proper noun, singular Proper noun, plural Predeterminer Possessive ending Personal pronoun Possessive pronoun Adverb Adverb, comparative Adverb, superlative Particle Symbol To Interjection Verb, base form Verb, past tense Verb, gerund or present participle Verb, past participle Verb, non-3rd person singular present Verb, 3rd person singular present Wh-determiner Wh-pronoun Possessive wh-pronoun Wh-adverb

CC CD DT EX FW IN JJ JJR JJS LS MD NN NNS NNP NNPS PDT POS PRP PRP$ RB RBR RBS RP SYM TO UH VB VBD VBG VBN VBP VBZ WDT WP WP$ WRB

264

B.2 DEPENDENCY TAGS

Depend ency Label dep aux auxpass cop conj cc arg subj nsubj nsubjpass csubj comp obj dobj iobj pobj attr ccomp xcomp compl mark rel acomp agent xsubj

Meaning

Depende ncy Label

Meaning

Dependent Auxiliary passive auxiliary Copula Conjunct Coordination Argument Subject nominal subject passive nominal subject clausal subject Complement Object direct object indirect object object of preposition Attributive clausal complement+internal Sub clausal complement+external Sub Complementizer marker ( introducing an advcl) relative (introducing a rcmod) adjectival complement Agent controlling subject

acomp agent ref expl mod advcl purpcl tmod rcmod amod infmod partmod num number appos nn abbrev advmod neg poss possessive prt det prep sdep

adjectival complement Agent Referent expletive (expletive there) Modifier adverbial clause modifier purpose clause modifier temporal modifier relative clause modifier adjectival modifier infinitival modifier participial modifier numeric modifier element of compound number appositional modifier noun compound modifier abbreviation modifier adverbial modifier negation modifier possession modifier possessive modifier (’s) phrasal verb particle Determiner prepositional modifier semantic dependent

265

B.3 TAMIL VERB MLI FILE Null PT+3SM PT+3SF PT+3SE PT+3SE+PL PT+3SN PT+NOM_athu PT+RP+3SN PT+3PN PT+RP+3PN PT+NOM_ana PT+2S PT+2EH PT+2EH+PL PT+1S PT+1P PRT+3SM PRT+3SF PRT+3SE PRT+3SE+PL PRT+3SN PRT+RP+3SN PRT+NOM_athu PRT+3PN PRT+RP+3PN PRT+NOM_ana PRT+2S PRT+2EH PRT+2EH+PL PRT+1S PRT+1P FT+3SM FT+3SF FT+3SE FT+3SE+PL FT+3SN FT+RP+3SN FT+NOM_athu FT+3PN FT+RP+3PN FT+NOM_ana FT+2S FT+2EH FT+2EH+PL FT+1S FT+1P FT_3SN RP_UM

PT+RP PRT+RP NM+RP PT+RP+3SM PT+RP+3SF PT+RP+3SE PRT+RP+3SM PRT+RP+3SF PRT+RP+3SE PRT+3SN PRT+3SN PRT+NOM_athu FT+RP+3SM FT+RP+3SF FT+RP+3SE NM+RP+3SM NM+RP+3SF NM+RP+3SE NM+RP+NOM_athu PT+RP+3PN_vai NM+NOM_ana VP INF NM+3SN NM+VP NM+VP_aamal NOM_thal NOM_al NOM_kai NM+NOM_mai NOM_kkal+MOOD_aam NOM_kkal+MOOD_aakum NOM_kkal+NM+3SN VP+PAAR_AUX+PT+3SM VP+PAAR_AUX+PT+3SF VP+PAAR_AUX+PT+3SE VP+PAAR_AUX+PT+2S VP+PAAR_AUX+PT+1P VP+PAAR_AUX+PT+1S VP+PAAR_AUX+PRT+3SE VP+PAAR_AUX+PRT+3SE+PL VP+PAAR_AUX+PRT+3SF VP+PAAR_AUX+PRT+3SM VP+PAAR_AUX+PRT+1P VP+PAAR_AUX+PRT+2S VP+PAAR_AUX+PRT+1S VP+PAAR_AUX+FT+3SE VP+PAAR_AUX+FT+3SF 266

VP+PAAR_AUX+FT+3SM VP+PAAR_AUX+FT+2S VP+PAAR_AUX+FT+1S VP+PAAR_AUX+FT+1P VP+IRU_AUX+PT+3SE VP+IRU_AUX+PT+3SF VP+IRU_AUX+PT+3SM VP+IRU_AUX+PT+2S VP+IRU_AUX+PT+1S VP+IRU_AUX+PT+1P VP+IRU_AUX+PRT+3SE VP+IRU_AUX+PRT+3SF VP+IRU_AUX+PRT+3SM VP+IRU_AUX+PRT+1S VP+IRU_AUX+PRT+1P VP+IRU_AUX+FT+3SE VP+IRU_AUX+FT+3SF VP+IRU_AUX+FT+3SM VP+IRU_AUX+FT+2S VP+IRU_AUX+FT+1S VP+IRU_AUX+FT+1P VP+KODU_AUX+PT+3SM VP+KODU_AUX+PT+3SF VP+KODU_AUX+PT+3SE VP+KODU_AUX+PT+3SE VP+KODU_AUX+PT+3SN VP+KODU_AUX+PT+3PN VP+KODU_AUX+PT+2S VP+KODU_AUX+PT+2SE VP+KODU_AUX+PT+2SE+PL VP+KODU_AUX+PT+1S VP+KODU_AUX+PT+1P VP+KODU_AUX+PRT+3SM VP+KODU_AUX+PRT+3SF VP+KODU_AUX+PRT+3SE VP+KODU_AUX+PRT+3SE+PL VP+KODU_AUX+PRT+3SN VP+KODU_AUX+PRT+3PN VP+KODU_AUX+PRT+2S VP+KODU_AUX+PRT+2EH VP+KODU_AUX+PRT+2EH+PL VP+KODU_AUX+PRT+1S VP+KODU_AUX+PRT+1P VP+KODU_AUX+FT+3SM VP+KODU_AUX+FT+3SF VP+KODU_AUX+FT+3SE VP+KODU_AUX+FT+3SE+PL VP+KODU_AUX+FT+3SN VP+KODU_AUX+FT+3PN VP+KODU_AUX+FT+2S

VP+KODU_AUX+FT+2EH VP+KODU_AUX+FT+2EH+PL VP+KODU_AUX+FT+1S VP+KODU_AUX+FT+1P VP+POO_AUX+PT+3SM VP+POO_AUX+PT+3SF VP+POO_AUX+PT+3SE VP+POO_AUX+PT+3SE+PL VP+POO_AUX+PT+3SN VP+POO_AUX+PT+3PN VP+POO_AUX+PT+2S VP+POO_AUX+PT+2EH VP+POO_AUX+PT+2EH+PL VP+POO_AUX+PT+1S VP+POO_AUX+PT+1P VP+POO_AUX+PRT+3SM VP+POO_AUX+PRT+3SF VP+POO_AUX+PRT+3SE VP+POO_AUX+PRT+3SE+PL VP+POO_AUX+PRT+3SN VP+POO_AUX+PRT+3PN VP+POO_AUX+PRT+2S VP+POO_AUX+PRT+2EH VP+POO_AUX+PRT+2EH+PL VP+POO_AUX+PRT+1S VP+POO_AUX+PRT+1P VP+POO_AUX+FT+3SM VP+POO_AUX+FT+3SF VP+POO_AUX+FT+3SE VP+POO_AUX+FT+3SE+PL VP+POO_AUX+FT+3SN VP+POO_AUX+FT+3PN VP+POO_AUX+FT+2S VP+POO_AUX+FT+2EH VP+POO_AUX+FT+2EH+PL VP+POO_AUX+FT+1S VP+POO_AUX+FT+1P VP+VIDU_AUX+PT+3SE VP+VIDU_AUX+PT+3SF VP+VIDU_AUX+PT+3SM VP+VIDU_AUX+PT+2S VP+VIDU_AUX+PT+1P VP+VIDU_AUX+PT+1S VP+VIDU_AUX+PRT+3SE VP+VIDU_AUX+PRT+3SM VP+VIDU_AUX+PRT+3SF VP+VIDU_AUX+PRT+2S VP+VIDU_AUX+PRT+1P VP+VIDU_AUX+PRT+1S VP+VIDU_AUX+FT+3SE 267

VP+VIDU_AUX+FT+3SF VP+VIDU_AUX+FT+3SM VP+VIDU_AUX+FT+2S VP+VIDU_AUX+FT+1S VP+VIDU_AUX+FT+1P VP+KI_AUX+PT+3SE VP+KI_AUX+PT+3SM VP+KI_AUX+PT+3SF VP+KI_AUX+PT+2S VP+KI_AUX+PT+1P VP+KI_AUX+PT+1S VP+KI_AUX+PRT+3SE VP+KI_AUX+PRT+3SF VP+KI_AUX+PRT+3SM VP+KI_AUX+PRT+2S VP+KI_AUX+PRT+1P VP+KI_AUX+PRT+1S VP+KI_AUX+FT+3SE VP+KI_AUX+FT+3SM VP+KI_AUX+FT+3SF VP+KI_AUX+FT+2S VP+KI_AUX+FT+1P VP+KI_AUX+FT+1S VP+AUX_aayiRRu VP+FT+POODU_AUX+PT+3SE VP+FT+POODU_AUX+PT+3SM VP+FT+POODU_AUX+PT+3SF VP+FT+POODU_AUX+PT+2S VP+FT+POODU_AUX+PT+1S VP+FT+POODU_AUX+PT+1P VP+FT+POODU_AUX+PRT+3SE VP+FT+POODU_AUX+PRT+3SM VP+FT+POODU_AUX+PRT+3SF VP+FT+POODU_AUX+PRT+2S VP+FT+POODU_AUX+PRT+1S VP+FT+POODU_AUX+PRT+1P VP+FT+POODU_AUX+FT+3SE VP+FT+POODU_AUX+FT+3SM VP+FT+POODU_AUX+FT+3SF VP+FT+POODU_AUX+FT+2S VP+FT+POODU_AUX+FT+1P VP+FT+POODU_AUX+FT+1S VP+THOLAI_AUX+PT+3SE VP+THOLAI_AUX+PT+3SM VP+THOLAI_AUX+PT+3SF VP+THOLAI_AUX+PT+2S VP+THOLAI_AUX+PT+1P VP+THOLAI_AUX+PT+3SM VP+THOLAI_AUX+PRT+3SE VP+THOLAI_AUX+PRT+3SF

VP+THOLAI_AUX+PRT+3SM VP+THOLAI_AUX+PRT+2S VP+THOLAI_AUX+PRT+1P VP+THOLAI_AUX+PRT+1S VP+THOLAI_AUX+FT+3SE VP+THOLAI_AUX+FT+3SM VP+THOLAI_AUX+FT+3SF VP+THOLAI_AUX+FT+2S VP+THOLAI_AUX+FT+1S VP+THOLAI_AUX+FT+1P VP+THALLU_AUX+PT+3SE VP+THALLU_AUX+PT+3SF VP+THALLU_AUX+PT+3SM VP+THALLU_AUX+PT+2S VP+THALLU_AUX+PT+1P VP+THALLU_AUX+PT+1S VP+THALLU_AUX+FT+3SE VP+THALLU_AUX+FT+3SM VP+THALLU_AUX+FT+3SF VP+THALLU_AUX+FT+2S VP+THALLU_AUX+FT+1S VP+THALLU_AUX+FT+1P VP+KIZI_AUX+PT+3SE VP+KIZI_AUX+PT+3SM VP+KIZI_AUX+PT+3SF VP+KIZI_AUX+PT+2S VP+KIZI_AUX+PT+1S VP+KIZI_AUX+PT+1P VP+KIZI_AUX+FT+3SE VP+KIZI_AUX+FT+3SM VP+KIZI_AUX+PRT+3SF VP+KIZI_AUX+FT+2S VP+KIZI_AUX+FT+1S VP+KIZI_AUX+FT+1P VP+KIZI_AUX+PRT+3SE VP+KIZI_AUX+PRT+3SM VP+KIZI_AUX+PRT+3SF VP+KIZI_AUX+PRT+2S VP+KIZI_AUX+PRT+1S VP+KIZI_AUX+PRT+1P VP+KIZI_AUX+FT+3SE VP+KIZI_AUX+FT+3SM VP+KIZI_AUX+FT+3SF VP+KIZI_AUX+FT+2S VP+KIZI_AUX+FT+1S VP+KIZI_AUX+FT+1P VP+KIDA_AUX+PT+3SE VP+KIDA_AUX+PT+3SM VP+KIDA_AUX+PT+3SF VP+KIDA_AUX+PT+2S 268

VP+KIDA_AUX+PT+1S VP+KIDA_AUX+PT+1P VP+KIDA_AUX+PRT+3SE VP+KIDA_AUX+PRT+3SM VP+KIDA_AUX+PRT+3SF VP+KIDA_AUX+PRT+2S VP+KIDA_AUX+PRT+1S VP+KIDA_AUX+PRT+1P VP+KIDA_AUX+FT+3SE VP+KIDA_AUX+FT+3SM VP+KIDA_AUX+FT+3SF VP+KIDA_AUX+FT+2S VP+KIDA_AUX+FT+1S VP+KIDA_AUX+FT+1P VP+THEER_AUX+PT+3SE VP+THEER_AUX+PT+3SM VP+THEER_AUX+PT+3SF VP+THEER_AUX+PT+2S VP+THEER_AUX+PT+1P VP+THEER_AUX+PT+1S VP+THEER_AUX+PRT+3SE VP+THEER_AUX+PRT+3SM VP+THEER_AUX+PRT+3SF VP+THEER_AUX+PRT+2S VP+THEER_AUX+PRT+1P VP+THEER_AUX+PRT+1S VP+THEER_AUX+FT+3SE VP+THEER_AUX+FT+3SM VP+THEER_AUX+FT+3SF VP+THEER_AUX+FT+2S VP+THEER_AUX+FT+1P VP+THEER_AUX+FT+1S VP+MUDI_AUX+PT+3SE VP+MUDI_AUX+PT+3SM VP+MUDI_AUX+PT+3SF VP+MUDI_AUX+PT+2S VP+MUDI_AUX+PT+1P VP+MUDI_AUX+PT+1S VP+MUDI_AUX+PRT+3SE VP+MUDI_AUX+PRT+3SM VP+MUDI_AUX+PRT+3SF VP+MUDI_AUX+PRT+2S VP+MUDI_AUX+PRT+1P VP+MUDI_AUX+PRT+1S VP+MUDI_AUX+FT+3SE VP+MUDI_AUX+FT+3SM VP+MUDI_AUX+FT+3SF VP+MUDI_AUX+FT+2S VP+MUDI_AUX+FT+1S VP+MUDI_AUX+FT+1P

INF+AUX_attum INF+MOD_vendum INF+MOD_vendam INF+MOD_koodum INF+MOD_koodathu INF+MAATTU_AUX+3SM INF+MAATTU_AUX+3SF INF+MAATTU_AUX+3SE INF+MAATTU_AUX+1S INF+MAATTU_AUX+2S INF+MAATTU_AUX+1P INF+MOD_illai INF+IYAL_AUX+RP_UM INF+IYAL_AUX+FT_3SN INF+IYAL_AUX+PT+3SN INF+IYAL_AUX+PRT+3SN INF+MUDI_AUX+PT+3SN INF+MUDI_AUX+PRT+3SN INF+MUDI_AUX+FT_3SN INF+MUDI_AUX+RP_UM INF+IRU_AUX+PT+3SM INF+IRU_AUX+PT+3SF INF+IRU_AUX+PT+3SE INF+POO_AUX+PT+3SM INF+POO_AUX+PT+3SF INF+POO_AUX+PT+3SE INF+VAA_AUX+PT+3SM INF+VAA_AUX+PT+3SF INF+VAA_AUX+PT+3SE INF+PAAR_AUX+PT+3SM INF+PAAR_AUX+PT+3SF INF+PAAR_AUX+PT+3SE INF+VAI_AUX+PT+3SE INF+VAI_AUX+PT+3SM INF+VAI_AUX+PT+3SF INF+PANNU_AUX+PT+3SM INF+PANNU_AUX+PT+3SF INF+PANNU_AUX+PT+3SE INF+SEY_AUX+PT+3SM INF+SEY_AUX+PT+3SF INF+SEY_AUX+PT+3SE INF+PERU_AUX+PT+3SM INF+PERU_AUX+PT+3SF INF+PERU_AUX+PT+3SE INF+PADU_AUX+PT+3SN INF+PADU_AUX+PT+3SM INF+PADU_AUX+PT+3SF INF+PADU_AUX+PT+3SE INF+PADU_AUX+PRT+3SN INF+PADU_AUX+PRT+3SM 269

INF+PADU_AUX+PRT+3SF INF+PADU_AUX+PRT+3SE INF+PADU_AUX+FT_3SN INF+PADU_AUX+RP_UM INF+VVA_AUX+PT+3SN INF+VVA_AUX+PRT+3SN INF+VVA_AUX+FT_3SN INF+VVA_AUX+RP_UM INF+VI_AUX+PT+3SN INF+VI_AUX+PT+3SN INF+VI_AUX+PRT+3SN INF+VI_AUX+PRT+3SN INF+VI_AUX+FT_3SN INF+VI_AUX+RP_UM VP+KAADDU_AUX+PT+3SM VP+KAADDU_AUX+PT+3SF VP+KAADDU_AUX+PT+3SE VP+KAADDU_AUX+PT+2S VP+KAADDU_AUX+PT+1S VP+KAADDU_AUX+PT+1P VP+KAADDU_AUX+PRT+3SM VP+KAADDU_AUX+PRT+3SF VP+KAADDU_AUX+PRT+3SE VP+KAADDU_AUX+PRT+2S VP+KAADDU_AUX+PRT+1P VP+KAADDU_AUX+FT+3SE VP+KAADDU_AUX+FT+3SM VP+KAADDU_AUX+FT+3SF VP+KAADDU_AUX+FT+2S VP+KAADDU_AUX+FT+1P VP+VAI_AUX+PT+3SE VP+VAI_AUX+PT+3SM VP+VAI_AUX+PT+3SF VP+VAI_AUX+PT+2S VP+VAI_AUX+PT+1P VP+VAI_AUX+PT+1S VP+VAI_AUX+PRT+3SE VP+VAI_AUX+PRT+2S VP+VAI_AUX+PRT+3SM VP+VAI_AUX+PRT+3SF VP+VAI_AUX+PRT+1P VP+VAI_AUX+FT+3SE VP+VAI_AUX+FT+3SM VP+VAI_AUX+FT+3SF VP+VAI_AUX+FT+2S VP+VAI_AUX+FT+1P VP+KAADDU_AUX+INF+MOD_vend VP+VAI_AUX+INF+MOD_vendum VP+THOLAI_AUX+INF+MOD_vendu VP+KIZI_AUX+INF+MOD_vendum

VP+PAAR_AUX+INF+MOD_vendum VP+IRU_AUX+INF+MOD_vendum VP+KODU_AUX+INF+MOD_vendu VP+POO_AUX+INF+MOD_vendum VP+VIDU_AUX+INF+MOD_vendum VP+KI_AUX+INF+MOD_vendum VP+POODU_AUX+INF+MOD_vendu VP+THALLU_AUX+INF+MOD_vendum VP+KIDA_AUX+INF+MOD_vendum VP+THEER_AUX+INF+MOD_vendum VP+MUDI_AUX+INF+MOD_vendum VP+PAAR_AUX+INF+MOD_vendum VP+SEY_AUX+INF+MOD_vendum VP+KAADDU_AUX+INF+MOD_venda VP+VAI_AUX+INF+MOD_vendam VP+THOLAI_AUX+INF+MOD_venda VP+KIZI_AUX+INF+MOD_vendam VP+PAAR_AUX+INF+MOD_vendam VP+IRU_AUX+INF+MOD_vendam VP+KODU_AUX+INF+MOD_vendam VP+POO_AUX+INF+MOD_vendam VP+VIDU_AUX+INF+MOD_vendam VP+KI_AUX+INF+MOD_vendam VP+POODU_AUX+INF+MOD_vendam VP+THALLU_AUX+INF+MOD_vendam VP+KIDA_AUX+INF+MOD_vendam VP+THEER_AUX+INF+MOD_vendam VP+MUDI_AUX+INF+MOD_venda VP+PAAR_AUX+INF+MOD_vendam VP+SEY_AUX+INF+MOD_vendam VP+PAAR_AUX+INF+MOD_illai VP+IRU_AUX+INF+MOD_illai VP+KODU_AUX+INF+MOD_illai VP+POO_AUX+INF+MOD_illai VP+VIDU_AUX+INF+MOD_illai VP+KI_AUX+INF+MOD_illai VP+POODU_AUX+INF+MOD_illai VP+THOLAI_AUX+INF+MOD_illai VP+THALLU_AUX+INF+MOD_illai VP+KIZI_AUX+INF+MOD_illai VP+KIDA_AUX+INF+MOD_illai VP+THEER_AUX+INF+MOD_illai VP+MUDI_AUX+INF+MOD_illai VP+VAA_AUX+INF+MOD_illai VP+VAI_AUX+INF+MOD_illai VP+KAADDU_AUX+INF+MOD_illai VP+KODU_AUX+INF+MOD_illai INF+VAA_AUX+INF+MOD_illai INF+VAI_AUX+INF+MOD_illai INF+SEY_AUX+INF+MOD_illai 270

INF+PANNU_AUX+INF+MOD_illai INF+PERU_AUX+INF+MOD_illai INF+PADU_AUX+INF+MOD_illai INF+VENDU_AUX+VP+NOM_athu+

INF+PADU_AUX+VP+VIDU_AUX+RP_UM INF+PADU_AUX+VP+VIDU_AUX+PT+3SN VP+KII_AUX+PRT+1S VP+KII_AUX+PT+1S VP+KII_AUX+FT+1S VP+KII_AUX+PRT+1P VP+KII_AUX+PT+1P VP+KII_AUX+FT+1P VP+KII_AUX+PRT+2S VP+KII_AUX+PT+2S VP+KII_AUX+FT+1S VP+KII_AUX+PRT+3SM VP+KII_AUX+PT+3SM VP+KII_AUX+FT+3SM VP+KII_AUX+PRT+3SF VP+KII_AUX+PT+3SF VP+KII_AUX+FT+3SF VP+KII_AUX+PRT+3SN VP+KII_AUX+PT+3SN VP+KII_AUX+FT+3SN VP+KII_AUX+PRT+3PE VP+KII_AUX+PT+3PE VP+KII_AUX+FT+3PE VP+IRU_AUX+PRT+1P VP+IRU_AUX+PRT+2S VP+IRU_AUX+PT+2S VP+IRU_AUX+PRT+3SN VP+IRU_AUX+PT+3SN VP+IRU_AUX+FT+3SN VP+KI_AUX+PRT+3SN VP+KI_AUX+PT+3SN VP+KI_AUX+FT+3SN VP+IRU_AUX+PRT+3PE VP+IRU_AUX+PT+3PE VP+IRU_AUX+FT+3PE VP+KI_AUX+PRT+3PE VP+KI_AUX+PT+3PE VP+KI_AUX+FT+3PE VP+IRU_AUX+FT_3SN VP+IRU_AUX+RP_UM VP+KI_AUX+FT_3SN VP+KI_AUX+RP_UM VP+KII_AUX+FT_3SN VP+KII_AUX+RP_UM

INF+VENDU_AUX+VP+3SN+MOD_illai FT+NOM_athu+MOD_illai FT+3SN+MOD_illai FT+NOM_athu+CM_dat+MOD_illai INF+UL_AUX+RP+3SM INF+UL_AUX+RP+3SF INF+UL_AUX+RP+3PE INF+UL_AUX+RP+3SN Ungal INF+CL_um INF+PADU_AUX+VP+IRU_AUX+PT+3S INF+PADU_AUX+VP+IRU_AUX+PT+3 INF+PADU_AUX+VP+IRU_AUX+PT+3 INF+PADU_AUX+VP+IRU_AUX+PT+2 INF+PADU_AUX+VP+IRU_AUX+PT+1s INF+PADU_AUX+VP+IRU_AUX+PT+1S INF+PADU_AUX+VP+IRU_AUX+PRT+3SE INF+PADU_AUX+VP+IRU_AUX+P3SE+PL INF+PADU_AUX+VP+IRU_AUX+PRT+3SF INF+PADU_AUX+VP+IRU_AUX+PRT+3S INF+PADU_AUX+VP+IRU_AUX+PRT+2S INF+PADU_AUX+VP+IRU_AUX+PRT+1P INF+PADU_AUX+VP+IRU_AUX+PRT+1S INF+PADU_AUX+VP+IRU_AUX+PRT+3S INF+PADU_AUX+VP+IRU_AUX+FT+3SE INF+PADU_AUX+VP+IRU_AUX+FT+3SM INF+PADU_AUX+VP+IRU_AUX+FT+3SF INF+PADU_AUX+VP+IRU_AUX+FT+2S INF+PADU_AUX+VP+IRU_AUX+FT+1P INF+PADU_AUX+VP+IRU_AUX+FT+1S INF+PADU_AUX+VP+IRU_AUX+PT+3SN INF+PADU_AUX+VP+IRU_AUX+FT_3SN INF+PADU_AUX+VP+IRU_AUX+RP_UM INF+PADU_AUX+VP+IRU_AUX+INF+MOD INF+PADU_AUX+VP+IRU_AUX+INF+MOD INF+PADU_AUX+VP+UL_AUX+RP+3PE INF+PADU_AUX+VP+VIDU_AUX+FT_3SN 271

B.4 TAMIL NOUN WORD FORMS நரம்ைப

நரம்பின்

நரம்பிைன


நரம்பினத்ைத

நரம்பதன்

நரம்ேபா

நரம் ப்பக்கம்

நரம்பிேனா

நரம்பின்பக்கம்

நரம்பினத்ேதா

நரம்பதின்பக்கம்

நரம்பினால்

நரம்பண்ைட

நரம்பால்

நரம்பினண்ைட

நரம் க்கு

நரம் க்கண்ைட

நரம்பிற்கு

நரம்பின ேக

நரம்பின்

நரம் க்க ேக

நரம்ப

நரம்பதன ேக


நரம்ப கில்

நரம்பின்கண்

நரம்பின கில்

நரம்ப கண்

நரம்பதன கில்

நரம் க்காக

நரம் க்க கில்

நரம்பாலான

நரம் கிட்ட

நரம் ைடய

நரம்பின்கிட்ட

நரம்பி

நரம் ேமல்

ைடய

லம் லம் லம்

நரம்பில்

நரம்பின்ேமல்

நரம்பினில்

நரம் க்குேமல்

நரம் டன்

நரம்பின ேமல்

நரம்பி

நரம் ேமேல

டன்

நரம் வைரக்கும்

நரம்பின்ேமேல

நரம் வைரயில்

நரம் க்குேமேல

நரம்பின்வைர

நரம்ப ேமேல

நரம்பில்லாமல்

நரம்பின ேமேல

நரம்ேபா ல்லாமல்

நரம்பின்கீழ்

நரம் க்கில்லாமல்

நரம் க்குங்கீழ்

நரம் டனில்லாமல்

நரம்பதின்கீழ்

நரம்பாட்டம்

நரம்பின்கீேழ

நரம் க்காட்டம்

நரம் க்குங்கீேழ

நரம்

நரம்பதின்கீேழ

தல்

நரம் வழியாக

நரம்ைபப்பற்றி

நரம்பின்வழியாக

நரம்பிைனப்பற்றி

நரம்பின வழியாக

நரம்பினத்ைதப்பற்றி

நரம்பிடம்

நரம்ைபக்குறித் 272

நரம்பிைனக்குறித்

நரம்பிைனப்ேபால்

நரம்பினத்ைதக்குறித்

நரம்பினத்ைதப்ேபால்

நரம்ைபப்பார்த்

நரம் மாதிாி

நரம்பிைனப்பார்த்

நரம்ைபமாதிாி

நரம்பினத்ைதப்பார்த்

நரம்ைபவிட

நரம்ைபேநாக்கி

நரம்பிைனவிட

நரம்பினத்ைதேநாக்கி

நரம்பினத்ைதவிட

நரம்பிைனேநாக்கி

நரம் க்குப்பதிலாக

நரம்ைபச்சுற்றி

நரம் க்காக

நரம்பினத்ைதச்சுற்றி

நரம்பி

நரம்பிைனச்சுற்றி

நரம் க்குப்பிறகு

நரம்ைபத்தாண்

நரம் க்கப் றம்

நரம்பிைனத்தாண்

நரம் க்கப்பால்

நரம்பினத்ைதத்தாண்

நரம் க்குேமல்

நரம்ைபத்தவிர்த்

நரம் க்குேமேல

நரம்பிைனத்தவிர்த்

நரம் க்குங்கீழ்

நரம்பினத்ைதத்தவிர்த்

நரம் க்குங்கீேழ

நரம்ைபத்தவிர

நரம் க்குள்

நரம்பிைனத்தவிர

நரம் க்குள்ேள

நரம்பினத்ைதத்தவிர

நரம் க்குெவளியில்

நரம்ெபாழிய

நரம் க்குெவளிேய

நரம்ைபெயாழிய

நரம் க்க யில்

நரம்பினத்ைதெயாழிய

நரம் க்க கில்

நரம்ைபெயாட்

நரம்

நரம்பிைனெயாட்

நரம் க்கு ன்

நரம்பினத்ைதெயாட்

நரம்

நரம்ைபக்ெகாண்

நரம் க்கு ன்னால்

நரம்பிைனக்ெகாண்

நரம் பின்னால்

நரம்பினத்ைதக்ெகாண்

நரம் க்குப்பின்னால்

நரம்ைபைவத்

நரம் க்குப்பின்

நரம்பினத்ைதைவத்

நரம் க்குப்பிந்தி

நரம்பிைனைவத்

நரம் க்குகு க்ேக

நரம்ைபவிட்

நரம் க்குள்

நரம்பிைனவிட்

நரம்பி

நரம்பினத்ைதவிட்

நரம் க்குள்ேள

நரம்ைபப்ேபால

நரம்பி

நரம்பிைனப்ேபால

நரம்ெபதிேர

நரம்பினத்ைதப்ேபால

நரம் க்ெகதிேர

நரம்ைபப்ேபால்

நரம்ெபதிர்க்கு 273

க்காக

ன் ன்னால்

ள் க்குள்ேள

நரம்ெபதிாில்

நரம் கள்வைரக்கும்

நரம் க்ெகதிாில்

நரம் கள்வைரயில்

நரம் க்ெகதிர்த்தாற்ேபால்

நரம் களில்லாமல்

நரம் க்கிைடயில்

நரம் களல்லாமல்

நரம் க்குந வில்

நரம் களாட்டம்

நரம் க்க த்தாற்ேபால்

நரம் கள் தல்

நரம்பி

நரம் களின்ப

ந்

நரம் டன்

நரம் களின்வழியாக

நரம்பிடமி ந்

நரம் களிடம்

நரம்

நரம் களின்

லமி ந்

லம்

நரம் ப்பக்கமி ந்

நரம் களின்பக்கம்

நரம்பண்ைடயி ந்

நரம் களினண்ைட

நரம்ப ேகயி ந்

நரம் கள ேக

நரம் க்க ேகயி ந்

நரம் களின ேக

நரம்ப கி

நரம் களின கில்

ந்

நரம் க்க கி

ந்

நரம் கள்கிட்ட

நரம் கிட்டயி ந்

நரம் களின்ேமல்

நரம் ேம

நரம் களின்ேமேல

ந்

நரம் க்குேம

ந்

நரம் களின்கீழ்

நரம் ேமேலயி ந்

நரம் களின்கீேழ

நரம் க்குேமேலயி ந்

நரம் கைளப்பற்றி

நரம் க்குங்கீழி ந்

நரம் கைளக்குறித்

நரம்பின்கீழி ந்

நரம் கைளப்பார்த்

நரம்பதின்கீழி ந்

நரம் கைளேநாக்கி

நரம்பின்கீேழயி ந்

நரம் கைளச்சுற்றி

நரம்பதன்கீேழயி ந்

நரம் கைளத்தாண்

நரம் கள்

நரம் கைளத்தவிர்த்

நரம் கைள

நரம் கைளத்தவிர

நரம் களிைன

நரம் கைளெயாழிய

நரம் களின்

நரம் கைளெயாட்

நரம் க

க்காக

நரம் கைளக்ெகாண்

நரம் க

க்கான

நரம் கைளைவத்

நரம் க

க்கு

நரம் கைளவிட்

நரம் களில்

நரம் கைளப்ேபால்

நரம் க

நரம் கைளப்ேபால

டன்

நரம் கேளா

நரம் கைளமாதிாி

நரம் கள

நரம் கைளவிட

நரம் களால்

நரம் க

க்குப்பதிலாக

நரம் க

நரம் க

க்காக

டன் 274

B.5 TAMIL VERB WORD FORMS

ப

ப க்கின்ற

ப த்தான்

ப க்காத

ப த்தாள்

ப த்தவன்

ப த்தார்

ப த்தவள்

ப த்தார்கள்

ப த்தவர்

ப த்த

ப த்த

ப த்தன

ப க்கின்றவன்

ப த்தாய்

ப க்கின்றவள்

ப த்தீர்

ப க்கின்றவர்

ப த்தீர்கள்


ப த்ேதன்

ப ப்பவன்

ப த்ேதாம்

ப ப்பவள்

ப க்கிறான்

ப ப்பவர்

ப க்கிறாள்

ப ப்ப

ப க்கிறார்

ப க்காதவன்

ப க்கிறார்கள்

ப க்காதவள்


ப க்காதவர்

ப க்கின்றன

ப க்காத

ப க்கின்றாய்

ப த்தைவ

ப க்கின்றீர்

ப க்காதன

ப க்கின்றீர்கள்

ப த்

ப க்கின்ேறன்

ப க்க

ப க்கின்ேறாம்

ப க்கா

ப ப்பான்

ப க்காமல்

ப ப்பாள்

ப த்தல்

ப ப்பார்

ப க்காைம

ப ப்பார்கள்

ப க்கலாம்

ப ப்ப

ப க்கலாகும்

ப ப்பன

ப க்கலாகா

ப ப்பாய்

ப த் ப்பார்த்தான்

ப ப்பீர்

ப த் ப்பார்த்தாள்

ப ப்பீர்கள்

ப த் ப்பார்த்தார்

ப ப்ேபன்

ப த் ப்பார்த்தாய்

ப ப்ேபாம்

ப த் ப்பார்த்ேதாம்

ப க்கும்

ப த் ப்பார்த்ேதன்

ப த்த

ப த் ப்பார்க்கின்றார் 275

ப த் ப்பார்க்கின்றார்கள்

ப த் க்ெகா த்ேதாம்

ப த் ப்பார்க்கின்றாள்

ப த் க்ெகா க்கிறான்

ப த் ப்பார்க்கின்றான்

ப த் க்ெகா க்கிறாள்

ப த் ப்பார்க்கின்ேறாம்

ப த் க்ெகா க்கிறார்

ப த் ப்பார்க்கின்றாய்

ப த் க்ெகா க்கிறார்கள்

ப த் ப்பார்க்கின்ேறன்

ப த் க்ெகா க்கின்ற

ப த் ப்பார்ப்பார்

ப த் க்ெகா க்கின்றன

ப த் ப்பார்ப்பாள்

ப த் க்ெகா க்கின்றாய்

ப த் ப்பார்ப்பான்

ப த் க்ெகா க்கின்றீர்

ப த் ப்பார்ப்பாய்

ப த் க்ெகா க்கின்றீர்கள்

ப த் ப்பார்ப்ேபன்

ப த் க்ெகா க்கின்ேறன்

ப த் ப்பார்ப்ேபாம்

ப த் க்ெகா க்கின்ேறாம்

ப த்தி ந்தார்

ப த் க்ெகா ப்பான்

ப த்தி ந்தாள்

ப த் க்ெகா ப்பாள்

ப த்தி ந்தான்

ப த் க்ெகா ப்பார்

ப த்தி ந்தா

ப த் க்ெகா ப்பார்கள்

ப த்தி ந்ேதன்

ப த் க்ெகா ப்ப

ப த்தி ந்ேதாம்

ப த் க்ெகா ப்பன

ப த்தி க்கின்றார்

ப த் க்ெகா ப்பாய்

ப த்தி க்கின்றாள்

ப த் க்ெகா ப்பீர்

ப த்தி க்கின்றான்

ப த் க்ெகா ப்பீர்கள்

ப த்தி க்கின்ேறன்

ப த் க்ெகா ப்ேபன்

ப த்தி க்கின்ேறாம்

ப த் க்ெகா ப்ேபாம்

ப த்தி ப்பார்

ப த் ப்ேபானான்

ப த்தி ப்பாள்

ப த் ப்ேபானாள்

ப த்தி ப்பான்

ப த் ப்ேபானார்

ப த்தி ப்பாய்

ப த் ப்ேபானார்கள்

ப த்தி ப்ேபன்

ப த் ப்ேபான

ப த்தி ப்ேபாம்

ப த் ப்ேபாயின

ப த் க்ெகா த்தான்

ப த் ப்ேபானாய்

ப த் க்ெகா த்தாள்

ப த் ப்ேபானீர்

ப த் க்ெகா த்தார்

ப த் ப்ேபானீர்கள்

ப த் க்ெகா த்தார்

ப த் ப்ேபாேனன்

ப த் க்ெகா த்த

ப த் ப்ேபாேனாம்

ப த் க்ெகா த்தன

ப த் ப்ேபாகிறான்

ப த் க்ெகா த்தாய்

ப த் ப்ேபாகிறாள்

ப த் க்ெகா த்தீர்

ப த் ப்ேபாகிறார்

ப த் க்ெகா த்தீர்கள்

ப த் ப்ேபாகிறார்கள்

ப த் க்ெகா த்ேதன்

ப த் ப்ேபாகின்ற 276

ப த் ப்ேபாகின்றன

ப த் க்ெகாண்

ந்ேதாம்

ப த் ப்ேபாகின்றாய்


ந்ேதன்

ப த் ப்ேபாகின்றீர்


க்கிறார்

ப த் ப்ேபாகின்றீர்கள்


க்கிறாள்

ப த் ப்ேபாகின்ேறன்


க்கிறான்

ப த் ப்ேபாகின்ேறாம்


க்கிறாய்

ப த் ப்ேபாவான்


க்கிேறாம்

ப த் ப்ேபாவாள்


க்கிேறன்

ப த் ப்ேபாவார்


ப்பார்

ப த் ப்ேபாவார்கள்


ப்பான்

ப த் ப்ேபாவ


ப்பாள்

ப த் ப்ேபாவன


ப்பாய்

ப த் ப்ேபாவாய்


ப்ேபாம்

ப த் ப்ேபா ர்


ப்ேபன்

ப த் ப்ேபா ர்கள்

ப த்தாயிற்

ப த் ப்ேபாேவன்

ப த் ப்ேபாட்டார்

ப த் ப்ேபாேவாம்

ப த் ப்ேபாட்டான்

ப த் விட்டார்

ப த் ப்ேபாட்டாள்

ப த் விட்டாள்

ப த் ப்ேபாட்டாய்

ப த் விட்டான்

ப த் ப்ேபாட்ேடன்

ப த் விட்டாய்

ப த் ப்ேபாட்ேடாம்

ப த் விட்ேடாம்

ப த் ப்ேபா கிறார்

ப த் விட்ேடன்

ப த் ப்ேபா கிறான்

ப த் வி கின்றார்

ப த் ப்ேபா கிறாள்

ப த் வி கின்றான்

ப த் ப்ேபா கிறாய்

ப த் வி கின்றாள்

ப த் ப்ேபா கிேறன்

ப த் வி கின்றாய்

ப த் ப்ேபா கிேறாம்

ப த் வி கின்ேறாம்

ப த் ப்ேபா வார்

ப த் வி கிேறன்

ப த் ப்ேபா வான்

ப த் வி வார்

ப த் ப்ேபா வாள்

ப த் வி வாள்

ப த் ப்ேபா வாய்

ப த் வி வான்

ப த் ப்ேபா ேவாம்

ப த் வி வாய்

ப த் ப்ேபா ேவன்

ப த் வி ேவன்

ப த் த்ெதாைலத்தார்

ப த் வி ேவாம்

ப த் த்ெதாைலத்தான்


ந்தார்

ப த் த்ெதாைலத்தாள்


ந்தான்

ப த் த்ெதாைலத்தாய்


ந்தாள்

ப த் த்ெதாைலேதாம்


ந்தாய்

ப த் த்ெதாைலேதான் 277

ப த் த்ெதாைலகிறார்

ப த் க்கிழிக்கிறாய்

ப த் த்ெதாைலகிறாள்

ப த் க்கிழிக்கிேறன்

ப த் த்ெதாைலகிறான்

ப த் க்கிழிக்கிேறாம்

ப த் த்ெதாைலகிறாய்

ப த் க்கிழிப்பார்

ப த் த்ெதாைலகிேறாம்

ப த் க்கிழிப்பான்

ப த் த்ெதாைலகிேறன்

ப த் க்கிழிப்பாள்

ப த் த்ெதாைலப்பார்

ப த் க்கிழிப்பாய்

ப த் த்ெதாைலப்பான்

ப த் க்கிழிப்ேபன்

ப த் த்ெதாைலப்பாள்

ப த் க்கிழிப்ேபாம்

ப த் த்ெதாைலப்பாய்

ப த் க்கிடந்தார்

ப த் த்ெதாைலப்ேபன்

ப த் க்கிடந்தான்

ப த் த்ெதாைலப்ேபாம்

ப த் க்கிடந்தாள்

ப த் த்தள்ளினார்

ப த் க்கிடந்தாய்

ப த் த்தள்ளினாள்

ப த் க்கிடந்ேதன்

ப த் த்தள்ளினான்

ப த் க்கிடந்ேதாம்

ப த் த்தள்ளினாய்

ப த் க்கிடக்கிறார்

ப த் த்தள்ளிேனாம்

ப த் க்கிடக்கிறான்

ப த் த்தள்ளிேனன்

ப த் க்கிடக்கிறாள்

ப த் த்தள்

வார்

ப த் க்கிடக்கிறாய்


வான்

ப த் க்கிடக்கிேறன்


வாள்

ப த் க்கிடக்கிேறாம்


வாய்

ப த் க்கிடப்பார்


ேவன்

ப த் க்கிடப்பான்


ேவாம்

ப த் க்கிடப்பாள்

ப த் க்கிழித்தார்

ப த் க்கிடப்பாய்

ப த் க்கிழித்தான்

ப த் க்கிடப்ேபன்

ப த் க்கிழித்தாள்

ப த் க்கிடப்ேபாம்

ப த் க்கிழித்தாய்

ப த் த்தீர்த்தார்

ப த் க்கிழித்ேதன்

ப த் த்தீர்த்தான்

ப த் க்கிழித்ேதாம்

ப த் த்தீர்த்தாள்

ப த் க்கிழிப்பார்

ப த் த்தீர்த்தாய்

ப த் க்கிழிப்பான்

ப த் த்தீர்த்ேதாம்

ப த் க்கிழிக்கிறாள்

ப த் த்தீர்த்ேதன்

ப த் க்கிழிப்பாய்

ப த் த்தீர்க்கிறார்

ப த் க்கிழிப்ேபன்

ப த் த்தீர்க்கிறான்

ப த் க்கிழிப்ேபாம்

ப த் த்தீர்க்கிறாள்

ப த் க்கிழிக்கிறார்

ப த் த்தீர்க்கிறாய்

ப த் க்கிழிக்கிறான்

ப த் த்தீர்க்கிேறாம்

ப த் க்கிழிக்கிறாள்

ப த் த்தீர்க்கிேறன் 278

ப த் த்தீர்ப்பார்

ப க்க

ந்த

ப த் த்தீர்ப்பான்

ப க்க

கிற

ப த் த்தீர்ப்பாள்

ப க்க

ம்

ப த் த்தீர்ப்பாய்

ப க்கயி ந்தான்

ப த் த்தீர்ப்ேபாம்

ப க்கயி ந்தாள்

ப த் த்தீர்ப்ேபன் ப த்

த்தார்

ப த்

த்தான்

ப த்

த்தாள்

ப த்

த்தாய்

ப த்

த்ேதாம்

ப த்

த்ேதன்

ப த்

க்கிறார்

ப த்

க்கிறான்

ப த்

க்கிறாள்

ப த்

க்கிறாய்

ப த்

க்கிேறாம்

ப த்

க்கிேறன்

ப த்

ப்பார்

ப த்

ப்பான்

ப த்

ப்பாள்

ப த்

ப்பாய்

ப த்

ப்ேபன்

ப த்

ப்ேபாம்

ப க்கட் ம் ப க்கேவண் ம் ப க்கேவண்டாம் ப க்கக்கூ ம் ப க்கக்கூடா ப க்கமாட்டான் ப க்கமாட்டாள் ப க்கமாட்டார் ப க்கமாட்ேடன் ப க்கமாட்டாய் ப க்கமாட்ேடாம் ப க்கவில்ைல ப க்கயிய

ம்

ப க்கயியன்ற ப க்கவிய

கிற 279

B.6 MOSES INSTALLATION AND TRAINING This subsection explains the installation of Moses tool kit and the issues which are occurred while training the system. Remedies for the issues are also given in detail. The required packages are in the packages folder. mkdir smt/moses/tools copy & paste gizapp into tools and extract Then, cd smt/moses/tools/gizapp make if error, then 1) yum install 2) yum install 3) yum install 4) yum install error)

gcc gcc-c++ glibc-static (for can't find -lm error) libstdc++-static (for can't find -lstdc++

cd ../ mkdir bin cp giza-pp/GIZA++-v2/GIZA++ bin/ cp giza-pp/mkcls-v2/mkcls bin/ cp giza-pp/GIZA++-v2/snt2cooc.out bin/

copy & paste srilm into tools and extract cd srilm change the path SRILM to srilm dir in srilm/MakeFile in srilm/commom, open makefile.machine.i686 -> under TCL_support --> comment other two options --> add NO_TCL = X yum install automake yum install zlib-devel yum install boost-devel

install c shell package i.e., tcsh package make World make all

280

export PATH=/home/anand/smt/moses/tools/srilm/bin/i686:/home/js chroe1/demo/tools/srilm/bin:$PATH cd .. cp&p moses and extract it yum install libtool ./regenerate-makefiles.sh ./configure --withsrilm=/home/anand/smt/moseso/tools/srilm --withirstlm=/home/anand/smt/moses/tools/irstlm make -j 2

To confirm setup: cd /home/anand/smt/moses mkdir data cp sample-models cd sample-models/phrase-model/ ../../../tools/moses/moses-cmd/src/moses -f moses.ini < in > out two sentences -> o/p: this is a small house .

compile moses support scriptscd ../../../tools/ mkdir moses-scripts cd moses/scripts Then edit makefile 13,14c13,14 < TARGETDIR?=/home/s0565741/terabyte/bin < BINDIR?=/home/s0565741/terabyte/bin --> TARGETDIR?=/home/anand/smt/moses/tools/moses-scripts > BINDIR?=/home/anand/smt/moses/tools/bin in moses/scripts/makefile, line 79 ./ -> perl checkdependencies.pl then, make release export SCRIPTS_ROOTDIR=/home/anand/smt/moses/tools/mosesscripts/scripts-YYYYMMDD-HHMM 281

Additional Scripts cd ../../ extract scripts also cp mteval.v11b.pl

For Training errors Use of uninitialized value $a in scalar chomp at tools/moses-scripts/scripts-201102040333/training/train-model.perl line 1079. Use of uninitialized value $a in split at tools/mosesscripts/scripts-20110204-0333/training/train-model.perl line 1082. type perl in line 1019 before $GIZA2BAL in trainmodel.perl

For Tuning errors if error -> sh: /home/anand/smt-moses/tools/mosesscripts/scripts-20101111-0136//training/cmert-0.5/scorenbest.py: /usr/bin/python^M: bad interpreter: No such file or directory in /home/anand/smt/moses/tools/moses-scripts/scripts20110204-0333/training/mert-moses.pl, type python before $cmertdir i.e., in this line $SCORENBESTCMD = "$cmertdir/score-nbest.py" if ! defined $SCORENBESTCMD;

Creating language model ngram-count -order 4 -interpolate -kndiscount -text SMTproject/smt-cmd/lm/monolingual -lm SMT-project/smtcmd/lm/monolingual.lm ./ngram-count -order 5 -text 300.morph.txt -lm 300.morph.lm

Training SMT-project/moses/bin/moses-scripts/scripts-200903020358/training/train-factored-phrase-model.perl -scriptsroot-dir SMT-project/moses/bin/moses-scripts/scripts20090302-0358/ -root-dir . -corpus SMT-project/smtcmd/corpus/corpus -f en -e ma -alignment grow-diag-final -reordering msd-bidirectional-fe -lm 0:4:SMTproject/smt-cmd/lm/monolingual.lm:0

282

Testing SMT-project/moses/moses-cmd/src/moses -config SMTproject/smt-cmd/model/moses.ini -input-file SMTproject/smt-cmd/testing/input > SMT-project/smtcmd/testing/output

Training factored model with alignment and translation factors bin/moses-scripts/scripts-20100311-1743/training/trainfactored-phrase-model.perl -scripts-root-dir bin/mosesscripts/scripts-20100311-1743/ -root-dir running_files corpus pc/twofactfour -f eng -e tam -lm 0:4:/root/Desktop/D5/srilm/bin/i686/300.lm:0 -lm 2:5:/root/Desktop/D5/srilm/bin/i686/300.pos.lm:0 -lm 3:5:/root/Desktop/D5/srilm/bin/i686/300.morph.lm:0 -alignment-factors 0,1,2,3-0,1,2,3 --translation-factors 0-0+1-1+2-2+3-3 --reordering-factors 0-0+1-1+2-2+3-3 -generation-factors 3-2+3,2-1+1,2,3-0 --decoding-steps t3,t2,t1,g0,g1,g2 Moses Config file ######################### ### MOSES CONFIG FILE ### ######################### # input factors [input-factors] 0 1 2 3 # mapping steps [mapping] 0 T 0 0 T 1 # translation tables: source-factors, target-factors, number of scores, file [ttable-file] 1,2 1,2 5 /media/DISK5/tools/trunk/scripts/running_files/model/phr ase-table.1,2-1,2.gz 2,3 3 5 /media/DISK5/tools/trunk/scripts/running_files/model/phr ase-table.2,3-3.gz

283

# no generation models, no generation-file section # language models: type(srilm/irstlm), factors, order, file [lmodel-file] 0 1 4 /root/Desktop/D5/srilm/bin/i686/300.lem.lm 0 2 4 /root/Desktop/D5/srilm/bin/i686/300.pos.lm 0 3 4 /root/Desktop/D5/srilm/bin/i686/300.morph_.lm

# limit on how many phrase translations e for each phrase f are loaded # 0 = all elements loaded [ttable-limit] 20 0 # distortion (reordering) weight [weight-d] 0.6 # language model weights [weight-l] 0.1667 0.1667 0.1667 # translation model weights [weight-t] 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 # no generation models, no weight-generation section # word penalty [weight-w] -1 [distortion-limit] 6

Testing factored model ./moses-cmd/src/moses -report-all-factors -config ./scripts/running_files/model/moses.ini -input-file ./scripts/test/test.eng > ./scripts/test/test.tam 284

B.7 COMPARISION WITH GOOGLE OUTPUT

English Sentences I went to his home. She is playing with her friends. He is a doctor. They are studying in my school. The book is on the table.

Google Output நான் அவர ட் ற்கு ெசன்றார். அவள்

நண்பர்க

ெகாண்

க்கிறார்.

அவர் ஒ

டாக்டர்.

அவர்கள்

என்

டன்

பள்ளியில்

த்தகம் அட்டவைண உள்ள . அவர்கள் ஒ

The rat was killed by the cat .

எ

She is studying with me .

அவள் என்

She did not come with me.

அவள் என்ைன வரவில்ைல.

I deposited money to the bank.

நான் வங்கிக்கு பணம் ெடபாசிட்.

We will study today.

இன்

English Sentences

F-SMT Output

I went to his home.

நான் அவன

பாட்ைட பாட ேபாகிறார்.

ைன ெகால்லப்பட்டார். டன் ப த்

ெகாண்

க்கிறார்.

நாம் ப ப்ேபாம்.

அவள்

ட் ற்கு ேபாேனன். நண்பர்க

She is playing with her friends.

ெகாண்

He is a doctor.

அவர் ஒ

The book is on the table.

பயின்

வ கிறார்கள்.

They will sing a song.

They are studying in my school.

விைளயா

டன்

விைளயா

க்கிறாள். டாக்டர்.

அவர்கள்

என்

பள்ளியில்

ப த்

வ கிறார்கள். த்தகம் ேமைசயின் ேமல் உள்ள .

They will sing a song.

அவர்கள் ஒ

The rat was killed by the cat .

எ

பாட்ைட பாட ேபாகிறார்கள்.

ைனயால் ெகால்லப்பட்ட .

அவள்

என்

டன்

She is studying with me .

ெகாண்

She did not come with me.

அவள் என்

I deposited money to the bank.

நான் வங்கிக்கு பணம் ெசன்ேறன்.

We will study today.

இன்

285

க்கிறாள். டன் வரவில்ைல.

நாம் ப ப்ேபாம்.

ப த்

B.8 GRAPHICAL USER INTERFACES

Figure B.1 Tamil POS-Tagger GUI

286

Figure B.2 Tamil Morphological Analyzer GUI 287

Figure B.3 Tamil Morphological Generator GUI

288

Figure B.4 English to Tamil T Mach hine Translation System m GUI

289

REFERENCES [1]

Lee, L. (2004). ‘‘I’m sorry Dave, I’m afraid I can’t do that’’: Linguistics, statistics, and natural language processing circa 2001. In On the Fundamentals of Computer Science: Challenges C, Opportunities CS, Telecommunications Board NRC (Eds.), Computer science: Reflections on the field, reflections from the field (pp. 111–118). Washington, DC: The National Academies Press.

[2]

Jurafsky Daniel and Martin James H (2005), “An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition”, Prentice Hall, ISBN: 0130950696, contributing writers: Andrew Kehler, Keith Vander Linden, and Nigel Ward.

[3]

Hutchins John 2001, Machine translation and human translation: in competition or in complementation?, International Journal of Translation, 13, 1-2, p. 5-20

[4]

Allen James (1995), “Natural Language Processing”, 1-2.Redwood Benjamin/ Cummings.

[5]

http://www.pangea.com.mt/en/q2-why-statistical-mt/

[6]

http://cordis.europa.eu/fp7/ict/language-technologies/projecteuromatrixplus_en.html

[7]

Ma, Xiaoyi. 1999. Parallel text collections at the Linguistic Data Consortium. In Machine Translation Summit VII, Singapore.

[8]

Durgesh Rao. 2001. Machine Translation in India:A Brief Survey. In “Proceedings of SCALLA 2001 Conference”, Banglaore, India

[9]

http://en.wikipedia.org/wiki/Tamil_language

[10]

Philipp Koehn and Hieu Hoang. 2007. Factored translation models. In Proc. EMNLP+CoNLL, pages 868–876, Prague

[11]

http://en.wikipedia.org/wiki/Subject%E2%80%93verb%E2%80%93object

290

[12]

Jesús Giménez and Llu´ıs Màrquez,2006 ,SVMTool:Technical manual v1.3, August 2006.

[13]

S.Rajendran,

Arulmozi,

S.,

Ramesh

Kumar,

Viswanathan,

S.

2001.

“Computational morphology of verbal complex “.Language in india Volume 3 : 4 April 2003. [14]

Bahl L and Mercer R. L (1976), “Part-Of-Speech assignment by a statistical decision algorithm”, IEEE International Symposium on Information Theory.

[15]

Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Proceedings of the Second Conference on Applied Natural Language Processing, pages 136–143.

[16]

Cutting D, Kupiec J, Pederson J and Nipun P (1992), “A Practical Part-ofspeech Tagger”, Proceedings of the 3rd Conference of Applied Natural Language Processing, ANLP, 1992, pp. 133-140.

[17]

DeRose, S. (1988). 'Grammatical category disambiguation by statistical optimization. Computational Linguistics 14, 31-39

[18]

Schmid H (1994), “Probabilistic Part-Of-Speech Tagging using Decision Trees”, Proceedings of the International Conference on new methods in language processing, Manchester, UK.

[19]

Brill E (1992), “A simple rule based part of speech tagger”, Proceedings of the Third Conference on Applied 5atural Language Processing, ACL, Trento, Italy.

[20]

Brill E (1993), “Automatic grammar induction and parsing free text: A transformation based approach”, Proceedings of 31st Meeting of the Association of Computational Linguistics, Columbus.

[21]

Brill E (1993), “Transformation based error driven parsing”, Proceedings of the Third International Workshop on Parsing Technologies, Tilburg, The Netherlands.

291

[22]

Brill E (1994), “Some advances in rule based part of speech tagging”, Proceedings of The Twelfth 5ational Conference on Artificial Intelligence (AAAI- 94), Seattle, Washington.

[23]

Prins R, and Van Noord G (2001), “Unsupervised Pos- Tagging Improves Parsing Accuracy And Parsing Efficiency”, Proceedings of the International Workshop on Parsing Technologies.

[24]

Pop M (1996), “Unsupervised Part-of-speech Tagging”, Department of Computer Science, Johns Hopkins University.

[25]

Brill E (1997), “Unsupervised Learning of Disambiguation Rules for Part of Speech Tagging”, In Proceeding of The Natural Language Processing Using Very Large Corpora, Boston.

[26]

Kimmo Koskenniemi (1985), Compilation of automata from morphological two-level rules. In F. Karlsson (ed.), Papers from the fifth Scandinavian Conference of Computational Linguistics, Helsinki, pp. 143-149.

[27]

Greene, B. B. & Rubin, G. M. 1971. Automatic grammatical tagging of English. Technical Report, Brown University. Providence, RI.

[28]

Francis, W.N., & Kucera, H. (1982). Frequency analysis of English usage: Lexicon and grammar. Boston: Houghton Mifflin.

[29]

Derouault A M and Merialdo B (1986), “Natural language modeling for phoneme-to-text transcription”, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8, pp. 742-749.

[30]

Jelinek, F. 1985. Self-organized language modeling for speech recognition. Technical report, IBM T.J. Watson Research Center, Continuous Speech Recognition Group, Yorktown Heights, NY.

[31]

Kupiec J M (1992), “Robust part-of-speech tagging using a Hidden Markov Model”, Computer Speech and Language, pp.113-118.

292

[32]

Yahya O and Mohamed Elhadj (2004), “Statistical Part-of-Speech Tagger for Traditional Arabic Texts”, Journal of Computer Science 5 (11): 794-800, ISSN 1549-3636.

[33]

Weischedel R, Schwartz R, Pahnueci J, Meteer M and Ramshaw L (1993), “Coping with Ambiguity and Unknown words through Probabilistic Models”. Computational Linguistics, pp. 19(2):260-269.

[34]

Ratnaparkhi Adwait (1996), “A Maximum Entropy Model for Part-Of-Speech Tagging”, Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP-1996), University of Pennsylvania.

[35]

Brill E (1995), “Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging”. Computational Linguistics, 1995, pp 21 (4):543-565.

[36]

Ray Lau, Ronald Rosenfeld, and Salim Roukos. 1993. Adaptive Language Modeling Using The Maximum Entropy Prin- ciple. In Proceedings of the Human Language Technology Workshop, pages 108-113. ARPA.

[37]

Quinlan J R (1986), “Induction of Decision Trees”, Machine Learning, 1:81106.

[38]

Jesús Giménez and Llu´ıs Màrquez. (2004), “SVMTool: A general POS tagger generator based on support vector machines”, Proceedings of the 4th LREC Conference.

[39]

Brants, T. (2000b). TnT – A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Conference on Applied Natural Language Processing ANLP-2000. Seattle, WA

[40]

Smriti Singh, Kuhoo Gupta, Manish Shrivastava and Pushpak Bhattacharyya (2006), “Morphological richness offsets resource demand – experiences in constructing a pos tagger for Hindi”, Proceedings of the COLING/ACL 2006, Sydney, Australia Main Conference Poster Sessions, pp. 779–786.

293

[41]

Manish Shrivastava and Pushpak Bhattacharyya, Hindi POS Tagger Using Naive Stemming: Harnessing Morphological Information Without Extensive Linguistic Knowledge, International Conference on NLP (ICON08), Pune, India, December, 2008.

[42]

Dalal Aniket, Kumar Nagaraj, Uma Sawant and Sandeep Shelke (2006), “Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach”, Proceedings of NLPAI-2006, Machine Learning Workshop on Part Of Speech and Chunking for Indian Languages.

[43]

Karthik Kumar G, Sudheer K and Avinesh P V S (2006), “Comparative study of various Machine Learning methods For Telugu Part of Speech tagging”, Proceedings of NLPAI Machine Learning Workshop on Part Of Speech and Chunking for Indian Languages.

[44]

Nidhi Mishra Amit Mishra (2011), “Part of Speech Tagging for Hindi Corpus”, International

Conference

on

Communication

Systems

and

Network

Technologies. [45]

Pradipta Ranjan Ray, Harish V., Sudeshna Sarkar and Anupam Basu, (2003) “Part of Speech Tagging and Local Word Grouping Techniques for Natural Language Parsing in Hindi” , Indian Institute of Technology, Kharagpur, INDIA 721302. www.mla.iitkgp.ernet.in/papers/hindipostagging.pdf.

[46]

Sivaji Bandyopadhyay, Asif Ekbal and Debasish Halder (2006), “HMM based POS Tagger and Rule-based Chunker for Bengali”, Proceedings of NLPAI Machine Learning Workshop on Part Of Speech and Chunking for Indian Languages.

[47]

RamaSree, R.J and Kusuma Kumari, P (2007), “Combining Pos Taggers For Improved Accuracy To

Create Telugu Annotated Texts For Information

Retrieval”, Available at http://www.ulib.org/conference/2007/RamaSree.pdf. [48]

Sandipan Dandapat (2007), “Part Of Speech Tagging and Chunking with Maximum Entropy Model”, Proceedings of IJCAI Workshop on Shallow Parsing for South Asian Languages.

294

[49]

Antony P.J and K.P. Soman. 2010. Kernel based part of speech tagger for kannada. In Machine Learning and Cybernetics (ICMLC), 2010 International Conference on, volume 4, pages 2139 –2144, july.

[50]

Manju K, Soumya S, and Sumam Mary Idicula (2009), “Development of a POS Tagger for Malayalam - An Experience”, International Conference on Advances in Recent Technologies in Communication and Computing, pp.709-713.

[51]

Antony P J, Santhanu P Mohan and Soman K P (2010), “SVM Based Parts Speech Tagger for Malayalam”, International Conference on-Recent Trends in Information,Telecommunication and Computing (ITC 2010).

[52]

Arulmozhi P, Sobha L, Kumara Shanmugam. B (2004), “Parts of Speech Tagger for Tamil”, Proceedings of the Symposium on Indian Morphology, Phonology & Language Engineering, Indian Institute of Technology, Kharagpur.

[53]

Arulmozhi P and Sobha L (2006), “A Hybrid POS Tagger for a Relatively Free Word Order Language”, Proceedings of MSPIL-2006, Indian Institute of Technology, Bombay.

[54]

Lakshmana Pandian S and Geetha T V (2008), “Morpheme based Language Model for Parts-of-Speech Tagging”, POLIBITS – Research Journal on Computer Science and Computer Engineering with applications, Volume 38, Mexico. pp. 19-25.

[55]

Vasu Renganathan,(2001),“Development of Part-of-Speech Tagger for Tamil”, Tamil Internet 2001 Conference, Kuala Lumpur, August 26-28, 2001

[56]

Ganesan M (2007), “Morph and POS Tagger for Tamil” (Software), Annamalai University, Annamalai Nagar.

[57]

Lakshmana Pandian S and Geetha T V (2009), “CRF Models for Tamil Part of Speech Tagging and Chunking “, Proceedings of the 22nd ICCPOL.

[58]

M. Selvam, A.M. Natarajan (2009), “Improvement of Rule Based Morphological Analysis and POS Tagging in Tamil Language via Projection 295

and Induction Techniques”, International Journal of Computers, Issue 4, Volume 3, 2009. [59]

Canasai Kruengkrai, Virach Sornlertlamvanich and Hitoshi Isahara (2006), “A Conditional Random Field Framework for Thai Morphological Analysis”, Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC-06), Genoa, Italy.

[60]

Daelemans Walter, Zavrel J, Van den Bosch A and Van der Sloot K (2003), “MBT: Memory Based Tagger, version 2.0, reference guide”, Technical Report ILK 03-13, ILK Research Group, Tilburg University.

[61]

Alon Itai and Erel Segal (2003), “A Corpus Based Morphological Analyzer for Unvocalized Modern Hebrew”, Department of Computer Science Technion— Israel Institute of Technology, Haifa, Israel.

[62]

John Goldsmith (2001), “Unsupervised Learning of the Morphology of a Natural Language”, Computational Linguistics, 27(2):153–198.

[63]

Asanee Kawtrakul and Chalatip Thumkanon (1997), “A statistical approach to thai morphological analyzer”, Proceedings of the 5th Workshop on Very Large Corpora, M. Young, The Technical Writer's Handbook. Mill Valley, CA: University Science.

[64]

John Lee (2008), “A Nearest-Neighbour Approach to the Automatic Analysis of Ancient Greek Morphology”, CoNLL-2008: Proceedings of the 12th Conference on Computational Natural Language Learning, Manchester.

[65]

T.N. Vikram & Shalini R, (2007), “Development of Prototype Morphological Analyzer for the South Indian Language of Kannada”, Lecture Notes In Computer Science: Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers. Vol. 4822/2007, 109-116.

[66]

Shambhavi. B.R, Dr. Ramakanth Kumar P, Srividya K, Jyothi B J , Spoorti Kundargi, Varsha Shastri, International Journal of Computer Science and

296

Network Security (IJCSNS) Vol 11 No. 1, Jan 2011, "Kannada Morphological Analyser and Generator using Trie" pp 112-116 [67]

Uma maheswara Rao G, Parameshwari K: CALTS, University of Hyderabad, „On the description of morphological data for morphological analyzers and generators: A case of Telugu, Tamil and Kannada 2010.

[68]

K. Narayana Murthy, "Issues in the Design of a Spell Checker for Morphologically Rich Languages", 3rd International Conference on South Asian Languages, ICOSAL-3, 4th to 6th January 2001, University of Hyderabad

[69]

Sajib Dasgupta and Vincent Ng.Unsupervised Morphological Parsing of Bengali. In the journal of Language Resources and Evaluation, 2007. 40:3-4, pp 311-330

[70]

Mohanty, S., Santi, P.K., Adhikary, K.P.D. 2004. Analysis and Design of Oriya Morphological Analyser: Some Tests with OriNet. In Proceeding of symposium on Indian Morphology, phonology and Language Engineering, IIT Kharagpur

[71]

Girish Nath Jha., Muktanand Agarwal., Subash., Sudhir K Mishra., DiwakarMishra., Manji Bhadra Surjit K Singh. 2007. Inflectional Morphology for Sanskrit. In Proceedings of First International Symposium on Sanskrit Computational Linguistics. 46-77.

[72]

Anandan P, Ranjani Parthasarathy and Geetha T.V (2002), “Morphological Analyzer for Tamil”, ICON 2002, RCILTS-Tamil, Anna University, India.

[73]

Viswanathan, S., Ramesh Kumar, S., Kumara Shanmugam, B., Arulmozi, S. and Vijay Shanker, K. (2003). A Tamil Morphological Analyser, Proceedings of the International Conference On Natural language processing ICON 2003, Central Institute of Indian Languages, Mysore, India, pp. 31–39.

[74]

Parameshwari K, “An Implementation of APERTIUM Morphological Analyzer and Generator for Tamil”, Language in India www.languageinindia.c o m. 11:5 M ay 2011, Special Volume: Problems of Parsing in Indian Languages, 2011

297

[75]

Vijay Sundar Ram R, Menaka S and Sobha Lalitha Devi (2010), Tamil Morphological Analyser, In Mona Parakh (ed.) Morphological Analyser For Indian Languages, CIIL, Mysore, pp. 1 -18.

[76]

Akshar Bharat, Rajeev Sangal, S. M. Bendre, Pavan Kumar and Aishwarya, “Unsupervised improvement of morphological analyzer for inflectionally rich languages,” Proceedings of the NLPRS, pp. 685-692, 2001.

[77]

Duraipandi(2002), “The Morphological Generator and. Parsing Engine for Tamil Verb Forms ”, Tamil Internet Conference 2002.

[78]

Dalal Aniket, Kumar Nagaraj, Uma Sawant and Sandeep Shelke (2006), “Hindi Part-of-Speech Tagging and Chunking: A Maximum Entropy Approach”, Proceedings of NLPAI-2006, Machine Learning Workshop on Part Of Speech and Chunking for Indian Languages.

[79]

K. Rajan, V. Ramalingam, M. Ganesan, S. Palanivel, B. Palaniappan, Automatic classification of Tamil documents using vector space model and artificial neural network, Expert Systems with Applications 36 (2009) 10914 – 10918.

[80]

Naskar, S. and S. Bandyopadhyay, 2002. Use of machine translation in India: Current status. Proceeding of the 7th EAMT Workshop on Teaching Machine Translation, (TMT’02), MT- Archive, Manchester, UK., pp: 23-32

[81]

R.M.K.Sinha, Jain R. and Jain A,“Translation from English to Indian languages, ANGLABHARTI Approach,” In proceedings of Symposium on Translation Support System STRANS2001, IIT Kanpur, India, Feb 15-17, 2001.

[82]

Murthy, B. K. and Deshpande, W. R.,“Language technology in India: past, present and future,”1998.

[83]

AksharBharati,

Vineet

Chaitanya,

Amba

P

Kulkarni,

and

Rajeev

Sangal,“Anusaaraka: Machine Translation in Stages,”In Vivek: A Quarterly in Artificial Intelligence, Vol. 10, No.3, pp. 22-25, 1997.

298

[84]

Durgesh Rao,“Machine Translation in India: A Brief Survey,” In Proceedings of the SCALLA 2001 Conference, Bangalore, India, 2001.

[85]

Murthy, B. K. and Deshpande, W. R.,“Language technology in India: past, present and future,”1998.

[86]

Bharati A., R. Moona, P. Reddy, B. Sankar and D.M.Sharma,“Machine translation: The Shakti approach,” In Proceeding of the 19th International Conference on Natural Language Processing, India, pp: 1-7, Dec. 2003.

[87]

Bandyopadhyay, S., “ANUBAAD, the translator for English to Indian languages,”In Proceedings of the 7th State Science and Technology Congress (SSTC’00), Calcutta, India, pp. 1-9, 2000.

[88]

R. Mahesh K. Sinha and Anil Thakur,“Machine translation of bilingual HindiEnglish (Hinglish) text,”In Conference Proceedings: the tenth Machine Translation Summit,Phuket, Thailand, pp.149-156, September 13-15, 2005.

[89]

Lata Gore and Nishigandha Patil,“English To Hindi - Translation System,”In Proceedings of Symposium on Translation Support Systems STRANS-2002, IIT Kanpur, 15-17March, 2002.

[90]

G.S. Josan and G.S. Lehal,“Punjabi to Hindi machine translation system,”In Proceedings of the 22nd International Conference on Computational Linguistics, MT-Archive, Manchester, UK., pp. 157-160,Aug. 21-24, 2001.

[91]

Vishal Goyal and Gurpreet Singh Lehal,“Hindi to Punjabi Machine Translation System,”Springer

Berlin

Heidelberg,

Information

Systems

for

Indian

Languages, Communications in Computer and Information Science, Vol: 139, pp: 236-241. 2011. [92]

Prashanth Balajapally, Phanindra Pydimarri, Madhavi Ganapathiraju, N. Balakrishnan and Raj Reddy,“Multilingual Book Reader: Transliteration, Wordto-Word Translation and Full-text Translation,”In Proceedings of VALA 2006: 13th Biennial Conference and Exhibition, Melbourne, Australia, February 8-10, 2006.

299

[93]

Ruvan Weerasinghe. 2004. A statistical machine translation approach to Sinhala Tamil language translation. In SCALLA 2004.

[94]

Vasu Renganathan, (2002) “An ineractive approach to development of English to Tamil machine translation system on the web”. INFITT, (TI2002).

[95]

Germann, U. (2001). “Building a statistical machine translation system from scratch: how much bang for the buck can we expect?”, In Proceedings of the workshop on Data-driven methods in machine translation, 1–8, ACL, Morristown, NJ, USA.

[96]

Fedric C. Gey,“Prospects for Machine Translation of the Tamil Language”, in the proceedings of Tamil Internet 2002, California, USA.

[97]

Chella muthu (2001) “ Russian language to Tamil machine translation system ”. INFITT, (TI2001).

[98]

R. Harshawardhan, Mridula Sara Augustine and K.P.Soman, Phrase based English – Tamil Translation System by Concept Labeling using Translation Memory,International Journal of Computer Applications (0975 – 8887), Volume 20– No.3, April 2011.

[99]

Saravanan S., Menon A. G., Soman K. P. (2010), English to Tamil Machine Tanslation System, INFITT 2010, at Coimbatore.

[100] Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin,“A

statistical

approach

to

machine

translation,”InJournal

ofComputational Linguistics, 16(2):79-85, 1990. [101] F.J.Och. “An Efficient method for determining bilingual word classes,” In Proceedings of Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL), 1999. [102] Daniel Marcu and William Wong,“A Phrase-Based, Joint Probability Model for Statistical Machine Translation,” In Proceedings of the Conference on Empirical

Methods

in

Natural

Language

Philadelphia, PA, July 6-7, 2002. 300

Processing

(EMNLP-2002),

[103] Philipp Koehn, Franz Josef Och, and Daniel Marcu,“Statistical Phrase-Based Translation,”In Proceedings ofHLT/NAACL, 2003. [104] Kenji Yamada and Kevin Knight, “A Syntax-based Statistical Translation Model,”InProceedings ofACL 2001,pp.523-530, 2001. [105] J. Graehl and K. Knight,“Training tree transducers,”In Proceedings ofHLTNAACL 2004: Main Proc., pp. 105–112, Boston,Massachusetts, USA, May 2 May 7, 2004. [106] Melamed. “Statistical machine translation by parsing,” In the Companion Volume to the Proc. of 42nd Annual Meeting of the Association for Computational Linguistics, pp. 653–660, 2004. [107] K. Imamura, H. Okuma, and E. Sumita, Practical approach to syntax-based statistical machine translation, In Proceedings of MT Summit X, pp. 267–274, 2005. [108] Sonja Nießen and Hermann Ney,“Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information,” In Journal ofComputational Linguistics, 30(2), pp. 181–204, 2004. [109] Maja Popovic and Hermann Ney,“Statistical Machine Translation with a Small Amount of Bilingual Training Data,”5th LREC SALTMIL Workshop on Minority Languages, pp. 25–29, 2006. [110] Michael Collins, Philipp Koehn, and Ivona Kucerova,“Clause Restructuring for Statistical Machine Translation,”In Proceedings of ACL, pp. 531–540, 2006. [111] Sahar Ahmadi and Saeed Ketabi. “Translation Procedures and problems of Color Idiomatic Expressions in English and Persian,” In the Journal of International Social Research, Volume: 4 Issue: 17, 2011. [112] Martine Smets, Joseph Pentheroudakis and Arul Menezes, “Translation of verbal idioms,”Microsoft Research, 2005.

301

[113] Breidt, E., Segond F and Valetto G.,“Local grammars for the description of multi-word lexemes and their automatic recognition in texts,”In Proceedings of COMPLEX96, Budapest, 1996. [114] P. Karageorgakis, A. Potamianos, and I. Klasinas, “Towards incorporating language morphology into statistical machine translation systems,” in Proc. Automatic Speech Recogn. and Underst. Workshop (ASRU), 2005. [115] P. Koehn,(2002) “Europarl: A Multilingual Corpus for Evaluation of Machine Translation,” Draft, Unpublished. [116] Sultan, Soha. Applying morphology to English-Arabic statistical machine translation. Diss. Master's Thesis Nr. 11 ETH Zurich in collaboration with Google Inc., 2011. [117] Adria de Gispert Ramis (2006). Introducing Linguistic knowledge into statistical Machine Translation. Ph.D. thesis, TALP Research Center, Speech Processing Group Department of Signal Theory and Communications, Universitat Politècnica de Catalunya. [118] Ann Clifton, (2010) Unsupervised Morphological Segmentation For Statistical Machine Translation. Master of Science thesis, Simon Fraser University. [119] Sara Stymne, (2009) Compound Processing for Phrase-Based Statistical Machine Translation. Licentiate thesis, Linköping University, Sweden. [120] Rabih M. Zbib (2010), Using Linguistic Knowledge in Statistical Machine Translation, Ph.D. thesis, Massachusetts Institute Of Technology, September 2010. [121] Lee, Y. S. (2004). Morphological analysis for statistical machine translation. Defense Technical Information Center. [122]

Elena Irimia, Alexandru Ceausu, Dependency-based translation equivalents for factored machine translation, 11th International Conference on Intelligent Text Processing and Computational Linguistics - CICLing 2010

302

[123] Sriram Venkatapathy, Rajeev Sangal, Aravind Joshi and Karthik Gali, A Discriminative

Approach

for

Dependency

Based.

Statistical

Machine

Translation (2010). [124] Loganathan R (2010). English-Tamil Machine Translation System. Master of Science by Research Thesis, Amrita Vishwa Vidyapeetham, Coimbatore. [125] Kumaran A and Tobias Kellner (2007) A Generic Framework for Machine Transliteration. Proceedings in 30th annual international ACM-SIGIR conference on Research and development in information retrieval, Pages 721722. [126]

Mohammad Afraz and Sobha L (2008), "English to Dravidian Language Machine Transliteration: A Statistical Approach Based on N-grams", In the Proceedings of International Seminar on Malayalam and Globalization, Trivandrum, Kerala,

[127] Srinivasan Janarthanam, Sethuramalingam S and Udhyakumar Nallasamy, Named Entity Transliteration for Cross Language Information Retrieval using Compressed Word Format Algorithm, 2nd International ACM Workshop Improving Non-English Web Searching (iNEWS-08), California. [128]

Vijaya M. S., Shivapratap G., Soman K. P (2010), ‘English to Tamil Transliteration using One Class Support Vector Machine’, International Journal of Applied Engineering Research, Volume 5, Number 4, 641-652.

[129] Rajendran S (2006), “Parsing in Tamil –Present State of Art”, Language in India, www.languageinindia.com, Volume 6 : 8. [130] Sobha, L., Vijay Sundar Ram. R, (2006) "Noun Phrase Chunker for Tamil", In Proceedings of Symposium on Modeling and Shallow Parsing of Indian Languages, Indian Institute of Technology, Mumbai, pp 194-198. [131] Menon A. G.; Saravanan S; Loganathan R; Soman K. P. (2009): Amrita Morph Analyzer and Generator for Tamil: A Rule-Based Approach, TIC, Cologne, Germany, pp. 239-243.

303

[132] http://en.wikipedia.org/wiki/Tamil_grammar. [133] kiriyAvin thaRkAla thamiz akarAthi (2006), Cre-A, pp. 230-231. [134] Lehmann Thomas (1983), “A Grammar of Modern Tamil”, Pondicherry: Pondicherry Institute of Linguistics and Culture. [135] http://en.wikipedia.org/wiki/Machine_learning [136] Vapnik, V. (1998). Statistical Learning Theory. Wiley. & Sons, Inc., New York. [137] K.P. Soman, Shyam Diwakar, V. Ajay, Insight into Data Mining, Theory and Practice. Prentice Hall of India, Pages174-198, 2008. [138] Dr. K.P. Soman, Ajav. V, Loganathan R., "Machine Learning with SVM and other Kernel Methods", Prentice-Hall India, ISBN: 978-81-203-3435-9, 2009. [139] Harris Z S (1962), String Analysis of Sentence Structure. Mouton, The Hague. [140] Voutilainen Atro (1995), A syntax-based part-of-speech analyzer, EACL 1995. [141] Schmid H (1994), “Probabilistic Part-Of-Speech Tagging using Decision Trees”, Proceedings of the International Conference on new methods in language processing, Manchester, UK. [142] Stanfill C, and Waltz D (1986),“Toward memory-based reasoning”, Communications of the ACM, Vol. 29, pp. 1213-1228. [143] Jakob Elming (2008), Syntactic Reordering In Statistical Machine Translation, PhD Thesis. Copenhagen Business School. [144] Kishore Papineni, Salim Roukos, Todd Ward & Wei-Jing Zhu (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Meeting of the Association for Computational Linguistics (ACL’02) (pp. 311–318). Philadelphia, PA. [145] George Doddington (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceeding of the ARPA Workshop on Human Language Technology. 304

[146] V. I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8), pp. 707–710, February. [147] S. Nießen, F. J. Och, G. Leusch, and H. Ney. 2000. An evaluation tool for machine translation: Fast evaluation for MT research. In Proc. Second Int. Conf. on Language Resources and Evaluation, pp. 39–45, Athens, Greece, May. [148] Christoph Tillmann, Stefan Vogel, Hermann Ney & Alex Zubiaga (1997). A DP-based search using monotone alignments in statistical translation. In Proceedings of the 35th Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics (pp. 289–296). Somerset, New Jersey. [149] Snover, M., B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul: 2006, À Study of Translation Edit Rate with Targeted Human Annotation'. In: Proceedings of Association for Machine Translation in the Americas. pp. 223-231. [150] http://nlp.stanford.edu/software/lex-parser.shtml [151] Fei Xia and Michael McCord. Improving a statistical MT system with automatically learned rewrite patterns. In Proceedings of the 20th International Conference on Computational Linguistics, COLING ’04, pages 508–514, Geneva, Switzerland, August 2004. Association for Computational Linguistics. [152] Michael Collins, Philipp Koehn, and Ivona Kuˇcerová. Clause restructuring for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages 531–540, Ann Arbor, Michigan, USA, June 2005. Association for Computational Linguistics. [153] Marta Ruiz Costa-juss` (2006). On developing novel reordering algorithms for Statistical Machine Translation. Ph.D. thesis, Speech Processing Group Department of Signal Theory and Communications, Universitat Politècnica de Catalunya. [154] Marta Ruiz Costa-juss and J. A. R. Fonollosa, “State-of-the-art word reordering approaches in statistical machine translation,” IEICE Transactions on Information and Systems, vol. 92, no. 11, pp. 2179–2185, November 2009. 305

[155] Ananthakrishnan Ramanathan, Pushpak Bhattacharya, Jayprasad Hegde, Ritesh M.Shah, and Sasikumar M. 2008. Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation. In IJCNLP 2008, Hyderabad, India. Rochester, NY, April [156] Schiffman, Harold (1999) A Reference Grammar of Spoken Tamil. Cambridge University Press. [157] Simon Zwarts and Mark Dras. 2007. Syntax-Based Word Reordering in PhraseBased Statistical Machine Translation: Why Does it Work? In Proceedings of MT Summit XI, pages 559–566. [158] Marie-Catherine de Marneffe and Christopher D. Manning (2008), “Stanford typed dependencies manual”. [159] Rajendran S, (2007) Complexity of Tamil in POS tagging, Language in India, Jan 2007. [160] Akshar Bharati, Rajeev Sangal, Dipti Misra Sharma and Lakshmi Bai (2006), “AnnCorra: Annotating Corpora Guidelines for POS and Chunk Annotation for Indian Languages”, Language Technologies Research Centre IIIT, Hyderabad. [161] http://www.au-kbc.org/research_areas/nlp/projects/postagger.html. [162] http://www.infitt.org/ti2001/papers/vasur.pdf. [163] http://shiva.iiit.ac.in/SPSAL2007/SPSAL-Proceedings.pdf. [164] http://www.ldcil.org/up/conferences/pos%20tag/presentation.html. [165] http://www.infitt.org/ti2009/papers/ganesan_m_final.pdf [166] http://tdil.mit.gov.in/Tamil-AnnaUniversity-ChennaiJuly03.pdf [167] Rajan K (2002), “Corpus Analysis And Tagging for Tamil”, Annamalai University, Annamalai nagar. [168] http://www.cs.waikato.ac.nz/~ml/weka/

306

[169] Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.)2004 , Morphology. A Handbook on Inflection and Word Formation, Berlin and New York: Walter De Gruyter, 1893-1900. [170] N. Ramaswami, 2001, Lexical Formatives and Word Formation Rules In Tamil. Volume 1: 8 December 2001. [171] Hal Daume (2006), http://nlpers.blogsp-ot.com/2006/11/-getting-started-insequencelabeling.html. [172] Brown, P., J. Cocke, S. Della Pietra, V. Della Pietra, F. Jelinek, R. Mercer, and P. Roossin, A statistical approach to machine translation, In Journal of Computational Linguistics, 16(2):79-85, 1990. [173] Jurafsky, Daniel, and James H. Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics, 2nd edition. Prentice-Hall, 2009. [174] Philipp Koehn and Kevin Knight, Knowledge Sources for Word-Level Translation Models, In Proceedings of EMNLP, 2001. [175] Philipp Koehn, Franz Josef Och, and Daniel Marcu, “Statistical Phrase-Based Translation, In Proceedings of HLT/NAACL, 2003. [176] Kenji Yamada and Kevin Knight, A Syntax-based Statistical Translation Model, In Proceedings of ACL 2001, pp.523-530, 2001. [177] J. Graehl and K. Knight, ―Training tree transducers, In Proceedings of HLTNAACL 2004: Main Proc., pp. 105–112, Boston, Massachusetts, USA, May 2 May 7, 2004. [178] Melamed. Statistical machine translation by parsing, In the Companion Volume to the Proc. of 42nd Annual Meeting of the Association for Computational Linguistics, pp. 653–660, 2004. [179] K. Imamura, H. Okuma, and E. Sumita, Practical approach to syntax-based statistical machine translation, In Proceedings of MT Summit X, pp.267–274, 2005. 307

[180] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, et.al, Moses: Open source toolkit for statistical machine translation, In Proceedings of ACL, Demonstration Session, 2007. [181] F.J.Och. An Efficient method for determining bilingual word classes, In Proceedings of Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL), 1999. [182] F.J.Och and H.Ney. A systematic comparison of various statistical alignment models, In Journal of Computational Linguistics, 29(1):19-51, 2003. [183] A.Stolcke. SRILM – an extensible language modelling toolkit, In Proceedings of the ICSLP, 2002. [184] Garrido Alicia, Amaia Iturraspe, Sandra Montserrat, et.al (1999). “A compiler for morphological analysers and generators based on finite state transducers”. Procesamiento del Lenguaje Natural, 25:93–98. [185] Guido Minnen, John Carroll, and Darren Pearce. 2000. “Robust applied morphological generation.” Proceedings of the First International Natural Language Generation Conference, pages 201.208, 12.16 June. [186] Goyal, V, Singh Lehal, G. “Hindi Morphological Analyzer and Generator ” Emerging Trends in Engineering and Technology, 2008. ICETET '08. [187] Madhavi Ganapathiraju and Lori Levin, 2006 “TelMore: Morphological Generator for Telugu Nouns and Verbs”. Proc. Second International Conference on Universal Digital Library, Vol Alexandria, Egypt, Nov 17-19, 2006. [188] Reyyan Yeniterzi and Kemal Oflazer, Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical Machine Translation from English to Turkish, in Proceedings of ACL 2010, Uppsala, Sweden, July 2010. [189]

Hoifung Poon, Colin Cherry, and Kristina Toutanova, Unsupervised Morphological Segmentation with Log-Linear Models, in Proceedings of NAACL-HLT, Association for Computational Linguistics, June 2009.

308

[190] Jan Hajič, Barbora Vidová-Hladká: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset, In: Proceedings of the Conference COLING - ACL `98. 1998

309

PUBLICATIONS International Journals 1. Anand Kumar M., Dhanalakshmi V., Rekha R. U., Soman K. P, and Rajendran S., A Novel Data Driven Algorithm for Tamil Morphological Generator, International Journal of Computer Applications(IJCA) - Foundation of Computer Science, 6(12):52,56, 2010. 2. Anand Kumar M., Dhanalakshmi V., Soman K. P. and Rajendran S., A Sequence Labeling Approach to Morphological. Analyzer for Tamil Language, International Journal on Computer Science and Engineering (IJCSE), Vol. 02, No. 06, 2201-2208, 2010. 3. Dhanalakshmi V., Anand Kumar M., Soman K. P. and Rajendran S., A Natural Language Processing Tools for Tamil Grammar Learning and Teaching, International Journal of Computer Applications(IJCA) - Foundation of Computer Science, October 2010. 4. Dhanalakshmi V, Anand Kumar M, Shivapratap G, Soman K.P and Rajendran S. (2009), “Tamil POS Tagging using Linear Programming”, International Journal of Recent Trends in Engineering, Vol. 1, No. 2, ISSN 1797-9617. 5. Antony P. John, Anand Kumar M., Soman K. P., A Paradigm Based Morphological Analyzer for English to Kannada using a Machine Learning Approach, Research India Publication(RIP), October 2010. 6. Poornima C, Dhanalakshmi V, Anand Kumar M, Soman K P, Rule based Sentence Simplification for English to Tamil Machine Translation System. International Journal of Computer Applications (IJCA) - Foundation of Computer Science, August 2011. 7. Tirumeni , Anand Kumar M, Dhanalakshmi V and Soman K. P., An Approach to handle Idioms and Phrasal verbs in English to Tamil Machine Translation System International Journal of Computer Applications(IJCA) (Impact Factor-0.835) - Foundation of Computer Science, July 2011.

310

8. Anand Kumar M, Dhanalakshmi V, Soman K.P, Factored Statistical Machine Translation System for English to Tamil using Tamil Linguistic Tools, Journal of Computer Science, Science publications. [Indexed by IET- ISI Thomson Scientific Index, SCOPUS]. (Accepted for Publication). International Conferences 1. Anand Kumar M, Dhanalakshmi.V ,Rekha R U,Soman K P and Rajendran S (2010), “A Novel Algorithm for Tamil Morphological generator”, 8th International Conference on Natural Language Processing 2010( ICON2010), IIT-Kharagpur, India. (Receives Best Second Paper award) 2. Anand Kumar M, Dhanalakshmi.V , Soman K P and Rajendran S (2009) , “A Novel Approach for Tamil Morphological Analyzer”, Proceedings of the 8th Tamil Internet Conference 2009, Cologne, Germany. 3. Dhanalakshmi V, Anand Kumar M, Vijaya M.S, Loganathan R, Soman K.P and Rajendran S (2008), “Tamil Part-of-Speech tagger based on SVMTool”, Proceedings of International Conference on Asian Language Processing 2008 (IALP 2008), Chiang Mai, Thailand . 4. Dhanalakshmi.V, Anand Kumar M, Rekha R U, Arun kumar C, Soman K P and

Rajendran S (2009), “Morphological Analyzer for Agglutinative

Languages Using Machine Learning Approaches”, Proceedings of International Conference on Advances in Recent Technologies in Communication and Computing ,India . Papers are archived in the IEEE Xplore and IEEE CS Digital Library. (Indexed in SCOPUS) 5. Dhanalakshmi.V , Anand Kumar M, Padmavathy P, Soman K P and Rajendran S, “Chunker for Tamil”, Proceedings of International Conference on Advances in Recent Technologies in Communication and Computing( ARTCom 2009), India. Papers are archived in the IEEE Xplore and IEEE CS Digital Library. (Indexed in SCOPUS) 6. Dhanalakshmi.V, Anand Kumar M, Soman K P and Rajendran S (2009), “Postagger and Chunker for Tamil Language”, Proceedings of the 8th Tamil Internet Conference, Cologne, Germany. 311

7. Dhanalakshmi.V , Padmavathy P, Anand Kumar M, Soman K P and Rajendran S (2009), “Chunker for Tamil using Machine Learning”, 7th International Conference on Natural Language Processing 2009( ICON2009), IIIT Hyderabad. 8. Anand Kumar M, Dhanalakshmi.V, R U Rekha, , Soman K P and Rajendran S (2010), “Morphological generator for Tamil a new data driven approach”, 9th Tamil Internet Conference, Chemmozhi Maanaadu, Coimbatore, India. 9. Dhanalakshmi.V , Anand Kumar M, Rekha R U, Soman K P and Rajendran S (2010), “Grammar Teaching Tools for Tamil” Technology for Education Conference (T4E), IIT Bombay, India. Papers are archived in the IEEE Computer Society. (Indexed in SCOPUS) 10. Rekha R U, Anand Kumar M, Dhanalakshmi.V, Soman K P and Rajendran S (2010), “ A Novel Approach to Morphological Generator for Tamil”, 2nd International Conference on Data Engineering and Management (ICDEM 2010), Trichy, India. Conference proceedings published by Lecture Notes in Computer

Science

(LNCS), Springer

Verlag-Germany. (Indexed in

SCOPUS) 11. Abeera V P, Aparna S, Rekha R U, Anand Kumar M, Dhanalakshmi.V, Soman K P and Rajendran S (2010), “Morphological Analyzer for Malayalam Using Machine Learning ”, 2nd International Conference on Data Engineering and Management (ICDEM 2010) ,

Trichy, India. Conference proceedings

published by Lecture Notes in Computer Science (LNCS), Springer VerlagGermany. (Indexed in SCOPUS) 12. Kiranmai G, Mallika K, Anand Kumar M, Dhanalakshmi.V and Soman K P (2010), “Morphological analyzer for Telugu using Support Vector Machine”, International Conference on Advances in Information and Communication Technologies (ICT 2010), Kochi, India. The Proceedings are published by Springer CCIS and it will be available in the Springer Digital Library. (Indexed in SCOPUS)

312

13. Anand Kumar M, Dhanalakshmi.V , Soman K P and Rajendran S (2011) , “Morphology based factored Statistical Machine Translation system from English to Tamil”, INFITT 2011, Conference was held at the University of Pennsylvania, Philadelphia, USA during June 17-19, 2011. 14. Dhanalakshmi.V , Anand Kumar M, Soman K P and Rajendran S (2011) , “Shallow Parser for Tamil”, INFITT 2011 Conference was held at the University of Pennsylvania, Philadelphia, USA during June 17-19, 2011. 15. Keerthana S, Dhanalakshmi V, Anand Kumar M and Soman K. P , Tamil To Hindi Machine Transliteration Using Support Vector Machines,

in

International Joint Conference on Advances in Signal Processing and Information Technology – SPIT. 2011 The Proceedings are published by Springer and it is available in the Springer Digital Library . 16. Dhivya R, Dhanalakshmi V, Anand Kumar M and Soman K. P. Clause Boundary Identification For Tamil Language Using Dependency Parsing, , in International Joint Conference on Advances in Signal Processing and Information Technology – SPIT. 2011 The Proceedings are published by Springer and it is available in the Springer Digital Library .

Following paper has accepted but not published 1. Anand Kumar M, Dhanalakshmi V, Soman K.P and Rajendren S, English to Tamil Factored-Statistical Machine Translation using Morphological Tools , International Conference on Asian Language Processing 2011 (IALP 2011) will be jointly organized by Chinese and Oriental Languages Information Processing Society (COLIPS) of Singapore, IEEE Singapore Computer Chapter. Conference proceedings will be included in the IEEE Xplore digital library .

313

morphology based prototype statistical machine ... - Amrita University

morphology based prototype statistical machine ... - Amrita University

Suggest Documents

english-malayalam statistical machine translation - Amrita University

English-Telugu Rule Based Machine Translation ... - Amrita University

Amrita School of Biotechnology - Amrita University

Untitled - Amrita University

Amrita School of Biotechnology - Amrita University

ICDCNPhD Forum - Amrita University

March 2014 - Amrita University

CTPS Text.pdf - Amrita University

Conference brochure - Amrita University

PhD presentation - Amrita University

Untitled - Amrita University

Distributed Machine Learning Based Biocloud Prototype

CTPS Text.pdf - Amrita University [PDF]

Bridging the Inflection Morphology Gap for Arabic Statistical Machine ...

Improvements in Phrase-Based Statistical Machine Translation

Document Translation Retrieval Based on Statistical Machine ...

Cascaded Phrase-Based Statistical Machine ... - Semantic Scholar

Graph-based Learning for Statistical Machine Translation

Connecting Phrase based Statistical Machine Translation Adaptation

Accuracy-Based Scoring for Phrase-Based Statistical Machine

paper - Statistical Machine Translation

on Statistical Machine Translation

CHINESE-SPANISH STATISTICAL MACHINE

Syntax-to-Morphology Mapping in Factored Phrase-Based Statistical