English to Marathi Rule-based Machine Translation of Simple

0 downloads 0 Views 486KB Size Report
word grouping is performed based on the morphological ... Word Grouping, Local Word Separator. ... in their regional language such Hindi, Bengali, Marathi,.
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < on the Table of the house, Bulletin Part-I and Part-II. Anglabharati(2001) and Anglabharti-II(2004) MT system developed by R.M.K. Sinha et al for English to Indian languages. Shiva and Shakti MT System (2003) designed using an Example-based and combination of rule based and statistical approaches. The Shakti system works for three target languages Hindi, Marathi and Telgu. Shiva and Shakti are the two Machine Translation systems from English to Hindi developed jointly by Carneige Mellon University of USA, IIIT Hyderabad and IISc, Bangalore, India. The MaTra System (2004, 2006) developed by Ananthakrishnan R et al uses transfer-based approach to translate news, annual reports and technical phrases from English to Hindi. A consortium of Nine institutions namely C-DAC Mumbai, IISc Bangalore, IIIT Hyderabad, C-DAC Pune, IIT Mumbai, Jadavpur University Kolkata, IIIT Allahabad, Utkal University Bangalore, Amrita University Coimbatore and Banasthali Vidyapeeth, Banasthali are developing EILMT(2006-) a MT System for English to Indian Languages in Tourism and Healthcare Domains. This project is funded by Department of Information Technology, MCIT Government of India. They have developed Sampark MT system (2009). The role of CDAC Mumbai is to develop statistical models and resources for a statistical MT (SMT) system from English to Hindi/Marathi/Bengali. Rule based machine translation from English to Urdu using transfer approach is developed by Naila Ata et al at Karachi, Pakistan [5]. They handled case phrases and verb postpositions through concept of Pannian Grammar. Dilshan De Silva et al have developed Sinhala to English translator with various inbuilt tools like grammar tool, dictionary, Unicode fonts, debugging tool, add word tool [3]. There are many more MT systems developed in India and abroad. III. SYSTEM OVERVIEW English language has a SVO (Subject-Verb-Object) structure whereas Marathi language follows SOV structure and is relatively free word order [1][8]. The overall architecture of system is depicted in Fig. 1. The input to the system is simple assertive English sentence like ‘He is going to school’. Lexical Analyzer splits the input sentence into tokens/lexemes separated by delimiters. Lexicon (Dictionary) contains a set of known words along with their complete morphological information such as its root, category, case, gender, number etc. Morphological Analyzer accepts a token and checks whether that tokens is present in the lexicon or not. If the token is present, system retrieves complete morphological information about it. IV. COMPONENT OF SYSTEM The proposed system composed of following components  Lexical Analyzer  Morphological Analyzer  Parser

2

Fig. 1. Architecture of proposed system

   

Mapping module Local word Grouper Tokens rearrangement Transliteration

A. Lexical Analyzer This module splits the given English sentence into the tokens and removes delimiters. Input: Sentence Output: Tokens/Lexemes. B. Morphological Analyzer Given a word, the morphological analyzer identifies the root and the grammatical features of the word. For languages that are not rich in inflections, a simple lookup dictionary that contains all the word forms would be sufficient. But creating such a dictionary for inflectionally rich languages is nontrivial and requires huge storage and high performance computing. The best alternative is to have a dictionary of root words and attaching the grammatical prefixes/suffixes is taken care in the target language generation. Part of Speech Tagging (POS) is the process of assigning a part of speech to each word in the sentence. Identification of the parts of speech such as nouns, verbs, adjectives, adverbs for each word of the sentence helps in analyzing the role of each constituent in a sentence Input: Token Output: Gender, Number, Person, Tense, Root. Example: Input: He is going to school Output of morphological analyzer: He [Root : he

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Gender : male Person : third Number : singular Category: pronoun] is [ Root : is Category: auxiliary] going [Root : go Category: verb ] to [Root : to Category: Preposition school [Root : school Category : noun Gender : Female Number : singular Person : third Type: common ] C. Shallow Parser This involves identifying simple noun phrases, verb groups, adjectival phrase, and adverbial phrase in a sentence. It also involves identifying the boundary of chunks and the labeling. Normally in language processing, sentences are parsed to identify the syntactic structure of the sentence. There are more similarities than differences between Indian languages. For example Marathi and Hindi language pair does not require a full parse. In this system, MT would be performed without a full sentential parser. The structural transformation is required when the source language structure does not have an equivalent structure in the target language. A partial parse or shallow parse is sufficient to identify the specific constituents in the sentence that has to undergo transformation. This component will also include the task of transliteration. The transliteration is done among Indian languages. Transliteration allows a word or words to be rendered in the script of the reader. Input: Tokens Output: Syntactically checked sentence D. Mapping Module In this module, each English word is mapped to the corresponding Marathi word. Verbs and Nouns are attached pratyay according to Gender Number Person and Tense. Adjectives are attached pratyay according to the Noun and Pronoun E. Local word Grouper Adjectives are grouped with their corresponding Nouns and Pronouns. (All need not be present in every case). If only pronoun or only noun exists, then it is considered as one group. Adverbs are grouped with their corresponding Verbs. Adverbs may be absent. In such cases a single verb may act as one group. Prepositions are grouped with the respective Noun or Pronoun. Transforming Algorithm: In Local Word Grouper (LWG)[3], grouping of all the tokens after assigning morphological information to them will

3

be carried out. The Pronoun ‘your’ and Noun ‘name’ becomes Noun Phrase (NP) according to the rules given below. The NP and VP-PPS becomes NP-VP-PPS. Finally the Pronoun ‘what’, Auxiliary verb ‘is’ and NP-VP-PPS forms a sentence S. This way LWG takes place using bottom-up parsing technique. The rules used in LWG for “What is your name?” are as follows: PRONOUN: = he | NULL AUX: = is | NULL NP: = name | NULL NP: = PRONOUN NP VP: = VERB VERB: = NULL PP: = PREP NP PREP: = NULL VP-PPS: = VP PP NP-VP-PPS: = NP VP-PPS S: = PRONOUN AUX NP-PP-VPS

Fig. 2. Parse tree generated by syntax analysis

The structure of the valid sentence represented as parse tree is shown in Fig. 2. Syntactic analysis exploits the result of morphological analysis to build a structural description of the sentence. A shallow parser is designed for source language and the lexicon is built to store the root words of source language followed by target language. The rules mentioned above show a simple context free phrase structure grammar for English. Local Word Separator separates tokens from the sentence generated in LWG to search corresponding token in target language dictionary. Mapping Block maps the tokens of source language (English) to corresponding tokens of target language (Marathi).

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Though Marathi is relatively free word order language, there are few units which occur in fixed order. Noun phrase (NP) and verb phrase (VP) can be formed using only local information and more importantly they provide sufficient information for further processing of the sentence. This is how the LWG provides all the necessary information with minimum computational efforts. F. Tokens rearrangement Algorithm The Local Word Groups need to be rearranged in order to generate a valid target language (Marathi) sentence. After going through a number of research papers and analyzing a set of sentences, we proposed an algorithm for rearrangement of words. After analyzing a number of sentences, a regular pattern is observed in most of the sentences. It is found that, by keeping the first group (noun phrase) as it is and then reversing the remaining groups in sequence; it produces a valid Marathi sentence. This technique is apt for most of the sentences tested using the algorithm presented in this paper. It is found that the rearrangement technique suggested in this paper is the most optimized and simplistic in terms of understanding and implementation as compared to other methods. This comprises of major part of research work. Algorithm: 1. 2. 3. 4.

5. 6. 7. 8.

Read in English Sentence Perform Lexical Analysis i.e. obtain tokens Perform morphological analysis i.e. retrieve each token from dictionary along with its morphology Check the syntax of the input sentence using production rules If (syntax is okay) { Retrieve corresponding Marathi tokens Perform mapping of English tokens to corresponding Marathi tokens Attach infections (pratyays) to Marathi tokens } else go to step 1 Apply Tokens rearrangement algorithm Perform transliteration (use any transliteration tool) Generate target language(Marathi) sentence End

Tokens rearrangement Algorithm 1. 2.

Traverse input sentence from left to right While(not end of sentence) { Keep position of NP/Noun intact and traverse a sentence by skipping NP/Noun till the end Reverse the sentence till NP/Noun Map source language tokens to target language tokens

4

} 3. End Example: Input English Sentence: ∣ I ∣ ∣ went ∣ ∣to∣ ∣ the ∣ ∣ market ∣ ∣ with ∣ ∣ my ∣ ∣ mother ∣ ∣ to ∣ ∣ buy ∣ ∣ apples∣. Word to Word Transliterated Mapping for Local Language: ∣ mI ∣ ∣ gelo ∣ ∣ lA ∣ ∣ bAjAra ∣ ∣ barobara ∣ ∣ mAzhyA ∣ ∣ AI ∣ NyAsAThi ∣ ∣ kharida ∣ ∣ saparchaMda ∣ Lexical Word Grouping: ∣ mI ∣ ∣ gelo ∣ ∣ bAjAralA ∣ ∣ mAzhyA AIbarobara ∣ ∣ kharidanyAsAThi ∣ ∣ saparchaMda ∣ After Applying Rearrangement Algorithm: ∣ mI ∣ ∣ saparchaMda ∣ ∣ kharidanyAsAThi ∣ ∣ mAzhyA AIbarobara ∣ ∣ bAjAralA ∣ ∣ gelo ∣ Translation in Marathi using Transliteration: . G. Transliteration After the rearrangement, text which is transliterated in Western script is converted into Devanagari script. There are many transliteration tools available to convert source language script to target language (English to Devanagari in this case). It is done using “Akshara Bridge” tool. V. MATHEMATICAL MODEL The overall process of a Rule Based Translation is carried out by using a following mathematical model. Terminologies used in the mathematical model are: STT - Source Translation Token equivalent to single word of English TTT - Target Translation Token is equivalent to word in Marathi. S  STT1, STT2 ,..., STTn  represented as a set of Tokens in English T  TTT1, TTT2, ...TTTn  represented as a set of Tokens in Marathi Ts  T is a subset of Token of the category Noun T v T is a subset of Token of category Verb To  T is a subset of Token of category Object Ta  T is a subset of Token of category Article

Tp  T is a subset of Token of category Preposition Tu  T is a subset of Token of category Auxiliary verb S  - Legitimate Input Sentence in the source language English T  - Output Sentence in the target language Marathi

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3) It was done by his mother 4) My father told me to work myself for his movie

5) They themselves will be very enthusiastic for their nice and amazing movie

compound and complex sentences. REFERENCES [1]

[2]

[3]

6) I went to market with my mother to buy apples [4]

7) She has a nice doll having blue eyes 8) My favourite book was stolen

[5] [6]

9) My dog should go running for food [7]

10) I have been speaking with him 11) My father had been eating mango 12) I did it for my father

VIII. CONCLUSION AND FUTURE SCOPE This paper presents a system for machine translation of Simple English assertive sentences to their Marathi counterpart. It follows the transfer approach with rule-based translation and emphasizes on assertive sentences and reordering algorithm for target language generation. It is difficult to frame the generalized rules for Marathi because grammar of English and Marathi are out of line. The system is successfully tested on 115 different simple assertive sentences using our production rules and produced satisfactory results. The major challenge in Machine Translation is to resolve the ambiguity in the meaning of words in the sentence. e.g.- ‘He is standing near the bank?’ two possible contexts of the word ‘bank’- bank of river or the money bank. Resolving lexical and structural ambiguity would be big challenge for researchers. Grammar of the English Language sometimes allows the change in the sequence of words without changing the meaning of the sentence. e. g.- ‘Should Ram have gone to the store?’ can be written as ‘‘Should have Ram gone to the store?’. The former sentence is translated by our system correctly but the latter is not. To allow such flexibility, there is a need to make rules more generalized. Translation of simple assertive sentences discussed in this paper can be extended for other sub-types of simple sentences such as imperative, negative and exclamatory sentences. The scope can be further expanded for

6

[8]

[9]

Goraksh V. Garje, Manisha Marathe, Urmila Adsule, “Translation of Simple Interrogative English Sentences to Marathi Sentences”, proceedings of ICWET’10, Mumbai, Maharashtra, India February 26– 27, 2010, pp. G.V. Garje, G.K. Kharate,. (2013, Oct) “Survey of Machine Translation Systems in India”, International Journal on Natural Language Computing (IJNLC) [Online],Vol. 2, No.4, pp. Available: http:// Ray, P.; Harish V.; Sarkar, S.; and Basu, A.; “Part of Speech Tagging and Local Word Grouping Techniques for Natural Language Parsing in Hindi”; Proceedings of the 1st International Conference on Natural Language Processing (ICON), Mysore, India, 2003, pp. Dilshan De Silva, Asanga Alahakoon, Imesha Udayangani, Vishva Kumara, Devinda Kolonnage, Harindu Perera, and Samantha Thelijjagoda, “Sinhala to English Language Translator”, proceedings of 4th International Conference on Information and Automation for Sustainability(ICIAFS), Colombo, Srilanka, 2008, pp. Dr. Shridhar Shanvare, “Abhinav Marathi Vyakaran”, Marathi Lekhan’, Vidya Vikas Mandal, Nagpur. Naila Ata, Bushra Jawaid , Amir Kamarn, “ Rule based English to Urdu Machine Translation”, in proceedings of Conference on Language and Technology(CLT’07], University of Karachi, Pakistan, 2007, pp. Rajiv Sangal, Vineet Chaitanya, “Natural Language Processing- a Paninian Perspective”, Akshar Bharati Group, PHI publication. R.M.K. Sinha, A.Jain, “AnglaHindi: an English to Hindi MachineAided Translation System”, in proceedings of the 9th MT Summit(MTS), New Orieans, USA, Sep. 23-27, 2003, pp. 494-497 Uzair Muhammad, Atif Khan, “Handling Proper nouns in machine translation from English into Urdu”, Journal of Information & Communication Technology, Vol. 1, No. 2, 2007, pp. 56-64.