Theoretical based approach to English to Sinhala ... - IEEE Xplore

Fourth International Conference on Industrial and Information Systems, ICIIS 2009, 28 - 31 December 2009, Sri Lanka

Theoretical based approach to English to Sinhala machine translation 1

B. Hettige1, A. S. Karunananda2 Department of Statistics and Computer Science, University of Sri Jayewardenepura, Sri Lanka 2 Faculty of Information Technology, University of Moratuwa, Sri Lanka [email protected], [email protected]

Abstract: Machine translation is a log felt need of the countries those who use English as a second language. Due to inherent complexity of natural languages, most of machine translation systems adopt rather ad-hoc strategies such as word level translation without concerning a proper theoretical basis. This has been a major reason for why the developments in natural language processing could not achieve as expected. This paper presents a theoretical-based approach to English-Sinhala machine translation through the concept of Varanagema (conjugation) in Sinhala Language. The theory of Varanagema in Sinhala language handles major language primitives including noun, verbs and prepositions. The concept of Varanagema also drastically reduces the number of word forms to be stored in the dictionaries of the machine translation system. The design, implementation and results of test versions of English–Sinhala machine translation system has been presented in the paper.

I.

INTRODUCTION

Use of mother tongue is inevitable for creative thinking and acquisition of world knowledge. As such, in the recent past, people have given more attention to develop machine translation systems from English to other languages than learning of English by everybody. However, Sri Lanka is a late comer in this line of research. In Sri Lanka, Sinhala language is spoken by about 16 million people. Sinhala is one of the constitutionally-recognized official languages of Sri Lanka, along with Tamil language. However 80 % of Sinhala spoken people do not have the ability to read and write English well [35]. This has become a language barrier which deprives disseminate world knowledge to rural communities with limited knowledge in English. This also prevents from flowing intellectual contribution from rural community to the development of the world. This issue is not confined to Sri Lanka, yet appears in many part of the world. Undoubtedly, computer-based Machine Translation of English to other language would be the most practical solution for this issue. The machine translation is a process that translates text or speech from one natural language to another. Nowadays, thousands of Machine translation systems have been developed for different languages. Among others, Apertium [25], Moses [30][31], Google Translate [27], Babel Fish [28], Bing Translator [34] and SYSTRAN [32] [33] are wellknown machine translation systems world-wide. In the region, Anusaaraka [1] [2], AnhalaHindi [4], ManTra [6] and AngalaBaratha [5] are the Indian family of machine translation systems. On the other hand, perhaps, EDR [36],

978-1-4244-4837-1/09/$25.00 ©2009 IEEE

the machine translation system by Japanese is the most completed system so far. These translation systems use various approaches to machine translation, including, Human-Assisted translation, Rule based translation, Statistical translation and Example-based translation. However, due to various reasons associated with complexity of languages, (for more than last fifty five years), Machine Translation has been identified as one of the least achieved area in computing for the last sixty years. Most of these issues are associated with semantic handling in the machine translation systems. The human-assisted machine translation approach is more practical to consider for the machine translation projects during the early stages. Also this approach is more suitable for local languages that don’t have large lexical resources; such as large parallel coups and large lexical dictionaries. These approaches are rather ad-hoc without a proper theoretical basis. We have been working on the project to develop English to Sinhala Machine translation system that has been goverened by theory of varanagema (conjugation) in Sinhala langauge. In this project we have already developed a Sinhala parser [7], intermediate-editor [11], Sinhala morphological analyzer [8], three lexical dictionaries [9] and Transliteration module [10]. Each of these modules and their prototype integrations have been tested through several real world applications [11] [12] [13]. In order to demonstrate the overall features of our machine translation system, this paper reports on the theory behind the approach and the system integration in detail. The rest of this paper is organized as follows. Section 2 describes the overview of some existing machine translation systems. Section 3 gives some problems faced in the English to Sinhala machine translation. Section 4 gives the complete design of the English to Sinhala machine translation system. Section 5 shows how translation engine works for a given English sentence. Finally, section 6 concludes the paper with a note on further work. II.

EXISTING MACHINE TRANSLATION SYSTEMS

In broader sense, machine translation approaches can be classified into three categories, namely, statistical approach, example based approach and rule-based approach. The Statistical approach uses some statistics such as mean,

380

variance on bilingual text corpora to find the most appropriate translation. The Example-based approach is often characterized by its use of a bilingual corpus with parallel texts as its main knowledge base. The rule based approach requires extensive lexicons with morphological, syntactic, and semantic information, and large sets of rules. Therefore, any rule-based machine translation system contains a source language morphological analyzer, a source language parser, translator, target language morphological analyzer, target language parser and several lexicon dictionaries. Source language morphological analyzer analyzes a source language word and provides morphological information. Source language parser is a syntax analyzer that analyzes source language sentence. Translator is used to translate a source language word into target language. Target language morphological analyzer works as a generator and it generates appropriate target language words for given grammatical information. Also target language parser works as a composer and it composes a suitable target language sentence. Furthermore, this type of machine translation system needs minimum of three dictionaries such as the source language dictionary, the bilingual dictionary and the target language dictionary. Source language morphological analyzer needs a source language dictionary for morphological analysis. Bilingual dictionary is used by the translator for translating source language into target language; and the target language morphological generator uses the target language dictionary to generate target language words. Regarding English to Sinhala machine translation point of view, the system needs an English dictionary, an English-Sinhala bilingual dictionary and a Sinhala dictionary. A large number of machine translation systems have been developed under above three broader heading. For instance, Apertium[25][26] is rule-based MT system that translate related languages. This is an open –source system that can be used to translate any related two languages. This MT engine follows a shallow transfer approach and consists of the eight pipelined modules, such as de-formatter, A morphological analyzer, A part-of-speech (PoS) tagger, A lexical transfer module, A structural transfer module, A morphological generator, A post-generator, and A re-formatter. Moses[30] is a free software statistical machine translation engine that allows automatically training translation models for any language pair given a collection of source and target text pairs (parallel corpus)[31]. Google Translator [27] translates a section of text, or a webpage, into another language. It does not always deliver accurate translations and does not apply grammatical rules, since its algorithms are based on statistical analysis rather than traditional rule-based analysis. Babel Fish [28] is a web-based application developed by AltaVista which translates text or web pages from one of several languages into another. The translation technology for Babel Fish is provided by systran [32], whose technology also powers the translator at Google and a number of other

sites. It can translate among English, Simplified Chinese, Traditional Chinese, Dutch, French, German, Greek, Italian, Japanese, Korean, Portuguese, Russian, and Spanish. A number of sites have sprung up that use the Babel Fish service to translate back and forth between one or more languages. Bing Translator [34] is a service provided by Microsoft as part of its Bing services which allow users to translate texts or entire web pages into different languages. All translation pairs are powered by Microsoft Translation, developed by Microsoft Research; it uses Microsoft's own syntax-based statistical machine translation technology. The Anusaaraka [1][2] is a popular machine-aided translation system for Indian languages that makes text in one Indian language accessible to another Indian language. Also this system uses Paninian Grammar model [3] to its language analysis. The Anusaaraka project has been developed to translate Punjabi, Bengali, Telugu, Kannada and Marathi language into Hindi. The approach and lexicon is general, but the system has mainly been applied for children’s stories. Angalabharti [5] is also human-aided machine translation system used in India. Since India has many languages, there are a variety of machine translation systems. For example, Angalahindi [5] translates English to Hindi using machineaided translation methodology. Human-aided machine translation approach is a common feature of most Indian machine translation systems. In addition, these systems also use the concepts of both pre-editing and post-editing as the means of human intervention in the machine translation system. Electronic Dictionary Research (EDR) [36], by Japanese, is the most successful machine translation system. This system has taken a knowledge-based approach in which the translation process is supported by several dictionaries and a huge corpus. While using the knowledge-based approach, EDR is governed by a process of statistical machine translation. As compared with other machine translation systems, EDR is more than a mere translation system but provides lots of related information. The Sinhala language point of view only the few researches have done for machine translation. Vitanage’s Engish to Sinhala translator for weather forecasting domain [16] and Silva and other’s Sinhala to English language translator are some prototype projects. Also there are some attempts to develop Sinhala to Tamil machine translation [17] and Japanese to Sinhala machine translation [14]. It is evident from the discussion; human-assisted machine translation is more practical to consider for machine translation projects which are at their early stages. Therefore, we have developed English to Sinhala machine translation system that has also taken the approach of human-assisted translation. However, we go beyond pre-editing and post-

381

editing, and introduce an intermediate-editing stage to a machine translation system. This system has been tested through a standard application and a web application. III.

TRANSLATION PROBLEMS

Due to having different language structures in English and Sinhala languages the translation of English to Sinhala is a difficult process. English is a West Germanic language that originated in Anglo-Saxon England. Sinhala belongs to the Indo-Aryan branch of the Indo-European languages [29]. In Sinhala there is distinctive diglossia, as in many languages of South Asia. The literary language and the spoken language differ from each other in many aspects. Note that, the most important difference between the two varieties is the lack of inflected verb forms in the literary language. Also Sinhala uses SOV (Subject Object Verb) word order and English uses SVO (Subject Verb Object) word order. Therefore translation process becomes more complex. Some translation problems are describe below. A. Word Order in phrase One of the major differences in structure of the English and Sinhala is that order of words in certain phrases is different in both languages. For example, in English, The boy goes to his school by school bus can translate as msßñ