Transliteration System for English to Sinhala Machine ...

Transliteration System for English to Sinhala Machine Translation 1

B. Hettige1, A. S. Karunananda2 Department of Statistics and Computer Science, Faculty of Applied Sciences, University of Sri Jayawardenepura, Sri Lanka 2 Faculty of Information Technology, University of Moratuwa, Sri Lanka [email protected] , [email protected]

Abstract – Machine translation is a challenging task in natural language processing. Out-Of-Vocabulary, handling Proper nouns and Technical terms are some major issues which are common to all machine translation systems. This paper presents a transliteration approach to machine translation from English to Sinhala. We have used Finite State Automaton to develop transducers for English to Sinhala transliteration. This approach can transliterate the text in original English and Sinhala words that are written using English letters. The transliteration System has been developed using SWI-PROLOG and Prolog Server Page (PSP). English WorldNet and Sinhala Chatbot are used to test the transducers and reasonable results were achieved.

I

INTRODUCTION

Machine translation has been a potential solution for giving access to the world knowledge available in English for those who have different mother tongues. There are number of Machine Translation (MT) systems, available for many languages. Among others, Electronic Dictionary Research (EDR) [27] is the most powerful MT System in the Asian region. EDR system translates English to Japanese and vise versa. Also it uses knowledge base Machine Translation approach. In the Asian region, numbers of MT Systems are developed to Translate English to Indian family languages. Some of these systems can be named as Anglabharati, Anusaaraka, MaTra, Mantra etc. [6]. Anglabharati MT System Translates English to Indian languages, primarily Hindi, using a rule-based transfer approach. The Anusaaraka MT System [3] is used to access language between Indian Languages. Also this System uses Paninian Grammar (PG) model [1] to it’s language analysis. The Anusaaraka project has developed Language accesses from Punjabi, Bengali, Telugu, Kannada and Marathi into Hindi. The approach and lexicon is general, but the system has mainly been applied for children’s stories. MaTra is a Human-Assisted translation project for English to Indian languages, and the Mantra project is based on the TAG formalism, that has been developed for the domain of gazette notifications pertaining to government appointments [2]. The Machine Translation process can be described simply as decoding the meaning of the source text, and re-encoding this meaning in the target language [11]. Any MT System contains source language morphological analyzer, source

language parser, translator, target language morphological generator, target language composer and lexical databases [8][9]. At present a number of approaches are used for Machine Translation: Dictionary-based Machine Translation, Statistical Machine Translation and Examplebased Machine Translation [25]. However, machine translation cannot be achieved merely by handling languages at the morphological, syntax and semantic levels. This is because; Out-Of-Vocabulary, handling Proper nouns and Technical terms become crucial when source and target languages are reasonably different in terms of alphabet, pronunciation, etc. There are various transliteration approaches taken to solve these issues. In general, transliteration is the practice of transcribing a word or text written in one writing system into another writing system [25]. In other words, Machine transliteration is a method for automatic conversion of words in one language in to phonetically equivalent ones in another language. For example, the English word ‘machine’ is transliterated into Sinhala as ueIska. At present there are number of Machine transliteration approaches available namely, grapheme-based transliteration model, phonemebased transliteration model, hybrid transliteration model and correspondence-based transliteration model [18]. Grapheme-based transliteration is direct orthographical mapping from source graphemes to target graphemes. Several transliteration methods have been proposed using this method. These methods include channel model and decision tree model [18]. Phoneme-based transliteration model is based on pronunciation or the source phoneme rather than spelling or source grapheme. This model uses source-phoneme-to-source-phoneme transformation and source- phoneme-to-target-phoneme transformation. By using this transliteration model, Knight and others have developed Japanese-to-English transliteration system with weighted finite state transducers (WFSTs) [19]. Arabic-toEnglish, English-to-Chinese Transliteration Systems too have used this model. Hybrid and correspondence-based Transliteration models are used both source grapheme and source phoneme to transliteration. Furthermore, there are two types of transliterations, namely, Forward Transliteration and Backward Transliteration. Forward Transliteration means transliteration of a name from its native script to a foreign

one. Backward Transliteration is restoration of a previously transliterated name to its native scripts. However any transliteration model can be modeled as phoneme-based transliteration, letter-based transliteration or substring-based transliteration. Among others, Weerasinghe and others have developed a rule based Syllabification Algorithm, that can be considered under letter-based transliteration for Sinhala language [24]. This is a backward transliteration approach that reads Sinhala text and its pronunciation. We present a transliteration approach that come under letter-based tradition with the use of theory of finite automaton. At this stage, this project has considered only the forward transliteration from English to Sinhala. The rest of this paper is organized as follows. Section II describes structures of the Sinhala and English words. Section III reports on the design and implementation of the proposed transliteration models for English to Sinhala machine translation. Section IV gives conclusion and further work. THE SINHALA AND ENGLISH LANGUAGES STRUCTURES Machine transliteration process is not a simple task. This is mainly because; there are many ambiguities about the relationship between the spelling of a word and its pronunciation. Further, some letters cannot be pronounced in isolation, but need to be considered as adjacent letters. Therefore, it is required to do word level analysis of words before going into transliteration approaches. II

A Structure of Sinhala words Sinhala Language is constitutionally – recognized as the official language of Sri Lanka, along with Tamil. Sinhala has its own writing system, which is an offspring of the Brahmi script [25]. Maldives, Dhivehi are the closest relative languages for Sinhala. However, Sinhala differs from all other Indo-Aryan languages. In that, it contains a pair of vowel sounds that are unique to it, such as short vowel: ‘we’ – ae and Long vowel: ‘wE’ – aae[21]. Also Sinhala contains a set of five nasal sounds known as “half nasal” or “prenasalized stops”. These sounds as represented in modern Sinhala writing and their romanised notation are as follows: Õa (nng), `ca (ndj), `â (nnd), |a (nd), ò(mb). The Sinhala alphabet consists of 61 letters comprising 18 vowels, 41 consonants and 2 semi-consonants [5][17]. These symbols represent 40 sounds: 14 vowel sounds and 26 consonant sounds. This is quite similar to other Indic alphabets, as all of them appear to be offshoots of the Sanskrit alphabet. Table I shows the Sinhala alphabet. Furthermore, some graphical symbols, stokes, are used in conjunction with consonants. They are used in writing some vowels too (example. wd" ta" ft). Unlike in English, a stoke may be positioned at any of the four side of the base letter. Table II shows Sinhala stokes and their positions. Note that, Sinhala letters (characters) are generated using Vowels, consonants and conjunction with consonant and stokes. Table III shows Combination of the consonant l (k) with vocalic stokes.

TABLE I The Sinhala Alphabet Letter Type Vowels Consonants Semi-Consonants

Sinhala Letters w, wd, we, wE, b, B, W, W! ,Ì, Ï iD, iDD, t, ta, ft, T, ´, T! l, L, . , >, V, Õ, p, P, c, Cv [, {, P, g, G, v, V, K, Ë, ; , : , o, O, k, |, m, M, n, N, u, U, h, r, ,, j, Y, I, i, y,