Example Based Machine Translation for English

0 downloads 0 Views 236KB Size Report
the English sentences and the corresponding Sinhala sentences in which the ... Sinhala/Tamil using Machine Translation techniques. Another requirement is ...
Example Based Machine Translation for English-Sinhala Translations Anne Mindika Silva, Ruvan Weerasinghe University of Colombo School of Computing Colombo 07, Sri Lanka. Contact No: 0714230465 E-mail: [email protected]


This paper presents an Example Based Machine Translation System which can be used for English – Sinhala translations mainly to be used in the government domain. The System uses a bilingual corpus of English Sinhala aligned at sentence level, as the knowledge base of the System. Given a source phrase, the System retrieves the English sentences and the corresponding Sinhala sentences in which the input phrase is found (IntraLanguage Matching). Then the System performs a scoring algorithm on the retrieved Sinhala sentences to find the most occurring Sinhala phrase in the set, which is most likely to be the best candidate translation for the phrase (Inter-Language Matching). The output of the System has obtained BLEU scores of 0.17 - 0.26 for 3-gram analysis using one reference translation. Index Terms— Example Based Machine Translation, Natural Language Processing, Sinhala Language Processing

1. Introduction Sri Lanka has a multiracial society comprising of 74% Sinhala-speaking population and 18% Tamil-speaking population. English is spoken competently by about 10% of the population. Translation between the 3 official languages - Sinhala, Tamil and English is very important in a multi-cultural country like Sri Lanka. Given the ethnic conflict in the country, translation can play a vital role in bringing together the society by improving the understanding among the peoples. Translation by definition is the activity of interpreting of the meaning of a text in one language - the source text - and the production, in another language, of a new, equivalent text - the target text, or translation. Machine Translation can be considered as the attempt to automate all, or part of the process of translating from one human language to another. The major problem of

manual translation is the huge demand for translation with the too few human translators. The productivity of the human translators can be greatly increased with Machine Translation techniques.

1.1 Approach This paper presents a Machine Translation Tool for English-Sinhala translations to be used for translating government documents. The approach used in this study is Example Based Machine Translation which uses a sentence aligned bilingual corpus, and a list of function words of both languages. Example Based Machine Translation techniques prove ideal for a less-resourced language like Sinhala, since it allows the researches to experiment on the virtues of the technique without waiting for the resources to become available. One of the advantages of the approach is that the quality of the translation will improve incrementally as the example set become more complete, without the need to update and improve detailed grammatical and lexical descriptions. [15]

1.2 Motivation Although many Free and Commercial systems exists for the widely used languages (English, French, German, etc.), systems for English-Sinhala and TamilSinhala translations are not availble. There is very little local (Sinala/Tamil) content available in electronic form even now, so that locals have to rely mostly on English material in Sri Lanka. Not only international content but also much of academic content is available only in English. This problem is mainly faced by speakers who are only familiar with Sinhala and Tamil. It would be highly

desirable if the English content can be translated into Sinhala/Tamil using Machine Translation techniques. Another requirement is that most of the material in Sinhala/Tamil to be translated into English so that the local knowledge and culture can be easily disseminated to the global community. Translation between Sinhala and Tamil is also very desirable, especially in the current context of conflict to develop a good relation between the two main cultural groups of Sri Lanka. In Sri Lanka there are many kinds of material that needs to be prepared in all three official languages: Sinhala, Tamil and English. For example, government documents, gazettes, public notices, etc. are issued in all three languages. This is an area where Machine Translation could be put into valuable use, especially when relevant to a particular domain.

1.3 Scope The system is expected to provide possible translations for chunks of the source text (words, phrases, sentences) in the target language so the user of the System (the Human Translator) can select the most suitable translation or else can insert his own translation. In case the user defines a new translation, it should be saved in the corpus so that it can be re-used in future translations (The functionality of a Translation Memory). The bilingual corpus can be a shared resource so that a group of translators working on the same domain can benefit by each other's contribution to the system.

2. Background 2.1 Translation The Translation Process, can be described simply as, [11] 1. Decoding the meaning of the Source Text, and 2. Re-encoding this meaning in the Target Language. To decode the meaning of a text the translator must first identify its component "translation units", that is to say the segments of the text to be treated as a cognitive unit. A translation unit may be a word, a phrase or even one or more sentences. This process requires thorough knowledge of the grammar, semantics, syntax, idioms and

the like of the source language, as well as the culture of its speakers.

2.2 Problems in translation Translation in general is a difficult activity and there are several problems faced even by human translators [11]: The source text may be difficult to read, mispelled/ misprinted, incomplete and may be inaccurate. Language problems such as, dialect terms, unexplained acronyms and abbreviations, proper names, obscure jargon and idioms, slang Rhymes, poetic meters, highly specific cultural references and humour The Problem of Untranslatability - untranslatable words Words having different meanings in different contexts Same word that are having different meanings depending on the culture Words having different levels of precision Expressions referring to concepts that do not exist in another language

2.3 Problems in English-Sinhala translations Sinhala is a language of SOV-(Subject-ObjectVerb)word order, whereas English is a SVO (Subject-Verb-Object) word order language. In Sinhala, there are almost no subordinate clauses as in English, but only non-finite clauses that are formed by the means of participles and verbal adjectives. E.g.: "The man who writes books" translates to /pot liyana miniha:/, literally "books writing man". Sinhala is a left-branching language, which means that determining elements are usually put in front of what they determine. An exception to this is statements of quantity which usually stand behind what they define. E.g.: "the four books" translates to /pot hatara/, literally "books four". There are no prepositions, only postpositions. E.g.: "under the book" translates to /pota jata/, literally "book under"

Sinhala is a Pro-drop language: The subject of a sentence can be omitted when it is redundant because of the context. E.g.: The sentence /koheda gie:/, literally "where went", can mean "where did I/you/he/she/we... go". Also the copula "to be" is generally omitted: "I am rich" translates to /mama po:sat/, literally "I rich". There is a four-way deictic system (which is rare): There are four demonstrative stems /me:/ "here, close to the speaker", /o:/ "there, close to the person adressed", /ara/ "there, close to a third person, visible" and /e:/ "there, close to a third person, not visible".

2.4 Trends in translation The classical architectures for machine translation are Direct Translation, Transfer Approach and Interlingua Approach. Real systems tend to involve combinations of elements from these three architectures; thus each is best thought of as a point in an algorithmic design space rather than as an actual algorithm. In direct translation, we proceed word-by-word through the source language text, translating each word as we go. Direct translation uses a large bilingual dictionary, each of whose entries is a small program with the job of translating one word. In transfer approaches, we first parse the input text, and then apply rules to transform the source language parse structure into a target language parse structure. We then generate the target language sentence from the parse structure. In interlingua approaches, we analyze the source language text into some abstract meaning representation, called an interlingua. We then generate into the target language from this interlingual representation. A common way to visualize these three approaches is with is with the Vauquois triangle shown in Fig. 1 [6]. The triangle shows the increasing depth of analysis required (on both the analysis and generation end) as we move from the direct approach through transfer approaches, to interlingual approaches. In addition, it shows the decreasing amount of transfer knowledge needed as we move up the triangle, from huge amounts of transfer at the direct level (almost all knowledge is transfer knowledge for each word) through transfer (transfer rules only for parse trees or thematic roles) through interlingua (no specific transfer knowledge). Most Transfer or Interlingual rule-based systems are based on the idea that success in practical MT involves

defining a level of representations for texts which is abstract enough to make translation itself straightforward, but which is at the same time superficial enough to permit sentences in the various source and target languages to be successfully mapped into that level of representation. That is, successful MT involves a compromise between depth of analysis or understanding of the source text, and the need to actually compute the abstract representation. In this sense, Transfer systems are less ambitious than Interlingual systems, because they accept the need for often quite complex mapping rules between the most abstract representations of source and target sentences. As the linguistic knowledge increases, MT systems too should improve based on linguistic rules encoding that knowledge. This position is based on the fundamental assumption that finding a sufficiently abstract level of representation for MT is an attainable goal. However, some researchers have suggested that it is not always the case that the deepest level of representation is necessarily the best level for translation. Also for languages that have similar properties, Shallow transfer methods can be used without going for syntax level transfers. The currently available MT systems can also be classified as, 1. Machine Translation - where the translator supports the machine, and 2. Computer Assisted Translation - where the computer program supports the translator

2.5 Machine Translation Machine Translation (MT) is a form of translation where a computer program analyses the source text and produces a target text without human intervention. In

Machine Translation, the translator supports the machine. At its basic level, MT performs simple substitution of atomic words in one natural language for words in another. Using corpus techniques, more complex translations can be performed, allowing for better handling of differences in linguistic typology, phrase recognition, and translation of idioms, as well as the isolation of anomalies. Current machine translation software often allows for customisation by domain or profession (such as weather reports) - improving output by limiting the scope of allowable substitutions. Although most such systems (e.g.: Alta Vista's 'Babel Fish', Google's Translation facility), produce what is called a "gisting translation" - a rough translation that gives the "gist" of the source text, in fields with highly limited ranges of vocabulary and simple sentence structure, for example in weather reports, machine translation can deliver very useful results. Improved output quality can also be achieved by human intervention. [10] There are some identified sub fields in Machine Translation: Dictionary Based Machine Translation :Machine translation can use a method based on dictionary entries, which means that the words will be translated as a dictionary does - word by word, usually without much correlation of meaning between them. [12] Statistical Machine Translation (STAT MT or SMT) :- SMT tries to generate translations using statistical methods based on bilingual text corpora. The document is translated on the probability that a string in English e is the translation of a string in French f using parameter estimation. The statistical translation models can be word based or phrase based, as many more recent designs. Models based on syntax have also been tried. [13] Example Based Machine Translation (EBMT) :EBMT is essentially translation by analogy. EBMT is also regarded as a case-based reasoning approach to MT, where previously resolved translation cases are reused to translate new SL text. Interlingual Machine Translation :- Uses rulebased machine translation approach. According to this approach, the source language, (i.e. the text to be translated) is transformed into an interlingual, language independent

representation. The target language is then generated out of the interlingua. [11] Both Statistical Machine Translation and Example Based Machine Translation are mainly based on the Direct Translation model. Both use Machine Learning, Data Driven approaches where Pattern Recognition, Data Mining concepts can be put to use. The advantages of these two systems are non-reliance on expert knowledge, learnability and trainability. While EBMT systems place more reliance on the examples, SMT systems place more reliance on Statistical techniques.

3. Example Based Machine Translation The basic assumption of EBMT is: "If a previously translated sentence occurs again, the same translation is likely to be correct again". [1] This idea is sometimes thought to be reminiscent of how human translators proceed when using a bilingual dictionary: looking at the examples given to find the Source Language (SL) example that best approximates what they are trying to translate, and constructing a translation on the basis of the Target Language (TL) example that is given. [2] The general architecture of an EBMT system presented by Konstantinidis [4] is given in Fig. 2. The system begins with the input referred to as the source text. The most similar and analogous examples are retrieved from the source language database. The next step is to retrieve the corresponding translations of the analogous examples. And the final step is to recombine the examples into the final translation.

The EBMT approach, which was proposed by Nagao uses raw, unanalysed, unannotated bilingual data and a set of SL and TL lexical equivalences mainly expressed

in terms of word pairs (with SL and TL verb equivalences expressed in terms of case frames) as the linguistic backbone of the translation process. [7] The translation process is mainly a matching process which aims at locating the best match in terms of semantic similarities between the input sentence and the available example in the database. In EBMT, instead of using explicit mapping rules for translating sentences from one language to another, the translation process is basically a procedure of matching the input sentence against the stored example translations. The basic idea is to collect a bilingual corpus of translation pairs and then use a best match algorithm to find the closest example to the source phrase in question. This gives a translation template, which can then be filled in by word-for-word translation. The distance calculation, for finding the best match for a source phrase, can involve calculating the closeness of items in a hierarchy of terms and concepts provided by a thesaurus. For a given input, the system will then calculate how close it is to various stored example translations based on the distance of the input from the example in terms of the thesaurus hierarchy and how likely the various translations are on the basis of frequency ratings for elements in the database of examples. In order to do this, it must be assumed that the database of examples is representative of the texts we intend to translate. The systems using memory based approach, examine the MT problem from a human learning point of view and exploits the language models based on corpus, statistics and examples and applying analogy principle for translation by making use of past experiences. Some EBMT systems operate on parse trees, or find the most similar complete sentence and modify its translation based on the differences between the sentence to be translated and the matched example. [8] It is evident that the feasibility of the approach depends crucially on the collection of good data. However, one of the advantages of the approach is that the quality of translation will improve incrementally as the example set becomes more complete, without the need to update and improve detailed grammatical and lexical descriptions. Moreover, the approach can be (in principle) very efficient, since in the best case there is no complex rule application to perform. All one has to do is find the appropriate example and (sometimes) calculate distances. However, there are some complications. For example, one problem arises when one has a number of different examples each of which matches part of the string, but where the parts they match overlap, and/or do not cover the whole string. In such cases, calculating the best

match can involve considering a large number of possibilities. [15]

4. The proposed system 4.1 Functionality of the system A prototype has been developed for English to Sinhala translations (using Visual C#.net 2005) which uses a sentence aligned bilingual corpus as its Knowledge Base. Given a source sentence to translate, it allows the Translator to find the most suitable translations at phrase-level and thus provides the facility to quickly arrange the suggested phrases to form the target sentence. The finding of the suitable translations for source chunks is done by retrieving the source and target language sentences which match or contain the given input, and then determining the best target language match by using a scoring algorithm. It also has the facility to take as input, a source file to be translated (a plain text file in which the sentences are separated by line-breaks) and then assist the user to translate the file, taking one sentence at a time. Another important feature is the System's ability to learn from past translations. The user can save the newly translated content into the Corpus, so that the new knowledge can be used for subsequent translations. Since the Corpus should be sentence aligned, the System remembers the past translations at sentence level until it is added to the Corpus.

4.2 The Knowledge Base The current System essentially requires no knowledge of the structure of the languages, grammar rules, morphological analysis or parsing, although they can be used to improve the outcome in future developments. The sources of Knowledge that the System uses is a bilingual corpus of English-Sinhala and function word lists of the source and target languages. Thus, the preliminary tasks involved in the experiment included, finding pre-translated material for English and Sinhala in electronic form and aligning the text at sentence level to be added to the Corpus. The sources for the pre-translated materiel were the Order Books of the Parliament (obtained by the courtesy of the Parliament of Sri Lanka) and Vibhasha Translation Magazine published by the Centre for Policy Alternatives (CPA) which were available in Sinhala, English and Tamil.

The translated text were then aligned at sentence level using an interface which was developed as part of the System. The aligned sentences were saved as text files in UTF-16 encoding. An initial Corpus which consists of several files mainly in the Government and Political domain was thus prepared to be used by the System. Also a list of function words was saved in two separate files so that the Translator can edit the content to suit his needs. These functions words are to be used as stop words when looking for pre-occurrences and finding best matches in the Corpus.

4.3 Steps of the translation process Once the user inputs a Source Sentence to be translated, the system checks to see whether an exact match can be found in the English files of the Corpus. If an exact match is found, the corresponding Sinhala sentence is retrieved from the Corpus and returned as the Output Sentence. If an exact match is not found, the user can break the source sentence into logical phrases and see whether the System can suggest an acceptable translation for the phrase. Given a source phrase, the System retrieves the English sentences and the corresponding Sinhala sentences in which the input phrase is found (IntraLanguage Matching). Then the System performs a scoring algorithm on the retrieved Sinhala sentences to find the most occurring Sinhala phrase in the set, which is most likely to be the best candidate translation for the phrase (InterLanguage Matching). Once the scoring is done, the user can select the best match from the highest scoring outputs and proceed to translate another phrase of the source sentence. Once the entire sentence is translated, the user can make necessary modifications (phrase re-ordering, etc.) and proceed to translate the next sentence. When the user has translated the whole file, he can save the entire translation to a file and also can add the new translations to the Corpus.

4.4 Intra-Language matching The input to this step is an English sentence/phrase which is submitted for translation. The output is a list of the form, (input phrase S (english-sentence-1, sinhala-sentence-1) (english-sentence-2, sinhala-sentence-2) (english-sentence-3, sinhala-sentence-3)

(english-sentence-4, sinhala-sentence-4) ......... (english-sentence-n, sinhala-sentence-n)

) in which the input phrase S is occurring in each english-sentence-i, and sinhala-sentence-i is the corresponding translation of english-sentence-i. The user can set up the parameters for the matching process, namely, to match for the exact input phrase (matching for contiguous words), or to find matches where all of the words given as input occurs anywhere in the english-sentence-i. Thus, when matching in the second mode (matching for all the words regardless of position), the system retrieves English sentences in which, There are gaps between words matching input words. e.g.: A X Y B Z C can match input chunk A B C. The word order is different from that in the input chunk. e.g.: B C A can match A B C. The user can also specify whether to drop the functional words of the input string when matching for words. (This option is not available when matching with the exact input string)

4.5 Inter-Language matching In this phase, the system calculates a score to find the most occurring phrases in the list of target language sentences (sinhala-sentence-i) retrieved from IntraLanguage Matching. Here, an assumption is made that if the input string is n words long, then the corresponding translation of the string would be n/2 to 2*n words long. The double length assumption may not be suitable for long input strings, but is important when translating 1-2 word strings. A score is calculated starting from n/2 contiguous words in each of the target strings to 2*n words long contiguous strings. E.g.: If the input string is 2 words long, and the current target language sentence is A B C D E F G, then the substrings that would be considered for the scoring process would be, A, B, C, D, E, F, G AB, BC, CD, DE, EF, FG ABC, BCD, CDE, DEF, EFG ABCD, BCDE, CDEF, DEFG When only the above-given target string has been scored, all the above listed substrings would get a score of 1. But when the score is calculated for all of the target language sentences, the substring that occurs most often will get the highest score, and thus is most

likely to be the best matching translation for the given input string. The user has the option to disregard the function words of the target language when scoring, to avoid the function words getting an unnecessary high score, which would affect the output.

5. An example translation An example matching output in the process of translating the following source sentence is described below. “The Sri Lankan government is in turmoil after one of the constituent parties quit the ruling coalition last week , leaving the Peoples Alliance ( PA ) as a minority in parliament with only 109 out of 225 seats.”

Since it is unlikely that an exact match is found for the above-given sentence, the user has to translate the input sentence at phrase level. If the user tries to translate the phrase : The Sri Lankan government, the System will find all the sentences (using Intra-Language Matching) in which the input phrase occurred in the Corpus. (See Fig. 3 for the list of sentences retrieved.) The System would then perform Inter-Language matching, to find the target string that is most likely to be the translation of the given input phrase. (See Fig. 4) The user can accept one of the suggested translations and then proceed to translate another phrase.

6. Evaluation 6.1 Evaluation of Machine Translation



$ (

) !













One of the most difficult things in machine translation is the evaluation of a proposed system/algorithm. Given the ambiguity of natural languages, it is hard to assign numbers to the output of natural language processing applications. When evaluating machine translation, a "good translation" or "better translation" is hard to define. Also in machine translation, there may not be one good translation. Even when a sentence is translated by two humans, there may be variances in word choice and word order. Typically, there are many "perfect" translations for a given source sentence. Even experts may not agree when coming to a conclusion as to which translation is better. Human evaluations of machine translation weigh many aspects of translation, including adequacy, fidelity and fluency of the translation. Although human evaluations are extensive, they are also very expensive and time consuming. Because of this, the need for quick, inexpensive automatic machine translation evaluation has arisen, especially for machine translation researchers and developers. It is accepted that the closer a machine translation is to a professional translation, the better it is. The fluency and the adequacy of the output sentences can be checked by n-gram analysis. If there is a reference translation available, it becomes possible to compare the output with the references and to put a number to the notion of "good translation". Some automatic machine translation techniques include BLEU, NIST, WER (Word Error Rate), PER (Position-independent word error rate) and GTM (General Text Matcher). [5]

6.2 The Bleu translation metric Bleu Metric (Sometimes called the Blue metric) is an IBM-developed metric. The central idea is that the closer a machine translation is to a professional human translation, the better it is. To check how close a candidate translation is to a reference translation, a ngram comparison is done between both translations. The closeness metric has been designed after the highly successful word error rate metric used by the speech recognition community, appropriately modified for multiple reference translations and allows for legitimate differences in word choice and word order. The main idea is to use a weighted average of variable length phrase matches against the reference translations. [3] Basically, it compares n-grams of the candidate with the n-grams of the reference translation and counts the number of matches. These matches are position independent. The more the matches, the better the candidate translation is. This sort of modified n-gram precision scoring captures two aspects of translation: adequacy and fluency. A translation using the same words (1-grams) as in the references tends to satisfy adequacy. The longer n-gram matches account for fluency. [3] The BLEU metric ranges from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation. The score gets improved when there are more reference translations per sentence. A brevity penalty is given if the length of the result is less than the length of the references. However, because the evaluation is based on n-gram comparison with reference sentences, it is possible to make sentences with completely different meaning by switching words/n-grams and still get high scores. Also the opposite can occur, for example when the Machine Translation algorithm consequently translates a certain constituent to "New South Wales politics" it is penalized heavily when reference texts mention "politics of New South Wales" when using larger ngrams. [14]

6.3 Evaluation of the system Evaluation of Machine Translation should be typically done using texts that is not seen by the System previously. For this purpose a Training Set and Testing Set is defined from the available data. The translated output is then compared with one or more reference translations to get the translation score. The output of the System was evaluated using the Bleu translation metric. Since corpus alignment was a

slow and manual process, there existed lots of pretranslated material which could be used as the Testing Set. The bilingual corpus which was used for the experiment consisted of approximately 3000 sentence pairs. Out of the text that has not been added to the Corpus, a source text consisting of 94 sentences was extracted to be translated from the Order paper of Parliament for Wednesday, August 10, 2005, for the evaluation. By using the selected input text, two tests were carried out. Test 1 : Translation of the file by always selecting the top Sinhala Match for the selected input phrase Test 2 : Translation of the file by allowing the user to select a target phrase out of the top 5 matches produced out of inter-language matching In each test, the translation proceeded by first searching for an exact match for each sentence, and in case that an exact match is not found, the sentence was translated at phrase level. In addition, the user always had to accept the suggestions given by the system, even if the system did not produce a good translation. And the user was not allowed to do word/phrase reordering or to use his own translation. The two Sinhala translations thus produced, were evaluated using the Bleu metric by using only one reference translation. The reference translation for the input text was prepared by using the original translation of the text.

6.4 Evaluation results The Blue score for the translation produced by Test 1 using 3-gram analysis is given below: Precision 1-gram: 0.980392 but used 1.000000 because of smoothing Precision 2-gram: 0.200000 but used 0.400000 because of smoothing Precision 3-gram: 0.000000 but used 0.500000 because of smoothing Weighted Precision: 0.477648 Brevity Penalty: 0.544524 Used "Add one" smoothing ------------------------Bleu = 0.260091 The Blue score for the translation produced by Test 2 using 3-gram analysis is given below: Precision 1-gram: 0.924242 but used 0.939394 because of smoothing

Precision 2-gram: 0.066667 but used 0.133333 because of smoothing Precision 3-gram: 0.000000 but used 0.111111 because of smoothing Weighted Precision: 0.226156 Brevity Penalty: 0.784723 Used "Add one" smoothing ------------------------Bleu = 0.177470 Also, the Bleu scores for the evaluation using 1-gram, 2-gram and 3-gram analysis is given in Table 1. N-gram 1-gram 2-gram 3-gram *

Test 1 0.533847 0.241119 0.260091

Test 2 0.725274 0.194789 0.177470


6.5 Discussion of results Typically, a manual translation gets a Bleu score of 0.4. and Statistical Machine Translation Systems typically score 0.05 - 0.25. State of the art FrenchEnglish MT systems have been known to score 0.25 with 2-4 reference translations. For English-Sinhala translations BLEU scores of 0.02 - 0.06 have been obtained. And for Sinhala-Tamil translations BLEU scores of 0.12 - 0.14 have been obtained. Since the number of reference translations affects the score, "adjusted" Sinhala-Tamil score is said to be close to 0.185. [9] The evaluation results show a higher Bleu score for the result of Test 2 (when the user can select out of the top 5 mathes) for 1-gram analysis. But in both 2-gram and 3-gram analysis, the output of Test 1 (when the user has to accept the top match) has obtained a higher score. Also the result of Test 1 has been given a higher brevity penalty. When comparing with other translation systems, the chunking of the source sentence into logical phrases and the selection of the best translation out of the suggestions may have been influenced by the user's competence in translation. But, the systems ability to suggest acceptable translations source language phrases and the exact matches, may also have affected the score. It has yet to be experimented how the System would perform when automatic chunking and alignment is available.

7. Conclusion

7.1 Problems encountered One of the major problems encountered at the initial stages was the lack of a bilingual Corpus for EnglishSinhala. Although there existed bilingual Corpora for the widely used languages, resources for Machine Translation of Sinhala and Tamil are still very rare. Therefore much time was initially spent on building up a sentence-aligned bilingual corpus to be used as the knowledge base of the system. Also the lack of Sinhala documents in electronic form was another problem encountered in the course of the project. In addition, both the corresponding English and Sinhala translations were needed to align the Corpus. The data required for the alignment of the Corpus were obtained from the Sri Lanka Parliament and from a magazine published by the Centre for Policy Alternatives. Also, the alignment had to be done manually, which was a time consuming task In order to cope with the structural differences between the two languages, a complex alignment algorithm is needed which makes use of a tagged Corpus and parsing techniques to determine the parse tree of the source and target language sentences. The System's inability to automatically break down the input sentence into logical phrases was another factor that limits the efficiency of the system. The main problem was that, if the input phrases were not logical, the suggested translations for the phrases would also be meaningless.

7.2 Conclusions reached The use of Example Based Machine Translation for the translation of government documents proves to be appropriate since it uses formal and language and follows the same format in most cases. In order to increase the probability that a suggested target string is the translation of the given input string, the System should increase the number of sentences retrieved from the Intra-Language Matching process. This could be done by accepting sentences in which all the input string words occur anywhere in the sentence, by disregarding function words when matching for words and by accepting morphological variants of the words when matching. Also the best candidate out of the target language phrases can be given a higher score, by disregarding the function words and also if possible by giving a penalty for the candidate phrases based on the variance of the number of words in the candidate phrase and the input phrase.

7.3 Future extensions Suggestions for future extensions for the System are given below: To automate the process of breaking down the input sentence into logical phrases To introduce an alignment module to automatically align the translations of the phrases to form the output sentence To incorporate morphological analysis to the matching process in order to increase the number of matching sentences in the Corpus To use a penalty for the phrase length when calculating scores for the candidate target language phrases, such that the score gets higher when the number of words in the target string approximates the number of words in the input string To use the System to for the translation of other language pairs. (e.g.: Sinhala-Tamil)

Acknowledgments The authors wish to thank Dr. H. L. Premaratne for reviewing the paper, Mr. Dulip Herath of Language Technology Research Laboratory (LTRL) and the staff of the LTRL. Special thanks to Mr. Jagath Gajaweera, DirectorLegislative of the Parliament of Sri Lanka for the kind cooperation in obtaining translated material in Sinhala, English and Tamil.

References [1] Ralf D. Brown, "Example-Based Machine Translation in the Pangloss System". In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), p. 169-174. Copenhagen, Denmark, August 5-9, 1996. [2] Nikos Drakos. http://www.muni.cz/usr/wong/ teaching/mt/notes/node24.html. 19/08/2006. [3] Todd Ward Wei-Jing Zhu Kishore Papineni, Salim Roukos. Bleu: a method for automatic evaluation of machine translation. Philadelphia, July 2002. 40th Annual Meeting of the Association for Computational Linguistics (ACL). [4] M. Konstantinidis. Example-based machine translation. http://www.cs.cmu.edu/afs/cs.cmu.edu/ user/ralf/pub/WWW/ebmt/ebmt.html, 1999. 24/08/2006.

[5] Lin Franz Josef Och Lin, Chin-Yew. Orange: a method for evaluating automatic evaluation metrics for machine translation. [6] Jurafsky Daniel Martin James. Speech and Language Processing : An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Pearson Education, 2000. [7] M Nagao. A framework of a mechanical translation between Japanese and English by analogy principle, in “Artificial and Human Intelligence: edited review papers at the International NATO Symposium on Artificial and Human Intelligence”. Elsevier Science Publishers, Amsterdam, 1984. [8] McFetridge P. Popowich F. & Toole J. Turcato, D. A unified example based and lexicalist approach to machine translation. 1999. 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI '99), http://www.cs.sfu.ca/research/groups/ NLL/elib/guide.html. [9] A.R. Weerasinghe. A statistical machine translation approach to sinhala-tamil language translation. Colombo, 2003. International Information Technology Conference. [10] Wikipedia. Machine translation. http://en.wikipedia.org/wiki/Machine_translation. 18/08/2006 [11] Wikipedia. Translation. http://en.wikipedia.org/wiki/Translation. 18/08/2006 [12]Wikipedia. Dictionarybased machine translation. http://en.wikipedia.org/wiki/ Dictionarybased_machine_translation. 18/08/2006 [13]Wikipedia. Statistical machine translation. http://en.wikipedia.org/wiki/Statistical_machine_translation. 18/08/2006 [14] Simon Zwarts. Machine translation evaluation. [15] D.J. Arnold, Lorna Balkan, Siety Meijer, R.Lee Humphreys and Louisa Sadler. Machine Translation: an Introductory Guide, Blackwells-NCC, London, 1994