Towards Sinhala Tamil Machine Translation

Towards Sinhala Tamil Machine Translation 1

1

2

Randil Pushpananda , Ruvan Weerasinghe , Mahesan Niranjan 1

Language Technology Research Laboratory, University of Colombo School of Computing, Sri Lanka 2 School of Electronics and Computer Science, University of Southampton, Highfield, Southampton SO17 1BJ, UK 2

Introduction Statistical Machine Translation (SMT) is a well established data-driven approach to translate source language text to target language text using statistical methods using bilingually aligned corpora. The following figure 1 shows the main processes of the SMT model.

3

1

3

Monolingual Target Corpus

Alignment (GIZA++)

for Sinhala and Tamil can be achieved by having a larger Sinhala-Tamil parallel corpus.

Data Collection - Monolingual Corpus

2

Parallel Corpus

French - English: Europarl corpus of 2.0 million (2,007,723) parallel sentences (Training: 357,088, Tuning: 53, Testing: 53) German - English: Europarl corpus of 1.9 million (1,920,210) parallel sentences (Training: 397,634, Tuning: 53, Testing: 53)

4

Sinhala: UCSC-LTRL 10 million word corpus Tamil: Tamil 3 million Newspaper corpus English (FR-EN): Europarl corpus of 2.0 million (2,007,723) corpus English (DE-EN): Europarl corpus of 1.9 million (1,920,210) corpus Figure 4: BLEU Score vs Random 800 Sentences

Phrase Extraction (Phrase-Extract & Phrase-Score)

N-gram Training (SRILM Tool)

Language Model

Translation Model

Source Input

Decoder (MOSES)

Target Output

Results From the parallel training sentences, 400, 600 and 800 random sample sentences were extracted 15 times for each language pair. The training, tuning and testing processes were done separately for 15 random samples and finally, the average values were calculated for each language pair. The figure 2, 3 and 4 shows how the minimum, maximum and average BLEU score values vary with the sample size.

Figure 5 shows how the BLEU score values vary against the number of parallel sentences up to 800. According to Figure 5, we can see that BLEU score values of SI-TA and TA-SI translations increase with the data size ranging from 400 to 800. We can clearly see that SI-TA and TA-SI translation results are higher than the FR-EN and DE-EN with this limited amount of data. The main reason for this appears to be the close alignment between Sinhala and Tamil sentence structures compared with the other pairs. BLEU Score VS Number of Sentences

Figure 1: The main process in SMT 25

*** Focus of the Research *** 20

15

BLEU Score

To investigate how translation performance varies with the amount of parallel training data in order to find out the minimum needed to develop a baseline machine translation system for the Sinhala-Tamil language pair.

10

Figure 2: BLEU Score vs Random 400 Sentences FR-EN

5 DE-EN

Experimental Setup

SI-TA TA-SI

Tools Used

850

800

750

700

650

Sinhala - Tamil: 1006 parallel sentences from “An introduction to Spoken Tamil” book written by Prof. W. S. Karunathilake (Training: 900, Tuning: 53, Testing: 53)

600

1

550

Figure 5: This is a picture with scientific results.

Summary and conclusions Figure 3: BLEU Score vs Random 600 Sentences

Data Collection - Parallel Corpus

500

Number of Sentences

Evaluation Metric Used 1 BLEU: Bilingual Evaluation Under Study Language Pairs Used 1 Sinhala - Tamil 2 French - English 3 German - English

450

3

GIZA++: word alignment tool SRILM: language modeling toolkit MOSES: phrase based open source SMT system

400

2

350

1

0

According to the figure 2, 3 and 4, the minimum value of each language pair has increased when the sample sizes are increased as expected. Also the maximum value has gone up except for the case of the DE-EN language pair of size 800. This provides strong evidence that a good translation system

By comparing the Sinhala-Tamil and Tamil-Sinhala translation results with French-English and GermanEnglish language pair translations, we have positive evidence for expecting statistical machine translation to perform well for translating between the Sinhala and Tamil languages. Using this approach, we plan to implement a system capable of producing acceptable translations between Sinhala and Tamil for use by the wider community.

ICTer Conference, 11 - 15 December 2013, Colombo, Sri Lanka