Evaluation of English to Urdu Machine Translation

Evaluation of English to Urdu Machine Translation Vaishali Gupta, Nisheeth Joshi, Iti Mathur Apaji Institute, Banasthali University, Rajasthan, India

[email protected], [email protected], [email protected]

Abstract. This paper is based on the Evaluation of English to Urdu Machine Translation. Evaluation measures the quality characteristic of the Machine Translation output and is based on two approaches: Human Evaluation and Automatic Evaluation. In this paper, we are mainly concentrating over Human Evaluation. Machine Translation is an emerging research area in which human beings play a very crucial role. Since language is so vast and because of its diverse in nature, the accuracy is not maintained. To maintain this accuracy, Human Evaluation is taken as a base. Human Evaluation can be used with different parameters to judge the quality of sentences. Keywords: Human Evaluation, Adequacy, Fluency, Comprehensibility.

1

Introduction

We live in a trans-lingual society where people from different linguistic backgrounds live together. It is practically impossible for an individual to speak all this languages. In this context MT proves to be a savior. In MT, research work totally depends on the result of its evaluation. Since Machine Translation is an automated system therefore it is not necessary that the system will provide us the accurate result. To know the accuracy of the result, evaluation is required. We can evaluate any languages by two ways: Human Evaluation and Automatic Evaluation. In Human Evaluation, we have a source language and target language. These two languages are compared by an expert on the basis of some parameters. Since human is a valuable resource therefore it is costly and time consuming. adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

Human Evaluation is based on subjective judgment. Apart from all these shortcomings, human are the ultimate end users of the output. For Automatic Evaluation, many metrics have been proposed which are fast and cheap. The metrics should be consistent and reliable. These metrics compare the result of Machine Translated output with the human reference translation and checks the closeness of the result. The closest result becomes the best output. In this paper, we describe the results of Human Evaluation using some scale based parameters. These scale rate the Machine Translation output. In this process, we focus on the English-Urdu language pair.

2

Related Work

Many researchers have analyzed the quality MT output. They have also explained some methods and approach for evaluation. John S. White et al [1] proposed a work for Evaluation of Machine Translation. Here author uses the some evaluation methodologies which minimize the subjectivity problem and heterogeneity problem. Snover et al [2] proposed a work for a study of translation edit rate with targeted human annotation to make TER be a more accurate measure of translation quality. This paper describes an approach that employs human annotation to make TER be a more accurate measure of translation quality. Bremin et al [3] describe methods for human evaluation of Machine Translation. These methods are automatic metrics, human error analysis, reading comprehension and eye tracking. Wojak and grali [4] proposed a work for Matura Evaluation Experiment based on Human Evaluation of Machine Translation. In this work, author presents a web based system for human evaluation of machine translation. Vilar et al [5] proposed the work for Human Evaluation of Machine Translation through binary system comparison. Here author describes a novel evaluation scheme for the human evaluation of MT system. This method is based on direct comparison of two sentences at a time by human judges. Denkowski and Lavie [6] presented the work for choosing the right Evaluation for Machine Translation. In this paper, author examines the motivation, design and performance of annotator and practical result of several types of human evaluation tasks of Machine Translation. Adly and Ansari [7] proposed the work for Evaluation of Arabic Machine Translation system based on the Universal Networking language. Here

authors evaluate a MT system based on the Interlingua approach, the universal network language (UNL) system, designed for Multilanguage translation.

3

Evaluation of English to Urdu MT

In this section, we are emphasizing on the significant studies of MT evaluation and the importance of Evaluation in the research domain of MT. Each country has its own language and these languages have their own structure. Similarly there are 22 constitutional languages in India among them one is Urdu. In this paper, we are emphasizing on the Evaluation of English- Urdu Machine Translation. After getting the translated output we will examine its quality of translated output by showing how much good or bad it is or how much reliable it is. Through this Evaluation, we will be able to know the progress of Machine Translation in this area. Some dimensions have been specified in Machine Translation Evaluation which are 1.Human Evaluation and Automatic Evaluation 2.Quality Assessment: at sentence level vs. system level vs. task-based Evaluation 3.Black box vs. Glass box Evaluation 4.Evaluation for external validation vs. contrastive comparison of different MT system. Human Evaluation is based on adequacy and fluency. Adequacy is the extent to which the meaning of the source text is rendered in the translated text. Fluency is the extent to which the target text appeals to native speakers in terms of well-formedness of the target text, grammatical correctness, absence of misspellings, adhere to common language usage of terms and meaningfulness within the context [1]. MT output also uses the automatic metrics which are quick, easy and cheap way to gauge the quality of output. There are many metrics that have been proposed for automated evaluation measures. Some of them are BLEU, NIST, METEOR, ATEC, AMBER. Among them BLEU and NIST are highly compatible with English to Urdu Machine Translation Evaluation. Hutchins and Somers provide a survey of evaluation, namely: Quality assessment in terms of accuracy, intelligibility and style; error analysis; and benchmark tests. Black box and Glass box is also specific dimension for Evaluation of Machine Translation. In Black Box evaluation, the

evaluator has access only to the input and output of the system under evaluation. In Glass Box evaluation, the evaluator also has access to the various working of the system and can thus assess each subpart of the system. There are a number of typologies of MT evaluation approaches currently in use. White [1], whose work is based on that of Arnold et al. [9] and augmented by the models of Vasconcellos [10] organises MT evaluation into six main types (Marian Flanagan, [8]):  Declarative evaluation: Measures the ability of an MT system to handle texts representative of an actual end-user, White [1].  Operational evaluation: Addresses the question of whether an MT system will actually serve its purpose in the context of its operational use or if focusing on cost in particular, determines the cost-effectiveness of an MT system in the context of a particular operational environment.  Feasibility evaluation: Provides measures of interest to researchers and the sponsors of research of whether the system has any actual potential for success after further research and implementation.  Internal evaluation: Occurs on a continual or periodic basis in the course of research and or development to test whether the components of a system work as they are intended.  Usability evaluation: Measures the ability of a system to be useful to people whose expertise lies outside MT.  Comparison evaluation: Measures some attribute of a system against the same attribute of other systems.

4

Human Evaluation

The goal of Machine Translation system is to get exactly the same results as that of Human Translation. Machine Translation system correlates well with the translation quality of human judges. To evaluate the translation quality of any sentence, we need human evaluation. In Human evaluation, two sentences are directly compared at a time by the human judges. Human Evaluator uses numerical ranges for judging the quality of Machine translation output. These judgments are basically categorized on the basis adequacy and fluency.

 Adequacy: Evaluators define amount of meaning expressed in a reference translation using following scale: 5.All 4.Most 3.Much 2.Little 1.None  Fluency: Evaluators define the well-formedness of a translation hypothesis in the target language without of sentence meaning using following scale: 5.Flawless 4.Good 3.Non-native 2.Disfluent 1.Incomprehensible This diagram shows the human evaluation of Machine Translation output: Fig. 1. Human Evaluation Process

In this paper, human expert evaluated the English to Urdu Machine translated output. For translation, we use the test corpora which consist of health and tourism domain. For these corpora, we are considering three different translation systems- Google1, Babylon2, Ijunoon3. The outputs obtained from these translated systems are given to the human expert for sentence level evaluation. This sentence level evaluation depends on some scale anparameters. These scale and parameter proposed by joshi et al [11] for human evaluation. These Scales are as follows:

1. Not Acceptable 2. Partially Acceptable 3. Acceptable 4. Perfect 5. Ideal These Parameters as follows:

1. Translation of Gender and Number of the Noun/s. 2. Translation of tense in the source sentence. 3. Translation of Voice in the source sentence. 4. Identification of the Proper Nouns. 5. Use of Adjectives and Adverbs corresponding to the nouns and verbs in the source sentence. 6. Selection of proper words / synonyms. 7. The sequence of Noun, Helping Verb and Verb in the translation. 8. Use of Punctuation signs in the translation. 9. Maintaining the stress on the significant part in the source sentence in the translation. 10. Maintaining the semantics of the source sentence in the translation. We have provided numerical range to the Urdu output obtained from the MT system under the mentioned 10 parameters. These 10 parameters have been assigned different numerical ranges and then we calculate the average of these ranges. Their average shows the closeness of MT output to the reference translation. For this we will use three Machine Translation systems. At the first, we will translate the 1000 English corpus to Urdu by using Google MT system and then tested under the 10 parameters. Similarly the corpus is translated by the other two Machine Translation systems: Babylon and Ijunoon. After getting the numerical range for Machine translated Urdu output, we can know the approximate accuracy. Below Table1 and Table2 shows accuracy in numerical range and also in percentage. For Example: We take an example to translate from Google, Babylon and IjunoonSource text. Jaipur is known as pink city and capital of Rajasthan.

MT output. Google. ‫کے طور پر جانا جاتا ہے‬

‫دارالحکومت‬

‫اور‬

‫شہر‬

‫گالبی‬

‫کے‬

‫راجستھان‬

‫جے پور‬

(Jaipur rajasthan ke gulabi shahar aur daaralhukumat ke taur par jana jata hai) Babylon: ‫جئے پور کو ہولنا شہر اور راجستھان کے دارالحکومت‬

(jaipur ko holna shahar aur rajasthan ke daralhukumat) Ijunoon: ‫کا اصل‬

‫راجستھان‬

‫اور‬

‫شہر‬

‫کرنا‬

‫آراستہ‬

‫طرح‬

‫جس‬

‫جانا‬

‫ہے‬

‫پور‬

‫جئے‬

(jaipur hai jana jis tarah aarastah karna shahar aur rajasthan ka asal) This MT outputs are evaluated according to these Parameters and Scale. Quality of this MT outputs are analyzed by given numerical scale and percentage in below Table. Table 1. Evaluation of MT Output

From the above table, we observe that Google provides us the highest score output for the input that we have given. It gives us ‘2’ accuracy in scale point.

These scale points are then converted into percentage. ‘2’ scale point means the score is average that means 50% accurate. Similarly Babylon gives us the score as 0.7 which lies between 0 to 4 scale. After calculating the percentage, it gives 17.50% accuracy. Finally the output from Ijunoon is obtained as 0.9 scale which means 22.50% accurate. Now we examined the whole corpus under these three Machine Translation System which provide us the score points as mention below in the Table2 and Table3 Table 1. Accuracy in Numeric range for MT output

Parameter

Google

Babylon

Ijunoon

1 2 3 4 5 6

2.0423 2.0845 1.3826 1.4813 1.2575 2.1953

1.8577 1.9465 1.0413 1.0837 1.0070 1.6498

1.7462 1.7482 1.1319 1.2356 1.0161 1.9063

7 8

1.2034 2.5050

1.0090 1.6468

1.0080 1.9919

9 10 Total

1.4179 1.1470 1.6717

1.0413 1.0030 1.3286

1.0503 1.0020 1.3836

Table 2. Accuracy in percentage for MT output

Parameters

Google%

Babylon%

Ijunoon%

1 2 3

40.85 41.69 27.65

37.23 38.93 20.82

34.92 34.96 22.63

4 5 6 7

29.62 25.15 43.90 24.06

21.67 20.14 32.99 20.18

24.71 20.32 38.12 20.16

8 9 10 Total %

50.10 28.35 22.94 33.43%

32.93 20.82 20.06 26.57%

39.83 21.00 20.04 27.67%

As we can observe in these tables, the Google translator provides better accuracy than others. It provides approximately 33.43% accuracy for total corpus. Its accuracy in 1st and 2nd parameter is 40.85% and 41.65% respectively whereas it is most accurate for 8th parameter with 50.10% accuracy. It has also been observed that Google translator gives us the correct output for small sentences but when a big sentence in word or a complex sentence is given, it is unable to maintain the structure and meaning of sentence. Babylon gives 26.57% accuracy for our corpus. Babylon translator provides average translated output but does not provide better result than Google. Even it does not translate some of the source word to target word i.e. source word is simply kept as it is. It also does not preserve the meaning of big sentences whereas Ijunoon translator is used only for English to Urdu Machine translation and it gives quite accurate output than Babylon. Ijunoon is 27.67% accurate for a complete sentence.

5

Conclusion

We have presented an approach for Evaluation of English to Urdu Machine Translation. In this approach, Machine Translation outputs are evaluated with the help of human evaluators. It shows that among the three MT Engines Google translators perform better than others for all corpus. Second best MT Engine was Ijunoon as it could atleast translate some complex sentences and could preserve the meaning of the source sentence in the target system generated sentence.

6

Reference

1. John S.White, Theresa A. O'Connell, and Lynn M.Carlson. Evaluation of machine translation. In proceedings of the workshop on Human Language Technology, pp 206-210. (1993) 2. Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea Micciulla, and John Makhoul. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, pages 223231. (2006) 3. Sofia Bremin, Hongzhan Hu, Johanna Karlsson, Anna Prytz Lillkull, MartinWester, Henrik Danielsson and Sara Stymne. Methods for human evaluation of machine translation. Small 14, pp 55-67. (2010)

4. Aleksandra Wojak and Filip Grali ´nski. Matura Evaluation Experiment Based on Human Evaluation of Machine Translation. In Proceedings of the International Multiconference on Computer Science and Information Technology pp. 547–551. (2010) 5. David Vilar, Gregor Leusch and Hermann Ney , Rafael E. Banchs. Human Evaluation of Machine Translation Through Binary System Comparisons. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 96–103. (2007) 6. Michael Denkowski and Alon Lavie. Choosing the Right Evaluation for Machine Translation: an Examination of Annotator and Automatic Metric Performance on Human Judgment Tasks. In proceeding of AMTA. (2010) 7. Noha Adly and Sameh Al Ansary. Evaluation of Arabic Machine Translation System based on the Universal Networking Language. H.Horack et al. (Eds.):NLDB 2009, LNCS 5723,pp.243-257,2010. Springer-Verlag Berlin Heidelberg. (2010) 8. Marian Flanagan. Recycling Texts: Human Evaluation of Example-Based Machine Translation Subtitles for DVD. School of Applied Language and Intercultural Studies, Dublin City University. (2009) 9. Arnold, Doug, Sadler, Louisa, and R. Lee Humprheys. Evaluation: An Assessment. Machine Translation, 8 (1-2), pp. 1-24. (1993) 10. Vasconcellos, Muriel. Panel: Apples, Oranges, or Kiwis? Criteria for the comparison of MT systems. IN: MT Evaluation: Basis for future directions. Proceedings of a workshop sponsored by the National Science Foundation. San Diego, California. pp. 37-50.( 1992) 11. Nisheeth Joshi, Hemant Darbari, Iti Mathur. Human and Automatic Evaluation of EnglishHindi Machine Translation Systems. Advances in Computer Science, Engineering and Applications. Advances in Intelligent and Soft Computing Series, Vol. 166, pp 423-432, Springer Verlang. ( 2012)