Some Improvements over the BLEU Metric for Measuring Translation Quality for Hindi Niladri Chatterjeea, Anish Johnsonb, Madhav Krishnab a Department of Mathematics, Indian Institute of Technology, Hauz Khas, New Delhi – 110 016, India Email:
[email protected] b Division of Computer Engineering, Netaji Subhas Institute of Technology, Dwarka, New Delhi – 110 075, India Email: {anish.johnson, madhkrish}@gmail.com Abstract The BLEU translation quality evaluation metric is a modified n-gram precision measure that uses a number of reference translations of a given candidate text in order to evaluate its quality of translation. In this paper, we propose some modifications to this metric so that it suits Indian languages; more specifically, Hindi. The problem in using BLEU in Hindi presents two difficulties: non-availability of multiple references and prevalence of free word order in sentence construction. It is established that the validity of BLEU scores generally increases with the number of reference translations used. Further, Hindi being a free word order language, naïve n-gram matching as adopted by BLEU does not accurately predict the quality of a translated text. In our approach we have modified BLEU in order to take care of the above-mentioned shortcomings. Our proposed metric has obtained a closer correlation with human judgment while using just one reference translation.
1. Introduction IBM’s BLEU [1] uses a weighted geometric mean of modified precision scores that measure n-gram matches between a translation to be evaluated (candidate sentences) and a set of model translations (reference sentences). The precision measure is modified to penalize over-generation of correct word forms, so as to correctly combine results for multiple references. Further, a multiplicative brevity penalty is computed at the corpus level in order to compensate for the lack of recall. The resulting numeric metric is declared to represent the quality of translation of the candidate sentences, and is presented in [1] as a substitute for skilled human judges where there is need for quick or frequent evaluations. In
any case, for BLEU scores to be a valid indicator of quality a number of human translations must be used – the ideal number should be determined empirically as it depends upon the language-pair, domain of knowledge covered, idiomatic content etc. As the authors of [1] claim, automatic metrics such as BLEU and NIST1 [2] would be extremely useful during the iterative development cycle of an MT system since the improvements in quality obtained by changes to the system may be easily adjudged using a fixed set of references. However, one must be more wary of using ngram metrics as the basis for a formal evaluation of a translation, since a large amount of research done on BLEU and its variants has exposed several weaknesses. Some of them are: 1) Since BLEU uses multiple reference texts, the recall measure cannot be used directly. It has been shown in [3] that the recall score for a given candidate predicts translation quality better than BLEU or any other precision measure does. 2) The BLEU baseline metric increments the overall scores equally for all equal length matches, regardless of the semantic and content worth of the matched text. However, a means of extending BLEU has been suggested in [4], with multiplicative weights for the statistical significance of lexical items determined from the frequency of such terms in the text. 3) BLEU fails to discern between low quality texts as it is essentially a geometric mean of the modified precision scores, which equals zero if any one score is zero. 4) The BLEU scores’ reliability depends highly on the quantity of the human references used - for best results, all possible translation equivalents for a source language sentence must be covered in the reference set. In general, the more reliability required of the BLEU score, the greater the number of reference translations to be used. 1 N-gram based metric which uses an arithmetic n-gram average and calculates the brevity penalty differently.
This paper is organized as follows – Section 2 demonstrates to the reader how and why BLEU and its variants are not suitable for judging translation quality of a Hindi document, due to the syntactical properties of the language itself. Section 3 presents the basis of our methodology for the development of a translation quality measurement system for Hindi, which may be extended for use with many other Indian languages. The actual implementation of this method is briefly described in Section 4, while our experiment is explained in Section 5. Finally, Section 6 discusses how our work takes care of the drawbacks of BLEU discussed above and presents some conclusions drawn from the results. It also lists possible future improvements that may be brought about to our system.
The above sentences in Hindi, while using the same words (or word root forms), with different word order, convey the same gross meaning and would be translated into English as “The fish that Mohan ate had eaten a frog.” The BLEU scores with varying maximum n-gram sequences considered, for sentence A2 to A5 as compared with A1, are shown in Figure 1.
2. Naïve N-gram Metrics vis-à-vis Hindi MT Evaluation Hindi, like other Indian languages, has poor electronic resource and professional translator availability. Also, Hindi is a relatively free-word order language where two sentences may convey the same meaning using the same words but in a different order. In most cases, this variation in syntax is utilised for the purpose of emphasizing particular words, rather than for the purpose of changing the meaning of the sentence. Coupled with the generic drawbacks of BLEU as listed in Section 1 and given that reference translations are fewer in number and some variation in word order is valid, BLEU scores may vary widely for texts that are semantically close, and therefore the algorithm is not at all suited to judging the translation quality of Hindi texts. One such example is given below: A1. jo machhlee Mohan ne khayee thee, usne maindak khaayaa thaa. (-the one- fish Mohan -doer- eaten had, -that one-+doer- frog eaten had) A2. Mohan ne jo machhlee khayee thee, usne maindak khaayaa thaa. (Mohan -doer- -the one- fish eaten had, -that one-+doer- frog eaten had) A3. jo Mohan ne khayee thee, us machhlee ne maindak khaayaa thaa. (-the one- Mohan -doer- eaten had, -that one- fish doer- frog eaten had) A4. jo Mohan ne machhlee khayee thee, usne maindak khaayaa thaa. (-the one Mohan -doer- fish eaten had, -that one-+doer- frog eaten had) A5. maindak us machhlee ne khaayaa thaa jo Mohan ne khayee thee. (frog -that one- fish -doer- eaten had -the oneMohan -doer- eaten had)
Figure 1. BLEU scores with A1 as reference It can be seen from Figure 1 that even though sentences A2 to A5 are all grammatically correct and equivalently transfer the meaning “The fish that Mohan ate had eaten a frog”, they receive widely varying BLEU scores. Specifically, sentences A2 and A5, which differ markedly in word order from A1, receive low BLEU scores that fall even lower on increasing the degree of ngrams considered during matching. Though the above example is just one sentence and admittedly constructed, it does give an idea of what can possibly go wrong when using BLEU as a guide to judging translation quality. Most grammatically correct sentences in Hindi (or other Indian languages) would similarly have several permutations of the word order that are syntactically valid. Though a few of these might be preferred over others for reasons of style and common usage, all can be used depending on which part of the sentence is to be stressed upon. If one were to utilise BLEU to estimate the translation quality for a text translated into Hindi, one would require a number of reference translations in order to account for the numerous variations in word order that are possible in a Hindi sentence. This is because higher degree n-gram matching tries to impose a certain word order on a given candidate translation; the longer the length of n-grams considered during matching, the lower would be the score given to a sentence that uses a valid word order different from the reference translation’s. As mentioned earlier, with limited resources such as professional translators and/or linguists, a large number of references would be difficult to arrange for a language like Hindi.
The metric METEOR [3] partially handles some of these above problems. METEOR computes a score using a combination of unigram-precision, unigram-recall, and a measure of fragmentation. The last one penalizes a candidate sentence for lack of word-order alignment with its corresponding reference sentence. Though suitable for languages with a highly positional grammar, such as English, this assumption of unigram word-order is not entirely valid for a word-order free language like Hindi. Instead, in the following section, we attempt to account differently for this property of free word order as observed in Hindi.
3. Word Order in Hindi 3.1. Groups of Words with Fixed Internal Order In Hindi, certain groups of words can be identified in which the words themselves occur in a fixed order within the particular group [6, 7], while these groups may permute amongst themselves in a given sentence. These have been identified as the following: a) Noun/Pronoun + Parsarg (post-positional marker): Post-positional markers (parsargs) play many important roles in almost all Hindi sentences, where they generally serve to indicate the meaning-roles of the nominals that they follow. For example, in the sentences A1 to A5, in the word sequence “Mohan ne”, “Mohan” is the noun (here, the name of a person) and “ne” is the parsarg that specifies Mohan as the doer of an action, or the kartaa. Parsargs are not all fixed in the sense they convey – while ne is used exclusively for designating doer, ko can be used to indicate a recipient, a doer, or a relation, all dependant on construction of the sentence. Note that in Hindi, pronouns are usually conflated with the following post-positional markers to give the actual form of the pronoun, as is the case with “usne”, formed of “us” and “ne”, in the same set of sentences. Also, some pronoun forms change completely when conflated with a parsarg. e.g main + ko (I + -recipient-) = mujhe. b) Verb group: Verbs in Hindi usually occur in sets of two or three words and a Hindi sentence almost always ends with a verb sequence. In the sentences A1-A5, the verb groups “khaayaa thaa” and “khayee thee” occur in all the sentences, though in different positions. Such verb groups usually contain a main verb that explains the action primarily, along with auxiliary verbs that indicate tense, aspect and modality (TAM). c) Adjective Group + Noun/Pronoun: Adjectives usually come after the noun/pronoun which they qualify. For instance, consider the following translated sentence pair: chhote gareeb ladke ko kuchh de do. (small poor boy -recipient- something give) ~ Give the small poor boy something.
Here, the adjectives “chhote” and “gareeb” precede the inflected noun form “ladke”. d) Adverb + Verb sequence: Adverbs almost always precede a verb sequence. For example, in the following sentence “jaldi” precedes the verb sequence “kar sakta hoon”. main yeh kaam jaldi kar sakta hoon. (I this work quickly do can) ~ I can do this work quickly.
3.2. Correspondence with Translation Quality Measures
Traditional
The above-mentioned ones, are the most important groups that can be identified in Standard Hindi. Almost all permutations of these groups in a sentence are acceptable equivalents, conveying the same gross meaning. This imposes some sort of order upon a valid Hindi sentence, and therefore can be used to identify grammatically acceptable sentences, corresponding to the known measure of translation fluency. All these groups and especially the first two – Noun/Pronoun + Parsarg and Verb Group – also play more important roles. These groups are integral to conveying important information in a sentence: how, who did what, to whom, where, when etc. So the unchanged presence of these groups is almost guaranteed across translations equivalent in information content, when allowing for synonymic replacement and idiomatic usage. This is a consequence of “karaka theory” of the “Paninian Grammar Framework” (PGF) [7]. When applied to Hindi this maintains that the external structural information is also key to the deeper semantic sense of a sentence. With the use of Verb Group and Noun + Parsarg group (and the other two groups, to a lesser extent) we hope to achieve a “syntacto-semantic” matching in an indirect manner, thus accounting for translation adequacy, as well as translation informativeness. Based on their frequency of occurrence and relevance to the above measures, we propose to collect varied data for the text under three categories of word-groupings: a) Ungrouped unigrams b) Noun + Parsarg Group and Verb Group c) All remaining groups as identified in Section 3.1. The appropriate values need to be chosen depending upon the content and style of the text to be translated and the reference text. For example, higher incidence of idioms may warrant a higher Category (a) weight. We, after several trials and tests, found that the most suitable weights for the three categories are (a) 0.70 (b) 0.25 (c) 0.05. However, more experiments need to be carried out before suggesting this as the optimal weighting scheme for this purpose.
4. The Implementation We propose a translation quality metric for Hindi that identifies these categories of word groupings in a sentence and performs matching amongst corresponding groupings in another sentence using the following steps: 1) Two texts, one each of candidate and reference documents, both Hindi translations of the same English passage, were sentence and phrase aligned. For the present work, this alignment was performed manually. 2) Both texts were read concurrently into linked-list structures representing two aligned sentences (R and C respectively). Using a Hindi morphological analyzer (currently, one developed by LTRC at IIIT-Hyderabad, http://ltrc.iiit.net/onlineServices/morph/hin_morph.zip) additional information, such as each analysed word’s stem, part-of-speech, gender, number, TAM label (for a verb) etc were generated and added to the lists. 3) The sentences were run through a few rule-based layers, some as suggested in [7] and others developed as heuristic measures, to catch and remove contextually incorrect morphological analyses for a word, based on Standard Hindi grammar and usage [8]. 4) Once the rules (based on identifying and tagging the word groups listed in Section 3.1) were run on the lists, the two sentences were examined to find the common word groups as described earlier, using different levels of matching: (a) Compatible group constituent matching: For all word groups identified, the constituents of compatible word groups in R and C were tested for a match. A direct match of words contributed 1.0 (100%), while a successful root/stem match contributed 0.7 (70%) to the individual match score. (b) Word group matching: After (a), the word groups themselves were tested for a match. For a word group of length ‘N’, and constituents’ matched score obtained as ‘S’ (equal to the sum of individual match scores of the constituent unigrams), the group was declared matched if a minimum threshold was crossed; we ensured S >= 0.4*N (that is, at least 40 percent of the maximum possible score is reached, taking into account all the constituent words) before declaring a successful word group match. Also, some groups were not declared matched unless the main content word was also matched, such as the nouns when matching adjective-noun groups, or the main verb in verb sequences. (c) Naïve unigram matching: While (a) and (b) were first carried out only for those words that are part of a word group in either R or C to maximize group match score, other words not a part of any group, or those not matched in (a) were naively matched with each other, with individual match score contributions as in (a).
Hence, instead of naïve n-gram sequence matching, only the compatible word groups identified in R and C were matched, where a match depends on whether the constituent words are matched either in word or stem with those making up an equivalent word-group in C. 5) Once all the words/groups were matched to the highest possible extent, the overall Recall (|R∩C|) / (|R|) and Precision (|R∩C|) / (|C|) scores for the sentences were computed separately for the three different categories of word groupings. 6) The same procedure was repeated for all the aligned sentences in both the texts, with two running scores (recall and precision) maintained for all three categories. Thus, for the entire text, aggregate recall and precision scores were obtained for all three different categories of groupings. 7) A fragmentation penalty (Pen), same as the one used in [3] (Pen = 0.5 *(#chunks / #matched unigrams) ^ 3.0), was calculated for Category (a) scores (unigrams). Dependent on the deviations from reference word order among the matched candidate unigrams, this penalizes the overall score contribution from unigram matches, subject to the matches of longer sequences from the reference; this penalty also helps handle words that cannot be put into any of the groups identified in Section 3.1. 8) The overall Precision and Recall scores for the document were used to calculate three separate Fmeans (Fmean = (1+α)PR/(P+αR), where P=Precision, R=Recall. We set α = 9, as recommended in [9]). 9) The Fmean for Category a) was modified as follows: Pen = Pen × (1 − Fmean ( b )) × (1 − Fmean ( c )) Fmean ( a ) = Fmean ( a ) × (1 − Pen ) The penalty was applied only to the Fmean(a), as this helped reduce unigram contribution to total score, dependent on unigram word order, especially in cases where score contributions by categories (b) and (c) were abnormally low, like in simple sentences, and those involving nonsense candidate output which contains the same words as the reference, but in a random meaningless sequence. 10) As explained at the end of Section 3.2, different contributions to translation quality are made by the three categories of matches, and hence the weights mentioned there were used with the individual F-mean scores to obtain our final score, an aggregate F-mean, calculated as the weighted arithmetic mean.
5. Our Experiment We used an English text consisting of 150 sentences which covered a wide range between simple and complex structure. Four human translations of this text were obtained, henceforth referred to as T1, T2, T3 and T4.
To ascertain the accuracy of an evaluation system, we determined the Pearson Correlation Coefficient between two sets of translation quality scores as given by an automatic system, and those by separate human translators. This method is widely used by other researchers (e.g. [1], [3]), as also in the official DARPA/TIDES evaluation program. It measures the closeness of the automated score to scores given by humans, who may be assumed as experts in the translation quality evaluation domain. To this end, seven bilingual (English and Hindi) speakers compared the sentences of the four translations with their English equivalents and rate them individually on a 1-5 scale for readability, and a 1-5 scale for adequacy (degree of information transfer). Each point in the two scales was given an adequate description so as to make the rating process as objective as possible. The two scores so assigned to a sentence were then combined to give an overall score out of ten. The human assigned scores were averaged and the following ranking for the translations was determined: T1, T2, T3 and T4, arranged in order of highest to lowest average human score. The correlation values between the automated systems (BLEU and our metric) and human-assigned scores, while considering a different translation as a reference each time are given in Table 1. Table 1. Correlation values Reference T3 T2 T1 T4
BLEU 0.880485 0.953138 0.575762 0.341664
OUR METRIC 0.990678 0.958197 0.779016 0.690431
As can be inferred from Table 1, our metric comes closer to human judgment of translation quality while using just one reference translation than BLEU does and therefore, may be considered an improvement over it. Admittedly though, the correlation values for both systems depend upon the choice of reference translation. The average correlation values for BLEU and for our metric are 0.6877623 and 0.8545803 respectively - an improvement of more than 16%. Table 2. Translation rankings (best to worst) Reference T1 T2 T3 T4
BLEU T3, T2, T4 T1, T3, T4 T1, T2, T4 T1, T3, T2
OUR METRIC T2, T3, T4 T1, T3, T4 T1, T2, T4 T1, T2, T3
In Table 2, the rankings as assigned to the translations by BLEU and by our metric are shown (ordered from highest to lowest score), while considering a different text as a reference translation each time; of course, this
text has not been included in the ranking as it would be assigned the highest possible score by both metrics. While the rankings assigned by our metric are consistent in all cases and for all combinations, in those assigned by BLEU there is an inconsistency in the relative ranking of the pair T2 and T3, the latter being rated higher than the former on two occasions. From Tables 1 and 2, it can be gauged that our metric performs better than BLEU in relation to human judgments when using a single reference, in terms of both correlation with average scores, and subjectively assigned rankings, regardless of choice of reference text. Further, the mean scores as assigned by our metric to the various translations are given in Table 3. Table 3. Scores assigned by our metric Reference T1 T2 T3 T4
T1 0.445 0.494 0.410
T2 0.466 0.437 0.390
T3 0.454 0.367 0.320
T4 0.371 0.336 0.309 -
Table 4. Paired t-test (p-values) Reference T3 Reference T2 Reference T1 Reference T4
T1-T3