Automatic Evaluation of Machine Translation Quality - Semantic Scholar

Automatic Evaluation of Machine Translation Quality Cyril Goutte (for the XRCE Multilingual Technology team) January 27, 2006

1

Evaluating Translation Quality

Any scientific endeavour must be evaluated in order to assess its correctness. In many applied sciences it is necessary to check that the theory adequately matches actual observations. In Machine Translation (MT), evaluation serves two purposes: relative evaluation allows us to check whether one MT technique is better than another, while absolute evaluation gives an absolute measure of performance, eg a score of 1 may mean a perfect translation. A peculiarity of MT evaluation is the fact that designing a proper evaluation measure is in itself a difficult problem. In many fields, there are obvious, measurable performance metrics, for example the difference between a prediction and an actual observed outcome. Because natural language is not exact and unambiguous, and also because its structure is complicated, estimating whether a translation is correct, or how far a translation is from correct is a lot more difficult. Two entirely different sequences of words (a.k.a. sentences) may be totally equivalent, while two sequences which differ in small detail can have entirely different meanings. Traditionally, the two main dimensions for evaluating translations are adequacy (does the translation convey the same meaning as the source text) and fluency (is the translation grammatically correct). Ideally, without any time or money constraints, MT output could be judged by humans to provide an idea of the system’s performance. Obviously this is not the case, hence the need for a cheap and fast way to evaluate MT systems. All metrics presented below rely on a number of reference translations to which the MT output is compared. This does not mean that all texts to be translated must have reference translations – only benchmark texts do. This does however mean that the performance measured automatically on that benchmark may not carry over to a different body of text, expecially in a different domain. In the following sections, we will review a few automatic MT evaluation metrics from two families: some based on string matching, others inspired by metrics used in Information Retrieval (IR).

2 2.1

Automatic Off-Line Evaluation String Matching Techniques

The following metrics are based on matching the (word) string produced by MT with the reference translation. The matching is based on the computation of the minimum edit (Levenshtein) distance. This identifies the minimum number of insertions, deletions and substitution necessary to transform one string into the other. The Word Error Rate (WER) is computed as the sum of insertions, substitutions and deletions,

normalised by the length of the reference sentence. A WER of 0 means the translation is identical to the reference. One problem with WER is that this “rate” is in fact not guaranteed to be between 0 and 1 and in some settings a wrong translation may yield a WER higher than 1. A differently normalised WER, denoted WERg [1], normalises the sum of insertions, substitutions and deletions by the length of the Levenshtein alignment path, ie insertions, substitutions, deletions and matches. The advantage of this metric is that it is guaranteed to lie between 0 and 1, where 1 is the worst case (no matches). The Position-independent Error Rate does not take into account the ordering of words in the matching operation. In fact it considers the translations and the reference as bag-of-words and computes the differences between them, normalised by the reference length. In fact, any string comparison technique may be used to derive similar MT evaluation metrics. One such example relies on the “string kernel”, and allows to take into account various levels of matching depending eg on the part-of-speech of the words, or to take into account synonymy relations [2]. Experiments carried out in [1] showed that WERg had the best (although poor) correlation with human judgement on a sentence level. However, at the corpus level, techniques relying on Ngrams have been shown to provide better correlation with human judgement. These techniques are presented below. 2.2

IR-style Techniques

These metrics use measures inspired by Information Retrieval. In particular the n-gram precision is the proportion of n-grams from the translation that are also present in the reference. These may be calculated for several values of n and combined in various ways. BLEU: This metric proposed by [8] is the geometric mean of the n-gram precisions for n = 1, . . . 4, multiplied by an exponentially decaying length penalty. This penalty compensates for short, high precision translations such as “the”. NIST: This metric was used in the MT evaluation rounds organised by NIST [3]. It computes the arithmetic mean of the n-gram precisions, also with a length penalty. Another significant difference with BLEU is that n-gram precisions are weighted by the n-gram frequencies, to put more emphasis on the less frequent (and more informative) n-grams. F-measure: Pushing the IR analogy a step further, the F-measure [7] is the harmonic mean of the precision and recall. It relies on first finding a maximum matching between the MT output and the reference, which favours long consecutive (n-gram) matches. The precision and recall are then computed as the ratio of the total number of matching words in the maximum match over the length of the translation and reference, respectively. Meteor: The Meteor evaluation system improves upon the F-measure in at least two ways. It uses some linguistic processing to match stemmed words in addition to exact matches, and it puts a lot more weight on the recall in the harmonic mean [5]. Love them or hate them, BLEU and NIST are the metrics that are currently most widely used, and the ones all other MT evaluation metrics have to be compared with. The F-measure claims to provide higher correlation with human judgements[7], but this is apparently not always the case, especially for smaller segments [1]. Empirical evidence [5] suggests that putting more emphasis

2

on recall further improves the correlation. In fact it shows that recall alone often correlates best with human judgement, at odds with the exclusive use of precision in BLEU and NIST. 2.3

Multiple References

As noted above, there may be many correct translations for a given source text. One way to partially take this into account is to provide several reference translations produced by different translators. However, handling multiple translations within an evaluation metric is not always straightforward. For WER/WERg/PER, one would use the minimum error over all references, while for the F-measure, the proposal is to perform the maximum matching over the concatenation of references. 2.4

Comparison with Human Translation

In addition to pitting MT systems against each other, it may be interesting to compare their performance to human translators. When several reference translations are available, this may be done in a “round-robin” fashion. Assume for example that we have 4 references. Each of the four is evaluated just as a MT produced translation against the other 3, using the evaluation metric. The scores are averaged to provide a single human performance measure. The automatic MT systems are also evaluated against each subset of 3 references, and their scores averaged in order to produce a comparable MT performance measure. In the latest rounds of evaluations organised by NIST (using the NIST evaluation score), it has been noted that the top MT systems were getting close to human. However, even a cursory inspection of the MT output reveals that it is still far from human-level performance. This effect becomes even more pronounced with statistical MT systems that are directly optimised over the evaluation metric. This points to the limits of the current automatic MT evaluation metrics. Although they have been very useful to tune and improve modern MT systems, they may not be very discriminant at the current level of MT performance.

3

Beyond Automatic Off-Line Measures

A number of approaches have been suggested to go beyond these off-line metrics. Key-Stroke Ratio: In the context of interactive translation help, the TransType2 project1 used the Key-Stroke Ratio as a translation quality measure (eg [9]). This is the ratio of the number of key strokes necessary to obtain the reference translation using the interactive translation engine divided by the number of key-strokes needed to type the reference, ie the length of the reference (in characters). A KSR of 0 means the system generated a perfect translation without user input, while a KSR of 1 means that the system never suggested anything even partially useful. Post-Editing Cost: Within the Global Autonomous Language Exploitation programme (GALE), NIST is undertaking the evaluation of the translation aspect of this programme with a new metric which measures the cost of post-editing MT output into fluent and adequate English. Specifically, external post editors will be hired to modify each segment until they feel the resulting translation completely captures the meaning of the source sentence. The Post-Editing Cost is essentially the edit distance between the original MT output and the post-edited version. This should be first tested in the winter 2006 evaluation. This metric aims to measure translation quality in a more realistic way. The fact that NIST switched to post-editing cost from their own 1

http://tt2.atosorigin.es/

3

automatic NIST score is a clear sign of dissatisfaction with fully automatic quality measures. The downside is of course the significantly higher cost. Learning MT evaluation: As it seems difficult to design a metric that provides high correlation with human judgements, especially at the sentence level, it is natural to invoke Machine Learning, and try to learn, from the data, a metric that correlated well. Kulesza and Shieber [4] do that by training a Support Vector Machine categoriser to discriminate between good (human) and bad (machine) translations. More recently, [6] adopted a different approach. They propose a flexible parameterised model that contains well-know metrics such as BLEU or the F-measure. The metric may then be trained to increase correlation with human judgement or to privilege either fluency or adequacy. Of these three metrics, KSR is tailored to the particular setting of interactive transaltion, postediting cost is attractive but costly, and the learning approach has not been widely adopted so far.

4

Conclusion

This note presents an overview of the MT evaluation problem, and of the main metrics used today. Automatic off-line evaluation metrics such as BLEU have served the MT community well over the past years and supported progress in statistical MT. They usually show good correlation with human judgement at the system level , ie aggregated across sentences. However, one important limitation is that they fail to provide a good indication of MT quality on a sentence level. Sentence-level MT quality estimation is important in some settings, eg to combine various MT systems. In addition, recent comparisons with human produced translations suggest that these metrics tend to overestimate the quality of MT output. Designing efficient and reliable MT evaluation metrics has been an area of active reasearch for the past 3 years, an area which is developing alongside progresses in MT. One challenge for work in this area is to go beyond n-gram statistics while staying fully automatic. The need for a fully automatic metric may not be over-rated as it allows faster development and progress of MT systems. Although it has received limited attention so far, the learning approach seems very attractive, as it promises to let the data tune the evaluation metric to a particular criteria (eg adequacy vs fluency) or to a particular domain.

References [1] John Blatz, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. Confidence estimation for machine translation. Final report, JHU / CLSP Summer Workshop, 2003. [2] Nicola Cancedda and Kenji Yamada. Method and apparatus for evaluating machine translation quality. US Patent Application 20050137854, 2005. [3] George Doddington. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proc. HLT-02, 2002. [4] Alex Kulesza and Stuart M. Shieber. A learning approach to improving sentence-level mt evaluation. In 10th International Conference on Theoretical and Methodological Issues in Machine Translation, 2004. [5] Alon Lavie, Kenji Sagae, and Shyamsundar Jayaraman. The significance of recall in automatic metrics for mt evaluation. In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-04), 2004.

4

[6] Lucian Vlad Lita, Monica Rogati, and Alon Lavie. Blanc: Learning evaluation metrics for mt. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 740–747, Vancouver, British Columbia, Canada, October 2005. Association for Computational Linguistics. [7] I. Dan Melamed, Ryan Green, and Joseph P. Turian. Precision and recall of machine translation. In Proc. HLT-03, pages 61–63, 2003. [8] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL-02, editor, Proc. ACL’02, pages 311–318, 2002. [9] Richard Zens, Franz Josef Och, and Hermann Ney. Efficient search for interactive statistical machine translation. In EACL ’03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics, pages 387–393, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

5