Augmenting PROMIS translation evaluation process

Augmenting PROMIS translation evaluation process with statistical, semantic and neural network metrics. Krzysztof Wołk1, Agnieszka Żukowska2, Wojciech Glinkowski2,3 1Polish

Japanese-Academy of Information Technology, 2Medical University of Warsaw, 3Polish Telemedicine Society

1. Introduction

Language and culture prevent translators from converting texts in one language to another. Not all languages will use the exact same words, but they all must convey the same meaning. In order to preserve the correct meaning the context must be deeply analyzed as well. Development of human-independent, yet accurate evaluation metrics, especially for reconciliation step and as additional information source experts review after backward translation within PROMIS translation process was an objective of this research. Within this study, we developed semi-automatic semantic evaluation metric for the Polish language based on HMEANT idea. Secondly, we conduct the evaluation using statistical-based Support Vector Machine (SVM) classifier. We tried applying deep neural networks because they try to replicate how human brain works. Lastly, we compared those results with human judgments and well know available machine translation metrics like BLEU, NIST, TER and METEOR. Results of our investigation show that some of the tested metrics can of great help in PROMIS translation evaluation process, not only some of them highly correlate with human judgments, but they also provide additional, human experience independent, semantic information.

2. PROMIS evaluation process

Entire PROMIS translation process is conducted using FACIT methodology. This methodology ensures that cross-lingual translation is highly consistent. The translation process itself can be divided into eleven steps. Firstly, items in English are treated as source texts and forwardly translated simultaneously by two independent translators. In the second step, a third independent translator tries to reconcile a single translation version based on the first step. Thirdly, the reconciled version is backwardly translated into English by a native English translator. In the fourth step, the backward translation is being reviewed, for harmonization between languages. In the next step, three independent experts examine all previously taken actions to select the most appropriate translation for each item or to provide their propositions. In the sixth step the translation project manager has to evaluate the merit of expert comments and in necessary identify problems. With such information, he is supposed to formulate guidance for the target language coordinator. This coordinator in next step is supposed to determine the final translation by examining all the information gathered in previous steps. To ensure cross-lingual harmonization and quality assurance next the translation project manager does preliminary assessment regarding accuracy and equivalence of the final translation. In the ninth step, the items are formatted and proofread by two independent native speakers. Those results are also reconciled. Next, the target language version is pretested within the group target language, native speakers. Each of them is debriefed and interviewed to check in his intentions were correctly understood. In final step comment after pre-testing are being analyzed, and any possible issues are summarized.

3. Automatic Evaluation Metrics and data preparation

In the research as a reference, we used the data in the TED talks parallel corpus (about 15 MB) that includes almost 2 million words that are not tokenized. The talks transcripts themselves were provided as pure text encoded with UTF-8, and the transcripts are prepared by the FBK team. TED corpus consisted of 92,135 unique Polish words. The machine translation quality is highly dependent on input data that should be of high quality and similar in the topic domain. On the TED data, we conducted domain adaptation using Modified Moore Levis filtering and linear interpolation to adapt it for PROMIS evaluation purposes. We also used tf-idf technique in reducing the number of the out of vocabulary (OOV) words. We empirically found metric acceptance rates that allowed us only to harvest 20% of most domain-similar data. As in-domain data, we used the original PROMIS translations.

4. SMT system preparation

Machine translation system was necessary to translate automatically English source documents and compare the automatic translation results with human judgments using widely used BLEU, NIST, TER and METEOR metrics. For those metrics is was the only possible way to extract cross-lingual information. Our Statistical Machine Translation (SMT) system was implemented using the Moses open source SMT toolkit with its Experiment Management System (EMS). The phrase symmetrisation method was set to grow-diag-final-and for word alignment processing. SyMGiza++, a tool that supports the creation of symmetric word alignment models, was used to extract parallel phrases from the data. The KenLM tool was also applied to the language model to train apart from binarization. This library enables highly efficient queries to language models, saving both memory and computation time. The lexical values of phrases are used to condition the reordering probabilities of phrases. We used KenLM with lexical reordering set to hier-msd-bidirectional-fe.

5.4. METEOR

The Metric for Evaluation of Translation with Explicit Ordering (METEOR) is intended to take several factors that are indirect in BLEU into account more directly. Recall (the proportion of matched n-grams to total reference n-grams) is used directly in this metric. In addition, METEOR explicitly measures higher order ngrams, considers word-to-word matches, and applies arithmetic averaging for a final score. Best matches against multiple reference translations are used. The METEOR method uses a sophisticated and incremental word alignment method that starts by considering exact word-to-word matches, word stem matches, and synonym matches. Alternative word order similarities are then evaluated based on those matches. Calculation of precision is similar in the METEOR and NIST metrics. The recall is calculated at the word level. To combine the precision and recall scores, METEOR uses a harmonic mean.

5.5. SVM Classifier

For the sentence similarity metric, we implemented an algorithm uses a statistical SVM classifier’s likelihood output and normalizes it into the 0–1 range. The classifier must be trained to determine if sentence pairs are translations of each other and to evaluate them. Support Vector Machine (SVM) besides of being an excellent classifier can provide a distance to the separation hyperplane during classification, and this distance could be easily modified using a Sigmoid Function to return a value similar to likelihood between 0 and 1. The use of a classifier means that the quality of the alignment depends not only on the input but also on the quality of the trained classifier. To train the classifier, good quality parallel data were needed, as well as a dictionary that included translation probability. For this purpose, we used already discussed the TED talks corpora.

5.6. Neural Network Classifier

Our evaluation neural network was implemented using the Groundhog and Theano tools. Most of the neural machine translation models being proposed belong to the encoder-decoder family [28], with the use of an encoder and a decoder for every language, or use of a language-specific encoder to each sentence application whose outputs are compared. A translation is an output a decoder gives from the encoded vector. The entire encoder-decoder system, consisting of an encoder and decoder for each language pair, is jointly trained for maximization of the correct evaluations. A significant and unique feature of this model approach is that it does not attempt to encode an input sentence into a vector of fixed length. Instead, the sentence is mapped to a vector sequence, and the model adaptively chooses a vector subset as it decodes the translation. To be more precise we trained a neural network also on TED Talks data that was previously adapted to teach it correctly evaluate translations. In easy words our neural model tries to capture different textual notions of cross-lingual similarity using input and output gates trying in the same time gates of LSTM (sequence learning technique which uses memory cell to preserve a state over long period – this enables distributed representations of sentences using distributed representations of word) architecture in order to build Tree Long Short Term Memory (Tree-LSTM). The network learning rate was set to 0.05 with regularization strength equal to 0.0001. The memory dimension was set to 150 cells, and the training was performed for ten epochs.

5.7 HMEANT Metric

The HMEANT is semi-automatic metric that in the first step requires human annotation in two stages semantic role labeling (SRL) and alignment. In the SRL phase annotators are asked to mark all the frames (a predicate and its roles) in reference and translated texts. To annotate a frame, it is necessary to mark its predicate (a verb, but not a modal verb) and its arguments (role fillers - linked to that predicate). Secondly, the annotators need to align the elements of frames. They must link both actions and roles, and mark them as “Correct” or “Partially Correct” (depending on equivalency in their meaning). Having the annotation step completed the HMEANT score can be calculated as the F-score from the counts of matches of predicates and corresponding role fillers. Predicates (together with roles) not having correct matches are not taken into account. The HMEANT model is defined as follows: #Fi- number of correct role fillers for predicate i in machine translation #Fi(Partial)- number of partially correct role fillers for predicate i in machine translation #MTi, #REFi- total number of role fillers in machine translation or reference for predicate i. Nmt, Nref - total number of predicates in MT or reference W - weight of the partial match (0.5 in the uniform model) #𝐹𝐹𝑖𝑖 𝑃𝑃 = � #𝑀𝑀𝑀𝑀𝑖𝑖 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑖𝑖

#𝐹𝐹𝑖𝑖 𝑅𝑅 = � #𝑅𝑅𝑅𝑅𝑅𝑅𝑖𝑖 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑖𝑖 #𝐹𝐹𝑖𝑖 (𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝) 𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = � #𝑀𝑀𝑀𝑀𝑖𝑖 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑖𝑖 #𝐹𝐹𝑖𝑖 (𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝) 𝑅𝑅𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = � #𝑅𝑅𝑅𝑅𝑅𝑅𝑖𝑖 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑖𝑖 𝑃𝑃 + 𝑤𝑤 ∗ 𝑃𝑃𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑃𝑃𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 𝑁𝑁𝑚𝑚𝑚𝑚 𝑅𝑅 + 𝑤𝑤 ∗ 𝑅𝑅𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑅𝑅𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 𝑁𝑁𝑟𝑟𝑟𝑟𝑟𝑟 2 ∗ 𝑃𝑃𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ∗ 𝑅𝑅𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻𝐻 = 𝑃𝑃𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 + 𝑅𝑅𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡

5. Automatic Evaluation Metrics

5.1. Bilingual Evaluation Understudy (BLEU)

BLEU was developed based on a premise similar to that used for speech recognition, described as: “The closer a machine translation is a professional human translation, the better it is.” So, the BLEU metric is designed to measure how close SMT output is to that of human reference translations. BLEU attempts to match variable length phrases between SMT output and reference translations. The basic metric requires calculation of a brevity penalty PB, which is calculated as follows: 1, 𝑐𝑐 > 𝑟𝑟 𝑃𝑃𝐵𝐵 = � (1−𝑟𝑟⁄𝑐𝑐) 𝑒𝑒 , 𝑐𝑐 ≤ 𝑟𝑟 where r is the length of the reference corpus, and candidate (reference) translation length is given by c. The basic BLEU metric is then determined as shown in: 𝑁𝑁

𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 = 𝑃𝑃𝐵𝐵 𝑒𝑒𝑒𝑒𝑒𝑒(�

𝑛𝑛=0

𝑤𝑤𝑛𝑛 log 𝑝𝑝𝑛𝑛 )

where wn are positive weights summing to one, and the n-gram precision pn is calculated using n-grams with a maximum length of N.

5.2. NIST

The NIST metric was designed to improve BLEU by rewarding the translation of infrequently used words. This was intended to further prevent inflation of SMT evaluation scores by focusing on common words and high confidence translations. As a result, the NIST metric uses heavier weights for rarer words. The final NIST score is calculated using the arithmetic mean of the n-gram matches between SMT and reference translations. In addition, a smaller brevity penalty is used for smaller variations in phrase lengths. The reliability and quality of the NIST metric have been shown to be superior to the BLEU metric.

5.3. TER

Translation Edit Rate (TER) was designed to provide a very intuitive SMT evaluation metric, requiring fewer data than other techniques while avoiding the labor intensity of human evaluation. It calculates the number of edits required to make a machine translation match exactly to the closest reference translation in fluency and semantics. Calculation of the TER metric is defined in: 𝐸𝐸 𝑇𝑇𝑇𝑇𝑇𝑇 = 𝑤𝑤𝑅𝑅 where E represents the minimum number of edits required for an exact match, and the average length of the reference text is given by wR. Edits may include the deletion of words, word insertion, word substitutions, as well as changes in word or phrase order.

6. RESULTS

In our research, we used 150 officially accepted PROMISitems translations between Polish and English. The results indicate that HMEANT is a very good predictor of human judgment. In our opinion, the slight difference is because of the human translation habits because the choices made are highly discussable. Even that the SVM and Neural network provided much lower accuracy, it must be noted that both methods are very dependent on reference training corpus which was not well suited for such evaluation (but the best we could obtain). Detailed results in percentage value are presented in Figure 1.

Figure 1. Percentage Results METEOR TER NIST BLEU SVM NEURAL HMEANT 0

10

20

30

40

50

Percentage Results

60

70

80

90

100