Using Language Models to Assist in the Correction of ... - CiteSeerX

Using Language Models to Assist in the Correction of Machine Translation Output Beatrice Alex Supervisors: Donnla Nic Gerailt and Miles Osborne

NIVER

S

IT

TH E

U

H

Y R

G

O F E

D I U N B

Master of Science in Speech and Language Processing Department of Theoretical and Applied Linguistics University of Edinburgh September 2002

Abstract Machine translation (MT) systems are renowned for making many translation errors. Spotting such errors can be a time-consuming and labour-intensive process which makes automatic evaluation and correction of MT output highly desirable for both system developers and end-users. Based on the novel approach of using statistical language models to assess the quality of MT, the main aim of this project is to automatically spot sentences containing translation errors in the output of a commercial MT system by means of N-gram models built from a target language corpus. This method, which is presented in this MSc dissertation, aims to differentiate between good- and bad-quality translations of sentences in terms of the cross entropy scores produced by the language model. The cross entropy values assigned to a set of known good-quality human-written sentences translations will be used as a reference point in the pilot experiment. Issues such as sentence length and the occurrence of unseen events in the test data will be addressed and the behaviour of various language modeling parameters, including the N-gram order, the smoothing technique, the amount of training data and the vocabulary size, will be investigated.

i

Declaration I declare that this MSc dissertation was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text. This work has not been submitted for any other degree or professional qualification except as specified.

(Beatrice Alex)

ii

Acknowledgements I would like to thank both my supervisors Donnla Nic Gerailt and Miles Osborne as well as my course organiser Simon King for their good advice and guidance throughout this project. Many thanks go to Miles Osborne for suggesting the topic of this project and monitoring my progress. I am particulary greatful to Donnla Nic Gerailt who invested alot of time and effort going through the first drafts of this dissertation. Her continuous feedback was very much appreciated. Special thanks also go to Simon King who showed a great interest in this work. Not only did he help me to get hold ot the training data from the LDC but he was also very helpful in answering a number of questions that arose in the course of the project. Finally, a big thank you goes to Keith and Elisabeth for their encouragement and support throughout.

iii

Table of Contents 1 Introduction

1

1.1

Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Objectives and Hypotheses of the Study . . . . . . . . . . . . . . . .

3

1.3

Layout of the Disseration . . . . . . . . . . . . . . . . . . . . . . . .

4

2 A Critical Look at Machine Translation 2.1

6

Machine Translation Paradigms . . . . . . . . . . . . . . . . . . . .

6

2.1.1

Rule-Based Machine Translation . . . . . . . . . . . . . . . .

7

2.1.2

Statistical Machine Translation . . . . . . . . . . . . . . . . .

8

2.1.3

Example-Based Machine Translation . . . . . . . . . . . . .

11

2.1.4

Hybrid Machine Translation Paradigms . . . . . . . . . . . .

12

2.2

Current Uses of Machine Translation . . . . . . . . . . . . . . . . . .

12

2.3

Motivation for Automatic Assessment of Machine Translation Output

14

3 Recent Efforts in Machine Translation Evaluation 3.1

3.2

16

Key Trends in MTE . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.1.1

Aspects of MT Quality . . . . . . . . . . . . . . . . . . . . .

17

3.1.2

Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . .

17

Automatic MT Assessment via Statistical Methods . . . . . . . . . .

19

3.2.1

Automatic Selection of the Best MT Output . . . . . . . . . .

19

3.2.2

Automatic Translation Quality Assessment . . . . . . . . . .

22

iv

3.2.3 3.3

A Machine Learning Approach to Automatic MTE . . . . . .

25

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4 Theory and Implementation 4.1

4.2

4.3

4.4

4.5

28

Statistical Language Modeling Theory . . . . . . . . . . . . . . . . .

28

4.1.1

Simple N-gram Models . . . . . . . . . . . . . . . . . . . . .

29

4.1.2

N-gram Models over Sparse Data . . . . . . . . . . . . . . .

29

Entropy Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4.2.1

Entropy of Language . . . . . . . . . . . . . . . . . . . . . .

32

4.2.2

Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . .

33

4.2.3

Cross Entropy as a Translation Quality Measure . . . . . . . .

34

Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.3.1

The MT System . . . . . . . . . . . . . . . . . . . . . . . .

35

4.3.2

The Statistical Language Modeling Toolkit . . . . . . . . . .

38

4.3.3

The Corpora . . . . . . . . . . . . . . . . . . . . . . . . . .

40

Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

4.4.1

Junk Formatting . . . . . . . . . . . . . . . . . . . . . . . .

41

4.4.2

Capitalisation . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.4.3

Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . .

42

4.4.4

Sentence Segmentation . . . . . . . . . . . . . . . . . . . . .

44

Experimental Procedure . . . . . . . . . . . . . . . . . . . . . . . . .

45

4.5.1

Training and Testing the SLM . . . . . . . . . . . . . . . . .

46

4.5.2

Human Ranking of Test Sentences . . . . . . . . . . . . . . .

47

4.5.3

Threshold Determination . . . . . . . . . . . . . . . . . . . .

48

5 Results and Evaluation

50

5.1

Pilot Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

5.2

Main Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

5.2.1

54

Experiment 1: The Conventional Trigram Model . . . . . . .

v

5.2.2

Experiment 2: Higher Order N-gram Models . . . . . . . . .

59

5.2.3

Experiment 3: Different Smoothing Techniques . . . . . . . .

62

5.2.4

Experiment 4: Different Amounts of Training Data . . . . . .

63

5.2.5

Experiment 5: Different Vocabulary Sizes . . . . . . . . . . .

65

6 Discussion and Future Work

69

A Future Perfect Sentences

72

B Results for Different Smoothing Techniques

74

C Results for Different Training Data Sets

75

D Cross Entropy Scores for Different Vocabularies

77

E Results for Different Vocabularies

79

Bibliography

80

vi

List of Abreviations AFP

- Agence France Presse

ALPAC - Automatic Language Processing Advisory Committee AMTA AP

- Association for Machine Translation in the Americas - Associated Press

DARPA - Defense Advances Research Projects Agency DPA - Deutsche Presse Agentur EBMT

- Example-Based Machine Translation

FAZ - Frankfurter Allgemeine Zeitung LDC

- Linguistic Data Consortium

FAHQMT - Fully Automatic, High Quality Machine Translation MAT - Machine-Assisted Translation MT - Machine Translation MTE

- Machine Translation Evaluation

NLP

- Natural Language Processing

POS - Part-of-Speech RBMT - Rule-Based Machine Translation SL SLM TL

- Source Language - Statistical Language Model - Target Language

vii

List of Figures 2.1

Conventional Approaches to MT . . . . . . . . . . . . . . . . . . . .

7

2.2

A Statistical MT System . . . . . . . . . . . . . . . . . . . . . . . .

10

2.3

EBMT Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4

MT as Basis for Human Translation . . . . . . . . . . . . . . . . . .

14

3.1

Modified N-gram Precision for Machine and Human Translations. . .

23

3.2

BLEU versus Bilingual and Monolingual Judgments. . . . . . . . . .

24

4.1

The Basic Concept of Smoothing . . . . . . . . . . . . . . . . . . . .

31

4.2

SYSTRAN Translation Process . . . . . . . . . . . . . . . . . . . . .

36

4.3

CMU-Cambridge SLM Toolkit Design . . . . . . . . . . . . . . . . .

39

4.4

Thresholding Determination . . . . . . . . . . . . . . . . . . . . . .

49

5.1

Cross Entropy Distributions of SYSTRAN Translated and HumanWritten Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

5.2

Human Ranking of Heldout Data . . . . . . . . . . . . . . . . . . . .

55

5.3

Natural Distributions of Good- and Bad-Quality MT sentences . . . .

56

5.4

Even Distribution of Good and Bad MT Sentences . . . . . . . . . .

58

5.5

Accuracy Scores for Different Smoothing Techniques . . . . . . . . .

62

5.6

Accuracy Scores for Different Amounts of Training Data . . . . . . .

64

5.7

Accuracy Scores for Different Vocabulary Sizes . . . . . . . . . . . .

67

5.8

Cross Entropy Distributions of the Best Model . . . . . . . . . . . . .

68

viii

List of Tables 3.1

French to English Translations. . . . . . . . . . . . . . . . . . . . . .

21

3.2

English to French Translations. . . . . . . . . . . . . . . . . . . . . .

21

3.3

Accuracy of the Decision Tree Method . . . . . . . . . . . . . . . . .

25

4.1

Word Types and Word Tokens . . . . . . . . . . . . . . . . . . . . .

47

5.1

SYSTRAN Output versus Human-Written Text . . . . . . . . . . . .

51

5.2

SYSTRAN Output versus Human-Written Sentences . . . . . . . . .

52

5.3

Optimal Thresholds and Accuracy Scores for Higher Order N-grams .

59

5.4

N-gram Log Likelihoods and Back-Off Classes from the 5-gram Model 60

5.5

Relationship between Word Type Frequency and Vocabulary Size . . .

ix

66

Chapter 1 Introduction 1.1 Statement of the Problem Machine translation (MT), that is “the application of computers to the task of translating texts from one natural language to another”(EAMT n.d.)1 was one of the earliest pursuits in computer science and proved to be a significantly more complex task than had first been anticipated. In order to achieve high quality MT, the computer must essentially function in the same way as a human translator and must therefore carry out processes which approximate thought. The truth is that language is far more complex and intricate than even linguists had ever imagined and not like some kind of code that can be cracked. Many MT researchers have consequently abandoned the idea of developing fully automatic, high quality (general-purpose) machine translation (FAHQMT) systems to replace human translators. Even though modern approaches to MT appear more realistic and level-headed than those which arose from the initial euphoria of the 1950s, the raw output produced by most MT systems is by no means flawless. 1 www.eamt.org/mt.html

1

Chapter 1. Introduction

2

As Makoto Nagao, Professor in the Faculty of Engineering at Kyoto University and a researcher well-known for his work in the field of MT, rightly pointed out that: “MT is still little more than a new-born child” (Nagao 1989), and remains an extremely difficult challenge. Prolonged research and development are required in order to create MT systems that are able to deal with all the nuances of translation. Despite all resulting translation errors, there are several MT systems currently available on the market which produce raw output that nevertheless serves a useful purpose for certain users. Translators use MT engines to produce a first rough draft of documents, which they then hand-correct to create their final translation. Assuming the MT system is able to translate some words, expressions and even sentences correctly, this method can save translators a certain amount of time and dictionary work. Amongst other users of MT output are people who simply want to get a rough idea of what a document translated from another language says. In case of technical documents, specialists can often use their own knowledge to interpret awkwardly worded translations. This means that the end-user’s own world knowledge or knowledge of the source language is required to determine whether a translation generated by a machine is appropriate or not. It appears desirable for end-users to receive additional feedback on the quality of a translation, a measure of approximately how good the output is. Translators could benefit from such a quality measure since it would help them distinguish which part of the output text requires post-editing. People using MT engines to translate into a language they do not understand could use this measure to single out bad translations and rephrase their original input. Similarly, end-users who use MT to get the overall gist of a document in a source language they do not master could use automatic quality assessment of MT output to determine which translations are likely to contain severe translation errors that may lead to misinterpretation. So far, only very few systems exist that automatically assess the quality of MT output. While some of them are designed to select the best output from a set of trans-


3

lations done by multiple MT engines (Callison-Burch & Flournoy 2001), others have been developed as research aids based on a limited set of human reference translations (Papineni, Roukos, Ward & Zhu 2001) and are therefore to a large extent restricted to the use of system designers and MT researchers. The idea of a system which provides MT output quality assessment that is also accessible and beneficial to end-users does not seem far-fetched. However, such a system would have to deal with any given MT output to be able to determine its quality. It would therefore require as much knowledge of the target language as possible. This study investigates how statistical language modeling technology, which allows building models that approximate the probability distribution of natural language by means of a large training corpus, can be used to assess the quality of MT output.

1.2 Objectives and Hypotheses of the Study The main objective of this study is to investigate to what extent Statistical Language Models (SLMs) built from a German newspaper corpus are capable of automatically spotting sentences containing translation errors in MT output. The aim is to build a model that can differentiate between good- and bad-quality translated sentences without additional linguistic knowledge about the text. The ultimate goal is to determine a threshold which reliably divides machine translations into either correct and acceptable translations on the one hand, or poor and wrong translations on the other, i.e. into good quality MT output and output that needs hand correction. Such a system would not only be very desirable to end-users of MT engines, but could equally be further upgraded to a system which in addition to producing a quality score also automatically spotted individual translation errors and auto-corrected them, an undertaking beyond the scope of this study. Since the SLMs are built from a large German training corpus, it is anticipated that good quality test translations are well captured by the model, i.e. they are likely to


4

contain N-grams that also appear in the training corpus and to consequently match a word sequence in the training corpus in word choice and word order. As a result, the log likelihood of good-quality translated sentences is expected to be higher in comparison to that of bad-quality translated sentences. It must be noted at this point that in the experiments conducted for this research, evaluation is carried out by means of cross entropy (see Chapter 4.2.2). This metric is derived from the log-probability of a test sample normalised by the number of tokens within it, the main advantage being that we can account for the test sample length. Cross entropy therefore allows us to compare test translations of different lengths. This is essential for this study since the set of test translations comprises real newspaper data. Human ranking of the test sentences was carried out to produce a baseline for comparison. One major difficulty addressed in Chapter 5.2.1 is the extremely uneven distribution of good and bad machine-translated sentences in a real sample of data resulting from the relative poor performance of the MT system. Since bad-quality translations outnumber the good-quality translations by far, a comparison of the two cross entropy distributions becomes extremely difficult. It was therefore decided to test the SLM on an even split set of good and bad translations. Although, this even split test set no longer represents the real quality distribution within a collection of machine-translated sentences, it nevertheless allows to test to what extent there is any potential in the method of using SLMs to distinguish goodfrom bad-quality machine translations. Moreover, extensive experiments were carried out using this test set to study the behaviour of different language modeling parameters and determine the SLM which works best for automatic quality assessment of MT output.

1.3 Layout of the Disseration The layout of this dissertation is organised as follows: Chapter 2 provides a detailed overview of the most common approaches to MT and


5

their performance. Furthermore, the current uses of MT are described which aims to help the reader understand the motivation for conducting this research. Chapter 3 reviews the literature with regards to the key concepts behind Machine Translation Evaluation (MTE). It then explains the achievements of three recent studies on the automatic assessment of MT quality have achieved and clarifies in what way this present study differs to what has been accomplished already. Chapter 4 describes the implementation of this research. Firstly, the theory behind N-gram models and entropy as a metric to evaluate them is thoroughly explained. The chapter then goes on to outline all resources used for the experiments and explains individual data preparation stages as well as the experimental design as a whole. Chapter 5 presents all experiments carried out (including the pilot study) and discusses the results and difficulties that were encountered. Chapter 6, the last chapter of this dissertation, provides an overall summary of all key results and draws conclusions accordingly. Finally, this chapter suggests direction for future research.

Chapter 2 A Critical Look at Machine Translation This chapter provides a brief description of different approaches to MT. A critical analysis of existing state-of-the-art MT paradigms then allows us to explain the motivation for automatic quality assessment of MT output.

2.1 Machine Translation Paradigms The reader may be surprised to learn that the field of MT is almost as old as the invention of the computer itself. The concept of MT was first elaborated by Andrew D. Booth, a British crystallographer, and Warren Weaver, a member of the Rockefeller Foundation, at the end of the 1940s (Arnold, Balkan, Humphreys, Meijer & Sadler 1994). Since then, many different approaches to MT have been developed, their main aim being the enhancement of translation quality.

6

Chapter 2. A Critical Look at Machine Translation

7

2.1.1 Rule-Based Machine Translation The majority of conventional MT systems (see Figure 2.1) are rule-based. These can be sub-divided into direct, interlingual and transfer translation systems (Bennett 2000):

Interlingua

Direct Translation

n io

at

An aly

sis

r ne Ge

Source Text

Transfer

Target Text

Figure 2.1: Conventional Approaches to MT

• Direct translation is an approach to MT in which the lexical and syntactic analysis of the source language (SL) is very limited and which often involves wordby-word translation. • Interlingual translation entails a procedure which consists of two stages. The SL text is analysed firstly into an intermediate language-independent interlingual representation which in turn is used to generate the appropriated target language (TL) text. While the SL analysis is SL-specific irrespective of the TL, the TL generation is TL-specific regardless of the SL. • Transfer translation is an MT approach carried out in three stages. Firstly, the SL text is analysed into a SL representation. This is then transferred into a TL


8

representation. The TL representation is finally used to generate the appropriate TL text. SYSTRAN (see Chapter 4.3.1), a well known MT engine which is used for the experiments outlined in Chapter 5, functions based on an architecture similar to that of a transfer system. Both interlingual and transfer translation systems use a considerable number of grammatical and lexical rules. The more rules such a system employs, the more sophisticated and complex it becomes. In order to model all variations inherent in natural language correctly, RBMT systems essentially require a complete set of lexical, grammatical, syntactic and semantic rules which so far has been impossible to acquire since language is far more complex than one can easily imagine. A further problem with RBMT is the fact that the implemented rules are often incoherent and contradict each other which causes frequent translation errors. Bearing in mind that it is extremely difficult to formulate linguistic rules and keep them consistent, it becomes clear that Rule-Based Machine Translation (RBMT) systems are intrinsically difficult to upgrade and maintain.1

2.1.2 Statistical Machine Translation Statistical approaches to MT, on the other hand, try to avoid encoding linguistic rules and therefore denote a clear departure from the assumptions which have dominated MT research since the early 1960s (Hutchins & Somers 1992). In fact, Warren Weaver proposed to make use of statistical methods for MT purposes in 1949, when researchers first raised the idea of automatic MT from one natural language to another (Weaver 1955). However, the limited computing resources available at the time represented a major obstacle to this computationally intensive approach. Hence, researchers only returned to the idea of employing statistical techniques to the task of MT in the late 1980s 1 The

book Machine Translation: An Introductory Guide by Arnold et al. (1994) devotes a whole chapter to RBMT and its feasibility.


9

when both computing power and storage had increased considerably and machinereadable corpora were readily available (Brown, Cocke, Pietra, Jelinek, Lafferty, Mercer & Roossin 1990). Rather than expressing formal linguistic information in an explicit way as in RBMT, purely statistical MT systems aim to encode this information by means of the probability distribution of large volumes of good quality corpora in the relevant languages. The basic assumption of statistical MT is that every sentence in the target language (TL) is a possible translation of each sentence in the source language (SL). Consequently, each sentence pair (SS, T S) is assigned a probability P(T S|SS), the probability that the MT system will produce the TL sentence when presented with the SL sentence. The task of the MT system decoder (see Figure 2.2) is to discover the TL sentence T S (a member of the universe of TL sentences) which is the most probable given a particular SL sentence SS, in other words, the T S which maximises the conditional probability P(T S|SS) (Manning & Schütze 2001): argmaxT S P(T S|SS)

(2.1)

Since it is unclear how to solve the conditional probability P(T S|SS) directly, it must be broken down into probabilities that are easier to compute. This can be done by means of Bayes’ theorem which allows us to rearrange the order of dependence between the two events. argmaxT S P(T S|SS) = argmaxT S

P(SS|T S)P(T S) P(SS)

(2.2)

Equation 2.2 states that the conditional probability P(T S|SS) is equal to the probability of the SL sentence conditioned on the TL sentence (P(SS|T S) (also referred to as likelihood of the TL sentence), multiplied by the prior probability of the TL sentence being produced irrespective of the SL (P(T S)), and divided by the probability of the SL sentence (P(SS)). Since P(SS) always remains the same when considering different TL sentences as possible matches, it only acts as a scaling factor on the probabilities


10

and can therefore be ignored when finding the TL sentence that maximises P(T S|SS): argmaxT S P(T S|SS) = argmaxT S P(SS|T S) P(T S) | {z } | {z } likelihood prior

(2.3)

Computing the prior P(T S) and the likelihood P(SS|T S), as shown in Equation 2.3, involves constructing a statistical model of likely TL sentences and a model of how SL sentences translate to TL sentences, respectively (see Figure 2.2). Both of these tasks can only be accomplished with the help of large, high-quality monolingual and bilingual corpora. Even then, experiment results reported for purely statistical approaches have yielded a relatively low percentage of correctly translated sentences. Brown et al. (1990), for example, described the results of a statistical MT system built by a team at IBM. Their system was constructed by means of the parallel Canadian French/ English Hansard Corpus consisting of proceedings of the Canadian Parliament. Only 5% of translations (of the 72 test sentences) produced by the system were an exact copy of the actual Hansard translation. Approximately 48% were either exact or at least preserved the meaning of the official translations. Language Model P(TL)

Translation Model

TS

Decoder

SS P(SS|TS) = P(TS,SS)

^ = argmaxTSP(TS,SS) TS

^ TS

Figure 2.2: A Statistical MT System

The main drawback of this approach is that translation performance is extremely sensitive to the training data and estimates of rare words are unreliable. Furthermore, such models have no notion of syntactic phrasing and often fail to capture long-distance dependencies (e.g. wh-movement) as a result of the independence assumptions made.


11

2.1.3 Example-Based Machine Translation Example-Based MT (EBMT), a more recent approach to MT, relies on a database of pre-translated sentences (Bennett 2000) in order to translate new input text (see Figure 2.3). The bilingual corpus of translation pairs is searched for the closest matching example(s) to the source sentence being translated. The corresponding translations of these examples are then combined to create a final translation in the same way that human translators tend to work when using previously translated material and parallel texts (Wilss 1998). Another major advantage often claimed by the advocates of EBMT is that the translation accuracy improves as the example base becomes more complete. For example (Mirna, Iida & Furuse 1998) report that the translation quality of their EBMT system increased in an almost linear fashion, from 30% with 100 examples to 65% with 774 examples. Source Language Text

Target Language Text

Find most analogous examples

Retrieve corresponding target language examples

Combine examples

Source Language Examples

Correspondences

Target Language Examples

Figure 2.3: EBMT Architecture


12

Despite the improvement of MT engine performance due to the increase in the number of examples in the database, difficulties arise when the database offers multiple similar or indeed identical examples, or when so many examples are stored that the time it takes to search through the database makes the EBMT system computationally inefficient. A further problem with EBMT systems is that translation quality depends highly on the quality of the examples in the database. This means that sentences wrongly translated by the system cannot serve as examples for future translations since that would automatically decrease translation quality.2

2.1.4 Hybrid Machine Translation Paradigms Current thinking in MT circles suggests that significant progress in the field of MT is unlikely to be achieved by refining any single approach. It has therefore become a common interest to merge different MT paradigms into one system in order to yield better translation results. In recent year, we have consequently seen the development of an increasing number of hybrid MT systems with the aim of combining the strengths of each individual approach and improving overall translation quality as a result (e.g. Hogan & Frederking 1998, Streiter, Carl & Iomdin 2000). At this point in time, the extent to which such hybrid MT paradigms can improve the performance of MT engines is not yet fully known, since the work carried out in this field is still in its infancy.

2.2 Current Uses of Machine Translation The idea of a machine translating any text without the help of a human translator is very appealing to today’s global and multilingual society, for reasons such as costeffectiveness and data accessibility. However, as discussed in the Chapters 2.1.1 to 2.1.4, automated translation is a notoriously difficult task. The system must essentially mimic fundamental functions of human natural language processing (NLP) carried out in the 2 For

a comprehensive review on EBMT the reader is referred to Somers (1999).


13

brain, or at least recreate the results of human intelligence operations. This is where all current MT engines are still greatly limited. They have problems producing a translation comparable in quality to that produced by a professional human translator. MT output is error-prone due to the fact that human languages contain a variety of complicated relationships and ambiguities. Modern MT translation methods are not yet sophisticated enough to deal with that level of complexity. Lexical, morphological, syntactic and semantic variations present in both the SL and the TL further complicate the process. But despite the generally poor quality of their output, MT engines are nevertheless used in four different areas (Hutchins 1999): 1. As a basis for a more extensive process of human translation, where the MT provides a rough draft that is used for future refinement (e.g. by the translators employed by the EU institutions). 2. As a way of allowing the reader to understand the general meaning of the content of a document written in a language that they do not understand. 3. As a one-to-one communication tool (e.g. telephone or written correspondence). 4. As a sub-system within multilingual systems for information retrieval, information extraction, database access, etc (e.g. AVENTINUS3 ). When a system has to process texts that vary greatly in type and content, it is unavoidable that resulting translations will contain many errors and even untranslated words. Human translators who use MT output as a rough draft to produce their final translation are therefore required to carry out extensive post-editing in order to obtain a high quality translation (see Figure 2.4). Spotting and correcting MT errors by hand is both time-consuming and labour-intensive. In some cases, such revision may even be so substantial that it takes more time than manual translation. 3 AVENTINUS is a system that will provide information about drugs, criminals and suspects to police

forces on databases accessible in any of the EU languages.

Chapter 2. A Critical Look at Machine Translation “Machine translation output often contains errors.”

MT ENGINE

“Die Maschinelle Übersetzung, die enthält ausgegeben wird häufig Fehler.”

14

HUMAN CORRECTION

“Maschinelle Übersetzungen enthalten häufig Fehler.”

Figure 2.4: MT as Basis for Human Translation

It is evident that translators would benefit greatly from obtaining automatic machine translations assigned with a quality score that gives them an idea of whether correction is likely to be necessary. A method to automatically spot and highlight errors within machine-translated texts would be even more useful in this scenario. The question that needs to be asked at this point is: Would an automatic approach actually be more desirable than improving the MT engine itself? What is the advantage of developing an additional system which may not be completely accurate in its performance itself?

2.3 Motivation for Automatic Assessment of Machine Translation Output The majority of today’s state-of-the-art MT systems have required years of research to reach a point where they still produce relatively low quality translation. One major difficulty is the fact that translation varies with SL and TL, thus not all language combinations require the system to make similar decisions. An RBMT system that translates from English into German, for example, must contain a syntactic rule that ensures that in a German adjectival phrase, the adjective precedes the noun, as it does in English. This rule can however not be employed when the system translates into French, where adjectives mostly follow the noun. Bearing this in mind, it becomes clear that MT systems that offer translations for multiple language pairs, such as SYSTRAN (see


15

Chapter 4.3.1), are extremely difficult to improve, since refinement is required for each SL/ TL pair individually. Although, admittedly, the ultimate goal is to improve the performance of MT engines themselves, a fast and cheap error spotting approach would be very attractive for end-users such as the translator of the European Commission who, since 1976, have made use of SYSTRAN output as a consequence of the increase in the number of documents and languages handled (Petrits 2001). Although the knowledge of whether a translation is of high or low quality does not directly improve the quality of a system’s translation, such additional feedback is expected to have a series of secondary benefits for end-users in general (Flournoy & Callison-Burch 2001): • Translators will get an idea which translations need to be corrected. • It will give users an insight into the strengths and weaknesses of MT technology which will allow them to be better equipped when using the system to generate the highest quality translation. • As users expand their knowledge of an MT system’s performance, their confidence in the translation quality of the output and satisfaction with the translation experience increases. But how can the quality of MT output be critiqued automatically? The next chapter will outline key trends in evaluating the performance of MT engines and analyse recent efforts which used statistical methods in order to automatically assess of the quality of MT output.

Chapter 3 Recent Efforts in Machine Translation Evaluation At present, numerous commercial MT packages such as SYSTRAN1 , PROMT2 and iTranslator3 are readily available to users. However, despite the magnificent descriptions on the box, they are only able to satisfy a limited range of general-purpose translation needs. The fact that MT is still so error-prone is the very reason why Machine Translation Evaluation (MTE) is desirable for everyone involved: researchers who need to establish if their theories are effective, commercial developers who aim to improve a system’s performance and end-users who have to decide which system to employ, or which translations will require post-editing (Hovy, King & PopescuBelis 2002). This chapter presents a concise overview of the most significant trends in recent efforts to determine the quality of MT output. In addition, a second section provides a thorough analysis of three studies which all focused on the automatic assessment of MT quality by means of statistical methods. 1 http://www.systransoft.com 2 http://www.translate.ru 3 http://www.itranslator.com

16

Chapter 3. Recent Efforts in Machine Translation Evaluation

17

3.1 Key Trends in MTE 3.1.1 Aspects of MT Quality The extensive theoretical literature proposes an enormous variety of methodologies and approaches to MTE which underlines the importance of assessing the quality of MT output.4 MTE efforts range from the early and influential ALPAC Report (Pierce, Carroll, Hamp, Hays, Hockett, Oettinger & Perlis 1966), which had a rather damaging effect on the government funding for MT research and development in the USA and consequently hampered initial efforts in the field, to the largest MT evaluations supported by the US Defense Advances Research Projects Agency (DARPA) (White 19921994) and beyond. Each individual approach tackles problematic questions such as: Which aspects of the MT output must be considered? How can evaluation be achieved? The most important characteristics of a translation on which many studies have focused their attention throughout the history of MTE, are: • Intelligibility, that is the syntactic and the lexical well-formedness of a sentence, • Fidelity, which represents the degree of semantic clarity produced by the MT system. As pointed out by Hovy et al. (2002), many MT researchers consider an MT system to be sufficient if it produces syntactically and lexically well-formed sentences and does not distort the meaning of the input.

3.1.2 Evaluation Metrics Despite the difficulties of measuring intelligibility and fidelity, researchers have proposed various metrics to do so. While some focused on the analysis of syntactic units 4 Summaries on MTE and descriptions of individual MTE studies can be found in:

Hutchins (1997) and Sparck-Jones & Galliers (1996).

Hovy et al. (2002),


18

such as relative clauses, number agreement, etc. (e.g. Flanagan 1994), others required monolingual or bilingual judges to rate the adequacy, fluency and informativeness of translated sentences on an N-point scale with respect to a set of ideal human translations (e.g. White 1992-1994, Doyon, Taylor & White 1998). This latter type of MTE is a particularly difficult task not only because the results are dependent on the subjective assessment of the judges but also because researchers, and even professional translators, are not able to agree on exactly what constitutes a good translation (Vanni & Reeder 2000). This sheer lack of agreement represents an immense impediment to the establishment of ground rules for MTE, which lead Vanni et al. to come to the conclusion that: “Unlike some other Natural Language Processing problems, there is no gold-standard evaluation possible.” (Vanni & Reeder 2000, p.110) The use of artificially designed test suites as input for MT engines is another method which has been employed by system developers to test the performance of MT systems with respect to particular linguistic phenomena.5 Lewis (1997) rightly pointed out that the use of such diagnostic tools for MTE is generally limited to MT researchers and system developers. Although test suites can also be employed to assess the adequacy of MT output, their construction is an expensive and complex task and therefore often beyond the resources of most users. Unfortunately, a universally accepted and reliable metric for measuring the quality of MT output does not yet exist, and nobody has yet studied the extent to which the results of existing MTE techniques agree. The fact that human MTE is both timeconsuming and labour-intensive is a major drawback of the majority of the aforementioned MTE efforts. Moreover, the complete manual assistance which most MTE methods require cannot be reused for new experiments. This is the very reason why an automated MTE method that is objective and does not require extensive human labour to be replicated is so desirable (Vanni & Reeder 2000). 5A

useful introduction to test suites in NLP can be found in (Balkan, Netter, Arnold & Meijer 1994).


19

3.2 Automatic MT Assessment via Statistical Methods Automatic assessment of MT output by means of statistical language models (SLMs) is a very novel approach in the field of MTE. The following sections introduce three such methods and discuss their respective performance and cost effectiveness.

3.2.1 Automatic Selection of the Best MT Output A recent study on automatic evaluation of MT output, carried out by Callison-Burch & Flournoy (2001), developed a system that automatically selected the best translation of a sentence from a set of candidate translations produced by multiple commercial MT engines. The authors of the study aimed to exploit the property that each translation package has its own strengths and weaknesses attributable to its respective implementation and resources (e.g. lexicon size and coverage). When translating a given sentence, each MT system therefore produces output with a distinct quality. CallisonBurch et al. created a program that always selects the best output from multiple MT engines for each individual sentence in the source text. These best outputs can then be concatenated to produce a target text that is expected to be of higher quality than the output each MT system would produce individually. The problem of choosing the best translation was simplified by making the basic assumption that the translation with the most fluent output is that of the highest quality. This idea was implemented by assigning each test translation with a probability score provided by an SLM built from an English (TL) corpus of 2 million words. The probability score P(w1 , . . ., wn ) for each TL sentence w1 , . . ., wn was computed simply by multiplying all its trigram probabilities which condition the probability of a word on the two previous words in the utterance: N

P(w1 , . . . , wn ) ≈ ∏ P(wn |wn−2 , wn−1 ) n=1

(3.1)


20

However, smoothing had to be carried out to overcome the effects of data sparsity. This was achieved by interpolating the trigram, 2-gram and 1-gram relative frequencies f (|) in cases where simple trigram probabilities were not robust enough:6

P(w3 |w1 , w2 ) = 0.80 ∗ f (w3|w1 , w2 ) + 0.14 ∗ f (w3|w2 ) + 0.099 ∗ f (w3)

(3.2)

This allowed Callison-Burch and Flournoy to automatically rank translations with respect to their fluency, and consequently their quality. Monolingual English speakers were asked to determine the fluency of each translation engine by classifying translations as one of the following categories: 1. Nearly perfect 2. Understandable 3. Barely understandable 4. Incomprehensible When comparing the systems’ performance to the human rankings, the baseline measure was considered to be the single best performing engine, i.e. that receiving the highest number of top ranks from the human subjects. When testing English translations from French sentences obtained from the web, the program performed between 7% to 19% better than the baseline metric as can be seen in Table 3.17 (CallisonBurch 2001). Table 3.2 displays the results obtained when testing French machine translations of English sentences on an SLM built from a French corpus of 1.1 million words. The multi-engine tool performs 2% to 4% better than the baseline engine. Even though this improvement was much smaller than that of the tests performed on the English 6 The

coefficients in Equation 3.2 were determined by training on a subset of the human ranked data. that the engine scores do not add up to 100% because of ties.

7 Note


21

All

Barely Understandable

Understandable

Nearly Perfect

(154 sets)

(146 sets)

(118 sets)

(38 sets)

Multi-Engine

84%

82%

81%

87%

Engine E

76%

75%

70%

68%

Engine F

58%

56%

52%

45%

Table 3.1: French to English Translations.

translations, Callison-Burch and Flournoy pointed out that the program can be applied easily to other languages. All

Barely Understandable

Understandable

Nearly Perfect

(51 sets)

(50 sets)

(44 sets)

(34 sets)

Multi-Engine

67%

66%

61%

64%

Engine E

53%

52%

48%

47%

Engine F

49%

48%

41%

47%

Engine G

45%

44%

36%

32%

Engine H

51%

50%

45%

44%

Engine I

63%

62%

57%

62%

Table 3.2: English to French Translations.

It can be concluded that Callison-Burch et al. produced a very effective program which is not only a fast and cheap approach to assessing translation quality, but also a source-language-independent method. Their results also underline the capacity of SLMs to capture probability distributions of human natural language.


22

3.2.2 Automatic Translation Quality Assessment Another method which proposed to automatically assess the quality of translations produced by different MT engines as well as humans, and which appeared even more relevant to this project, was that of Papineni et al. (2001). This so-called Bilingual Evaluation Understudy (BLEU) is based on the underlying assumption that the more a machine translation resembles an expert human translation the better it is. To put this assumption into practice, a numerical translation closeness metric was required to determine the translation quality of a set of test translations produced by a series of MT engines in relation to a corpus of corresponding good quality human translations. This was accomplished by training an SLM on a set of ideal reference translations and computing the distance between each test candidate and their ideal translations in terms of the number of shared word N-grams. In order to counteract over-generation, so-called modified N-gram precision scoring was implemented according to which candidate Ngram counts were clipped by their corresponding maximum reference counts, summed and divided by the total number of candidate N-grams. The sentence was used as the basic unit of evaluation. This method enabled Papineni et al. to capture both the adequacy and the fluency of a sentence, two important translation requirements. Furthermore, a sentence brevity penalty was introduced in order to ensure that a high-scoring test translation not only matched a reference translation in word choice and word order, but also in sentence length. This penalty only penalised candidates shorter than their reference translations due to the fact that those longer than their references were already accounted for by the modified N-gram precision.8 The results, as shown in Figure 3.1, illustrate that the program was capable of estimating the large difference in translation quality between humans (H1, H2) and MT engines (S1, S2, S3). The fact that it could also reliably distinguish minor differences in translation quality of MT engines themselves made this program even more useful. 8 See

Papineni et al. (2001) for more details.


23

0.7 0.6 Precision

0.5 0.4 0.3 0.2 0.1 0 1

2 3 Phrase (n -gram) Length H2

H1

S3

S2

4

S1

Figure 3.1: Modified N-gram Precision for Machine and Human Translations.

Similarly to the previous study, human evaluation was carried out, this time both on monolingual and bilingual subjects, in order to assess the quality of the program with reference to human perception. Figure 3.2 illustrates that both monolingual and bilingual subjects rated human translations with a much better quality score than corresponding machine translations. Similarly to the BLEU score, they also perceived small translation quality differences between MT engines. The high level of agreement between BLEU scores and human judgment with correlation coefficients of 0.99 (with monolingual subjects) and 0.96 (with bilingual subjects) is a key advantage of this system. The fact that it is a fast, cost-effective and source-language-independent approach is another appealing feature. At this point, it is important to mention that the most important prerequisite of BLEU is the fact that each test translation requires a set of professional reference trans-

Normalised score


24

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 S1

S2

Monolingual

S3

Bilingual

H1

H2

Bleu

Figure 3.2: BLEU versus Bilingual and Monolingual Judgments.

lations to train the SLM. The quality of this gold standard represents an upper limit to the quality score of each test translation. Even though the hope is that the set of ideal reference translations will cover the range of possible variations among acceptable translations, this assumption was not backed up in the study and may well discriminate against good MT output that is not contained within the gold standard. It must also be pointed out that MT evaluation based on human reference data is restricted to the use of researchers and system designers. It would clearly be unnecessary for end-users since they would not require MT in the first place if they were already equipped with ideal reference translations.


25

3.2.3 A Machine Learning Approach to Automatic MTE The third study, carried out by a Microsoft research team, presents a machine learning approach to evaluating the well-formedness of MT output by means of classifiers that learn to distinguish human reference translations from machine translations (CorstonOliver, Gamon & Brockett 2001). The authors of the study had observed that humans can reliably categorise a sentence as either machine- or human-generated and justify their decision. This observation lead them to build a system that evaluates the wellformedness of sentences in order to predict whether they are either machine translations or generated by humans. The system uses decision trees to divide the even split test data, a set of aligned sentence pairs (English human-generated sentences and English MT output translated from Spanish), into one of the two categories. This classification method is determined by a series of features extracted from a parallel reference corpus. The collection of features contains both statistical and linguistic information. The statistical information about the test set was obtained by testing trigram models built from a training corpus (lexicalised trigram perplexity and Part-of-Speech (POS) trigram perplexity). The linguistic information was extracted by means of syntactic parsing and comprises a series of fine-grained features including branching properties of parses, function word densities and constituent length. Since the test data contained an even split of human-written and machine-translated sentences, there was a 50% chance of categorising a sentence correctly. This percentage was used as a baseline for comparison. Features Used

Accuracy (%)

Statistical Features Only

74.73

Linguistic Features Only

76.51

All Features

82.89

Table 3.3: Accuracy of the Decision Tree Method


26

Table 3.3 shows that, compared to the baseline, the system yielded a relatively high accuracy when only using statistical information or only linguistic information. However, the best result of 82.89% accuracy was achieved by employing a classifier that combined all features. Corsten-Oliver et al. concluded that, compared to manual MTE, their classifier represented a less time-consuming and less expensive approach to assessing MT systems that are under constant development and require frequent feedback of their performance. One advantage of the system is that more subtle features can be added to the classifier as the performance of an MT engine improves. While the methodology claims to be independent of source and target language as well as the subject domain, the main drawback is that the system requires a large quantity of aligned translations. Although aligned corpora are available for languages such as English, French, Spanish or German, in domains such as newspaper text or parliamentary debates, it is not as easy to find large quantities of aligned corpora in other domains or for less mainstream languages. This data sparsity problem is a significant restriction on the usage of this classifier. It can only be employed for language pairs for which a sufficiently large parallel corpus is available.

3.3 Summary The literature review showed that recent efforts to produce systems that automatically assess the well-formedness of MT output have all used statistical language modeling technology that capture regularities in natural language. This is a very novel approach to MTE which will be further investigated in this study. The two latter systems (Papineni et al. 2001, Corston-Oliver et al. 2001) were developed as research aids based on a limited set of human reference translations or requiring a parallel corpus for training and testing. The system developed by Callison-Burch et al. (2001), which selects the best translation from a set of machine-translated sentences produced by


27

multiple commercial MT engines, is the only one that, in its current form, is tailored to end-users of MT systems. The study presented in this MSc dissertation also investigates a method using statistical language modeling technology that can be implemented by end-users to assess MT output. However, rather than determining the best translation of a set, the aim here is to produce a system which can differentiate a between good- and bad-quality translated MT output in order to give the end-user more feedback on the performance of the MT engine with regards to individual translated sentences.

Chapter 4 Theory and Implementation This chapter outlines the theoretical framework of conventional language modeling technology, focusing particularly on solutions to the data sparsity problem of N-gram models. Moreover, the fundamental theory of entropy and the reasons for choosing it as a measure to capture the regularities in machine translations are explained. The chapter goes then on to describe the resources which were used for conducting this research - that is, the MT system and the statistical language modeling toolkit, the training data and the test data. The pre-processing steps required to prepare the data as input for the SLM and the experimental procedure are set out in the remaining sections of this chapter.

4.1 Statistical Language Modeling Theory The development of SLMs dates back to the beginning of the 20th century. They were first applied by Andrei Markov in 1913 who aimed to model letter sequences in a work of Russian literature (Markov 1913). Moreover, Claude Shannon’s early work focused on SLMs of letter and word sequences to illustrate the implications of coding and information theory (Shannon 1948).

28

Chapter 4. Theory and Implementation

29

The basic aim of an SLM is to model a word sequence P(W ) = P(w1 , w2 . . . , wm ) by conditioning the probability of the next word in the utterance on that of previous words (i.e. the history): P(W ) = P(w1 )P(w2 |w1 ) . . . P(wm |w1 , . . ., wm−1 )

(4.1)

4.1.1 Simple N-gram Models However, previous identical textual histories are often unavailable since we never have enough data at our disposal to contain all possible utterances. We therefore require a method which groups similar histories. This can be achieved by making the Markov assumption that the probability of the next word is only affected by that of the prior local context (n − 1) which allows the approximation: P(wi |w1 , . . ., wi−1 ) ≈ P(wi |wi−n+1 , . . . , wi−1 )

(4.2)

Given such simple N-gram models, it is then straightforward to compute the probability of an utterance P(W ). Equation 4.3 shows how the probability of a word sequence of length m is calculated using the conventional trigram model (n = 3) where only the most recent two words of the history are used to condition the probability of the next word. m

P(W ) = P(w1 , w2 . . . , wm ) ≈ P(w1 )P(w2 |w1 ) ∏ P(wi |wi−2 , wi−1 )

(4.3)

i=3

4.1.2 N-gram Models over Sparse Data The main problem that occurs when using a such a simple trigram model is the fact that the training vocabulary size V would allow V 3 potential trigrams, most of which do not actually occur in the training corpus. This data sparsity would lead the toolkit to commit a large number of errors, since each test utterance containing a trigram which


30

does not occur in the training corpus would be assigned a zero probability when using the relative trigram frequency f (wi |wi−2 , wi−1 ) as an estimate of the trigram probability P(P(wi |wi−2 , wi − 1): P(wi |wi−2 , wi−1 ) ≈ f (wi |wi−2 , wi−1 ) =

C(wi−2 , wi−1 , wi ) C(wi−2 , wi−1 )

(4.4)

where C(w1 , . . ., wi ) denotes the frequency of the N-gram w1 , . . ., wi in the training corpus. 4.1.2.1 Backing-Off and Smoothing In order to counteract this data sparseness problem, the CMU-Cambridge SLM Toolkit, which was used in our experiments, employs a technique called backing-off. In the case of a trigram model, trigram probabilities are computed as long as estimates are robust, but if not 2-grams are used instead and once these also no longer give robust results, the model switches to the 1-grams (Katz 1987). The toolkit considers a result as robust if the N-gram appeared at least once in the training data. Hence, backing-off to a lower order N-gram only happens if there is zero evidence for a higher order. Backing-off is often combined with smoothing which entails transferring some probability mass from frequently observed events to unseen or rare events, i.e. Ngrams with zero or low frequencies. This is achieved by firstly determining a threshold; then by discounting some probability mass from those N-grams above the threshold; and redistributing it to those below (see Figure 4.1). The main aim of this technique is to make the overall probability distribution more uniform (Osborne 2002). Equation 4.5 (Manning & Schütze 2001) shows the computation necessary for applying backing-off and smoothing to the trigram model whereby P˜ refers to the discounted probabilities and k is usually set to either 0 or 1.

Chapter 4. Theory and Implementation Redistribution

Discounting

PROBABILITY

PROBABILITY

Thresholding

PROBABILITY

31

0

1

2

3

4

5

FREQUENCY

0

1

2

3

4

5

FREQUENCY

0

1

2

3

4

5

FREQUENCY

Figure 4.1: The Basic Concept of Smoothing (King 2002)

  ˜ i |wi−2 , wi−1 ),  P(w     α(wn−2 )P(w ˜ i |wi−1 ), n−1 PBO (wi |wi−2 , wi−1 ) =       α(wn−1 )P(w ˜ i ),

if C(wi−2 , wi−1 , wi ) > k if C(wi−2 , wi−1 , wi ) = k and C(wi−1 , wi ) > k

(4.5)

otherwise.

One complication involved in the use of backing-off and smoothing is the fact that

the probability mass across all words within the training corpus becomes greater than 1. This difficulty is resolved by setting a back-off-weight α, a normalising factor which guarantees that the sum of all probabilities adds up to 1. There are various strategies to carry out discounting all based on the basic principle of smoothing. For this research, I experimented with 3 different smoothing techniques supported by the CMU-Cambridge SLM Toolkit: linear, Witten-Bell and Good-Turing smoothing. While linear smoothing involves discounting a percentage proportional to the frequencies of more commonly seen N-grams (Good 1953, Ney, Essen & Kneser 1994), Witten-Bell is a more complex method based on modeling the probability of seeing a zero frequency N-gram by means of the probability of seeing an N-gram for the first time (Witten & C.Bell 1991). Good-Turing smoothing is an even more complex method that lends extra counts from more frequent N-grams to infrequent ones derived from a concept called frequency of frequencies (Gale & Sampson 1995).1 1 The

paper of Chen and Goodman (1998) provides a comprehensive explanation and comparison of


32

4.2 Entropy Theory Entropy (or perplexity 2 is the metric commonly used to directly evaluate the performance of SLMs and is very important in speech and language processing. Shannon (1948), who described entropy as a measure of information content, defined it mathematically as: H(X ) = − ∑ P(x)log2 P(x)

(4.6)

x∈X

where the random variable X ranges over what is to be predicted (words, letters, etc.). Since the base of the logarithm is 2, entropy is measured in bits. According to Equation 4.6, entropy can therefore be interpreted as the minimum number of bits required to encode a piece of information. This metric can be used to determine the amount of information in a particular SLM, how well an SLM matches a given language, or the degree to which an SLM captures regularities in the corpus by minimising unpredictability and ambiguity.

4.2.1 Entropy of Language If a language is regarded as a stochastic process L, which produces sequences of words (W = w1 , . . . , wn ), the true entropy of the language, H(L), i.e. the average surprise value associated with a language event, can be defined as the limit of the per-symbol entropy when taking sequences of infinite length into account (Tür 2000): 1 ∑ P(w1 , . . ., wn )log2P(w1, . . . , wn) n→∞ n w1 ,...,wn ∈L

H(L) = − lim

(4.7)

all types of smoothing. 2 Perplexity PP is derived from entropy H as PP = 2H ) and can be interpreted as the probabilistic approximation of the average branching factor of the test set (Jelinek 1997).


33

4.2.2 Cross Entropy However, in most real-life situations we do not know the true probability distribution P since N-gram models function on the basis of the Markov assumption that only the probability of the local prior context affects the probability of the next element in a sequence (cf. Chapter 4.1.1). Unlike natural language, N-gram models are therefore stationary and only provide an approximate estimate of the probability distribution P and the entropy of natural language. This approximation of true entropy is called cross entropy which is an entropy estimate derived from the SLM (see Equation 4.8 ) and therefore always greater than or equal to the actual entropy. 1 ∑ P(w1, . . . , wn)log2PSLM (w1 , . . ., wn ) n→∞ n w1 ,...,wn ∈L

HSLM (L) = − lim

(4.8)

The Shannon-McMillan-Breiman theorem states that if language is stationary and ergodic, which means regular in certain ways, cross-entropy is equal to the log likelihood of the test data (see Algoet & Cover 1988 for a detailed description of the theorem): 1 HSLM (L) = − lim log2 PSLM (w1 , . . . , wn ) (4.9) n→∞ n When testing the SLM on a single sentence with a probability P(w1 , . . . , wn ), the CMU-Cambridge SLM Toolkit determines the per-word cross entropy, i.e. the average number of bits per word that are required to encode the sentence of length n: 1 (4.10) HSLM (w1 , . . . , wn ) = − log2 PSLM (w1 , . . . , wn ) n In equation 4.10, n is the number of words in the test sentence. Since cross entropy increases with the length of the string, dividing by n helps to normalise cross entropy scores, so that a given model will be assigned with roughly the same cross entropy when tested on test sentences of different lengths.


34

4.2.3 Cross Entropy as a Translation Quality Measure In our experiments, the cross entropy of a conventional trigram model combined with backing-off and smoothing is used as an initial upper bound for the true entropy of German sentences. Since the model is built from a very large and consequently representative sample of German text (the training corpus), it is assumed that it is able to capture the language well. The main aim of the project is to examine whether the cross entropy of the model can in actual fact serve as a quality measure of machine translations (the test set), without the help of additional linguistic knowledge. If this is the case, good machine translations are expected to produce low cross entropy scores as they are well matched by the model of natural German. On the other hand, bad machine translations should lead to higher entropy values because the model is less able to capture their irregularities. The ultimate goal is to determine an cross entropy threshold relative to which an unseen machine translation can be classified into either a good translation, or one which requires post-editing, i.e. hand correction. According to the crucial supposition of this research that the SLMs are capable of capturing the regularities of German natural language, the representative German language models are consequently expected to produce generally lower cross entropy results for good-quality MT output than for machine translations containing errors. The cross entropy measure is used to evaluate SLMs built under various training regimes (e.g. increasing N-gram order, different types of smoothing, varying amounts of training data and different vocabulary sizes) to assess the behaviour of the statistical language modeling parameters and their influence on the results. The main advantages of using cross entropy to scale machine translations are that the measure has a clear interpretation in the probabilistic context and allows a meaningful comparison of different SLMs built from the same corpus. Cross entropy is also used as an evaluation metric for the experiments of this project because the product of individual N-gram conditional probabilities (each less then one) results in a very small probability estimate for a given test sentence. A further advantage of choosing cross


35

entropy over basic probability measures is that cross entropy is already normalised over sentence length n (see Equation 4.10) and therefore represents a much better metric for this research.

4.3 Resources 4.3.1 The MT System SYSTRAN (SYStem TRANslation) was used for this research since it is a well-known MT engine that is easily accessible online (see SYSTRAN homepage at www.systran.com). Its first prototype was built in 1968 and translated solely from Russian to English, is by and large considered to be a first generation direct MT system since it is based on extensive dictionary look-up. It therefore falls into the category of RBMT systems (see Chapter 2.1.1). Today after more than 30 years of Systran’s initiation, its corporate version can handle translation between 36 different languages pairs.3 Unlike most other direct approaches to MT, SYSTRAN analyses the source text to some extent. Its architecture, as seen in Figure 4.2, consequently resembles a transfer-based MT system although it does not follow the strict theoretical transfer paradigm (Hutchins & Somers 1992). The main characteristic of SYSTRAN’s translation process is the extensive use of large bilingual dictionaries. These contain lexical equivalences, grammatical and semantic information at each stage. The MT system analyses source text according to its own linguistic framework (Yang & Gerber 1996). Figure 4.2 shows clearly the different pre-processing, analysis, transfer and synthesis stages through which the source text is passed. The text pre-processing routines involve the look-up of several dictionaries as well as a morphological analysis of the entire text. After that, a more profound linguistic SL analysis is carried out which helps to resolve homographs, identify both phrase boundaries and primary syntactic relations, determine coordinate structures as 3 http://www.systransoft.com


Source Language Text

36

Target Language Text

Pre-processing

Synthesis

Input Pre-processing

Syntactic Rearrangement

Idiom Dictionary Look-up

Idiom Dictionary

Main Dictionary Look-up

Idiom Dictionary

Morphological Generation

Morphological Analysis Compound Nouns

Limited Semantics Dictionary

Word Translation

Analysis

Transfer

Homograph Resolution

Homograph Dictionary

Lexical Routines

Phrase/Clause Identification Translation of Prepositions

Primary Syntactic Relations Coordinate Structure Subject/Predicate Identification

Conditional Limited Semantics Dictionary

Conditional Idioms

Figure 4.2: SYSTRAN Translation Process


37

well as both the subject and predicate within sentences and mark up the deep structure relations between predicates and their arguments. All pieces of linguistic information are encoded in the so-called byte area appended to each word and reused later at the transfer and synthesis stages (Petrits 2001). It must be underlined that analysis routines are carried out strictly on the SL with the aim to establish grammatical, syntactic and semantic relationships between the words in the source text. Only at the next stage, the bilingual transfer stage, does SYSTRAN begin to consider the actual translation of the text. At this point, the system uses a number of lexical routines to retrieve the meaning of conditional idioms, determine correct prepositions and choose between a selection of applicable translations for particular words. The result is a concatenation of TL equivalents for their SL expressions in their canonical form. TL synthesis is consequently required to assign the default translation for any word not already dealt with, add inflections and morphological appurtenances, insert articles, prepositions and infinitive particles according to the syntactic rules of the TL and finally adapt the word order by means of rearrangement routines (Hutchins & Somers 1992). Despite the fact that SYSTRAN’s architecture is relatively transparent and that a lot of research has been carried out to enhance the system’s performance, it employs linguistic rules which are often incoherent since they are designed empirically for particular constructions in a given language. Each routine may therefore cause unwanted side-effects on the results at other stages. This is the main reason why SYSTRAN is only able to carry out a limited analysis of the source text and produces more translation errors when translating documents that contain complex syntactic and semantic constructions. Angeliki Petris Petrits (2001), a member of the MT Management Team of the European Commission Translation Service, notes that the system often struggles to establish the correct word order in German when translating from either French or English, language pairs with a considerably different word order. Even though SYSTRAN supports translation for multiple language pairs, it cannot be regarded as a real multilingual MT system but rather as a compilation of several uni-directional MT sys-


38

tems since each language pair requires its individual routines (Wong 2001). Despite the fact that this method deals well with some of the translation problems, the major drawback of this system architecture is that a lot of time and effort is required to develop and upgrade each individual uni-directional MT module.

4.3.2 The Statistical Language Modeling Toolkit The Carnegie Mellon University-Cambridge Statistical Language Modeling (CMUCambridge SLM) Toolkit, which was used for this research, comprises a set of Unix software tools which, supplied with a training corpus, facilitate the construction of SLMs (Rosenfeld 1995). The toolkit then uses the resulting SLMs to compute the perplexity and entropy scores of a test corpus. This collection of short and simple programs, required to both train and test the SLMs (see Figure 4.3), can be concatenated by the user which makes the software very straightforward to employ (Clarkson & Rosenfeld 1997). This design is of particular advantage when modifying model parameters and testing their influence on the results. Another reason for using the CMU-Cambridge SLM Toolkit is that it is able to process large amounts of data, an important prerequisite for this project. The toolkit considers language input as a stream of words separated by white space (new lines have no special meaning) and is, by means of an SLM, capable of assigning a probability P(W ) to every conceivable word string W = w1 , w2 , . . ., wm

(4.11)

in an unseen test set. It can therefore estimate distributions over sentences whose probabilities are learned from the training data (Jelinek 1997).


text2wfreq

Vocab

wfreq2vocab

Training Corpus

Statistical Language Model

text2wngram wngram2idngram

39

Id N-gram

idngram2lm

SLM Training

evallm

Test Corpus

Entropy/ Perplexity

SLM Testing

Figure 4.3: CMU-Cambridge SLM Toolkit Design

4.3.2.1 Context Cues It must be noted at this point that the toolkit is not able to detect individual sentences in both training and test data, unless the text is marked with additional context cues which indicate the beginning (< s >) and the end (< /s >) of a sentence. Both markers are part of the vocabulary and provide useful contextual information for the N-grams. 4.3.2.2 Vocabulary Size Conventional N-gram models are generally defined relative to a particular vocabulary. In the first experiment (when evaluating a SLM built from a 28.3 million words training corpus), we restricted the vocabulary size to 65,143 entries, that is all words which appeared at least 12 times in the training text. All words outside the given vocabulary, that is Out-Of-Vocabulary words (OOVs), were mapped to the same unknown symbol, called < U NK >. The main reason for this procedure is that the difficulty of constructing an SLM and its complexity increases with the vocabulary size. The CMUCambridge SLM Toolkit allowed us to build an open vocabulary which acknowledges and models OOVs in the same manner as the words within the vocabulary.


40

4.3.3 The Corpora 4.3.3.1 Training Data In order to build a SLM which is able to reliably estimate the probabilistic distribution of German natural language, a very large German training corpus was required. The aim was, therefore, to find as much German text as possible. We chose a subset of the European Language Newspaper Text (LDC95T11) made available by the Linguistic Data Consortium (LDC).4 The corpus extracted for this project comprises of 48.1 million words of German newspaper text provided by the Associated Press (AP) World Stream, the Agence France Presse (AFP) and the Deutsche Presse Agentur (DPA). All SGML mark-up was stripped during the text pre-processing stage (see Chapter 4.4.1). 4.3.3.2 Test Data A small English-German-aligned parallel corpus comprising of 230 sentences per language was used for the pilot experiment which compared the entropy measure of human-written text with those of corresponding machine translations. The language model was tested on 460 German sentences comprising the original German sentences and the German MT output of the parallel English sentences. The test corpus used for the main experiments consisted of 1,000 sentences of newspaper articles published in the Washington Post. All 1,000 sentences were translated into German by SYSTRAN (see Chapter 4.3.1). A subset of 147 good-quality and 147 bad-quality machinetranslated sentences served as a heldout set to study the behaviour of the SLM.

4.4 Data Preparation Since the aim is to evaluate the quality of individual translated sentences, the SLM needed to be trained on a large corpus of tokenised training sentences. In order to 4

http://www.ldc.upenn.edu/Catalog/LDC95T11.html


41

tokenise sentences and determine sentence boundaries, it was first necessary to deal with several low level formatting issues, such as junk formatting and capitalisation.

4.4.1 Junk Formatting Since the training data extracted from the LDC corpus contained real German newspaper text and SGML mark-up, it was necessary to filter out information which the SLM was either unable to deal with or which would distort the probabilities of the model. Since the aim was to build an SLM representative for German natural language, I wrote a filter program, a series of Perl and Unix shell scripts, to remove all junk, such as mark-up, newspaper headlines, tables, etc. Since the training data originated from three different newspapers with different formatting, the filter program had to be adapted each time. The example below shows that this program only allowed to retain connected German text which served as the input into the tokeniser. Raw Text: 0114 D/Bunt Pilzsuche kann Taxischein kosten
Lindau, 27. September (AFP) - Nicht einmal Pilze kann man in Bayern suchen, ohne von der Polizei behelligt zu werden. Text after Junk Formatting: Nicht einmal Pilze kann man in Bayern suchen, ohne von der Polizei behelligt zu werden.


42

4.4.2 Capitalisation As the capitalisation practices were not consistent throughout the original data collection, all words were converted to uppercase. Such neutralisation of case distinction has the major advantage that it allows the SLM to treat tokens the same when they are identical except for the case. The German newspaper text, for examples contains both tokens FRANKFURT and Frankfurt which after capitalisation would be counted correctly as one and the same word by the SLM. However, capitalisation also entails some drawbacks of its own since the model is now no longer able to distinguish between tokens which have the same spelling but different meanings depending on whether the first letter is in upper or lower case, e.g. oder, the German word or, and Oder, the river. On balance however, the advantages of capitalisation are believed to outweigh these drawbacks.

4.4.3 Tokenisation The German newspaper text then had to be tokenised which means that individual words, numbers, abbreviations, acronyms, dates, punctuation marks etc. had to be surrounded with white space on either side. When designing the tokeniser, it turned out that the task was far from trivial. The question of what constitutes a word was problematic and could unfortunately not be solved with the definition of a graphic word from Kuˇcera and Francis who claimed it to be: “a string of contiguous alphanumeric characters with space on either side; may include hyphens and apostrophes, but no other punctuation marks.” (Kuˇcera & Francis 1967) and (Francis & Kucera 1982, p.3) The following examples show that this definition is flawed and illustrate the problem punctuation actually poses for tokenisation.


43

4.4.3.1 Abbreviations, Acronyms and Initials Words are often attached with punctuation marks which cannot always be separated by white space, as in case of abbreviations or initials where the period is regarded as part of the word. Since the detachment of periods from abbreviations and initials would create particular problems for determining sentence boundaries, it was decided to remove sentence-internal punctuation all together. All sentence-boundary punctuation (including colons and semi-colons attached to the end of alphanumbericals but followed by whitespace) was kept as it was used at the later sentence segmentation stage. In some cases, it was however impossible to distinguished whether a period belonged to a token, e.g. ART., the German abbreviation for article, or whether it was a token of its own and therefore indicated a sentence boundary, e.g. ART., the German word for way at the end of a sentence. Tokenisation of abbreviations and acronyms containing more than one period, such as Z.B. (e.g.) or S.O.S. posed similar problems and on top of that made it impossible for the SLM to distinguish them from initials. 4.4.3.2 Compound Nouns Unlike English, German compound nouns are graphically distinct from non-compounds which made tokenisation easier. However, when German compounds are part of a coordinated phrase where two compounds share the same sub-part, such as OBST- UND ¨ GEMUSEGARTEN (literally: fruit and vegetable garden), it became more problematic. In such cases, it was decided to remove the dash and treat the first part of the first compound as a token on its own. Dashes and forward slashes were also removed when surrounded by white space on either side, but kept in cases where they appeared within a token. Other token-internal punctuation such as appears in numbers, amounts and time expressions etc., e.g. 5.200,99 (5,200.99 in English), 1 1/2 KG or 6:00 UHR (6 o’clock), was also kept for obvious reasons.


44

4.4.3.3 Apostrophes As far as apostrophes are concerned, they lead to as much confusion in German as they do in English. Apart from single quotation marks, apostrophes are often used in written German to indicate elisions, such as in the expression DAS WAR’S for DAS WAR ES (That was it). In such cases, it was decided to replace the apostrophe by whitespace and treat the S as a separate token. However, this decision resulted in some unavoidable tokenisation errors. Even though, the use of ’S for genitive forms of names is usually viewed as incorrect, the orthography reform of 1998 sanctioned the usage of the apostrophe in such instances as ANDREA’S BLUMENLADEN (Andrea’s flower shop) where the genentive form ANDREAS of the female name ANDREA would otherwise be confused with the nominal form of the male name ANDREAS (Der Duden 1996, p.25). Moreover, the ever increasing influence of English leads to the genitive ’S being used more often, even in newspaper text. The tokeniser consequently treated ’S-genitives the same as ’S-elisions which made the tokenised S undistinguishable for the SLM. It must be underlined that many of these punctuation problems are exceptions and therefore occurred extremely rarely in the data. As the German newspaper corpus used to train the SLM was very large, it was assumed that such tokenisation errors would not have a significant effect on its probability distribution.

4.4.4 Sentence Segmentation After tokenisation, the data then had to be separated into individual sentences by means of rules for identifying sentence boundaries. As explained above, all sentence-internal punctuation (apart from token-internal punctuation) was removed during tokenisation. The remaining punctuation marks (full stops, question marks, exclamation marks, colons and semi-colons) followed by white space were therefore regarded as sentenceboundary punctuation and served as indicators as to where to segment the text. Since


45

the task is to model natural German sentences, it is evident that the performance of sentence segmentation has a great influence on the probabilities of the model. Although segmentation errors were unavoidable, e.g. the majority but not all colons and semicolons are followed by a new sentence in German, much work was invested into producing a very accurate sentence splitter. Several random spot checks of 100 sentences showed an accuracy ranging from 98% to 100%. After sentence segmentation, sentence-boundary punctuation was also removed from the data. It is acknowledged that punctuation provides clues on the overall structure of text and likely dependencies within (Manning & Schütze 2001). However, the reason for removing all non-token-internal punctuation from the data was to reduce the number of unseen N-grams within the training corpus. The reader must also understand that, while the preceding sections largely discussed data preparation with respect to the training corpus, the SYSTRAN translated test sentences had to undergo the same pre-processing steps (with the exception of junk formatting). This was necessary in order to guarantee the same format for both test and training data as input into the SLM toolkit. Both training and test sentences were also assigned with the context cues < s > and < /s > (marking the beginning and end of each sentence respectively) required by the SLM to distinguish between separate sentences. The context cues thus reintroduced sentence-boundary information that had been discarded earlier when removing punctuation.

4.5 Experimental Procedure Once tokenised and segmented into sentences, the training data was ready to use in order to build the SLM for natural German sentences which was in turn tested on the pre-processed test translations.


46

4.5.1 Training and Testing the SLM The first SLM was built from a training corpus comprising 1.5 million sentences (28.3 million word tokens). This corpus contained 418,677 graphically distinct words called word types, some appearing more frequently than others. Only 65,143 of them, those which appeared at least 12 times in the training data, were entered into the vocabulary. The remaining 353,534 word types were classed as OOVs. Table 4.1 indicates how the number of word types that occured with a particular frequency in the corpus relates to the actual coverage of the corpus. The first column C shows the number of observed instances of a word type. The second column N(WordTypes) shows the number of word types which had this count. The third column N(W Tokens) represents the product of C and N(WordTypes), i.e. the number of word tokens in the corpus derived from the number word types multiplied by their count C. Table 4.1 shows that the OOVs (those words which appeared less than 12 times in the training data) account for 84.4% of the total number of word types. However, multiplied by their low frequency, they only make up 867,781 of the total 28,339,108 word tokens contained in the training corpus and therefore only cover 3.1% of the text. This means that the word types which are part of the vocabulary cover a very large majority (96.9%) of the training text. The first SLM, a conventional trigram model combined with backing-off and smoothing, was then tested on individual test sentences which resulted in an entropy score for each. These scores assigned with their ranking category (see next section) then allowed us to evaluate whether it may be possible to distinguish between good- and bad-quality translations simply by means of raw cross entropy scores. Further experiments were carried out to test: • How different order N-gram models (3-gram, 4-gram, 5-gram and 6-gram) influence the entropy results. • The effects which different types of smoothing (linear, Good-Turing and Witten-


47

C

N(Word Types)

N(Word Tokens)

1

188,822

188,822

2

76,929

153,858

3

40,715

122,145

4

27,940

111,760

5

20,463

102,315

6

16,645

99,870

7

13,926

97,482

8

12,354

98,832

9

10,970

98,730

10

4,417

44,170

11

3,855

42,405

353,534 = 84.4%

867,781 = 3.1%

1-11 ≥ 12 Total

65143 = 15.6% 27,471,327 = 96.9% 418,667

28,339,108

Table 4.1: Word Types and Word Tokens

Bell) have. • To what extent training corpora of smaller and larger sizes (250,000, 500,000, 1,000,000, 1,500,000, 2,000,000, 2,500,000 sentence) have an effect on the results. • In what way different sizes of vocabulary change the distribution.

4.5.2 Human Ranking of Test Sentences In order to test the extent to which cross entropy can be used as a quality measure to distinguish good from bad translations, the translation quality of all German test


48

machine translations was ranked by a trained German native translator according to the following scale: 1. Correct translation - no correction is required. 2. Acceptable translation - minor translation errors (wrong preposition, determiner or word order) which do not distort the overall meaning of the original sentence, little correction required. 3. Poor translation - several severe translation errors, correction is absolutely necessary but a monolingual reader can still guess what the sentence means. The human translator would require limited use of the original text for post-editing. 4. Wrong translation - so many errors that there is no resemblance to the original sentence. The human translator would need the original sentence for post-editing and may want to disregard the MT output altogether. For the purpose of this project, we decided to group correct and acceptable translations as good, and poor and wrong translations as bad.

4.5.3 Threshold Determination Despite the fact that the initial aim of this project was to distinguish between good and bad quality translations in relation to each other, the ultimate goal is to determine a threshold which divides MT output into good translations (with a cross entropy score below the threshold) and bad translations (with a cross entropy score higher than the threshold). This allows us to treat the evaluation of well-formedness of MT output sentences as a classification task: Given a sentence, how accurately can we predict whether it is a good- or bad-quality translation? Threshold determination was done empirically, simply by setting a series of thresholds and computing the percentage of good and bad translations divided correctly by the threshold (see Figure 4.4).


49

SENTENCE COUNTS

THRESHOLD

CROSS ENTROPY Good-Quality Translations

Bad-Quality Translations

Figure 4.4: Thresholding Determination

The percentage of accuracy was therefore made up of all good translations with a lower cross entropy than the threshold and all bad translations with a higher cross entropy than the threshold. The optimal threshold was that which resulted in the highest accuracy rating. In our experiments, the thresholds always divided between good (correct and acceptable translations) and bad (poor and wrong) translations. However, the threshold value can be changed according to the needs of the user. Human translators, for example, could set the threshold to a lower value and split between correct translations and those which need hand correction. On the other hand, someone who simply wants to understand the overall gist of a document can adjust the threshold to a higher value to only filter out sentences which are completely wrong translations.

Chapter 5 Results and Evaluation This chapter presents the results of all the experiments carried out during the course of this project. The pilot experiment and its results are described and evaluated in the first section. The second section introduces the main experiments and assesses how the results are influenced by a series of language modeling parameters, including different order N-gram models, smoothing techniques, amounts of training data and number of vocabulary entries.

5.1 Pilot Experiment The main aim of the project was to analyse to what extent raw cross entropy can be used as an evaluation metric with regards to the quality of MT output. In order to establish whether this approach has any potential, a pilot experiment was conducted to determine the relationship between the entropy scores of human reference translations and the corresponding machine translations. The backed-off trigram model was built from a training corpus of 1.5 million German newspaper sentences (28.3 million words) and linear smoothing was applied. A small, parallel German-English corpus of 230 sentences per language (obtained from the EUROPA website of the European

50

Chapter 5. Results and Evaluation

51

Union1) was used for testing. We translated the 230 English sentences into German using SYSTRAN. The SLM was therefore tested on 230 human-written German sentences and 230 German SYSTRAN translated sentences. On average it was expected that the human-written texts would be assigned lower cross entropy scores than the SYSTRAN output since the former should be captured better by the SLM. The results of the pilot experiment therefore also served as an appropriate method for evaluating the SLM itself. Human-written text must receive overall higher log likelihood than MT output, otherwise the SLM needs to be revised. The sets of 230 sentences were tested both as a chunk and sentence by sentence. Table 5.1 demonstrates that, as a chunk, the human-written text has an overall cross entropy of 9.05 bits, 1.05 bits lower than that of the corresponding MT output. This means that it takes on average one bit less to encode the human-written test data than to encode the SYSTRAN translations. Test Set (Chunk)

Cross Entropy

Perplexity

German Human-Written Text

9.05 bits

530.67

German Machine Translations

10.10 bits

1100.70

Table 5.1: SYSTRAN Output versus Human-Written Text

As our aim is to assess the quality of individual machine translations, it is of even more interest to establish how the distribution of cross entropy scores for humanwritten text differs from that of machine-translated text when testing the SLM on the individual sentences. The histogram, which displays cross entropy scores against sentence counts (see Figure 5.1), shows a clear left shift from the distribution of machinetranslated sentences to that of human-written sentences, i.e. to lower cross entropy values. Since the cross entropy and the log likelihood a of a sentence are closely correlated (when log probability increases, cross entropy decreases, see Equation 4.10 1 http://europa.eu.int


52

in Chapter 4.2.2), the SLM estimated on average greater log probabilities for humanwritten sentences than for machine translations. This result indicates, as had been anticipated in advance, that the SLM captures well-formed German sentences better than sentences containing irregularities, i.e. MT errors. Test Set (Individual Sentences)

Cross Entropy Mean (µ)

Variance (s2 )

German Human-Written Text

9.04 bits

2.10

German Machine Translations

9.97 bits

1.88

Table 5.2: SYSTRAN Output versus Human-Written Sentences

The mean of the cross entropy scores of MT sentences was 9.97 bits and that of the human-written sentences was 9.04 bits (see Table 5.2). As for testing both sets as a chunk of text, the means of both cross entropy distributions differed by approximately 1 bit. However, due to their variances (2.10 for human-written and 1.88 for machinetranslated sentences) the distributions overlap, as can be seen clearly in Figure 5.1. The optimal threshold was set at 9.6 bits with an accuracy of 64.78%. This means that according to the optimal threshold only 298 of 460 sentences in the test data are correctly classified as either human-written or machine-translated. Since the test data contains an even split of human-written and machine-translated sentences, the baseline for comparison is chance, i.e. 50%. Even though the pilot experiment yields a result much better than the baseline, the percentage of incorrectly classified sentences is still relatively high, mainly resulting from the overlap of both distributions. Taking into consideration, however, that the collection of machine-translated sentences contains a subset of good-quality translations with a well-formedness that approaches that of natural human language, it is believed that the overlap is reduced and accuracy increased when comparing the cross entropy of natural and well-formed German sentences (human-written text or good-quality MT output) with that of a set containing only irregular sentences (bad-quality MT output).


53

45 40

Sentence Counts

35 30 25 20 15 10 5

Human

MT

9 9. 5 10 10 .5 11 11 .5 12 12 .5 13 13 .5 14 14 .5 1 M 5 or e

8 8. 5

6 6. 5 7 7. 5

5 5. 5

0

Cross Entropy (bits)

Figure 5.1: Cross Entropy Distributions of SYSTRAN Translated and HumanWritten Sentences

While we obtained an accuracy of 64.78% for distinguishing between human-written and machine-translated sentences in the pilot experiment, we therefore expect this percentage to be considerably higher when testing the same model only on MT output and distinguishing between good- and bad-quality translations instead. The extent to which this hypothesis holds true and the way in which the accuracy rate is affected by various language modeling parameters were examined in the main experiments which are described in the remaining sections of this chapter.


54

5.2 Main Experiments The following sections provide a detailed discussion of the main experiments conducted for this research. While Experiment 1 involved the training and testing of a conventional trigram model, Experiments 2 to 5 tested how the N-gram order, the smoothing technique, the amount of training data and the number of vocabulary entries influenced the accuracy rates, respectively, when distinguishing between goodand bad-quality machine translations.

5.2.1 Experiment 1: The Conventional Trigram Model The first main experiment was conducted using a conventional backed-off trigram model with linear smoothing, the same model that we employed for the pilot experiment. The model was tested on a much larger test set of 1,000 German machinetranslated sentences. The original sentences were extracted from English newspaper articles published in the Washington Post which we translated into German using the online SYSTRAN engine2 . Before testing, the author of this dissertation, a qualified translator (to degree level), categorised all 1,000 SYSTRAN-translated sentences according to their quality as either correct, acceptable, poor or wrong translations (see Chapter 4.5.2) which resulted in the data split presented in Figure 5.2. Out of 1,000 sentences overall, 147 translations were classified as good (correct and acceptable) and 853 as bad (poor and wrong). All 1,000 ranked, SYSTRAN-translated sentence were used as the test set in the first part of Experiment 1, described in the following section. 5.2.1.1 Uneven Quality Distribution of MT Output When testing the SLM on the complete set of 1,000 sentences, the uneven distribution (see Figure 5.2) between the number of good and bad translations had a detrimental effect on the cross entropy distributions (see first graph of Figure 5.3). Although the 2 www.systransoft.com


Wrong 157

55

Correct 32

Acceptable 115

Poor 696

Figure 5.2: Human Ranking of Heldout Data

means of each individual distribution of good and bad are considerably different at 9.26 bits and 10.51 bits respectively, the cross entropy distribution of good machine translations is almost completely overlapped by that of the bad machine translations simply because there are more than five times as many bad sentences as good. A very similar result is presented in the second graph of Figure 5.3 which shows the different distributions for each individual category. This time, the number of poor translations is so high in comparison to the rest that their cross entropy distribution covers up the large majority of the other distributions, despite that each individual curve peaks at a different point. It is evident that the distribution of good- and bad-quality translated sentences is dependent on the performance of the MT system. The better the MT system works, the closer its output will be to human natural language and the smaller will be the percentage of bad-quality translated sentences. However, the real ratio between good and bad, or between correct and acceptable as well as poor and wrong translations within the output of a real-world MT system is unknown to the end-user. It largely

56 140

140

120

Bad

Acceptable

Poor

or e

13 .5

4. 5

12 .5

Cross Entropy (bits) Correct

M

Cross Entropy (bits) Good

13 .5 M or e

12 .5

11 .5

9. 5

10 .5

8. 5

7. 5

0

6. 5

20

0

5. 5

20

11 .5

40

9. 5

40

60

10 .5

60

80

8. 5

80

100

7. 5

100

6. 5

120

5. 5

Sentence Counts

160

4. 5

Sentence Counts


Wrong

Figure 5.3: Natural Distributions of Good- and Bad-Quality MT sentences

depends on the given MT paradigm and on the domain of the SL input text. For instance, an MT system that was built to translate technical documents may produce a relatively high quality output when translating a medical text but not necessarily do as well when translating a newspaper article. This means that, given an unseen translation, the chance of it being either good or bad MT output depends to a large extent on how capable the MT system is in translating well-formed sentences from the domain the sentence belongs to. In our case, only 14.7% of our test set was classified as good-quality MT ouput. The large majority of test sentences, 85.3%, were labeled as bad translations. With an optimal threshold set at 7.3 bits, we were able to correctly classify 86.5% of all 1,000 machine-translated sentences as either good or bad. However, the low optimal threshold indicates that the largest proportion of correctly classified translations is made up by the bad-quality translations. In fact, the SLM only produces cross entropy rates lower than the optimal threshold for 16 out of 147 good translations since the cross entropy distribution of the good-quality translations is almost completely overlapped by that of the bad-quality translations. The threshold therefore fails to actually distinguish between good and bad.


57

5.2.1.2 Evening Up the Quality Distribution To facilitate a better comparison of the cross entropy distributions, we conducted all further experiments using an even split of 147 good- and 147 bad-quality translations, with the number of bad-quality translations comprising the same ratio between poor and wrong translations as the set of 1,000 SYSTRAN-translated sentences (see Figure 5.2). The decision of evening up the quality distributions was made to overcome the almost complete overlap of the cross entropy of good translations by that of bad translations, as described in the previous section. By using an even split of good and bad, we essentially assume that given a machine-translated sentence, there is a 50% chance that it is either a good- or a bad-quality translation. While this simplification does not hold true in real data, it is made here to investigate the extent to which conventional SLMs are capable of distinguishing between good- and bad-quality MT output in general. The results will then give us more insight into how the quality of unseen MT sentences can be automatically assessed where the chance of a sentence being either a good- or a bad-quality translation is unknown. The cross entropy distribution produced by the backed-off trigram model when testing the even split of good- and bad-quality MT sentences can be seen in Figure 5.4. Similarly to the pilot experiment, there is a prominent left-shift from the cross entropy distribution of bad- to that of the good-quality translations. In fact, their means of 10.92 bits and 9.26 bits, respectively, are slightly further apart than those obtained in the pilot experiment where human-written text was compared to a collection of MT output sentences containing both good- and bad-quality translations. In the present experiment, the accuracy score reflects a smaller overlap between both distributions. We were able to decide with an accuracy of 73.81% whether a translation was good or bad, with an optimal threshold of 9.9 bits. The accuracy rate is the result of the SLM assigning overall lower cross entropy rates to good machine translations than to wrongly translated sentences. Since cross lower entropy rates are caused by the SLM better capturing the test data, we can conclude that the SLM of natural German


58

35

Sentence Counts

30 25 20 15 10 5

Good

Bad

M or e

13 .5

12 .5

11 .5

10 .5

9. 5

8. 5

7. 5

6. 5

5. 5

4. 5

0


Figure 5.4: Even Distribution of Good and Bad MT Sentences

matches good-quality MT output better in word choice and word order than bad-quality MT output, as was hypothesised in Chapter 4.2.3. The fact that the accuracy rate considerably outperforms the baseline suggests that an SLM, which is representative enough of a language, is capable of distinguishing between well-formed sentences of that language and others which contain linguistic irregularities. Experiments 2 to 5, described in the following sections, tested how various language modeling parameters, such as higher order N-grams, different smoothing techniques, various amounts of training data and vocabulary entries, influence the performance of the SLM and accuracy scores accordingly.


59

5.2.2 Experiment 2: Higher Order N-gram Models Given the large training corpus, the most obvious extension to the conventional 3-gram model was to test the effect higher-order N-gram models, such as 4-gram, 5-gram and 6-gram models, would have on the quality assessment of the MT output. In order to guarantee a realistic comparison of the results produced by the different models, all other parameters were kept the same as in the previous experiment. Thus, the training corpus of 1.5 million sentences and the vocabulary containing 65,143 distinct word types (as defined in Chapter 4.5.1) were used for building the SLMs. Moreover, linear smoothing was applied to each N-gram model. The 4-, 5- and 6-gram models all resulted in better accuracy rates compared to the 3-gram model, though the difference was relatively small, as can be seen in Table 5.3. There was no further improvement moving beyond the 5-gram model which suggests that a point was reached where performance leveled off. We can therefore conclude that out of both the 5- and the 6-gram model, the 5-gram model is the better choice since it requires less processing time and memory. N-gram Order

Optimal Threshold (bits)

Accuracy (%)

3-gram

9.9

73.81

4-gram

10.0

74.49

5-gram

10.0

74.83

6-gram

10.4

74.83

Table 5.3: Optimal Thresholds and Accuracy Scores for Higher Order N-grams

Although it is fairly unlikely that the training data contained many occurences of five- or six-word sequences that appear in the test data, the findings indicate that the higher order N-gram models capture longer contexts in good-quality translations better than in bad-quality translations.


60

5.2.2.1 An Example The following example illustrates how the use of higher order N-gram models allows to capture long-distance dependencies in German. The SYSTRAN translated test sentence: < s > ICH |WERDE {z } NICHT DIE ANWEISUNG |ANALYSIEREN {z } < /s > Auxiliary Infinitive was categorised by the translator as a good translation of the English sentence:

I won’t analyse the instruction. and received a cross entropy rate of 8.43 bits by the 5-gram model. As we determined an optimal threshold of 10 bits for the results of the 5-gram model (see Table 5.3), this sentence belongs to the proportion of good quality sentences correctly split by the threshold. The sentence contains a correct future perfect construction made up of the first person singular form of the auxiliary WERDEN and the dependent infinitive ANALYSIEREN. Since we are dealing with a main clause, the independent infinitive appears at the end of the sentence and is therefore separated from the auxiliary by the negation word NICHT and the direct object DIE ANWEISUNG. N-gram Log Likelihood

Back-Off Class

logP(ICH|< s >) = −2.450164

2

logP(WERDE|< s > ICH) = −1.450285

3

logP(NICHT|< s > ICH WERDE) = −1.387740

4

logP(DIE|< s > ICH WERDE NICHT) = −1.342164

5

logP(ANWEISUNG|ICH WERDE NICHT DIE) = −3.815166

5-4-3

logP(ANALYSIEREN|WERDE NICHT DIE ANWEISUNG) = −6.870123

5x4-3-2-1

logP(< /s >|NICHT DIE ANWEISUNG ANALYSIEREN) = −0.448628

5x4x3x2

Table 5.4: N-gram Log Likelihoods and Back-Off Classes from the 5-gram Model


61

Table 5.4 presents the individual N-gram log probabilities and back-off classes3 produced for the above-cited SYSTRAN translation when tested on the 5-gram model. As can be seen, the 5-gram WERDE NICHT DIE ANWEISUNG ANALYSIEREN spans over the whole predicate of the test sentence, including the auxiliary and the infinitive. Since this 5-gram was not found in the training corpus as indicated by the back-off class, the SLM had to back off all the way to the unigram ANALYSIEREN. However, the context WERDE NICHT DIE was found in the collection of training sentences. In fact, all sentences in which this context occurs (see Appendix A) contained the same future perfect construction but with different infinitives at the end of the main clause. The log probabilities of the N-grams therefore captured the dependency between the auxiliary and the infinitive to some extent, even though the exact 5-gram is not contained in training data. Increased log probabilities of higher order N-grams can be put down to the fact that word sequences similar to that of the N-gram in the test data occur less sparsely in the training data (Goodman 2000). However, this can only be guaranteed if the training corpus is reasonably large. 5.2.2.2 Drawback of Higher Order N-gram Models The main drawback of higher order N-grams is that the use of more complex models becomes fairly computationally expensive, particularly when the size of the vocabulary is large as well. 6-gram models require considerably more memory than conventional 3-gram models which is also mirrored in the processing time the toolkit requires when reading in the model for testing. This computational cost is a very important factor which needs to be taken into consideration, in particular when developing a commercial MT quality assessment product able to function in real time. 3 The

backed-off class 5 means that the 5-gram A B C D E is contained in the model and the probability was predicted accordingly. 5-4 and 5x4 mean that the 5-gram was not found and that the model had to back off to the 4-gram B C D E. While 5-4 indicates that the context A B C D was found and a back-off weight was applied, 5x4 means that the context A B C D was not found.


62

5.2.3 Experiment 3: Different Smoothing Techniques As stated in the literature on statistical language modeling technology, the use of higher order N-grams combined with the proper smoothing techniques can lead to improved performance (Goodman 2000). We therefore conducted an experiment to test which model combined with which smoothing technique was the best in distinguishing goodfrom bad-quality machine translations within the collection of test sentences. Besides linear smoothing, Good-Turing and Witten-Bell smoothing (see Chapter 4.1.2.1) were also employed to build 3-gram, 4-gram, 5-gram and 6-gram models, resulting in 12 different combinations of N-gram orders and smoothing techniques. All 12 models were built with the same training corpus (1.5 million sentences) and vocabulary (65,143 entries) as used in all previous experiments in order to allow a valid comparison of the accuracy scores. 78

Accuracy (%)

77 76 75 74 73 72 3-gram

4-gram

Linear

5-gram

Good-Turing

6-gram

Witten-Bell

Figure 5.5: Accuracy Scores for Different Smoothing Techniques


63

Figure 5.5 clearly shows the slight increase in performance moving from the conventional 3-gram to the 6-gram model when either employing Good-Turing or WittenBell smoothing, the same as was found when comparing the performance of higher order N-gram models combined with linear smoothing (see previous experiment). Even more noteworthy is the difference in performance when applying all three smoothing techniques to a model of the same order. In all cases, Good-Turing and Witten-Bell smoothing lead to higher accuracy scores than linear smoothing. The best accuracy of 77.22% was measured when testing the 6-gram model built with Good-Turing smoothing (see Appendix B). This finding was not surprising since many language modeling related publications state that Good-Turing smoothing yields exceedingly good results in terms of cross entropy or perplexity when combined with backing-off (e.g. Jelinek 1997, Chen & Goodman 1998, Manning & Schütze 2001, Hain 2001). The results of this experiment not only illustrate the importance of smoothing in general, but also demonstrate that the smoothing technique which results in the best performance is highly dependent on both the N-gram order of the model and the available training data. Determining the best smoothing technique is therefore an empirical task and can only be accomplished through experiments.

5.2.4 Experiment 4: Different Amounts of Training Data Out of the 48.1 million words of German newspaper text extracted from the European Language Newspaper Text (see Chapter 4.3.3.1), the previously described experiments have only made use of 28.3 million words (1.5 million sentences). We increased the total training data to 48.1 million words by pre-processing the remainder of the data. In order to measure the effect of the amount of training data on the accuracy of automatic quality assessment, samples containing 250,000, 500,000, 1,000,000, 1,500,000, 2,000,000 and 2,500,000 random sentences were assembled for training. We therefore built 3-gram, 4-gram, 5-gram and 6-gram models using the six different training data samples which resulted in a total of 24 different SLMs. The vocabulary containing


64

65,143 word types was used to train each model, as was Good-Turing smoothing, the technique which yielded best accuracy scores in Experiment 3. 79 78

Accuracy (%)

77 76 75 74 73 72 71 70 250,000

500,000

1,000,000

1,500,000

2,000,000

2,500,000

Number of Sentence in Training Data

3-gram

4-gram

5-gram

6-gram

Figure 5.6: Accuracy Scores for Different Amounts of Training Data

Figure 5.6 shows that the SLMs benefited from additional training data. There was an overall trend towards higher accuracy scores when moving from a smaller to a larger number of training sentences (see also Appendix C). It is interesting to note that out of the SLMs built from the smaller samples comprising 250,000 and 500,000 training sentences, the 3-gram model always performed best. It was pointed out earlier that the more training data available, the better the estimates for higher order N-grams (see Chapter 5.2.2). This is in fact reflected in the accuracy scores obtained from the models which were trained on a greater number of sentences. At 1 million sentences, the performances of the 3-gram, 5-gram and 6-gram models were the same, each scoring


65

an accuracy of 76.53%. For larger training sets of 1,5 and 2 million sentences, however, the 5- and 6-gram models outperformed the 3-gram models. Moreover, Figure 5.6 reveals that there is no further improvement of the accuracy rates produced by all Ngrams trained on the largest training corpus comprising 2.5 million German newspaper sentences (48.1 million words). Out of all 24 SLMs, the best accuracy score of 78.23% was yielded when testing the 5- and 6-gram models both trained on 2 million sentences. The second best accuracy score of 77.89% was produced by the 3-gram models trained on 2 and 2.5 million sentences. Since the difference between the best and the second best accuracy is relatively small, the use of the 3-gram models was preferred especially when taking the required processing time into account. While both 3-gram models were able to assess the test sentences almost in real time, it took several days to test the 5- and 6-gram models on the same test data. Until computational processing power and memory storage are further increased, the latter models remain too computationally expensive to be embedded into a commercial MT quality assessment system available to end-users.

5.2.5 Experiment 5: Different Vocabulary Sizes The last training regime tested how the amount of training data influences accuracy rates while keeping the vocabulary size the same. The next step was to investigate how smaller sized vocabularies affect accuracy rates while keeping the number of training sentences the same. Rather than using the 5- or 6-gram models which combined with a large training corpus (comprising 2 million sentences) yielded the best results in the previous experiment, we decided to resort to conventional 3-gram models in this experiment, the main advantage being to speed up processing. We trained 3-gram models on the total number of 2.5 million sentences and employed Good-Turing smoothing, the combination which together with a vocabulary of 65,143 entries yielded the second best accuracy rate in Experiment 4. Since the CMU-Cambridge SLM Toolkit limits the vocabulary to 65,535 entries, we decided to

Chapter 5. Results and Evaluation Word Type Frequency

66 Number of Vocabulary Entries

≥ 20

65,143

≥ 50

37,429

≥ 100

23,615

≥ 500

7,286

≥ 1,000

4,140

≥ 5,000

833

≥ 10,000

410

Table 5.5: Relationship between Word Type Frequency and Vocabulary Size

test models built with consistently smaller vocabularies than that used in the previous experiment. The largest vocabulary in this experiment therefore contained 65,143 different entries, i.e. all word types which occurred at least 20 times in the training data. We then trained models using consistently smaller vocabularies which contained all word types that occurred at least 50, 100, 500, 1,000, 5,000 and 10,000 times in the training data. Table 5.5 shows that the smallest vocabulary, which contained all word types that appeared at least 10,000 times in the training data, comprised only 410 entries. As can be seen in Figure 5.7, the smaller the vocabulary the, lower the optimal threshold and the lower the accuracy. Since smaller vocabularies typically exclude rare words, the models paradoxically produced lower cross entropy rates (Rosenfeld 1997). The 3-gram model built with a vocabulary of 65,143 entries resulted in the highest accuracy of 77.89% due to the smallest overlap of good- and bad-quality translations (see Figure 5.8). The 3-gram model built with a vocabulary of merely 410 entries, on the other hand, only produced a very low accuracy of 51.70%, just 1.70% above chance. Appendix D shows the remaining cross entropy distributions of the models built with smaller vocabularies. These graphs show that the smaller the vocabulary,

67

11

80 78 76 74 72 70 68 66 64 62 60 58 56 54 52 50

10 9 8 7 6 5 4


Accuracy (%)


3 65,143 37,429 23,615 7,286

4,140

833

410

Vocabulary Size (Word Types) Accuracy

Optimal Threshold

Figure 5.7: Accuracy Scores for Different Vocabulary Sizes

the greater the overlap of the distributions of good- and bad-quality translations. It therefore becomes increasingly difficult to decided on the quality of a translation which is reflected in the decreasing accuracy rates (see Appendix E). These findings demonstrate that in order to distinguish between good- and badquality translations, the model must be supplied with a very large vocabulary. In fact, it would be interesting to investigate how even larger vocabularies affect accuracy rates and at what point the performance levels off. The second best accuracy rate of 76.53% was obtained using a vocabulary containing 37,429 entries. Since the difference between the best and the second best accuracy score is only 1.36%, it is expected that larger vocabularies will not remarkably increase the performance. Another point which needs to be considered is that the size of the vocabulary also has a direct influence on


68

Vocabulary = 65,143 Word Types 35

Sentence Counts

30 25 20 15 10 5

Good

Bad

13

12

11

10

9

8

7

6

5

0


Figure 5.8: Cross Entropy Distributions of the Best Model

the processing time required to test an SLM. It can therefore be concluded that careful fine-tuning of the vocabulary size is necessary to produce a model that not only yields high accuracy scores but is also inexpensive to run.

Chapter 6 Discussion and Future Work The initial goal of this dissertation was to demonstrate to what extent SLMs built from a large TL corpus are capable of distinguishing between good- and bad-quality machine translations without receiving any additional linguistic knowledge of the MT output. We used cross entropy as the evaluation metric since it allowed to compare sentences of different lengths. By means of human ranking of the test translations, we were able to determine an optimal threshold, i.e. the cross entropy value which most accurately splits the total number of translations into good- and bad-quality output. Chapter 5 presented a number of experiments which demonstrated that an SLM representative of a language is able to capture natural sentences of that language better than sentences which contain irregularities and therefore, on average, assigns lower cross entropy rates to good-quality than to bad-quality MT output. The main difficulty of distinguishing the one from the other is caused by the overlap between the two cross entropy distributions. We therefore aimed to build an SLM that maximises the distance between both distributions and consequently minimises the overlap. When testing a conventional 3-gram model built with a 28.3 million word training corpus extracted from German newspaper articles, a vocabulary of 65,143 distinct word types and linear smoothing, the optimal threshold correctly divided the test translations to 73.81% into either good or bad. Further experiments showed that this performance is 69

Chapter 6. Discussion and Future Work

70

dependent on a number of language modeling parameters, including the N-gram order, the smoothing technique, the amount of training data and the size of the vocabulary. It was found that performance increases using higher order N-grams combined with large training data sets. On the other hand, higher order N-gram models require exceedingly more computational processing time and memory than conventional 3-gram models do. Due to the large training corpus, it took several days to test the 5-gram and 6-gram models which makes them too computationally expensive. As far as the smoothing technique is concerned, Good-Turing smoothing produced consistently better accuracy rates than linear smoothing and outperformed Witten-Bell smoothing with higher order N-gram models. Moreover, we discovered that the larger the vocabulary, the better the accuracy scores, but the longer it takes to test a model. The careful fine-tuning of all the parameters lead to an overall best accuracy score of 78.23%. This result was produced by both the 5- and the 6-gram model built with 2 million training sentences, a vocabulary of 65,143 entries and Good-Turing smoothing. Even the 3-gram model built using the same training data set, vocabulary and smoothing technique, yielded a comparatively good accuracy rate of 77.89%. Since the latter model takes up only a fraction of the processing time and memory required by the two best performing models, we regard it as the better option. Based on these findings, it can be concluded that there is a real potential in using SLMs to automatically assess the quality of MT output. The main advantage of this approach is that, given an SLM which is representative of a particular TL, the quality assessment is completely independent of the SL. Unlike many existing MTE methods, it can therefore be used for multiple language pairs. Given the right parameters, SLMs also present a very quick and inexpensive approach to determine the quality of machine translations which would not only be desirable to MT system designers and researchers but also to end-users, such as translators. Since the training corpus we used to built the SLMs was extremely large, we assumed the best performing models to be representative of natural German as a whole. We therefore believe that such SLMs can

Chapter 6. Discussion and Future Work

71

be equally employed to assess the quality of texts from different domains. Although in such cases, the overall cross entropy distribution will vary in comparison to that of newspaper text, we still expect the model to assign overall lower cross entropy rates to good translations and higher cross entropy rates to machine translations containing many errors. Apart from testing MT output from different domains, it would also be interesting to examine the performance of SLMs built from a training corpus which distinguishes between upper and lower case and which contains sentence punctuation, altogether valuable information which was removed during the pre-processing stage of this research. Despite the relatively high accuracy rates produced in our experiments, it also becomes apparent that there are certain limitations to SLMs. When using an even split of good- and bad-quality translations, the best performing model still resulted in an error rate of 21.11%. It would therefore be worthwhile testing whether the use of additional linguistic features, such as POS tags or parse tree information, may be advantageous in determining the quality of MT output and therefore reduce the error margin of deciding whether a sentence is a well or badly translated sentence. This may also turn out to be beneficial when using a test set with a real quality distribution where the chance of a translation being either good-quality or bad-quality MT output is unknown. In fact, this is a problem that needs to be solved in order to build a commercial MT quality assessment system that is able to assess the quality of any given MT output sentence. From the results of our experiments it becomes clear that a more sophisticated normalisation is necessary to move the uneven cross entropy distributions apart and minimise their overlap. It is evident that a considerable amount of research is still necessary to develop a commercial system which can actually be employed by end-users of MT engines to assess the quality of MT output. The results of this project provide an important insight into the capacity of SLMs to distinguish between good- and bad-quality translations, which represents an essential first step towards this end.

Appendix A Future Perfect Sentences 1. < s > ICH WERDE NICHT DIE CHANCE MEINES LEBENS VERPASSEN SAGTE SIE IN KAIRO < /s > 2. < s > DURCH DAS KONZEPT WERDE NICHT DIE EINHEIT EUROPAS VORANGEBRACHT SONDERN VIEL MEHR DIE BINDUNG DEUTSCHLANDS AN EUROPA GELOCKERT DAMIT DER WEG IN RICHTUNG EINES ¨ DEUTSCHEN HEGEMONIALSTREBENS GEGANGEN WERDEN KONNE < /s > 3. < s > DER HESSISCHE UMWELTMINISTER JOSCHKA FISCHER SAGTE DURCH DAS KONZEPT WERDE NICHT DIE EINHEIT EUROPAS VORANGEBRACHT SONDERN DIE BINDUNG DEUTSCHLANDS AN EUROPA GELOCKERT < /s > 4. < s > DURCH DIE ABSENKUNG WERDE NICHT DIE VERMITTELBARKEIT ¨ ¨ VERBESSERT SONDERN WURDEN SOZIALE NOTE UND DIE SOZIALHILFE¨ ¨ BEDURFTIGKEIT ERHOHT < /s > 5. < s > AUCH AUF DEM PARTEITAG DER PDS ENDE JANUAR IN MAGDEBURG WERDE NICHT DIE KOALITIONSFRAGE IM VORDERGRUND STE72

Appendix A. Future Perfect Sentences

73

HEN SONDERN DIE LAGE DER KOMMUNEN IN DEN NEUEN ¨ BUNDESLANDERN < /s > 6. < s > ES WERDE NICHT DIE LETZTE KATASTROPHE SEIN < /s > 7. < s > ENGLAND WERDE NICHT DIE LETZTE KATASTROPHE SEIN < /s > 8. < s > DIE BRITISCHE REGIERUNG WERDE NICHT DIE SYSTEMATIS¨ ¨ CHE SCHLACHTUNG SAMTLICHER RINDERBESTANDE ZULASSEN IN ¨ DENEN FALLE VON RINDERWAHN AUFGETAUCHT SIND < /s > ¨ 9. < s > PARTEISPRECHER VITHAL GADGIL ERKLARTE DIE BJP WERDE ¨ NICHT DIE NOTIGE STIMMENMEHRHEIT IM PARLAMENT AUF SICH ¨ VEREINEN KONNEN < /s > ¨ DIE KONGRESSPARTEI ERKLARTE ¨ 10. < s > FUR DEREN SPRECHER VITHAL ¨ GADGIL DIE BJP WERDE NICHT DIE NOTIGE STIMMENMEHRHEIT IM ¨ PARLAMENT AUF SICH VEREINEN KONNEN < /s > 11. < s > IN SARAJEVO BETONTEN AM SAMSTAG NATO-GENERALSEKRE¨ JAVIER SOLANA UND DER NATO-OBERBEFEHLSHABER IN EUTAR ROPA GENERAL GEORGE JOULWAN DIE NATO WERDE NICHT DIE ¨ DIE ORGANISATION DER BOSNIENPOLITISCHE VERANTWORTUNG FUR ¨ WAHL UBERNEHMEN < /s >

Appendix B Results for Different Smoothing Techniques N-gram Order

Smoothing

3-gram

Linear

9.9

73.81

Good-Turing

10.2

76.19

Witten-Bell

9.9

76.19

Linear

10.0

74.49

Good-Turing

9.9

76.19

Witten-Bell

9.9

76.19

Linear

10.0

74.83

Good-Turing

10.0

76.87

Witten-Bell

9.9

76.53

Linear

10.4

74.83

Good-Turing

10.0

77.22

Witten-Bell

9.9

76.53

4-gram

5-gram

6-gram


74

Accuracy (%)

Appendix C Results for Different Training Data Sets N-gram Order 3-gram

4-gram

Training Sentences


250,000

10.3

74.15

500,000

10.2

74.15

1,000,000

10.1

76.53

1,500,000

9.8

76.19

2,000,000

10.1

77.89

2,500,000

10.0

77.89

250,000

9.9

73.81

500,000

10.0

73.13

1,000,000

10.1

76.19

1,500,000

9.9

76.19

2,000,000

9.9

77.55

2,500,000

9.7

77.55

75

Accuracy (%)

Appendix C. Results for Different Training Data Sets N-gram Order 5-gram

6-gram

76

Training Sentences


Accuracy (%)

250,000

9.9

73.81

500,000

10.0

73.13

1,000,000

10.2

76.53

1,500,000

10.0

76.87

2,000,000

9.9

78.23

2,500,000

9.8

77.89

250,000

9.9

73.81

500,000

10.0

73.13

1,000,000

10.2

76.53

1,500,000

10.0

77.21

2,000,000

9.9

78.23

2,500,000

9.8

77.89

Appendix D Cross Entropy Scores for Different Vocabularies Vocabulary = 23,615 Word Types

30

25

Sentence Counts

30

25 20 15 10 5

Good

20 15 10 5

Good


77

Bad


13

12

11

10

9

8

7

5

13

12

11

10

9

8

6

7 Bad

6

0

0

5

Sentence Counts

Vocabulary = 37,429 Word Types 35

Appendix D. Cross Entropy Scores for Different Vocabularies

Vocabulary = 4,140 Word Types

35

30

30

25

Sentence Counts

25 20 15 10 5 0

20 15 10 5

Good

or e

11

10

9

8

7

or e M

5

5 9.

7.

5 8.

5

5 7.

6.

5 6.

5

5 5.

5.

5

5 3.

4.

5 2.

0

5

5

4.

10

5

15

3.

20

5

25

40 35 30 25 20 15 10 5 0

2.

Sentence Counts

30



Bad

Vocabulary = 410 Word Types

Vocabulary = 833 Word Types

Bad

6

M


35

Good

5

3

5 10 .5 11 .5 12 .5

5

9.

5 6.

8.

5 5.

Bad

5

5 4.

Good

7.

5 3.

0

4

Sentence Counts

Vocabulary = 7,286 Word Types

Sentence Counts

78

Good

Bad


Appendix E Results for Different Vocabularies C of Word Types

Vocabulary Size


Accuracy (%)

≥ 20

65,143

10.0

77.89

≥ 50

37,429

9.6

76.53

≥ 100

23,615

9.2

70.75

≥ 500

7,286

7.2

65.65

≥ 1,000

4,140

7.3

63.27

≥ 5,000

833

4.0

55.78

≥ 10,000

410

3.8

51.70

79

Bibliography Algoet, P. H. & Cover, T. M. (1988), ‘A Sandwich Proof of the Shannon-McMillanBreiman Theorem’, The Annals of Probability 16(2), 899–909. Arnold, D., Balkan, L., Humphreys, R. L., Meijer, S. & Sadler, L. (1994), Machine Translation: An Introductory Guide, NCC Blackwell Ltd., Oxford. Balkan, L., Netter, K., Arnold, D. & Meijer, S. (1994), TSNLP: Test Suites for Natural Language Processing, in ‘Proceedings of Language Engineering Convention’, La Défense, Paris, France, pp. 17–22. Bennett, S. (2000). Lecture Slides on Translation Technology, Montclair State University. Brown, P. F., Cocke, J., Pietra, S. A. D., Jelinek, F., Lafferty, J. D., Mercer, R. L. & Roossin, P. S. (1990), ‘A Statistical Approach to Machine Translation’, Computational Linguistics 16, 79–85. Callison-Burch, C. (2001), Upping the Ante for ”Best of Breed” Machine Translation Providers, Technical report, Amikai, Inc., San Francisco. Callison-Burch, C. & Flournoy, R. (2001), A Program for Automatically Selecting the Best Output from Multiple Machine Translation Engines, in ‘Proceedings of the Machine Translation Summit VIII’, Santiago de Compostela, Spain. Chen, S. F. & Goodman, J. (1998), An Empirical Study of Smoothing Techniques for Language Modeling, Technical report, Center for Research in Computing Technology, Harvard University. Clarkson, P. & Rosenfeld, R. (1997), Statistical Language Modeling Using the CMU-Cambridge Toolkit, in ‘Proceedings of Eurospeech ’97’, Rhodes, Greece, pp. 2707–2710. 80

Bibliography

81

Corston-Oliver, S., Gamon, M. & Brockett, C. (2001), A Machine Learning Approach to the Automatic Evaluation of Machine Translation, in ‘Proceedings of the Association for Computational Linguistics’, Toulouse, France, pp. 140–147. Der Duden (1996). Band 1, Rechtschreibung, 21. Auflage. Doyon, J., Taylor, K. B. & White, J. S. (1998), The DARPA MT Evaluation Methodology: Past and Present, in ‘Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA)’, Langhorne, Pennsylvania, pp. 113–124. EAMT (n.d.), What is Machine Translation? European Association for Machine Translation (EAMT), Available online at: http://www.eamt.org/mt.html. Flanagan, M. (1994), Error Classification for MT Evaluation, in ‘Proceedings of the First Conference of the Association for Machine Translation in the Americas (AMTA)’, Columbia, MD. Flournoy, R. S. & Callison-Burch, C. (2001), Secondary Benefits of Feedback and User Interaction in Machine Translation, Technical report, Amikai, Inc., San Francisco. Francis, W. N. & Kucera, H. (1982), Frequency Analysis of English Usage: Lexicon and Grammar, Houghton: Mifflin, Boston. Gale, W. A. & Sampson, G. (1995), ‘Good-Turing Frequency Estimation without Tears’, Journal of Quantitive Linguistics 2, 217–237. Good, I. J. (1953), ‘The Population Frequencies of Species and the Estimation of Population Parameters’, Biometrika 40, 237–264. Goodman, J. (2000), Putting It All Together: Language Model Combination, in ‘Proceedings of the IEEE ICASSP-2000’, Istanbul, Turkey. Hain, T. (2001), Hidden Model Sequence Models forAutomatic Speech Recognition, PhD thesis, Darwin College University of Cambridge. Hogan, C. & Frederking, R. (1998), An Evaluation of the Multi-Engine Architecture, in ‘Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA)’, Langhorne, Pennsylvania, pp. 113–124.

Bibliography

82

Hovy, E., King, M. & Popescu-Belis, A. (2002), Computer-Aided Specification of Quality Models for Machine Translation Evaluation, in ‘Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC)’, Las Palmas, Canary Islands, Spain. Hutchins, W. J. (1997), Evaluation of Machine Translation and Translation Tools, in G. Varile & A. Zampolli, eds, ‘Survey of the State of the Art in Human Language Technology’, Cambridge University Press, chapter 13.3, pp. 418–419. Available online at: http://ourworld.compuserve.com/homepages/WJHutchins/EvalHLT.htm. Hutchins, W. J. (1999), The Development and Use of Machine Translation Systems and Computer-Based Translation Tools, in ‘Proceedings of the International Symposium on Machine Translation and Computer Language Information Processing’, Beijing, China. Available online at: http://www.foreignword.com/Technology/art/Hutchins/hutchins99.htm. Hutchins, W. J. & Somers, H. L. (1992), An Introduction to Machine Translation, Academic Press, London. Jelinek, F. (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge Massachusetts. Katz, S. (1987), ‘Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer’, IEEE Transactions on Acoustics, Speech and Signal Processing 35(3), 400–401. King, S. (2002). Lecture Notes from ’Speech Processing 2’, Edinburgh University, Available online at: http://www.ling.ed.ac.uk/local/teaching/postgrad/modules/sp2/slides/. Kuˇcera, H. & Francis, W. N. (1967), Computational Analysis of Present Day American English, Brown University Press, Providence. Lewis, D. (1997), ‘MT Evaluation: Science or Art?’, Machine Translation Review 6, 25–36. Manning, C. D. & Schütze, H. (2001), Foundations of Statistical Natural Language Processing, MIT Press, Cambridge Massachusetts.

Bibliography

83

Markov, A. A. (1913), An Example of Statistical Investigation in the Text of ’Eugene Onyegin’ Illustrating Coupling of ’Tests’ in Chains, in ‘Proceedings of the Academy of Sciences, St.Petersburg’, Vol. 7, pp. 153–162. Mirna, H., Iida, H. & Furuse, O. (1998), Simultaneous Interpretation Utilizing Example-Based Incremental Transfer, in ‘Proceedings of the 17th International Conference on COLING and the 36th Conference of the ACL’, Montreal, Canada, pp. 855–861. Nagao, M. (1989), Machine Translation, How Far Can It Go?, Oxford University Press, Oxford. Ney, N., Essen, U. & Kneser, R. (1994), ‘On Structuring Probabilistic Dependencies in Stochastic Language Modeling’, Computer Speech and Language 8(1), 1–28. Osborne, M. (2002). Lecture Notes from ’Data Intensive Linguistics’, Edinburgh University, Available online at: http://www.cogsci.ed.ac.uk/school/study/msc/course/dil/. Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. (2001), Bleu: A Method for Automatic Evaluation of Machine Translation, Technical report, IBM Research Division, Thomas J. Watson Research Center, Yorktown Heights, NY. Petrits, A. (2001), EC Systran: The Commission’s Machine Translation System, Technical report, European Commission Translation Service. Internal document. Pierce, J. R., Carroll, J. B., Hamp, E. P., Hays, D. G., Hockett, C. F., Oettinger, A. G. & Perlis, A. (1966), Languages and Machines: Computers in Translation and Linguistics, Technical report, Automatic Language Processing Advisory Committee (ALPAC), National Academy of Science, National Research Council, Washington, D.C. Rosenfeld, R. (1995), The CMU Statistical Language Modeling Toolkit, and its Use in the 1994 ARPA CSR Evaluation, in ‘Proceedings of the ARPA Spoken Language Technology Workshop’, Austin, TX. Rosenfeld, R. (1997). Lecture Notes in ’Language and Statistics’, Carnegie Mellon University, Available online at: http://www2.cs.cmu.edu/afs/cs.cmu.edu/academic/class/11761-s97/WWW/syllabus.html. Shannon, C. E. (1948), ‘A Mathematical Theory of Communication’, Bell System Technical Journal 30, 379–423,623–656.

Bibliography

84

Somers, H. L. (1999), ‘Review Article: Example-Based Machine Translation’, Machine Translation 14(2), 113–158. Sparck-Jones, K. & Galliers, J. (1996), Evaluating Natural Language Processing Systems: An Analysis and Review, Springer-Verlag, Berlin. Streiter, O., Carl, M. & Iomdin, L. (2000), A Virtual Translation Machine for Hybrid Machine Translation, in ‘DIALOG 2000’, Moscow, Russia. Tür, G. (2000), A Statistical Information Extraction System for Turkish, PhD thesis, Bilkent University, Department of Computer Engineering, Ankara, Turkey. Vanni, M. & Reeder, M. (2000), How Are You Doing? A Look at MT Evaluation, in ‘Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA)’, Cuernavaca, Mexico, pp. 109–116. Weaver, W. (1955), Translation, in W. Locke & D.A.Booth, eds, ‘Machine Translation’, MIT Press, Cambridge, Massachusetts. White, J. S. (1992-1994), ARPA Workshops on MT (a series of four workshops on comparative evaluation), Technical report, PRC Inc., McLean, Virginia. Wilss, W. (1998), Decision Making in Translation, in M. Baker, ed., ‘Encyclopedia of Translation Studies’, Routledge, London. Witten, I. H. & C.Bell, T. (1991), ‘The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression’, IEEE Transactions on Information Theory 37(4), 1085–1094. Wong, S. H. S. (2001). Course Materials on ’Machine Translation Techniques’, Masaryk University, Brno, Available online at: http://www.fi.muni.cz/usr/wong/teaching/mt/notes/index.html. Yang, J. & Gerber, L. (1996), Chinese-English Machine Translation System, in ‘International Conference on Chinese Computing ’96 (ICCC’96)’, Singapore.

Using Language Models to Assist in the Correction of ... - CiteSeerX

Using Language Models to Assist in the Correction of ... - CiteSeerX

Suggest Documents

Image and Text Correction Using Language Models

Using Simplified Models to Assist Fault Detection and Diagnosis in

Using Natural Language Processing to Assist the Visually ...

Using language models for information retrieval - CiteSeerX

Using Word Semantics To Assist English as a Second Language ...

Using audioblogs to assist English-language ... - Semantic Scholar

Using Language Models to Detect Wikipedia

(ASSIST) in the elderly - CiteSeerX

Estimation of Nonlinear Error Correction Models - CiteSeerX

WORD ORDER CORRECTION FOR LANGUAGE TRANSFER USING ...

Using Pro le Information to Assist Advanced Compiler ... - CiteSeerX

Using Quizzing to Assist Student Learning in the

Error Correction using DOP - CiteSeerX

using wifi to assist mobility-impaired navigation - CiteSeerX

Using Guidelines to assist in the Visualisation Design Process

Habitat Models to Assist Plant Protection Efforts in ...

Using Pro le Information to Assist Advanced Compiler ... - CiteSeerX

PARAPHRASTIC LANGUAGE MODELS AND ... - CiteSeerX

Exploring K-12 Blended Learning Models to Assist the ...

Using Profile Information to Assist Classic Code

morpheme-based feature-rich language models using ... - CiteSeerX

Using Tags to Assist Near-Synchronous Communication

using artificial intelligence to assist psychological testing

the application of remote sensor technology to assist the ... - CiteSeerX