Crosslingual Implementation of Linguistic Taggers using ... - CiteSeerX

2 downloads 0 Views 1MB Size Report
POS tagger and an English/French parallel corpus has been implemented and evaluated using this ... L'apprentissage supervisé est l'outil principal utilisé pour créer des marqueurs linguis- tiques. ... taille est plus de 40,000,000 mots. ...... File types: specify which files to be downloaded (HTML, PDF, text files, etc...). In.
Crosslingual Implementation of Linguistic Taggers using Parallel Corpora Hani Safadi

School of Computer Science McGill University Montreal, Canada September 2008

A thesis submitted to McGill University in partial fulfillment of the requirements for the degree of Master of Science. c 2008 Hani Safadi

i

DEDICATION

XðYm× CË@ ZA¢ªÊËð QÒJ‚ÖÏ @ AÒêÔ« YË à QK QªË@  Ë@ èAÓY¯ ø YË@

ø YË@ð úÍ@ è Yë IjJ . á£ñË@ . ék. PAg ú ¯ð ú ¯ . @Qºƒ CK Qk ú GAë

 éËAƒP ø Yë @  P X ZAJ K @ úÍ ú æƒ@

ii

ACKNOWLEDGEMENTS The work of this thesis was graciously supported by the NSERC special research opportunity grant number 307188-2004. I would like to thank my supervisors Richard Rose and Doina Precup for their support and guidance during my Master’s studies at McGill. I would like to thank the following members of the Interactive Language Technologies group in the National Research Council of Canada: Pierre Isabelle, Roland Kuhn and Michel Simard for their help, comments and suggestions on the work of this thesis, Eric Joanis and Samuel Larkin for their invaluable help and support in configuring and using the Portage translation system. I want to thank Alain D´esilets at the Interactive Information Group in the National Research Council of Canada for his comments on the work of collecting parallel corpora from the Web. I would like to thank Elliott Macklovitch, Guy Lapalme, and Jian-Yun Nie at the RALI group (Recherche Appliqu´ee en Linguistique Informatique) in the University of Montreal for their their fruitful comments and suggestions on the work of this thesis and for giving me access to their part-of-speech-tagged English-French corpus that is used for evaluation in this thesis. Finally, I would like to thanks my parents and my family for their gracious support during all of my studies.

iii

ABSTRACT This thesis addresses the problem of creating linguistic taggers for resource-poor languages using existing taggers in resource rich languages. Linguistic taggers are classifiers that map individual words or phrases from a sentence to a set of tags. Part of speech tagging and named entity extraction are two examples of linguistic tagging. Linguistic taggers are usually trained using supervised learning algorithms. This requires the existence of labeled training data, which is not available for many languages. We describe an approach for assigning linguistic tags to sentences in a target (resourcepoor) language by exploiting a linguistic tagger that has been configured in a source (resource-rich) language. The approach does not require that the input sentence be translated into the source language. Instead, projection of linguistic tags is accomplished through the use of a parallel corpus, which is a collection of texts that are available in a source language and a target language. The correspondence between words of the source and target language allows us to project tags from source to target language words. The projected tags are further processed to compute the final tags of the target language words. A system for part of speech (POS) tagging of French language sentences using an English language POS tagger and an English/French parallel corpus has been implemented and evaluated using this approach. A parallel corpus of the source and target languages might not be readily available for many language pairs. To deal with this problem, we describe a system for automatic acquisition of aligned, bilingual corpora from pre-specified domains on the World Wide Web. The system involves automatic indexing of a given domain using a web crawler, identifying pairs of pages that are translations of one another, and aligning bilingual texts at the sentence level. Using this approach we create a 40,000,000 word English-French parallel corpus from the Government of Canada domain. The quality of this corpus is evaluated and compared to other parallel corpora.

iv

´ E ´ ABREG Le sujet de cette th`ese est la cr´eation de marqueurs linguistiques pour les langues qui sont pauvres en ressources en utilisant les marqueurs des langues riches en ressources. Les marqueurs linguistiques sont des classificateurs qui conjuguent des mots ou des collections des mots d’une phrase `a un ensemble d’´etiquettes. La description de nature grammaticale et l’extraction des entit´es nomm´ees sont deux exemples de marquage linguistique. L’apprentissage supervis´e est l’outil principal utilis´e pour cr´eer des marqueurs linguistiques. Cela exige l’existence de donn´ees de formation marqu´es qui n’est pas disponible dans plusieurs langues. Nous d´ecrivons une approche pour ´etiquer les phrases d’une langue cible qui est pauvre en ressources en utilisant un marqueur linguistique qui a ´et´e configur´e dans une langue d’origine qui est riche en ressources. Cette approche n’exige pas que la phrase entr´ee doit ˆetre traduite dans la langue d’origine. Au lieu de cela, les ´etiquettes linguistiques sont projet´ees grˆace `a l’utilisation d’un corpus parall`ele (une collection de textes qui sont disponibles dans plus d’une langues) entre la langue cible et la langue d’origine. La correspondance entre les mots de la langue cible et la langue d’origine nous permet de projeter les ´etiquettes entre les phrases de ces deux langues. Les ´etiquettes projet´ees sont trait´ees pour calculer l’´etiquage finale de la phrase de la langue cible. Pour tester cette approche, un descripteur de nature grammaticale de langue fran¸caise a ´et´e mis en œuvre et ´evalu´e en utilisant un descripteur de nature grammaticale de langue anglaise et un corpus parall`ele Anglais/Fran¸cais. Un corpus parall`ele entre la langue d’origine et la langue cible n’est pas toujours disponible pour plusieurs langues. Pour r´esoudre ce probl`eme, nous d´ecrivons un syst`eme d’acquisition automatique d’un corpus parall`ele `a partir du Web. Le corpus est extrait d’un domaine sp´ecifique et automatiquement align´e. Le syst`eme implique une indexation automatique d’un domaine donn´e en utilisant un robot d’exploration du Web, l’identification des paires de pages qui sont des traductions l’une de l’autre, et l’alignement de textes bilingues au niveau de la phrase. En utilisant cette approche `a partir du site du web du Gouvernement du Canada, nous avons cr´e´e un corpus parall`ele Anglais/Fran¸cais dont la taille est plus de 40,000,000 mots. La qualit´e de ce corpus est ´evalu´ee et compar´ee avec d’autres corpus parall`eles.

v

Contents 1 Introduction 1.1 Prologue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1

1.2 1.3

Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linguistic Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 3

1.4 1.5 1.6

Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 5 6

2 Literature Review

7

2.1 2.2 2.3

Part of Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cross-Lingual Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alignment in Parallel Corpora . . . . . . . . . . . . . . . . . . . . . . . . .

7 10 11

2.4 2.5

Statistical Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . Statistical Machine Translation . . . . . . . . . . . . . . . . . . . . . . . .

11 14

2.6

The Statistical Machine Translation Model . . . . . . . . . . . . . . . . . . 2.6.1 Word Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Translation Model Estimation . . . . . . . . . . . . . . . . . . . . .

14 16 17

2.6.3 2.6.4

Sentence Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . Translation Process (Decoding) . . . . . . . . . . . . . . . . . . . .

18 19

2.6.5 Evaluating Translation Performance . . . . . . . . . . . . . . . . . . Automatic Construction of Parallel Corpora . . . . . . . . . . . . . . . . . Other Applications of Parallel Corpora . . . . . . . . . . . . . . . . . . . .

22 22 24

2.8.1 2.8.2

24 25

2.7 2.8

Example-Based Machine Translation . . . . . . . . . . . . . . . . . Resources for Translators . . . . . . . . . . . . . . . . . . . . . . . .

vi

Contents 2.8.3

Cross Lingual Information Retrieval . . . . . . . . . . . . . . . . . .

26

3 Implementation of Cross-Lingual Taggers 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Corpus Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28 28 29

3.2.1 3.2.2

Corpus Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . Tagset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30 31

3.3

Crosslingual System Configuration and Tagging . . . . . . . . . . . . . . . 3.3.1 Crosslingual System Configuration . . . . . . . . . . . . . . . . . . 3.3.2 Assigning Tags to Target Language Sentences . . . . . . . . . . . .

31 34 35

3.4 3.5

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40 42

4 Automatic Acquisition of Parallel Corpora 4.1 4.2 4.3

4.4 4.5

46

Multilingual Resources on the Web . . . . . . . . . . . . . . . . . . . . . . Target Task Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46 47 48

4.3.1 4.3.2

Identifying Bilingual Sites . . . . . . . . . . . . . . . . . . . . . . . Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 49

4.3.3 4.3.4 4.3.5

Identifying Paired Pages . . . . . . . . . . . . . . . . . . . . . . . . Text Extraction and Filtering . . . . . . . . . . . . . . . . . . . . . Aligning Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . .

50 51 52

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accessing the Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52 55

5 Conclusions & Future Work

57

References

59

vii

List of Figures 2.1

Example of a parallel corpus with chapter, paragraph and sentence alignments. 12

2.2 2.3

Examples of word alignments between English and French sentences . . . . The structure of a statistical machine translation system . . . . . . . . . .

13 15

2.4 2.5

Example of search graph in SMT decoding, double frames represent final nodes with their scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of translation spotting . . . . . . . . . . . . . . . . . . . . . . . .

21 26

3.1 3.2

The structure of the tagging system . . . . . . . . . . . . . . . . . . . . . . Tagging example: tagging the word “entre” in the sentence “il entre dans le

34 40

3.3

salon” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results of experiments. Black is performance of the tagger, white is the performance of the direct projection method . . . . . . . . . . . . . . . . .

42

Example of a pair of pages from a bi-lingual site and their corresponding content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

4.1 4.2 4.3 4.4

The structure of the corpus acquisition system . . . . . . . . . . . . . . . . Distribution of pair length in the three corpora: Europarl, Hansards and GcCorp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

Interfacing the collected corpus from a browser . . . . . . . . . . . . . . . .

56

53

viii

List of Tables 1.1

Several examples of linguistic taggers . . . . . . . . . . . . . . . . . . . . .

4

2.1 2.2

Core Part of Speech Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of tags per English word in the Brown corpus . . . . . . . . . . .

8 8

3.1 3.2

Excerpt from the training corpus, notice that, since alignment is estimated from data, not all alignments are correct . . . . . . . . . . . . . . . . . . . Core Part of Speech Tags . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33

3.3 3.4

English and French sub tags of AdjQ . . . . . . . . . . . . . . . . . . . . . Distribution of tags per word in UdeM corpus . . . . . . . . . . . . . . . .

33 33

3.5 3.6

Tag Projection Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . Similar French sentences to “le nombre de femmes qui ont un emploi a augment´e” from a subset of the Hansard corpus . . . . . . . . . . . . . . . . .

36

3.7

Performance measures obtained across different tags and corpora . . . . . .

43

4.1

BLEU scores of the translation models on WPT task . . . . . . . . . . . .

54

39

1

Chapter 1 Introduction This chapter defines the basic concepts of linguistic tagging and parallel texts. It then describes several examples of linguistic taggers and briefly discusses parallel texts and their structure. Finally, it presents the thesis contribution and outline.

1.1 Prologue Once upon a time, 3,100 BCE in Mesopotamia rose a great civilization. The center of this civilization was the city of Babylon, a cosmopolitan city that united the human race all speaking one language and migrating all over the world. As the city’s power increased, its people became more ambitious. They crowned their achievements by constructing all sorts of landmarks: temples, statues, hanging gardens, gates, and palaces. Finally they came up with the idea of building a tall tower that reaches the sky so that they can go up and testify to the existence of God. This act of ingratitude and arrogance angered God and made him punish the Babylonians. He destroyed the tower and divided the Babylonians by making them speak different languages so they no longer understand each other.

1.2 Problem Description Regardless whether the previous story is true or not, people speak different languages and translation is required so that literature produced in one language can be understood by people who speak another language. The texts that have translations across languages are called parallel texts, or bi-texts when considering only two languages. Parallel texts are an

2

Introduction

important resource for cross lingual applications. The term “parallel corpora” is often used to refer to parallel texts, when they are used in a computational task. In this thesis, we are going to use the three terms interchangeably. With the advent of computers and the Internet, approaches have been invented to automate the processing of natural language texts. The term “natural language processing (NLP)” refers to the subfield of computer science that studies these approaches. Many tasks in NLP require discovering the syntactical and functional properties of individual words or phrases in text. These tasks are called linguistic tagging because they tag or label words or phrases in a given text. More formally, linguistic taggers are classifiers that map individual words or a collection of words in a sentence to a set of tags. Part of speech tagging (POS tagging) is one example of a linguistic tagger. In POS tagging every word in a given sentence is labeled according to its syntactical function in the sentence. Typical tags for POS tagging are verb, noun, adjective, adverb and others. Supervised training is mostly used to implement linguistic taggers. This requires the existence of labeled training data, which are not available for many languages. Resourcerich languages (such as English) have many available resources, whereas resource-poor languages (such as Inuktitut) lack many of these resources. Linguistic taggers are building blocks for many NLP applications. NLP tasks such as automatic summarization, information extraction, information retrieval, question answering and text to speech generation require the pre-processing of the input text with linguistic taggers. For example, a text to speech generator needs to disambiguate words according to their function in the text. The pronunciation of the word “read” depends on whether it is a verb, a noun or an adjective. This requires a part of speech tagger to tag the word “book” in the input sentence. In this thesis we describe an approach for implementing a linguistic tagger in a resourcepoor target language by exploiting a linguistic tagger that has been configured in a resourcerich source language. The tagging is accomplished through the use of a bitext corpus for the source and target languages. A system for part of speech (POS) tagging of French language sentences using an English language POS tagger and an English/French bitext corpus has been implemented and evaluated using this approach. A parallel corpus of the source and target languages is also a requirement that might not be readily available for many language pairs. However, today many governments, organizations, and companies publish their content on the Web in more than one language. These

1.3 Linguistic Tagging

3

web sites are a valuable resource of translations that can be organized in parallel corpora. To take advantage of these sites, we implement a system for the automatic acquisition of aligned, bilingual corpora from the Web. The system involves automatic indexing of a given domain using a web crawler, identifying pairs of pages that are translations of one another, and aligning bilingual texts at the sentence level. We describe the system and describe the application of the system to creating a large English-French bi-text corpus from the “Government of Canada” website. The quality of the corpus is evaluated by comparing it to other parallel corpora.

1.3 Linguistic Tagging As mentioned in the previous section, natural language processing (NLP) studies the generation and understanding of natural language texts. Many NLP subtasks aim to discover the syntactical and functional structure of a given text. In these tasks a tag or label is assigned to words of phrases in a given sentence. The label depends on both the labeled word and the context. This process is called linguistic tagging. The following are typical examples of linguistic taggers: • Part of Speech Tagging (POS Tagging): Each word in a given sentence is assigned to a part of speech label that depends on the word and its context inside the sentence. • Named Entity Tagging (NE Tagging): Each word or sequence of words in a given sentence is classified into predefined categories such as the person names, organizations, places, expressions of date and time, quantities, and others. • Morphology Analysis: Each word is classified according to its internal structure and origin. In most languages the word is the smallest semantic unit of a sentence. Understanding the nature of individual words is important in order to understand the sentence. • Noun Phrase Bracketer: A noun phrase bracketer extracts noun phrases in sentences. This is important for many applications that seek to identify key phrases since noun phrases carry most of the information in sentences. Table 1.1 shows examples of different linguistic taggers.

4

Introduction

Tagging tasks are typically performed by manually annotating a text and training a classifier on it. The annotation is done on the word level or on the phrase level depending on the task. In this thesis we are concentrating on word level tagging. The annotation process is expensive since it requires human expertise. Therefore, annotated training sets are not available for many languages. Sentence

Mr. Safadi introduced his research in Montreal

Part of speech tagging

Mr.title Safadiproper noun introducedverb hisdeterminer researchnoun inpreposition Montrealproper noun [Mr. Safadi]person introduced his research in [Montreal]place Mr.abbreviation of Mister SafadiUNKNOWN introducedpast tense of introduce his3rd person male possessive pronoun researchsingular noun inpreposition Montrealproper noun [Mr. Safadi]NP introduced [his research]NP [in Montreal]NP

Named entity extraction Morphology analysis

Sentence bracketing

Table 1.1

Several examples of linguistic taggers

In this thesis we are going to use a tagger that is configured in a source language to tag sentences in another target language. Tags are projected from words of the source language to words of the target language. The process is carried using a parallel corpus of the two languages and described in Section 3.

1.4 Parallel Corpora A parallel corpus is composed of two texts that are translations of each other, along with information about corresponding parts of the two texts. The texts are obtained from multilingual resources, while correspondence information is usually automatically computed. Modern parallel corpora are translations that were created to continuously maintain a multilingual ressource. Examples of modern corpora are multilingual corporate web sites, documents created for international organizations, and parliament proceedings of bilingual countries. Of particular interest to us are the Canadian Hansards [Roukos et al., 1995], an English-French transcription of the proceedings of parliamentary discussions in Canada. Parallel corpora are typically used in NLP to train statistical machine translation systems. They are also used by human translators. Translators might want to know the meaning of a specific word in a certain context, or locate its translation in a complete

1.5 Contribution

5

phrase. Parallel corpora are perfect resources for such tasks. Professional services are available for many languages. For example, TransSearch1 offers a web service that allows subscribers to search English, French and Spanish parallel corpora. It is worth noting that although many modern bilingual resources exist, they lack correspondence information that make them a true parallel corpus. For example, documents of a specific organization may be scattered across their sites without enough information about the correspondence between their different parts. One goal of the research presented in this thesis is to create usable parallel corpora from these available resources. In a parallel corpus, textual information alone is not sufficient for most computational applications. Correspondence between the different parts of the texts is required to obtain useful information. The correspondence information is called alignment. Several levels of alignments can be defined in a parallel corpus. Each level corresponds to one structural level in the corpus. For example, if the parallel corpus is a book with its translation, there would be an alignment between chapters, paragraphs inside chapters, sentences of these paragraphs, and individual words inside the sentences. Higher levels of alignments such as chapter, section and paragraph alignments are usually one to one and easy to compute using heuristic procedures that exploit external boundary information. The two levels that are of particular interest to us are sentence and word alignments.

1.5 Contribution In this thesis, we introduce a new language independent approach for implementing linguistic taggers. The approach implements a tagger in a target language using a tagger configured for a source language and a parallel corpus of the source and target languages. It only assumes that tags are assigned to words and that a reasonable mapping between the tagsets of the two languages can be established. Since parallel corpora might not be readily available for many language pairs, we developed a system to automatically build parallel corpora from the Web. With minimal configuration the system crawls bilingual web sites, extracts pages and aligns sentences. Linguistic taggers are important and expensive resources for many natural language processing tasks. The system described in this thesis allows one to cheaply implement 1

http://transsearch.iro.umontreal.ca

6

Introduction

taggers for resource-poor languages by exploiting taggers of resource-rich languages.

1.6 Thesis Outline Implementing cross-lingual taggers is the main topic of this thesis. The implementation method relies on the availability of a parallel corpus between the two languages. A supporting task presented in this thesis is the acquisition of such corpora from the web. Chapter 2 provides a literature review of linguistic tagging, parallel corpora and other relevant topics. Chapter 3 provides a detailed description of the techniques developed in this thesis work for cross-lingual tagging in the Canadian Hansard domain. Chapter 4 describes the acquisition of a parallel corpus from the web. Finally, Chapter 5 concludes the thesis and discusses future work.

7

Chapter 2 Literature Review This chapter reviews the literature relevant to the implementation of linguistic taggers and parallel corpora. We start with an overview of relevant work in the fields of part-of-speech tagging and cross-lingual tagging, and then discuss parallel texts and their properties. Next, we discuss statistical language modeling and introduce two important techniques that are used in this thesis: sentence and word alignment. We present previous approaches for automating the acquisition of parallel corpora. Finally, we give an overview of applications in this field.

2.1 Part of Speech Tagging The existence of tremendous amounts of on-line text, in all languages, has increased the need for algorithms for natural language processing (NLP). Accomplishing NLP tasks requires the use of language-dependent taggers, including part-of-speech taggers and named entity extractors. Part of Speech Tagging (POS Tagging) is the process of giving each word in a given sentence a part of speech label that depends on the word and its context inside the sentence. The tags are taken from a given tagset. In this thesis we use the tagset defined for the RALI Tagged Hansard Corpus [RALI-Laboratory, 1996]. The tags are used for both English and French texts. Table 2.1 outlines the core tags in this set. While most English and French words are unambiguous and have unique tags, many of the most common words have more than one tag. For example, the word “look” in the sentence “They look at him” is a verb, whereas in the sentence “He changed his look” it is a noun. Table 2.2 is taken from [Jurafsky and Martin, 2000a] and illustrates the distribution

8

Literature Review

Tag

Description

Quan Pron Adve ConC AdjQ Dete ConS NomC Verb NomP Ordi Prep Punc

Quantity Pronoun Adverb Conjunction Adjective Determiner Condition Common Noun Verb Proper Noun Ordinal Number Preposition Punctuation

Table 2.1 Core Part of Speech Tags

of tags per English word in the Brown corpus, a standard NLP corpus for training and benchmarking POS taggers. The corpus contains 1, 014, 312 words tagged with 82 tags [Francis and Kucera, 1979]. The role of a part of speech tagger is to disambiguate the word tag appropriately given the sentence. Unambiguous (1 tag)

35,340

Ambiguous (2-7) tags 4,100 2 tags 3,760 3 tags 264 4 tags 61 5 tags 12 6 tags 2 7 tags 1 (the word “still”) Table 2.2 Number of tags per English word in the Brown corpus

There are two main approaches for part of speech tagging: rule based tagging, and stochastic tagging. Rule-based tagging relies on hand-written rules to disambiguate words with more than one tag. The approach requires human experts to craft the rules and is not of interest in this thesis because it is language specific and labor intensive. For more

2.1 Part of Speech Tagging

9

information on this approach see [Jurafsky and Martin, 2000b]. Stochastic tagging uses a tagged training corpus to build a tagging model that is used to compute the probability of each word having a given tag in the context of the sentence. The tagger computes for every sentence the most probable sequence of tags [Jurafsky and Martin, 2000c]. Let W be a sequence of n words and the ith words in this sequence is wi . Let T be the sequence of the corresponding tags of W , ti be the tag of wi, and Tb be the best sequence of tags. The role of the tagger is to compute Tb: Tb = argmax P (T |W )

(2.1)

T

P (W |T )P (T ) P (W ) T = argmax P (W |T )P (T )

using Bayes law

= argmax

because P (W ) is the same for every T

(2.2) (2.3)

T

Using the chain rule of probability: P (T )P (W |T ) =

n Y

P (wi|w1 t1 . . . wi−1 ti−1 ti )P (ti|w1 t1 . . . wi−1 ti−1 ).

(2.4)

i=1

To simplify the previous computation according to [Jurafsky and Martin, 2000c], the assumptions that the probability of a word depends only on its tag and that the tag history can be approximated by the most recent two tags are made: P (wi|w1 t1 . . . wi−1 ti−1 ti ) = P (wi|ti ) P (ti|w1 t1 . . . wi−1 ti−1 ) = P (ti |ti−2 , ti−1 )

(2.5) (2.6)

These probabilities can be estimated from their relative frequencies in the training corpus: count(ti−2 , ti−1 , ti ) count(ti−2 , ti−1 ) count(wi , ti ) . P (wi|ti ) = count(ti )

P (ti |ti−2 , ti−1 ) =

(2.7) (2.8)

An experiment reported in [Weischedel et al., 1993] used the stochastic approach to

10

Literature Review

train and tag texts from the Wall Street Journal and achieved 96% accuracy. However, this approach requires the existence of a labeled training corpus, where words have been assigned the correct tag. Unsupervised learning methods were proposed to train part of speech taggers as well, but their performance lags behind that of supervised learning methods [Jurafsky and Martin, 2000c]. Other methods for part-of-speech tagging exist. Transformational-based learning methods [Brill, 1995] use unsupervised learning to train a set of tagging rules. Co-training methods use a small labeled training set to bootstrap a model that is then tuned with unsupervised learning methods. For more information on this approach see [Clark et al., 2003].

2.2 Cross-Lingual Tagging Configuring linguistic taggers in a given language is a time-consuming and expensive operation, which requires both human expertise and the availability of labeled text in the target language. Moreover such labeled resources do not exist for many languages. The goal of this thesis is to obtain linguistic taggers automatically by taking advantage of parallel corpora. Such parallel corpora exist for many language pairs, especially due to the existence of multi-lingual web sites. An approach which automates the process of acquiring parallel corpora is also presented in Chapter 4. The prior work of [Yarowsky et al., 2001, Yarowsky and Ngai, 2001] is the most relevant to the work presented in this thesis. They used parallel corpora to induce several linguistic taggers in several target languages using existing taggers configured in several source languages. These included a part-of-speech tagger, noun-phrase bracketer, named entity extractor, and morphological analyzer. The algorithm requires a source-language tagged and word-aligned parallel corpus. Given a target language input sentence in the corpus, a word alignment was obtained with its source language translation, and the source language tags were projected to the target language sentence. Some prior knowledge was also incorporated to reduce the noise arising from the inherent ambiguity associated with the tag projection. For example, to train a stochastic French part of speech tagger such as the one described in section 2.1, the parameters P (ti |ti−2 , ti−1 ) and P (wi|ti ) need to be estimated. P (wi|ti ) is estimated from the projected tags with one caveat to eliminate noise: Only 1-1 alignments and basic part of speech tags (verbs, nouns, adjectives, adverbs, etc . . . ) are considered.

2.3 Alignment in Parallel Corpora

11

P (ti |ti−2 , ti−1 ) is also estimated from the projected tags. However, since the P (T ) has fewer parameters than P (W |T ) it can be estimated from a chosen filtered set of the aligned corpus. Only alignments with high probabilities are taken into consideration. After that, parameters are smoothed so that the model is biased toward the majority POS tag for each tag. The rationale of this re-estimation is that in both English and French, most words have a unique tag. Good crosslingual performance was obtained for several tasks. In the part of speech tagging experiment, an accuracy of 91% was obtained. However, the tag set was small. Only 9 core tags were considered.

2.3 Alignment in Parallel Corpora Section 1.4 introduced the general concept of parallel corpora and alignment. Sentence and word alignment are two levels of alignments that are of particular interest to us because they describe the correspondence of two important building blocks of text. Sentence alignment establishes correspondence between sentences in the parallel corpus. Sentence alignment is many-to-many in general, due to the different syntax and structure of different languages. However, most algorithmic approaches produce one-to-one alignments. Figure 2.1 shows an example of paragraph and sentence alignments. Word alignments are translation relationships between words in a pair of corresponding sentences in a parallel corpus. The alignment establishes correspondence between words that are translations of each other in two aligned sentences. The alignment is many-tomany in general and can be seen as a bipartite graph between words of the two sentences. Examples of word alignments for several sentences are shown in Figure 2.2. In this thesis we use an implementation of the word alignment algorithm provided by the Portage statistical machine translation system [Ueffing et al., 2007].

2.4 Statistical Language Modeling A statistical language model (LM) is a model that predicts the probability of a sequence of words, P (w1 . . . wn ) = P (w1)P (w2 |w1) . . . P (wn |w1 , . . . , wn−1).

(2.9)

12

Literature Review









Chapitre 1: Introduction

Chapter 1: Introduction  

 



Des ´etudes scientifiques r´ecentes r´efutent l’opinion couramment r´epandue selon laquelle la jeunesse serait devenue apolitique.

Recent studies have shown that, 



contrary to popular opinion, 



young people have not grown less political. 





Young adults in Germany are not necessarily interested in large organizations like political parties or unions, 





comme les partis ou les syndicats, 





 



mais davantage aux contenus politiques. 

Ce rapport accompagne un group de jeunes qui s’int´eressent `a la politique;

This report follows a group of politically minded young people taking part in an EU parliamentary simulation known as the MEP, 



Les jeunes en Allemagne, s’int´eressent moins aux organisations de masse,

but they certainly take an interest in the political issues.  







ils prennent part `a une simulation du Parlement Europ´een qui se nomme, 



or Model European Parliament.



en abr´eg´e, 



le MEP – Model European Parliament. 





 



Chapitre 2: Contexte

Chapter 2: Background information

 





Le MEP est une initiative qui vient des Pays-Bas

The MEP is an initiative which arose in the Netherlands  ... 

 

  



 





... 



Chapter alignments are indicated with bold boxes, paragraph alignments with rectangular boxes, and sentence alignment with round boxes. In this example, individual phrases are considered sentences to illustrate several types of alignment. For example, the first three English phrases after Chapter 1 align to the first French phrase, and the fourth English phrase aligns to the second and the third French phrases. Fig. 2.1 Example of a parallel corpus with chapter, paragraph and sentence alignments.



2.4 Statistical Language Modeling they certainly 

but 

 



mais

davantage

algorithm The   









talked  ?

   

Vous

avez

aux

has 

been 



 



a

L’algorithme

You 

?

in the political ? oissues  oo o?  oo oo ? oo o  ? oo o o

interest

take



´et´e

about  ?

contenus

parl´e



des

politiques

implemented  ? 



mis

poor?

 ?

13







?

 

en

?

?

place

children ?  ?  ??

enfants pauvres

Fig. 2.2 Examples of word alignments between English and French sentences

The most common statistical model is the N-gram language model which approximates the probability P (wi |w1, . . . , wi−1 ) as P (wi|wi−N , . . . , wi−1). i.e., the probability of a word sequence is predicted using at most N preceding words. For instance, the simplest example of an N-Gram language model is the unigram model that predicts the probability of a word sequence as: P (w1 . . . wn ) = P (w1 )P (w2)P (w3) . . . P (wn )

(2.10)

A bigram would predict the word sequence probability as: P (w1 . . . wn ) = P (w1 )P (w2|w1 )P (w3 |w2) . . . P (wn |wn−1 ).

(2.11)

Bi-gram probabilities (as well as N-gram probabilities for any N) can be estimated by counting their frequencies in a text corpus: P (wi|wi−1 ) =

count(wi−1 , wi) count(wi−1 )

(2.12)

In general, a larger value for N can provide in principle a better estimate of the probabilities of words in context for the given language. However, as N grows, it becomes more difficult

14

Literature Review

to compute robust estimates of N-gram probabilities from finite training corpora. More robust methods have been proposed to estimate N-gram probabilities. An overview of several different methods is presented in [Goodman, 2000].

2.5 Statistical Machine Translation We provide a brief introduction to statistical machine translation (SMT) because many of the important techniques that are applied to parallel corpora originated from SMT research. In particular sentence and word alignments algorithms are two techniques used in this thesis. In order to put these techniques in context, a discussion of SMT is included here. A machine translation system automatically translates a text from a source language to a target language. Machine translation, as a research topic, is not new. The earliest machine translation systems relied on a rule based approach where linguists predefined rules that described how text in the source language can be translated to the target language. However, due to the complexity and dynamic nature of natural languages, the implementation of rule based systems is expensive and difficult in many language pairs where human experts are not available.

2.6 The Statistical Machine Translation Model A new approach for machine translation which relies on machine learning using parallel texts has emerged. The term used to describe this approach is “statistical machine translation” or SMT [Brown et al., 1993]. SMT systems are usually built using an entirely data driven approach, in which associated translation models and language models are generally trained strictly from bilingual texts with no additional prior knowledge. The need for sentence aligned parallel corpora in many languages motivated research into methods for automatically constructing such resources. SMT assumes that a sentence in the source language can be translated to several sentences in the target language. In this chapter we assume that the source language is French and the target language is English. We denote a French sentence with f and an English sentence with e. Each translation generated by a SMT system has an assigned probability,

2.6 The Statistical Machine Translation Model

15

P (e|f ). The goal of an SMT system is to produce the best target language translation e⋆ for source language sentence f :

e⋆ = argmax P (e|f )

(2.13)

e

P (f |e)P (e) P (f ) e = argmax P (f |e)P (e) = argmax

using Bayes rule

(2.14)

because P (f ) does not depend on e

(2.15)

e

In Equation 2.15 the first term corresponds to the probability of f being a translation of e, and the second term characterizes the degree to which the target sentence is syntactically well-formed. P (e) can be estimated using a language model built from text in the target language, P (f |e) is estimated using a translation model. For example, P (f |e) can be estimated by summing over all the possible word alignments, a, between f and e: P (f |e) =

X

P (f, a|e).

(2.16)

a

The statistical model is then used to translate sentences of the source language into the target language. This translation process is called “decoding”. Figure 2.3 depicts the structure of an SMT system.

Parallel Corpus

Parameter Estimation

Run Time

Input Sentence

Translation Model

Decoding

Training

Translation of Input Sentence

Fig. 2.3 The structure of a statistical machine translation system

Before training the translation model, the parallel corpus has to be pre-processed in order to obtain structural information that describes alignment between different parts of the corpus across the two languages. We explain the alignment and model training algorithms in Subsections 2.6.1 and 2.6.2. Sentence alignment is discussed in Section 2.6.3.

16

Literature Review

2.6.1 Word Alignment Word alignment gives important information in a parallel corpus, since words define most of the syntactic and semantic structure of a sentence. Moreover, translation probabilities are estimated by observing word alignment in all the corpus pairs. Since word alignment in a parallel corpus is usually done without the availability of data aligned by hand, this process can be considered an unsupervised learning method. The alignments between words depend on the translation probabilities between the words, because the best word alignments are those that maximize the overall translation probability. However, the translation probabilities themselves can only be estimated when word alignments are available. The word alignment for sentence pairs f and e can be written as a vector of integer variables where the value of alignment variable ai is the index of the word eai in the target language string that has been aligned with word fi in the source language string. For example, the word alignment for the third sentence in Figure 2.2 is a = (1, 2, 2, 3, 5, 4). The alignment probability, P (a|e, f ) can be written, using Bayes rule, as: P (f, a|e) P (f |e) P (f, a|e) . = P a P (f, a|e)

P (a|e, f ) =

(2.17) (2.18)

It is clear from Equation 2.17 that the alignment probability and the translation probability, P (f |e), are dependant on the joint probability, P (f, a|e). In section 2.6.2 it will be shown how the parameters of P (f, a|e) can be computed from word aligned sentence pairs after making assumptions about the form of the translation model [Brown et al., 1993, Knight and Koehn, 2007]. It is also clear that, since estimation of the parameters associated with P (f, a|e) depends on the existence of the alignments, there is a hidden variable problem. There are many excellent tutorial references describing how the expectation maximization (EM) algorithm [Dempster et al., 1977] is applied to the estimation of the parameters of P (f, a|e) [Brown et al., 1993, Knight and Koehn, 2007]. This approach assumes that the alignment variables are hidden variables in the EM algorithm. The algorithm works by iteratively obtaining optimum alignments (Expectation

2.6 The Statistical Machine Translation Model

17

step) and then estimating the translation parameters (Maximization step). These translation parameters include individual word translation probabilities tr(fj |eaj ) which is the probability of source word fj given target word eaj . As a notation convention, subscripts will be used to denote individual words in the sentence and superscript will be used to index the sentence in the parallel corpus. For example, fi is the ith word in the sentence under consideration, f k is the kth source language sentence in the corpus and fik is the ith word in the kth source language sentence in the corpus. 2.6.2 Translation Model Estimation As mentioned in Section 2.5, the probability of a given translation can be written as: P (f |e) =

X

P (f, a|e)

(2.19)

a

Computing the probability P (f, a|e) depends on the assumptions made in formulating the translation model. Several approximations for computing P (f, a|e) (known as IBM models) were proposed in [Brown et al., 1993]. We will describe here the simplest model, IBM-1. The IBM-1 translation model assumes that all the alignments between source and target sentences are equally probable and that the probability of each alignment is independent of the length of the sentences and the ordering of their words. Under this constraint the conditional joint probability, P (f, a|e), is given by: P (f, a|e) =

m Y 1 tr(fj |eaj ) (l + 1)m j=1

(2.20)

where l and m are the lengths of e and f respectively, fj is the jth word in f and eaj is the corresponding word in e, where aj is the index of the target language word that has been aligned with source language word fi . Since all parameters were estimated in the word alignment step, what remains is only plugging their values into equations 2.19 and 2.20 to compute the translation probability.

18

Literature Review

2.6.3 Sentence Alignment There are two main types of techniques used for sentence alignment: sentence-length-based and word-correspondence-based. Sentence-length-based techniques rely on statistical similarity between the length of translated sentences across languages: long and short sentences have long and short translations respectively. Word-correspondence-based methods make use of the actual word co-occurrence information in the sentences to obtain alignments. One of the earliest sentence-length-based alignment techniques is Church & Gale’s algorithm [Gale and Church, 1991] which assigns a probabilistic score to every possible alignment pair based on the difference in lengths of the sentences in the pair. Paragraphs in the parallel corpus are aligned first. This is done using simple typographical heuristics like line boundaries and indentation. Then, a probabilistic score is assigned to each possible pair of sentences. The score depends on the ratio of the lengths of the sentences (in characters) and the variance of this ratio. Finally, dynamic programming is used to compute the best alignment of the sentences using individual alignments scores. The algorithm uses a sentence-level cost function that allows for insertions, deletions, substitutions, contractions, expansions and merges. The algorithm was tested on a tri-lingual corpus of economic reports issued by the Union Bank of Switzerland in English, French and German. The algorithm was able to achieve 96% recall rate. Moreover, a precision error rate of 0.7% was achieved when only the 80% best alignments (according to their scores) were considered. Moore’s alignment algorithm [Moore, 2002] is an example of a word-correspondencebased alignment algorithm. The algorithm has two phases: in the first phase a preliminary alignment is obtained using a length-based approach like Church & Gale’s algorithm. Then a translation model is built using the aligned corpus. Scores from the translation model are used to augment the scores of the first phase. More precisely, for a given pair of sentences, f with l words and e with m words, the model combines the following scores: • Initial probability for a one to one alignment between f and e given by the length based alignment algorithm: P1−1 (f, e) • Translation probability obtained from the translation model: P (e|f ) • Language model probability, P (f ), estimated from a word unigram model.

2.6 The Statistical Machine Translation Model

19

The overall score for a given pair is: S(f, e) = P1−1 (f, e)P (e|f )P (f )

(2.21)

The algorithm gave an impressive performance of over 99% precision and recall when tested on a manually aligned English-Spanish corpus. In this this thesis, Moore’s alignment algorithm is used for all sentence alignments. 2.6.4 Translation Process (Decoding) The translation model allows us to compute the translation probability of a target language sentence given a source language sentence. However, when translating, only the source language sentence is available. Examining all possible valid target language sentences is not an option because there is a huge number of them. Therefore at run-time, the SMT system needs to do an efficient search in the space of target language sentences to come up with a good translation candidate. This search operation is referred to as “decoding”. Many decoding algorithms were developed during the last decade. Their basic mechanism is the same: use word translation probabilities to construct a target language sentence maximizing the overall translation probability. The algorithm described here is from [Och et al., 2001] and is based on the A* graph search algorithm. A* finds the minimum cost1 path in a given graph from a start node to a goal node. At each node, the algorithm adds two scores to determine the ordering of nodes to be visited. The first score is the true cost of the current path g(n) where n is the path to the current node. The second score, denoted h(n), is a heuristic estimate of the distance from the current node to the goal node. If h(n) is an optimistic function that underestimates the distance to the goal node, then A* is guaranteed to find the optimal path. In SMT decoding, each intermediate node in the search graph represents a partial sentence. The cost of the path so far, g(n), is the alignment score between the partial sentence and the source sentence. h(n) is an estimation of the score of the best path from node n to the final node. Heuristic functions described in [Och et al., 2001] rely on 1

A* finds the optimal path. This path can be the path with minimum or maximum score depending of the definition of optimality. In SMT decoding the best path is the path with highest translation probability, i.e. the path with the highest score

20

Literature Review

estimating the translation probabilities for source sentence positions that have not covered so far in the search. The simplest function takes only into account the probability P (f |e): h(n) = max P (fn |e). e

(2.22)

A node is open if more words can be added to the partial sentence, otherwise the node is closed. For each open node a new word is added to the hypothesis based and a new node is created. Figure 2.4 shows an example of A* decoding in the context of statistical machine translation. Source sentence is “Je veux aller”. Closed nodes are double framed and the score is indicated. In this simple example g(n) is the alignment score (Section 2.6.1) between the partial English sentence and the French sentence and h(n) is 1.





want going

go 

want to    

Je veux aller score = 0.36



I

   

want to

Je veux aller



I 



Je veux aller score = 0.3



I would go



 o  wooo

Je veux score = 0.4

aller



go

aller

I would like oo to  oo

Je veux

 o  wooo

Je veux aller



   

I would like

I would like oo to  oo 



Je veux aller



I would

   





go Je veux aller score = 0.35

   

I would to

Je veux aller



I would to

Fig. 2.4 Example of search graph in SMT decoding, double frames represent final nodes with their scores

Je veux aller score = 0.32



I





Je veux aller

want

I

Je veux aller



I

Je veux aller

-

2.6 The Statistical Machine Translation Model 21

22

Literature Review

2.6.5 Evaluating Translation Performance Evaluating machine translation quality is not straightforward.

Two translations of a

given sentence can be correct without sharing a single word. The metric widely used in the research community to evaluate the accuracy of SMT systems is the BLEU measure [Papineni et al., 2001]. The BLEU score correlates well with human judgments and ranges from 0 to 1, where 1 denotes a perfect translation, and 0 denotes a translation that has nothing to do with the original sentence. The score is the geometric mean of N-gram overlaps between the produced translations and a set of reference translations produced by professionals. To compute the BLEU score of a candidate translation, a brevity penalty2 BP is first computed to penalize short translations. If c is the length of the candidate translation and r is the length of the reference then: BP =

(

1

if c > r

e1−r/c if c ≤ r

(2.23)

The BLEU score is then given by: BLEU = BP exp (

X

wN log pN )

(2.24)

N

where pN is the geometric average of N-gram partial overlap between the candidate and the reference translations, and wN are weights that sum to 1. In the baseline of the original paper [Papineni et al., 2001] the sum goes from 1 to 4 and all weights are 1/4. These values are used in all BLEU evaluations in this thesis.

2.7 Automatic Construction of Parallel Corpora As seen above, parallel corpora are critical for training SMT systems. However, acquiring them by hand is expensive. Many techniques for automating the process of building parallel 2

One may ask why not penalizing long translations as well? The answer is that SMT models do not produced overly-long translations. For example, the IBM-3 model ([Brown et al., 1993]) assumes that a given source language word may be translated into different target language word. This is referred to as fertility. The fertility parameters are learned from the corpus and are mostly between 1 and 2 for English-French translation.

2.7 Automatic Construction of Parallel Corpora

23

corpora have been proposed. Several workshops were dedicated for this topic [ACL, 2003, ACL, 2005]. One of the earliest systems is STRAND (Structural Translation Recognition, Acquiring Natural Data) [Resnik, 1999]. STRAND relies on web page structure to identify translation pairs. The rationale is that page authors typically use the same layout to represent two pages of the same content in two languages. The processing steps in the system are: 1. Locating pages: A search engine is used to locate two types of pages: a parent page, which is a page that links to two pages of different languages, and a sibling page, which is a page that contains a link to a page of a different language with an anchor specifying the language. These two types of pages indicate the existence of a potential pair. 2. Generation of potential pairs: Pages obtained from the previous step are used to generate pairs. A pair consists of two pages of different languages and URLs that match specific rules. The page language is identified by a language detector. 3. Candidate pair filtering: STRAND uses the structural similarities of HTML code to select the best match from the candidate pairs. Another example is the PTMiner system [Chen and Nie, 2000]. PTMiner was developed to collect an English-Chinese parallel corpus that would be used in cross-language information retrieval tasks. The approach, however, is language independent and can be used for other languages. The system is divided in several steps: 1. Locate candidate sites: uses search engines to locate bilingual sites. Queries for pages that have anchor text that indicates the existence of a translation pair are sent. For example, to search for pages that may have English parts the query “anchor:English version” is used. 2. File name fetching: URLs from a candidate site are retrieved using a search engine. Queries like: “host:hostname” are used for this task. 3. Site crawling: Candidate sites are crawled using a crawler so that a large portion of the site is acquired.

24

Literature Review 4. Pair scan: URL names are used to find pairs of pages. Patterns like:“file-lang.html” and “. . . /lang/. . . /file.html” are used to detect pairs. The approach assumes that the web site is structured in such a way that URL names are sufficient to detect pairs. 5. Filtering: Pages that are paired together should satisfy text length and character set constraints. For instance, the difference in the length of the two paired pages should be less then a certain threshold, and each page should be in one of the languages under consideration. 6. Alignment: Since the goal of the system is to use the corpus in an information retrieval task, alignment is done on the HTML level rather than sentence or word levels. Identical sequences of characters in corresponding words of the two languages, are used to align parts of the HTML pages. Markup is also used in the alignment. In this thesis, we are going to use similar methods to the approaches mentioned above

in order to collect parallel corpora from the web. However, since the interest of this thesis is in sentence-aligned parallel corpora, the alignment process will go beyond generating page pairs to aligning individual sentences in the pairs.

2.8 Other Applications of Parallel Corpora Statistical translation models and parallel corpora can be used in many applications besides statistical machine translation. This section reviews several such applications. 2.8.1 Example-Based Machine Translation Unlike SMT systems that estimate word translation probabilities from the corpus, an example-based machine translation (EMBT) system uses analogy, by breaking the input source language sentence to smaller phrases and looking up these phrases in the parallel text in order to create the final translation. EBMT was first introduced in [Nagao, 1984] as a way to simulate the way human translators work. It is based on the assumption that human translators do not do a thorough linguistic analysis. Instead they try to decompose the source language sentence to different phrases and translate each individual phrase as a whole. At the end, they compose the translations into one sentence.

2.8 Other Applications of Parallel Corpora

25

Since not all phrases would be in the corpus, EMBT attempts to construct analogy rules from the corpus, by analyzing its content. For example, if the corpus contains the following English-French pairs: • (il me permet de conduire la voiture, he lets me drive the car) • (elle me permet de voir le film, she lets me see the movie) the EMBT system can deduce the general translation of the expression: “me permet de” is “let me” and that translations of “il”, “elle”, “conduire la voiture”, “voir le film” are: “he”, “she”, “drive the car” and “see the movie” respectively. The system uses these small units of translations and composes them in order to generate the complete translation. 2.8.2 Resources for Translators Multilingual texts are important resources for translators. Even before the advent of computers, translators used to refer to these texts in order to lookup phrase translation or study the style and the context of the translation. Parallel texts offer translators a lot of help since alignment is done automatically, and search facilities are available to query any translation need. The term “translation spotting”is used for the task of identifying the target language translation for a given source language text [Simard, 2003]. Translation spotting is usually associated with “translation memories” which are manually prepared databases of phrase translations. Parallel texts can be considered as translation memories and translation spotting techniques can be used to look up translation needs inside parallel text. The work of [Simard, 2003] uses a translation model to spot the translation of a source language sentence in a parallel corpus. The input source language sentence is aligned to the target language sentences in the corpus. The target language sentence with the highest alignment probability is chosen and its individual words are aligned to the words in the input. Figure 2.5 illustrates an example of the process. In this example the query “the government’s commitment” is aligned to every possible French sentence in the corpus. As described in Section 2.6.1, each alignment will have an associated score, the sentence with best score is chosen as a translation candidate. It is worth noting that the exact query may not occur exactly in the English pair of the chosen French sentence. The reason is that the alignment relies on the translation model that is computed from many language pairs.

26

Literature Review

Parallel texts can be created on demand to serve a certain translation need. In [D´esilets, 2007] translation queries are sent to search engines, and then a small parallel corpus is constructed from the results on the fly. 2.8.3 Cross Lingual Information Retrieval The size of the text corpus is an important factor in the accuracy of N-gram language model probabilities. The larger the corpus, the more statistically robust the probabilities are. However, for many resource-poor languages it is difficult to have a large collection of texts. However, it is possible to exploit related texts in resource-rich languages to enhance the estimation of a language model in a resource-poor language. In [Khudanpur and Kim, 2004] a method that uses English texts to enhance a Chinese language model is presented in the context of a cross lingual information retrieval (CLIR) task in which a query in Mandarin is issued and English documents are retrieved. C E E Let dC 1 , . . . , dN be the texts of N stories in Mandarin and d1 , . . . , dN be their corresponding or aligned English stories. The term “correspondence” means that each English document should be talking about the same story, news, or events as the Mandarin documents, without the necessity of being an exact translation. Let P (e|dE i ) be an estimate of the probability of an English word in the document dE i , derived from the relative frequency E of e in di , and P (c|e) be the translation probability between the Mandarin and English query: the government’s commitment corpus: 1 Let us see where the governments commitment is really at in terms of the farm community. 2 I would stress that these increases meet the government’s continuing commitment to expenditure restraint. Best alignment: target sentence in pair 1 Spotting answer: Voyons quel est le v´eritable engagement du cole. Fig. 2.5

Voyons quel est le v´eritable engagement du gouvernement envers la communaut´e agricole. Je voudrais souligner le fait que ces augmentations respectent l’engagement du gouvernement `a freiner les d´epenses.

gouvernement envers la communaut´e agri-

Example of translation spotting

2.8 Other Applications of Parallel Corpora

27

words c and e. Then a unigram model for Mandarin given the English document dE i is: PCL (c|dE i ) =

X

P (c|e)P (e|dE i ).

(2.25)

e∈E

According to [Khudanpur and Kim, 2004], this unigram model can be combined with the original Mandarin bi-gram or tri-gram model: E Pinterpolated (ck |ck−1 , ck−2, dE i ) = λPCL (ck |di ) + (1 − λ)P (ck |ck−1 , ck−2 ).

(2.26)

Experiments on the Hong Kong news corpus show a 15 − 28% reduction of perplexity, a statistical measure that describes how well a language model fits a given corpus [Jelinek et al., 1977], when using a trigram language model. When using the enhanced language model on a speech recognition task, an absolute 0.5% reduction of word error rate was achieved. Parallel corpora can also be used to train translation models that help translate queries for a CLIR task. In [Chen and Nie, 2000] a system to harvest a Chinese-English parallel corpus from the web is presented. The corpus is then used for two tasks: English-Chinese and Chinese-English information retrieval. Results show a 40% monolingual precision. This is a bit low compared to results obtained using other translation technologies, probably because of the noise in the obtained corpus.

28

Chapter 3 Implementation of Cross-Lingual Taggers This chapter describes how to assign linguistic tags to sentences in a target language by exploiting a linguistic tagger that has been configured in a source language. The approach does not require that the input sentence be translated into the source language. Instead, crosslingual projection of linguistic tags is accomplished through the use of a bi-text corpus for the source and target languages. We demonstrate this approach by implementing a system for part of speech (POS) tagging of French language sentences using an English language POS tagger and an English/French bi-text corpus. approach and report the results.

Finally, we evaluate this

3.1 Motivation The existence of tremendous amounts of on-line text, in all languages, has increased the need for algorithms for natural language processing (NLP). Accomplishing NLP tasks requires the use of language-dependent taggers, such as part-of-speech taggers, named entity extractors, etc. Constructing these taggers is a time-consuming and expensive operation that requires both human expertise and the availability of labeled text. This work focuses on inducing a French part of speech tagger automatically by taking advantage of the increasing amount of bilingual, unlabeled text on the web.

3.2 Corpus Description

29

An obvious solution for this problem is to translate the source language sentence to target language using an automated translation system, tag the translated sentence using the target language tagger, and project the tags back to the source language sentence. Although this solution is straightforward to implement, it has many drawbacks. First, machine translation is computationally expensive. In addition, tagging tasks try to obtain a characterization of the structure of language. Machine translation, as implemented according to the model in Section 2.5, does not incorporate any of this structure. Therefore, there is no reason that the structure would be carried along with the translation. The system presented in this chapter does not rely on having a translation for the input sentence before linguistic tags can be assigned in the target language. It is assumed that a sentence aligned parallel corpus is available, and all linguistic tag projection is performed using this corpus. Linguistic tags are assigned to the input sentence in the target language by projecting tags from corpus sentences. For every word in the input sentences, scores are assigned to possible tags. The scores depend on both the frequency of word/tag pairs in the corpus, and the similarity between the input sentence and corpus target sentences. The previous procedure requires a mapping between tags of the source language and the target language. Since tags in many tasks are language dependent, establishing such a mapping might not be easy. However, in many cases it is possible to establish a reasonable mapping between general categories of tags in the two languages, especially when the two languages have syntactical and grammatical similarities. We evaluate this general approach on a particular task crosslingual part-of-speech (POS) tagging task. POS tags are assigned to French language sentences by exploiting an English language POS tagger and an English-French parallel corpus. In this evaluation, since POS tagging is done for French sentences, French is the target language and English is the source language. Results are reported for a hand labeled French language test set.

3.2 Corpus Description This section describes the corpus used in the experiments. Although the approach is general and designed to work with several taggers, the lack of multilingual tagged resources makes evaluation difficult. Evaluation requires labeled data and/or taggers in both the source and the target language. For the POS tagging task, a manually tagged subset of the Hansard

30

Implementation of Cross-Lingual Taggers

corpus was provided by the Recherche Appliqu´ee en Linguistique Informatique (RALI) group at Universit´e de Montr´eal (UdeM) [RALI-Laboratory, 1996]. This corpus contains 4527 English sentences and 4439 French sentences. Both English and French sentences are manually tagged with a comprehensive tag set. Using a set of hand-derived POS tags in this study is preferable to using an automatic POS tagger. To the extent that POS tags obtained from human transcribers can be considered consistent and correct, the use of these tags eliminates the issue of errors that are made by the automated tagger. While the use of an aligned bilingual corpus with manually derived POS tags is important for evaluation purposes, it is not required for system implementation. Indeed, we assume that an automatic POS tagger in the source language is an integral part of the final system. 3.2.1 Corpus Preparation English and French sentences in the UdeM corpus are taken from parallel Hansard texts. The same content is expressed in both English and French. However, the number of sentences in English and French differs. This is because a source sentence is sometimes translated to more than one target sentence. The English and French sentences in the UdeM corpus were aligned using the Moore aligner described in Section 2.6.3. As a result, we obtained 4400 sentence pairs out of 4527 English and 4439 French sentences. Since we are interested primarily in word-level linguistic taggers, a word-level alignment between source and target language sentences must be obtained. As described in Section 2.6.1, word alignment algorithms use EM and interate over the corpus, exploiting word co-occurrence information in order to compute word alignment and translation model parameters. In this statistical procedure, the more data is used to compute the alignment, the more accurate the alignments are. It is also worth noting that once the parameters are estimated, they can be used to align any pair of sentences. Since our system relies heavily on word alignments and the corpus used here is relatively small (4400 sentence pairs), we wanted to incorporate more information sources to compute word alignments. To test this idea, word alignments are computed with three different methods: 1. No external resource is used. We will call the resulting corpus UdeM corpus.

3.3 Crosslingual System Configuration and Tagging

31

2. Word alignments of UdeM sentence pairs are estimated using the Hansard corpus [Roukos et al., 1995]. The resulting corpus is called UdeM-Hansards. 3. Word alignments of UdeM sentence pairs are estimated using the corpus collected with the method described in Chapter 4. The resulting corpus is called UdeM-GcCorp. Each entry in the final corpus has five elements: English sentence, French sentence, English tags, French tags, and word alignment. Table 3.1 shows an excerpt from the corpus; only relatively short phrases are shown for ease of display. 3.2.2 Tagset Both the English and the French sentences are manually tagged with a comprehensive tag set. The tag set contains 13 top-level tags that expand to 183 English tags and 254 French tags. Although French has more tags due to the gender and number differentiation, a reasonable mapping can be established between the 13 English and French top-level tags that are shown in Table 3.2. Sub-tags give more information that is quite detailed and languagespecific. For example, Table 3.3 shows the different sub-tags for the “Adjective tag”. Table 3.4 shows the distribution of tags per word in the manually tagged corpus. Most words have unique tags in both English and French. The French text has more unique words than the English text due to its more complex morphology. It is worth noting that ambiguous words, i.e. words that have more than one tag, tend to be more frequent than unambiguous words. In [Chanod and Tapanainen, 1995] the most frequent ambiguous words are usually words that are corpus-independent. These words usually belong to special categories such as determiners, prepositions, pronouns and conjunctions. Similar results were obtained for English when examining the Brown corpus. In a test set selected randomly from a French corpus, [Chanod and Tapanainen, 1995] found that 54% of the words were ambiguous.

3.3 Crosslingual System Configuration and Tagging Our general framework is described in the block diagram given in Figure 3.1. The framework assumes that a linguistic tagger which assigns linguistic tags to words in the source

32

Implementation of Cross-Lingual Taggers

English tence

Sen-

Manual En- French glish Tags tence

Sen-

Manual French Tags

Alignment

taking these applications and proposals at face value , that would still not provide for efforts in the province of qu´ebec or in atlantic canada .

taking/Verb these/Dete applications/NomC and/ConC proposals/NomC at/Prep face/NomC value/NomC ,/Punc that/Pron would/Verb still/Adve not/Adve provide/Verb for/Prep efforts/NomC in/Prep the/Dete province/NomC of/Prep qubec/NomP or/ConC in/Prep atlantic/AdjQ canada/NomP ./Punc

mˆeme si l’ on prenait ces demandes et ces propositions au s´erieux , cela ne couvrirait ni le qu´ebec ni les provinces de l’ atlantique .

mˆeme/Adve si/ConS l’/Pron on/Pron prenait/Verb ces/Dete demandes/NomC et/ConC ces/Dete propositions/NomC au/Adve s´erieux/Adve ,/Punc cela/Pron ne/Adve couvrirait/Verb ni/ConC le/Dete qu´ebec/NomP ni/ConC les/Dete provinces/NomC de/Prep l’/Dete atlantique/NomC ./Punc

taking/prenait these/si these/ces applications/demandes and/on and/et proposals/propositions at/au face/s´erieux value/s´erieux ,/, that/cela would/cela still/mˆeme not/ne provide/ne for/ne efforts/le in/le the/le province/qu´ebec of/de qu´ebec/qu´ebec or/ni or/ni in/les in/de atlantic/l’ atlantic/provinces atlantic/l’ atlantic/atlantique canada/atlantique ./.

the result is a 1.5 per cent reduction in canadian revenues from the petroleum industry .

the/Dete result/NomC is/Verb a/Dete 1.5/Quan per/Prep cent/NomC reduction/NomC in/Prep canadian/AdjQ revenues/NomC from/Prep the/Dete petroleum/NomC industry/NomC ./Punc

il en r´esulte une r´eduction de 1,5 p. 100 des recettes canadiennes provenant de l’ industrie p´etroli`ere .

il/Pron en/Pron r´esulte/Verb une/Dete r´eduction/NomC de/Prep 1,5/Quan p./NomC 100/NomC des/Dete recettes/NomC canadiennes/AdjQ provenant/Verb de/Prep l’/Dete industrie/NomC p´etroli`ere/AdjQ ./Punc

the/il result/en result/r´esulte is/r´esulte a/une 1.5/1,5 per/de cent/p. cent/100 reduction/r´eduction in/des canadian/canadiennes revenues/recettes revenues/provenant from/provenant from/de the/l’ petroleum/l’ petroleum/p´etroli`ere industry/industrie ./.

Table 3.1 Excerpt from the training corpus, notice that, since alignment is estimated from data, not all alignments are correct

3.3 Crosslingual System Configuration and Tagging

Tag

Description

AdjQ Adve ConC ConS Dete NomC NomP Ordi Prep Pron Punc Quan Verb

Adjective Adverb Conjunction Condition Determiner Common Noun Proper Noun Ordinal Number Preposition Pronoun Punctuation Quantity Verb

33

Table 3.2 Core Part of Speech Tags

English Tags

French Tags

AdjQ AdjQ-Compar AdjQ-Super

AdjQ AdjQ-femi-plur AdjQ-masc-sing AdjQ-abbr AdjQ-femi-sing AdjQ-masc-plur

Table 3.3 English and French sub tags of AdjQ

Words Words Words Words

with with with with

1 2 3 4

tag tags tags and more tags

Total unique words Table 3.4

English

French

5895 505 47 16

7920 422 77 21

6436

8440

Distribution of tags per word in UdeM corpus

34

Implementation of Cross-Lingual Taggers

language already exists. Many important linguistic taggers, such as part-of-speech taggers, named entity extractors, and morphological analyzers conform to this description. It also assumes that the source and target languages under consideration are related, so that both reasonable mappings between their tag sets and reasonable word alignments between aligned sentences in bi-text corpora can be established. This is a fair assumption, for example, for English and French, but not for English and Arabic. The approach can be divided into two components: crosslingual system configuration and the actual assignment of tags to the input target language sentence. The steps involved in these two components, with particular application to the POS tagging problem, are described below. 3.3.1 Crosslingual System Configuration

English Sentence e

Parallel Sentences

French Sentence f

Input: Target Sentence F

Word Alignment

Similarity Measure

Source Language Tagger

Tags of e Te

Tag Engine

Alignment a Tag Projection

Tags of f Tf

Parallel Corpus

Tag Consensus

Output: Tags of F

Fig. 3.1 The structure of the tagging system

Crosslingual system configuration, as depicted in Figure 3.1, is based on the existence of a sentence-aligned bilingual text corpus. There are three steps. First, a linguistic tagger that has been configured in the source language is used to assign linguistic tags to the

3.3 Crosslingual System Configuration and Tagging

35

source language sentences in the parallel corpus. In the experimental study described in Section 3.2.1, English sentences were manually tagged according to POS categories. Second, a word-level alignment is obtained between source and target language parallel sentences. In the POS setup, words of English and French pairs are aligned as described in Section 3.2.1. As a prerequisite of word alignment, sentence alignment was also performed as described in Section 3.2.1. Finally, crosslingual tag projection is performed to map the source language tags to the target language. The projection is done by assigning the tag of every source language word in the corpus to the target language word that is aligned to that source language word. For example in Table 3.1 the French word “mˆeme” in the first sentence is given the tag “Adve” in the projection because it aligned to the English word “still” which is tagged as “Adve”. When a target language word is aligned to several source language words, the alignment of the target language word and the source language word with similar position in the sentence is chosen. For example, in the same sentence the French word “cela” is aligned to both English words “that” and “would” that have the tags “Pron” and “Verb” respectively. However, “cela” is closer in position to “that” than “would”; therefore, the projection tags it with “Pron”. Table 3.5 shows the complete projection for the two sentences of Table 3.1. The projection process can be thought as a function that tags every word in the target language. We will denote this function tag(f, w) where f is the target language sentence and w is the word getting tagged. It is clear that tags resulting from direct projection are very noisy. The next section describes how to impose this initial noisy tagging. 3.3.2 Assigning Tags to Target Language Sentences Let the input sentence be f˜ and suppose the position of the word to be tagged in f˜ is k. The tagger should choose the best tag t from the tagset T using the corpus C: ˜ k) t = argmax P (t|f,

(3.1)

t∈T

The approach relies on every corpus pair participating in the overall score: t = argmax t∈T

X

(f i ,ei )∈C

˜ k) P (t, f i, ei |f,

(3.2)

36

Implementation of Cross-Lingual Taggers

French Sentence

Projected French Tags

mˆeme si l’ on prenait ces demandes et ces mˆeme/Verb si/NomC l’/Prep on/Adve propositions au s´erieux , cela ne couvrirait ni prenait/Verb ces/Dete demandes/NomC le qu´ebec ni les provinces de l’ atlantique . et/ConC ces/Dete propositions/NomC au/Verb s´erieux/NomC ,/Punc cela/Pron ne/Adve couvrirait/NomC ni/ConC le/Dete qu´ebec/NomP ni/ConC les/Prep provinces/AdjQ de/Prep l’/Prep atlantique/AdjQ ./Punc il en r´esulte une r´eduction de 1,5 p. 100 des il/Dete en/NomC r´esulte/Quan une/Dete recettes canadiennes provenant de l’ industrie r´eduction/NomC de/UKN 1,5/Quan p./Prep p´etroli`ere . 100/NomC des/AdjQ recettes/NomC canadiennes/AdjQ provenant/Prep de/UKN l’/Dete industrie/NomC p´etroli`ere/NomC ./Punc Table 3.5

Tag Projection Example

As a notation convention, subscripts will be used to denote individual words in the sentence and superscript will be used to index the sentence in the parallel corpus. For example, fi is the ith word in the sentence under consideration, f k is the kth source language sentence in the corpus and fik is the ith word in the kth source language sentence in the corpus. Applying Bayes rule to the previous equation: t = argmax t∈T

= argmax t∈T

= argmax t∈T

X

P (t, f i, ei |f˜, k)

(3.3)

˜ i , ei , t, k)P (t, f i, ei , k) P (f|f ˜ P (f)

(3.4)

˜ i , ei , t, k)P (t, f i, ei , k) P (f|f

(3.5)

(f i ,ei )∈C

X

(f i ,ei )∈C

X

(f i ,ei )∈C

˜ i , ei , t, k) represents the degree of similarity between the input sentence The term P (f|f and the corpus pair given the word and tag under consideration. Since the input sentence is in the source language we can assume that this similarity depends only on the source

3.3 Crosslingual System Configuration and Tagging

37

language corpus sentence, we can rewrite this equation as: t = argmax t∈T

X

˜ i )P (t, f i, ei , k). P (f|f

(3.6)

(f i ,ei )∈C

By assuming a uniform prior over tags t in source language sentences ei and that the input sentence f˜ is independent of the tag t, the corpus source sentence ei and the word position k, the previous equation can be written as: t = argmax t∈T

X

˜ i )P (t|f i, ei , k)P (f i|ei , k)P (ei , k), P (f|f

(3.7)

(f i ,ei )∈C

assuming that the position k is independent from the source language sentence ei and that all source language sentences have equal prior, the previous equation can be written as: t = argmax t∈T

X

˜ i )P (t|f i, ei , k)P (f i|ei , k). P (f|f

(3.8)

(f i ,ei )∈C

Equation 3.8 has three components: ˜ i ): this term represents the similarity between the input sentence f˜ and the 1. P (f|f corpus sentence f i . We will call this term the similarity component. 2. P (t|f i, ei , k): this term represents how the corpus pair (ei , f i) participates in the score of the word f˜k being tagged with t. Since the projected tags will be used to evaluate this term, we will be calling it the projection component. 3. P (f i|ei , k): this term provides a measure of the degree of correspondence between corpus pairs f i and ei . This correspondence depends on the word being tagged as well, which means that f i should contain the word f˜k for this score to be positive. This term is called “the alignment component”. Since estimating the previous probabilities is difficult for any finite parallel corpus, the following approximations will be used here: 1. The similarity component: the measure used to determine the similarity between the target language input sentence and the sentences in the parallel corpus is the standard term frequency–inverse document frequency, tf-idf, measure [Ricardo and Berthier, 1999].

38

Implementation of Cross-Lingual Taggers The measure has been adapted here for sentence retrieval so that tf-idf(i, j) is obtained from the term frequency, tf(i, j), for word i in sentence j and the inverse document frequency, idf(i), for word i as the following: freq(i, j) maxl f req(l, j) N idf(i) = log( ) n(i) tf-idf(i, j) = tf(i, j) ∗ idf(j). tf(i, j) =

(3.9) (3.10) (3.11)

In the above equation, N is the total number of target language sentences in the corpus, n(i) is the number of these sentences that contain at least one occurrence of word i, and freq(i, j) is the frequency of word i in sentence j. When measuring the similarity between two sentences f k and f l , each sentence is represented as a vector whose components are the tf-idf values of the words in the sentence. − → vk = [tf-idf(1, k), tf-idf(2, k), . . . , tf-idf(|f k |, k)] − → vl = [tf-idf(1, l), tf-idf(2, l), . . . , tf-idf(|f l |, l)]

(3.12) (3.13)

→ → The angle between the two vectors − vk and − vl is used as to assess similarity between k l the two sentences f and f . − → → vk .− vl → − → cos(− vd k , vl ) = − → → | vk | ∗ |− vl |

(3.14)

When the cosine is equal to 0, the two vectors are orthogonal and the two sentences are not at all similar. When the cosine is equal to 1, the two vectors are parallel and the two sentences are very similar. Using the previous measure as the similarity function, we will approximate the prob˜ i ): ability P (f|f → − → P (f˜|f i) ≈ cos(− (3.15) v fd ˜, v f i )

Table 3.6 displays an example for the French sentence “le nombre de femmes qui ont un emploi a augment´e” The five most similar sentences from the corpus are shown,

3.3 Crosslingual System Configuration and Tagging

39

along with the σ() similarity function. The score is normalized so that the best match always receives a score of 1. French Sentence

Similarity Score

comme le nombre de femmes qui ont un emploi r´emun´er´e ou qui ´el`event seules 1.0 leurs enfants a augment´e , il est devenu de plus en plus pressant d’ avoir des services de garde d’ enfants ad´equats . le nombre de femmes travaillant ´a leur propre compte a augment´e de 118 p. 100 0.5771 entre 1975 et 1986 . m. mulroney : le nombre des femmes engag´ees dans des postes de gestion ou d’ administration a augment´e de 36 p. 100 entre 1984 et 1987 .

0.4328

c’ est dire que le nombre d’ emplois occup´es par des femmes a augment´e ´a un 0.4100 rythme presque deux fois plus rapide que le nombre d’ emplois occup´es par des hommes . m. mulroney : cela porte ´a 15,5 p. 100 le taux de croissance de l’ emploi chez les 0.2937 femmes , ce qui est spectaculaire quand on sait qu’ il a ´et´e de 8,8 p. 100 chez les hommes pendant la mˆeme p´eriode . Table 3.6 Similar French sentences to “le nombre de femmes qui ont un emploi a augment´e” from a subset of the Hansard corpus

2. The projection component: if the word f˜k does not exist in f i , or if it exists but its projected tag is not t, then the pair (ei , f i ) has nothing to do with tagging this word and the score is 0. However, if the word f˜k exists in f i and its projected tag is t then the score that the pair (ei , f i) gives is 1. There is a small caveat to this; when the word f˜k exists more than one time in f i , the score is the percentage of times this word is tagged with t. This is important since long sentences may contain repeated words (mostly common words).   

0 if f˜k ∈ / f i or tag(f i, f˜k ) 6= t P (t|f i, ei , k) ≈ (3.16) countin f i (tag(f i , f˜k ) = t)  otherwise  countin f i (f˜k )

40

Implementation of Cross-Lingual Taggers 3. The alignment component: since sentences in the corpus were aligned as described in Section 3.2.1, we know that every target language sentence f i is aligned to some corresponding source language sentence ei . The score should reflect the existence of word f˜k in f i :

i

i

P (f |e , k) ≈

(

0 if f˜k ∈ / fi 1 if f˜k ∈ f i

(3.17)

Figure 3.2 demonstrates an example of the tagging algorithm. Input sentence: il entre dans le salon, k = 2 Possible tags for the word “entre” are: Adve, Verb and Prep. Best tag according to the algorithm is: Verb Corpus Sentence Similarity nous avons entre/Adve autres mis le 0.162 programme mais entre/Adve temps , je donne la 0.35 parole au d´eput´e de Davenport . le Canada entre/Verb dans le xxie si`ecle 0.518 comme un pays plus fort . d´epˆot d’ un protocole d’ entente en0.174 tre/Prep le Canada et la Pologne l’ accord de libre-´echange entre/Prep le 0.206 ´ Canada et les Etats-Unis comme le gouvernement cherche par le 0 projet de loi . . . Ohter sentences that don’t contain the 0 word “entre”

Score of Adve = 0.512

Score of Verb = 0.518 Score of Prep = 0.480

Scores of other tags = 0

Fig. 3.2 Tagging example: tagging the word “entre” in the sentence “il entre dans le salon”

3.4 Experiments To evaluate the French language POS tagging performance, using the described approach, the manually tagged UdeM corpus is split into a training set of 4000 pairs and a testing set of

3.4 Experiments

41

144 pairs. The training set has 6189 and 8250 unique English and French words respectively. The testing set has 1072 and 943 unique English and French words respectively. The corpus is divided into 3700 training sentence pairs and 450 French test sentences. Each French sentence in the test set is automatically tagged using the independent training set for crosslingual system configuration as described in Section 3.3.1. If the tag cannot be estimated, an unknown tag “UKN” is reported. This happens when the word to be tagged does not exist in the training corpus, or when the input sentence is not similar to any sentence in the corpus. When more than one tag has the same score, the first one according to its location in the tag set is selected. Every tagged word has three possibilities when compared to the correct tagging: • The tag is correct. • The tag is incorrect. • The word was not tagged, i.e. tagged with “UKN”. To evaluate the performance of the system, we computed the precision and recall: 1. Precision: percentage of correct tags among the detected tags, #correct . #correct + #incorrect

(3.18)

#UKN . #correct + #incorrect + #UKN

(3.19)

precision = 2. Recall: percentage of tagged words, recall = 1 −

The experiment is repeated for all three corpora described in Section 3.2.1. In addition, for evaluation purposes, we evaluated the precision and recall using the simple projection procedure described in Section 3.3.1. We call this experiment direct projection and we perform it for all three corpora, because projection depends on word alignments which may differ across these corpora. The direct projection experiment requires that the input sentence has a translation in the target language. Although this requirement is satisfied in our evaluation (because the test sentences are taken from the parallel corpus) this is not required in the final system. Our motivation to perform experiments with directly projected

42

Implementation of Cross-Lingual Taggers

tags is first to see how successful our system is in eliminating noise associated with direct projection, and second, to get an idea about the performance of direct projection when a true translation exists. This is in some sense the best possible case for direct projection.

3.5 Results and Discussion Figure 3.3 summarizes the results of the experiments. Table 3.7 shows the performance of the taggers across tags 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

UdeM

UdeM-Hansards UdeM-GcCorp

(a) Precision

UdeM

UdeM-Hansards UdeM-GcCorp

(b) Recall

Fig. 3.3 Results of experiments. Black is performance of the tagger, white is the performance of the direct projection method

All the different experiments gave a high recall, more than 97%. However, recall when using direct projection was higher than recall when using the tagging system. This tradeoff between precision and recall is a typical behavior of taggers and information retrieval systems. The precision of the tagging system exceeds the performance of the direct precision method by 12% to 18%. This means that the tag assignment algorithm (Section 3.3.2) managed to reduce the noise caused by direct projection. However, the performance still lags behind the typical performance of part of speech taggers, which are above 96% precision [Jurafsky and Martin, 2000a]. Many reasons might have affected the performance. First

3.5 Results and Discussion

Tag

Count

AdjQ Adve ConC ConS Dete NomC NomP Ordi Prep Pron Punc Quan Verb

43

Precision

Recall

UdeM

UdeMUdeMHansards GcCorp

UdeM

UdeMUdeMHansards GcCorp

193 181 68 104 608 755 104 12 488 295 169 104 552

73.42 59.66 100.0 62.5 62.17 86.31 85.71 8.33 81.35 72.2 89.35 72.04 89.96

77.07 72.61 100.0 81.25 58.77 88.81 100.0 11.11 92.22 81.2 99.4 87.1 88.08

73.42 63.64 100.0 79.81 76.64 90.92 78.57 33.33 95.49 66.78 99.41 86.17 83.65

81.87 97.24 100.0 100.0 100.0 89.01 80.77 100.0 100.0 100.0 100.0 89.42 86.59

81.35 86.74 100.0 76.92 85.36 88.74 79.81 75.0 86.89 90.17 98.82 89.42 86.59

81.87 97.24 100.0 100.0 100.0 89.01 80.77 100.0 100.0 100.0 100.0 90.38 86.41

Mean Std. Dev.

77.63 22.87

82.85 23.77

83.32 17.92

97.96 7.55

97.96 7.40

97.94 7.52

Table 3.7

Performance measures obtained across different tags and corpora

44

Implementation of Cross-Lingual Taggers

unsupervised learning is always more difficult than supervised learning. In addition, there were some particularities about part of speech tagging projection. Not all tags have the same meaning across English and French, so projecting tags, even when the alignment is 100% correct might not give correct results. Moreover, the alignment process itself assumes that a correspondence between words exists. In reality, there are cases in which two translations of the same sentence have completely different wording and structure. Finally, the assumptions made to facilitate the estimation of the parameters in Section 3.3.2 were probably too strong, which may reduce the performance of the system. Aligning words using an external corpus gave an improvement over the self-aligned corpus. This is understandable since the quality of the alignment depends on the data available to estimate the parameters. It was surprising though that the performance when aligning with the automatically collected GcCorp was slightly better than the performance when aligning with Hansards. This small difference might be due to the more varied content in GcCorp compared to Hansards; GcCorp spans different government sites and includes different content types, whereas the Hansards contains only parliamentary discussions. The performance was not increased for all tags. For some tags GcCorp performed worse than Hansards. For example, more than 20% precision was lost for proper nouns (NomP). Again, the difference of the content between the two corpora is suspected to be the main reason (the UdeM tagged corpus is a subset of Hansards). This means that proper nouns that exist in UdeM will exist in Hansards, and will therefore be aligned properly. This is not necessarily true for GcCorp. Looking at the performance at the tagger across different tags, the tags can be roughly divided to three sets: 1. Tags with poor performance (less than 70% precision) such as Adve, Ordi, and Pron: it is interesting to note that POS tags where the system performs poorly are those POS categories whose definitions are not consistent across languages. For example, the assignment of a word to the adjective class in French is sometimes different from the assignment made in English. While the word “last” is an adjective in English, its French translation “dernier” is a noun most of the time. This suggests that word/tag correspondence is not always consistent across languages. Another reason of the poor performance is inconsistent translation style. For instance, ordinals (Ordi tags) can be completely written or abbreviated. For example,

3.5 Results and Discussion

45

translations of the word “first” are both “premier” and “1er ”. So the word “first” may almost always be translated to “1er ”, so if the tagger is trying to tag “premier”, it will not be able to find a suitable match for it. The same is true for the other direction, and for other ordinal numbers. 2. Tags with mediocre performance (70% - 90% precision) such as ConS, Dete, Quan and Verb. In this class there is a big variance in performance. We suspect that errors produced by the tagging algorithm is the reason of such a variance. Some tags like Verb can be associated with words that have other tags like Noun. This may be another contributing factor to the variance. 3. Tags with good performance (more that 90% precision) such as ConC, NomC, Prep and Punc: most of these tags have a consistent definition across English and French. For example nouns (NomC) have a clear correspondence. The same is true for conjunctions (ConC) and punctuation (Punc). The system did well also with prepositions (Prep), although not as good as the previous categories. It is also worth mentioning that the projection function provided most of the information required for obtaining good tagging. In a separate experiment, the similarity component was omitted and overall precision and recall were reduced by less than 2%. However, since testing pairs are extracted from the same corpus as training pairs, it can be assumed that many similarities exist between training and testing sets. This may reduce the effect of not incorporating the similarity component in the calculation.

46

Chapter 4 Automatic Acquisition of Parallel Corpora A parallel corpus is an important resource for multilingual natural language processing tasks. This chapter presents a tool for automatically acquiring parallel corpora by exploiting multi-lingual resources available on the web. Potential bilingual sites are crawled and a relevant portion of their content is downloaded. Pages that are translations of each other are identified and text is extracted from these pages. Then sentence alignment is performed to create pairs of the final corpus. The quality of the corpus is evaluated against other parallel corpora. We also describe an interface to access the corpus and locate specific content.

4.1 Multilingual Resources on the Web The Web is home to many bilingual and multilingual sites, e.g. corporate web sites, international organizations documents, and parliament proceedings of bilingual countries. Since these organizations often publish content in more than one language, it is possible to obtain the same information expressed in two different languages and extract from it a parallel corpus. In fact, some of mentioned organizations maintain their whole web site in more than one language. This presents an opportunity to obtain parallel corpora cheaply and without much effort. It is worth noting that although many modern bilingual resources exist, they lack the

4.2 Target Task Domain

47

alignment information that makes a true corpus. For example, documents of a specific organization may be scattered across their sites without direct correspondence between their different parts. Figure 4.1 shows two web pages from the “Collection Canada” website. These pages are translations of each other. However, they need to be processed in order to be identified as paired pages and to align their content. The system presented in this chapter obtains bilingual content as a first step, and then processes this content in order to obtain alignment information.

Fig. 4.1 Example of a pair of pages from a bi-lingual site and their corresponding content

4.2 Target Task Domain Canada is a bilingual nation and it is required that all official legislation exist in both official languages, English and French. A large English-French parallel corpus of the Canadian Hansards containing four million aligned sentences is available from the Linguistic Data Consortium (LDC) [Roukos et al., 1995]. Furthermore, the paid subscription service TransSearch provides a continually updated parallel Hansards corpus that is maintained

48

Automatic Acquisition of Parallel Corpora

by professional transcribers 1 . The application we investigate here is augmenting the available English-French parallel corpora by indexing the Government of Canada’s “gc.ca” domain and automatically building a bi-text corpus from paired bilingual pages contained in this domain. The underlying techniques are general and can be adapted to other language pairs with minor modifications.

4.3 System Description Figure 4.2 shows the system block diagram. The process for acquiring domain specific bi-text corpora from the world wide web has 6 main steps.

Identifying bilingual sites

Crawling

Identifying paired pages

Text extraction and filtering

Aligning sentences

Evaluation

Fig. 4.2

The structure of the corpus acquisition system

After identifying a potential bilingual site, a web crawl is performed in which a relevant portion of the online content is downloaded. Then, pairs of pages are identified as being translations of one another. The text is extracted from the downloaded web pages and sentence alignment is performed in those pages that have been identified as bilingual pairs. Finally, the quality corpus is evaluated against other parallel corpora. The rest of this section will discuss the important issues associated with the implementation of these steps. 1

http://transsearch.iro.umontreal.ca

4.3 System Description

49

4.3.1 Identifying Bilingual Sites The choice of the seed sites for the crawl is dependent on a number of issues. These include prior knowledge of the relevance of the websites to the target domain and prior knowledge of the bilingual structure and initial quality of the translations in the paired pages. For example, our work is focused in English and Canadian French content. The standard corpus of English and Canadian French is the Hansards corpus [Roukos et al., 1995]. Both of these issues influenced the choice of the Government of Canada domain “gc.ca” for this work. We wanted to build an English/Canadian French parallel corpus and evaluate it against Hansards. The resulting parallel corpus will be referred to as GcCorp. 4.3.2 Crawling In the crawling step, a portion of the website is downloaded to a local machine. A crawler traverses the website starting from the seed web pages. It downloads the pages, extracts the links from the pages, and repeats the process on these links until a termination condition like search depth or corpus size is reached. There are various heuristics that can be used to evaluate the richness of a website before actually performing the web crawl. For example, to compare abundance of content in the two domains: “gc.ca”, which is bilingual (French and English), and “ilo.org”, which is tri-lingual (English, French and Spanish), Google is used by submitting the queries: “site:gc.ca” and “site:ilo.org”. Google returns an estimate of the total number of search results. This estimate can be used to approximate the exact number [Goo, 2008]. The first query gives 11, 900, 000 hits whereas the second query gives 955, 000 hits. This suggests that there are approximately 6, 000, 000 and 640, 000 English-French pairs in “gc.ca” and “ilo.org” respectively. Other examples of how the structure of a web site can impact search include the fact that some websites hide direct links between pages using script-based navigation. Others require logging in or the existence of cookies before showing their content. These issues must all be taken into account when configuring the web crawler. Excellent open source crawlers exist, e.g. Heritrix, which was developed and used by the Internet Archive [Callison-Burch et al., 2004]. The Heritrix crawler was used in the system described here. Many parameters should be tuned to ensure correct behavior of the crawler. Some of the important parameters are:

50

Automatic Acquisition of Parallel Corpora 1. Seed list: a list of starting web sites. In this work the page “www.gc.ca” was used. 2. Domain settings: specify which domains to crawl, usually using regular expressions. This is important to direct the crawling process. For example, in this task only pages on the domain “gc.ca” are crawled. 3. Depth level: indicates the maximum number of links to be traversed. This parameter controls the number of crawled levels and thus the number of crawled pages. In this work the depth level was 5. 4. File types: specify which files to be downloaded (HTML, PDF, text files, etc. . . ). In this work, only HTML files were considered, although it is possible to download other types and parse them. 5. Load settings: control the number of requests that can be sent to the web site server in a specific amount of time. Care must be taken in order not to overwhelm a website by too many requests.

The crawling progress should be monitored to ensure correct behavior and results. For instance, we examined the logs of the crawler once per day during the crawling process. This is important in order to ensure that the crawling is going properly without creating network congestion. 4.3.3 Identifying Paired Pages Several methods exist for identifying paired pages. These methods range from simple heuristic procedures to more advanced techniques based on modeling of the page content. 1. URL structure: URLs sometimes indicate the language of the page. Regular-expressionbased rules can be written to identify paired pages, for example: • http://www.gc.ca/intro_e.html • http://www.gc.ca/intro_f.html

However, URL patterns are site dependent and can change inside the same site. A thorough examination of a web site should be done to uncover all its patterns. In our task, we noticed that several conventions were used in the “gc.ca” domain. For

4.3 System Description

51

instance, in the pair mentioned above, “e” or “f” are added to the URL depending on the language of the page. Since we noticed the use of many conventions, we decided to use a more sophisticate approach to locate paired pages. 2. Link structure2 : Hyperlinks contain valuable information since they denote explicit intent of the web page author. In bilingual sites it is common for a page to provide a link to the same page in the other language. The links in a page can be parsed, and only links with anchor text containing a language name are considered as links between paired pages. For example, in the page: http://www.gc.ca/copyright_e.html the links to its pair page formatted in this way would be: caise. Version Fran¸ 3. Content: If no hints are available to detect paired pages, the content of the pages can be used to perform this task. A translation model can be used to compute the likelihood that two pages are translations of each other, P (P age1|P age2) or P (P age2|P age1). An approximation can be used by evaluating the translation probabilities of page titles P (T itle1|T itle2) or P (T itle2|T itle1). The problem with this process is that it is computationally expensive because it needs to perform a quadratic number of comparisons in order to compute the probability for every page combination in the number of pages. For the gc.ca domain, we found that the link structure was sufficient to detect paired pages. Pages were required to link to each other in order to be identified as bilingual pairs. For example, the English page should link to the French page with “Version Fran¸caise” as anchor text. The French page should back link to the English page with “English Version” as anchor text. This helped remove many noisy pairs. Using this strategy, 30% of the downloaded pages were identified as not having a bilingual counterpart and thus rejected. 4.3.4 Text Extraction and Filtering Most of the crawled content is in the form of HTML pages. An HTML page contains text and markups. It is necessary to preprocess these pages to obtain a list of sentences that 2

One of the complexities of using link structure is that a web site continuously changes. Thus, its structure does not remain fixed. However, for the purpose of this study, we assume that crawling was done in a relatively short time span and that the structure of the website did not change during the crawling process.

52

Automatic Acquisition of Parallel Corpora

will then be input to the sentence alignment process. Extracting the sentence text from an HTML page involves unifying the encoding of all pages to Unicode, removing all the hyperlinks, discarding all texts except the text in the body of the page, reducing multiple line feeds, and finally discarding all markups except line and page breaks, which are used to extract paragraphs. Punctuation is then used to extract sentences. In this work we used the period to delineate sentences. Care has been taken to account for cases when a period is not a sentence boundary, like in numbers and acronyms. The regular expression API in Java provides matchers for word and non-word boundaires3 . 4.3.5 Aligning Sentences We aligned the extracted sentences are aligned using Moore’s alignment algorithm [Moore, 2002] as described in Section 2.6.3. The alignment algorithm is performed in two passes. In the first pass, the approximate alignment is computed using the Church & Gale algorithm [Gale and Church, 1991]. In the second pass, an IBM-1 translation model is trained and used to filter the alignments. The accuracy of the algorithm can be controlled by varying a threshold on the minimum value of the translation probability between two sentences for those sentences to be considered an alignment pair. The default value of this parameter, which is 0.5, was used.

4.4 Experimental Results A parallel corpus was built from 400,000 pages obtained by crawling the Government of Canada “gc.ca ”domain. The number of pages was determined by a preset crawl depth of five levels. A total of 1,000,000 unique pair sentences were extracted containing over 40,000,000 words. This section describes the characteristics of this corpus and evaluates the quality of the resulting sentence alignment. It is informative to compare the characteristics of the GcCorp corpus collected here with the four million sentence pair LDC Hansards corpus [Roukos et al., 1995] and the two million sentence pair English-French Europarl corpus [Koehn, 2005]. These corpora were compared in terms of the length of the pairs, measured by the number of words per pair. The histogram in Figure 4.3 shows the distribution of pair length in words in the three 3

http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html

4.4 Experimental Results

53

corpora. Since the corpora differ in size, we took a random sample of 980,000 pairs from each of the three corpora. number of pairs in hundreds of thousands

5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0

0 − 25

25 − 50

50 − 75 75 − 100 100 − 125125 − 150150 − 175 > 175

length of pairs in words (black Europarl, gray Hansards, white GcCorp) Fig. 4.3 Distribution of pair length in the three corpora: Europarl, Hansards and GcCorp

Compared to the Europarl corpus, GcCorp contains significantly more medium pairs (25 − 50 words). The reason for this is the filtering algorithm (Section 4.3.4), in which short text that occurs in titles, headers and other places in the web page is discarded. This high concentration of words in the 25 − 50 category affects the number of words in other categories as well. GcCorp contains less words than the Hansards in all of other categories except in > 175 where text extraction as described in 4.3.4 could not break out some large sentences. The quality of the corpus can potentially be evaluated in two ways. First, it can be used to evaluate directly the quality of the sentence alignment. This has the effect of evaluating both the performance of the system described in Section 3 and the quality of the original translations on the bilingual paired pages. The number of misalignments was estimated

54

Automatic Acquisition of Parallel Corpora

manually by sampling the corpus and computing the percentage of misaligned sentence pairs. Manual evaluation was done by sampling 1000 sentence pairs from the corpus and counting the mis-alignments. 52 of the 1000 sentences were identified by human inspection as being misaligned. This corresponds to an overall sentence alignment accuracy of 94.8%. A second way of evaluating the corpus is to use it, either by itself or augmented with other parallel corpora, to train models for a particular task like SMT or bilingual projection of linguistic tags. SMT performance can be evaluated using standard measures like BLEU [Papineni et al., 2001] on a baseline test set and compared with SMT performance when trained from other corpora. Automatic evaluation was done by running the Portage system [Ueffing et al., 2007] on the ACL WPT shared task [WPT, 2005]. This task provides reference translations for a subset of the Europarl corpus. Three translation models were trained and compared: an in-domain model with 55,000 English-French sentence pairs from Europarl, and two out-ofdomain models trained with 100,000 sentence pairs from Hansards and GcCorp respectively. BLEU scores were computed by running the test set on the three models. The results are shown in Table 4.1. As described in Section 2.6.5, the BLEU measure is used to evaluate the performance of an SMT system. The higher the BLEU score is, the better the SMT system is. In this experiment the SMT system is the same, but it is trained on a different corpus every time. The quality of corpus reflects directly on the quality of the system.

BLEU Table 4.1

50,000 Europarl pairs

100,000 Hansards pairs

100,000 GcCorp pairs

0.36

0.14

0.18

BLEU scores of the translation models on WPT task

The Europarl corpus is best, as expected, because in this case the data used for training includes the testing data. So this figure is provided as an upper bound. While the Hansards and GcCorp trained models both produced SMT performance that was below that of the matched condition, it was surprising that training with the automatically acquired GcCorp actually resulted in better performance than Hansards corpus. This improved performance might be related to the more varied content in GcCorp compared to Hansards; GcCorp

4.5 Accessing the Corpus

55

spans different government sites and includes different content types, whereas the Hansards is only about parliamentary discussions.

4.5 Accessing the Corpus Although the goal of collecting the parallel corpus was to use it in NLP applications, we wanted to be able to examine specific parts of the corpus and allow others to access it as well. We implemented a web interface that allows the users to enter a specific query in English or French, and retrieve pairs that are similar to this query from the corpus. We used Lucene [Gospodnetic and Hatcher, 2004], an open source information retrieval engine, to search the corpus. The user submits the query along with the language to be searched. Corpus sentences are then ranked against the query using the tf-idf measure described in Section 3.3.1. Figure 4.4 demonstrates using the interface to query about “quebec tourism” in English. The score shown is the tf-idf similarity score. This interface allowed us to examine the corpus during its development. It could also be used by human translators to help satisfy their information needs. A translator might need to uncover the meaning of a specific word in a certain context, or locate its translation in a complete phrase. This interface provides the perfect tool for these two tasks.

56

Automatic Acquisition of Parallel Corpora

(a) Query: “quebec tourism”

(b) Results

Fig. 4.4 Interfacing the collected corpus from a browser

57

Chapter 5 Conclusions & Future Work In this thesis we described an approach to implement cross-linguistic taggers using parallel corpora that are automatically collected from multi-lingual web sites. In Chapter 3 we presented a new approach for obtaining a cross-lingual implementation of linguistic taggers. The approach was evaluated on implementing a part-of-speech tagger in French. Despite the fact that the experiments were performed with a small manually tagged parallel corpus (3700 pairs), French POS tags were estimated with over 80% precision and 95% recall without the need for translating input sentences into English. In [Chanod and Tapanainen, 1995], the performance of rule-based and statistical POS taggers for French is compared. The study reported a precision rate of 99.5% for the rule-based tagger, and 95% for the stochastic tagger trained on labeled French texts. The two systems outperform the system presented in this thesis, but clearly, they are more expensive to implement. The rule-based system requires human experts to craft the tagging rules and the stochastic system requires a labeled corpus to train. The advantage of our approach is that it does not require any resource in the target language. Therefore, it can be easily ported to other languages and tagging tasks. Our approach has a great potential for implementing linguistic taggers in resource-poor languages starting from their counterparts in resource-rich languages. Future work includes applying the approach to implement other types of taggers, such as named entity extractors and sentence bracketers, using it in other language pairs, employing a more robust tag projection mechanism, and exploring other similarity measures than tf-idf.

58

Conclusions & Future Work

Although the system presented in this thesis was evaluated with a part of speech tagging task of French texts starting from a part of speech tagger in English, the approach is general and can be applied to other tagging tasks and language pairs. It only requires that the tags are world level and that a mapping between tags of the two languages can be established. Experiments on other tasks and languages are necessary to testify the system scalability. The tf-idf score was used in this work to measure the similarity of sentences. This score is, although well suited for information retrieval tasks, discard order information, i.e. it considers the sentence a bag of words. Other measures that preserve context take into account word order information such as n-gram distance descriped in Section 2.6.5 can be employed. It is interesting to investigate how performance correlates with the distance measure used. In Chapter 4 we described a system that can automatically assemble parallel corpora from the Internet. We also described its usage in obtaining a large English-French corpus from the “gc.ca” domain. In terms of quality, our system was able to produce a corpus with few mis-alignments. The corpus is comparable to other parallel corpora in term of sentence length and according to a standardized statistical machine translation measures. The promising results in evaluating the performance of this corpus on a standardized MT task suggest that automatically acquired corpora may contribute significantly to improvements in data driven multilingual NLP applications. The two contributions taken together will allow reductions in the time and cost of developing taggers and compiling parallel corpora for many resource-poor languages. We hope that this will generate a positive impact on many languages and help linguists, computer scientists and professional translators.

59

References [ACL, 2003] (2003). Acl 2003 workshop on building and using parallel texts: Data driven machine translation and beyond. URL: http://www.cs.unt.edu/~ rada/wpt/. [ACL, 2005] (2005). Acl 2005 workshop on building and using parallel texts: data-driven machine translation and beyond. URL: http://www.statmt.org/wpt05/. [WPT, 2005] (2005). Wpt 2005 shared task: Exploiting parallel texts for statistical machine translation. URL: http://www.statmt.org/wpt05/mt-shared-task. [Goo, 2008] (2008). How does google calculate the number of results? URL: http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=70920. [Brill, 1995] Brill, E. (1995). Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543–565. [Brown et al., 1993] Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Comput. Linguist., 19(2):263–311. [Callison-Burch et al., 2004] Callison-Burch, C., Talbot, D., and Osborne, M. (2004). Statistical machine translation with word- and sentence-aligned parallel corpora. In Proceedings of ACL 2004 (Meeting of the Association for Computational Linguistics), pages 175–182. [Chanod and Tapanainen, 1995] Chanod, J.-P. and Tapanainen, P. (1995). Tagging french: comparing a statistical and a constraint-based method. In Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics, pages 149–156, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. [Chen and Nie, 2000] Chen, J. and Nie, J.-Y. (2000). Automatic construction of parallel english-chinese corpus for cross-language information retrieval. In Proceedings of the sixth conference on Applied natural language processing, pages 21–28, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

60

References

[Clark et al., 2003] Clark, S., Curran, J. R., and Osborne, M. (2003). Bootstrapping pos taggers using unlabelled data. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 49–55, Morristown, NJ, USA. Association for Computational Linguistics. [Dempster et al., 1977] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38. [D´esilets, 2007] D´esilets, A. (2007). Translation wikified: How will massive online collaboration impact the world of translation? In Opening keynote for Translating and the Computer, London, UK. [Francis and Kucera, 1979] Francis, W. N. and Kucera, H. (1979). Brown corpus manual. Technical report, Department of Linguistics, Brown University, Providence, Rhode Island, US. [Gale and Church, 1991] Gale, W. A. and Church, K. W. (1991). A program for aligning sentences in bilingual corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics, pages 177–184, Morristown, NJ, USA. Association for Computational Linguistics. [Goodman, 2000] Goodman, J. (2000). Putting it all together: language model combination. Acoustics, Speech, and Signal Processing, 2000. ICASSP ’00. Proceedings. 2000 IEEE International Conference on, 3:1647–1650 vol.3. [Gospodnetic and Hatcher, 2004] Gospodnetic, O. and Hatcher, E. (2004). Lucene in Action. Manning Publications. [Jelinek et al., 1977] Jelinek, F., Mercer, R. L., Bahl, L. R., and Baker, J. K. (1977). Perplexity – a measure of the difficulty of speech recognition tasks. Journal of the Acoustical Society of America, 62:S63. Supplement 1. [Jurafsky and Martin, 2000a] Jurafsky, D. and Martin, J. H. (2000a). Speech and Language Processing, pages 298–299. Prentice-Hall, Inc. [Jurafsky and Martin, 2000b] Jurafsky, D. and Martin, J. H. (2000b). Speech and Language Processing, pages 300–303. Prentice-Hall, Inc. [Jurafsky and Martin, 2000c] Jurafsky, D. and Martin, J. H. (2000c). Speech and Language Processing, pages 303–307. Prentice-Hall, Inc. [Khudanpur and Kim, 2004] Khudanpur, S. and Kim, W. (2004). Contemporaneous text as side-information in statistical language modeling. Computer Speech & Language, 18(2):143–162.

References

61

[Knight and Koehn, 2007] Knight, K. and Koehn, P. (2007). Tutorial of statistical machine translation. In Machine Translation Summit XI. [Koehn, 2005] Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. MT Summit. [Moore, 2002] Moore, R. C. (2002). Fast and accurate sentence alignment of bilingual corpora. In AMTA ’02: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, pages 135–144, London, UK. Springer-Verlag. [Nagao, 1984] Nagao, M. (1984). A framework of a mechanical translation between japanese and english by analogy principle. In Proc. of the international NATO symposium on Artificial and human intelligence, pages 173–180, New York, NY, USA. Elsevier North-Holland, Inc. [Och et al., 2001] Och, F. J., Ueffing, N., and Ney, H. (2001). An efficient a* search algorithm for statistical machine translation. In Proceedings of the workshop on Data-driven methods in machine translation, pages 1–8, Morristown, NJ, USA. Association for Computational Linguistics. [Papineni et al., 2001] Papineni, K., Roukos, S., Ward, T., and Zhu, W. (2001). Bleu: a method for automatic evaluation of machine translation. [RALI-Laboratory, 1996] RALI-Laboratory (1996). Hand-tagged hansard corpus. Universit´e de Montr´eal. [Resnik, 1999] Resnik, P. (1999). Mining the web for bilingual text. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 527–534, Morristown, NJ, USA. Association for Computational Linguistics. [Ricardo and Berthier, 1999] Ricardo and Berthier (1999). Modern Information Retrieval, pages 29–30. ACM Press / Addison-Wesley. [Roukos et al., 1995] Roukos, S., Graff, D., and Melamed, D. (1995). Canadian hansards. LDC Catalog. URL: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95T20. [Simard, 2003] Simard, M. (2003). Translation spotting for translation memories. In Proceedings of the HLT-NAACL 2003 Workshop on Building and using parallel texts, pages 65–72, Morristown, NJ, USA. Association for Computational Linguistics.

62

References

[Ueffing et al., 2007] Ueffing, N., Simard, M., Larkin, S., and Johnson, H. (2007). NRC’s PORTAGE system for WMT 2007. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 185–188. Association for Computational Linguistics. [Weischedel et al., 1993] Weischedel, R., Schwartz, R., Palmucci, J., Meteer, M., and Ramshaw, L. (1993). Coping with ambiguity and unknown words through probabilistic models. Comput. Linguist., 19(2):361–382. [Yarowsky and Ngai, 2001] Yarowsky, D. and Ngai, G. (2001). Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In NAACL ’01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1–8, Morristown, NJ, USA. Association for Computational Linguistics. [Yarowsky et al., 2001] Yarowsky, D., Ngai, G., and Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In HLT ’01: Proceedings of the first international conference on Human language technology research, pages 1–8, Morristown, NJ, USA. Association for Computational Linguistics.

Suggest Documents