Using Collocation Segmentation to Augment the Phrase Table

2 downloads 0 Views 398KB Size Report
Jul 16, 2010 - Carlos A. Henríquez Q.∗, Marta R. Costa-jussà†, Vidas Daudaravicius‡. Rafael E. ..... Chris Callison-Burch, Marcello Federico, Nicola. Bertoldi ...
Using collocation segmentation to augment the phrase table ∗



Carlos A. Henríquez Q. , Marta R. Costa-jussà , Vidas Daudaravicius







Rafael E. Banchs , José B. Mariño



TALP Research Center, Universitat Politècnica de Catalunya, Barcelona, Spain



{carlos.henriquez,jose.marino}@upc.edu

Barcelona Media Innovation Center, Barcelona, Spain

{marta.ruiz,rafael.banchs}@barcelonamedia.org ‡Faculty

of Informatics, Vytautas Magnus University, Kaunas, Lithuania

[email protected]

Abstract

2

Phrase-based SMT

This paper describes the 2010 phrase-based

This approach to SMT performs the translation

statistical machine translation system de-

splitting the source sentence in segments and as-

veloped at the TALP Research Center of

signing to each segment a bilingual phrase from

the UPC

a phrase-table.

3

1 in cooperation with BMIC2 and

Bilingual phrases are translation

VMU . In phrase-based SMT, the phrase

units that contain source words and target words,

table is the main tool in translation. It is

e.g.

created extracting phrases from an aligned

and have dierent scores associated to them. These

parallel corpus and then computing trans-

bilingual phrases are then sorted in order to max-

lation model scores with them. Performing

imize a linear combination of feature functions.

a collocation segmentation over the source

Such strategy is known as the log-linear model

and target corpus before the alignment

(Och and Ney, 2003) and it is formally dened as:

< unidad de traducci´ on | translation unit >,

causes that dierent and larger phrases

"

are extracted from the same original doc-

eˆ = arg max

uments. We performed this segmentation

e

and used the union of this phrase set with

1

M X

λm hm (e, f )

(1)

m=1

the phrase set extracted from the non-

where

segmented corpus to compute the phrase

weights

table. We present the congurations con-

are the translation model (TM) and the target

sidered and also report results obtained

language model (LM). Additional models include

with internal and ocial test sets.

POS target language models, lexical weights, word

hm are dierent λm . The two

feature

functions

with

main feature functions

penalty and reordering models among others.

Introduction

1 in coop2 3 eration with BMIC and VMU participated in the

The TALP Research Center of the UPC

3

Collocation segmentation

Collocation segmentation is the process of de-

Spanish-to-English WMT task. Our primary sub-

tecting boundaries between collocation segments

mission was a phrase-based SMT system enhanced

within a text (Daudaravicius and Marcinkeviciene,

with POS tags and our contrastive submission was

2004). A collocation segment is a piece of text be-

an augmented phrase-based system using colloca-

tween boundaries. The boundaries are established

tion segmentation (Costa-jussà et al., 2010), which

in two steps using two dierent measures: the Dice

mainly is a way of introducing new phrases in the

score and a Average Minimum Law (AML).

translation table. This paper presents the descrip-

The Dice score is used to measure the associa-

tion of both systems together with the results that

tion strength between two words. It has been used

we obtained in the evaluation task and is organized

before in the collocation compiler XTract (Smadja,

as follows: rst, Section 2 and 3 present a brief de-

1993) and in the lexicon extraction system Cham-

scription of a phrase-based SMT, followed by a gen-

pollion (Smadja et al., 1996). It is dened as fol-

eral explanation of collocation segmentation. Sec-

lows:

tion 4 presents the experimental framework, corpus used and a description of the dierent systems built for the translation task; the section ends showing

Dice (x; y) =

the results we obtained over the ocial test set. Finally, section 5 presents the conclusions obtained

where

from the experiments.

x

1 2 3

#

2f (x, y) f (x) + f (y)

(2)

f (x, y) is the frequency of co-occurrence of y , and f (x) and f (y) the frequencies of occurrence of x and y anywhere in the text. It gives high scores when x and y occur in conjunction.

Universitat Politècnica de Catalunya Barcelona Media Innovation Center Vytautas Magnus University

and

The rst step then establishes a boundary between

98 Proceedings of the Joint 5th Workshop on Statistical Machine Translation and MetricsMATR, pages 98–102, c Uppsala, Sweden, 15-16 July 2010. 2010 Association for Computational Linguistics

two adjacent words when the Dice score is lower than a threshold

t = exp (−8).

Such a threshold

was established following the results obtained in

Corpora

Spanish

English

Training sent

1, 180, 623 26, 454, 280 118, 073 1, 729 37, 092 7, 025 2, 525 69, 565 10, 539 2, 489 66, 714 10, 725

1, 180, 623 25, 291, 370 89, 248 1, 729 34, 774 6, 199 2, 525 65, 595 8, 907

Running words

(Costa-jussà et al., 2010), where an integration of

Vocabulary

this technique and a SMT system was performed

Development sent

over the Bible corpus.

Running words

The second step of the procedure uses the AML.

xi−1

It denes a boundary between words

and

Vocabulary

xi

Internal test sent

when:

Running words Vocabulary

Dice (xi−2 ; xi−1 ) + Dice (xi ; xi+1 ) > Dice (xi−1 ; xi ) 2 (3) That is, the boundary is set when the Dice value between words

xi

and

xi−1

Running words Vocabulary Table 1:

is lower than the aver-

Experimental Framework

-

Statistics for the training, development

Adjectives Common nouns

All systems were built using Moses (Koehn et al.,

Proper nouns

2007), a state-of-the-art software for phrase-based

Verbs

SMT. For preprocessing Spanish, we used Freeling (Atserias et al., 2006), an open source library of

Others

natural language analyzers. For English, we used

Total

TnT (Brants, 2000) and Moses' tokenizer.

-

and test sets.

age of preceding and following values.

4

Ocial test sent

Internal test

Ocial test

137 369 408 213 119 1246

72 188 2, 106 128 168 2662

The

language models were built using SRILM (Stolcke,

Table 2:

2002).

ocial test sets

4.1 Corpus

Unknown words found in internal and

It is important to notice that neither the United

This year, the translation task provided four dif-

Nations nor the Gigaword corpus were used for

ferent sources to collect corpora for the Spanish-

bilingual training. Nevertheless, the English part

English pair. Bilingual corpora included version 5

from the United Nations and the monolingual

of the Europarl Corpus (Koehn, 2005), the News

News corpus were used to build the language model

Commentary corpus and the United Nations cor-

of our systems.

pus. Additional English corpora was available from the News corpus. The organizers also allowed the

4.1.1 Unknown words

use of the English Gigaword Third and Fourth Edi-

We analyzed the content from the internal and of-

tion, released by the LDC. As for development

cial test and realized that they both contained

and internal test, the test sets from 2008 and 2009

many words that were not seen in the training data.

translation tasks were available.

Table 2 shows the number of unknown words found in both sets, classied according to their POS.

For our experiments, we selected as training data

In average, we may expect an unknown word

the union of the Europarl and the News CommenDevelopment was performed with a section

every two sentences in the internal test and more

of the 2008 test set and the 2009 test set was se-

than one per sentence in the ocial test set. It can

lected as internal test. We deleted all empty lines,

also be seen that most of those unknown words are

tary.

removed pairs that were longer than

40

proper nouns, representing

words, ei-

32%

and

79%

of the

ther in Spanish or English; and also removed pairs

unknown sets, respectively. Common nouns were

whose ratio between number of words were bigger

the second most frequent type of unknown words,

than

followed by verbs and adjectives.

3.

As a preprocess, all corpora were lower-cased and tokenized. The Spanish corpus was tokenized

4.2 Systems

and POS tags were extracted using Freeling, which

We submitted two dierent systems for the trans-

split clitics from verbs and also separated words

lation task. First a baseline using the training data

like del into de el. In order to build a POS tar-

mentioned before; and then an augmented system,

get language model, we also obtained POS tags

where the baseline-extracted phrase list was ex-

from the English corpus using the TnT tagger.

tended with additional phrases coming from a seg-

Statistics of the selected corpus can be seen in Ta-

mented version of the training corpus. We also considered an additional system built

ble 1.

99

with two dierent decoding path, a standard path

baseline

from words to words and POS and an alternative path from stems to words and POS in the target side.

internal test

24.25 23.45 23.9

baseline+stem

augmented

At the end, we did not submit this system

to the translation task because it did not provide

Table 3: Internal test results.

better results than the previous two in our internal test. The set of feature functions used include: sourceto-target

and

target-to-source

relative

frequen-

baseline

cies, source-to-target and target-to-source lexical

augmented

weights, word and phrase penalties, a target language model, a POS target language model, and a

testcased−detok

25.1 25.1

Table 4: Results from translation task

lexicalized reordering model (Tillman, 2004).

4.2.1 Considering stems as an alternate decoding path.

The test set for this translation task comes from the news domain, but most of our bilingual cor-

Using Moses' framework for factored translation

pora belonged to a political domain, the Europarl.

models we dened a system with two decoding paths:

test

26.1 26.1

Therefore we use the additional monolingual cor-

one decoding path using words and the

pus to adapt the language model to the news do-

other decoding path using stems in the source lan-

main.

guage and words in the target language. Both de-

The strategy used followed the experiment per-

coding paths only had a single translation step. The possibility of using multiple alternative decod-

formed last year in (R.

ing path was developed by Birch et. al. (2007).

We used SRILM during the whole process.

Fonollosa et al., 2009). All

language models were order ve and used modi-

This system tried to solve the problem with the Because Spanish is morphologi-

ed Kneser-Ney discount and interpolation. First,

cally richer than English, this alternative decoding

we build three dierent language models accord-

path allowed the decoder translate words that were

ing to their domain: Europarl, United Nations and

not seen in the training data and shared the same

news; then, we obtained the perplexity of each lan-

root with other known words.

guage model over the News Commentary develop-

unknown words.

ment corpus; next, we used

4.2.2 Expanding the phrase table using collocation segmentation.

minish the global perplexity.

to

Finally, the models

were combined using those weights.

In order to build the augmented phrase table with the technique mentioned in section 3,

compute-best-mix

obtain weights for each language model that di-

we seg-

In our experiments all systems used the resulting

mented each language of the bilingual corpus in-

language model, therefore the dierence obtained

dependently and then, using the collocation seg-

in our results were cause only by the translation

ments as words, we aligned the corpus and ex-

model.

tracted the phrases from it. Once the phrases were extracted, the segments of each phrase were split

4.3 Results

again in words to have standard phrases. Finally, we use the union of this phrases and the phrases

We present results from the three systems devel-

extracted from the baseline system to compute the

oped this year. First, the baseline, which included

nal phrase table. A diagram of the whole proce-

all the features mentioned in section 4.2; then, the

dure can be seen in gure 1.

system with an alternative decoding path, called

The objective of this integration is to add new

baseline+stem ; and nally the augmented system,

phrases in the translation table and to enhance

which integrated collocation segmentation to the

the relative frequency of the phrases that were ex-

baseline. Internal test results can be seen in table

tracted from both methods.

3. Automatic scores provided by the WMT 2010

4.2.3 Language model interpolation.

organizers for the ocial test can be found in table 4.

Because SMT systems are trained with a bilingual

All BLEU scores are case-insensitive and

tokenized except for the ocial test set which also

corpus, they ended highly tied to the domain the

contains case-sensitive and non-tokenized score.

corpus belong to. Therefore, when the documents

We obtained a BLEU score of

we want to translate belong to a dierent domain,

26.1

and

25.1

for

additional domain adaptation techniques are rec-

our case-insensitive and sensitive outputs, respec-

ommended to build the system. Those techniques

tively.

usually employ additional corpora that correspond

versity of Cambridge, with

to the domain we want to translate from.

points.

100

The highest score was obtained by Uni-

30.5

and

29.1

BLEU

Figure 1: Example of the expansion of the phrase table using collocation segmentation. added by the collocation-based system are marked with a

4.3.1 Comparing systems

us achieve better results.

Nevertheless, a deeper

study of the unknown set showed us that most

Once we obtained the translation outputs from the

of those words were proper nouns, which do not

baseline and the augmented system, we performed a manual comparison of them.

New phrases

∗∗.

have inection and therefore cannot beneted from

Even though we

stems.

did not nd any signicant advantages of the aug-

Finally, despite that internal test did not showed

mented system over the baseline, the collocation segmentation strategy chose a better morphologi-

an improvement with the augmented system, we

cal structures in some cases as can be seen in Table

submitted it as a secondary run looking for the

5 (only sentence sub-segments are shown):

eect these phrases could have over human evaluation.

5

Conclusion

Acknowledgment

We presented two dierent submissions for the Spanish-English

language

pair.

The

The research leading to these results has received

language

funding from the European Community's Seventh

model for both system was built interpolating two

Framework

big out-of-domain language models and one smaller in-domain language model. The rst system was a

(FP7/2007-2013)

under

ish Ministry of Science and Innovation through the

baseline with POS target language model; and the

Buceador project (TEC2009-14094-C04-01) and

second one an augmented system, that integrates the baseline with collocation segmentation.

Programme

grant agreement number 247762, from the Span-

the Juan de la Cierva fellowship program.

Re-

The

authors also wants to thank the Barcelona Media

sults over the ocial test set showed no dierence

Innovation Centre for its support and permission

in BLEU between these two, even though internal

to publish this research.

results showed that the baseline obtained a better score. We also considered adding an additional decod-

References

ing path from stems to words in the baseline but internal tests showed that it did not improve trans-

Jordi

Atserias,

Bernardino

Casas,

Elisabet

lation quality either. The high number of unknown

Comelles, Meritxell González, Lluís Padró, and

words found in Spanish suggested us that consider-

Muntsa Padró.

ing in parallel the simple form of stems could help

and semantic services in an open-source NLP

101

2006.

FreeLing 1.3: Syntactic

está recibiendo el premio it receive the prize Augmented: knowing that he is receiving the prize Original: muchos de mis amigos preeren no separarla. Baseline: many of my friends prefer not to separate them. Augmented: many of my friends prefer not to separate it. Original: Los estadounidenses contarán con un teléfono móvil Baseline: The Americans have a mobile phone Augmented: The Americans will have a mobile phone Original: es plenamente consciente del camino más largo que debe emprender Baseline: is fully aware of the longest journey must undertake Augmented: is fully aware of the longest journey that need to be taken Original: sabiendo que

Baseline: knowing that

Table 5: Comparison between baseline and augmented outputs

In Proceedings of the fth interna-

José A. R. Fonollosa, Maxim Khalilov, Marta R.

tional conference on Language Resources and

Costa-jussá, José B. Mariño, Carlos A. Hen-

Evaluation (LREC 2006), ELRA, Genoa, Italy,

ríquez Q., Adolfo Hernández H., and Rafael E.

May.

Banchs.

library.

Alexandra

Birch,

Miles

Osborne,

and

Philipp

The TALP-UPC phrase-based

tical Machine Translation, pages 8589, Athens,

tical machine translation. In StatMT '07: Pro-

Greece, March. Association for Computational

ceedings of the Second Workshop on Statistical

Linguistics.

Machine Translation, pages 916, Morristown, NJ, USA. Association for Computational Lin-

Frank

guistics.

A.

Smadja,

Kathleen

Vasileios Hatzivassiloglou. approach.

In Proceedings of the Sixth

2000), Seattle, WA.

Computational Linguistics, 22(1):1

Frank Smadja. 1993. Retrieving collocations from text: Xtract. Comput. Linguist., 19(1):143177.

Marta R. Costa-jussà, Vidas Daudaravicius, and Rafael E. Banchs. 2010. Integration of statisti-

Andreas Stolcke.

cal collocation segmentations in a phrase-based statistical machine translation system.

2002.

SRILM  an extensible

language modeling toolkit. pages 901904.

In 14th

Annual Conference of the European Association

Christoph Tillman. 2004. A Unigram Orientation

for Machine Translation.

Model for Statistical Machine Translation.

HLT-NAACL.

Vidas Daudaravicius and Ruta Marcinkeviciene. 2004. Gravity counts for the boundaries of collocations. International Journal of Corpus Lin-

guistics, 9:321348(28). Alexandra Birch,

Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ond°ej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation.

and

Translating

38.

Applied Natural Language Processing (ANLP-

Hieu Hoang,

McKeown,

1996.

collocations for bilingual lexicons: A statistical

Thorsten Brants. 2000. TnT  a statistical partof-speech tagger.

In

Proceedings of the Fourth Workshop on Statis-

Koehn. 2007. Ccg supertags in factored statis-

Philipp Koehn,

2009.

translation system for EACL-WMT 2009.

In ACL '07: Proceedings of

the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pages 177180, Morristown, NJ, USA. Association for Computational Linguistics. Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Machine

Translation Summit. Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29:19 51.

102

In

Suggest Documents