Open Source Toolkit for Statistical Machine Translation ... - CiteSeerX

0

Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding Final Presentation

Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Chris Callison-Burch, Ondrej Bojar, Brooke Cowan, Chris Dyer, Hieu Hoang, Richard Zens, Alexandra Constantin, Evan Herbst, Christine Moran 17 August 2006 Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

1

Schedule • First session: Overview and toolkit development – Factored models and confusion network decoding Koehn, Federico – Moses toolkit Hoang, Dyer, Herbst, Callison-Burch, Bertoldi • Second session: Experiments – Experiments in small data settings Shen, Bojar, Moran, Cowan – Factored models for morphological rich languages Dyer, Koehn, Cowan, Constantin – Confusion network experiments Zens Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

2

Accomplishments • Open source toolkit – advances state-of-the-art of statistical machine translation models – best performance of European Parliament task – competitive on IWSLT and TC-Star • Factored models – outperform traditional phrase-based models – framework for a wide range of models – integrated approach to morphology and syntax • Confusion networks – exploit ambiguous input and outperform 1-best – enable integrated approach to speech translation

Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

3

Phrase-Based Translation er

geht

er

geht he

ja

nicht ja nicht

does not

nach

hause

nach hause go

home

• Foreign input is segmented in phrases – any sequence of words, not necessarily linguistically motivated • Each phrase is translated into English, phrases are reordered • Log linear model: PMany feature functions hi(e, f ) with weights λi combined to overall score i λihi(e, f ) → easy to extend Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

4

Translation • Task: translate this sentence from German into English

er

geht

ja


nicht

nach

Final Presentation

hause

17 August 2006

5

Translation step 1 • Task: translate this sentence from German into English

er

geht

ja

nicht

nach

hause

er he • Pick phrase in input, translate


Final Presentation

17 August 2006

6


er

geht

ja

er

nicht

nach

hause

ja nicht he

does not

• Pick phrase in input, translate – it is allowed to pick words out of sequence (reordering) – phrases may have multiple words: many-to-many translation


Final Presentation

17 August 2006

7


er

geht

er

geht he

ja

nicht

nach

hause

ja nicht does not

go

• Pick phrase in input, translate


Final Presentation

17 August 2006

8


er

geht

er

geht he

ja

nicht

nach

ja nicht does not

hause

nach hause go

home

• Pick phrase in input, translate


Final Presentation

17 August 2006

9

Translation options er

geht

he it , it , he

is are goes go

ja

nicht

yes is , of course

not do not does not is not

it is he will be it goes he goes

nach after to according to in

not is not does not do not is are is after all does

hause house home chamber at home

home under house return home do not to following not after not to

not is not are not is not a

• Phrase translation tables provide many translation options • Learned from automatically word-aligned corpora Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

10

Translation options er

geht

he it , it , he

is are goes go

ja

nicht

yes is , of course

not do not does not is not

it is he will be it goes he goes

nach

hause

after to according to in

not is not does not do not

house home chamber at home

home under house return home do not

is are is after all does

to following not after not to not is not are not is not a

• The machine translation decoder does not know the right answer → Search problem solved by heuristic beam search Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

11

Decoding process: precompute translation options er

geht


ja

nicht

Final Presentation

nach

hause

17 August 2006

12

Decoding process: start with initial hypothesis er

geht


ja

nicht

nach

Final Presentation

hause

17 August 2006

13

Decoding process: hypothesis expansion er

geht

ja

nicht

nach

hause

are


Final Presentation

17 August 2006

14


geht

ja

nicht

nach

hause

he are it


Final Presentation

17 August 2006

15


geht

ja

nicht

nach

hause

yes he are

goes

home

does not

go

it


home

to

Final Presentation

17 August 2006

16

Decoding process: find best path er

geht

ja

nicht

nach

hause

yes he are

goes

home

does not

go

it


home

to

Final Presentation

17 August 2006

17

Statistical machine translation today • Best performing methods based on surface word phrases – uses mapping of short chunks of text (mostly 1-3 words) – sophisticated methods for phrase extraction and modeling (EM algorithm, generative models, discriminative training) • Translation solely based on surface forms of words – no use of explicit syntactic information – no use of morphological information • How can be build richer models?


Final Presentation

17 August 2006

18

One motivation: morphology • Current models treat house and houses as completely different words – training occurrences of house have no effect on learning translation of houses – if we only see house, we do not know how to translate houses – rich morphology (German, Arabic, Finnish, Czech, ...) → many word forms • Better approach combines evidence for house and houses – analyze surface word forms into lemma and morphology e.g.: Haus +plural – translate lemma and morphology separately e.g.: Haus → house; +pl → +pl – generate target surface form e.g.: house +pl → houses


Final Presentation

17 August 2006

19

Factored translation models • Factored represention of words Input

Output

word

word

lemma

lemma

part-of-speech

part-of-speech

morphology

morphology

word class

word class ...

...

• Benefits – generalization, e.g. by translating lemmas, not surface forms – richer model, e.g. using syntax for reordering, language modeling) Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

20

Example factored model • Our example as factored model: Input

Output

word

word

lemma

lemma

morphology

morphology

• Translation process broken up into mapping steps – translation of lemma – translation of morphology – generation of word from lemma, morphology Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

21

Expansion of input phrase • Probabilistic mapping steps – translation step: lemma → lemma haus → house, home, chamber, ... – translation step: morphology → morphology single-noun → single-noun, single-pronoun, plural-noun, ... – generation step: lemma,morphology → word house,single-noun → house house,plural-noun → houses • Still a phrase model – translation steps may map phrases nach hause → home, return home – generation steps operate on single words – traditional phrase-models are special case: single-factor models Philipp Koehn et al., JHU 2006 WS on MT

Final Presentation

17 August 2006

22

Computational complexity of mapping steps • Number of factored expansions may grow exponentially • Key insights to reduce complexity for a given input sentence: – expansions can be pre-computed and stored as translation options, – pruning translation options early • Future work: problems with more complex models need to be addressed – we had problems using some models with three steps or more – see student proposals (Hoang, Dyer) for solutions


Final Presentation

17 August 2006

Spoken Language Translation with Confusion Networks Marcello Federico, Nicola Bertoldi, Wade Shen, Richard Zens August 17, 2006

Marcello Federico, ITC-irst Trento

Project Summary

August 17, 2006

1

Outline • Spoken language translation • Approaches to SLT

• Confusion network decoding • Computational issues

• Implementation in Moses

• Language model interface

• Other applications of confusion networks


Project Summary

August 17, 2006

2

Spoken Language Translation Translation from speech input is likely more difficult than translation from text input: • many styles and genres: formal read speech, unplanned speeches, interviews, spontaneous conversations, ... • less controlled language: relaxed syntax, spontaneous speech phenomena • automatic speech recognition is prone to errors: possible corruption of syntax and meaning This work addresses methods to improve performance of spoken language translation by better integrating speech recognition and machine translation models. Marcello Federico, ITC-irst Trento

Project Summary

August 17, 2006

3

Integrating Speech Recognition and Translation • Correlation between transcription word-error-rate and translation quality: 42.5

42

BLEU SCORE

41.5

41

40.5

40

39.5

39

38.5 14

15

16

17

18

19

20

21

WER OF TRANSCRIPTIONS

• Better transcriptions have been possibly analyzed during ASR decoding but discarded due to lower scores • Potential for improving translation quality by exploiting more transcription hypotheses generated during ASR. Marcello Federico, ITC-irst Trento

Project Summary

August 17, 2006

4

Statistical Spoken Language Translation • Let o be be spoken input in the foreign language • let F(o) be a set of possible transcriptions of o

Goal: find the best English translation through the approximate criterion: e∗ = arg max Pr(e | o) ≈ arg max max Pr(e, f | o) e

e

f ∈F (o)

Pr(e, f | o) is computed with a log-linear model incorporating:

• acoustics features, i.e. probs that some foreign words are in the input • linguistic features, i.e. probs of foreign and English sentences

• translation features, i.e. probs of foreign phrases into English • alignment features: i.e. probs for word re-ordering Marcello Federico, ITC-irst Trento

Project Summary

August 17, 2006



Project Summary ;

/&*

67

4.

38

48

48

48

67

67

#

%$79:.-25

&1'

79:.-25

79:.-25

/"*

:.-

&1(

/&"

#

%$;

#

%$&1*

#

%$&10

&1/

Project Summary #

%$"//

"/&

"/'

;

'*

;

#

%$'0

#

%$&0'

"/0

&0(

#

%$#

%$#

%$;

&0+

&0*

&/'

&0!

D

&0/

&&(

"/"

;

&&*

"*&

&/*

&0"

&/"

&0& ;

&/+

;

#$% #

%$;

0*

,-.

79:.-25

,-.

,-.

,-.

"

,-.

68

,-.

(

,-.

'!

"&*

"0/

&&'

7

&&/

7

&/0

;

5

&/&

&!'

7 #$% #

%$1+

.4

"*+

,

&'!

#

%$&01

)

:

74

&/

#

%$79:.-25

68

#

%$2.32,445

"1

**

&*

67

&(

#

%$@5

,-.

#

%$67

/"

#

%$79:.-25

/!

&/!

#

%$&'&

#

%$&'"

) #

%$"+!

#

%$,

"'*

.

"!!

;

&'/

&1

/&

#

%$67

"!"

; ;

#

%$#

%$"1(

"((

#$% 7

;

;

"(1

;

#

%$"1'

#$% !

#$% #

%$"('

"'!

@.

5

1/

#

%$#

%$;

"'0

#

%$"'/

#

%$"'&

!

#

%$""+

;

""1

@. "'"

.4

++

;

48 #

%$""*

!

1&

!

""(

A3

!#

%$"'1

"(!

#

%$;

"(/

#

%$"(&

"&"

"&/

.448

"&1

;

#

%$#$% )

#

%$.

.6

"1&

;

"/+

"&+

#

%$48 #

%$;

#

%$"1!

;

""'

#

%$""0

#

%$""/

.

;

1*

#

%$1(

#

%$)

*!

('

AB

;

"""

A

#

%$;

#

%$AB

;

)

&!"

#

%$;

"!1

9.

#

%$#

%$#

%$;

.@

. &!(

4.

;

+'

"10

4.

#

%$"+/

;

+"

(1

;

#

%$;

#

%$. "+&

"+(

"+'

#

%$"+0

&"/

4.

;

&"'

; #

%$&"0

4?

;

&&"

#

%$&&!

#

%$4?

&"1

#

%$&!1

#

%$4?

#

%$#

%$;

#

%$#

%$#$% &"&

&!+

4.

#

%$&"!

4.

&""

"*(

. &"+

. #

%$"*'

4.

#$% &!*

A3

(+

#$% #

%$#

%$9.

79:.-25

. A3

"**

#

%$&"*

"!0

"!+

.@

"1/

9.

#

%$&"(

"*0

*&

#

%$#

%$#

%$"!(

.7 B

"!*

#

%$"0*

)

#

%$+!

,

? "!&

"!/

?

. #

%$*1

#

%$#

%$""!

A@ "!'

A@

10

+0

#

%$*"

, #

%$#$% #

%$)

"*/

#

%$*0

(*

)

((

79:.-25

#$%#

%$#

%$79:.-25

#

%$#

%$6. +&

"1"

6.

)

A 1'

#

%$&!!

#

%$"11

#

%$"1+

#$% "0'

""&

A #

%$;

"00

"/*

A38

.

, #

%$"+*

(0

#

%$(/

#

%$#$% 79:.-25

#$% . ,

*/

79:.-25

(&

("

79:.-25 79:.-25

79:.-25

#$% #$% #

%$73

, #$% "+1

"0(

79:.-25

"++

.

+1

.448

*'

.6 "&0

#$% "/(

"&!

48

#

%$+/

79:.-25

#

%$.6 A38

*+

79:.-25

(!

#

%$"&&

A3 #$% #

%$A3

#

%$79:.-25

*(

67

#$% +*

@.

"("

@.

5

#

%$, #

%$. #

%$"0+

5 #$% .4

"01

"''

. #$% @.

/&+

;

"0"

74

#

%$'1

67

A3 A38

"'(

79:.-25

#

%$67

2.32,445

'+

#

%$67

"'+ #

%$. "0!

) #

%$#$% "1*

"/1

#

%$&'0

#

%$67 #

%$&//

11

.

#$% @.

,-.

2.32,445

,-.

#

%$2.32,445

2.32,445

#$% #

%$79:.-25

68

&'

#

%$&0

#

%$#

%$&&

#

%$"0

#$% "(

,-.

#

%$"/

#

%$"'

67 =7 #

%$"*1

#

%$; ;

&&1

&&0

"+"

, )

#

%$#

%$&!/

#

%$&!0

&!&

2 , #$% :

&&&

"&(

"(+

.7

5@ 5 , )

"(*

#

%$"&'

#$% ; #

%$#

%$#

%$#$% #

%$#

%$#

%$1!

79:.-25

79:.-25

&+

2.32,445

#

%$2.32,445

"&

2.32,445 &"

&!

#

%$,-.

@5#

%$+(

79:.-25

79:.-25

79:.-25

68

#

%$#

%$#

%$1

#

%$+

,-.

#

%$5 "0&

#

%$68

//

2.32,445

""

#

%$"!

#

%$,-.

"(0

"*!

#

%$5

#$% /(

79:.-25

79:.-25

79:.-25

/*

68

/'

#

%$#

%$/0

"+

#

%$"*

#$% *

& #

%$,-.

#

%$/

,-.

#

%$0

,-.

01

#

%$#

%$0+

#

%$#

%$79:.-25

79:.-25

79:.-25

@5C

#

%$#

%$"*"

5

@5

'"

0(

#

%$/+

79:.-25

79:.-25

2 #$% #

%$7

#

%$"/!

#

%$; #

%$7

#

%$2 &/1

#

%$; ; &00

&/(

#

%$3 68 #$% #

%$&&+

#

%$#

%$#

%$#

%$'(

''

@5 1"

@5

#

%$'/

'&

#

%$0'

#

%$0/

#

%$#

%$@5

0&

#

%$00

0"

#

%$79:.-25

79:.-25

79:.-25

79:.-25

/&0

79:.-25

79:.-25

0!

#

%$/1

#

%$79:.-25

#

%$79:.-25

79:.-25

79:.-25

79:.-25

48

#

%$79:.-25

#$% 253 &1+

253 /&&

#

%$#

%$253

#

%$/"+

#$% /&!

/"1

:.-

#

%$:.-

#

%$#$% &1&

&1"

#

%$&1!

#

%$&+1

/"'

#

%$/"0

/"/

79:.-25 #

%$&++

#

%$79:.-25 79:.-25 &+*

&+(

48

#

%$&*+

#

%$)

'

)

#

%$/""

#

%$#

%$&(0 #

%$&**

)

#

%$&(/

#

%$&(&

&('

#

%$#

%$#

%$#

%$#

%$/"&

67

/"!

#$% 67

&+'

#

%$/!+

&+0

/!*

&+/ #

%$#

%$#$% /!(

&+&

#

%$&+! #

%$#

%$&+"

/"(

/!1

253 /&/

/&'

67

4.

/!'

#

%$67 #

%$/!0

#

%$/!/

:.- #

%$#

%$#

%$/&(

253

:.-

67

4.

48

#

%$48

&(! #

%$,-.

&'+

#

%$&'1

)

#

%$!

#$% )

&'(

&'*

#$% #$% )

E.2.3F.

48 &*1 48

&*(

E.2.3F.

#

%$&((

#

%$#

%$4. /!"

#$% #

%$4.

48 #

%$&11

4.

4.

&*/

#

%$&*&

#$% 4.

#

%$E.2.3F.

/!!

#

%$&*0

&*"

#

%$E.2.3F.

#

%$&(*

)

)

4. /!&

4.

E.2.3F.

&*!

#

%$E.2.3F.

&(+

&*'

E.2.3F.

&(1

E.2.3F.

)

)

&("

)

&''

5

ASR Word Graph

A very general set of transcriptions F(o) can be represented by a word-graph: • directly computed from the ASR word lattice (e.g. HTK format, lattice tool) • provides a good representations of all hypotheses analyzed by the ASR system

• arcs are labeled with words, acoustic and language model probabilities

• paths correspond to transcription hypotheses for which probabilities can be computed

August 17, 2006

6

Approaches to Spoken Language Translation

The previous statistical framework includes several alternative implementations:

• 1-best translation: translate only the most probable hypothesis in the word graph – pros: very efficient – cons: no potential to recover from recognition errors in the 1-best transcription

• N-best translation: translate only the N–most probable hypotheses in the word-graph – pros: can exploit more accurate transcriptions in the word graph – cons: N must be large in order to include good transcriptions, and decoding time increases linearly with N

August 17, 2006

7

Approaches to Spoken Language Translation • Transducer: compose word-graph with a translation FSN and apply a transducer algorithm – pros: straightforward method that permits to work on the full word graph – cons: computationally prohibitive with large vocabulary tasks and long range word re-ordering • Confusion network: translate a suitable approximation of the WG – pros: it permits to effectively explores all paths in the word-graph, with no problems in re-ordering – cons: can only exploit limited information in the word graph


Project Summary

August 17, 2006

8

Confusion Network(Mangu 1999) A confusion network approximates a word graph with a linear network, s.t.: • arcs are labeled with words or with the empty word ( !-word) • arcs are weighted with word posterior probabilities

DEF

"6

#C

%$"3

"2

",

'+$ * ' 0 ) 7 >0 >) -

"&

'+$ -7

""

'+$ B

"!

9 ) ? ' '+$

@

)/ 5 4)!) 8) 7/ /A

'+$

* '

0

-0.1)

'+$

) )/ >0 /) )4 /5 -

< = . .5

47 45

:

6

'+$!

3

789)(-0 9)( '+

2

45 /5 )//5 '+

,

'+$ /)

&

-).-'//0 1)-).

'() * '+$

#

%$"

!

• paths are a superset of those in the word graph

CNs can be conveniently represented as a sequences of columns of different depth.


Project Summary

August 17, 2006

9

Confusion Network Decoding: Extension of basic phrase-based decoding step: • cover some not yet covered consecutive columns (span)

• retrieve phrase-translations for all paths inside the columns

• compute translation, distortion and target language models Example. Coverage set: 01110... Path: cancello d’ 0 era 0.997 è 0.002 ! 0.001

1 cancello 0.995 vacanza 0.004 ! 0.002


1

1

! 0.999 la 0.001

di 0.615 d’ 0.376 all’ 0.005 l’ 0.002 ! 0.001

0 imbarco 0.999 bar 0.001

Project Summary

... ...

August 17, 2006

10

Confusion Network Decoding Computational issues: • Number of paths grows exponentially with span length

• Implies look-up of translations for a huge number of source phrases

• Factored models require considering joint translation over all factors (tuples): – cartesian product of all translations of each single factor Solutions implemented into Moses • Source entries of the phrase-table are stored with prefix-trees

• Translations of all possible coverage sets are pre-fetched from disk

• Efficiency achieved by incrementally pre-fetching over the span length

• Phrase translations over all factors are extracted independently, then translation tuples are generated and pruned by adding a factor each time Once translation tuples are generated, usual decoding applies. Marcello Federico, ITC-irst Trento

Project Summary

August 17, 2006

11

Implementation into Moses • Input Format: CN input can be rather large, so better to put one word-position per line: Haus 0.1 aus 0.4 Aus 0.4 eps 0.1 der 0.9 eps 0.1 Zeitung 1.0 each line represents alternatives with their probability. • Factored confusion networks: alternatives are over the full factor space: Haus|N 0.1 der|DET 0.1 Zeitung|N 1.0

aus|PREP 0.4 der|PREP 0.8

Aus|N 0.4 eps|eps 0.1

eps|eps 0.1

Notice: confusion network can be projected over single factors.


Project Summary

August 17, 2006

12

Implementation into Moses Decoding CN with Factored Models • at each step of the search process, a portion of the CN is explored, e.g. ... Haus | N 0.1 der|DET 0.1 Zeitung|N 1.0 ... .... and translations are

... aus|PREP 0.4 der|PREP 0.8

Aus|N 0.4 eps|eps 0.1

.... ... looked up for each factor.

eps|eps 0.1

...

Features: • Efficiency by pre-filtering possible translations for each factor

• Decoding of confusion networks is completely hidden to the decoder.


Project Summary

August 17, 2006

13

Other Applications of Confusion Networks Translation tasks with ambiguous input: • linguistic annotation for factored models – avoid hard decision by linguistic tools but rather provide alternative annotations with respective scores: – e.g. particularly ambiguous part of speech tags • insertion of punctuation marks missing in the input – model all possible insertions of punctuation marks in the input • translation of input similar to that produced by speech recognition – e.g. OCR output for optical text translation • ....


Project Summary

August 17, 2006

14

Language Model Interface • Features – compact binary format for very large language model – quantization of probabilities (8 bits) – fast upload of language model from disk – upload of n-grams on demand • Comparison with SRI LM Toolkit – memory: 50% less with large quantized models – speed: 10% slower in decoding with 3-gram LM • Recent work and improvements – speed-up by directly storing log-probs – addition of cache memory on n-gram internal data strucure – analysis of LM score computations by search algorithm – caching of probabilities and LM states the search algorithm requests the same probabilities many times Marcello Federico, ITC-irst Trento

Project Summary

August 17, 2006

15

Requests of N-grams by Decoder

Requests of 3-gram probabilities during decoding of a single sentence. About 1.6M requests involving about 120K 3-grams. Marcello Federico, ITC-irst Trento

Project Summary

August 17, 2006

16

Conclusions Implementation work • Efficient on-demand pre-fetching of phrase translations • Tuning of parameters for confusion network decoding

• Language model interface and pre-fetching of n-grams Development of state-of-the-art baselines for SLT • IWSLT BTEC Chinese-English SLT – submissions to IWSLT 2006 evaluation • EPPS Spanish-English SLT – performance comparable with best TC-STAR systems Achievements • SLT decoder more efficient wrt current implementations by IRST and MIT/LL • works with large-data tasks and large confusion networks • works with factored confusion networks Marcello Federico, ITC-irst Trento

Project Summary

August 17, 2006

Engineering Results JHUSWS 2006

Open software, so what?

State of the world, June 2006

“Black box” decoder (Pharaoh) widely used 20+ citations in this year’s ACL Proceedings alone Ubiquitous baseline system

But… it is difficult to extend New features limited to what can be expressed in the

existing phrase-table format (source, target, feature vector) Many interesting projects require reinventing the wheel just to change one spoke

Aug 17, 2006

JHUSWS 2006

2

Software Goals

Accessibility Easy to maintain Flexibility Easy for distributed team development Portability

Aug 17, 2006

JHUSWS 2006

3

Accessibility

Easy to read “Nothing should be a black box” void Load(const std::string &fileName Descriptive names , FactorCollection &factorCollection , FactorType factorType , float weight Uniform coding style , size_t nGramOrder);

Available immediately

Source code on Sourceforge.net

Cross-platform compatibility

Windows, Linux, MacOS X, 64 bit OS

Aug 17, 2006

JHUSWS 2006

4

Easy to Maintain

Modular code Team development Object oriented framework

Integrated documentation framework Using Doxygen Easy to maintain Wiki documentation on the Web

Aug 17, 2006

JHUSWS 2006

5

Documentation

Aug 17, 2006

JHUSWS 2006

6

JHUSWS 2006

Aug 17, 2006

7

Extensibility

Open architecture designed for extensibility

Architecture matches theoretical descriptions of phrase-based MT models

Feature function evaluation decoupled from search algorithms

Short ramp-up time for researchers familiar with SMT but not with any particular decoder

Facilitates experimentation with new classes of feature functions

Modular design

Aug 17, 2006

Framework to allow different replacements of all parts of the decoder Multiple implementations of translation tables Language models Different types of models

JHUSWS 2006

8

Case Study: Lexicalized Reordering

Very successful model, but implementation not possible with a “black box” decoder With Moses, anyone with an idea can try it Adding support for LR models to moses required code changes in four (relatively logical) locations

Feature-function base class (ScoreProducer) extended, logic for feature value computation implemented Enable the model based on configuration Call to evaluate the feature function when extending a hypothesis Add the feature values to n-best list output for tuning algorithms JHUSWS 2006

Aug 17, 2006

9

Regression Testing

Regression Testing Pharaoh scores used as baseline, which were updated as models changed (for example, hypothesis recombination based on LM state rather than n-gram order) Detailed logging enables strict test coverage for all model types Regression test suite was run approximately 3000 times during workshop

Aug 17, 2006

JHUSWS 2006

10

Accomplishments

Code contributions from every member of the team Performance improvements Day 1: 5.01 sec/sentence avg decoding time Today: 1.43 sec/sentence avg decoding time

Aug 17, 2006

JHUSWS 2006

11

Summary

State of the world, August 2006

“White box” multi-factored decoder (Moses) available Drop-in replacement for Pharaoh

Further experimentation and development anticipated at:

Aachen, Charles University, Cornell, Edinburgh, IRST, MIT, Lincoln Labs, UMD…and many more.

Aug 17, 2006

JHUSWS 2006

12

Software Goals • • • • •

Accessibility Easy to maintain Flexibility Easy for distributed team development Portability

Accessibility • Easy to read – “Nothing should be a black box” void Load(const std::string &fileName – Descriptive names , FactorCollection &factorCollection – Uniform coding style

, FactorType factorType , float weight , size_t nGramOrder);

• Available immediately – Source code on Sourceforge.net

• Cross-platform compatibility – Windows, Linux, MacOS X, 64 bit OS

Easy to Maintain • Modular code – Team development – Object orientated framework

• Integrated documentation framework – Using Doxygen – Interactive Wiki documentation on the Web

• Extendable – Flexibility • Framework to allow different replacements of all parts of the decoder • Multiple implementations of translation tables • Language models • Different types of models

– Code size • 10,000 at beginning of workshop • 16,000 now

1

System tuning • Log Linear Model e∗ = arg max Pr(e | f ) = arg max pλ(e | f ) = arg max e

e

X

e

λihi(e, f )

(1)

i

• real valued feature functions: – model specific component of the translation process: fluency, adequacy, reordering, ... – statistical models are estimated on specific training data • feature weights: – balance ranges of feature scores – weight importance of features – tuned through Minimum Error Training (MET)

Nicola Bertoldi, ITC-irst

Minimum Error Training

August 17, 2006

2

Minimum Error Training • automatic procedure to optimize feature weights • minimization of translation errors • development set (f , ref ) • automatic error function Err(e; ref ): (100-BLEU) score

e∗ = e∗(λ) = arg max pλ(e | f )

(2)

λ∗ = arg min Err(e∗(λ); ref )

(3)

e

λ

• Err(e) is not math-sound =⇒ no exact solution • approximate iterative algorithm: gradient descent, downhill simplex Nicola Bertoldi, ITC-irst


August 17, 2006

3

CLSP-WS solution for MET features

input

Moses

weights

reference

n-best

Extractor

1-best

weights

inner loop outer loop

Scorer score

Optimizer

optimal weights

• outer loop: • inner loop: – decoding with actual lambdas – optimization over n-bests – generation of nbest translations – decoder and ”random” weights – addition to previous translations as initial points • optimizer: – iterative optimization on single weights – discretization of the r-dimensional space of weights Nicola Bertoldi, ITC-irst


August 17, 2006

4

MET vs. size of nbest list • German-English EuroParl task • tuning on dev set of 2000 sentences • evaluation on test set of 2000 sentences

26 25 24

BLEU

• convergence in 5-6 iterations: – good: faster outer loop • no impact of size of nbest: – good: faster inner loop

23 22 21 100 nbest 200 nbest 400 nbest 800 nbest

20 19 18 0

2

4

6

8

10

12

14

iteration



August 17, 2006

5

MET vs. size of development set 30

• extraction of 4 subsets: 100, 200, 400, 800 sentences

25

BLEU

20

10

• larger dev set: – more stable result – less iterations – better results

100 sentences 200 sentences 400 sentences 800 sentences 2000 sentences

5 0 0

2

4

6

8

10

12

14

16

18

iteration

• bad: – overfitting – large dev set – slower outer loop (decoding)


15

100 200 400 800 2000

sentences sentences sentences sentences sentences


iteration 18 15 16 14 9

BLEU 24.3 25.1 24.6 24.9 25.3 August 17, 2006

6

MET vs. optimization algorithm • task: Spanish-English EPPS, speech input • dev set of 2643 Confusion Networks, test set of 1073 CNs • CLSP-WS algorithm vs. downhill simplex (RWTH)

iteration CLSP-WS algorithm downhill simplex

4 7

∆ BLEU dev test +1.0 +0.4 +2.9 +3.4

• mismatch between internal score of CLSP-WS algorithm and official score • better performance of the downhill simplex algorithm • post-workshop investigation Nicola Bertoldi, ITC-irst


August 17, 2006

1

Moses in parallel • effective R&D cycle: – fast experiments

source input

• computing facilities: – 6 clusters, 200 machines

Splitter

part-1

• parallelization of translation

(remote) cluster of machines

• ’split and merge’ technique • translation time: – splitting/merging ≈ constant, negligible – access to cluster related to cluster load – loading data≈ constant – decoding ∝ input length Nicola Bertoldi, ITC-irst

part-N

Moses in parallel

Moses

Moses

translation-1

translation-N

Merger

translation

August 17, 2006

2

Moses in parallel • Spanish-English EuroParl task • CLSP cluster, 18 machines • no control of cluster load

10 sentences 100 sentences 1000 sentences

standard 6.3 5.2 6.3

1 job 13.1 5.6 6.5

5 jobs 9.0 3.0 2.0

10 jobs 9.0 1.7 1.6

20 jobs – 1.7 1.1

Average time (seconds).


Moses in parallel

August 17, 2006

Decoder Output Analysis Evan Herbst 8 / 17 / 06

Evan Herbst

Decoder Output Analysis

8 / 17 / 06

1

Measurables • Difficulty – perplexity • Error – WER – PWER – BLEU – confidence intervals • Significance – t-test – sign test

Evan Herbst


8 / 17 / 06

2

Definition: Perplexity Measure likelihood of corpus given model (e.g. language model) 1 P log(p LM (wi )) i

P X = 2− N

Evan Herbst

, wi words


8 / 17 / 06

3

Definition: WER Word Error Rate: modified edit distance

Evan Herbst


8 / 17 / 06

4

Definition: PWER Position-independent Word Error Rate: match bags of words

Evan Herbst


8 / 17 / 06

5

Definition: BLEU BiLingual Evaluation Understudy: n-gram precision and length comparison

Evan Herbst


8 / 17 / 06

6

Numbers Dataset: 2000-sentence Europarl subset pharaoh Linguae → BLEU WER PWER/WER Lemma BLEU N-gram Prec. Perplexity Ref Perplex.

moses baseline

de-en

en-de

de-en

en-de

.2557 .5432 .865 .2625 .609/.315/.188/.119 40.97 68.81

.1775 .6144 .940 .2170 .519/.223/.122/.070 62.01 125.29

.2554 .5428 .865 .2622 .609/.314/.188/.119 40.94 68.81

.1776 .6145 .947 .2180 .519/.223/.122/.070 61.77 125.29

Inferences

• lemmas vs. surface: morphology • output vs. reference perplexity: fluency • PWER/WER ratio: reordering; phrase tables Evan Herbst


8 / 17 / 06

7

Tool: Comparison

Evan Herbst


8 / 17 / 06

8

Tool: Alignment

Evan Herbst


8 / 17 / 06

Suffix Arrays for More Statistics (and Less Disk Space!) Chris Callison-Burch August 17, 2006

Chris Callison-Burch

Suffix Arrays for More Statistics (and Less Disk Space!)

August 17, 2006

1

Phrase Tables in Statistical Machine Translation • Using longer phrases leads to better translation quality • Phrase tables can get unwieldily large with long phrases • Problem of large tables is compounded for factored translation models



August 17, 2006

2

Phrase Tables in Factored Translation Models • Translation tables between source and target phrases, and POS tags, stems, morphological markers, etc. • Plus generation tables • Want longer sequences for factors with smaller tags sets • Number of tables depend on number of conditioning variables, and on back-off strategies • Potentially more tables than all pairwise combinations of factors



August 17, 2006

3

Ad Hoc Solutions • Limit length of phrases • Only extract phrases for test data • Make unnecessary independence assumptions



August 17, 2006

4

Proposed Solution: Intelligent Data Structure • Uses less memory than table-based data structures • Allows us to condition on whatever factors we want and easily back-off • Retrieve translation / generation probabilities for arbitrarily long sequences • Suffix arrays to index parallel corpus



August 17, 2006

5

How Suffix Arrays Work Index of words: 0

Corpus 1

Spain declined

2

3

4

5

6

7

8

9

to

confirm

that

Spain

declined

to

aid

Morocco

Initialized, unsorted Suffix Array


Suffixes denoted by s[i]

s[0]

0

s[1]

1

declined to confirm that Spain declined to aid Morocco

s[2]

2

to confirm that Spain declined to aid Morocco

s[3]

3

confirm that Spain declined to aid Morocco

s[4]

4

s[5]

5

that Spain declined to aid Morocco Spain declined to aid Morocco

s[6]

6

s[7]

7

declined to aid Morocco to aid Morocco

s[8]

8

aid Morocco

s[9]

9

Morocco

Spain declined to confirm that Spain declined to aid Morocco


August 17, 2006

6

Alphabetically Sorted Sorted Suffix Array


s[0]

8

aid Morocco

s[1]

3

s[2]

6

confirm that Spain declined to aid Morocco declined to aid Morocco

s[3]

1


s[4]

9

Morocco

s[5]

5

Spain declined to aid Morocco

s[6]

0

s[7]

4

Spain declined to confirm that Spain declined to aid Morocco that Spain declined to aid Morocco

s[8]

7

to aid Morocco

s[9]

2




August 17, 2006

7

(Reasonably) Fast Find Sorted Suffix Array



s[0]

8

aid Morocco

s[1]

3

s[2]

6


s[3]

1


s[4]

9

Morocco

s[5]

5


s[6]

0


s[7]

4

s[8]

7

that Spain declined to aid Morocco to aid Morocco

s[9]

2



August 17, 2006

8

Applied to Factored Translation Models Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin

7

8

9

to

aid

Morocco

TO

VB

NNP

to

aid

morocco

• Index each factor • Store word-level alignments • Calculate probabilities on the fly



August 17, 2006

9

Generation Probabilities Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin

Sorted Suffix Array



7

8

9

to

aid

Morocco

TO

VB

NNP

to

aid

morocco

p(NNP VBN | Spain declined) = 0.5 p(NNP VBD | Spain declined) = 0.5

s[0]

8

aid Morocco

s[1]

3

s[2]

6


s[3]

1


s[4]

9

Morocco

s[5]

5


s[6]

0

s[7]

4

s[8]

7

Spain declined to confirm that Spain declined to aid Morocco that Spain declined to aid Morocco to aid Morocco

s[9]

2



August 17, 2006

10

Generation Probabilities Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin

Sorted Suffix Array

7

8

9

to

aid

Morocco

TO

VB

NNP

to

aid

morocco

p(Spain | NNP) = 0.66666 p(Morocco | NNP) = 0.33333


s[0]

4

IN NNP VBN TO VB NNP

s[1]

9

NNP

s[2]

0

NNP VBD TO VB IN NNP VBN TO VB NNP

s[3]

5

NNP VBN TO VB NNP

s[4]

2

TO VB IN NNP VBN TO VB NNP

s[5]

7

TO VB NNP

s[6]

3

VB IN NNP VBN TO VB NNP

s[7]

8

VB NNP

s[8]

1

s[9]

6

VBD TO VB IN NNP VBN TO VB NNP VBN TO VB NNP



. .

August 17, 2006

11

Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin

Sorted Suffix Array


7

8

9

to

aid

Morocco

TO

VB

NNP

to

aid

morocco


s[0]

8

aid Morocco

s[1]

3

s[2]

6


s[3]

1


s[4]

9

Morocco

s[5]

5


s[6]

0

s[7]

4

s[8]

7

Spain declined to confirm that Spain declined to aid Morocco that Spain declined to aid Morocco to aid Morocco

s[9]

2



Translation Probabilities L' Espagne a refusé de confirmer que l' Espagne avait refusé d' aider le Maroc

p(L'Espagne a refusé de | Spain declined) = 0.5 p(l'Espagne avait refusé d' | Spain declined) = 0.5


August 17, 2006

12

Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin

Sorted Suffix Array


7

8

9

to

aid

Morocco

TO

VB

NNP

to

aid

morocco


s[0]

4

IN NNP VBN TO VB NNP

s[1]

9

NNP

s[2]

0

NNP VBD TO VB IN NNP VBN TO VB NNP

s[3]

5

NNP VBN TO VB NNP

s[4]

2

TO VB IN NNP VBN TO VB NNP

s[5]

7

TO VB NNP

s[6]

3

VB IN NNP VBN TO VB NNP

s[7]

8

VB NNP

s[8]

1

s[9]

6

VBD TO VB IN NNP VBN TO VB NNP VBN TO VB NNP p(l'Espagne


Translation Probabilities L' Espagne a refusé de confirmer que l' Espagne avait refusé d' aider le Maroc

avait refusé d' | Spain declined, NNP VBN) = 1


August 17, 2006

13

Advantages • Memory reduction – Memory = 2 * num factors * corpus + word alignments – Significantly less than phrase tables! • Greater range of statistics – Arbitrary number of conditioning variables – Allows range of back-off strategies • Can extract statistics for arbitrarily long sequences



August 17, 2006

14

Research to be Undertaken • Integrate into Moses decoder • Deal with increased computational complexity • Change search strategies to incorporate longer factor sequences, of different levels of granularity • Experiment to test if longer sequences improve translation quality • Experiment with what variables to condition upon, how to back off



August 17, 2006

Factored Translation Models for Small Data Problems Experiments with Spanish, Czech and Chinese

Wade Shèn, Břooke Cowan, Ondrej Bojar and Christine Möran

MIT Lincoln + Computer Science AI Labs

1 8/14/2006

Charles University

Outline

2

•

Motivations

•

Experimental Design and Baselines

•

Models for Agreement in Spanish

•

Coping with Rich Morphological Constraints in Czech

•

Generalizing Lexical Distortion Models

•

Models for Sparse Statistics in Chinese

•

Conclusions and Follow-on Research


Charles University

8/14/2006

General Motivations Challenges with Small Data

•

Phrase-based MT relies on large data – Learn “Phrase” co-occurence within language – Learn Translation templates/phrases across languages

•

Problems Phrase-based MT with Small Data – Word Alignment – Hard to see enough phrases (coverage) Ö Especially in morphologically rich languages

– Tend to rely on shorter phrases Ö Increased local agreement problems Ö Increased long-distance coherence problems

3 8/14/2006


Charles University

Possible Advantages of Factored Models Generalization over Morphology

•

We can Model morph. variation and phrase translation separately for better statistics: Translation + Generation – Spanish Gender Masculine he is a red player Él es un jugador rojo

English Spanish

Feminine she is a red player Ella es una jugadora roja Morph: f 3p+sing f f f

Morph: m 3p+sing m m m

el ser un jugador roj – Czech Case Nominative + Plural black cats černé kočky

English Czech

Morph: nom+pl nom+pl

Dative + Plural black cats černým kočkám Morph: dat+pl dat+pl

černá kočka


4

Charles University

8/14/2006

Factors as Type Checking Long Range Phenomena and Divergence

•

Long range dependencies can be modeled with latent factors – Spanish: Verb – Subject Number Agreement Subject: 3p+sing

Spanish Gloss Czech

AGR

Mi hija de dos años tiene catarro My daughter of two years has cold Nachlazena je moje dvouletá dcera. verb: 3p+sing

•

Verb-Argument dependencies verb

Czech Gloss

5 8/14/2006

select

AGR

Subject: 3p+sing noun: accusative

Napsal zprávu o matčině domu na papír He wrote a message about mother’s house on a paper verb

Czech Gloss

verb: 3p+sing

select

noun: locative

Našel zprávu o matčině domu na papíře He found a message about mother’s house on a paper


Charles University

Phrase-Level Generalization •

Class-based divergences – Chinese-English resultative constructions Verb Specific

Similar pattern for large class of verbs

Chinese Gloss

English

•

你要回答 you made it hit

you

破吗 broken done

broke

it

Longer distance movement dependencies – Chinese-English Questions Chinese 你要回答 [clause…] Gloss you want reply [clause…] English would you like to reply to [clause…] ? Tags: VModal Pn

6

causes reordering


吗 y/n-marker Tag: Part

Charles University

8/14/2006

Large vs. Small Data How generalizations may affect SMT Performance

•

With large data sets these phenomena can be learned – Language Models should get local agreement phenomena with enough data – Long range agreement/coherence still problematic – Generalization may still be better, but errors in analysis can limit

•

Generalization may be advantageous for small data – For example: (Spanish/Czech Agreement) Can’t learn every noun/adjective/determiner triple

– Situation for many real-world problems

7 8/14/2006


Charles University

Outline •

Motivations

•

Experimental Design and Baselines – Approaches – Data Sets

8

•


•


•


•


•



Charles University

8/14/2006

Data Sets and Baselines Data Set

Translation Direction(s)

Size

Baseline w/diff LMs (BLEU/Surface) 3g 4g 5g

Ö 29.35 Ö 29.57 Ö 29.54

3g 3g (950k)

Ö 23.41 Ö 25.10

Full Europarl

English Ö Spanish

950k LM Train 700k Bitext

Euromini

English Ö Spanish


Czech WSJ

English Ö Czech


IWSLT Chinese

Chinese Ö English


9 8/14/2006


3g

Ö 25.82 (four references)

4g Ö 19.54 (seven references)

Charles University

Using Factored Models Approaches for Small-Data Tasks

•

Factored Models we tried – Different levels of linguistic information modeled separately example: Morphology vs. phrasal content

– Feature “Checking” of existing phrasal models with LMs on factors Good Bad Words

I would like some donuts

POS pn mod

vb

det

np

I would like some big jump pn mod

High likelihood

vb

det

adj

vb

Low likelihood

– Generalized Factor-based Distortion Phrase are likely to move distance X if preceding word is Tag Y

• 10

Hypothesis: These models allows better utilization of limited training data MIT Lincoln + Computer Science AI Labs

Charles University

8/14/2006

Different Factored Approaches Overview of Models Tried

•

High Order Language Models

Analysis Supervised

Unsupervised

•

Problems Addressed Explicit Agreement

Model Types • LMs over verbs/subject • LMs over nouns determiner adjectives Long Distance Coherence • LMs over POS Agreement/Coherence • LMs over Word-Classes

Parallel Translation Models

Analysis Supervised

Problem Types Explicit Agreement

Unsupervised

Agreement/Coherence

11 8/14/2006


Model Types • Parallel Translation Models over Lemmas and Morphology • Parallel Translation Models over Word-Classes and Surface Charles University

Outline •

Motivations

•


•

Models for Agreement in Spanish – Morphology and Agreement Features (Brooke) – Parallel Lemma and Morphology Translation (Wade) – Scaling to Larger Corpora (Wade)

•


•


•


•



12

Charles University

8/14/2006

Spanish Experiments Language Models over Morphological Features

•

NDA – Nouns/Determiner/Adjective Agreement – Generate only on N, D and A tags (don’t care’s elsewhere)

•

VNP

N/D/A Features Gender: masc, fem, common, none Number: sing, plural, invariable, none

– Verb/Nouns/Preposition Selection Agreement – Generate on V, N or P V/N/P Features

word

nda vpn

Surface Generate + Check Latent Factors

Model Model

Number: sing, plural, invariable, none Person: 1p, 2p, 3p, none Prep-ID: Preposition, none 13 8/14/2006


Charles University

Spanish Experiments Skipped LMs for Agreement Target Phrase …gave the woman Source Phrase

dio

a

la

mujer

word

X

X

s+f

s+f

nda

3+s

“a”

X

s

vpn

Surface Generate Latent Factors

Model Model

• •

Allow NULL factors to be generated Increase effective context length to model longer range dependencies

14


Charles University

8/14/2006

Spanish Agreement LMs Experimental Results

•

With Skipping Data Set EuroMini

•

Baseline 23.41

NDA VPN Both 24.47 24.33 24.54

No Skipping with all morphological features w/ and w/o POS Data Set EuroMini

•

NDA+Skip VPN+Skip 24.03 24.16

No Skipping (LM counts don’t care positions) Data Set EuroMini

•

Baseline 23.41

Baseline 23.41

Morph 24.66

Morph+POS 24.25

All models beat baseline – Skipping doesn’t seem to help – Full morphology is best

15 8/14/2006


Charles University

Spanish Experiments Parallel Lemma/Morphology Translation Me Analysis

Mi

Surface Lemma

I

Yo

Person + Number + Gender + Case

1ps+ Acc

Generation

1ps+ Acc

• • • •

Factor surface into lemma and morphology features Translate both simultaneously Re-generate target surface form Apply LM on both surface and morphology features

•

Results:

Data Set Baseline EuroMini + 950k LM 25.10


16

Lemma 25.71 Charles University

8/14/2006

Scaling Up to Large Training POS Language Models

POS-LM vs. Baseline

31

BLEU Score

30.5 30

Baseline POS-LM Full Tags

29.5 29 28.5

NOTE: Scale

28 3g

4g

5g

6g

7g

8g

9g

POS N-gram Order

•

Full Train → Less/No Gain from richer features

17 8/14/2006


Charles University

Outline •

Motivations

•


•


•

Coping with Rich Morphological Constraints in Czech – – –

18

Factored Word Alignment for Limited Data Rich Morphology and Tagged LMs Putting it Together: Parallel Translation

•


•


•

Analysis and Conclusions

•

Follow-on Research


Charles University

8/14/2006

Factors for Coping with Limited Data Better Word Alignment for Czech

•

Word Alignment is difficult when data is limited and Morphology is rich – Data: 20k bitext sentences, large vocabulary – Contrast Set: 20k + 840k (Out of Domain) sentences – Task: English Ö Czech

•

Two methods to deal with limited data Stem Alignment

•

Contrastive Behavior for small and large data Data Set 20k Czech Large Contrast

19 8/14/2006

Lemma Alignment

Word-Word 25.17


Stem-Lemma 25.23 25.40

Stem-Stem 25.82 24.99 Charles University

Czeching Rich Morphology with Tags Tagged Czech Language Models cat

kočky

Surface

N+acc

Apply LM

•

Generation

Idea: Use morphologically rich POS Tag sequences to “czech” target output generation POS Information Configurations (Baseline: 25.82)

•

Full Tags Feature 1 Feature 2 … (15 total) Size: 1098 tags Result: 27.04

CNG Tags Case Number+Gender on V, P, PP, N, A Size: 707 tags Result: 27.45

CNG+VP CNG Features Person+Tense+Aspect (verbs) Lemma+Case (prepostions) Size: 899 tags Result: 27.62


20

Charles University

8/14/2006

Comparing with Larger Data Models Tagged Czech Language Models

•

Large vs. Small Data Data Set

Data Set

Baseline

20k Czech Large Contrast (20k + 840k OOD) 20k Czech Large Contrast (20k + 840k OOD)

CNG+VP

•

BLEU 25.82

Relative Improvement –

27.47

–

27.62

6.97%

28.12

2.37%

Tagged Language Models improve performance for small data significantly – approaches large data performance

• 21 8/14/2006

Large Task also improves (but much less: 2.36% vs. 6.97%) MIT Lincoln + Computer Science AI Labs

Charles University

Parallel Translation Models for Czech •

Motivation: Factored LM models seem to lose number information

him

3p+ acc

ho

Surface

on

Lemma

3p+ acc

POS Tag + CNG Features

Model Surface Ö Surface + POS Ö POS+CNG Surface Ö Lemma + POS Ö POS+CNG

• 22

Result 25.94 26.43

Better than baseline, but worse than both CNG & CNG+VP MIT Lincoln + Computer Science AI Labs

Charles University

8/14/2006

Outline •

Motivations

•


•


•


•

Generalizing Lexical Distortion Models (Christine) – – –

23 8/14/2006

Lexical Distortion Models Factor-based Distortion Results

•


•

Analysis and Conclusions

•

Follow-on Research


Charles University

Generalized Distortion Modeling Introduction to Distortion

•

•

For each phrase pair we learn its likely placement relative to the previous phrase

Monotone

Source

Orientations word alignment point on top left

– Swap word alignment point on top right

Target

– Monotone

– Discontinuous Not monotone or swap

•

Examples – la casa roja Ö the red house – D NN ADJ Ö D ADJ NN 24

Discontinuous


Swap

Charles University

8/14/2006

Factor-based Distortion Models •

A Factor-based extension of Lexicalized Distortion – Use of more general factors e.g. POSf-POSe, Lemma-Lemma

•

Can model longer range dependencies – More conditioning variables

•

Motivating Results – Hard-coding in a few factor based rules (e.g. swap nouns and adjectives when translating from English to Spanish) led to improvements (Gispert, et. al. 2006)

25 8/14/2006


Charles University

Factor-based Distortion Spanish Experiments

•

Lexicalized Distortion only Europarl Lang En Ö De Es Ö En En Ö Es

•

Pharaoh 18.15 31.46 31.06

Moses 18.85 32.37 31.85

Factor-based Distortion on small data Data Set Baseline (No Lexical) Baseline Lexical Factored: POS-POS System Combined: Lexical + POS-POS

•

Further Experiments – Other Factors – Minimizing Model Parameters – Combining different models

26

Result


Charles University

8/14/2006

Outline

27 8/14/2006

•

Motivations

•


•


•


•


•


•



Charles University

IWSLT Chinese Experiments with Unsupervised Annotation

• •

Data: Travel-domain sentences, limited vocabulary, short sentences Task: Text and ASR translation, Chinese Ö English

•

Can we use automatic word classes to learn general sequence constraints? First Experiment: 2-gram Word Class LMs of varying orders

•

Target Phrase How much is it?

总共多少

Source Phrase

c1

c22

钱

?

word

c3

c55

class

Surface

Model Model 28


Charles University

8/14/2006

IWSLT Chinese Alignment Templates for Translation Me I

Surface Word Class

Mi Yo

Generation

•

Second Experiment: Extend Class-based LM to the translation Model

•

Bigram word classes for source and target

•

Translate alignment templates similar to [Och 98] + surface

•

Apply LM to surface and Class 29 8/14/2006


Charles University

IWSLT Chinese Autoclass Results 22.5 22

Baseline

BLEU Score

21.5 21

Class-LM

20.5 20 19.5

Class Trans+LM

19 18.5 18 3g

4g

5g

6g

7g

8g

9g

Class N-gram Order

• •

NOTE: Scale

Class-LM significantly better (p=0.05, ~1.0 BLEU) Class-Trans may be limited by synchronous PT constraint – Start to address here, but not in time for eval

30


Charles University

8/14/2006

Outline

31 8/14/2006

•

Motivations

•


•


•


•


•


•

Conclusions and Follow On Research


Charles University

Conclusions and Future Work •

Factored Approach can help with small data – Large Data tasks may need different factored approaches

•

MIT/LL + CSAIL – Continue experiments with morphology and coherence – Fully Asynchronous Factor Translation – Apply techniques to other languages Extend existing LCTL experiments

– Syntax-driven reordering models (Brooke)

•

Asynchronous Factors Translation (Hieu)

•

Making use of verb sub-categorization information (Ondrej)

32


Charles University

8/14/2006

Valency-Aware Machine Translation Project Proposal Ondˇrej Bojar [email protected] August 17, 2006

Ondˇrej Bojar

Valency-Aware Machine Translation

August 17, 2006

1

Overview • JHU Workshop motivation and one of the results. • State-of-the-art MT errors. • Project goal. • Motivation: Why Czech. • Proposed strategy and information sources. • Summary. Appendices: References, illustrations and further details on Czech and English Ondˇrej Bojar


August 17, 2006

2

Workshop Motivation • Statistical machine translation (SMT) into morphologically rich languages is more difficult than from them. See e.g. Koehn (2005).

• One of workshop goals: examine utility of factored translation models to translate into morphologically rich languages. • There was room for improvement: Regular BLEU English→Czech BLEU of lemmatized MT against lemmatized references

25% 32%

⇒ Errors in morphology cause large BLEU loss.

Ondˇrej Bojar


August 17, 2006

3

One of the Workshop Results • Significant improvements gained on small data sets: English→Czech: 20k sentences, BLEU 25.82% to 27.62% or up to 28.12% with additional out-of-domain parallel data. • Still far below the margin of lemmatized BLEU (35%). • However local agreement already very good: Microstudy: Adjective-Noun Agreement 74% correct, 2% mismatch, other: missing noun etc. ⇒ So where are the morphological errors?

Ondˇrej Bojar


August 17, 2006

4

Current English→Czech MT Errors Microstudy of current best MT output (BLEU 28.12%), intuitive metric: • 15 sentences, 77 verb-modifier pairs in source text examined: Translation of Verb Modifier

. . . preserves meaning 43% 79%

. . . is disrupted 14% 12%

. . . is missing 21% 6%

But: When Verb&Mod correct, 44% of cases are non-grammatical or meaningdisturbing relations.

Ondˇrej Bojar


August 17, 2006

5

Samples Errors Input: MT output: Gloss: Correct:

Keep on investing. Pokraˇcovalo investován´ı. (grammar correct here!) Continued investing. (Meaning: The investing continued.) Pokraˇcujte v investován´ı.

⇒language model misled us ⇒ need to include source valency information. Input: MT Output: Gloss: Correct option 1: Correct option 2:

brokerage firms rushed out ads . . . brokerské firmy vybˇehl brokerage firmspl.f em ransg.masc brokerské firmy vybˇehly brokerské firmy vydaly

reklamy adspl.nom,pl.acc,pl.voc,sg.gen s reklamamipl.instr reklamypl.acc

Target-side data may be rich enough to learn: vybˇehnout–s–instr Not rich enough to learn all morphological and lexical variants: vybˇehl–s–reklamou, vybˇehla–s–reklamami, vybˇehl–s–prohláˇsen´ım, vybˇehli–s–oznámen´ım, . . . Ondˇrej Bojar


August 17, 2006

6

Project Goal

Improve MT output quality by valency information.

Ondˇrej Bojar


August 17, 2006

7

Motivation: Why Czech • Relevant properties: very rich morphological system and relatively free word order. • Well-established theory on syntax and valency in particular. Sgall, Hajiˇcová, and Panevová (1986), Panevová (1994)

• Data available: monolingual and parallel corpora manual surface and deep treebanks (parallel forthcoming!) manual valency lexicons Language Cs Cs Cs Cs↔En Cs↔En

Corpus Annotation up to Tokens PDT 2.0 (Hajiˇc, 2005) manual surface and deep syntax 1.5M surf. CNC (Kocek, Kopˇrivová, and Kuˇcera, 2000) automatic lemmatization and morphology 114M Web corpus automatic surface syntax 100M ˇ PCEDT 1.0 (Cmejrek, Cuˇr´ın, and Havelka, 2003) automatic surface and deep syntax 500k CzEng 0.5 automatic surface syntax 15M

Ondˇrej Bojar


August 17, 2006

8

Proposed Strategy Preliminary experiments at workshop: • Factored models touching valency explored during workshop perform badly. No gain or a slight loss.

Future: • Evaluate the causes. Was it just sparse data?

• Check subcategorization using partially lexicalized language models. Morphological LM with verbs lexicalized should capture subcategorization.

• Experiment with syntax-based language models. (Chelba and Jelinek, 1998; Charniak, 2001)

• Map explicit subcategorization information from source to target. Translate lemma+subcat to lemma+subcat and POS to POS, generate surface from this. Ondˇrej Bojar


August 17, 2006

9

Project Will Use these Sources of Information • Available valency/subcategorization dictionaries. VALLEX for Czech. (∼PropBank for English.)

• Automatically collected subcategorization data. (Korhonen, 2002) and previous, my diss. in prep.

• Word-sense-like algorithms to label verb occurrences with frames. (Bojar, Semecký, and Beneˇsová, 2005), and all WSD community results

Compare with simple approaches: • More monolingual data for plain n-gram language models may help enough. • Are valency-based generalizations useful in general/on small data/on out-ofdomain data? Ondˇrej Bojar


August 17, 2006

10

Summary • Factored models help fixing morphology → local dependencies already correct. • Significant margin for improving verb-modifier agreement. • English→Czech pair is a good fit for the experiments. • Improved valency models should improve translation quality: Valency theory, data and methods available.

Ondˇrej Bojar


August 17, 2006

11

References Bojar, Ondˇrej. 2003. Towards Automatic Extraction of Verb Frames. Prague Bulletin of Mathematical Linguistics, 79–80:101–120. Bojar, Ondˇrej, Jiˇr´ı Semecký, and Václava Beneˇsová. 2005. VALEVAL: Testing VALLEX Consistency and Experimenting with Word-Frame Disambiguation. Prague Bulletin of Mathematical Linguistics, 83:5–17. Charniak, Eugene. 2001. Immediate-head parsing for language models. In Meeting of the Association for Computational Linguistics, pages 116–123. Chelba, Ciprian and Frederick Jelinek. 1998. Exploiting syntactic structure for language modeling. In Christian Boitet and Pete Whitelock, editors, Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pages 225–231, San Francisco, California. Morgan Kaufmann Publishers. ˇ Cmejrek, Martin, Jan Cuˇr´ın, and Jiˇr´ı Havelka. 2003. Czech-English Dependency-based Machine Ondˇrej Bojar


August 17, 2006

12

Translation. In EACL 2003 Proceedings of the Conference, pages 83–90. Association for Computational Linguistics, April. Collins, Michael. 1996. A New Statistical Parser Based on Bigram Lexical Dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 184–191. Collins, Michael, Jan Hajiˇc, Eric Brill, Lance Ramshaw, and Christoph Tillmann. 1999. A Statistical Parser of Czech. In Proceedings of 37th ACL Conference, pages 505–512, University of Maryland, College Park, USA. Hajiˇc, Jan. 2005. Complex Corpus Annotation: The Prague Dependency Treebank. In Mária ˇ Simkov´ a, editor, Insight into Slovak and Czech Corpus Linguistics, pages 54–73, Bratislava, Slovakia. Veda, vydavateˇlstvo SAV. Holan, Tomáˇs. 2003. K syntaktické analýze ˇceských(!) vˇet. In MIS 2003. MATFYZPRESS, January 18–25, 2003. ˇ y národn´ı korpus - u Kocek, Jan, Marie Kopˇrivová, and Karel Kuˇcera, editors. 2000. Cesk´ ´vod a ´ CNK, ˇ pˇr´ıruˇcka uˇzivatele. FF UK - U Praha. Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of MT Summit X, September. Ondˇrej Bojar


August 17, 2006

13

Korhonen, Anna. 2002. Subcategorization Acquisition. Technical Report UCAM-CL-TR-530, University of Cambridge, Computer Laboratory, Cambridge, UK, February. Kruijff, Geert-Jan M. 2003. 3-Phase Grammar Learning. In Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development. Panevová, Jarmila. 1994. Valency Frames and the Meaning of the Sentence. In Ph. L. Luelsdorff, editor, The Prague School of Structural and Functional Linguistics, pages 223–243, Amsterdam-Philadelphia. John Benjamins. Sgall, Petr, Eva Hajiˇcová, and Jarmila Panevová. 1986. The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague, Czech Republic/Dordrecht, Netherlands.

Ondˇrej Bojar


August 17, 2006

14

Analysis of Czech Analytic (surface syntactic): Morphological: Form Lemma AUX P OBJ zákony zákon ADV zákony zákon zákony zákon Zákony udˇelejte pro lidi #36 zákony zákon Laws make for people udˇelejte udˇelat Tectogrammatical (deep syntactic): udˇelejte udˇelat pro pro-1 PRED lidi ˇclovˇek ACT BEN PAT lidi ˇclovˇek ˇclovˇek udˇelatimp zákonP l ˇclovˇekP l,pro lidi #36 you lawP l makeimp personP l,f or PRED

Ondˇrej Bojar


Morphological tag NNIP1-----A---NNIP4-----A---NNIP5-----A---NNIP7-----A---Vi-P---2--A---Vi-P---3--A---4 RR--4---------NNMP1-----A---NNMP4-----A---NNMP5-----A----

August 17, 2006

15

Properties of Czech language Czech ≥ 4,000 tags possible, ≥ 2,300 seen free

Rich morphology Word order

English 50 used rigid

• rigid global word order phenomena: clitics • rigid local word order phenomena: coordination, clitics mutual order Nonprojective sentences Nonprojective edges Known parsing results Edge accuracy Sentence correctness

16,920 23,691

23.3% 1.9%

Czech 69.2–82.5% 15.0–30.9%

Ondˇrej Bojar

Data by (Collins et al., 1999), (Holan, 2003), Zeman (http://ckl.mff.cuni.cz/˜zeman/ /projekty/neproj/index.html) and (Bojar, 2003). Consult (Kruijff, 2003) for measuring word order freeness.

English 91% 43%


August 17, 2006

16

Detailed numbers on Czech Edge length English [%] Czech [%]

1 74.2 51.8

Number of gaps Sentences [%] Climbing steps Nodes [%]

≤2 86.3 72.1

0 76.9 1 90.3

≤5 95.6 90.2

1 22.7 2 8.0

1

2 0.42 3 1.3

4 0.3

2

5 0.1

3

1

Data for English by (Collins, 1996). Data for Czech by (Holan, 2003). Data by (Holan, 2003). 3 Data by (Holan, 2003). 2

Ondˇrej Bojar


August 17, 2006

17

Analytic vs. Tectogrammatical (2) PRED

AUXK OBJ

AUXV

SB

AUXR

#45

To It

by

se

conjunct particle

reflexive particle

mˇelo should

zmˇenit change

. full stop

PRED

ACT

PAT PRED

#45

to it

Ondˇrej Bojar

m´ıt should

zmˇenitconj changeconj

Generic Actor


Asynchronous Factored Translation Hieu Hoang University of Edinburgh

August 17, 2006

Current System Translating Je

vous

PRO PRO

achète un chat VB

ART NN

Phrase Table 1 I am buying you

Je vous achète Phrase Table 2

PRO VB VB PRO

PRO PRO VB

Current System Translating Je

vous

PRO PRO

achète un chat VB

ART NN

Phrase Table 1 Je vous achète

I am buying you

Phrase Table 2 PRO PRO VB

PRO VB VB PRO

Limitations Synchronous Je

vous

PRO PRO

achète un chat VB

ART NN

Phrase Table 1 Je

I

vous

you

achète

am buying

Phrase Table 2 PRO VB VB PRO

PRO PRO VB

Asynchronous Translation Synchronous Je

vous

PRO PRO

achète un chat VB

ART NN

Phrase Table 1 Je

I

vous

you

achète

am buying

Phrase Table 2 PRO PRO VB

PRO VB VB PRO

Tiling Current System Je

vous

PRO PRO

achète un chat VB

ART NN


vous

PRO PRO

achète un chat VB

ART NN


vous

PRO PRO

achète un chat VB

ART NN

Future Je

vous

PRO PRO

achète un chat VB

ART NN


vous

PRO PRO

achète un chat VB

ART NN

Future Je

vous

PRO PRO

achète un chat VB

ART NN


vous

PRO PRO

achète un chat VB

ART NN

Future Je

vous

PRO PRO

achète un chat VB

ART NN


vous

PRO PRO

achète un chat VB

ART NN

Future Je

vous

PRO PRO

achète un chat VB

ART NN

Long Templates Je

vous

achète un chat

PRO PRO

VB

ART NN

Phrase Table 1 Je

I

Vous

am buying

achète

You

un chat

a cat

Phrase Table 2 PRO VB VB PRO ART NN

PRO PRO VB ART NN

Templates Je

vous

PRO PRO

achète un chat VB

ART NN

Phrase Table 1 Je

I

Vous

am buying

achète

You

un chat

a cat

Phrase Table 2 PRO PRO VB ART NN

PRO VB VB PRO ART NN

Combining information from different factors

You say his Surface:

ni suo ta

Tense:

name da

mingzi

already question le

ma ?

past

You

said

his name, right ?

past

Challenges • • • •

Computational complexity Pruning strategies Recombination Scoring

Translation of morphologically rich languages with additional linguistic information Chris Dyer, Philipp Koehn, Chris Callison-Burch, Hieu Hoang 17 August 2006

Dyer, Koehn, Callison-Burch, Hoang

Morphologically rich languages

17 August 2006

1

Morphologically rich languages • Languages differ in their morphological markup • Examples with increasing complexity: – Chinese: no marking for number, gender, tense, or aspect – English: number(2) for nouns, four verb forms – Spanish: number(2) and gender(2) for adjectives, ... – German: number(2), gender(3), case(4), definiteness for adjectives, ... – Arabic: number(3), gender(2), case(3), definiteness, possessors for nouns – Finnish: prepositions often expressed morphologically Language Vocabulary size in Europarl English 65,887 word forms Spanish 102,886 word forms German 195,290 word forms Finnish 358,345 word forms Dyer, Koehn, Callison-Burch, Hoang


17 August 2006

2

Impact of morphological complexity • How much information do we have if we discount inflectional morphology? • Experiment (systems trained on full 700,000 sentence Europarl corpus): Method surface → surface surface → surface (lemmatize) surface → lemma

devtest 18.22 BLEU 22.27 BLEU 22.70 BLEU

test 18.04 BLEU 22.15 BLEU 22.45 BLEU

• Gain of 4 BLEU points possible, if we can solve morphology



17 August 2006

3

Problem: unknown word forms • Unknown surface word forms (German) test set devtest-2006 test-2006

unigrams 0.71% 0.69%

bigrams 12.00% 12.20%

trigrams 40.46% 41.08%

bigrams 9.05% 9.14%

trigrams 33.93% 34.36%

• Unknown lemmas (German) test set devtest-2006 test-2006


unigrams 0.64% 0.64%


17 August 2006

4

Factored models • Factored models allow us to address these problems • Sparse data – back off to translation of lemmas – back off to language models with richer statistics • Agreement and grammatical coherence – use of factors that enforce agreement within noun phrases – use of factors that enforce agreement on the clause level



17 August 2006

5

Addressing data sparseness with lemmas Input

Output

word

word lemma

• Translate surface into lemma • Generate surface from lemma • Translate surface into surface • Language models over surface and lemma



17 August 2006

6

Addressing data sparseness with lemmas, model 2 Input

Output

word

word lemma

• Translate surface into surface • Generate lemma from surface • Language models over surface and lemma



17 August 2006

7

Experimental Results Method baseline hidden lemma (gen only) hidden lemma (gen and trans) best published results

devtest 18.22 18.82 18.41 -

test 18.04 18.69 18.52 18.15

• Better performance than baseline model • Simpler model has higher performance – fewer search errors



17 August 2006

8

Addressing data sparseness with factored models Input

Output

word

word

lemma

lemma

part-of-speech

part-of-speech morphology

• Morphological analysis and generation model • Pitfalls of this approach – tag set does not necessarily have sufficient information – explosive search space on large models Dyer, Koehn, Callison-Burch, Hoang


17 August 2006

9

Overall grammatical coherence Input

Output

word

word part-of-speech

• High order language models over POS • Motivation: syntactic tags should enforce syntactic sentence structure • Results: No major impact with 7-gram POS model (BLEU 18.25 vs. 18.22) • Analysis: local grammatical coherence already fairly good, POS sequence LM model not strong enough to support major restructuring Dyer, Koehn, Callison-Burch, Hoang


17 August 2006

10

Local agreement (esp. within noun phrases) Input

Output

word

word part-of-speech morphology

• High order language models over POS and morphology • Motivation – DET-sgl NOUN-sgl good sequence – DET-sgl NOUN-plural bad sequence Dyer, Koehn, Callison-Burch, Hoang


17 August 2006

11

Agreement within noun phrases • Experiment: 7-gram POS, morph LM in addition to 3-gram word LM • Results Method baseline factored model

Agreement errors in NP 15% in NP ≥ 3 words 4% in NP ≥ 3 words

devtest 18.22 BLEU 18.25 BLEU

test 18.04 BLEU 18.22 BLEU

• Example – baseline: ... zur zwischenstaatlichen methoden ... – factored model: ... zu zwischenstaatlichen methoden ... • Example – baseline: ... das zweite wichtige änderung ... – factored model: ... die zweite wichtige änderung ... Dyer, Koehn, Callison-Burch, Hoang


17 August 2006

12

Subject-verb agreement • Lexical n-gram language model would prefer the

paintings

of

the

old

man

is

beautiful

old man is is a better trigram than old man are • Correct translation the paintings SBJ-plural

of -

the -

old -

man -

are V-plural

beautiful -

• Special tag that tracks count of subject and verb p(-,SBJ-plural,-,-,-,-,V-plural,-) > p(-,SBJ-plural,-,-,-,-,V-singular,-)



17 August 2006

13

Experiment on English–German • Add special features for subject and verb • Verb – our morphological analyzer does not provide verb morphology → use of surface forms • Subject – subject identified with German parser (Amit Dubey’s parser trained on TIGER treebank) – if pronoun: surface form of pronoun – if noun phrase: POS and morphological tags of determiner, adjective, and noun



17 August 2006

14

Skip language models • Full language model confused by many non-items: p(-,SBJ-plural,-,-,-,-,V-plural,-) > p(-,SBJ-plural,-,-,-,-,V-singular,-) • Skip language models: ignoring irrelevant tags: p(SBJ-plural,V-plural) > p(SBJ-plural,V-singular) • Results: experiments did not finish as of yet, preliminary results inconclusive



17 August 2006

15

Reflection on the data • Clause elements are translated reasonable well – now high agreement within noun phrases (with factored model 4%) • Overall sentence structure muddled – subject–verb agreement hard to enforce, since which noun phrase is subject is hard to establish – role (and hence case) of noun phrases often wrong, since relation to verb is unclear • Similar problems when translating Arabic–English, Chinese–English – this motivates work on syntax-based machine translation – one solution: syntactic restructuring models (Brooke’s presentation) – another solution: clause-level sequence models



17 August 2006

16

Clause level sequence models • Correct sentence with verb the paintings of SBJ SBJ OBJ

the OBJ

• Incorrect sentence without verb the paintings of SBJ SBJ OBJ

old OBJ

the OBJ

man OBJ

old OBJ

are V

man OBJ

beautiful ADJ

beautiful ADJ

• Syntactic role label sequence model is on the steering wheel! p(SBJ,SBJ,OBJ,OBJ,OBJ,OBJ,V,ADJ) > p(SBJ,SBJ,OBJ,OBJ,OBJ,OBJ,ADJ) • May be simplified using skip language models to p(SBJ,OBJ,V,ADJ) > p(SBJ,OBJ,ADJ) Dyer, Koehn, Callison-Burch, Hoang


17 August 2006

17

Another reality check • One typical error of the current system wir we SUBJ

haben have AUX

daher hence PART

nicht not PART

f¨ ur for PP-OBJ

diesen this PP-OBJ

bericht report PP-OBJ

stimmen voting VINF

• Typical sentences have many particles floating around – if interested in core sentence structure: ignore them – if interested in all parts of the clause: include them • Key lesson: feature engineering – know your tag sets and morphological features – be aware of what problem you want to address – create a factor for this purpose Dyer, Koehn, Callison-Burch, Hoang


17 August 2006

Future Research Back-off models: improving MT through smarter searching and better use of data Chris Dyer, University of Maryland 8/22/2006

JHUSWS 2006

1

Two Goals

Smarter Search Mitigate sparse-data effects in multi-factored models Recover from search errors Enable well-motivated models for translating into morphologically complex languages

Back-off models

8/22/2006

Take advantage of single-factored models when it makes sense to do so JHUSWS 2006

2

Smarter Search: Motivation

Morphological complexity poses problems for “whitespace tokenized” statistical MT

Beyond data sparseness: conventional models run into search problems for rare surface forms

Lemmatizing results in considerable German performance gains devtest-2006

test-2006

surface→surface

18.22

18.04

surface→surface, lemmatize surface→lemma

22.27

22.15

22.70

22.45

JHUSWS 2006

8/22/2006

3


Single factor models do not generalize . They cannot produce a target form unless seen in the training data. Basic generation models allow us to improve translation coverage with (inexpensive) monolingual resources Translating English→German

Generation training data size

# distinct words produceable

Surface only

n/a

105,000 distinct words

Lemmas only

n/a

85,000 distinct lemmas

8/22/2006

JHUSWS 2006

4


Single factor models do not generalize . They cannot produce a target form unless seen in the training data. Basic generation models allow us to improve translation coverage with (inexpensive) monolingual resources Translating English→German

Generation training data size

# distinct words produceable

Surface only

n/a


Lemmas only

n/a

85,000 distinct lemmas

Lemmas + bitext Europarl

15 million words


Lemmas + full Europarl

27 million words


Lemmas + 1.2M EP + Wikipedia

113 million words


Net result: 30% increase in forms produceable over a single-factor model

8/22/2006

JHUSWS 2006

5

Morphological Analysis and Generation Model n-gram LM, surface n-gram LM, morphology n-gram LM, lemmata 4-step model 1. Translate surface to lemma 2. Generate morphology from lemma 3. Translate POS to morphology 4. Generate surface from lemma + morphology 8/22/2006

JHUSWS 2006

6

Initial results were disappointing…

BLEU scores well below baseline (~11) Tuning took an entire weekend on a very small tuning set

JHUSWS 2006

8/22/2006

7

The Problem

Search errors

Aggressive pruning Each step multiplies number

of states in the search space

over a single factored model

8/22/2006

Spans must overlap exactly

JHUSWS 2006

8

The Problem: an illustration

Translation options ‘the right approach’: der richtige Ansatz dem richtigen Ansatz den richtigen Ansatz 8/22/2006

JHUSWS 2006

9

The Solution

Back off to shorter spans

8/22/2006

When a dead-end is reached, break up the source span into smaller spans and translate those

JHUSWS 2006

10

The Solution: an illustration

Translation options ‘right ‘the’: approach’: der, die, das, richtiger Ansatz dem, den, das, des Ansatz 8/22/2006

richtigen Ansatzes

JHUSWS 2006

11

Back-off Models

Lexicalized surface forms are common

Because of lexicalization, obscure morphology or root forms often retained Ex. “be that as it may”

Translations often approximate, unusual when analyzed in more abstract layers If you mistranslate common stock phrases because of a rigid analysis and generation processes, fluency suffers

8/22/2006

JHUSWS 2006

12

Back-off Models

Solution Try to let single translation step to cover all factors Back off to multi-factored model

8/22/2006

JHUSWS 2006

13

Back-off Models: Implementation

“Primary” phrase table Standard form Contains all factors on target side

Necessary for secondary factor LMs

May be trained on single factor data with “best guesses” for secondary factors May be aggressively filtered, i.e., for >n occurrences, etc.

8/22/2006

JHUSWS 2006

14

Back-off Models: Implementation

Key idea: Back-off weight Feature that is associated with choosing a single factored path Tuned along with other feature weights Function of source phrase length?

JHUSWS 2006

8/22/2006

15

Summary

Increase performance of multi-factored models Recover from search errors Recover from data sparseness (make more efficient use of longer underlying phrases)

Extend the benefits of multi-factor models to target languages where sparse-data and search errors are not generally an issue

8/22/2006

English JHUSWS 2006

16

Translation with syntax and factors: Handling global and local dependencies in SMT Brooke Cowan MIT CSAIL August 17, 2006

Brooke Cowan, MIT CSAIL

Syntax and factors in SMT

August 17, 2006

1

Goals of statistical machine translation • Linguistically-correct output – learn correct syntax and morphology in target language – e.g., noun-phrase agreement, subject-verb agreement, verbs and their arguments • Meaning-preserving output – learn mapping between source and target sentence elements – e.g., identify the subject in the source and ensure it plays the proper role in the target – can involve a significant amount of reordering



August 17, 2006

2

Linguistically-correct output • E.g., in Spanish noun phrases, nouns, determiners, and adjectives are constrained to agree in gender and number

det

noun

adj

adj

las políticas pesqueras comunitarias the policies fisheries common FEMININE PLURAL



August 17, 2006

3

Linguistically-correct output • E.g., in Spanish noun phrases, nouns, determiners, and adjectives are constrained to agree in gender and number

det

noun

adj

adj

las políticas pesqueras comunitarias the policies fisheries common FEMININE PLURAL

• Phrasal agreement phenomena are generally local in nature.



August 17, 2006

4

Meaning-preserving output: free word order • E.g., when translating from German to English, we want to identify and place the subject, object, and phrasal modifiers in the output ich möchte dem berichterstatter für seinen bericht danken dem berichterstatter möchte ich für seinen bericht danken für seinen bericht möchte ich dem berichterstatter danken

i would like to thank the rapporteur for his report



August 17, 2006

5

Meaning-preserving output: free word order • E.g., when translating from German to English, we want to identify and place the subject, object, and phrasal modifiers in the output ich möchte dem berichterstatter für seinen bericht danken dem berichterstatter möchte ich für seinen bericht danken für seinen bericht möchte ich dem berichterstatter danken

i would like to thank the rapporteur for his report

• Translation involving free-word-order languages or languages pairs with very different basic word order can be quite challenging because these phenomena are generally global in nature. Brooke Cowan, MIT CSAIL


August 17, 2006

6

A hybrid system • A syntax-based system – handle global phenomena in translation ∗ inter-phrasal reordering ∗ verb/argument structure ∗ some long-distance agreement phenomena (e.g., subject/verb agreement) • A factored phrase-based system – handle local phenomena in translation ∗ agreement and reorderings



August 17, 2006

7

Combining the two systems • Use the the syntax-based system to reorder the source-language input • Feed the output of the syntax-based system into the phrase-based system German input: für seinen bericht möchte ich dem berichterstatter danken Modified German input: ich WOULD LIKE TO THANK dem berichterstatter für seinen bericht English output: i would like to thank the rapporteur for his report



August 17, 2006

8

The syntax-based system • Discriminatively-trained, tree-to-tree translation system (Cowan, Collins, and Kuˇcerová, EMNLP ’06) • Fully implemented and tested on German-to-English Europarl task • Model predicts an aligned extended projection (AEP) on the target side – a syntactic structure encapulating the argument structure of the main target-side verb, and – alignment information between the modifiers on the source and target sides



August 17, 2006

9

What is an AEP? German clause:

English AEP:

s pp-mo 1 appr zwischen piat beiden nn gesetzen vvfin-hd bestehen adv-mo 2 also np-sb 3 adja erhebliche adja rechtliche $, , adja praktische kon und adja wirtschaftliche nn underschiede


S Extended Projection (EP) of the main verb (Frank 2002)

+ Alignment information


NP-A

VP V

NP-A

are SUBJECT: there OBJECT: 3 MOD(1): post-object MOD(2): pre-subject

August 17, 2006

10

Integration with Moses • Factor-based systems handle local phenomena well • Extensions to Moses Modified German input: [ ich ] [ WOULD LIKE TO THANK ] [ dem berichterstatter ] [ für seinen bericht ]

– externally-provided translation options – constraints on reordering – n-best lists of AEPs



August 17, 2006

11

Research questions • Factor the translation problem into two parts – syntax-based system to handle global reorderings and agreements – factor-based system to handle local reordering and agreements • Can this approach improve overall translation quality? – past work in rule-based clause restructuring (e.g., Collins, Koehn, Kuˇcerová, ACL ’05) • What is the best way to combine these systems? – hard constraints vs soft constraints – voting/backoff framework



August 17, 2006

Part of Speech Information for Alignment

Alexandra Constantin

2006 CLSP Summer Workshop

Bilingual Dictionary

Haus – house, building, home, household

Lexical Translation Probability Distribution

Implicit Alignment 1 2 3 4 Das Haus ist klein. 1 2 3 4 The house is small.

Alignment Function a

1

2 3

4

Klein ist das Haus The house is small 1

2

3

4

POS Motivation

z

POS information for infrequent words

Example

IBM Model 1 - Notations e = target word f = source word t(e|f) = probability of translating foreign word f into English word e f = (f_1, …, f_n) = foreign sentence e = (e_1,…,e_m) = English sentence p(e|f) = translation probability a = alignment function

IBM Model 1

EM Algorithm 1.

Initialize model (typically with uniform distribution)

2.

Apply the model to the data (expectation step)

3.

Learn the model from the data (maximization step)

4.

Iterate steps 2-3 until convergence

Expectation Step

Expectation Step – p(e|f)

Expectation Step

Maximization Step

Adding POS Information

Experiments- AER z z z

Compare generated alignments against manual alignments Manual alignments: probable (P) and sure (S) Automated alignments: A

Results AER

10k

20k

40k

60k

80k

100k

Baseline 53.7

51.8

49.3

48.6

47.5

47.1

Only POS

76.0

75.4

75.5

75.1

75.3

75.1

+ POS

53.6

51.5

49.6

48.4

47.7

47.3

Future Work z z z

Use alignments to train MT system and compare BLEU scores Use POS information in more complicated alignment methods Use other factors

JHU CLSP Summer Workshop 2006 Team Presentation

Experimental Results for Confusion Network Decoding

Richard Zens, Nicola Bertoldi, Marcello Federico, Wade Shen

Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding

1

August 17, 2006

IWSLT Task • Chinese–English, domain: phrase book entries • corpus statistics: Chinese English sentences 40 K running words 351 K 365 K vocabulary 11 K 10 K • confusion network statistics (489 sentences): read speech spontaneous speech avg. length 17.2 17.4 avg. / max. depth 2.2 / 92 2.9 / 82 avg. number of paths 1021 1032 • no development data for confusion networks Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding

2

August 17, 2006

Results for IWSLT

• phrase table provided by MIT/LL • competitive baseline results • results: read speech spontaneous speech BLEU [%] BLEU [%] verbatim 21.4 1-best from lattice 19.0 17.2 1-best from CN 19.0 17.2 full CN 19.3 17.8 • improvements are statistically significant (89% confidence)

Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding

3

August 17, 2006

Other Ambiguous Input: Punctuation • Chinese input does not contain punctuation • illustration: hello world

→

1 2 3 hello 1.0 0.9 world 1.0 ! , 0.1 .

• results for verbatim input: punctuation input type BLEU [%] 1-best 20.8 confusion network 21.0 • competitive performance without tuning → room for improvement Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding

4

August 17, 2006

4 0.7 0.2 0.1

Truecasing truecasing, i.e. restoring case information in lowercase text • common approach: – core MT system produces lowercase output – truecasing is done as postprocessing step • application of factored translation models 1. translate lowercase 2. generate truecase output (using a truecase LM) • results: BLEU [%] two-step 18.9 integrated 17.8 → somewhat worse performance than dedicated tool Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding

5

August 17, 2006

EPPS Task • EPPS: European Parliament Plenary Sessions • Spanish-English speech-to-speech translation task • corpus statistics: Spanish English sentences 1.2 M running words 31 M 30 M vocabulary 140 K 94 K • confusion network statistics: dev test sentences 2 633 1 071 avg. length 10.6 23.6 avg. / max. depth 2.8 / 165 2.7 / 136 avg. number of paths 1038 1075 Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding

6

August 17, 2006

Results for EPPS Task dev ASR-WER 1-best lattice 19.3 1-best CN 21.7 full CN 7.0

BLEU 42.2 40.3 42.4

test ASR-WER 22.4 23.3 8.5

BLEU 37.6 36.7 38.9

• best result for test in previous work: 37.2 BLEU • in comparison with previous work on this task, we have 1. a stronger baseline, 2. larger improvements and 3. much more efficient decoding (4x vs. 25x) note: all figures in percent Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding

August 17, 2006

7

Exploration of Confusion Networks

avg. number per sentence

1x1010 1x10

CN total CN explored 1-best explored

9

1x108 1x107 1x106 1x105 1x104 1x103 100 10 1 0.1 0

2

4

6

8

10

12

14

path length Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding

8

August 17, 2006

JHU CLSP Summer Workshop 2006 Proposal for Follow-up Research

Exploiting Ambiguous Input in Statistical Machine Translation Richard Zens

Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6 Computer Science Department RWTH Aachen University, Germany Zens: Exploiting Ambiguous Input in SMT

1

August 17, 2006

Motivation

• MT often used in a pipeline, i.e. the input to the MT system is the output of another imperfect NLP system, e.g. – spoken language translation: ASR – segmentation: Chinese words, Arabic tokens – named entity recognition / translation

Zens: Exploiting Ambiguous Input in SMT

2

August 17, 2006

Motivation

• MT often used in a pipeline, i.e. the input to the MT system is the output of another imperfect NLP system, e.g. – spoken language translation: ASR – segmentation: Chinese words, Arabic tokens – named entity recognition / translation • traditional approach: ignore problem, i.e. translate 1-best • result of previous work: improvements if ambiguity is taken into account


3

August 17, 2006

Previous Approaches

1. confusion network decoding • advantages: efficiency, reordering is straightforward • problem: representing alternative segmentations 2. lattice decoding • advantage: representing alternative segmentations • problem: reordering goal: ⇒ exploit advantages of both approaches, ⇒ but avoid weaknesses Zens: Exploiting Ambiguous Input in SMT

4

August 17, 2006

Generalized Confusion Networks • confusion networks: 0

1


2

3

4

August 17, 2006

5

Generalized Confusion Networks • confusion networks: 0

1

2

3

4

1

2

3

4

• generalization:

0

– add edges that cover multiple positions → representation of alternative segmentations – do not add nodes → retain efficiency, straightforward reordering Zens: Exploiting Ambiguous Input in SMT

6

August 17, 2006

Improved Reordering for Lattice Input • confusion network is approximation of lattice → valuable information might be lost → potential improvement when using lattices


7

August 17, 2006

Improved Reordering for Lattice Input • confusion network is approximation of lattice → valuable information might be lost → potential improvement when using lattices • so far: – only very local reordering on lattice: ∗ skip 1 phrase [Zens & Bender+ 05] ∗ switch positions of 2 or 3 phrases [Kumar & Byrne 05]


8

August 17, 2006

Improved Reordering for Lattice Input • confusion network is approximation of lattice → valuable information might be lost → potential improvement when using lattices • so far: – only very local reordering on lattice: ∗ skip 1 phrase [Zens & Bender+ 05] ∗ switch positions of 2 or 3 phrases [Kumar & Byrne 05] • idea: – generalize reordering scheme used for CN to lattice input → long range reordering


9

August 17, 2006

Goals

• improve robustness to imperfect input • investigate novel approaches: – generalized confusion networks – reordering strategies for lattice input • perform a systematic comparison in terms of MT quality and computational requirements • scalability → apply to tasks of different size: small: IWSLT, medium: EPPS/TC-Star, large: NIST/GALE


10

August 17, 2006

Targeted Applications

• spoken language translation: – output of ASR system – punctuation insertion / sentence boundary detection – disfluency detection • named entity recognition / translation • Chinese word segmentation • Arabic tokenization


11

August 17, 2006

References [Kumar & Byrne 05] S. Kumar, W. Byrne: Local phrase reordering models for statistical machine translation. Proc. HLT/EMNLP, pp. 161–168, Vancouver, Canada, October 2005. [Sadat & Habash 06] F. Sadat, N. Habash: Combination of Preprocessing Schemes for Statistical MT. Proc. COLING/ACL, pp. 1–8, Sydney, Australia, July 2006. [Xu & Matusov+ 05] J. Xu, E. Matusov, R. Zens, H. Ney: Integrated Chinese Word Segmentation in Statistical Machine Translation. Proc. Int. Workshop on Spoken Language Translation (IWSLT), pp. 141–147, Pittsburgh, PA, October 2005. [Zens & Bender+ 05] R. Zens, O. Bender, S. Hasan, S. Khadivi, E. Matusov, J. Xu, Y. Zhang, H. Ney: The RWTH Phrase-based Statistical Machine Translation System. Proc. Int. Workshop on Spoken Language Translation (IWSLT), pp. 155–162, Pittsburgh, PA, October 2005. [Zens & Och+ 02] R. Zens, F.J. Och, H. Ney: Phrase-Based Statistical Machine Translation. Proc. M. Jarke, J. Koehler, G. Lakemeyer, editors, 25th German Conf. on Artificial Intelligence (KI2002), Vol. 2479 of Lecture Notes in Artificial Intelligence (LNAI), pp. 18–32, Aachen, Germany, September 2002. Springer Verlag.


12

August 17, 2006

Open Source Toolkit for Statistical Machine Translation ... - CiteSeerX

Open Source Toolkit for Statistical Machine Translation ... - CiteSeerX

Suggest Documents

OpenNMT: Open-Source Toolkit for Neural Machine Translation

Phrasal: A Toolkit for New Directions in Statistical Machine Translation

Statistical Methods for Machine Translation - CiteSeerX

Deep Open-Source Machine Translation - Nanyang Technological ...

Open-Source Neural Machine Translation API Server

Combining Machine Translation Output with Open Source

Open source machine translation with DELPH-IN

open-source rule-based machine translation

An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation

on Statistical Machine Translation

paper - Statistical Machine Translation

Nematus: a Toolkit for Neural Machine Translation

Machine translation: statistical approach with additional ... - CiteSeerX

A Critique of Statistical Machine Translation - CiteSeerX

Bilingual N-gram Statistical Machine Translation - CiteSeerX

Improving statistical machine translation by paraphrasing ... - CiteSeerX

The MetaMorpho Translation System - Statistical Machine Translation

Open Machine Translation Core: An Open API for Machine ...

Moses manual - Statistical Machine Translation

Automatic Evaluation Measures for Statistical Machine Translation ...

Morphological Analysis for Statistical Machine Translation

Dynamic Model Interpolation for Statistical Machine Translation

Statistical Machine Translation Framework for Modeling Phonological ...

Chinese Syntactic Reordering for Statistical Machine Translation