0
Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding Final Presentation
Philipp Koehn, Marcello Federico, Wade Shen, Nicola Bertoldi, Chris Callison-Burch, Ondrej Bojar, Brooke Cowan, Chris Dyer, Hieu Hoang, Richard Zens, Alexandra Constantin, Evan Herbst, Christine Moran 17 August 2006 Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
1
Schedule • First session: Overview and toolkit development – Factored models and confusion network decoding Koehn, Federico – Moses toolkit Hoang, Dyer, Herbst, Callison-Burch, Bertoldi • Second session: Experiments – Experiments in small data settings Shen, Bojar, Moran, Cowan – Factored models for morphological rich languages Dyer, Koehn, Cowan, Constantin – Confusion network experiments Zens Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
2
Accomplishments • Open source toolkit – advances state-of-the-art of statistical machine translation models – best performance of European Parliament task – competitive on IWSLT and TC-Star • Factored models – outperform traditional phrase-based models – framework for a wide range of models – integrated approach to morphology and syntax • Confusion networks – exploit ambiguous input and outperform 1-best – enable integrated approach to speech translation
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
3
Phrase-Based Translation er
geht
er
geht he
ja
nicht ja nicht
does not
nach
hause
nach hause go
home
• Foreign input is segmented in phrases – any sequence of words, not necessarily linguistically motivated • Each phrase is translated into English, phrases are reordered • Log linear model: PMany feature functions hi(e, f ) with weights λi combined to overall score i λihi(e, f ) → easy to extend Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
4
Translation • Task: translate this sentence from German into English
er
geht
ja
Philipp Koehn et al., JHU 2006 WS on MT
nicht
nach
Final Presentation
hause
17 August 2006
5
Translation step 1 • Task: translate this sentence from German into English
er
geht
ja
nicht
nach
hause
er he • Pick phrase in input, translate
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
6
Translation step 2 • Task: translate this sentence from German into English
er
geht
ja
er
nicht
nach
hause
ja nicht he
does not
• Pick phrase in input, translate – it is allowed to pick words out of sequence (reordering) – phrases may have multiple words: many-to-many translation
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
7
Translation step 3 • Task: translate this sentence from German into English
er
geht
er
geht he
ja
nicht
nach
hause
ja nicht does not
go
• Pick phrase in input, translate
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
8
Translation step 4 • Task: translate this sentence from German into English
er
geht
er
geht he
ja
nicht
nach
ja nicht does not
hause
nach hause go
home
• Pick phrase in input, translate
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
9
Translation options er
geht
he it , it , he
is are goes go
ja
nicht
yes is , of course
not do not does not is not
it is he will be it goes he goes
nach after to according to in
not is not does not do not is are is after all does
hause house home chamber at home
home under house return home do not to following not after not to
not is not are not is not a
• Phrase translation tables provide many translation options • Learned from automatically word-aligned corpora Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
10
Translation options er
geht
he it , it , he
is are goes go
ja
nicht
yes is , of course
not do not does not is not
it is he will be it goes he goes
nach
hause
after to according to in
not is not does not do not
house home chamber at home
home under house return home do not
is are is after all does
to following not after not to not is not are not is not a
• The machine translation decoder does not know the right answer → Search problem solved by heuristic beam search Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
11
Decoding process: precompute translation options er
geht
Philipp Koehn et al., JHU 2006 WS on MT
ja
nicht
Final Presentation
nach
hause
17 August 2006
12
Decoding process: start with initial hypothesis er
geht
Philipp Koehn et al., JHU 2006 WS on MT
ja
nicht
nach
Final Presentation
hause
17 August 2006
13
Decoding process: hypothesis expansion er
geht
ja
nicht
nach
hause
are
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
14
Decoding process: hypothesis expansion er
geht
ja
nicht
nach
hause
he are it
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
15
Decoding process: hypothesis expansion er
geht
ja
nicht
nach
hause
yes he are
goes
home
does not
go
it
Philipp Koehn et al., JHU 2006 WS on MT
home
to
Final Presentation
17 August 2006
16
Decoding process: find best path er
geht
ja
nicht
nach
hause
yes he are
goes
home
does not
go
it
Philipp Koehn et al., JHU 2006 WS on MT
home
to
Final Presentation
17 August 2006
17
Statistical machine translation today • Best performing methods based on surface word phrases – uses mapping of short chunks of text (mostly 1-3 words) – sophisticated methods for phrase extraction and modeling (EM algorithm, generative models, discriminative training) • Translation solely based on surface forms of words – no use of explicit syntactic information – no use of morphological information • How can be build richer models?
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
18
One motivation: morphology • Current models treat house and houses as completely different words – training occurrences of house have no effect on learning translation of houses – if we only see house, we do not know how to translate houses – rich morphology (German, Arabic, Finnish, Czech, ...) → many word forms • Better approach combines evidence for house and houses – analyze surface word forms into lemma and morphology e.g.: Haus +plural – translate lemma and morphology separately e.g.: Haus → house; +pl → +pl – generate target surface form e.g.: house +pl → houses
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
19
Factored translation models • Factored represention of words Input
Output
word
word
lemma
lemma
part-of-speech
part-of-speech
morphology
morphology
word class
word class ...
...
• Benefits – generalization, e.g. by translating lemmas, not surface forms – richer model, e.g. using syntax for reordering, language modeling) Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
20
Example factored model • Our example as factored model: Input
Output
word
word
lemma
lemma
morphology
morphology
• Translation process broken up into mapping steps – translation of lemma – translation of morphology – generation of word from lemma, morphology Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
21
Expansion of input phrase • Probabilistic mapping steps – translation step: lemma → lemma haus → house, home, chamber, ... – translation step: morphology → morphology single-noun → single-noun, single-pronoun, plural-noun, ... – generation step: lemma,morphology → word house,single-noun → house house,plural-noun → houses • Still a phrase model – translation steps may map phrases nach hause → home, return home – generation steps operate on single words – traditional phrase-models are special case: single-factor models Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
22
Computational complexity of mapping steps • Number of factored expansions may grow exponentially • Key insights to reduce complexity for a given input sentence: – expansions can be pre-computed and stored as translation options, – pruning translation options early • Future work: problems with more complex models need to be addressed – we had problems using some models with three steps or more – see student proposals (Hoang, Dyer) for solutions
Philipp Koehn et al., JHU 2006 WS on MT
Final Presentation
17 August 2006
Spoken Language Translation with Confusion Networks Marcello Federico, Nicola Bertoldi, Wade Shen, Richard Zens August 17, 2006
Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
1
Outline • Spoken language translation • Approaches to SLT
• Confusion network decoding • Computational issues
• Implementation in Moses
• Language model interface
• Other applications of confusion networks
Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
2
Spoken Language Translation Translation from speech input is likely more difficult than translation from text input: • many styles and genres: formal read speech, unplanned speeches, interviews, spontaneous conversations, ... • less controlled language: relaxed syntax, spontaneous speech phenomena • automatic speech recognition is prone to errors: possible corruption of syntax and meaning This work addresses methods to improve performance of spoken language translation by better integrating speech recognition and machine translation models. Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
3
Integrating Speech Recognition and Translation • Correlation between transcription word-error-rate and translation quality: 42.5
42
BLEU SCORE
41.5
41
40.5
40
39.5
39
38.5 14
15
16
17
18
19
20
21
WER OF TRANSCRIPTIONS
• Better transcriptions have been possibly analyzed during ASR decoding but discarded due to lower scores • Potential for improving translation quality by exploiting more transcription hypotheses generated during ASR. Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
4
Statistical Spoken Language Translation • Let o be be spoken input in the foreign language • let F(o) be a set of possible transcriptions of o
Goal: find the best English translation through the approximate criterion: e∗ = arg max Pr(e | o) ≈ arg max max Pr(e, f | o) e
e
f ∈F (o)
Pr(e, f | o) is computed with a log-linear model incorporating:
• acoustics features, i.e. probs that some foreign words are in the input • linguistic features, i.e. probs of foreign and English sentences
• translation features, i.e. probs of foreign phrases into English • alignment features: i.e. probs for word re-ordering Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
Marcello Federico, ITC-irst Trento
Marcello Federico, ITC-irst Trento
Project Summary ;
/&*
67
4.
38
48
48
48
67
67
#
%$79:.-25
&1'
79:.-25
79:.-25
/"*
:.-
&1(
/&"
#
%$;
#
%$&1*
#
%$&10
&1/
Project Summary #
%$"//
"/&
"/'
;
'*
;
#
%$'0
#
%$&0'
"/0
&0(
#
%$#
%$#
%$;
&0+
&0*
&/'
&0!
D
&0/
&&(
"/"
;
&&*
"*&
&/*
&0"
&/"
&0& ;
&/+
;
#$% #
%$;
0*
,-.
79:.-25
,-.
,-.
,-.
"
,-.
68
,-.
(
,-.
'!
"&*
"0/
&&'
7
&&/
7
&/0
;
5
&/&
&!'
7 #$% #
%$1+
.4
"*+
,
&'!
#
%$&01
)
:
74
&/
#
%$79:.-25
68
#
%$2.32,445
"1
**
&*
67
&(
#
%$@5
,-.
#
%$67
/"
#
%$79:.-25
/!
&/!
#
%$&'&
#
%$&'"
) #
%$"+!
#
%$,
"'*
.
"!!
;
&'/
&1
/&
#
%$67
"!"
; ;
#
%$#
%$"1(
"((
#$% 7
;
;
"(1
;
#
%$"1'
#$% !
#$% #
%$"('
"'!
@.
5
1/
#
%$#
%$;
"'0
#
%$"'/
#
%$"'&
!
#
%$""+
;
""1
@. "'"
.4
++
;
48 #
%$""*
!
1&
!
""(
A3
!#
%$"'1
"(!
#
%$;
"(/
#
%$"(&
"&"
"&/
.448
"&1
;
#
%$#$% )
#
%$.
.6
"1&
;
"/+
"&+
#
%$48 #
%$;
#
%$"1!
;
""'
#
%$""0
#
%$""/
.
;
1*
#
%$1(
#
%$)
*!
('
AB
;
"""
A
#
%$;
#
%$AB
;
)
&!"
#
%$;
"!1
9.
#
%$#
%$#
%$;
.@
. &!(
4.
;
+'
"10
4.
#
%$"+/
;
+"
(1
;
#
%$;
#
%$. "+&
"+(
"+'
#
%$"+0
&"/
4.
;
&"'
; #
%$&"0
4?
;
&&"
#
%$&&!
#
%$4?
&"1
#
%$&!1
#
%$4?
#
%$#
%$;
#
%$#
%$#$% &"&
&!+
4.
#
%$&"!
4.
&""
"*(
. &"+
. #
%$"*'
4.
#$% &!*
A3
(+
#$% #
%$#
%$9.
79:.-25
. A3
"**
#
%$&"*
"!0
"!+
.@
"1/
9.
#
%$&"(
"*0
*&
#
%$#
%$#
%$"!(
.7 B
"!*
#
%$"0*
)
#
%$+!
,
? "!&
"!/
?
. #
%$*1
#
%$#
%$""!
A@ "!'
A@
10
+0
#
%$*"
, #
%$#$% #
%$)
"*/
#
%$*0
(*
)
((
79:.-25
#$%#
%$#
%$79:.-25
#
%$#
%$6. +&
"1"
6.
)
A 1'
#
%$&!!
#
%$"11
#
%$"1+
#$% "0'
""&
A #
%$;
"00
"/*
A38
.
, #
%$"+*
(0
#
%$(/
#
%$#$% 79:.-25
#$% . ,
*/
79:.-25
(&
("
79:.-25 79:.-25
79:.-25
#$% #$% #
%$73
, #$% "+1
"0(
79:.-25
"++
.
+1
.448
*'
.6 "&0
#$% "/(
"&!
48
#
%$+/
79:.-25
#
%$.6 A38
*+
79:.-25
(!
#
%$"&&
A3 #$% #
%$A3
#
%$79:.-25
*(
67
#$% +*
@.
"("
@.
5
#
%$, #
%$. #
%$"0+
5 #$% .4
"01
"''
. #$% @.
/&+
;
"0"
74
#
%$'1
67
A3 A38
"'(
79:.-25
#
%$67
2.32,445
'+
#
%$67
"'+ #
%$. "0!
) #
%$#$% "1*
"/1
#
%$&'0
#
%$67 #
%$&//
11
.
#$% @.
,-.
2.32,445
,-.
#
%$2.32,445
2.32,445
#$% #
%$79:.-25
68
&'
#
%$&0
#
%$#
%$&&
#
%$"0
#$% "(
,-.
#
%$"/
#
%$"'
67 =7 #
%$"*1
#
%$; ;
&&1
&&0
"+"
, )
#
%$#
%$&!/
#
%$&!0
&!&
2 , #$% :
&&&
"&(
"(+
.7
5@ 5 , )
"(*
#
%$"&'
#$% ; #
%$#
%$#
%$#$% #
%$#
%$#
%$1!
79:.-25
79:.-25
&+
2.32,445
#
%$2.32,445
"&
2.32,445 &"
&!
#
%$,-.
@5#
%$+(
79:.-25
79:.-25
79:.-25
68
#
%$#
%$#
%$1
#
%$+
,-.
#
%$5 "0&
#
%$68
//
2.32,445
""
#
%$"!
#
%$,-.
"(0
"*!
#
%$5
#$% /(
79:.-25
79:.-25
79:.-25
/*
68
/'
#
%$#
%$/0
"+
#
%$"*
#$% *
& #
%$,-.
#
%$/
,-.
#
%$0
,-.
01
#
%$#
%$0+
#
%$#
%$79:.-25
79:.-25
79:.-25
@5C
#
%$#
%$"*"
5
@5
'"
0(
#
%$/+
79:.-25
79:.-25
2 #$% #
%$7
#
%$"/!
#
%$; #
%$7
#
%$2 &/1
#
%$; ; &00
&/(
#
%$3 68 #$% #
%$&&+
#
%$#
%$#
%$#
%$'(
''
@5 1"
@5
#
%$'/
'&
#
%$0'
#
%$0/
#
%$#
%$@5
0&
#
%$00
0"
#
%$79:.-25
79:.-25
79:.-25
79:.-25
/&0
79:.-25
79:.-25
0!
#
%$/1
#
%$79:.-25
#
%$79:.-25
79:.-25
79:.-25
79:.-25
48
#
%$79:.-25
#$% 253 &1+
253 /&&
#
%$#
%$253
#
%$/"+
#$% /&!
/"1
:.-
#
%$:.-
#
%$#$% &1&
&1"
#
%$&1!
#
%$&+1
/"'
#
%$/"0
/"/
79:.-25 #
%$&++
#
%$79:.-25 79:.-25 &+*
&+(
48
#
%$&*+
#
%$)
'
)
#
%$/""
#
%$#
%$&(0 #
%$&**
)
#
%$&(/
#
%$&(&
&('
#
%$#
%$#
%$#
%$#
%$/"&
67
/"!
#$% 67
&+'
#
%$/!+
&+0
/!*
&+/ #
%$#
%$#$% /!(
&+&
#
%$&+! #
%$#
%$&+"
/"(
/!1
253 /&/
/&'
67
4.
/!'
#
%$67 #
%$/!0
#
%$/!/
:.- #
%$#
%$#
%$/&(
253
:.-
67
4.
48
#
%$48
&(! #
%$,-.
&'+
#
%$&'1
)
#
%$!
#$% )
&'(
&'*
#$% #$% )
E.2.3F.
48 &*1 48
&*(
E.2.3F.
#
%$&((
#
%$#
%$4. /!"
#$% #
%$4.
48 #
%$&11
4.
4.
&*/
#
%$&*&
#$% 4.
#
%$E.2.3F.
/!!
#
%$&*0
&*"
#
%$E.2.3F.
#
%$&(*
)
)
4. /!&
4.
E.2.3F.
&*!
#
%$E.2.3F.
&(+
&*'
E.2.3F.
&(1
E.2.3F.
)
)
&("
)
&''
5
ASR Word Graph
A very general set of transcriptions F(o) can be represented by a word-graph: • directly computed from the ASR word lattice (e.g. HTK format, lattice tool) • provides a good representations of all hypotheses analyzed by the ASR system
• arcs are labeled with words, acoustic and language model probabilities
• paths correspond to transcription hypotheses for which probabilities can be computed
August 17, 2006
6
Approaches to Spoken Language Translation
The previous statistical framework includes several alternative implementations:
• 1-best translation: translate only the most probable hypothesis in the word graph – pros: very efficient – cons: no potential to recover from recognition errors in the 1-best transcription
• N-best translation: translate only the N–most probable hypotheses in the word-graph – pros: can exploit more accurate transcriptions in the word graph – cons: N must be large in order to include good transcriptions, and decoding time increases linearly with N
August 17, 2006
7
Approaches to Spoken Language Translation • Transducer: compose word-graph with a translation FSN and apply a transducer algorithm – pros: straightforward method that permits to work on the full word graph – cons: computationally prohibitive with large vocabulary tasks and long range word re-ordering • Confusion network: translate a suitable approximation of the WG – pros: it permits to effectively explores all paths in the word-graph, with no problems in re-ordering – cons: can only exploit limited information in the word graph
Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
8
Confusion Network(Mangu 1999) A confusion network approximates a word graph with a linear network, s.t.: • arcs are labeled with words or with the empty word ( !-word) • arcs are weighted with word posterior probabilities
DEF
"6
#C
%$"3
"2
",
'+$ * ' 0 ) 7 >0 >) -
"&
'+$ -7
""
'+$ B
"!
9 ) ? ' '+$
@
)/ 5 4)!) 8) 7/ /A
'+$
* '
0
-0.1)
'+$
) )/ >0 /) )4 /5 -
< = . .5
47 45
:
6
'+$!
3
789)(-0 9)( '+
2
45 /5 )//5 '+
,
'+$ /)
&
-).-'//0 1)-).
'() * '+$
#
%$"
!
• paths are a superset of those in the word graph
CNs can be conveniently represented as a sequences of columns of different depth.
Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
9
Confusion Network Decoding: Extension of basic phrase-based decoding step: • cover some not yet covered consecutive columns (span)
• retrieve phrase-translations for all paths inside the columns
• compute translation, distortion and target language models Example. Coverage set: 01110... Path: cancello d’ 0 era 0.997 `e 0.002 ! 0.001
1 cancello 0.995 vacanza 0.004 ! 0.002
Marcello Federico, ITC-irst Trento
1
1
! 0.999 la 0.001
di 0.615 d’ 0.376 all’ 0.005 l’ 0.002 ! 0.001
0 imbarco 0.999 bar 0.001
Project Summary
... ...
August 17, 2006
10
Confusion Network Decoding Computational issues: • Number of paths grows exponentially with span length
• Implies look-up of translations for a huge number of source phrases
• Factored models require considering joint translation over all factors (tuples): – cartesian product of all translations of each single factor Solutions implemented into Moses • Source entries of the phrase-table are stored with prefix-trees
• Translations of all possible coverage sets are pre-fetched from disk
• Efficiency achieved by incrementally pre-fetching over the span length
• Phrase translations over all factors are extracted independently, then translation tuples are generated and pruned by adding a factor each time Once translation tuples are generated, usual decoding applies. Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
11
Implementation into Moses • Input Format: CN input can be rather large, so better to put one word-position per line: Haus 0.1 aus 0.4 Aus 0.4 eps 0.1 der 0.9 eps 0.1 Zeitung 1.0 each line represents alternatives with their probability. • Factored confusion networks: alternatives are over the full factor space: Haus|N 0.1 der|DET 0.1 Zeitung|N 1.0
aus|PREP 0.4 der|PREP 0.8
Aus|N 0.4 eps|eps 0.1
eps|eps 0.1
Notice: confusion network can be projected over single factors.
Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
12
Implementation into Moses Decoding CN with Factored Models • at each step of the search process, a portion of the CN is explored, e.g. ... Haus | N 0.1 der|DET 0.1 Zeitung|N 1.0 ... .... and translations are
... aus|PREP 0.4 der|PREP 0.8
Aus|N 0.4 eps|eps 0.1
.... ... looked up for each factor.
eps|eps 0.1
...
Features: • Efficiency by pre-filtering possible translations for each factor
• Decoding of confusion networks is completely hidden to the decoder.
Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
13
Other Applications of Confusion Networks Translation tasks with ambiguous input: • linguistic annotation for factored models – avoid hard decision by linguistic tools but rather provide alternative annotations with respective scores: – e.g. particularly ambiguous part of speech tags • insertion of punctuation marks missing in the input – model all possible insertions of punctuation marks in the input • translation of input similar to that produced by speech recognition – e.g. OCR output for optical text translation • ....
Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
14
Language Model Interface • Features – compact binary format for very large language model – quantization of probabilities (8 bits) – fast upload of language model from disk – upload of n-grams on demand • Comparison with SRI LM Toolkit – memory: 50% less with large quantized models – speed: 10% slower in decoding with 3-gram LM • Recent work and improvements – speed-up by directly storing log-probs – addition of cache memory on n-gram internal data strucure – analysis of LM score computations by search algorithm – caching of probabilities and LM states the search algorithm requests the same probabilities many times Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
15
Requests of N-grams by Decoder
Requests of 3-gram probabilities during decoding of a single sentence. About 1.6M requests involving about 120K 3-grams. Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
16
Conclusions Implementation work • Efficient on-demand pre-fetching of phrase translations • Tuning of parameters for confusion network decoding
• Language model interface and pre-fetching of n-grams Development of state-of-the-art baselines for SLT • IWSLT BTEC Chinese-English SLT – submissions to IWSLT 2006 evaluation • EPPS Spanish-English SLT – performance comparable with best TC-STAR systems Achievements • SLT decoder more efficient wrt current implementations by IRST and MIT/LL • works with large-data tasks and large confusion networks • works with factored confusion networks Marcello Federico, ITC-irst Trento
Project Summary
August 17, 2006
Engineering Results JHUSWS 2006
Open software, so what?
State of the world, June 2006
“Black box” decoder (Pharaoh) widely used 20+ citations in this year’s ACL Proceedings alone Ubiquitous baseline system
But… it is difficult to extend New features limited to what can be expressed in the
existing phrase-table format (source, target, feature vector) Many interesting projects require reinventing the wheel just to change one spoke
Aug 17, 2006
JHUSWS 2006
2
Software Goals
Accessibility Easy to maintain Flexibility Easy for distributed team development Portability
Aug 17, 2006
JHUSWS 2006
3
Accessibility
Easy to read “Nothing should be a black box” void Load(const std::string &fileName Descriptive names , FactorCollection &factorCollection , FactorType factorType , float weight Uniform coding style , size_t nGramOrder);
Available immediately
Source code on Sourceforge.net
Cross-platform compatibility
Windows, Linux, MacOS X, 64 bit OS
Aug 17, 2006
JHUSWS 2006
4
Easy to Maintain
Modular code Team development Object oriented framework
Integrated documentation framework Using Doxygen Easy to maintain Wiki documentation on the Web
Aug 17, 2006
JHUSWS 2006
5
Documentation
Aug 17, 2006
JHUSWS 2006
6
JHUSWS 2006
Aug 17, 2006
7
Extensibility
Open architecture designed for extensibility
Architecture matches theoretical descriptions of phrase-based MT models
Feature function evaluation decoupled from search algorithms
Short ramp-up time for researchers familiar with SMT but not with any particular decoder
Facilitates experimentation with new classes of feature functions
Modular design
Aug 17, 2006
Framework to allow different replacements of all parts of the decoder Multiple implementations of translation tables Language models Different types of models
JHUSWS 2006
8
Case Study: Lexicalized Reordering
Very successful model, but implementation not possible with a “black box” decoder With Moses, anyone with an idea can try it Adding support for LR models to moses required code changes in four (relatively logical) locations
Feature-function base class (ScoreProducer) extended, logic for feature value computation implemented Enable the model based on configuration Call to evaluate the feature function when extending a hypothesis Add the feature values to n-best list output for tuning algorithms JHUSWS 2006
Aug 17, 2006
9
Regression Testing
Regression Testing Pharaoh scores used as baseline, which were updated as models changed (for example, hypothesis recombination based on LM state rather than n-gram order) Detailed logging enables strict test coverage for all model types Regression test suite was run approximately 3000 times during workshop
Aug 17, 2006
JHUSWS 2006
10
Accomplishments
Code contributions from every member of the team Performance improvements Day 1: 5.01 sec/sentence avg decoding time Today: 1.43 sec/sentence avg decoding time
Aug 17, 2006
JHUSWS 2006
11
Summary
State of the world, August 2006
“White box” multi-factored decoder (Moses) available Drop-in replacement for Pharaoh
Further experimentation and development anticipated at:
Aachen, Charles University, Cornell, Edinburgh, IRST, MIT, Lincoln Labs, UMD…and many more.
Aug 17, 2006
JHUSWS 2006
12
Software Goals • • • • •
Accessibility Easy to maintain Flexibility Easy for distributed team development Portability
Accessibility • Easy to read – “Nothing should be a black box” void Load(const std::string &fileName – Descriptive names , FactorCollection &factorCollection – Uniform coding style
, FactorType factorType , float weight , size_t nGramOrder);
• Available immediately – Source code on Sourceforge.net
• Cross-platform compatibility – Windows, Linux, MacOS X, 64 bit OS
Easy to Maintain • Modular code – Team development – Object orientated framework
• Integrated documentation framework – Using Doxygen – Interactive Wiki documentation on the Web
• Extendable – Flexibility • Framework to allow different replacements of all parts of the decoder • Multiple implementations of translation tables • Language models • Different types of models
– Code size • 10,000 at beginning of workshop • 16,000 now
1
System tuning • Log Linear Model e∗ = arg max Pr(e | f ) = arg max pλ(e | f ) = arg max e
e
X
e
λihi(e, f )
(1)
i
• real valued feature functions: – model specific component of the translation process: fluency, adequacy, reordering, ... – statistical models are estimated on specific training data • feature weights: – balance ranges of feature scores – weight importance of features – tuned through Minimum Error Training (MET)
Nicola Bertoldi, ITC-irst
Minimum Error Training
August 17, 2006
2
Minimum Error Training • automatic procedure to optimize feature weights • minimization of translation errors • development set (f , ref ) • automatic error function Err(e; ref ): (100-BLEU) score
e∗ = e∗(λ) = arg max pλ(e | f )
(2)
λ∗ = arg min Err(e∗(λ); ref )
(3)
e
λ
• Err(e) is not math-sound =⇒ no exact solution • approximate iterative algorithm: gradient descent, downhill simplex Nicola Bertoldi, ITC-irst
Minimum Error Training
August 17, 2006
3
CLSP-WS solution for MET features
input
Moses
weights
reference
n-best
Extractor
1-best
weights
inner loop outer loop
Scorer score
Optimizer
optimal weights
• outer loop: • inner loop: – decoding with actual lambdas – optimization over n-bests – generation of nbest translations – decoder and ”random” weights – addition to previous translations as initial points • optimizer: – iterative optimization on single weights – discretization of the r-dimensional space of weights Nicola Bertoldi, ITC-irst
Minimum Error Training
August 17, 2006
4
MET vs. size of nbest list • German-English EuroParl task • tuning on dev set of 2000 sentences • evaluation on test set of 2000 sentences
26 25 24
BLEU
• convergence in 5-6 iterations: – good: faster outer loop • no impact of size of nbest: – good: faster inner loop
23 22 21 100 nbest 200 nbest 400 nbest 800 nbest
20 19 18 0
2
4
6
8
10
12
14
iteration
Nicola Bertoldi, ITC-irst
Minimum Error Training
August 17, 2006
5
MET vs. size of development set 30
• extraction of 4 subsets: 100, 200, 400, 800 sentences
25
BLEU
20
10
• larger dev set: – more stable result – less iterations – better results
100 sentences 200 sentences 400 sentences 800 sentences 2000 sentences
5 0 0
2
4
6
8
10
12
14
16
18
iteration
• bad: – overfitting – large dev set – slower outer loop (decoding)
Nicola Bertoldi, ITC-irst
15
100 200 400 800 2000
sentences sentences sentences sentences sentences
Minimum Error Training
iteration 18 15 16 14 9
BLEU 24.3 25.1 24.6 24.9 25.3 August 17, 2006
6
MET vs. optimization algorithm • task: Spanish-English EPPS, speech input • dev set of 2643 Confusion Networks, test set of 1073 CNs • CLSP-WS algorithm vs. downhill simplex (RWTH)
iteration CLSP-WS algorithm downhill simplex
4 7
∆ BLEU dev test +1.0 +0.4 +2.9 +3.4
• mismatch between internal score of CLSP-WS algorithm and official score • better performance of the downhill simplex algorithm • post-workshop investigation Nicola Bertoldi, ITC-irst
Minimum Error Training
August 17, 2006
1
Moses in parallel • effective R&D cycle: – fast experiments
source input
• computing facilities: – 6 clusters, 200 machines
Splitter
part-1
• parallelization of translation
(remote) cluster of machines
• ’split and merge’ technique • translation time: – splitting/merging ≈ constant, negligible – access to cluster related to cluster load – loading data≈ constant – decoding ∝ input length Nicola Bertoldi, ITC-irst
part-N
Moses in parallel
Moses
Moses
translation-1
translation-N
Merger
translation
August 17, 2006
2
Moses in parallel • Spanish-English EuroParl task • CLSP cluster, 18 machines • no control of cluster load
10 sentences 100 sentences 1000 sentences
standard 6.3 5.2 6.3
1 job 13.1 5.6 6.5
5 jobs 9.0 3.0 2.0
10 jobs 9.0 1.7 1.6
20 jobs – 1.7 1.1
Average time (seconds).
Nicola Bertoldi, ITC-irst
Moses in parallel
August 17, 2006
Decoder Output Analysis Evan Herbst 8 / 17 / 06
Evan Herbst
Decoder Output Analysis
8 / 17 / 06
1
Measurables • Difficulty – perplexity • Error – WER – PWER – BLEU – confidence intervals • Significance – t-test – sign test
Evan Herbst
Decoder Output Analysis
8 / 17 / 06
2
Definition: Perplexity Measure likelihood of corpus given model (e.g. language model) 1 P log(p LM (wi )) i
P X = 2− N
Evan Herbst
, wi words
Decoder Output Analysis
8 / 17 / 06
3
Definition: WER Word Error Rate: modified edit distance
Evan Herbst
Decoder Output Analysis
8 / 17 / 06
4
Definition: PWER Position-independent Word Error Rate: match bags of words
Evan Herbst
Decoder Output Analysis
8 / 17 / 06
5
Definition: BLEU BiLingual Evaluation Understudy: n-gram precision and length comparison
Evan Herbst
Decoder Output Analysis
8 / 17 / 06
6
Numbers Dataset: 2000-sentence Europarl subset pharaoh Linguae → BLEU WER PWER/WER Lemma BLEU N-gram Prec. Perplexity Ref Perplex.
moses baseline
de-en
en-de
de-en
en-de
.2557 .5432 .865 .2625 .609/.315/.188/.119 40.97 68.81
.1775 .6144 .940 .2170 .519/.223/.122/.070 62.01 125.29
.2554 .5428 .865 .2622 .609/.314/.188/.119 40.94 68.81
.1776 .6145 .947 .2180 .519/.223/.122/.070 61.77 125.29
Inferences
• lemmas vs. surface: morphology • output vs. reference perplexity: fluency • PWER/WER ratio: reordering; phrase tables Evan Herbst
Decoder Output Analysis
8 / 17 / 06
7
Tool: Comparison
Evan Herbst
Decoder Output Analysis
8 / 17 / 06
8
Tool: Alignment
Evan Herbst
Decoder Output Analysis
8 / 17 / 06
Suffix Arrays for More Statistics (and Less Disk Space!) Chris Callison-Burch August 17, 2006
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
1
Phrase Tables in Statistical Machine Translation • Using longer phrases leads to better translation quality • Phrase tables can get unwieldily large with long phrases • Problem of large tables is compounded for factored translation models
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
2
Phrase Tables in Factored Translation Models • Translation tables between source and target phrases, and POS tags, stems, morphological markers, etc. • Plus generation tables • Want longer sequences for factors with smaller tags sets • Number of tables depend on number of conditioning variables, and on back-off strategies • Potentially more tables than all pairwise combinations of factors
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
3
Ad Hoc Solutions • Limit length of phrases • Only extract phrases for test data • Make unnecessary independence assumptions
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
4
Proposed Solution: Intelligent Data Structure • Uses less memory than table-based data structures • Allows us to condition on whatever factors we want and easily back-off • Retrieve translation / generation probabilities for arbitrarily long sequences • Suffix arrays to index parallel corpus
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
5
How Suffix Arrays Work Index of words: 0
Corpus 1
Spain declined
2
3
4
5
6
7
8
9
to
confirm
that
Spain
declined
to
aid
Morocco
Initialized, unsorted Suffix Array
Chris Callison-Burch
Suffixes denoted by s[i]
s[0]
0
s[1]
1
declined to confirm that Spain declined to aid Morocco
s[2]
2
to confirm that Spain declined to aid Morocco
s[3]
3
confirm that Spain declined to aid Morocco
s[4]
4
s[5]
5
that Spain declined to aid Morocco Spain declined to aid Morocco
s[6]
6
s[7]
7
declined to aid Morocco to aid Morocco
s[8]
8
aid Morocco
s[9]
9
Morocco
Spain declined to confirm that Spain declined to aid Morocco
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
6
Alphabetically Sorted Sorted Suffix Array
Suffixes denoted by s[i]
s[0]
8
aid Morocco
s[1]
3
s[2]
6
confirm that Spain declined to aid Morocco declined to aid Morocco
s[3]
1
declined to confirm that Spain declined to aid Morocco
s[4]
9
Morocco
s[5]
5
Spain declined to aid Morocco
s[6]
0
s[7]
4
Spain declined to confirm that Spain declined to aid Morocco that Spain declined to aid Morocco
s[8]
7
to aid Morocco
s[9]
2
to confirm that Spain declined to aid Morocco
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
7
(Reasonably) Fast Find Sorted Suffix Array
Chris Callison-Burch
Suffixes denoted by s[i]
s[0]
8
aid Morocco
s[1]
3
s[2]
6
confirm that Spain declined to aid Morocco declined to aid Morocco
s[3]
1
declined to confirm that Spain declined to aid Morocco
s[4]
9
Morocco
s[5]
5
Spain declined to aid Morocco
s[6]
0
Spain declined to confirm that Spain declined to aid Morocco
s[7]
4
s[8]
7
that Spain declined to aid Morocco to aid Morocco
s[9]
2
to confirm that Spain declined to aid Morocco
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
8
Applied to Factored Translation Models Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin
7
8
9
to
aid
Morocco
TO
VB
NNP
to
aid
morocco
• Index each factor • Store word-level alignments • Calculate probabilities on the fly
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
9
Generation Probabilities Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin
Sorted Suffix Array
Chris Callison-Burch
Suffixes denoted by s[i]
7
8
9
to
aid
Morocco
TO
VB
NNP
to
aid
morocco
p(NNP VBN | Spain declined) = 0.5 p(NNP VBD | Spain declined) = 0.5
s[0]
8
aid Morocco
s[1]
3
s[2]
6
confirm that Spain declined to aid Morocco declined to aid Morocco
s[3]
1
declined to confirm that Spain declined to aid Morocco
s[4]
9
Morocco
s[5]
5
Spain declined to aid Morocco
s[6]
0
s[7]
4
s[8]
7
Spain declined to confirm that Spain declined to aid Morocco that Spain declined to aid Morocco to aid Morocco
s[9]
2
to confirm that Spain declined to aid Morocco
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
10
Generation Probabilities Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin
Sorted Suffix Array
7
8
9
to
aid
Morocco
TO
VB
NNP
to
aid
morocco
p(Spain | NNP) = 0.66666 p(Morocco | NNP) = 0.33333
Suffixes denoted by s[i]
s[0]
4
IN NNP VBN TO VB NNP
s[1]
9
NNP
s[2]
0
NNP VBD TO VB IN NNP VBN TO VB NNP
s[3]
5
NNP VBN TO VB NNP
s[4]
2
TO VB IN NNP VBN TO VB NNP
s[5]
7
TO VB NNP
s[6]
3
VB IN NNP VBN TO VB NNP
s[7]
8
VB NNP
s[8]
1
s[9]
6
VBD TO VB IN NNP VBN TO VB NNP VBN TO VB NNP
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
. .
August 17, 2006
11
Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin
Sorted Suffix Array
Chris Callison-Burch
7
8
9
to
aid
Morocco
TO
VB
NNP
to
aid
morocco
Suffixes denoted by s[i]
s[0]
8
aid Morocco
s[1]
3
s[2]
6
confirm that Spain declined to aid Morocco declined to aid Morocco
s[3]
1
declined to confirm that Spain declined to aid Morocco
s[4]
9
Morocco
s[5]
5
Spain declined to aid Morocco
s[6]
0
s[7]
4
s[8]
7
Spain declined to confirm that Spain declined to aid Morocco that Spain declined to aid Morocco to aid Morocco
s[9]
2
to confirm that Spain declined to aid Morocco
Spain declined to confirm that Spain declined to aid Morocco
Translation Probabilities L' Espagne a refusé de confirmer que l' Espagne avait refusé d' aider le Maroc
p(L'Espagne a refusé de | Spain declined) = 0.5 p(l'Espagne avait refusé d' | Spain declined) = 0.5
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
12
Factored Corpus Index of 0 1 2 3 4 5 6 words: Spain declined to confirm that Spain declined POS: NNP VBD TO VB IN NNP VBN stems: spain declin to confirm that spain declin
Sorted Suffix Array
Chris Callison-Burch
7
8
9
to
aid
Morocco
TO
VB
NNP
to
aid
morocco
Suffixes denoted by s[i]
s[0]
4
IN NNP VBN TO VB NNP
s[1]
9
NNP
s[2]
0
NNP VBD TO VB IN NNP VBN TO VB NNP
s[3]
5
NNP VBN TO VB NNP
s[4]
2
TO VB IN NNP VBN TO VB NNP
s[5]
7
TO VB NNP
s[6]
3
VB IN NNP VBN TO VB NNP
s[7]
8
VB NNP
s[8]
1
s[9]
6
VBD TO VB IN NNP VBN TO VB NNP VBN TO VB NNP p(l'Espagne
Spain declined to confirm that Spain declined to aid Morocco
Translation Probabilities L' Espagne a refusé de confirmer que l' Espagne avait refusé d' aider le Maroc
avait refusé d' | Spain declined, NNP VBN) = 1
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
13
Advantages • Memory reduction – Memory = 2 * num factors * corpus + word alignments – Significantly less than phrase tables! • Greater range of statistics – Arbitrary number of conditioning variables – Allows range of back-off strategies • Can extract statistics for arbitrarily long sequences
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
14
Research to be Undertaken • Integrate into Moses decoder • Deal with increased computational complexity • Change search strategies to incorporate longer factor sequences, of different levels of granularity • Experiment to test if longer sequences improve translation quality • Experiment with what variables to condition upon, how to back off
Chris Callison-Burch
Suffix Arrays for More Statistics (and Less Disk Space!)
August 17, 2006
Factored Translation Models for Small Data Problems Experiments with Spanish, Czech and Chinese
Wade Shèn, Břooke Cowan, Ondrej Bojar and Christine Möran
MIT Lincoln + Computer Science AI Labs
1 8/14/2006
Charles University
Outline
2
•
Motivations
•
Experimental Design and Baselines
•
Models for Agreement in Spanish
•
Coping with Rich Morphological Constraints in Czech
•
Generalizing Lexical Distortion Models
•
Models for Sparse Statistics in Chinese
•
Conclusions and Follow-on Research
MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
General Motivations Challenges with Small Data
•
Phrase-based MT relies on large data – Learn “Phrase” co-occurence within language – Learn Translation templates/phrases across languages
•
Problems Phrase-based MT with Small Data – Word Alignment – Hard to see enough phrases (coverage) Ö Especially in morphologically rich languages
– Tend to rely on shorter phrases Ö Increased local agreement problems Ö Increased long-distance coherence problems
3 8/14/2006
MIT Lincoln + Computer Science AI Labs
Charles University
Possible Advantages of Factored Models Generalization over Morphology
•
We can Model morph. variation and phrase translation separately for better statistics: Translation + Generation – Spanish Gender Masculine he is a red player Él es un jugador rojo
English Spanish
Feminine she is a red player Ella es una jugadora roja Morph: f 3p+sing f f f
Morph: m 3p+sing m m m
el ser un jugador roj – Czech Case Nominative + Plural black cats černé kočky
English Czech
Morph: nom+pl nom+pl
Dative + Plural black cats černým kočkám Morph: dat+pl dat+pl
černá kočka
MIT Lincoln + Computer Science AI Labs
4
Charles University
8/14/2006
Factors as Type Checking Long Range Phenomena and Divergence
•
Long range dependencies can be modeled with latent factors – Spanish: Verb – Subject Number Agreement Subject: 3p+sing
Spanish Gloss Czech
AGR
Mi hija de dos años tiene catarro My daughter of two years has cold Nachlazena je moje dvouletá dcera. verb: 3p+sing
•
Verb-Argument dependencies verb
Czech Gloss
5 8/14/2006
select
AGR
Subject: 3p+sing noun: accusative
Napsal zprávu o matčině domu na papír He wrote a message about mother’s house on a paper verb
Czech Gloss
verb: 3p+sing
select
noun: locative
Našel zprávu o matčině domu na papíře He found a message about mother’s house on a paper
MIT Lincoln + Computer Science AI Labs
Charles University
Phrase-Level Generalization •
Class-based divergences – Chinese-English resultative constructions Verb Specific
Similar pattern for large class of verbs
Chinese Gloss
English
•
你 要 回 答 you made it hit
you
破 吗 broken done
broke
it
Longer distance movement dependencies – Chinese-English Questions Chinese 你 要 回 答 [clause…] Gloss you want reply [clause…] English would you like to reply to [clause…] ? Tags: VModal Pn
6
causes reordering
MIT Lincoln + Computer Science AI Labs
吗 y/n-marker Tag: Part
Charles University
8/14/2006
Large vs. Small Data How generalizations may affect SMT Performance
•
With large data sets these phenomena can be learned – Language Models should get local agreement phenomena with enough data – Long range agreement/coherence still problematic – Generalization may still be better, but errors in analysis can limit
•
Generalization may be advantageous for small data – For example: (Spanish/Czech Agreement) Can’t learn every noun/adjective/determiner triple
– Situation for many real-world problems
7 8/14/2006
MIT Lincoln + Computer Science AI Labs
Charles University
Outline •
Motivations
•
Experimental Design and Baselines – Approaches – Data Sets
8
•
Models for Agreement in Spanish
•
Coping with Rich Morphological Constraints in Czech
•
Generalizing Lexical Distortion Models
•
Models for Sparse Statistics in Chinese
•
Conclusions and Follow-on Research
MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
Data Sets and Baselines Data Set
Translation Direction(s)
Size
Baseline w/diff LMs (BLEU/Surface) 3g 4g 5g
Ö 29.35 Ö 29.57 Ö 29.54
3g 3g (950k)
Ö 23.41 Ö 25.10
Full Europarl
English Ö Spanish
950k LM Train 700k Bitext
Euromini
English Ö Spanish
60k LM Train 40k Bitext
Czech WSJ
English Ö Czech
20k LM Train 20k Bitext
IWSLT Chinese
Chinese Ö English
40k LM Train 40k Bitext
9 8/14/2006
MIT Lincoln + Computer Science AI Labs
3g
Ö 25.82 (four references)
4g Ö 19.54 (seven references)
Charles University
Using Factored Models Approaches for Small-Data Tasks
•
Factored Models we tried – Different levels of linguistic information modeled separately example: Morphology vs. phrasal content
– Feature “Checking” of existing phrasal models with LMs on factors Good Bad Words
I would like some donuts
POS pn mod
vb
det
np
I would like some big jump pn mod
High likelihood
vb
det
adj
vb
Low likelihood
– Generalized Factor-based Distortion Phrase are likely to move distance X if preceding word is Tag Y
• 10
Hypothesis: These models allows better utilization of limited training data MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
Different Factored Approaches Overview of Models Tried
•
High Order Language Models
Analysis Supervised
Unsupervised
•
Problems Addressed Explicit Agreement
Model Types • LMs over verbs/subject • LMs over nouns determiner adjectives Long Distance Coherence • LMs over POS Agreement/Coherence • LMs over Word-Classes
Parallel Translation Models
Analysis Supervised
Problem Types Explicit Agreement
Unsupervised
Agreement/Coherence
11 8/14/2006
MIT Lincoln + Computer Science AI Labs
Model Types • Parallel Translation Models over Lemmas and Morphology • Parallel Translation Models over Word-Classes and Surface Charles University
Outline •
Motivations
•
Experimental Design and Baselines
•
Models for Agreement in Spanish – Morphology and Agreement Features (Brooke) – Parallel Lemma and Morphology Translation (Wade) – Scaling to Larger Corpora (Wade)
•
Coping with Rich Morphological Constraints in Czech
•
Generalizing Lexical Distortion Models
•
Models for Sparse Statistics in Chinese
•
Conclusions and Follow-on Research
MIT Lincoln + Computer Science AI Labs
12
Charles University
8/14/2006
Spanish Experiments Language Models over Morphological Features
•
NDA – Nouns/Determiner/Adjective Agreement – Generate only on N, D and A tags (don’t care’s elsewhere)
•
VNP
N/D/A Features Gender: masc, fem, common, none Number: sing, plural, invariable, none
– Verb/Nouns/Preposition Selection Agreement – Generate on V, N or P V/N/P Features
word
nda vpn
Surface Generate + Check Latent Factors
Model Model
Number: sing, plural, invariable, none Person: 1p, 2p, 3p, none Prep-ID: Preposition, none 13 8/14/2006
MIT Lincoln + Computer Science AI Labs
Charles University
Spanish Experiments Skipped LMs for Agreement Target Phrase …gave the woman Source Phrase
dio
a
la
mujer
word
X
X
s+f
s+f
nda
3+s
“a”
X
s
vpn
Surface Generate Latent Factors
Model Model
• •
Allow NULL factors to be generated Increase effective context length to model longer range dependencies
14
MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
Spanish Agreement LMs Experimental Results
•
With Skipping Data Set EuroMini
•
Baseline 23.41
NDA VPN Both 24.47 24.33 24.54
No Skipping with all morphological features w/ and w/o POS Data Set EuroMini
•
NDA+Skip VPN+Skip 24.03 24.16
No Skipping (LM counts don’t care positions) Data Set EuroMini
•
Baseline 23.41
Baseline 23.41
Morph 24.66
Morph+POS 24.25
All models beat baseline – Skipping doesn’t seem to help – Full morphology is best
15 8/14/2006
MIT Lincoln + Computer Science AI Labs
Charles University
Spanish Experiments Parallel Lemma/Morphology Translation Me Analysis
Mi
Surface Lemma
I
Yo
Person + Number + Gender + Case
1ps+ Acc
Generation
1ps+ Acc
• • • •
Factor surface into lemma and morphology features Translate both simultaneously Re-generate target surface form Apply LM on both surface and morphology features
•
Results:
Data Set Baseline EuroMini + 950k LM 25.10
MIT Lincoln + Computer Science AI Labs
16
Lemma 25.71 Charles University
8/14/2006
Scaling Up to Large Training POS Language Models
POS-LM vs. Baseline
31
BLEU Score
30.5 30
Baseline POS-LM Full Tags
29.5 29 28.5
NOTE: Scale
28 3g
4g
5g
6g
7g
8g
9g
POS N-gram Order
•
Full Train → Less/No Gain from richer features
17 8/14/2006
MIT Lincoln + Computer Science AI Labs
Charles University
Outline •
Motivations
•
Experimental Design and Baselines
•
Models for Agreement in Spanish
•
Coping with Rich Morphological Constraints in Czech – – –
18
Factored Word Alignment for Limited Data Rich Morphology and Tagged LMs Putting it Together: Parallel Translation
•
Generalizing Lexical Distortion Models
•
Models for Sparse Statistics in Chinese
•
Analysis and Conclusions
•
Follow-on Research
MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
Factors for Coping with Limited Data Better Word Alignment for Czech
•
Word Alignment is difficult when data is limited and Morphology is rich – Data: 20k bitext sentences, large vocabulary – Contrast Set: 20k + 840k (Out of Domain) sentences – Task: English Ö Czech
•
Two methods to deal with limited data Stem Alignment
•
Contrastive Behavior for small and large data Data Set 20k Czech Large Contrast
19 8/14/2006
Lemma Alignment
Word-Word 25.17
MIT Lincoln + Computer Science AI Labs
Stem-Lemma 25.23 25.40
Stem-Stem 25.82 24.99 Charles University
Czeching Rich Morphology with Tags Tagged Czech Language Models cat
kočky
Surface
N+acc
Apply LM
•
Generation
Idea: Use morphologically rich POS Tag sequences to “czech” target output generation POS Information Configurations (Baseline: 25.82)
•
Full Tags Feature 1 Feature 2 … (15 total) Size: 1098 tags Result: 27.04
CNG Tags Case Number+Gender on V, P, PP, N, A Size: 707 tags Result: 27.45
CNG+VP CNG Features Person+Tense+Aspect (verbs) Lemma+Case (prepostions) Size: 899 tags Result: 27.62
MIT Lincoln + Computer Science AI Labs
20
Charles University
8/14/2006
Comparing with Larger Data Models Tagged Czech Language Models
•
Large vs. Small Data Data Set
Data Set
Baseline
20k Czech Large Contrast (20k + 840k OOD) 20k Czech Large Contrast (20k + 840k OOD)
CNG+VP
•
BLEU 25.82
Relative Improvement –
27.47
–
27.62
6.97%
28.12
2.37%
Tagged Language Models improve performance for small data significantly – approaches large data performance
• 21 8/14/2006
Large Task also improves (but much less: 2.36% vs. 6.97%) MIT Lincoln + Computer Science AI Labs
Charles University
Parallel Translation Models for Czech •
Motivation: Factored LM models seem to lose number information
him
3p+ acc
ho
Surface
on
Lemma
3p+ acc
POS Tag + CNG Features
Model Surface Ö Surface + POS Ö POS+CNG Surface Ö Lemma + POS Ö POS+CNG
• 22
Result 25.94 26.43
Better than baseline, but worse than both CNG & CNG+VP MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
Outline •
Motivations
•
Experimental Design and Baselines
•
Models for Agreement in Spanish
•
Coping with Rich Morphological Constraints in Czech
•
Generalizing Lexical Distortion Models (Christine) – – –
23 8/14/2006
Lexical Distortion Models Factor-based Distortion Results
•
Models for Sparse Statistics in Chinese
•
Analysis and Conclusions
•
Follow-on Research
MIT Lincoln + Computer Science AI Labs
Charles University
Generalized Distortion Modeling Introduction to Distortion
•
•
For each phrase pair we learn its likely placement relative to the previous phrase
Monotone
Source
Orientations word alignment point on top left
– Swap word alignment point on top right
Target
– Monotone
– Discontinuous Not monotone or swap
•
Examples – la casa roja Ö the red house – D NN ADJ Ö D ADJ NN 24
Discontinuous
MIT Lincoln + Computer Science AI Labs
Swap
Charles University
8/14/2006
Factor-based Distortion Models •
A Factor-based extension of Lexicalized Distortion – Use of more general factors e.g. POSf-POSe, Lemma-Lemma
•
Can model longer range dependencies – More conditioning variables
•
Motivating Results – Hard-coding in a few factor based rules (e.g. swap nouns and adjectives when translating from English to Spanish) led to improvements (Gispert, et. al. 2006)
25 8/14/2006
MIT Lincoln + Computer Science AI Labs
Charles University
Factor-based Distortion Spanish Experiments
•
Lexicalized Distortion only Europarl Lang En Ö De Es Ö En En Ö Es
•
Pharaoh 18.15 31.46 31.06
Moses 18.85 32.37 31.85
Factor-based Distortion on small data Data Set Baseline (No Lexical) Baseline Lexical Factored: POS-POS System Combined: Lexical + POS-POS
•
Further Experiments – Other Factors – Minimizing Model Parameters – Combining different models
26
Result
MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
Outline
27 8/14/2006
•
Motivations
•
Experimental Design and Baselines
•
Models for Agreement in Spanish
•
Coping with Rich Morphological Constraints in Czech
•
Generalizing Lexical Distortion Models
•
Models for Sparse Statistics in Chinese
•
Conclusions and Follow-on Research
MIT Lincoln + Computer Science AI Labs
Charles University
IWSLT Chinese Experiments with Unsupervised Annotation
• •
Data: Travel-domain sentences, limited vocabulary, short sentences Task: Text and ASR translation, Chinese Ö English
•
Can we use automatic word classes to learn general sequence constraints? First Experiment: 2-gram Word Class LMs of varying orders
•
Target Phrase How much is it?
总共 多少
Source Phrase
c1
c22
钱
?
word
c3
c55
class
Surface
Model Model 28
MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
IWSLT Chinese Alignment Templates for Translation Me I
Surface Word Class
Mi Yo
Generation
•
Second Experiment: Extend Class-based LM to the translation Model
•
Bigram word classes for source and target
•
Translate alignment templates similar to [Och 98] + surface
•
Apply LM to surface and Class 29 8/14/2006
MIT Lincoln + Computer Science AI Labs
Charles University
IWSLT Chinese Autoclass Results 22.5 22
Baseline
BLEU Score
21.5 21
Class-LM
20.5 20 19.5
Class Trans+LM
19 18.5 18 3g
4g
5g
6g
7g
8g
9g
Class N-gram Order
• •
NOTE: Scale
Class-LM significantly better (p=0.05, ~1.0 BLEU) Class-Trans may be limited by synchronous PT constraint – Start to address here, but not in time for eval
30
MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
Outline
31 8/14/2006
•
Motivations
•
Experimental Design and Baselines
•
Models for Agreement in Spanish
•
Generalizing Lexical Distortion Models
•
Models for Sparse Statistics in Chinese
•
Coping with Rich Morphological Constraints in Czech
•
Conclusions and Follow On Research
MIT Lincoln + Computer Science AI Labs
Charles University
Conclusions and Future Work •
Factored Approach can help with small data – Large Data tasks may need different factored approaches
•
MIT/LL + CSAIL – Continue experiments with morphology and coherence – Fully Asynchronous Factor Translation – Apply techniques to other languages Extend existing LCTL experiments
– Syntax-driven reordering models (Brooke)
•
Asynchronous Factors Translation (Hieu)
•
Making use of verb sub-categorization information (Ondrej)
32
MIT Lincoln + Computer Science AI Labs
Charles University
8/14/2006
Valency-Aware Machine Translation Project Proposal Ondˇrej Bojar
[email protected] August 17, 2006
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
1
Overview • JHU Workshop motivation and one of the results. • State-of-the-art MT errors. • Project goal. • Motivation: Why Czech. • Proposed strategy and information sources. • Summary. Appendices: References, illustrations and further details on Czech and English Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
2
Workshop Motivation • Statistical machine translation (SMT) into morphologically rich languages is more difficult than from them. See e.g. Koehn (2005).
• One of workshop goals: examine utility of factored translation models to translate into morphologically rich languages. • There was room for improvement: Regular BLEU English→Czech BLEU of lemmatized MT against lemmatized references
25% 32%
⇒ Errors in morphology cause large BLEU loss.
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
3
One of the Workshop Results • Significant improvements gained on small data sets: English→Czech: 20k sentences, BLEU 25.82% to 27.62% or up to 28.12% with additional out-of-domain parallel data. • Still far below the margin of lemmatized BLEU (35%). • However local agreement already very good: Microstudy: Adjective-Noun Agreement 74% correct, 2% mismatch, other: missing noun etc. ⇒ So where are the morphological errors?
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
4
Current English→Czech MT Errors Microstudy of current best MT output (BLEU 28.12%), intuitive metric: • 15 sentences, 77 verb-modifier pairs in source text examined: Translation of Verb Modifier
. . . preserves meaning 43% 79%
. . . is disrupted 14% 12%
. . . is missing 21% 6%
But: When Verb&Mod correct, 44% of cases are non-grammatical or meaningdisturbing relations.
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
5
Samples Errors Input: MT output: Gloss: Correct:
Keep on investing. Pokraˇcovalo investov´an´ı. (grammar correct here!) Continued investing. (Meaning: The investing continued.) Pokraˇcujte v investov´an´ı.
⇒language model misled us ⇒ need to include source valency information. Input: MT Output: Gloss: Correct option 1: Correct option 2:
brokerage firms rushed out ads . . . brokersk´e firmy vybˇehl brokerage firmspl.f em ransg.masc brokersk´e firmy vybˇehly brokersk´e firmy vydaly
reklamy adspl.nom,pl.acc,pl.voc,sg.gen s reklamamipl.instr reklamypl.acc
Target-side data may be rich enough to learn: vybˇehnout–s–instr Not rich enough to learn all morphological and lexical variants: vybˇehl–s–reklamou, vybˇehla–s–reklamami, vybˇehl–s–prohl´aˇsen´ım, vybˇehli–s–ozn´amen´ım, . . . Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
6
Project Goal
Improve MT output quality by valency information.
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
7
Motivation: Why Czech • Relevant properties: very rich morphological system and relatively free word order. • Well-established theory on syntax and valency in particular. Sgall, Hajiˇcov´a, and Panevov´a (1986), Panevov´a (1994)
• Data available: monolingual and parallel corpora manual surface and deep treebanks (parallel forthcoming!) manual valency lexicons Language Cs Cs Cs Cs↔En Cs↔En
Corpus Annotation up to Tokens PDT 2.0 (Hajiˇc, 2005) manual surface and deep syntax 1.5M surf. CNC (Kocek, Kopˇrivov´a, and Kuˇcera, 2000) automatic lemmatization and morphology 114M Web corpus automatic surface syntax 100M ˇ PCEDT 1.0 (Cmejrek, Cuˇr´ın, and Havelka, 2003) automatic surface and deep syntax 500k CzEng 0.5 automatic surface syntax 15M
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
8
Proposed Strategy Preliminary experiments at workshop: • Factored models touching valency explored during workshop perform badly. No gain or a slight loss.
Future: • Evaluate the causes. Was it just sparse data?
• Check subcategorization using partially lexicalized language models. Morphological LM with verbs lexicalized should capture subcategorization.
• Experiment with syntax-based language models. (Chelba and Jelinek, 1998; Charniak, 2001)
• Map explicit subcategorization information from source to target. Translate lemma+subcat to lemma+subcat and POS to POS, generate surface from this. Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
9
Project Will Use these Sources of Information • Available valency/subcategorization dictionaries. VALLEX for Czech. (∼PropBank for English.)
• Automatically collected subcategorization data. (Korhonen, 2002) and previous, my diss. in prep.
• Word-sense-like algorithms to label verb occurrences with frames. (Bojar, Semeck´y, and Beneˇsov´a, 2005), and all WSD community results
Compare with simple approaches: • More monolingual data for plain n-gram language models may help enough. • Are valency-based generalizations useful in general/on small data/on out-ofdomain data? Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
10
Summary • Factored models help fixing morphology → local dependencies already correct. • Significant margin for improving verb-modifier agreement. • English→Czech pair is a good fit for the experiments. • Improved valency models should improve translation quality: Valency theory, data and methods available.
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
11
References Bojar, Ondˇrej. 2003. Towards Automatic Extraction of Verb Frames. Prague Bulletin of Mathematical Linguistics, 79–80:101–120. Bojar, Ondˇrej, Jiˇr´ı Semeck´y, and V´aclava Beneˇsov´a. 2005. VALEVAL: Testing VALLEX Consistency and Experimenting with Word-Frame Disambiguation. Prague Bulletin of Mathematical Linguistics, 83:5–17. Charniak, Eugene. 2001. Immediate-head parsing for language models. In Meeting of the Association for Computational Linguistics, pages 116–123. Chelba, Ciprian and Frederick Jelinek. 1998. Exploiting syntactic structure for language modeling. In Christian Boitet and Pete Whitelock, editors, Proceedings of the Thirty-Sixth Annual Meeting of the Association for Computational Linguistics and Seventeenth International Conference on Computational Linguistics, pages 225–231, San Francisco, California. Morgan Kaufmann Publishers. ˇ Cmejrek, Martin, Jan Cuˇr´ın, and Jiˇr´ı Havelka. 2003. Czech-English Dependency-based Machine Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
12
Translation. In EACL 2003 Proceedings of the Conference, pages 83–90. Association for Computational Linguistics, April. Collins, Michael. 1996. A New Statistical Parser Based on Bigram Lexical Dependencies. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, pages 184–191. Collins, Michael, Jan Hajiˇc, Eric Brill, Lance Ramshaw, and Christoph Tillmann. 1999. A Statistical Parser of Czech. In Proceedings of 37th ACL Conference, pages 505–512, University of Maryland, College Park, USA. Hajiˇc, Jan. 2005. Complex Corpus Annotation: The Prague Dependency Treebank. In M´aria ˇ Simkov´ a, editor, Insight into Slovak and Czech Corpus Linguistics, pages 54–73, Bratislava, Slovakia. Veda, vydavateˇlstvo SAV. Holan, Tom´aˇs. 2003. K syntaktick´e anal´yze ˇcesk´ych(!) vˇet. In MIS 2003. MATFYZPRESS, January 18–25, 2003. ˇ y n´arodn´ı korpus - u Kocek, Jan, Marie Kopˇrivov´a, and Karel Kuˇcera, editors. 2000. Cesk´ ´vod a ´ CNK, ˇ pˇr´ıruˇcka uˇzivatele. FF UK - U Praha. Koehn, Philipp. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Proceedings of MT Summit X, September. Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
13
Korhonen, Anna. 2002. Subcategorization Acquisition. Technical Report UCAM-CL-TR-530, University of Cambridge, Computer Laboratory, Cambridge, UK, February. Kruijff, Geert-Jan M. 2003. 3-Phase Grammar Learning. In Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development. Panevov´a, Jarmila. 1994. Valency Frames and the Meaning of the Sentence. In Ph. L. Luelsdorff, editor, The Prague School of Structural and Functional Linguistics, pages 223–243, Amsterdam-Philadelphia. John Benjamins. Sgall, Petr, Eva Hajiˇcov´a, and Jarmila Panevov´a. 1986. The Meaning of the Sentence and Its Semantic and Pragmatic Aspects. Academia/Reidel Publishing Company, Prague, Czech Republic/Dordrecht, Netherlands.
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
14
Analysis of Czech Analytic (surface syntactic): Morphological: Form Lemma AUX P OBJ z´akony z´akon ADV z´akony z´akon z´akony z´akon Z´akony udˇelejte pro lidi #36 z´akony z´akon Laws make for people udˇelejte udˇelat Tectogrammatical (deep syntactic): udˇelejte udˇelat pro pro-1 PRED lidi ˇclovˇek ACT BEN PAT lidi ˇclovˇek ˇclovˇek udˇelatimp z´akonP l ˇclovˇekP l,pro lidi #36 you lawP l makeimp personP l,f or PRED
Ondˇrej Bojar
Valency-Aware Machine Translation
Morphological tag NNIP1-----A---NNIP4-----A---NNIP5-----A---NNIP7-----A---Vi-P---2--A---Vi-P---3--A---4 RR--4---------NNMP1-----A---NNMP4-----A---NNMP5-----A----
August 17, 2006
15
Properties of Czech language Czech ≥ 4,000 tags possible, ≥ 2,300 seen free
Rich morphology Word order
English 50 used rigid
• rigid global word order phenomena: clitics • rigid local word order phenomena: coordination, clitics mutual order Nonprojective sentences Nonprojective edges Known parsing results Edge accuracy Sentence correctness
16,920 23,691
23.3% 1.9%
Czech 69.2–82.5% 15.0–30.9%
Ondˇrej Bojar
Data by (Collins et al., 1999), (Holan, 2003), Zeman (http://ckl.mff.cuni.cz/˜zeman/ /projekty/neproj/index.html) and (Bojar, 2003). Consult (Kruijff, 2003) for measuring word order freeness.
English 91% 43%
Valency-Aware Machine Translation
August 17, 2006
16
Detailed numbers on Czech Edge length English [%] Czech [%]
1 74.2 51.8
Number of gaps Sentences [%] Climbing steps Nodes [%]
≤2 86.3 72.1
0 76.9 1 90.3
≤5 95.6 90.2
1 22.7 2 8.0
1
2 0.42 3 1.3
4 0.3
2
5 0.1
3
1
Data for English by (Collins, 1996). Data for Czech by (Holan, 2003). Data by (Holan, 2003). 3 Data by (Holan, 2003). 2
Ondˇrej Bojar
Valency-Aware Machine Translation
August 17, 2006
17
Analytic vs. Tectogrammatical (2) PRED
AUXK OBJ
AUXV
SB
AUXR
#45
To It
by
se
conjunct particle
reflexive particle
mˇelo should
zmˇenit change
. full stop
PRED
ACT
PAT PRED
#45
to it
Ondˇrej Bojar
m´ıt should
zmˇenitconj changeconj
Generic Actor
Valency-Aware Machine Translation
Asynchronous Factored Translation Hieu Hoang University of Edinburgh
August 17, 2006
Current System Translating Je
vous
PRO PRO
achète un chat VB
ART NN
Phrase Table 1 I am buying you
Je vous achète Phrase Table 2
PRO VB VB PRO
PRO PRO VB
Current System Translating Je
vous
PRO PRO
achète un chat VB
ART NN
Phrase Table 1 Je vous achète
I am buying you
Phrase Table 2 PRO PRO VB
PRO VB VB PRO
Limitations Synchronous Je
vous
PRO PRO
achète un chat VB
ART NN
Phrase Table 1 Je
I
vous
you
achète
am buying
Phrase Table 2 PRO VB VB PRO
PRO PRO VB
Asynchronous Translation Synchronous Je
vous
PRO PRO
achète un chat VB
ART NN
Phrase Table 1 Je
I
vous
you
achète
am buying
Phrase Table 2 PRO PRO VB
PRO VB VB PRO
Tiling Current System Je
vous
PRO PRO
achète un chat VB
ART NN
Tiling Current System Je
vous
PRO PRO
achète un chat VB
ART NN
Tiling Current System Je
vous
PRO PRO
achète un chat VB
ART NN
Future Je
vous
PRO PRO
achète un chat VB
ART NN
Tiling Current System Je
vous
PRO PRO
achète un chat VB
ART NN
Future Je
vous
PRO PRO
achète un chat VB
ART NN
Tiling Current System Je
vous
PRO PRO
achète un chat VB
ART NN
Future Je
vous
PRO PRO
achète un chat VB
ART NN
Tiling Current System Je
vous
PRO PRO
achète un chat VB
ART NN
Future Je
vous
PRO PRO
achète un chat VB
ART NN
Long Templates Je
vous
achète un chat
PRO PRO
VB
ART NN
Phrase Table 1 Je
I
Vous
am buying
achète
You
un chat
a cat
Phrase Table 2 PRO VB VB PRO ART NN
PRO PRO VB ART NN
Templates Je
vous
PRO PRO
achète un chat VB
ART NN
Phrase Table 1 Je
I
Vous
am buying
achète
You
un chat
a cat
Phrase Table 2 PRO PRO VB ART NN
PRO VB VB PRO ART NN
Combining information from different factors
You say his Surface:
ni suo ta
Tense:
name da
mingzi
already question le
ma ?
past
You
said
his name, right ?
past
Challenges • • • •
Computational complexity Pruning strategies Recombination Scoring
Translation of morphologically rich languages with additional linguistic information Chris Dyer, Philipp Koehn, Chris Callison-Burch, Hieu Hoang 17 August 2006
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
1
Morphologically rich languages • Languages differ in their morphological markup • Examples with increasing complexity: – Chinese: no marking for number, gender, tense, or aspect – English: number(2) for nouns, four verb forms – Spanish: number(2) and gender(2) for adjectives, ... – German: number(2), gender(3), case(4), definiteness for adjectives, ... – Arabic: number(3), gender(2), case(3), definiteness, possessors for nouns – Finnish: prepositions often expressed morphologically Language Vocabulary size in Europarl English 65,887 word forms Spanish 102,886 word forms German 195,290 word forms Finnish 358,345 word forms Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
2
Impact of morphological complexity • How much information do we have if we discount inflectional morphology? • Experiment (systems trained on full 700,000 sentence Europarl corpus): Method surface → surface surface → surface (lemmatize) surface → lemma
devtest 18.22 BLEU 22.27 BLEU 22.70 BLEU
test 18.04 BLEU 22.15 BLEU 22.45 BLEU
• Gain of 4 BLEU points possible, if we can solve morphology
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
3
Problem: unknown word forms • Unknown surface word forms (German) test set devtest-2006 test-2006
unigrams 0.71% 0.69%
bigrams 12.00% 12.20%
trigrams 40.46% 41.08%
bigrams 9.05% 9.14%
trigrams 33.93% 34.36%
• Unknown lemmas (German) test set devtest-2006 test-2006
Dyer, Koehn, Callison-Burch, Hoang
unigrams 0.64% 0.64%
Morphologically rich languages
17 August 2006
4
Factored models • Factored models allow us to address these problems • Sparse data – back off to translation of lemmas – back off to language models with richer statistics • Agreement and grammatical coherence – use of factors that enforce agreement within noun phrases – use of factors that enforce agreement on the clause level
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
5
Addressing data sparseness with lemmas Input
Output
word
word lemma
• Translate surface into lemma • Generate surface from lemma • Translate surface into surface • Language models over surface and lemma
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
6
Addressing data sparseness with lemmas, model 2 Input
Output
word
word lemma
• Translate surface into surface • Generate lemma from surface • Language models over surface and lemma
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
7
Experimental Results Method baseline hidden lemma (gen only) hidden lemma (gen and trans) best published results
devtest 18.22 18.82 18.41 -
test 18.04 18.69 18.52 18.15
• Better performance than baseline model • Simpler model has higher performance – fewer search errors
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
8
Addressing data sparseness with factored models Input
Output
word
word
lemma
lemma
part-of-speech
part-of-speech morphology
• Morphological analysis and generation model • Pitfalls of this approach – tag set does not necessarily have sufficient information – explosive search space on large models Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
9
Overall grammatical coherence Input
Output
word
word part-of-speech
• High order language models over POS • Motivation: syntactic tags should enforce syntactic sentence structure • Results: No major impact with 7-gram POS model (BLEU 18.25 vs. 18.22) • Analysis: local grammatical coherence already fairly good, POS sequence LM model not strong enough to support major restructuring Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
10
Local agreement (esp. within noun phrases) Input
Output
word
word part-of-speech morphology
• High order language models over POS and morphology • Motivation – DET-sgl NOUN-sgl good sequence – DET-sgl NOUN-plural bad sequence Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
11
Agreement within noun phrases • Experiment: 7-gram POS, morph LM in addition to 3-gram word LM • Results Method baseline factored model
Agreement errors in NP 15% in NP ≥ 3 words 4% in NP ≥ 3 words
devtest 18.22 BLEU 18.25 BLEU
test 18.04 BLEU 18.22 BLEU
• Example – baseline: ... zur zwischenstaatlichen methoden ... – factored model: ... zu zwischenstaatlichen methoden ... • Example – baseline: ... das zweite wichtige ¨anderung ... – factored model: ... die zweite wichtige ¨anderung ... Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
12
Subject-verb agreement • Lexical n-gram language model would prefer the
paintings
of
the
old
man
is
beautiful
old man is is a better trigram than old man are • Correct translation the paintings SBJ-plural
of -
the -
old -
man -
are V-plural
beautiful -
• Special tag that tracks count of subject and verb p(-,SBJ-plural,-,-,-,-,V-plural,-) > p(-,SBJ-plural,-,-,-,-,V-singular,-)
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
13
Experiment on English–German • Add special features for subject and verb • Verb – our morphological analyzer does not provide verb morphology → use of surface forms • Subject – subject identified with German parser (Amit Dubey’s parser trained on TIGER treebank) – if pronoun: surface form of pronoun – if noun phrase: POS and morphological tags of determiner, adjective, and noun
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
14
Skip language models • Full language model confused by many non-items: p(-,SBJ-plural,-,-,-,-,V-plural,-) > p(-,SBJ-plural,-,-,-,-,V-singular,-) • Skip language models: ignoring irrelevant tags: p(SBJ-plural,V-plural) > p(SBJ-plural,V-singular) • Results: experiments did not finish as of yet, preliminary results inconclusive
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
15
Reflection on the data • Clause elements are translated reasonable well – now high agreement within noun phrases (with factored model 4%) • Overall sentence structure muddled – subject–verb agreement hard to enforce, since which noun phrase is subject is hard to establish – role (and hence case) of noun phrases often wrong, since relation to verb is unclear • Similar problems when translating Arabic–English, Chinese–English – this motivates work on syntax-based machine translation – one solution: syntactic restructuring models (Brooke’s presentation) – another solution: clause-level sequence models
Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
16
Clause level sequence models • Correct sentence with verb the paintings of SBJ SBJ OBJ
the OBJ
• Incorrect sentence without verb the paintings of SBJ SBJ OBJ
old OBJ
the OBJ
man OBJ
old OBJ
are V
man OBJ
beautiful ADJ
beautiful ADJ
• Syntactic role label sequence model is on the steering wheel! p(SBJ,SBJ,OBJ,OBJ,OBJ,OBJ,V,ADJ) > p(SBJ,SBJ,OBJ,OBJ,OBJ,OBJ,ADJ) • May be simplified using skip language models to p(SBJ,OBJ,V,ADJ) > p(SBJ,OBJ,ADJ) Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
17
Another reality check • One typical error of the current system wir we SUBJ
haben have AUX
daher hence PART
nicht not PART
f¨ ur for PP-OBJ
diesen this PP-OBJ
bericht report PP-OBJ
stimmen voting VINF
• Typical sentences have many particles floating around – if interested in core sentence structure: ignore them – if interested in all parts of the clause: include them • Key lesson: feature engineering – know your tag sets and morphological features – be aware of what problem you want to address – create a factor for this purpose Dyer, Koehn, Callison-Burch, Hoang
Morphologically rich languages
17 August 2006
Future Research Back-off models: improving MT through smarter searching and better use of data Chris Dyer, University of Maryland 8/22/2006
JHUSWS 2006
1
Two Goals
Smarter Search Mitigate sparse-data effects in multi-factored models Recover from search errors Enable well-motivated models for translating into morphologically complex languages
Back-off models
8/22/2006
Take advantage of single-factored models when it makes sense to do so JHUSWS 2006
2
Smarter Search: Motivation
Morphological complexity poses problems for “whitespace tokenized” statistical MT
Beyond data sparseness: conventional models run into search problems for rare surface forms
Lemmatizing results in considerable German performance gains devtest-2006
test-2006
surface→surface
18.22
18.04
surface→surface, lemmatize surface→lemma
22.27
22.15
22.70
22.45
JHUSWS 2006
8/22/2006
3
Smarter Search: Motivation
Single factor models do not generalize . They cannot produce a target form unless seen in the training data. Basic generation models allow us to improve translation coverage with (inexpensive) monolingual resources Translating English→German
Generation training data size
# distinct words produceable
Surface only
n/a
105,000 distinct words
Lemmas only
n/a
85,000 distinct lemmas
8/22/2006
JHUSWS 2006
4
Smarter Search: Motivation
Single factor models do not generalize . They cannot produce a target form unless seen in the training data. Basic generation models allow us to improve translation coverage with (inexpensive) monolingual resources Translating English→German
Generation training data size
# distinct words produceable
Surface only
n/a
105,000 distinct words
Lemmas only
n/a
85,000 distinct lemmas
Lemmas + bitext Europarl
15 million words
117,000 distinct words
Lemmas + full Europarl
27 million words
122,000 distinct words
Lemmas + 1.2M EP + Wikipedia
113 million words
137,000 distinct words
Net result: 30% increase in forms produceable over a single-factor model
8/22/2006
JHUSWS 2006
5
Morphological Analysis and Generation Model n-gram LM, surface n-gram LM, morphology n-gram LM, lemmata 4-step model 1. Translate surface to lemma 2. Generate morphology from lemma 3. Translate POS to morphology 4. Generate surface from lemma + morphology 8/22/2006
JHUSWS 2006
6
Initial results were disappointing…
BLEU scores well below baseline (~11) Tuning took an entire weekend on a very small tuning set
JHUSWS 2006
8/22/2006
7
The Problem
Search errors
Aggressive pruning Each step multiplies number
of states in the search space
over a single factored model
8/22/2006
Spans must overlap exactly
JHUSWS 2006
8
The Problem: an illustration
Translation options ‘the right approach’: der richtige Ansatz dem richtigen Ansatz den richtigen Ansatz 8/22/2006
JHUSWS 2006
9
The Solution
Back off to shorter spans
8/22/2006
When a dead-end is reached, break up the source span into smaller spans and translate those
JHUSWS 2006
10
The Solution: an illustration
Translation options ‘right ‘the’: approach’: der, die, das, richtiger Ansatz dem, den, das, des Ansatz 8/22/2006
richtigen Ansatzes
JHUSWS 2006
11
Back-off Models
Lexicalized surface forms are common
Because of lexicalization, obscure morphology or root forms often retained Ex. “be that as it may”
Translations often approximate, unusual when analyzed in more abstract layers If you mistranslate common stock phrases because of a rigid analysis and generation processes, fluency suffers
8/22/2006
JHUSWS 2006
12
Back-off Models
Solution Try to let single translation step to cover all factors Back off to multi-factored model
8/22/2006
JHUSWS 2006
13
Back-off Models: Implementation
“Primary” phrase table Standard form Contains all factors on target side
Necessary for secondary factor LMs
May be trained on single factor data with “best guesses” for secondary factors May be aggressively filtered, i.e., for >n occurrences, etc.
8/22/2006
JHUSWS 2006
14
Back-off Models: Implementation
Key idea: Back-off weight Feature that is associated with choosing a single factored path Tuned along with other feature weights Function of source phrase length?
JHUSWS 2006
8/22/2006
15
Summary
Increase performance of multi-factored models Recover from search errors Recover from data sparseness (make more efficient use of longer underlying phrases)
Extend the benefits of multi-factor models to target languages where sparse-data and search errors are not generally an issue
8/22/2006
English JHUSWS 2006
16
Translation with syntax and factors: Handling global and local dependencies in SMT Brooke Cowan MIT CSAIL August 17, 2006
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
1
Goals of statistical machine translation • Linguistically-correct output – learn correct syntax and morphology in target language – e.g., noun-phrase agreement, subject-verb agreement, verbs and their arguments • Meaning-preserving output – learn mapping between source and target sentence elements – e.g., identify the subject in the source and ensure it plays the proper role in the target – can involve a significant amount of reordering
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
2
Linguistically-correct output • E.g., in Spanish noun phrases, nouns, determiners, and adjectives are constrained to agree in gender and number
det
noun
adj
adj
las políticas pesqueras comunitarias the policies fisheries common FEMININE PLURAL
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
3
Linguistically-correct output • E.g., in Spanish noun phrases, nouns, determiners, and adjectives are constrained to agree in gender and number
det
noun
adj
adj
las políticas pesqueras comunitarias the policies fisheries common FEMININE PLURAL
• Phrasal agreement phenomena are generally local in nature.
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
4
Meaning-preserving output: free word order • E.g., when translating from German to English, we want to identify and place the subject, object, and phrasal modifiers in the output ich möchte dem berichterstatter für seinen bericht danken dem berichterstatter möchte ich für seinen bericht danken für seinen bericht möchte ich dem berichterstatter danken
i would like to thank the rapporteur for his report
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
5
Meaning-preserving output: free word order • E.g., when translating from German to English, we want to identify and place the subject, object, and phrasal modifiers in the output ich möchte dem berichterstatter für seinen bericht danken dem berichterstatter möchte ich für seinen bericht danken für seinen bericht möchte ich dem berichterstatter danken
i would like to thank the rapporteur for his report
• Translation involving free-word-order languages or languages pairs with very different basic word order can be quite challenging because these phenomena are generally global in nature. Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
6
A hybrid system • A syntax-based system – handle global phenomena in translation ∗ inter-phrasal reordering ∗ verb/argument structure ∗ some long-distance agreement phenomena (e.g., subject/verb agreement) • A factored phrase-based system – handle local phenomena in translation ∗ agreement and reorderings
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
7
Combining the two systems • Use the the syntax-based system to reorder the source-language input • Feed the output of the syntax-based system into the phrase-based system German input: für seinen bericht möchte ich dem berichterstatter danken Modified German input: ich WOULD LIKE TO THANK dem berichterstatter für seinen bericht English output: i would like to thank the rapporteur for his report
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
8
The syntax-based system • Discriminatively-trained, tree-to-tree translation system (Cowan, Collins, and Kuˇcerov´a, EMNLP ’06) • Fully implemented and tested on German-to-English Europarl task • Model predicts an aligned extended projection (AEP) on the target side – a syntactic structure encapulating the argument structure of the main target-side verb, and – alignment information between the modifiers on the source and target sides
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
9
What is an AEP? German clause:
English AEP:
s pp-mo 1 appr zwischen piat beiden nn gesetzen vvfin-hd bestehen adv-mo 2 also np-sb 3 adja erhebliche adja rechtliche $, , adja praktische kon und adja wirtschaftliche nn underschiede
Brooke Cowan, MIT CSAIL
S Extended Projection (EP) of the main verb (Frank 2002)
+ Alignment information
Syntax and factors in SMT
NP-A
VP V
NP-A
are SUBJECT: there OBJECT: 3 MOD(1): post-object MOD(2): pre-subject
August 17, 2006
10
Integration with Moses • Factor-based systems handle local phenomena well • Extensions to Moses Modified German input: [ ich ] [ WOULD LIKE TO THANK ] [ dem berichterstatter ] [ für seinen bericht ]
– externally-provided translation options – constraints on reordering – n-best lists of AEPs
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
11
Research questions • Factor the translation problem into two parts – syntax-based system to handle global reorderings and agreements – factor-based system to handle local reordering and agreements • Can this approach improve overall translation quality? – past work in rule-based clause restructuring (e.g., Collins, Koehn, Kuˇcerov´a, ACL ’05) • What is the best way to combine these systems? – hard constraints vs soft constraints – voting/backoff framework
Brooke Cowan, MIT CSAIL
Syntax and factors in SMT
August 17, 2006
Part of Speech Information for Alignment
Alexandra Constantin
2006 CLSP Summer Workshop
Bilingual Dictionary
Haus – house, building, home, household
Lexical Translation Probability Distribution
Implicit Alignment 1 2 3 4 Das Haus ist klein. 1 2 3 4 The house is small.
Alignment Function a
1
2 3
4
Klein ist das Haus The house is small 1
2
3
4
POS Motivation
z
POS information for infrequent words
Example
IBM Model 1 - Notations e = target word f = source word t(e|f) = probability of translating foreign word f into English word e f = (f_1, …, f_n) = foreign sentence e = (e_1,…,e_m) = English sentence p(e|f) = translation probability a = alignment function
IBM Model 1
EM Algorithm 1.
Initialize model (typically with uniform distribution)
2.
Apply the model to the data (expectation step)
3.
Learn the model from the data (maximization step)
4.
Iterate steps 2-3 until convergence
Expectation Step
Expectation Step – p(e|f)
Expectation Step
Maximization Step
Adding POS Information
Experiments- AER z z z
Compare generated alignments against manual alignments Manual alignments: probable (P) and sure (S) Automated alignments: A
Results AER
10k
20k
40k
60k
80k
100k
Baseline 53.7
51.8
49.3
48.6
47.5
47.1
Only POS
76.0
75.4
75.5
75.1
75.3
75.1
+ POS
53.6
51.5
49.6
48.4
47.7
47.3
Future Work z z z
Use alignments to train MT system and compare BLEU scores Use POS information in more complicated alignment methods Use other factors
JHU CLSP Summer Workshop 2006 Team Presentation
Experimental Results for Confusion Network Decoding
Richard Zens, Nicola Bertoldi, Marcello Federico, Wade Shen
Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding
1
August 17, 2006
IWSLT Task • Chinese–English, domain: phrase book entries • corpus statistics: Chinese English sentences 40 K running words 351 K 365 K vocabulary 11 K 10 K • confusion network statistics (489 sentences): read speech spontaneous speech avg. length 17.2 17.4 avg. / max. depth 2.2 / 92 2.9 / 82 avg. number of paths 1021 1032 • no development data for confusion networks Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding
2
August 17, 2006
Results for IWSLT
• phrase table provided by MIT/LL • competitive baseline results • results: read speech spontaneous speech BLEU [%] BLEU [%] verbatim 21.4 1-best from lattice 19.0 17.2 1-best from CN 19.0 17.2 full CN 19.3 17.8 • improvements are statistically significant (89% confidence)
Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding
3
August 17, 2006
Other Ambiguous Input: Punctuation • Chinese input does not contain punctuation • illustration: hello world
→
1 2 3 hello 1.0 0.9 world 1.0 ! , 0.1 .
• results for verbatim input: punctuation input type BLEU [%] 1-best 20.8 confusion network 21.0 • competitive performance without tuning → room for improvement Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding
4
August 17, 2006
4 0.7 0.2 0.1
Truecasing truecasing, i.e. restoring case information in lowercase text • common approach: – core MT system produces lowercase output – truecasing is done as postprocessing step • application of factored translation models 1. translate lowercase 2. generate truecase output (using a truecase LM) • results: BLEU [%] two-step 18.9 integrated 17.8 → somewhat worse performance than dedicated tool Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding
5
August 17, 2006
EPPS Task • EPPS: European Parliament Plenary Sessions • Spanish-English speech-to-speech translation task • corpus statistics: Spanish English sentences 1.2 M running words 31 M 30 M vocabulary 140 K 94 K • confusion network statistics: dev test sentences 2 633 1 071 avg. length 10.6 23.6 avg. / max. depth 2.8 / 165 2.7 / 136 avg. number of paths 1038 1075 Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding
6
August 17, 2006
Results for EPPS Task dev ASR-WER 1-best lattice 19.3 1-best CN 21.7 full CN 7.0
BLEU 42.2 40.3 42.4
test ASR-WER 22.4 23.3 8.5
BLEU 37.6 36.7 38.9
• best result for test in previous work: 37.2 BLEU • in comparison with previous work on this task, we have 1. a stronger baseline, 2. larger improvements and 3. much more efficient decoding (4x vs. 25x) note: all figures in percent Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding
August 17, 2006
7
Exploration of Confusion Networks
avg. number per sentence
1x1010 1x10
CN total CN explored 1-best explored
9
1x108 1x107 1x106 1x105 1x104 1x103 100 10 1 0.1 0
2
4
6
8
10
12
14
path length Zens, Bertoldi, Federico, Shen: Results for Confusion Net Decoding
8
August 17, 2006
JHU CLSP Summer Workshop 2006 Proposal for Follow-up Research
Exploiting Ambiguous Input in Statistical Machine Translation Richard Zens
Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6 Computer Science Department RWTH Aachen University, Germany Zens: Exploiting Ambiguous Input in SMT
1
August 17, 2006
Motivation
• MT often used in a pipeline, i.e. the input to the MT system is the output of another imperfect NLP system, e.g. – spoken language translation: ASR – segmentation: Chinese words, Arabic tokens – named entity recognition / translation
Zens: Exploiting Ambiguous Input in SMT
2
August 17, 2006
Motivation
• MT often used in a pipeline, i.e. the input to the MT system is the output of another imperfect NLP system, e.g. – spoken language translation: ASR – segmentation: Chinese words, Arabic tokens – named entity recognition / translation • traditional approach: ignore problem, i.e. translate 1-best • result of previous work: improvements if ambiguity is taken into account
Zens: Exploiting Ambiguous Input in SMT
3
August 17, 2006
Previous Approaches
1. confusion network decoding • advantages: efficiency, reordering is straightforward • problem: representing alternative segmentations 2. lattice decoding • advantage: representing alternative segmentations • problem: reordering goal: ⇒ exploit advantages of both approaches, ⇒ but avoid weaknesses Zens: Exploiting Ambiguous Input in SMT
4
August 17, 2006
Generalized Confusion Networks • confusion networks: 0
1
Zens: Exploiting Ambiguous Input in SMT
2
3
4
August 17, 2006
5
Generalized Confusion Networks • confusion networks: 0
1
2
3
4
1
2
3
4
• generalization:
0
– add edges that cover multiple positions → representation of alternative segmentations – do not add nodes → retain efficiency, straightforward reordering Zens: Exploiting Ambiguous Input in SMT
6
August 17, 2006
Improved Reordering for Lattice Input • confusion network is approximation of lattice → valuable information might be lost → potential improvement when using lattices
Zens: Exploiting Ambiguous Input in SMT
7
August 17, 2006
Improved Reordering for Lattice Input • confusion network is approximation of lattice → valuable information might be lost → potential improvement when using lattices • so far: – only very local reordering on lattice: ∗ skip 1 phrase [Zens & Bender+ 05] ∗ switch positions of 2 or 3 phrases [Kumar & Byrne 05]
Zens: Exploiting Ambiguous Input in SMT
8
August 17, 2006
Improved Reordering for Lattice Input • confusion network is approximation of lattice → valuable information might be lost → potential improvement when using lattices • so far: – only very local reordering on lattice: ∗ skip 1 phrase [Zens & Bender+ 05] ∗ switch positions of 2 or 3 phrases [Kumar & Byrne 05] • idea: – generalize reordering scheme used for CN to lattice input → long range reordering
Zens: Exploiting Ambiguous Input in SMT
9
August 17, 2006
Goals
• improve robustness to imperfect input • investigate novel approaches: – generalized confusion networks – reordering strategies for lattice input • perform a systematic comparison in terms of MT quality and computational requirements • scalability → apply to tasks of different size: small: IWSLT, medium: EPPS/TC-Star, large: NIST/GALE
Zens: Exploiting Ambiguous Input in SMT
10
August 17, 2006
Targeted Applications
• spoken language translation: – output of ASR system – punctuation insertion / sentence boundary detection – disfluency detection • named entity recognition / translation • Chinese word segmentation • Arabic tokenization
Zens: Exploiting Ambiguous Input in SMT
11
August 17, 2006
References [Kumar & Byrne 05] S. Kumar, W. Byrne: Local phrase reordering models for statistical machine translation. Proc. HLT/EMNLP, pp. 161–168, Vancouver, Canada, October 2005. [Sadat & Habash 06] F. Sadat, N. Habash: Combination of Preprocessing Schemes for Statistical MT. Proc. COLING/ACL, pp. 1–8, Sydney, Australia, July 2006. [Xu & Matusov+ 05] J. Xu, E. Matusov, R. Zens, H. Ney: Integrated Chinese Word Segmentation in Statistical Machine Translation. Proc. Int. Workshop on Spoken Language Translation (IWSLT), pp. 141–147, Pittsburgh, PA, October 2005. [Zens & Bender+ 05] R. Zens, O. Bender, S. Hasan, S. Khadivi, E. Matusov, J. Xu, Y. Zhang, H. Ney: The RWTH Phrase-based Statistical Machine Translation System. Proc. Int. Workshop on Spoken Language Translation (IWSLT), pp. 155–162, Pittsburgh, PA, October 2005. [Zens & Och+ 02] R. Zens, F.J. Och, H. Ney: Phrase-Based Statistical Machine Translation. Proc. M. Jarke, J. Koehler, G. Lakemeyer, editors, 25th German Conf. on Artificial Intelligence (KI2002), Vol. 2479 of Lecture Notes in Artificial Intelligence (LNAI), pp. 18–32, Aachen, Germany, September 2002. Springer Verlag.
Zens: Exploiting Ambiguous Input in SMT
12
August 17, 2006