Translation models based on tree representation of language. – significant
ongoing ... Phrase structure grammar tree for an English sentence. (as produced
...
Chapter 11 Tree-based models
Statistical Machine Translation
Tree-Based Models • Traditional statistical models operate on sequences of words • Many translation problems can be best explained by pointing to syntax – reordering, e.g., verb movement in German–English translation – long distance agreement (e.g., subject-verb) in output ⇒ Translation models based on tree representation of language – significant ongoing research – state-of-the art for some language pairs
Chapter 11: Tree-Based Models
1
Phrase Structure Grammar • Phrase structure – – – –
noun phrases: the big man, a house, ... prepositional phrases: at 5 o’clock, in Edinburgh, ... verb phrases: going out of business, eat chicken, ... adjective phrases, ...
• Context-free Grammars (CFG) – non-terminal symbols: phrase structure labels, part-of-speech tags – terminal symbols: words – production rules: nt → [nt,t]+ example: np → det nn
Chapter 11: Tree-Based Models
2
Phrase Structure Grammar S VP-A VP-A VP-A PP PRP
I
MD
VB
VBG
RP
TO
PRP
NP-A DT
NNS
shall be passing on to you some comments Phrase structure grammar tree for an English sentence (as produced Collins’ parser)
Chapter 11: Tree-Based Models
3
Synchronous Phrase Structure Grammar • English rule np → det jj nn • French rule np → det nn jj • Synchronous rule (indices indicate alignment): np → det1 nn2 jj3 | det1 jj3 nn2
Chapter 11: Tree-Based Models
4
Synchronous Grammar Rules • Nonterminal rules np → det1 nn2 jj3 | det1 jj3 nn2 • Terminal rules n → maison | house np → la maison bleue | the blue house • Mixed rules np → la maison jj1 | the jj1 house
Chapter 11: Tree-Based Models
5
Tree-Based Translation Model • Translation by parsing – synchronous grammar has to parse entire input sentence – output tree is generated at the same time – process is broken up into a number of rule applications • Translation probability score(tree, e, f) =
Y
rulei
i
• Many ways to assign probabilities to rules
Chapter 11: Tree-Based Models
6
Aligned Tree Pair S VP-A VP-A VP-A PP PRP
I
Ich
MD
VB
VBG
RP
TO
PRP
NP-A DT
NNS
shall be passing on to you some comments
werde Ihnen die entsprechenden Anmerkungen aushändigen
PPER
VAFIN
PPER
ART
ADJ
NN
VVFIN
NP S
VP
VP
Phrase structure grammar trees with word alignment (German–English sentence pair.) Chapter 11: Tree-Based Models
7
Reordering Rule • Subtree alignment vp
pper ...
↔
np
vvfin
...
aush¨ andigen
vp vbg
rp
pp
np
passing
on
...
...
• Synchronous grammar rule vp → pper1 np2 aush¨ andigen | passing on pp1 np2 • Note: – one word aush¨ andigen mapped to two words passing on ok – but: fully non-terminal rule not possible (one-to-one mapping constraint for nonterminals) Chapter 11: Tree-Based Models
8
Another Rule • Subtree alignment pro
↔
pp
Ihnen
to
prp
to
you
• Synchronous grammar rule (stripping out English internal structure) pro/pp → Ihnen | to you • Rule with internal structure pro/pp → Ihnen
Chapter 11: Tree-Based Models
to
prp
to
you 9
Another Rule • Translation of German werde to English shall be
↔
vp
vp
vafin
vp
md
werde
...
shall
vp vb
vp
be
...
• Translation rule needs to include mapping of vp ⇒ Complex rule vafin vp1 vp →
werde
md shall
vp vb
vp1
be Chapter 11: Tree-Based Models
10
Internal Structure • Stripping out internal structure vp → werde vp1 | shall be vp1 ⇒ synchronous context free grammar • Maintaining internal structure vafin vp1 vp →
werde
md shall
vp vb
vp1
be ⇒ synchronous tree substitution grammar Chapter 11: Tree-Based Models
11
Learning Synchronous Grammars • Extracting rules from a word-aligned parallel corpus • First: Hierarchical phrase-based model – only one non-terminal symbol x – no linguistic syntax, just a formally syntactic model • Then: Synchronous phrase structure model – non-terminals for words and phrases: np, vp, pp, adj, ... – corpus must also be parsed with syntactic parser
Chapter 11: Tree-Based Models
12
I shall be
Anmerkungen aushändigen
Ihnen die entsprechenden
Ich werde
Extracting Phrase Translation Rules
shall be = werde
passing on to you some comments
Chapter 11: Tree-Based Models
13
Anmerkungen aushändigen
entsprechenden
Ich werde Ihnen die
Extracting Phrase Translation Rules
I shall be passing on to you some comments
Chapter 11: Tree-Based Models
some comments = die entsprechenden Anmerkungen
14
aushändigen
Anmerkungen
die entsprechenden
Ich werde Ihnen
Extracting Phrase Translation Rules
I shall be passing on to you some comments
Chapter 11: Tree-Based Models
werde Ihnen die entsprechenden Anmerkungen aushändigen = shall be passing on to you some comments
15
Anmerkungen aushändigen
entsprechenden
Ich werde Ihnen die
Extracting Hierarchical Phrase Translation Rules
subtracting subphrase
I shall be passing on
werde X aushändigen = shall be passing on X
to you some comments
Chapter 11: Tree-Based Models
16
Formal Definition • Recall: consistent phrase pairs (¯ e, f¯) consistent with A ⇔ ∀ei ∈ e¯ : (ei, fj ) ∈ A → fj ∈ f¯ and ∀fj ∈ f¯ : (ei, fj ) ∈ A → ei ∈ e¯ and ∃ei ∈ e¯, fj ∈ f¯ : (ei, fj ) ∈ A
• Let P be the set of all extracted phrase pairs (¯ e, f¯)
Chapter 11: Tree-Based Models
17
Formal Definition • Extend recursively: if (¯ e, f¯) ∈ P and (¯ esub, f¯sub) ∈ P and e¯ = e¯pre + e¯sub + e¯post and f¯ = f¯pre + f¯sub + f¯post and e¯ 6= e¯sub and f¯ 6= f¯sub add (epre + x + epost, fpre + x + fpost) to P (note: any of epre, epost, fpre, or fpost may be empty) • Set of hierarchical phrase pairs is the closure under this extension mechanism
Chapter 11: Tree-Based Models
18
Comments • Removal of multiple sub-phrases leads to rules with multiple non-terminals, such as: y → x1 x2 | x2 of x1 • Typical restrictions to limit complexity [Chiang, 2005] – at most 2 nonterminal symbols – at least 1 but at most 5 words per language – span at most 15 words (counting gaps)
Chapter 11: Tree-Based Models
19
Learning Syntactic Translation Rules S VP
PRP S
Anm. NN aushänd.
Ich PPER werde VAFIN Ihnen PPER die ART entspr. ADJ
NP
VVFIN
VP
I
shall VB be VBG passing RP on TO to PP PRP you DT some NNS comments MD
VP VP
VP
NP
Chapter 11: Tree-Based Models
pro Ihnen
=
pp to
prp
to
you
20
Constraints on Syntactic Rules • Same word alignment constraints as hierarchical models • Hierarchical: rule can cover any span ⇔ syntactic rules must cover constituents in the tree • Hierarchical: gaps may cover any span ⇔ gaps must cover constituents in the tree
• Much less rules are extracted (all things being equal)
Chapter 11: Tree-Based Models
21
Impossible Rules S VP
PRP S VP VP
VP
NP
I
shall VB be VBG passing RP on TO to PP PRP you DT some NNS comments MD
Chapter 11: Tree-Based Models
Anm. NN aushänd.
Ich PPER werde VAFIN Ihnen PPER die ART entspr. ADJ
NP
VVFIN
VP
English span not a constituent no rule extracted
22
Rules with Context S VP
PRP S
Anm. NN aushänd.
Ich PPER werde VAFIN Ihnen PPER die ART entspr. ADJ
NP
VVFIN
VP
Rule with this phrase pair requires syntactic context
I
shall VB be VBG passing RP on TO to PP PRP you DT some NNS comments
vp
MD
VP VP
VP
NP
Chapter 11: Tree-Based Models
vp
vafin
vp
werde
...
=
md shall
vp vb
vp
be
...
23
Too Many Rules Extractable
• Huge number of rules can be extracted (every alignable node may or may not be part of a rule → exponential number of rules)
• Need to limit which rules to extract • Option 1: similar restriction as for hierarchical model (maximum span size, maximum number of terminals and non-terminals, etc.)
• Option 2: only extract minimal rules (”GHKM” rules)
Chapter 11: Tree-Based Models
24
Minimal Rules S VP VP VP PP
Ich
PRP
MD
I
shall
werde
VB
RP
TO
PRP
be passing on
to
you
Ihnen
VBG
NP
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extract: set of smallest rules required to explain the sentence pair Chapter 11: Tree-Based Models
25
Lexical Rule S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: prp → Ich | I Chapter 11: Tree-Based Models
26
Lexical Rule S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: prp → Ihnen | you Chapter 11: Tree-Based Models
27
Lexical Rule S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: dt → die | some Chapter 11: Tree-Based Models
28
Lexical Rule S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: nns → Anmerkungen | comments Chapter 11: Tree-Based Models
29
Insertion Rule S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: pp → x | to prp Chapter 11: Tree-Based Models
30
Non-Lexical Rule S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: np → x1 x2 | dt1 nns2 Chapter 11: Tree-Based Models
31
Lexical Rule with Syntactic Context S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: vp → x1 x2 aush¨ andigen | passing on pp1 np2 Chapter 11: Tree-Based Models
32
Lexical Rule with Syntactic Context S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: vp → werde x | shall be vp (ignoring internal structure) Chapter 11: Tree-Based Models
33
Non-Lexical Rule S VP VP VP PP
Ich
PRP
MD
I
shall
werde
NP
RP
TO
PRP
be passing on
to
you
VBG
VB
Ihnen
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Extracted rule: s → x1 x2 | prp1 vp2 done — note: one rule per alignable constituent Chapter 11: Tree-Based Models
34
Unaligned Source Words S VP VP VP PP
Ich
PRP
MD
I
shall
werde
VB
RP
TO
PRP
be passing on
to
you
Ihnen
VBG
NP
die
entsprechenden
DT
NNS
some comments
Anmerkungen aushändigen
Attach to neighboring words or higher nodes → additional rules Chapter 11: Tree-Based Models
35
Too Few Phrasal Rules?
• Lexical rules will be 1-to-1 mappings (unless word alignment requires otherwise) • But: phrasal rules very beneficial in phrase-based models • Solutions – combine rules that contain a maximum number of symbols (as in hierarchical models, recall: ”Option 1”) – compose minimal rules to cover a maximum number of non-leaf nodes
Chapter 11: Tree-Based Models
36
Composed Rules • Current rules
x1 x2 = dt1
die = dt
np nns1
entsprechenden Anmerkungen = nns
some
comments
• Composed rule die entsprechenden Anmerkungen = np dt
nns
some
comments
(1 non-leaf node: np) Chapter 11: Tree-Based Models
37
Composed Rules • Minimal rule:
3 non-leaf nodes: vp, pp, np
• Composed rule:
3 non-leaf nodes: vp, pp and np
Chapter 11: Tree-Based Models
x1 x2 aush¨ andigen =
vp
prp
prp
passing
on
Ihnen x1 aush¨ andigen = prp
prp
passing
on
pp1
np2
vp pp to
prp
to
you
np1
38
Relaxing Tree Constraints • Impossible rule x
=
werde
md
vb
shall
be
• Create new non-terminal label: md+vb ⇒ New rule x werde
Chapter 11: Tree-Based Models
=
md+vb md
vb
shall
be
39
Zollmann Venugopal Relaxation • If span consists of two constituents , join them: x+y • If span conststs of three constituents, join them: x+y+z • If span covers constituents with the same parent x and include – every but the first child y, label as x\y – every but the last child y, label as x/y • For all other cases, label as fail
⇒ More rules can be extracted, but number of non-terminals blows up Chapter 11: Tree-Based Models
40
Special Problem: Flat Structures
• Flat structures severely limit rule extraction np
dt
nnp
nnp
nnp
nnp
the
Israeli
Prime
Minister
Sharon
• Can only extract rules for individual words or entire phrase
Chapter 11: Tree-Based Models
41
Relaxation by Tree Binarization np dt
np
the nnp
np
Israeli nnp Prime
np nnp
nnp
Minister
Sharon
More rules can be extracted Left-binarization or right-binarization? Chapter 11: Tree-Based Models
42
Scoring Translation Rules • Extract all rules from corpus • Score based on counts – – – – –
joint rule probability: p(lhs, rhsf , rhse) rule application probability: p(rhsf , rhse|lhs) direct translation probability: p(rhse|rhsf , lhs) noisy channel translation probability: p(rhsf |rhse, lhs) Q lexical translation probability: ei∈rhse p(ei|rhsf , a)
Chapter 11: Tree-Based Models
43
Syntactic Decoding Inspired by monolingual syntactic chart parsing: During decoding of the source sentence, a chart with translations for the O(n2) spans has to be filled
Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
Chapter 11: Tree-Based Models
VP
44
➏
Syntax Decoding
VB
drink Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
VP
German input sentence with tree Chapter 11: Tree-Based Models
45
➏
Syntax Decoding
➊ PRO
VB
she
drink
Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
VP
Purely lexical rule: filling a span with a translation (a constituent in the chart) Chapter 11: Tree-Based Models
46
➏
Syntax Decoding
➊
➋ PRO
NN
VB
she
coffee
drink
Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
VP
Purely lexical rule: filling a span with a translation (a constituent in the chart) Chapter 11: Tree-Based Models
47
➏
Syntax Decoding
➊
➋
➌
PRO
NN
VB
she
coffee
drink
Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
VP
Purely lexical rule: filling a span with a translation (a constituent in the chart) Chapter 11: Tree-Based Models
48
➏
Syntax Decoding
➍ NP NP
PP
DET |
NN |
IN |
a
cup
of
➊
NN
➋
➌
PRO
NN
VB
she
coffee
drink
Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
VP
Complex rule: matching underlying constituent spans, and covering words Chapter 11: Tree-Based Models
49
➏
➎
Syntax Decoding VP VP VBZ |
TO |
wants
NP
VB
to ➍ NP NP
PP
DET |
NN |
IN |
a
cup
of
➊
NN
➋
➌
PRO
NN
VB
she
coffee
drink
Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
VP
Complex rule with reordering Chapter 11: Tree-Based Models
50
➏ S
Syntax Decoding
PRO
VP
➎ VP VP VBZ |
TO |
wants
NP
VB
to ➍ NP NP
PP
DET |
NN |
IN |
a
cup
of
➊
NN
➋
➌
PRO
NN
VB
she
coffee
drink
Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
Chapter 11: Tree-Based Models
VP
51
Bottom-Up Decoding • For each span, a stack of (partial) translations is maintained • Bottom-up: a higher stack is filled, once underlying stacks are complete
Chapter 11: Tree-Based Models
52
Naive Algorithm Input: Foreign sentence f = f1, ...flf , with syntax tree Output: English translation e 1: for all spans [start,end] (bottom up) do 2: for all sequences s of hypotheses and words in span [start,end] do 3: for all rules r do 4: if rule r applies to chart sequence s then 5: create new hypothesis c 6: add hypothesis c to chart 7: end if 8: end for 9: end for 10: end for 11: return English translation e from best hypothesis in span [0,lf ]
Chapter 11: Tree-Based Models
53
Chart Organization
Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
NP S
VP
• Chart consists of cells that cover contiguous spans over the input sentence • Each cell contains a set of hypotheses1 • Hypothesis = translation of span with target-side constituent 1
In the book, they are called chart entries.
Chapter 11: Tree-Based Models
54
Dynamic Programming Applying rule creates new hypothesis
NP: a cup of coffee
apply rule: NP → NP Kaffee ; NP → NP+P coffee
NP+P: a cup of
NP: coffee
eine
Tasse
Kaffee
trinken
ART
NN
NN
VVINF
Chapter 11: Tree-Based Models
55
Dynamic Programming Another hypothesis NP: a cup of coffee NP: a cup of coffee
apply rule: NP → eine Tasse NP ; NP → a cup of NP NP+P: a cup of NP: coffee
eine
Tasse
Kaffee
trinken
ART
NN
NN
VVINF
Both hypotheses are indistiguishable in future search → can be recombined Chapter 11: Tree-Based Models
56
Recombinable States Recombinable?
NP: a cup of coffee NP: a cup of coffee NP: a mug of coffee
Chapter 11: Tree-Based Models
57
Recombinable States Recombinable?
NP: a cup of coffee NP: a cup of coffee NP: a mug of coffee
Yes, iff max. 2-gram language model is used
Chapter 11: Tree-Based Models
58
Recombinability Hypotheses have to match in • span of input words covered • output constituent label • first n–1 output words not properly scored, since they lack context • last n–1 output words still affect scoring of subsequently added words, just like in phrase-based decoding
(n is the order of the n-gram language model)
Chapter 11: Tree-Based Models
59
Language Model Contexts When merging hypotheses, internal language model contexts are absorbed
S (minister of Germany met with Condoleezza Rice)
the foreign ... ... in Frankfurt
NP
VP
(minister)
(Condoleezza Rice)
the foreign ... ... of Germany
relevant history
met with ...
... in Frankfurt
un-scored words
pLM(met | of Germany) pLM(with | Germany met)
Chapter 11: Tree-Based Models
60
Stack Pruning • Number of hypotheses in each chart cell explodes ⇒ need to discard bad hypotheses e.g., keep 100 best only • Different stacks for different output constituent labels? • Cost estimates – translation model cost known – language model cost for internal words known → estimates for initial words – outside cost estimate? (how useful will be a NP covering input words 3–5 later on?)
Chapter 11: Tree-Based Models
61
Naive Algorithm: Blow-ups • Many subspan sequences for all sequences s of hypotheses and words in span [start,end] • Many rules for all rules r • Checking if a rule applies not trivial rule r applies to chart sequence s ⇒ Unworkable
Chapter 11: Tree-Based Models
62
Solution
• Prefix tree data structure for rules • Dotted rules • Cube pruning
Chapter 11: Tree-Based Models
63
Storing Rules • First concern: do they apply to span? → have to match available hypotheses and input words • Example rule np → x1 des x2 | np1 of the nn2 • Check for applicability – is there an initial sub-span that with a hypothesis with constituent label np? – is it followed by a sub-span over the word des? – is it followed by a final sub-span with a hypothesis with label nn? • Sequence of relevant information np • des • nn • np1 of the nn2 Chapter 11: Tree-Based Models
64
Rule Applicability Check
Trying to cover a span of six words with given rule
NP • des • NN → NP: NP of the NN
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry
65
Rule Applicability Check
First: check for hypotheses with output constituent label np
NP • des • NN → NP: NP of the NN
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry
66
Rule Applicability Check
Found np hypothesis in cell, matched first symbol of rule
NP • des • NN → NP: NP of the NN
NP
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry
67
Rule Applicability Check
Matched word des, matched second symbol of rule
NP • des • NN → NP: NP of the NN
NP
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry
68
Rule Applicability Check
Found a nn hypothesis in cell, matched last symbol of rule
NP • des • NN → NP: NP of the NN
NP
das
NN
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry
69
Rule Applicability Check Matched entire rule → apply to create a np hypothesis
NP • des • NN → NP: NP of the NN NP
NP
das
NN
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry
70
Rule Applicability Check Look up output words to create new hypothesis (note: there may be many matching underlying np and nn hypotheses)
NP • des • NN → NP: NP of the NN NP: the house of the architect Frank Gehry
NP: the house
das
Haus
Chapter 11: Tree-Based Models
NN: architect Frank Gehry
des
Architekten
Frank
Gehry
71
Checking Rules vs. Finding Rules
• What we showed: – given a rule – check if and how it can be applied • But there are too many rules (millions) to check them all • Instead: – given the underlying chart cells and input words – find which rules apply
Chapter 11: Tree-Based Models
72
Prefix Tree for Rules NP
NP: NP1 IN2 NP3 NP: NP1 of DET2 NP3 NP: NP1 of IN2 NP3
...
des um
NN VP …
...
...
NP: NP1 of the NN2 NP: NP2 NP1 NP: NP1 of NP2
...
VP …
...
PP …
NN
...
DET NP … NP: NP1
das
Haus
NP: DET1 NN2
...
NN
...
DET
NP: the house
...
...
...
Highlighted Rules np → np1 det2 nn3 | np1 in2 nn3 np → np1 | np1 np → np1 des nn2 | np1 of the nn2 np → np1 des nn2 | np2 np1 np → det1 nn2 | det1 nn2 np → das Haus | the house Chapter 11: Tree-Based Models
73
Dotted Rules: Key Insight • If we can apply a rule like p→ABC | x to a span • Then we could have applied a rule like q→AB | y to a sub-span with the same starting word ⇒ We can re-use rule lookup by storing A B • (dotted rule)
Chapter 11: Tree-Based Models
74
Finding Applicable Rules in Prefix Tree
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 75
Covering the First Cell
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 76
Looking up Rules in the Prefix Tree ●
das
Haus
Chapter 11: Tree-Based Models
des
das ❶
Architekten
Frank
Gehry 77
Taking Note of the Dotted Rule ●
das ❶
das ❶
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 78
Checking if Dotted Rule has Translations das ❶
●
DET:
DET:
the that
das ❶
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 79
Applying the Translation Rules das ❶
●
DET:
DET:
the that
that DET: the DET:
das ❶
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 80
Looking up Constituent Label in Prefix Tree ●
das ❶ DET
❷
that DET: the DET:
das ❶
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 81
Add to Span’s List of Dotted Rules ●
das ❶ DET
❷
that DET: the DET:
DET ❷
das ❶
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 82
Moving on to the Next Cell ●
das ❶ DET
❷
that DET: the DET:
DET ❷
das ❶
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 83
Looking up Rules in the Prefix Tree ●
das ❶ DET
❷
Haus ❸
that DET: the DET:
DET ❷
das ❶
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 84
Taking Note of the Dotted Rule ●
das ❶ DET
❷
Haus ❸
that DET: the DET:
DET ❷
das ❶
das
house ❸
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 85
Checking if Dotted Rule has Translations das ❶
●
DET
❷
Haus ❸ NN:
house NP: house
that DET: the DET:
DET ❷
das ❶
das
house ❸
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 86
Applying the Translation Rules das ❶
●
DET
❷
Haus ❸ NN:
house NP: house
that DET: the DET:
DET ❷
das ❶
das
house NN: house NP:
house ❸
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 87
Looking up Constituent Label in Prefix Tree ●
das ❶ DET
❷
Haus ❸ NN ❹ NP ❺
that DET: the DET:
DET ❷
das ❶
das
house NN: house NP:
house ❸
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 88
Add to Span’s List of Dotted Rules ●
das ❶ DET
❷
Haus ❸ NN ❹ NP ❺
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry 89
More of the Same ●
das ❶ DET
❷
Haus ❸ NN ❹ NP ❺
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 90
Moving on to the Next Cell ●
das ❶ DET
❷
Haus ❸ NN ❹ NP ❺
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 91
Covering a Longer Span Cannot consume multiple words at once All rules are extensions of existing dotted rules Here: only extensions of span over das possible
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 92
Extensions of Span over das ●
das ❶
NN, NP, Haus?
❷
NN, NP, Haus?
DET
Haus ❸ NN ❹ NP ❺
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 93
Looking up Rules in the Prefix Tree das ❶
●
Haus ❻ NN
DET
Haus ❽
❷
NN
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
❼
architect NN: architect
❾
NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 94
Taking Note of the Dotted Rule das ❶
●
Haus ❻ NN
DET
❼
Haus ❽
❷
NN
❾
DET NN❾ DET Haus❽ das NN❼
das Haus❻
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 95
Checking if Dotted Rules have Translations ●
das ❶
Haus ❻ NP: the house NN
DET
❷
❼
NP:
the NN
Haus ❽ NP: DET house NN
❾
NP: DET NN
DET NN❾ DET Haus❽ das NN❼
das Haus❻
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 96
Applying the Translation Rules ●
das ❶
Haus ❻ NP: the house NN
DET
❷
❼
NP:
the NN
Haus ❽ NP: DET house NN
❾
NP: DET NN
that house NP: the house NP:
DET NN❾ DET Haus❽ das NN❼
das Haus❻
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 97
Looking up Constituent Label in Prefix Tree ●
das ❶
Haus ❻ NP: the house NN
DET
❷
❼
NP:
the NN
Haus ❽ NP: DET house NN
❾
NP: DET NN
NP ❺ that house NP: the house NP:
DET NN❾ DET Haus❽ das NN❼
das Haus❻
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 98
Add to Span’s List of Dotted Rules ●
das ❶
Haus ❻ NP: the house NN
DET
❷
❼
NP:
the NN
Haus ❽ NP: DET house NN
❾
NP: DET NN
NP ❺ that house NP: the house NP:
DET NN❾ DET Haus❽ das NN❼
NP❺
das Haus❻
that DET: the DET:
house NN: house NP:
DET ❷
NN ❹ NP ❺
das ❶
house ❸
das
Haus
Chapter 11: Tree-Based Models
of DET: the IN:
architect NN: architect NP:
NNP:
Frank
NNP:
Gehry
DET ❷
NN ❹
NNP•
NNP•
des•
Architekten•
Frank•
Gehry•
des
Architekten
Frank
Gehry 99
Even Larger Spans Extend lists of dotted rules with cell constituent labels span’s dotted rule list (with same start) plus neighboring span’s constituent labels of hypotheses (with same end)
das
Haus
Chapter 11: Tree-Based Models
des
Architekten
Frank
Gehry
100
Reflections
• Complexity O(rn3) with sentence length n and size of dotted rule list r – may introduce maximum size for spans that do not start at beginning – may limit size of dotted rule list (very arbitrary) • Does the list of dotted rules explode? • Yes, if there are many rules with neighboring target-side non-terminals – such rules apply in many places – rules with words are much more restricted
Chapter 11: Tree-Based Models
101
Difficult Rules • Some rules may apply in too many ways • Neighboring input non-terminals vp → gibt x1 x2 | gives np2 to np1 – non-terminals may match many different pairs of spans – especially a problem for hierarchical models (no constituent label restrictions) – may be okay for syntax-models • Three neighboring input non-terminals vp → trifft x1 x2 x3 heute | meets np1 today pp2 pp3 – will get out of hand even for syntax models Chapter 11: Tree-Based Models
102
Where are we now?
• We know which rules apply • We know where they apply (each non-terminal tied to a span) • But there are still many choices – many possible translations – each non-terminal may match multiple hypotheses → number choices exponential with number of non-terminals
Chapter 11: Tree-Based Models
103
Rules with One Non-Terminal Found applicable rules pp → des x | ... np ... PP ➝ of NP PP ➝ by NP PP ➝ in NP PP ➝ on
to NP
the architect ... architect Frank ... the famous ... Frank Gehry
NP NP NP NP
• Non-terminal will be filled any of h underlying matching hypotheses • Choice of t lexical translations ⇒ Complexity O(ht) (note: we may not group rules by target constituent label, so a rule np → des x | the np would also be considered here as well)
Chapter 11: Tree-Based Models
104
Rules with Two Non-Terminals Found applicable rule np → x1 des x2 | np1 ... np2 a house a building the building a new house
NP ➝ NP of NP NP ➝ NP by NP NP ➝ NP in NP NP ➝ NP on
to NP
the architect architect Frank ... the famous ... Frank Gehry
NP NP NP NP
• Two non-terminal will be filled any of h underlying matching hypotheses each • Choice of t lexical translations ⇒ Complexity O(h2t) — a three-dimensional ”cube” of choices (note: rules may also reorder differently)
Chapter 11: Tree-Based Models
105
1.5 in
the ... 1.7 by architect ... 2.6 by the ... 3.2 of the ...
Cube Pruning
a house 1.0 a building 1.3 the building 2.2 a new house 2.6
Arrange all the choices in a ”cube” (here: a square, generally a orthotope, also called a hyperrectangle)
Chapter 11: Tree-Based Models
106
1.5 in
the ... 1.7 by architect ... 2.6 by the ... 3.2 of the ...
Create the First Hypothesis
a house 1.0 a building 1.3 the building 2.2 a new house 2.6
2.1
• Hypotheses created in cube: (0,0)
Chapter 11: Tree-Based Models
107
1.5 in
the ... 1.7 by architect ... 2.6 by the ... 3.2 of the ...
Add (”Pop”) Hypothesis to Chart Cell
a house 1.0 a building 1.3 the building 2.2 a new house 2.6
2.1
• Hypotheses created in cube: • Hypotheses in chart cell stack: (0,0)
Chapter 11: Tree-Based Models
108
1.5 in
the ... 1.7 by architect ... 2.6 by the ... 3.2 of the ...
Create Neighboring Hypotheses
a house 1.0 a building 1.3 the building 2.2 a new house 2.6
2.1 2.5 2.7
• Hypotheses created in cube: (0,1), (1,0) • Hypotheses in chart cell stack: (0,0)
Chapter 11: Tree-Based Models
109
1.5 in
the ... 1.7 by architect ... 2.6 by the ... 3.2 of the ...
Pop Best Hypothesis to Chart Cell
a house 1.0 a building 1.3 the building 2.2 a new house 2.6
2.1 2.5 2.7
• Hypotheses created in cube: (0,1) • Hypotheses in chart cell stack: (0,0), (1,0)
Chapter 11: Tree-Based Models
110
1.5 in
the ... 1.7 by architect ... 2.6 by the ... 3.2 of the ...
Create Neighboring Hypotheses
a house 1.0 a building 1.3 the building 2.2 a new house 2.6
2.1 2.5 3.1 2.7 2.4
• Hypotheses created in cube: (0,1), (1,1), (2,0) • Hypotheses in chart cell stack: (0,0), (1,0)
Chapter 11: Tree-Based Models
111
1.5 in
the ... 1.7 by architect ... 2.6 by the ... 3.2 of the ...
More of the Same
a house 1.0 a building 1.3 the building 2.2 a new house 2.6
2.1 2.5 3.1 2.7 2.4 3.0 3.8
• Hypotheses created in cube: (0,1), (1,2), (2,1), (2,0) • Hypotheses in chart cell stack: (0,0), (1,0), (1,1)
Chapter 11: Tree-Based Models
112
Queue of Cubes • Several groups of rules will apply to a given span • Each of them will have a cube • We can create a queue of cubes ⇒ Always pop off the most promising hypothesis, regardless of cube
• May have separate queues for different target constituent labels
Chapter 11: Tree-Based Models
113
Bottom-Up Chart Decoding Algorithm 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
for all spans (bottom up) do extend dotted rules for all dotted rules do find group of applicable rules create a cube for it create first hypothesis in cube place cube in queue end for for specified number of pops do pop off best hypothesis of any cube in queue add it to the chart cell create its neighbors end for extend dotted rules over constituent labels end for
Chapter 11: Tree-Based Models
114
Two-Stage Decoding • First stage: decoding without a language model (-LM decoding) – may be done exhaustively – eliminate dead ends – optionably prune out low scoring hypotheses • Second stage: add language model – limited to packed chart obtained in first stage • Note: essentially, we do two-stage decoding for each span at a time
Chapter 11: Tree-Based Models
115
Coarse-to-Fine
• Decode with increasingly complex model
• Examples – reduced language model [Zhang and Gildea, 2008] – reduced set of non-terminals [DeNero et al., 2009] – language model on clustered word classes [Petrov et al., 2008]
Chapter 11: Tree-Based Models
116
Outside Cost Estimation • Which spans should be more emphasized in search? • Initial decoding stage can provide outside cost estimates
NP Sie
will
eine
Tasse
Kaffee
trinken
PPER
VAFIN
ART
NN
NN
VVINF
• Use min/max language model costs to obtain admissible heuristic (or at least something that will guide search better)
Chapter 11: Tree-Based Models
117
Open Questions
• Where does the best translation fall out the beam? • How accurate are LM estimates? • Are particular types of rules too quickly discarded? • Are there systemic problems with cube pruning?
Chapter 11: Tree-Based Models
118
Summary • Synchronous context free grammars • Extracting rules from a syntactically parsed parallel corpus • Bottom-up decoding • Chart organization: dynamic programming, stacks, pruning • Prefix tree for rules • Dotted rules • Cube pruning
Chapter 11: Tree-Based Models
119