Trainable Pruning Model for ITG Parsing - Google Sites

Discriminative Pruning for Discriminative ITG Alignment Shujie Liu, Chi-Ho Li and Ming Zhou

Outline  

Introduction Discriminative Model for Pruning   



Training Sample Extraction Training : MERT Features

Experiments and Analysis

7/12/2010

Shujie Liu, Chi-Ho Li and Ming Zhou

2

Alignment and ITG Alignment Problem: Finding translation pairs in bitext sentences: 向

be

7/12/2010

财政

accountable

司

to

the

负责

Financial

Secretary


3


be

7/12/2010

财政

accountable

司

to

the

负责

Financial

Secretary


4


be

财政

accountable

司

to

the

负责

Financial

Secretary

ITG (Wu, 1997) parsing does synchronous parsing of two languages. Word alignment is the by-product.

7/12/2010


5


be

财政

accountable

司

to

the

负责

Financial

Secretary

ITG (Wu, 1997) parsing does synchronous parsing of two languages. Word alignment is the by-product. Lexical Rules

7/12/2010

Structural Rules


6


be

财政

accountable

司

to

负责

the

Financial

Secretary


Structural Rules

C→ei/fi C→Ɛ/fi C→ei/Ɛ

7/12/2010

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary


7


be

财政

accountable

司

to

负责

the

Financial

Secretary

ITG (Wu, 1997) parsing does synchronous parsing of two languages. Word alignment is the by-product. Lexical Rules C→ei/fi C→Ɛ/fi C→ei/Ɛ

7/12/2010

Structural Rules

X X→[XX]

X

X

X

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary


8


be

财政

accountable

司

to

负责

the

Financial

Secretary


x

Lexical Rules C→ei/fi C→Ɛ/fi C→ei/Ɛ

7/12/2010

Structural Rules

X X→[XX]

X→

X

X

X

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary


9


be

财政

accountable

司

to

the

负责

Financial

Secretary


x

x X X

X

X

X

X

X

X

Ɛ

负责

向

Ɛ

财政

司

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary

be

accountable

to

the

Financial

Secretary

7/12/2010


10


be

财政

accountable

司

to

负责

the

Financial

Secretary


B

Lexical Rules


7/12/2010

Structural Rules

A A→[AB] A→[BB] A→[CB] A→[AC] A→[BC] A→[CC]

B→ B→ B→ B→ B→ B→

A

A

A

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary


11

Why Pruning 

ITG has achieved state of the art results against gold standard alignments (Haghighi et al., 2009).

7/12/2010


12

Why Pruning 



ITG has achieved state of the art results against gold standard alignments (Haghighi et al., 2009). Speed is a major obstacle in ITG parsing

7/12/2010


13

Why Pruning 



ITG has achieved state of the art results against gold standard alignments (Haghighi et al., 2009). Speed is a major obstacle in ITG parsing [s/u,t/v]

for each F-span [s, t] for each E-span [u,v]

[s/u,S/U ]

s/u

7/12/2010

n2

n2

try to find an optimal point pair [S,U] to split the span pair[s/u, t/v] into two small span pairs ([s/u, S/U],[S/U, t/v]).

[S/U,t/v]

S/U

n2

t/v


14

Why Pruning 





[s/u,S/U ]

s/u  

n2

n2

try to find a optimal point pair [S,U] to split the span pair[s/u, t/v] into two small span pairs ([s/u, S/U],[S/U, t/v]).

[S/U,t/v]

S/U

n2

t/v

The complexity for ITG parsing without pruning is O(n6) Take more than 1 hour to parse one sentence pair longer than 60

7/12/2010


15

Why Pruning 





[s/u,S/U ]

s/u  



n2

n2

try to find a optimal point pair [S,U] to split the span pair[s/u, t/v] into two small span pairs ([s/u, S/U],[S/U, t/v]).

[S/U,t/v]

S/U

n2

t/v

The complexity for ITG parsing without pruning is O(n6) Take more than 1 hour to parse one sentence pair longer than 60

Pruning is necessary

7/12/2010


16

Three Kinds of Pruning 

discard F-spans and/or E-spans.  

discards too many span pairs (empirically) highly harmful to alignment performance

7/12/2010


17





discards too many span pairs (empirically) highly harmful to alignment performance

discard some alignment for a span pair.  

= minimizing the beam size of each span pair i.e. K-Best parsing

7/12/2010


18





discard some alignment for a span pair.  



discards too many span pairs (empirically) highly harmful to alignment performance = minimizing the beam size of each span pair i.e. K-Best parsing

discard some unpromising span pairs.  

i.e. limit E-spans per F-span It’s what our research is about.

7/12/2010


19

Related Work 

Tic-tac-toe pruning (Zhang and Gildea, 2005) 

Inside and outside scores to prune candidate E-spans for each F-span

7/12/2010


20

Related Work 





Tree constraints pruning (Cherry and Lin, 2006) invalid spans = spans interrupting the phrases of dependency tree i.e. [x1, j] and [j,x2]

7/12/2010


21

Related Work 





Tree constraints pruning (Cherry and Lin, 2006) invalid spans = spans interrupting the phrases of dependency tree i.e. [x1, j] and [j,x2]



High-precision alignments pruning (Haghighi et al., 2009) 



Prune all bitext cells that would invalidate more than 8 of high-precision alignments

1-1 alignment posterior pruning (Haghighi et al., 2009) 

Prune all 1-1 bitext cells that have a posterior below 10-4 in both HMM Models

7/12/2010


22

Outline  






7/12/2010


23

Linear Model 

As all these techniques have certain contribution in making good pruning decision, we try to incorporate all these features in ITG pruning.

7/12/2010


24

Linear Model 



As all these techniques have certain contribution in making good pruning decision, we try to incorporate all these features in ITG pruning. DPDI: Discriminative Pruning for Discriminative ITG

P (e | f ) 

exp(  (e , f )) ' exp(    ( e , f )) 

e ' T

λ: Feature weights

7/12/2010

ᴪ : Features


25

Outline  






7/12/2010


26

Training Sample Extraction 

Training samples? 

consist of various F-spans and their corresponding E-spans. 书 the book 书 the book 就会 is to 书就会 the book is to 书就会来 the book is to come …….

7/12/2010


27


Training samples?  

consist of various F-spans and their corresponding E-spans. extracted from word alignment annotation

7/12/2010


28



consist of various F-spans and their corresponding E-spans. extracted out of word alignment annotation

书

就会

the

book

is

7/12/2010

来的

to

come


29




书

就会

the

book

is

来的

to

come

ITG Constraints

书

就会

the

book

is

7/12/2010

来的

to

come


30




书

就会

the

book

is

来的

to

come

书就会来的 the book is to come 书就会来 the book is to come

ITG Constraints

the

书

book

就会

is

书就会 the book is to

来的

to

come

书 the book 书 the book

7/12/2010

Ɛ the

的

就会 is to

书 book


Ɛ is

就会 to

来 come

的 Ɛ

31



consist of various F-spans and their corresponding E-spans. extracted out of word alignment annotation 书就会来的 the book is to come

书 the book 书 the book 就会 is to 书就会 the book is to 书就会来 the book is to come

书就会来 the book is to come 书就会 the book is to 书 the book 书 the book

Ɛ the

7/12/2010

书 book

的

就会 is to

……. Ɛ is

就会 to

来 come

的 Ɛ


32

Outline  






7/12/2010


33

Loss Function M

loss ( rs , eˆ ( f s ; 1 ))





 rank( rs ) penalty

if rs  eˆ ( f s ;1M ) otherwise

fs : F-span, rs : correct E-span M M eˆ ( f s ; 1 ): the N-best list given fs and 1 rank (rs ) : is the rank of rs in the N-Best list penalty: If rs is not in the N-best list at all, then the loss is defined to be penalty(-100,000).

7/12/2010


34

Loss Function M

loss ( rs , eˆ ( f s ; 1 ))







fs : F-span, rs : correct E-span M M eˆ ( f s ; 1 ): the N-best list given fs and 1 rank (rs ) : is the rank of rs in the N-best list penalty: If rs is not in the N-best list at all, then the loss is defined to be penalty(-100,000). loss: -1

[1,3]

[1,2]

[2,3]

[2,2]

F-span

0

1

2

7/12/2010

….

…

….

10


….

…

….

index 35

Loss Function M

loss ( rs , eˆ ( f s ; 1 ))







fs : F-span, rs : correct E-span M M eˆ ( f s ; 1 ): the N-best list given fs and 1 rank (rs ) : is the rank of rs in the N-best list penalty: If rs is not in the N-best list at all, then the loss is defined to be penalty(-100,000).

loss: -100,000

[1,7]

[1,8]

[1,9]

[2,7]

F-span

0

1

2

7/12/2010

…. …

…. 10


[1,10]

….

…

index 36

Loss Function M

loss ( rs , eˆ ( f s ; 1 ))






if rs  eˆ ( f s ;1M ) otherwise Rationale : keep as many correct E-spans as possible in the N-best lists, and push the correct E-spans upward as much as possible

fs : F-span, rs : correct E-span M M eˆ ( f s ; 1 ): the N-best list given fs and 1 rank (rs ) : is the rank of rs in the N-best list penalty: If rs is not in the N-best list at all, then the loss is defined to be penalty(-100,000). loss: -1

[1,3]

[1,2]

[2,3]

[2,2]

….

….

….

…. loss: -100,000

[1,7]

[1,8]

[1,9]

[2,7]

F-span

0

1

2

7/12/2010

…. …

…. 10


[1,10]

….

…

index 37

MERT: Minimum Error Rate Training 

Training method is much similar with MERT for SMT 









An important part of MERT for SMT is a linear search, which is a search for a best point given a fixed dimension. Bleu score are changed while the best candidate changes The changed best candidates form the upper envelope (red curved line) The changed points are interval boundaries (green points) Finding the interval boundaries are important for Normal MERT

7/12/2010


Normal MERT

38

MERT: Minimum Error Rate Training 



Training method is much similar with MERT for SMT Difference: 

Instead of finding the interval boundaries at which the optimal candidate changes, we will find the interval boundaries at which the index of the correct result changes.

Normal MERT

interval boundaries : the red points, which are the intersections between the correct E-span and all other candidate E-spans.

golden

Modified MERT 7/12/2010


39

MERT: Minimum Error Rate Training score 



Training method is much similar with MERT for SMT Difference: 



Instead of finding the interval boundaries at which the optimal candidate changes, we will find the interval boundaries at which the index of the correct result changes. And the performance gain will be calculated between loss before the interval boundaries and loss after that

golden wi index -8 -9

index

loss

1

-10

-1

+1 -1

wi loss -8 -9

boundaries

+1

-10

+1

N = 10

-1

-99991 -99991 +99991

wi

-100000

7/12/2010


Modified MERT

40

Outline  






7/12/2010


41

Features 

Features for pruning model (F-span:, E-span)   

Inside probability Outside probability Alignment count Ratio 



Alignment invalid count Ratio 



2*Count(links linked to outside)/(j-i+m-l)

Length Ratio 



2*Count(Links in this span pair) / (j-i+m-l)

|(j-i) /(m-l)-1.15|

Position Ratio 

7/12/2010

|(j+i)/(2*length(src sent))–(l+m)/(2*length(trg sent))|


42

Features 






Length Ratio 







Tic-Tac-Toe

|(j-i) /(m-l)-1.15|

Position Ratio 

7/12/2010



43

Features 





Similar with Haghighi ’s


Length Ratio 







Tic-Tac-Toe

|(j-i) /(m-l)-1.15|

Position Ratio 

7/12/2010



44

Features 





|(j-i) /(m-l)-1.15|

Position Ratio 

7/12/2010



Length Ratio 







Tic-Tac-Toe

ratio of span length ≈ 1.15 : (average ratio of sentence length)



45

Features 







Length Ratio 







Tic-Tac-Toe

|(j-i) /(m-l)-1.15|

Position Ratio 

ratio of span length ≈ 1.15 : (average ratio of sentence length)

|(j+i)/(2*length(src sent))–(l+m)/(2*length(trg sent))| monotonic assumption:{ Position(F-span) ≈ Position(E-span)}

7/12/2010


46

Outline  






7/12/2010


47

Small-scale alignment Evaluation 





 

The first set of experiments evaluates the performance of the three pruning methods using the Berkeley annotated data. We use the first 250 sentence pairs as training data and the rest 241 pairs as testing data. The corresponding numbers of E-spans in training and test data are 4590 and 3951 respectively. Two ITG models are used: W-DITG and HP-DITG. The upper-bound , actual F-score and the time cost are compared.

7/12/2010


48

Small-scale alignment Evaluation ID

pruning

beam size

pruning/ (total time cost)

F-score Upper Bound

F-score

1

DPDI

10

72’’/3’03’’

88.5%

82.5%

2

TTT

10

58’’/2’38’’

87.5%

81.1%

3

TTT

20

53’’/6’55’’

88.6%

82.4%

4

DP

--

11’’/6’01’’

86.1%

80.5%

Table 1: Evaluation of DPDI against TTT (Tic-tac-toe) and DP (Dynamic Program) for W-DITG

• With the same beam size, although DPDI spends a bit more time, in terms of F-score upper bound, DPDI is 1 percent higher. • DPDI achieves even larger improvement in actual F-score.

7/12/2010


49

Small-scale alignment Evaluation ID

pruning

beam size

pruning/ (total time cost)

F-score Upper Bound

F-score

1

DPDI

10

72’’/5’18’’

93.9%

87.0%

2

TTT

10

58’’/4’51’’

93.0%

84.8%

3

TTT

20

53’’/12’5’’

94.0%

86.5%

4

DP

--

11’’/15’39’’

91.4%

83.6%

Table 2: Evaluation of DPDI against TTT (Tic-tac-toe) and DP (Dynamic Program) for HP-DITG

• Roughly the same observation as in W-DITG can be made. • In addition to the superiority of DPDI, it can also be noted that HPDITG achieves much higher F-score and F-score upper bound (For more details, please read our Coling2010 paper).

7/12/2010


50

Large-scale End-to-End Experiment 

Machine translation evaluation 

Bilingual Training data: 



Language model: 



5-gram language model trained from the Xinhua section of the Gigaword corpus

Develop corpus 



the NIST training set excluding the Hong Kong Law and Hong Kong Hansard

NIST’03 test set

Test corpus 

7/12/2010

Nist’05 and Nist’08 test sets


51

Large-scale End-to-End Experiment ID

Prun-ing

beam size

time cost

Bleu-05

Bleu-08

1

DPDI

10

1092h

38.57

28.31

2

TTT

10

972h

37.96

27.37

3

TTT

20

2376h

38.13

27.58

4

DP

--

2068h

37.43

27.12

Table 3: Evaluation of DPDI against TTT and DP for HP-DITG 





HP-DITG using DPDI achieves the best Bleu score with acceptable time cost. An explanation of the better performance by HP-DITG is the better phrase pair extraction due to DPDI. Good ITG pruning like DPDI guides the subsequent ITG alignment process so that less links inconsistent to good phrase pairs are produced.

7/12/2010


52

Large-scale End-to-End Experiment Prun-ing

F-Score

Bleu-05

3

HMM Giza++ BITG

80.1% 84.2% 85.9%

36.91 37.70 37.92

26.86 27.33 27.85

4

W-DITG

82.5%

--

--

5

HP-DITG

87.0%

38.57

28.31

ID 1 2

Bleu-08

Table 4: Evaluation of DPDI against HMM, Giza++ and BITG 



W-DITG is not as good as HMM, Giza++ and BITG, since W-DITG suffers from the 1-to-1 alignment constraints. HP-DITG (with DPDI) is better than the three baselines both in alignment F-score and Bleu score.

7/12/2010


53

Summary 





A discriminative pruning method (DPDI) is proposed, which can use Minimum Error Rate Training and various features. DPDI is an effective way to reduce the number of bitext cells for bilingual parsing. DPDI can improve not only the alignment performance, but also the SMT performance.

7/12/2010


54

Thanks

7/12/2010


55


be

财政

accountable

司

to

负责

the

Financial

Secretary



7/12/2010

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary


56


be

财政

accountable

司

to

负责

the

Financial

Secretary



7/12/2010

Structural Rules


A

A

A

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary


57


be

财政

accountable

司

to

负责

the

Financial

Secretary


B

Lexical Rules


7/12/2010

Structural Rules


B→ B→ B→ B→ B→ B→

A

A

A

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary


58


be

财政

accountable

司

to

负责

the

Financial

Secretary


S

Lexical Rules


7/12/2010

B

Structural Rules


B→ B→ B→ B→ B→ B→

A

A

S→A S→B S→C

A

Ɛ

负责

向

Ɛ

财政

司

be

accountable

to

the

Financial

Secretary


59


The annotated data for training? 

e1

We use the phrase pair extracted from golden alignment sentence pairs as annotated data for training e2

f1

e3

A: [e1,e3]/[f1,f2] {e1/f1,e3/f2},{e2/f1,e3/f2} A  [C, C ]

f2

A  [C, C ]

[f1,f2]

f2 C: [e1,e2]/[f1] {e2/f1}

C  [Ce , Cw ] Cw: e1/f1 {e1/f1}

7/12/2010

Ce: e1/Ɛ

Cw: e2/f1 {e2/f1}

C: [e2,e3]/[f2] {e3/f2}

[e2,e3] e3 [e1,e2]

f1

C  [Ce , Cw ] Ce: e2/Ɛ

[e2,e3]

e2 e1

Cw: e3/f2 {e3/fe}


60

Evaluation Criteria 

The upper bound on alignment F-score 

how many links in annotated alignment can be kept in ITG parse A: [e1,e3]/[f1,f2] hit=max{1+1,1+1}=2

1 if  u, v  R hit (Cw [u, v])   0 otherwise

hit (Ce )  0

A  [C, C ]

hit (C f )  0

hit ( X [ f , e ])  max

Y , Z , f1 ,e1 , f 2 ,e2

7/12/2010

C: [e1,e2]/[f1] hit=max{0+1}=1

(hit (Y [ f1 , e1 ])  hit (Y [ f 2 , e2 ]))

where X,Y,Z are variables for the categories in ITG grammar, and R comprises the golden links in annotated alignment.

A  [C, C ]

C: [e2,e3]/[f2] hit=max{0+1}=1

C  [Ce , Cw ]

Cw: e1/f1 hit=1

Ce: e1/Ɛ hit=0


C  [Ce , Cw ] Cw: e2/f1 hit=1

Ce: e2/Ɛ hit=0

Cw: e3/f2 hit=1

61

Features 


Inside probability Outside probability Length Ratio 



3

4

1

2

3

4

|(j+i)/(2*length(src sent))–(l+m)/(2*length(trg sent))| = |4/(2*4) - 5/(2*4)| = 0.125

Alignment count Ratio 



2

Position Ratio 



|(j-i) /(m-l)-1.15| = |(3-1)/(4-2)-1.15| = |-0.15| = 0.15

1

2*Count(Links in this span pair) / (j-i+m-l) = 2*1/4 = 0.5


7/12/2010

2*Count(links linked to outside)/(j-i+m-l) = 2*2/4 = 1


62

Trainable Pruning Model for ITG Parsing - Google Sites

Trainable Pruning Model for ITG Parsing - Google Sites

Suggest Documents

pruning search space for parsing free coordination

Trainable Speaker Diarization - Google Sites

Meta-model Pruning - Irisa

ITG 1000

TRAINABLE FRONTEND FOR ROBUST AND ... - Research - Google

fast parsing using pruning and grammar specialization - CiteSeerX

A Comprehensive Trainable Error Model for Sung Music Queries

Generalized Higher-Order Dependency Parsing with Cube Pruning

Using a Diathesis Model for Semantic Parsing

A Trainable Excitation Model for HMM-Based Speech ... - CiteSeerX

Shkarko - ITG

ITG Executive Brief

Automatic Event Parsing for Context Model: A Role for ...

Trainable Pedestrian Detection

A Non-Parametric Trainable Object-Detection Model Using a Concept

Discriminant Component Pruning: Regularization and ... - Google Sites

a trainable model to assess the accuracy of ...

TrIAs: Trainable Information Assistants for Cooperative ... - CiteSeerX

Partial Optimality via Iterative Pruning for the Potts Model

pruning analysis for the position specific posterior ... - Google Sites

Colour-Based Model Pruning for Efficient ARG ... - Semantic Scholar

A Trainable System for People Detection - CiteSeerX

A sensitivity-based approach for pruning architecture of ... - Google Sites

Pruning and Preprocessing Methods for Inventory ... - Google Sites