Test set: 264 sentences. Noisy-channel. 63.3. 50.247.1. 75.3. 64.162.1. 80.9. 72.069.5. Maximum EntropyMaximum Entropy w
Trimming CFG Parse Trees for Sentence Compression Using Machine Learning Approach Yuya Unno, Takashi Ninomiya, Yusuke Miyao and Jun’ichi Tsujii University of Tokyo, Hongo 7-3-1, Bunkyo-ku, Tokyo, Japan
I ntroduc tion
Method 2. Bottom-Up Method
Sentence compression is one of the summarization tasks. We compress an input sentence and create a new short grammatical sentence preserving its meaning. Yesterday I went to Tokyo by train I went to Tokyo
We cannot learn some compression patterns using the previous method, because the two parse trees sometimes have different structures. Previous method S
We can only drop words
Input: sentence l Output: argmaxsP(s | l)
NP
VP
NP
Knight and Marcu’ s noisy-channel model [1] 1. Parse sentences in the training corpus 2. Compare the corresponding nodes of compressed and original parse trees from the root nodes 3. Estimate rewriting probabilities using count of applied CFG rules We revised this model in two points 1. Maximum-entropy model 2. Bottom-up method
DT The
PP on the table
N apple
S NP
NP
ADVP
Yesterday
VP V
I
PP
I
∏P
( rl , rs )∈R
exp
Left-most ‘ ADVP’
S
ADVP
NP
Yesterday
I
V
‘ Yesterday’ is removed
features Mother
DT
N
The
apple
on the table
r ∈R '
NP
is red
DT
N
The
apple
NP
went to Tokyo
1 P( s | l ) = ∏ exp ∑ λi f i (rs , rl ) i ( rs ,rl )∈R Z Depth
PP on the table
is red
Avg. length of original sentences: 23.8 Avg. length of compressed sentences: 12.5 Training set: 527 sentences Development set: 263 sentences Test set: 264 sentences F-measure Bigram F-measure BLEU score
80
I
VP
Extracted tree
70 60 50 40
75.3 63.3
30
50.2 47.1
20
80.9 64.1 62.1
72.0 69.5
S
PP
S NP
10
VP
Select the nodes which dominate the compressed sentence
Daughter nodes are corresponding
100 90
Probabilities depend on various features of a parse tree
node Daughter nodes sequence Daughter terminals that are removed
Compressed tree
Original tree
PP
(rl | rs )∏ Pcfg (r )
‘ S’is the root
V
apple
VP
We can easily introduce various features to the maximum-entropy model, such as the depth from the root node and which words are removed. Maximum entropy model
The
is red
E x pe rim e nta l R e s ult
went to Tokyo went to Tokyo Pexp(rl | rs): probability of rewriting rs to rl P ( s | l ) ∝ P (l | s ) P ( s ) P(l | s ) =
PP
S
NP
N
VP
NP
Probabilities only depend on mother and daughter nodes
rs
DT
{DT, N} is not a subsequence of {NP, PP}
Bottom-up method
Rewriting probabilities only depend on mother and daughter nonterminals in Knight and Marcu’s model.
S
is red
VP
In the bottom-up method, we only parse the original sentence, and extract a tree from the original parse tree.
Method 1. Maximum-Entropy Model
rl
S
NP
Original tree
A lgorithm
Knight and Marcu’ s Noisy-channel model
Daughter nodes are not corresponding
Noisy-channel Maximum EntropyMaximum Entropy with Bottom-up
VP V
Results of N-gram based evaluation
PP
Grammar
Importance
Human Noisy-channel
4.94 3.81
4.31 3.38
Maximum Entropy ME + Bottom-up
3.88 4.22
3.38 4.06
went to Tokyo
from the root Left-most and right-most daughters etc...
We used the same corpus as Knight and Marcu. We evaluated the results using F-measure and BLEU score [2], and human judgment. Our method exceeds the previous method in all evaluation criteria. Especially we obtained the highest score using the maximum entropy model with bottom-up method.
Results of human evaluation Grammar: Whether the output is grammatically correct Importance: Whether the important words remain
[1] K. Knight and D. Marcu. 2000. Statistics-Based Summarization - Step One: Sentence Compression. In Proc. of AAAI/IAAI' ‘00 [2] K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proc. of ACL'02.