Multi-document Summarization based on Hierarchical Topic Model Lei LI Center for Intelligence Science and Technology, Beijing University of Posts and Telecommunications, Beijing, China
[email protected]
Hongyan LIU Center for Intelligence Science and Technology, Beijing University of Posts and Telecommunications, Beijing, China
[email protected]
topic model. It tries to reveal the deeper underlying topic information. In 2009 Rachit[4] used the LDA to capture the events being covered by the documents and form the summary with sentences representing these different events. In 2010 Asli [5] presented a novel approach that formulates MDS as a prediction problem based on a two-step hybrid model: a generative model for hierarchical topic discovery and a regression model for inference. WU Xiaofeng [6] proposed a new supervised method for the extraction-based single document summarization by adding LDA of the document as new features into a CRF summarization system. They study the power of LDA and analyze its different effects by changing the number of topics. Zhang Minghui[7] model the document using LDA, and calculate the distance between a sentence and the given multi-documents via their topic probability distributions as the weight of the sentences. YANG Xiao [8] proposed a multi-document summarization method based on the LDA model in which the number of topics was determined by model perplexity. They proposed two sentence scoring methods, one based on sentence distribution and the other on topic distributions. In [5] the model is proposed based on hybrid hierarchical topic model for multi-document summarization, which used the ideal summaries and achieved competitive results. However, this method has the disadvantage of needing ideal summaries. The innovation of our proposed method is: using a simple approach, without the guide of ideal summaries, completely dependent on the document corpus, taking full advantage of hierarchy structure, including the concept of path and topic relationships between the levels of abstraction to extract summary sentence, and exploring sentence compression pruning technology to generate 100 words length summary.
Abstract—In this paper, we introduced an extractive multidocument summarization method based on hierarchical topic model of hierarchical Latent Dirichlet Allocation (hLDA) and sentences compression. hLDA is a representative generative probabilistic model, which not only can mine latent topics from a large amount of discrete data, but also can organize these topics into a hierarchy to achieve a deeper semantic analysis. At the same time we also use sentence compression technology to refine the summaries, making them more concise. We use TAC 2010 data sets as the experimental test corpus and ROUGE method to evaluate our summaries. The evaluations confirmed that our method has better performance than some traditional methods. Keywords-multi-document summarization; hierarchical Latent Dirichlet Allocationtopic model; sentence compression
I.
INTRODUCTION
With the increasingly wider use of the Internet and the improvement of social information, the amount of information in the public domain continues to increase and form rapid accumulation, rendering much of these information redundant. Companies, governments and research institutions are all facing the unprecedented challenges that how to process these information efficiently. Multi-document summarization is an essential technology to overcome this obstacle in technological environments. It aims to generate a brief and concise summary that retaining the main characteristic of the original set of documents, helps people identify a summary of a document without reading the entire document.
III. HIERARCHICAL LATENT DIRICHLET ALLOCATION (HLDA) TOPIC MODEL Learning a topic hierarchy from data is an important instance of modeling challenges. We use hLDA model, an efficient statistical method for constructing such a hierarchy which allows it to grow and change as the data accumulate. In the approach, each node in the hierarchy is associated with a topic, where a topic is a distribution across words. A sentence is generated by choosing a path from the root to a leaf, repeatedly sampling topics along that path, and sampling the words from the selected topics. Note that all sentences share the topic distribution associated with the root node.
In Section 2 we will look at some of the algorithms and their derivations that researchers have proposed for multidocument summarization. Section 3 talks about hierarchical topic model of hLDA. Section 4 introduces our multi-document summarization algorithms. Section 5 gives the evaluation of our algorithms and Section 6 talks about the future work. II.
RELATED WORKS
In recent years, following the LSI (Latent Semantic Indexing) and pLSI (probabilistic Latent Semantic Indexing) models, complex probabilistic models are increasingly prevalent. The LDA [1] model is a representative widespread
The structure of tree is learnt along with the topics using a nested Chinese restaurant process (nCRP), which is used as a
This paper is supported by NSFC Project 60873001 and 111 Project of China under Grant No. B08004
978-1-61284-729-0/11/$26.00 ©2011 IEEE 88
prior. The nCRP is a stochastic process, which assigns probability distributions to infinitely branching and infinitely deep trees. In our model, nCRP specifies a distribution of words into paths in an L-level tree. The assignments of sentences to paths are sampled sequentially. The first sentence takes the initial L-level path, starting with a single branch tree. Later on, mth subsequent sentence is assigned to a path drawn from the distribution: p ( path
p ( path
old
new
,c | m,mc )
,c | m,mc )
L
p(wm | c, wm , z) ( l 1
(3)
n c( w ) , m
m ,l where is the number of instances of word w that have been assigned to the topic indexed by cm,l, not including those in the current sentence, W is the total vocabulary size, and (.) denotes the standard gamma function. When c contains a previously unvisited restaurant,
mc m 1
n c( mw,)l , m
m 1
(nc(.)m,l ,m W) w (nc(mw,)l ,m nc(mw,)l ,m ) ) w (nc(mw,)l ,m ) (nc(.)m,l ,m nc(.)m,l ,m W)
is zero. Note that the cm must be drawn as a block. More detailed description on hierarchical topic model please refer to [2] [3].
(1)
pathold and pathnew represent an existing and new path respectively, mc is the number of previous sentences assigned to path c, m is the total number of sentences seen so far, and is a hyper-parameter which controls the probability of creating new paths. Based on this probability each node can branch out a different number of child nodes proportional to γ. Small values of γ suppress the number of branches.
IV.
OUR PROPOSED METHOD
A. Overview of our multi-document summarization
The following is the generative process for hLDA in our system: (1) For each topic k ∈ T, sample a distribution βk~Dirichlet(η) (2) For each sentence s ∈ {S}, (a) draw a path cs~ nCRP(γ) (b) Sample L-vector θs mixing weights from Dirichlet(α)
θs
~
(c) For each word n, choose: (i) zs,n|θs~Mult(θs) and (ii) word ws,n|{zs,n, cs, β}
Figure 1. Overview of our multi-document summarization
In this section, we describe a Gibbs sampling algorithm for sampling from the posterior nested CRP and corresponding topic distributions in the hLDA model. The Gibbs sampler provides a clean method of simultaneously exploring the parameter space and the model space. The variables needed by the sampling algorithm are: wm,n, the nth word in the mth sententce. cm,l, the restaurant corresponding to the lth topic distribution in sentence m; and zm,n, the assignment of the nth word in the mth sentence to one of the L available topics. All other variables in the model-θ and β are integrated out. The Gibbs sampler thus assesses the values of zm,n and cm,l.
B. Pre-processing of the data set In our multi-document summarization system, we use single sentence as the basic processing unit. Firstly, we remove stop words, punctuations and other auxiliary modifiers. At the same time the document is filtered by removing the sentences that contain quotations and the length of which is less than 8 words. To generate a summary, we take full advantage of the hierarchical topic model, which includes the path information and the abstractive information of level of words. We also added some features for weighting these sentences.
The conditional distribution for cm, the L topics associated with sentence m, is:
p (cm | w , c m , z ) p ( wm | c , w m , z ) p (cm | c m )
The hierarchical topic model assigns probability distributions to infinitely branching and infinitely deep trees. Our summarization system constructs a hierarchical tree structure of candidate sentences by positioning all sentences on a three level tree. Each sentence is represented by a path in the tree, and each path can be shared by many sentences. The assumption is that sentences sharing the same path should be more similar to each other because they share the same topics. Moreover, if a path includes a title sentence, then candidate sentences on that path are more likely to be included in the generated summary.
(2)
where w-m and c-m denote the w and c variables for all sentences other than m. This expression is an instance of
p(wm | c, w m , z ) as the likelihood of the p (cm | c m ) as the data given a particular choice of cm and Bayes’ rule with
prior on cm implied by the nested CRP. The likelihood is obtained by integrating over the parameters β, which gives:
89
Lastly, we consider the number of sentences that assigned on each path, sorting these paths according to the number of sentences that they contain. We usually think that one path that contains more sentences is considered to be the main topic or a hot topic, so we prefer to choose sentences in these paths.
C. Sentence weighting We evaluate the weight of sentence according to various features. The first feature is the similarity of title sentences with all candidate sentences. If there is title sentence in one path, then all sentences assigned to that path st is 1, otherwise it is 0. We will give priority to selected sentences in that path to generate summary. Secondly, the feature we considered is the number of named entities that one sentence contains. We use Stanford University’s named entity recognition toolkit. If one sentence contains these particular entity names, it will have a high priority to be chosen as candidate summary sentence. sn is the number of name entity categories in one sentence. For example, if one sentence has only person names, then the sn is 1, else if it also has location information, then sn is 2, the max value is 3. The third important feature for weighting sentences is the frequency of words in sentences. We considered sentence coverage of each word in one sentence. The weight of the sentence is calculated by following formula. |s |
S
tf
i 1
num
s
( w
D. Similarity Calculation We study the method that based on sentence similarity calculation to remove redundant information. If the similarity between the candidate sentences that to be added to summary and the sentence that has been added to summary exceeds a certain threshold, we’ll give up the sentence and re-select a sentence to compose the summary. The similarity of the two sentences is calculated as followed: sim ( s 1 , s 2 )
(4)
n | s |
E. Summary sentence compression In TAC 2010 Multi-document Summarization evaluation task, the goal is to generate a brief, fluent, coherent and readable 100 words length summary. To compare our method with TAC 2010’s systems, we also use compression techniques [10] that used in TAC 2010, which removes redundant elements in the sentence and retain important information to make the summaries coherence, concise and readable.
Wi is the ith word in s, nums(wi) is the number of sentences that wi covered, | s | is the length of the sentence, and n is the total number of all sentences. Then, in order to emphasize those aspects TAC required, we set higher weight for sentences containing more keywords in a knowledge base. We build the knowledge base for every required aspects listed in the guided summarizations. We extract keywords from TAC sample document sets for all the five topics which contain tagged information units for the required aspects. S
k
num
(W | s |
k
EXPERIMENTS AND DISCUSSIONS In our experiment, we use the latest TAC 2010 multidocument summarization data sets to test our proposed method empirically. TAC is held by NIST (http://www.nist.gov). The basic objective of TAC 2010 is to guide researchers experiment on large-scale public data sets to promote the development of multi-document summarization technology. The data set consists of 46 topics, and each topic contains 10 short articles related to that topic. It is designed for automatic multidocument summarization evaluation. TAC 2010 summarization task is creating a summary of no more than 100 words for each topic. We use ROUGE [9]toolkit to measure our proposed method, which has been widely used by DUC (now TAC) for performance evaluation. ROUGE-N is an n-gram recall computed as follows. V.
(5)
)
sk is the score of the keywords, num(wk) is the number of keywords that a sentence contain. |s| is the length of the sentence. Next, hierarchical model can mine out abstract and specific topics, which will help identify candidate summary sentence. For one summary, it should be more abstractive. we evaluate the abstractive extent of the sentence by following formula. S
abs
num
(W | s |
abs
)
(6)
num(Wabs) is the number of abstractive words in one sentence. In our experiment we use the level zero and level one, |s| is the length of the sentence.
Cn
f
d * S k e * S abs
C { M o d el _ U n its } , n g ra m C
C o u n t ( n g ra m ) C o u n t ( n g ra m )
C { M o d el _ U n its } , n g ra m C
At last, we calculate the sentence score by following formula: S a * St b * Sn c * St
(8)
Molecule num(s1∩s2) is the number of the same words that s1 and s2 contain. |s1|, |s2| represent the length of sentence s1and s2 respectively. We use the average value of |s1| and |s2|.
)
i
num ( s 1 s 2 ) 1 / 2 (| s 1 | | s 2 |)
m a tch
(9)
Where, n is the length of the n-gram, and Count match(ngram) is the maximum number of n-grams co-occuring in an automatic machine summary and the human summaries, Count (n-gram) is the number of n-grams in human summaries. ROUGE-L uses the longest common subsequence (LCS) statistics, while ROUGE-W is based on weighted LCS and ROUGE-SU is based on skip-bigram plus unigram. Each of these evaluation methods in ROUGE can generate three scores (recall, precision and F-measure). As we have similar
(7)
We use a linear additive manner, which Sabs is the score of sentence abstractivity. St is the score of the similarity between sentence and title. Sn is the score of the sentence containing named entities. In the following experiment, the parameters a,b,c,d,e are fixed at {0.3,0.1,2,1,1} achieved the best result.
90
use keywords as auxiliary information. In H4 summarization system, we model for the sentences that stop-words are removed, and at the same time in this system we use keywords as auxiliary information.
conclusions in terms of any of the three scores, for simplicity, in this paper, we only report the average F-measure scores to compare our proposed method with other implemented systems. We also take TAC 2010 baseline1, baseline2 and the best result as our baseline systems. 1) Baseline 1: leading sentences from the most recent document. 2) Baseline 2: summary generated by publicly available summarizer MEAD with default settings. 3) Baseline 3: the best system at TAC 2010, the team id is 22.
The evaluation result shows that the fourth experiment H4 achieves the best effect. Thus the quality of summary which stop words in source sentences removed are higher than the one not removed. With removing stop words, we model for the clean data can reduce the impact of noisy data and more targeted, and achieve the better result. At the same time, the keywords information can make the sentences that contain more information more likely be chosen. It also improves the quality of the summary. One disadvantage is that our sentence compression method seems to prune some critical information.
We use the four standard human summaries TAC provided rather than a single one to improve the consistency of evaluation results. In order to prove the validity of our hLDA model, we also use the strict consensus sentence compression method in [10]. In our hLDA experimental system, we fixed the tree depth at 3. The following table 1 and table 2 are the evaluation results that with removing stop words and without removing stop words in human and system summaries before computing various statistics in ROUGE.
R-1
R-2
R-3
R-4
R-L
RW1.2
R-SU4
0.1950
0.0474
0.0120
0.0025
0.1706
0.0928
0.0549
CONCLUSION This paper works on the main techniques to generate the multi-document summary, and describes the details of each step. In a word, our proposed multi-document summarization method based on hierarchical Latent Dirichlet Allocation topic model could improve the quality of the summary to a certain extent and achieve a competitive result.
0.2264
0.0539
0.0145
0.0066
0.2007
0.1081
0.0689
ACKNOWLEDGEMENTS
0.3027
0.0846
0.0286
0.0110
0.2513
0.1338
0.0994
0.2535
0.0633
0.0207
0.0073
0.2121
0.1151
0.0791
0.2722
0.0723
0.0237
0.0089
0.2250
0.1214
0.0867
0.2689
0.0710
0.0227
0.0075
0.2240
0.1202
0.0847
0.2910
0.0825
0.0285
0.0112
0.2394
0.1288
0.0948
TABLE I.
VI.
TABLE 1 ROUGE EVALUATION RESULTS WITH REMOVING STOPWORDS
B 1 B 2 B 3 H 1 H 2 H 3 H 4
This paper is supported by NSFC Project 60873001 and 111 Project of China under Grant No. B08004. REFERENCES
TABLE II.
[1]
D. M. Blei, A. Ng, and M. Jordan. “Latent dirichlet allocation,” J. In Jrnl. Machine Learning Research, 3:993-1022, 2003b. [2] D. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum. “Hierarchical topic models and the nested Chinese restaurant process,” J. In Neural Information Processing Systems [NIPS], 2003a. [3] D. Blei, T. Griffiths, and M. Jordan. “The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies,” J. In Journal of ACM, 2009. [4] R. Arora and R. Balaraman. “Latent dirichlet allocation based multidocument summarization,”. Proceedings of the Second Workshop on Analysis for Noisy Unstructured Data AND,2008. [5] Asli Celikyilmaz. Dilek Hakkani-Tur. “A hybrid hierarchical model for multi-document summarization,” J.Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 815– 824, Uppsala, Sweden, 11-16 July 2010. [6] WU Xiaofeng, ZONG Chengqing. “An Approach to Automatic Summarization by Integrating Latent Dirichlet Allocation in Conditional Random Field,” J. Journal of Chinese Information Processing. Vol.23, No.6.Nov.2090 [7] ZHONG Minghui, WANG Hongling,ZHOU Guodong. “Chinese multidocument summarization based on LDA Topic-Oriented method,”J.The Fifth National Youth Conference on Computational Linguistics. Wuhan, China.2010. [8] YANG Xiao, MA Jun, YANG Tongfeng. “Automatic multi-document summarization based on the latent Dirichlet topic allocation model,” J. CAAI Transactions on Intelligent Systems. Vol 51.2 Apr.2010. [9] LinChin-Yew, Edward Hovy. “Automatic Evaluation of Summaries Using N-gram Co-occurrences Statistics,” J. In Proceedings of 2003 [10] H. Liu, Q. Zhao, Y. Xiong, L. Li, C. Yuan. “The CIST Summarization System at TAC 2010,”. http://www.nist.gov/tac/publications/2010/participant.papers/CIST.proce edings.pdf
TABLE2 ROUGE EVALUATION RESULTS WITHOUT REMOVING STOPWORDS
B 1 B 2 B 3 H 1 H 2 H 3 H 4
R-1
R-2
R-3
R-4
R-L
RW1.2
RSU4
0.2953
0.0565
0.0164
0.0057
0.2525
0.1127
0.0902
0.2986
0.0607
0.0206
0.0096
0.2533
0.1153
0.0936
0.3717
0.0957
0.0376
0.0187
0.2901
0.1324
0.1302
0.3381
0.0722
0.0250
0.0110
0.2625
0.1195
0.1115
0.3468
0.0809
0.0300
0.0139
0.2687
0.1227
0.1174
0.3464
0.0763
0.0276
0.0120
0.2682
0.1212
0.1136
0.3616
0.0913
0.0353
0.0167
0.2796
0.1276
0.1221
In H1 summarization system, we model for the source sentences that stop words are not removed, and we don’t use keywords as auxiliary information. In H2 summarization system, we also model for the source sentences that stop words are not removed, but in this one we use keywords as auxiliary information. In H3 summarization system, we model for the sentences that stop words are removed, but in this one we don’t
91