grated together to compensate each other in similarity calculation. Automatic. ROUGE evaluations indicate that the proposed idea can produce a very good.
Exploiting the Role of Named Entities in Query-Oriented Document Summarization Wenjie Li1, Furu Wei1,2, Ouyang You1, Qin Lu1, and Yanxiang He2 1 Department of Computing The Hong Kong Polytechnic University, Hong Kong {cswjli,csyouyang,csluqin}@comp.polyu.edu.hk 2 Department of Computer Science and Technology Wuhan University, China {frwei,yxhe}@whu.edu.cn
Abstract. In this paper, we exploit the role of named entities in measuring document/query sentence relevance in query-oriented extractive summarization. Named entity driven associations are defined as the informative, semanticsensitive text bi-grams consisting of at least one named entity or the semantic class of a named entity. They are extracted automatically according to seven pre-defined templates. Question types are also taken into consideration if they are available when dealing with query questions. To alleviate problems with low coverage, named entity based association and uni-gram models are integrated together to compensate each other in similarity calculation. Automatic ROUGE evaluations indicate that the proposed idea can produce a very good system that among the best-performing system at the DUC 2005. Keywords: Query-Oriented Summarization, Named Entity based Association.
1 Introduction In recent years, the focus has been noticeably shifted from generic summarization to query-oriented summarization, which aims to produce a summary from a set of relevant documents with respect to a given query, i.e. a short description of the user’s information need containing one or more narrative and/or question sentences. As anticipated, the machine generated summaries should concisely describe information contained in the documents and also should facilitate the user in understanding documents according to his/her interests. The advantages of query-oriented summarization in information retrieval have been widely acknowledged. Brief summaries allow people to judge the relevance of the returned results without having to look through the whole documents. Currently, most query-oriented summarization approaches are to extract the salient sentences from the documents which are supposed to be relevant to the given query. The fundamental issue with these approaches is how to measure the relevance of document sentences to the query sentences. In earlier studies, sentences are represented as bags of words. There are at least two drawbacks with this representation. First, the single word (i.e. word Uni-gram) is not informative enough to represent T.-B. Ho and Z.-H. Zhou (Eds.): PRICAI 2008, LNAI 5351, pp. 740–749, 2008. © Springer-Verlag Berlin Heidelberg 2008
Exploiting the Role of Named Entities in Query-Oriented Document Summarization
741
underlying information in the sentences. For example, the meaning of the residence of US president would be lost when “White House” was represented by “White” and “House” separately. As a result, named entities, like other words, should be treated as meaningful text units when measuring relevance. Second, the ordering information, especially the semantic underlying information and the sentence structure can not be captured by Uni-gram models. N-gram, such as Bi-gram, model provides a mean to take into account of the shallow structural information by combining two text units. But meanwhile, any N-gram model will more or less suffer from the bottleneck of low coverage. That is why Uni-gram and Bi-gram models are normally combined in use, or constraints on Bi-gram models are relaxed. In this study, we tend to highlight the role of named entities (NE) in variety of NE driven models. Named entities are regarded as text uni-grams and NE centered associations are defined as the informative and semantic-sensitive text bi-grams involving at least one named entity in representing sentences. Associations combine named entities, their semantic classes, as well as other representative words (adjacent to the named entities in certain models). Question types, which indicate what kind of information the questions are looking for, if applicable, are also concerned in associations when dealing with the questions in query. Because of this, NE-driven models can help effectively locate the sentences that contain the most relevant information to the questions. Consequently better summaries could be expected. Automatic ROUGE evaluations show that the summaries produced by the combinatorial models of NE/word uni-grams and NE-driven bi-grams are comparatively good with the summaries produced by the best systems competing at the DUC 2005.
2 Related Work Query-oriented summarization has been boosted by DUC evaluations since 2005. Many previous approaches rank the sentences according to their relevance to the query and then select the most relevant ones into the summary. Regardless of the approaches taken, query-oriented summarization involves three basic aspects: text content representation; query formulation; and relevance judgment. Among them, how to estimate the relevance between query and sentences is the most fundamental issue, which has been extensively studied in the past. The simplest yet effective way is to calculate the cosine similarity of the two sentences represented by the vectors of the words [7, 13, 14]. Some related work also utilizes WordNet as the external resource to solve the word mismatch problem by calculating the semantic similarity between the words. An extension to vector space models is dimension reduction performed with latent semantic analysis [5]. In addition to various kinds of word occurrence, frequency and semantic matching techniques, similarity can be also measured by the matching of the other text contents, such as named entities [8, 14], basic elements [6], and grammatical relations [3]. Normally, the relevance is judged based on a set of features, which are linearly combined to decide how a sentence is likely to be included in the summary. An alternative way is to construct a single but complicated feature, such as dependency tree [12] or document graph [4, 11]. It is however limited by the complexity of feature construction and relevance judgment.
742
W. Li et al.
Question answering (QA) is closely related to query-oriented summarization in terms of needs for question interpretation. Although question type identification [2, 8], question reformulation [3, 12] and question expansion [1] have been applied in the context of query-oriented summarization, special handling of query questions is not well concerned in many related work.
3 Measuring Relevance with Named Entity Driven Association Models 3.1 NE Driven Bi-gram Association (NeBiA) Model In the NeBiA model, content associations are defined as the bi-grams involving at least one named entity or the semantic class of a named entity. They are the combinations of the named entities and the content representative words (i.e. non-stopwords) immediately adjacent to the named entities. All the associations fall into 4 categories and appear in one of the following forms: Table 1. Templates for the Extraction of NE Driven Bi-grams
Category NE-NE
NE-WORD NE-TAG TAG-WORD
Form
(NE1, NE2 ) (NE, word ) , (word , NE ) (NE1 , NE2 tag ) , (NE1 tag , NE2 ) (NE tag , word ) , (word , NE tag )
Table 1 provides seven templates to guide the automatic extraction of NE centered bi-grams from both document sentences and query sentences so that the similarity can be calculated according to the bi-grams they match and the matching extent. Notice that NE-NE represents two successive named entities in a sentence. But they are not necessarily adjoining to each other and might be separated by a couple of words inbetween. In fact, the NeBiA bi-grams defined in Table 1 are the selected subset of the text bi-grams, where the role of named entities is highlighted. It is common for the same entity to be expressed in the different ways when it is mentioned in the text. For example, “US”, “U.S.”, “the US” and “the United States” all refer to the Unite States. Consequently, most of time, named entities fail to find their matches simply because of this. Coreference resolution can definitely provide a solution to this problem, but itself is also a problem being worked out in natural language processing. Our solution is to relax the matching restriction to allow for both named entities and their semantic classes being included in the bi-gram associations. The semantic classes considered in this paper include , , , , and , which are called NE tags. Another advantage from the use of the NE tags is being able to integrate QA techniques into queryoriented summarization. This will be detailed in Section 3.3.
Exploiting the Role of Named Entities in Query-Oriented Document Summarization
743
The NeBiA model can be extended to the NeBiA-II model to include all the words within the window of the sentence instead of the adjacent ones, i.e. extended from rigid to soft NE(TAG)-WORD bi-gram combinations. 3.2 NE Driven Event Bi-gram Association (NEvBiA) Model
As we know, named entities always play an important role in characterizing the events which can be defined as “[Who] did [What] to [Whom] [When] and [Where]”. The design of the NEvBiA model is based on the assumptions that if the words in the NeBiA model could be restricted to those related to the events, the bi-grams might be able to reflect the underling intra-event associations. In this paper, we choose verbs (such as “elect”) and action nouns (such as “supervision”) as event words that can characterize or partially characterize the actions or the incident occurrences in the world. They roughly relate to the “did [What]” mentioned above. Meanwhile, the named entities , , and convey the information of [Who], [Whom], [Where] and [When], while [Number] complements other event descriptions, such as the extent. Clearly, the NEvBiA bi-grams are the selected subset of the NeBiA bi-grams, where the words are limited to the event words. Similarly, the NEvBiA model can be extended to the NEvBiA-II model, corresponding to the NeBiA and NeBiA-II models. 3.3 Handling Query Questions
We strongly support the idea of incorporating QA techniques into query-oriented summarization. Thus, the models introduced in Section 3.1 and 3.2 are also designed to facilitate the formulation of both narrative and question sentences in query. For a query question, its question type is concerned and handled in the same way as the tags of the named entities presented in the sentence. Question type indicates what kind of information the question is looking for. It can help locate the sentences containing the information related to a particular question, and select the appropriate sentences in the summary. For example, if a sentence contains the named entity tagged as or , it should be more likely to provide the answer to the question “Who has criticized the World Bank?”. Figure 1 in the next page illustrates four categories of five NE driven bi-grams extracted from this question. Notice that the ordering information is reserved in them, i.e. (NE , word ) ≠ (word , NE ) . This can avoid the mistakes in including the sentences containing the phrase “World Bank criticized ” in the summary responding to the previous question. These sentences are obviously not expected. Question types are determined by a set of heuristic rules. For the questions beginning with the interrogatives like “who”, “where”, and “when”, a straightforward mapping between these interrogatives and the classes of the named entity to be questioned is established. “who”⇔, “where”⇔, and “when”⇔ . If the sentence begins with “which”, “what” or the word “name”, the classes are deduced based on the semantics of the nouns in the patterns of “which + noun”, “what + noun”, “what be + noun”, and “name + noun”. WordNet supplies the semantic information needed. See (Li, 2005) for more details.
744
W. Li et al.
Who
has [criticized] the World Bank? [Criticize]
Person
[World Bank]
Word
Organization
(, World Bank) (, Criticize) (Criticize, ) (Criticize, World Bank)
NE-TAG TAG-WORD TAG-WORD NE-WORD
Fig. 1. Example of the 3 Categories of Bi-gram association
3.4 Matching-Based Relevance Measure
Sentence and query relevance are measured based on the words and the associations they match. In this study, we make an attempt on three matching strategies: (1) exact matching (EM); (2) semantic matching (SM); and degreed matching (DM). EM and SM are binary decisions. While EM returns binary 0 or 1 depending on whether a matching succeeds or fails, SM considers the hyponyms of the words and returns 1 when the two words (or the two words in the two associations under comparison) belong to the same synset in WordNet. This is motivated by the observation that some words of the same or quite similar meanings are in different surface forms. These words are commonly synonyms or hyponyms, such as “diminish” and “reduction”. The third strategy, i.e. DM, backs off EM with SM. It performs EM first. Only when EM fails, it gets back to SM, and returns a value smaller than 1 (e.g. 0.7) if SM succeeds. The relevance is then be measured by calculating the similarity of the sentences and the query according to the frequencies of the matches. The matching strategies are applied not only to bi-gram association matching but also to uni-gram matching. For the extracted bi-gram associations, once the matching is done, they naturally form a collection of n association groups, denoted by A . An association group contains either a set of associations matched or a single association if no match is found. T The similarity of a sentence s D in a document set D and a query T = {s1T , s T2 ,..., s m } is then calculated by cosine similarity based on frequencies of ai
∑ tf (ai , s D )* tf (ai ,T ) n
(
)
Simbi s D , T =
i =1
∑ i =1 (tf (ai , s D )) n
2
*
∑ i =1(tf (ai ,T ))2 n
where ai ∈ A , tf( ai ,*) is the frequency of ai in s D or T.
tf (ai ,T ) =
∑ (tf (ai , sTj ) m
j =1
Exploiting the Role of Named Entities in Query-Oriented Document Summarization
745
Associations provide important means for relational content matching. But it often suffers from low coverage. If the similarity is calculated solely based on association matching, an actual relevant sentence might be mistakenly judged as non-relevance. To remedy this shortage, the overall similarity is actually calculated by linearly combining the association model and the uni-gram model.
(
)
(
)
(
Sim s D , T = λ1Simuni s D , T + λ2 Simbi s D , T where
)
∑ tf (ui , s D )* tf (ui , T ) n
(
)
Sim uni s D , T =
i =1
∑
n i =1
(tf (u , s )) i
D 2
*
∑ (tf (ui ,T )) n i =1
. tf( u i ,*) denotes the fre-
2
quency of ui in s D and T . λ1 and λ2 are the weights for uni-gram and association based similarity respectively . Similarly,
( )
tf ui , T =
∑ (tf (ui , sTj )) m
j =1
4 Experimental Studies 4.1 Experiment Set-Up
The experiments are conducted on the DUC 2005 50 document sets. Each set of documents is given a query which simulates the user’s information need. All documents and queries are pre-processed by TextPrepEngine, a text pre-processing engine developed upon GATE1 and Porter Stemmer2. Sentences can then be represented by a group of words which are stemmed, part of speech (POS) tagged, and the stop-word removed3. Moreover, named entities are tagged for each sentence. According to the task definitions, system generated summaries are strictly limited to the 250 English words in length. Based on the calculated similarities, we pick up the highest scored sentences from the original documents into the summary until the word limitation is reached. Duplicate sentences are prohibited. Automatic evaluation methods and criteria are still a research topic in summarization community. Many literatures have addressed different methods for automatic evaluation other than human judges. Among them, the ROUGE toolkit4 [10], though being argued by quite a few researchers, is supposed to produce the most reliable and stable scores comparing with human evaluation. Moreover, the DUC 2005 officially adopts ROUGE as the automatic evaluation method. Therefore, we also take it as the evaluation means in this work. Specifically, the machine-generated summaries are evaluated in terms of average recalls of ROUGE-1, ROUGE-2, and ROUGE-SU4. 1
http://www.gate.ac.uk http://www.tartarus.org/~martin/PorterStemmer 3 A list of 199 words is used to the filter stop words. 4 ROUGE 1.5.5 is used, and the ROUGE parameters are “-n 2 -x -m -2 4 -u -c 95 -r 1000 -f A p 0.5 -t 0”, according to the DUC task definition. 2
746
W. Li et al.
4.2 Evaluation of Uni-gram Models
The word-based uni-gram model is implemented as the baseline model. When the named entities in the text are recognized and manipulated as the integrated text units, we call it NE-based models. Table 2 below compares the NE-based uni-gram model with the word-based uni-gram model. However, the advantage is visible but not markedly. In later experiments, we will further evaluate the combinations of uni-gram and various bi-gram models. Table 2. Evaluations of Uni-gram Models
Word-based NE-based
ROUGE-1 0.35952 0.36400
ROUGE-2 0.06932 0.06988
ROUGE-SU4 0.12602 0.12743
4.3 Evaluation of Uni-gram and Bi-gram Combinatorial Models
The aims of the following experiments are to examine the performance of various combinational models integrated with NE-based uni-gram and the bi-gram models and more important to discovery the most informative and representative text units. In the rigid approaches, the NeBiA and NEvBiA models have been described in Section 3. Bi-gram in the NeBi model is constrained to two adjoining words (non stop-words) according to their appearance in the sentence. This is the normal use of the bi-gram model. Differently, in the soft approaches, the two words within a given window size will be combined as a bi-gram in NeBi5-II, while in NeBiA-II and NEvBiA-II models, the NE-NE and NE-TAG bi-gram associations are constrained to two successive named entities (or tags), and a named entity (or tag) together with a word within the given window size 6 will be combined to the NE-WORD or TAG-WORD associations. The numbers behind the soft models in the following table denote the best window sizes used in our experiments (tuned experimentally). In this set of experiments, the SM strategy is adopted and λ1 : λ2 = 2 : 1 is set experimentally. Table 3. Results of Combinatorial Models
Rigid + NeBi + NeBiA + NEvBiA Soft + NeBi-II (7) + NeBiA-II (6) + NEvBiA-II (8) 5 6
ROUGE-1 0.36169 0.36563 0.36588 ROUGE-1 0.36201 0.36663 0.36670
ROUGE-2 0.07010 0.07345 0.07336 ROUGE-2 0.07068 0.07354 0.07357
ROUGE-SU4 0.12620 0.12974 0.12986 ROUGE-SU4 0.12648 0.12987 0.13014
NeBi here denotes the original NE based bi-gram model. Notice that the word within the given widow size to a named entity (or tag) can not cross another named entity (or tag).
Exploiting the Role of Named Entities in Query-Oriented Document Summarization
747
Table 3 above presents the ROUGE results of the combinational models. We can see that the NeBi model improves the performance of the original word-based unigram slightly. And when text representative units are narrowed down gradually in NeBiA and NEvBiA models, the improvement becomes visible. Furthermore, more significant performance can be achieved when the soft models are taken into account., The best performance is obtained by the NEvBiA-II model. These results strongly support the ideas of using NE-driven bi-gram associations in query-oriented summarization. 4.4 Coverage Problems with Single-Handed Bi-gram Models
As mentioned previously, the single-handed bi-gram based approaches, i.e. NeBi, NeBiA and NEvBiA models suffer from the coverage problem. For some set of documents in our experiments, we can even hardly find enough sentences, which can be considered relevant to the query, in order to produce a summary with the length close to the given 250 words limitations by solely using the bi-gram based measuring methods. The proportion of “x/50” in the each row of table 4 denotes that “x” out of the total 50 document sets are capable of producing the 250 word length summary. Table 4. Results of Word based Models
NeBi (34/50) NeBiA (7/50) NEvBiA (7/50)
ROUGE-1 0.35272 0.37521 0.37711
ROUGE-2 0.06430 0.07947 0.07973
ROUGE-SU4 0.12061 0.13268 0.13163
Obviously, the results in table 4 indicates the NeBiA and NEvBiA models can achieve quite encouraging and significant performance, but they are both limited by the low coverage. That’s why normally bi-gram and uni-gram models are combined in use. It can be also observed that the performance of the bi-gram model NeBi is even worse than its corresponding uni-gram model. The sparse nature is the possible reason. This motivates us to restrict the bi-gram combinations in the proposed model. 4.5 Evaluations on Impacts of Matching Techniques
The following set of experiments aims to examine the three matching strategies, i.e. exact matching (EM), semantic matching and degreed matching (DM). WordNet 2.07 and JWNL8 are used to determine whether the two words are semantically matched according to whether they are in the same sysnet. In our implementation, DM will return 0.7 when the matching of two associations fails in EM but successes in SM. Table 5 shows the comparison results of the best-performing models, i.e. NEvBiA-II with 8 as the window size, in our former experiments.
7 8
http://wordnet.princeton.edu/ http://sourceforge.net/projects/jwordnet
748
W. Li et al. Table 5. Comparison of EM/SM/DM strategies
ROUGE-1 0.36604 0.36670 0.36704
EM SM DM
ROUGE-2 0.07340 0.07357 0.07360
ROUGE-SU4 0.13009 0.13014 0.13029
As seen, there exists improvement when semantic relation between two words is considered. However, the improvement is not quite obvious. This may due to the fact that the number of the named entity centered bi-gram associations involving with one word is still small in our current system, so that the contribution of the semantic relation is limited. 4.6 Comparison with DUC 2005 Top 3 Systems
The following table shows the comparison of our models with the DUC 2005 participating systems, where S15, S17 and S10 are the top three performing systems. As seen, both the NEvBiA-II and NeBiA-II models can achieve very competing performance. Although no further post-processing is carried out, the results of the NEvBiA-II model outperform the top system in the DUC 2005 in the ROUGE-2 evaluation, rank the second in ROUGE-SU4 evaluation and among the top three systems in the ROUGE-1 evaluation. Table 6. Comparison with DUC 2005 top-3 systems
NEvBiA-II NeBiA-II S15 S17 S10 NIST Baseline
ROUGE-1 0.36704 0.36663 0.37383 0.36901 0.36640 0.30217
ROUGE-2 0.07360 0.07354 0.07251 0.07174 0.07089 0.04947
ROUGE-SU4 0.13029 0.12987 0.13163 0.12972 0.12649 0.09788
6 Conclusion In this paper, the role of named entity has been emphasized in query-oriented summarization. The effects of named entities in uni-gram and bi-gram models are investigated. ROUGE evaluation based on the DUC 05 data set shows that the proposed models can achieve very competitive and significant performance. The NE based unigram and NE driven bi-gram combinatorial model can even outperform the best system in the DUC 2005. However, we also note that the use of name entities centered bi-gram associations is limited by the coverage problem, which can be improved by a more appropriate and wide-coverage named entity recognizer. Furthermore, since named entity co-reference is very useful in our investigation, co-reference resolution achievement in the natural language processing community will be further studied in the future work.
Exploiting the Role of Named Entities in Query-Oriented Document Summarization
749
Acknowledgments The work described in this paper was supported by the grants from the RGC of HK, (Project No. PolyU5211/05E and PolyU5217/07E), the grant from the NSF of China (Project No. 60703008), and the internal grant from the Hong Kong Polytechnic University (Project No. A-PA6L).
References 1. Barzilay, R., Lapata, M.: Modeling Local Coherence: An Entity-based Approach. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 141–148 (2005) 2. Conroy, J.M., Schlesinger, J.D.: CLASSY Query-Based Multi-Document Summarization. In: Proceedings of Document Understanding Conferences 2005 (2005) 3. Doran, W., Newman, E., Stokes, N., Dunnion, J., Carthy, J.: IIRG-UCD at DUC 2005. In: Proceedings of Document Understanding Conferences 2005 (2005) 4. Erakn, G.: Using Biased Random Walks for Focused Summarization. In: Proceedings of Document Understanding Conferences 2006 (2006) 5. Hachey, B., Murray, G., Reitter, D.: The Embra System at DUC 2005: Query-oriented Multi-document Summarization with a Very Large Latent Semantic Space. In: Proceedings of Document Understanding Conferences 2005 (2005) 6. Hovy, E., Lin, C.Y., Zhou, L.: A BE-based Multi-document Summarizer with Query Interpretation. In: Proceedings of Document Understanding Conferences 2005 (2005) 7. Jagarlamudi, J., Pingali, P., Varma, V.: Query Independent Sentence Scoring approach to DUC 2006. In: Proceedings of Document Understanding Conferences 2006 (2006) 8. Li, W., Li, W., Li, B., Chen, Q., Wu, M.: The Hong Kong Polytechnic University at DUC2005. In: Proceedings of Document Understanding Conferences 2005 (2005) 9. Li, W., Li, B., Wu, M.: Query Focus Guided Sentence Selection Strategy for DUC 2006. In: Proceedings of Document Understanding Conferences 2006 (2006) 10. Lin, C.Y., Hovy, E.: Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of HLT-NAACL, pp. 71–78 (2003) 11. Mohamed, A.A., Rajasekaran, S.: Query-Based Summarization Based on Document Graphs. In: Proceedings of Document Understanding Conferences 2006 (2006) 12. Schilder, F., McCulloh, A., McInnes, B.T., Zhou, A.: TLR at DUC: Tree similarity. In: Proceedings of Document Understanding Conferences 2005 (2005) 13. Seki, Y., Eguchi, K., Kando, N., Aono, M.: Multi-Document Summarization with Subjectivity Analysis at DUC 2005. In: Proceedings of Document Understanding Conferences 2005 (2005) 14. Zhao, L., Huang, X., Wu, L.: Fudan University at DUC 2005. In: Proceedings of Document Understanding Conference 2005 (2005)