Fazal Masud Kundi1,Muhammad Zubair Asghar1, Syeda Rabail Zahra1, Shakeel. Ahmad1, Aurangzeb Khan2 ..... Clara Nogueraa, Carlos Cobosa, Elizabeth.
MAGNT Research Report (ISSN. 1444-8939)
Vol.2 (4). PP: 309-317
A Review of Text Summarization Fazal Masud Kundi1, Muhammad Zubair Asghar1, Syeda Rabail Zahra1, Shakeel Ahmad1, Aurangzeb Khan2 1
2
Institute of Computing and Information Technology, Gomal University, D. I. Khan, Pakistan
Insititute of Engineering and Computing Sciences, University of Science and Technology, Bannu, Pakistan Abstract
The excessive use of internet and online technologies has caused a rapid growth of electronic data. When a data is being accessed from such a huge repository of e-documents, hundreds and thousands of documents are retrieved. For a user, it is impossible to read all the retrieved documents. Also, these documents contain redundant information. The problem is termed as Information Overload. Text summarization addresses this problem by producing the summary of related documents. Text summarization is one of the typical tasks of text mining. It is among most attractive research areas now-a-days. This paper gives a review of text summarization and defines the criteria for summary generation.
I.
Introduction The rapid growth of internet usage results in exponential increase of information especially textual information (e.g. news articles, e-books, scientific papers, blogs, etc.) [1,2]. Due to increasing number of electronic documents, search engines provide numerous web pages for a single query. This information overload problem leads to wastage of time for browsing all the retrieved information and also relevant information is missed out. Automatic summarization has solved this twofold problem by providing summaries of each page [3]. Text summarization is a process of producing a reduced version of original text that highlights the important contents of the text. It is an information retrieval task. It is hard to imagine everyday life without some form of summarization. A trailer or preview of a movie is a summary. Abstracts of research papers and scientific articles are summary written by authors. Other examples are minutes of a meeting, a resume, a program for conference, reviews of a product in e-commerce sites, can all be considered as summaries in their respective domains. Such information provided in the form of summaries is easy to read and
coherent, therefore can be understood with less effort as compared to raw information. However the creation of these summaries requires human intervention and ample amount of time, along with resources in some cases. With ever growing contents on World Wide Web, it is not possible to manually create summaries on such a large scale. Text summarization has become an important and timely tool for assisting and interpreting text information in today’s fast growing information age. Technologies that can make a coherent summary consider variables such as length, writing style and syntax. Search engines like Google, is an example of the summarization technology. Microsoft Word’s AutoSummarize function is also a simple example of text summarization. Rest of the paper is organized as follows: In section II, we present different nature of summaries, section III represents the summarization process, and section IV gives different sentence selection methods for summary generation. Finally, commonly used criteria for summary evaluation are presented in section V. Figure1 shows the hierarchy of paper.
(DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
MAGNT Research Report (ISSN. 1444-8939)
Vol.2 (4). PP: 309-317
Types
Identification
Process
Interpretation Generation Statistical
Text Summarization Methods
Linguistics Rhethorical
Evaluation
Precision & Recall Compression & Evaluation
Figure1. Text Summarization II. Overview of Text Summarization A. Concepts and Types A summary can be defined as “it is a text which is derived from one or more texts. It contains important contents of the main text(s) and its length is half of the original text(s) [4].” The summary should meet the major concepts of the original text(s), and should be redundant-less and ordered.
(DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
The goal of text summarization process is to produce an abridged version of text in a predefined template that contains information that is important or relevant to a user [5]. In the literature various summarization types [Figure 2] exists based on different factors like media, input, output, purpose and language [6,7,8].
MAGNT Research Report (ISSN. 1444-8939)
Media
Vol.2 (4). PP: 309-317 Text, Image, Video, Audio, Speech, Hypertext,etc. Single document
Input Multidocument
Extractive Output Abstractive
Text Summarizer
Generic
Personalized Purpose Query-Based
SentimentBased
Mono-Lingual
Language
Multi-lingual
Cross-lingual
Figure2.Types of Summary
(DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
MAGNT Research Report (ISSN. 1444-8939) Bsed on media there could be text, image, video, audio, speech and hypertext summarizations [7,9]. Depending on the input there may be Single and Multi document summarization systems. Single document summarizer accepts only one document as input and produces the summary of the single document, while multi document summarizer summarizes multiple documents [8,10]. On the basis of output extractive and abstractive summarization systems exists [6,11]. In extractive summarization, the goal is to identify the most important concepts in the input text and then summary is formed by reusing the main words and sentences of the original text [7,11]. Summaries generated by extractive methods suffer from the problems of coherence and inconsistencies [6]. In abstractive summarization, first the system understands the text(s) and then create summary in its own words [6,12]. Abstractive methods build an internal semantic representation of the input text and then by using natural language generation techniques Vipul et al. [12] create summary which is closer to human generated summary. Such a summary might contain words that are not explicitly present in the original text and also the internal semantic representation is the biggest challenge for abstractive summarizers [6]. Considering the purpose, there are generic, personalized, query-focused, sentiment based, indicative, informative and critical summarizations. A generic summarizer produces summaries by capturing all the important information from the source text [8,13]. Personalized summarizers provide specific summaries to user based on their field of expertise and personal interest. The personalized summarization aims to adapt summarization result of a specified document based on the user’s interest which is inferred from social context [14]. Query-Focused Summarization attempts to summarize the information that a document contains pertaining to specific search terms. In other words, query-focused summary presents the information which is salient to the given queries [13]. In sentiment based summaries importance is given to (DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
Vol.2 (4). PP: 309-317 emotions, opinions, reviews, feedback, recommendations, etc [15,16,17]. Indicative summaries presents the main idea of the text [6,7]. Such type of summaries encourages the user to read the original document. Informative summaries cover all the topics in the source text. These summaries represent the important factual terms of the text [8,18]. These summaries are 20 to 30 % length of the original text [6]. On the language basis the summaries can be categorized as mono-lingual, multilingual and cross- lingual [7,13]. In monolingual summaries input and output language is same [19]. Multi-lingual summaries are able to deal with several languages as input and produce summary in the user desired language [20]. Cross-lingual systems accept document in one language and produce summary in another language [21]. B. Text Summarization Process Text summarization is the process of condensing text to its most essential contents. When summarization process is done by machines, it is termed as Automatic Text Summarization. For summary generation, the document is analyzed in order to explore important information it contains. Different approaches are used for analyzing the text(s) and summary generation. To summarize a document, researchers have listed a three phase process [4,6,7]. Following equation shows this summarization process as:
Several independent modules are used for each of the above mentioned stages of summarization process. 1. Topic Identification: It is an initial exploration of text to identify its genre and topic. The goal of this stage is to filter the input texts to find out
MAGNT Research Report (ISSN. 1444-8939) only the most important central topics [4]. Topic identification can be achieved by using several complementary techniques including Cue words, Position, high frequency indicator phrases and discourse structure [6]. 2. Interpretation: In this step, identified important and relevant topics are fused to represent a general content [6]. As the document has many sub topics, fusing topics into one or more characterizing concepts is the most difficult step of automated text summarization [4]. This interpretation represents the actual concepts which may not be explicitly present in the text. Interpretation is used in abstractive summary generation. 3. Generation: In text generation process, the system produces the natural language from the processed information of the previous steps. This phase reformulate the extracted and fused contents into a coherent, densely phrased new text [4]. This step includes a range of various generation methods from very simple word or phrase printing to more sophisticated phrase merging and sentence generation [6].
Vol.2 (4). PP: 309-317 sentences that contain words like ‘in summary’, ‘in conclusion’, etc. have highest priority for inclusion in summary.[23,25] The Frequency Method gives higher score to sentences that contain words with highest frequency.[6]
III. Methods for Text Summarization In order to generate a high quality summary, different approaches used that are based on the way they use to select important sentences from the original document. 1. Surface-level / Statistical Methods In Statistical based approaches sentences are selected upon the criteria of word frequency, indicator phrases and other features regardless of the meaning of the words [6]. Several methods are used now-a-days, for determining the key sentences [22]. Like The Title Method selects those sentences that have resemblance to the title[23]. The Location Method gives priority to first sentence of each paragraph as a strong candidate for summary.[24] In Cue-phrases (DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
2. Mid-level / Linguistic Methods In Linguistic approaches connections between words are assumed and upon that the main idea is analyzed. Techniques used in this approach are: 2.1. Lexical Chain For performing the task of summarization, lexical chains were introduced for the purpose of cohesion [26]. Cohesion means sticking together different parts of the text. Cohesion occurs not only at word level but also at word sequences which results in the formation of lexical chains. Lexical chains are used to find short or long distances of sequences of related words to identify the important information. After lexical chain identification, the sentences having strong lexical chains are selected for extraction to be a part of the summary. WorldNet distance is used as a relatedness measure to find out lexical chains [6,27]. WordNet denotes how general the word is across all the documents [28]. It is a thesaurus that is used for determining relationship between words. 2.2. Graph theory Graphs are used to represent the structure of the text as well as the relationship between sentences of the document [6,29]. Graph consists of nodes and edges. Nodes of the graph present sentences and edges denote the connection between sentences. When a sentence refers to another phrase it generates a link with an associated weight between them. The weights are used to generate the score of sentences [1]. When the graph processed, the sentence will be categorized by their scores and sentences in higher orders are chosen for final summary [11]. 2.3. Cluster based method:
MAGNT Research Report (ISSN. 1444-8939) Clustering based methods become essential in text summarization. These methods reduce the information by categorizing similar data into groups. Generally documents contain multiple topics or themes. To summarize a document in a way that summaries should address different “themes” of the document, clustering techniques are applied. Clustering methods find the similarity between sentences on the basis of different themes of the document. Majorly, Hierarchical and partitioned based clustering techniques are used. In hierarchical clustering smaller clusters are merged together to form bigger clusters. While partitioned clustering forms distinct clusters by splitting the data. Linguistic approaches need great amount of memory for saving supplementary linguistic information like WordNet. Moreover, complex linguistic processing needs powerful processors [18]. 3.
Deep-level / Rhetorical Methods To produce a coherent and fluent summary it is essential to determine flow of data in the document, and overall discourse structure of the document. Rhetorical Structure Theory (RS) is based on the analyses of text. According to this theory it is possible to analyze the majority of text types in terms of a hierarchical tree of rhetorical relations. The analysis is based on the assumption that some text units are more central (Salient) to the text than others. The central units are named nuclei, and the supporting units are called satellites [30]. Rhetorical relations are described in terms of schemas, i.e. the way in which one or more satellites (or nuclei) are related to the current nucleus. It is also assumed that a relation that holds between two text spans also holds between the nuclei of those text spans. Salient sentences are retrieved by traversing the tree. According to the target compression rate, top n sentences can be extracted and presented as a summary.
(DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
Vol.2 (4). PP: 309-317
IV. COMPARISON AND EVALUATION Summary evaluation is an important aspect of text summarization [11]. Evaluation methods evaluate the usefulness and truthfulness of the summary [6]. Evaluating the qualities of summary like comprehensibility, coherence and readability are difficult tasks. Generally for summary evaluation intrinsic and extrinsic measures are used [11,31,32]. In intrinsic methods, humans evaluate the quality of summary. While extrinsic methods measure the quality by a task-based performance measure [11]. Intrinsic measures are known as glass-box testing while extrinsic measures are known as blackbox testing [31]. Intrinsic evaluations have assessed mainly the coherence and informativeness of summaries. Extrinsic evaluations, on the other hand, have tested the impact of summarization on tasks like relevance assessment, reading comprehension, etc. Two commonly used criteria for summary evaluation are: i) Precision and Recall ii) Compression Ratio and Retention Ratio i) Precision and Recall For evaluating the similarity between human generated and system generated summaries precision and recall are used [4]. They calculated as follow:
Precision
Re call
Correct Correct Wrong
Correct Correct Missed
Where, Correct shows sentences that are same in automated summary as well as
MAGNT Research Report (ISSN. 1444-8939)
Vol.2 (4). PP: 309-317
manual summary. Wrong represents sentences present in automated summary but not in manual summary. Missed gives sentences that are found in manual summary but not present in automated summary. Thus, Precision shows the number of suitable sentences extracted by the system and Recall gives the number of suitable sentences missed by the summarization system. ii)
Compression Ratio Retention Ratio
and
In general, a text is said to be a summary, when it must obey two requirements: • It must be shorter than the original input text; • It must contain the important information of the original text [4]. Compression Ratio [31,32] measures how much shorter the summary is as of the original text.
Compression Ratio
Length of Summary Length of FullText
Retention Ratio determines how much information is retained [31]. A good summary is one that has high retention ratio and low compression ratio. [4] V.
Conclusion
Research on automated text summarization still has a long way to go before we can really claim to understand the nature of summaries. In this paper we have emphasized on different nature of summaries, the general process of summary generation which most of the systems follow, and techniques for text summarization like statistical and linguistic approaches which are useful for identifying the most important points for a text document for generating the summary. For summaries, commonly used evaluation criteria and techniques are also discussed. (DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
Automatic text evaluation is still an open research topic. There is much room available for the researcher’s to produce summarization software that should produce the effective summary in less time and with least redundancy. The ultimate research goal of automatic text summarization is to enable the computer to read and understand the texts like human beings and produce relevant and satisfactory summaries. However, due to flexible nature of natural languages and limited capacity of computer natural language processing, the produced summaries may contain the problem of cohesion and semantics and unable to meet the users need. References [1] Rafael Ferreira ,Luciano de Souza Cabral ,Rafael Dueire Lins ,Gabriel Pereira e Silva ,Fred Freitas ,George D.C. Cavalcanti ,Luciano Favaro , "Assessing sentence scoring techniques for extractive text summarization ",Expert Systems with Applications, Elsevier, vol. 40, 5755-5764 (2013). [2] Asghar, M. Z., Khan, A., Ahmad, S., & Kundi, F. M. PREPROCESSING IN NATURAL LANGUAGE PROCESSING. Editorial board, pp. 152 (2013). [3] Suneetha Manne, Shaik Mohammed Zaheer Pervez, Dr. S. Sameen Fatima . “A Novel Automatic Text Summarization System with Feature Terms Identification.” India Conference (INDICON), Annual IEEE (2011). [4] Eduard Hovy, The Oxford Handbook of Computational Linguistics, Oxford University Press, Oxford, chapter 32, (2003). [5] Mani, I., Automatic Summarization, John Benjamin’s Publishing Co. pp.1-22 (2001). [6] Saeedeh Gholamrezazadeh, Mohsen Amini Salehi ,Bahareh Gholamzadeh “A Comprehensive Survey on Text
MAGNT Research Report (ISSN. 1444-8939) Summarization Systems “, Computer Science and its Applications, CSA '09. 2nd International Conference on 10-12 Dec. (2009). [7] Elena Lloret, “Text summarization: an overview”, paper supported by the Spanish government under the project TEXT-MESS (TIN2006-15265-C06-01). [8] Ani Nenkova and Kethleen McKeown,“AutomaticSummarization” ,Foundations and Trends in Information Retrieval, vol.5, pp.103–233, (2011). [9] Raposo, Francisco, Ricardo Ribeiro, and David Martins de Matos. "On the Application of Generic Summarization Algorithms to Music." Signal Processing Letters, IEEE, Vol. 22 no. 1, pp. 26-30, (2015). [10] Goldstein, Jade, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. "Multi-document summarization by sentence extraction." In Proceedings of the 2000 NAACLANLPWorkshop on Automatic summarization,Vol. 4, pp. 40-48, (2008). [11] Vishal Gupta, Gurpreet Singh Lehal,” A Survey of Text Summarization Extractive Techniques”, journal of emerging technologies in web intelligence, vol. 2, no. 3, august 2010. [12] Vipul Dalaal and Dr. Latesh Malik, “A Survey of Abstractive and Extractive Automatic Text Summarization Techniques”, Sixth International Conference on Emerging Trends in Engineering and Technology, (2013). [13] Karel Jezek, and Josef Steinberger, “Automatic Text Summarization (the state of the art 2007 and new challenges)”, Znalosti, pp. 1-12 (2008). [14]Kumar, Pingali, and Varma. “Generating Personalized Summaries Using Publicly AvailableWeb Documents.” In Web Intelligence and Intelligent Agent Technology, WI-IAT '08. IEEE/WIC/ACM, International Conference on Vol. 3, (2008).
(DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
Vol.2 (4). PP: 309-317 [15] M. Zubair Asghar, Aurangzeb Khan, Shakeel Ahmad, Fazal Masud Kundi, “A Review of Feature Extraction in Sentiment Analysis”, Journal of Basic and Applied Scientific Research, vol. 4 no. 3, pp. 181186, (2014). [16] Asghar, M. Z., Qasim, M., Ahmad, B., Ahmad, S., Khan, A., & Khan, I. A. (2013). HEALTH MINER: OPINION EXTRACTION FROM USER GENERATED HEALTH REVIEWS. International Journal of Academic Research, vol. 5 no. 6 (2013). [17] Asghar, M. Z., RahmanUllah, B. A., Khan, A., Ahmad, S., & Nawaz, I. U. POLITICAL MINER: OPINION EXTRACTION FROM USER GENERATED POLITICAL REVIEWS, (2014). [18] Partha Lal, “Text Summarization”, ( 2002). [19] Martha Mendozaa, Susana Bonillaa, Clara Nogueraa, Carlos Cobosa, Elizabeth Leónc , “Extractive single-document summarization based on genetic operators and guided local search”, Science Direct, Vol. 41, No. 9, Pages 4158–4169, (July 2014). [20] Marina Litvak, Mark Last, Menahem Friedma, “A new approach to improving multilingual summarization using a genetic algorithm”, ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 927-936, (2010). [21] Pintu Lohar, Pinaki Bhaskar, Santanu Pal, Sivaji Bandyopadhyay , “Cross Lingual Snippet Generation Using Snippet Translation System”, Computational Linguistics and Intelligent Text Processing, Springer Vol. 8404, pp. 331-342, (2014). [22] Gupta, A., Kaur, M., Singh, A., Goel, A., & Mirkin, S. "Text summarization through entailment-based minimum vertex
MAGNT Research Report (ISSN. 1444-8939) cover." Lexical and Computational Semantics, pp.75 (2014). [23] Kulkarni, U. V., & Prasad, Rajesh S. “ Implementation and evaluation of evolutionary connectionist approaches to automated text summarization.” In Journal of Computer Science , Science Publications, pp. 1366–1376, (2010). [24] Abuobieda, A., Salim, N., Albaham, A. T., Osman, A. H., & Kumar, Y. J. “Text summarization features selection method using pseudo genetic-based model” . In International conference on information retrieval knowledge management, pp. 193– 197, (2012). [25] Gupta, P., Pendluri, V. S., & Vats. I., “ Summarizing text by ranking text units according to shallow linguistic features”. In 13th International conference on advanced communication technology, pp. 1620–1625, (2011). [26] R. Barzilay and M. Elbadad. Using lexical chains for text summarization. (1997). [27] Kundi, F. M., Ahmad, S., Khan, A., & Asghar, M. Z. Detection and Scoring of Internet Slangs for Sentiment Analysis Using
(DOI: dx.doi.org/14.9831/1444-8939.2014/2-4/MAGNT.39)
Vol.2 (4). PP: 309-317 SentiWordNet. Life Science Journal, vol. 11 no. 9 (2014). [28] Pal, Alok Ranjan, and Diganta Saha. "An approach to automatic text summarization using WordNet." Advance Computing Conference (IACC), 2014 IEEE International. IEEE, (2014). [29] Ramesh, Animesh, K. G. Srinivasa, and N. Pramod. "SentenceRank—A graph based approach to summarize text." Applications of Digital Information and Web Technologies (ICADIWT), 2014 Fifth International Conference on the. IEEE, (2014). [30] Mann, W.C., A. Thompson, and S. ,” Rhetorical Structure Theory: Toward a functional theory of text organization”, pp. 243-281, (1998). [31] http://www.isi.edu/naturallanguage/people/{hovy,cyl,marcu}.html [32] Martin Hassel, “Evaluation of Automatic Text Summarization”, A practical implementation Licentiate, Thesis Stockholm, Sweden, (2004).