Automatic Text Document Summarization Based ... - ACM Digital Library

1 downloads 0 Views 669KB Size Report
Sep 11, 2015 - ABSTRACT. The need for automatic generation of summaries gained im- portance with the unprecedented volume of information avail-.
Automatic Text Document Summarization Based on Machine Learning Gabriel Silva , Rafael Ferreira

Rafael Lins, Luciano Cabral, Hilário Oliveira

UFRPE/UFPE, Recife, PE, Brazil

{rdl, htao}@cin.ufpe.br

Hewlett-Packard Labs. Fort Collins, CO 80528, USA

Hewlett-Packard Brazil Porto Alegre, RS, Brazil

UFPE, Recife, PE, Brazil

{gfps, rflm}@cin.ufpe.br Steven J. Simske

Marcelo Riss

[email protected]

[email protected]

ABSTRACT

even be seen as a way to “compress” information [12]. TS platforms may receive one or more documents as input to generate a summary. Such technique is classified as extractive when the summary is formed by sentences of the original document, or abstractive, when summaries modify the original sentences chosen to yield a better quality summary [11]. In general, abstractive summarization may be seen as a step further ahead of extractive summarization and research in that area may be considered in the very beginning. The extractive summarization techniques (RTS) select the sentences with the highest score from the original document based on a set of criteria. The Extractive Summarization methods are better consolidated and may be considered efficient in the automatic generation of summaries [12, 11, 4]. Summaries may also be classified as generic or query dependent or driven. Generic summaries analyze the text as a whole without prioritizing any aspect. On the other hand, query dependant or driven summaries look at the text trying to find sentences that may answer a query from the user. Text summarization may also be seen as a text compression strategy. The vertical compression rate of a summary may be defined as the ratio between the number of sentences in the original document and the number of sentences in the summary. Another possibility is horizontal sentence compression in which each sentence may be summarized by removing non-essential information. In this case the compression rate is measured by the ratio between the number of words in the original document and the number of words in the summary. Both compression rates are important factors that influence the overall quality and purpose of the summary. This paper focuses exclusively in extractive vertical summarization. Extractive text summarization techniques are split into three categories [4]: word-based, sentence-based, and graphbased scoring methods. In the methods based on word scoring each word receives a score and the weight of each sentence is the sum of all scores of its constituent words. Sentence-based Scoring analyzes the features of the sentence and its relation to the text. Cue-phrases (such as “it is important”, “in summary”, etc.), resemblance to the title, and sentence position are examples of sentence-based scoring techniques. Finally, in graph-based methods, the score of a sentence reflects some relationship among sentences. When a word or sentence refers to another one, an edge is gener-

The need for automatic generation of summaries gained importance with the unprecedented volume of information available in the Internet. Automatic systems based on extractive summarization techniques select the most significant sentences of one or more texts to generate a summary. This article makes use of Machine Learning techniques to assess the quality of the twenty most referenced strategies used in extractive summarization, integrating them in a tool. Quantitative and qualitative aspects were considered in such assessment demonstrating the validity of the proposed scheme. The experiments were performed on the CNN-corpus, possibly the largest and most suitable test corpus today for benchmarking extractive summarization.

Categories and Subject Descriptors I.2.7 [Natural Language Processing]: Text analysis.

General Terms Algorithms, Experimentation

Keywords Text Summarization; Extractive features; Sentence Scoring Methods

1.

INTRODUCTION

Automatic document summarization is a research area that was born in the early 1950’s. Recently, with the pervasiveness of the Internet and the fast growing number of text documents the search for efficient automated systems for Text Summarization (TS) has gained importance and may Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. DocEng’15, September 8-11, 2015, Lausanne, Switzerland. c 2015 ACM. ISBN 978-1-4503-3307-8/15/09 ...$15.00.

DOI: http://dx.doi.org/10.1145/2682571.2797099.

191

ated with a weight between them. The sum of the weights of a sentence is its score. This article analyzes 15 sentence scoring methods, and some variation of them, widely used and referenced in the literature applied to document summarization in the last 10 years. The scoring methods comprise the feature vector that will be used to train the classifier and to rank sentences, totaling 20 features. The key point in this paper is to use Machine Learning techniques to analyze such features in a way to point out which of them better contribute to yield good quality summaries. Quantitative and qualitative strategies are used here as ways of assessing the quality of summaries. The quantitative assessment was performed using ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [9], a measure widely accepted for such a purpose. In addition, another quantitative analysis was performed by three people who analyzed each original text and generated summaries following a methodology that is better described below. The qualitative assessment is made by counting the number of sentences selected by the system that coincides with the sentences selected by all the three human users. The results obtained shows the effectiveness of the proposed method. It selects two times more relevant sentences to compose the summary. Moreover, it achieves results 71% better in evaluation using ROUGE 2 metric.

2.

moval and stemming. Each text paragraph is numbered, as well as each of their sentences. Then, sentence segmentation is performed by Stanford CoreNLP1 . Stop words [5] are removed since they are considered unimportant and can indicate noise. Stop Words are predefined and stored in an array that is used for comparison with the words in the document. Word Stemming [13] converts each word in its root form, removing its prefix and suffix of the specified word is performed. After this stage the text is structured in XML and included in the XML file that corresponds to the news article. As the focus here is in the text part of the document for summarization all other XML-file attributes will no longer be addressed in this paper.

3.2

The XML document after preprocessing is represented by the set D = {S1 , S2 , ..., Si }, where Si is a sentence in the document D. The preprocessed sentences are subjected to the feature extraction process so that a feature vector is generated for each sentence, Vi = {F1 , F2 , ..., Fi }, where Vi is the feature vector of each sentence Si . As already mentioned extractive summarization use three scoring strategies [4]: (i) Word : it assigns scores to the most important words; (ii) Sentence: it accounts for features of sentences, such as its position in the document, similarity to the title, etc; (iii) Graphic: it uses the relationship between words and sentences. Table 1 shows the features analyzed in this work and their kind of scoring. They correspond to the most widely acknowledged techniques for extractive summarization reported in the literature.

THE CNN CORPUS

The CNN corpus developed by Lins and his colleagues [9] consists of news texts extracted from the CNN website (www.cnn.com). The main advantage of this test corpus rests not only on the high quality of the writing using grammatically correct standard English to report on general interest subjects, but each of the texts of the news article is provided with its highlights, which consists of a 3 to 5 sentences long summary written by the original author(s). The highlights were the basis for the development the gold standard, which was obtained by the injective mapping of each of the sentences in the highlights onto the original sentences of the text. Such mapping process was performed by three different people. The gold standard was formed with most voted mapped sentences chosen. A very high degree of consistency in sentence selection was observed. The CNN-corpus is possibly the largest existing corpus for benchmarking extractive summarization techniques. The current version has 400 documents, written in the English language, totaling 13,228 sentences, of which 1,471 were selected for the gold standards, representing an average compression rate of 90%.

3.

Table 1: Number of summaries sentences into gold standard. Feature Name of Extractive Type of Summarization Strategy Scoring F01 Aggregate Similarity Graph F02 Bushy Path Graph F03 Centrality Sentence F04 Heterogeneous Graph Graph F05 Text Rank Graph F06 Cue-Phrase Sentence F07 Numerical Data Sentence F08 Position Paragraph Sentence F09 Position Text Sentence F10 Resemblance Title Sentence F11 Sentence Length Sentence F12 Sentence Position in Paragraph Sentence F13 Sentence Position in Text Sentence F14 Proper-Noun Word F15 Co-Occurrence Bleu Word F16 Lexical Similarity Word F17 Co-Occurrence N-gram Word F18 TF/IDF Word F19 Upper Case Word F20 Word Frequency Word

THE SYSTEM

The steps for creating the methodology for obtaining the extractive summaries are presented in the following sections.

3.1

Feature Extraction

Text pre-processing

The news articles obtained from the CNN website must be carefully chosen in order to contain only text, thus news articles with figures, videos, tables and other multi-media elements are discarded. Besides that, the article must be “complete” with the text, highlights, title, author(s), subject area, etc. All such data is inserted in a XML file. The text part of the document in then processed for paragraph segmentation, sentence segmentation, stop word re-

1

192

http://nlp.stanford.edu/software/corenlp.shtml

3.3

Classification model

The steps for creating the classification model used to select the sentences that will compose the summary are detailed here. The first step has the purpose of reducing the problems inherent to feature extraction of each sentence. First, the feature vectors that have missing information and outliers (when all features reach the maximum value) are eliminated. Another problem addressed here is basis unbalance, whenever there is a large disparity in the number of data of the training classes, the problem known in the literature as a problem of class imbalance arises. Classification models that are optimized with respect to overall accuracy tend to create trivial models that almost always predict the majority class. The algorithm chosen to address the problem of balancing was SMOTE [3]. The principle of the algorithm is to create artificial data based on spatial features between examples of the minority class. Specifically, for a subset (whole minority class), consider the k nearest neighbors for each instance belonging to k for some integer value. Depending on the amount of oversampling chosen k nearest neighbors are randomly chosen. Synthetic samples are generated as follows: Calculate the Euclidean distance between the vector of points (samples) in question and its nearest neighbor. Multiply this distance by a random number between zero and one and add the vector points into consideration. This causes the selection of a point along a line between the two points selected. This approach effectively makes the region the minority class becomes harder to become more general [3]. Then, the system perform a feature selection, which is an important tool for reducing the dimensionality of the vectors, as some features contribute to decreasing the efficiency of the classifier. Another contribution of this study is to identify which of the 20 most used features in the last 10 years in the problems of extractive summarization contribute effectively to a good performance of classifiers. The experiment was conducted under the corpus of 400 news CNN-English. The experiments were performed with selection algorithms of WEKA2 , three were chosen and applied on the balanced basis for defining the best attributes of the vector. Below, the methods of selection of attributes are listed: (i) CFS Subset Evaluator: Evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them; (ii) Information Gain Evaluator: Evaluates the worth of an attribute by measuring the information gain with respect to the class; (iii) SVM Attribute: Evaluates the worth of an attribute by using an SVM classifier. The top five characteristics indicated by the selection methods were chosen. Figure 1 shows the profile of the selected features. The selected features demonstrate the prevalence of language independent features such as the position of text, TF/IDF and similarity. This allows summarization texts in different languages. Five classifiers were tested using the WEKA platform: Naive Bayes [8], MLP [7], SVM [7], KNN [1], Ada Boost [6], and Random Forest [2]. The results of the classifiers were compared with seven summarization systems: Open 2

Figure 1: Selected Features

Text Summarizer (OTS), Text Compactor (TC), Free Summarizer (FS), Smmry (SUMM), Web Summarizer (WEB), Intellexer Summarizer (INT)3 , Compendium (COMP) [10]. Figure 2 presents the proposed summarization method, showing the number of correct sentences chosen from the human selected sentences that form the gold standard. This experiment used 400 texts from CNN news.

Figure 2: Evaluation of the classifiers for summarization The classifiers were tested with variations of parameters with and without adjustment and balancing of the base. The technique chosen to validate the models was the CrossValidation. The tests performed with the unbalanced basis yielded an accuracy of 52% and balanced with the base yielded 70% accuracy. The Naive Bayes classifier achieve the best result in all cases. In qualitative evaluation it reach 969 and 1082 correct sentences selected to the summary on unbalanced and balanced cases respectively. In the first case Naive Bayes outperforms in 7.42% the second place (Ada Boost) and it selects the same number of important sentences of KNN on balanced case. 3

libots.sourceforge.net, www.textcompactor.com, freesummarizer.com, smmry.com, www.websummarizer.com, summarizer.intellexer.com

http://www.cs.waikato.ac.nz/ml/weka/

193

Figure 3 and 4 presents the comparison of the Naive Bayes classifier results against the seven summarization systems. The superiority of the proposed method was proved on both evaluation. In the qualitative assessment the proposed method reach 1082 correct sentences selected, which means an improvement of more than 100% in relation to Text Compactor the best tool found in the literature. In number it obtained 554 more correct sentences. Using ROUGE the Naive Bayes Classifier achieve a result 61.3% better than Web Summarizer, the second place. The proposed method reach 71% of accuracy while WEB obtained 44%. These results confirms the hypothesis that using machine learning technics improves the text summarization results.

which means an improvement of more than 100%, in relation to the best tool found in literature. It was also evident that the balancing judgment on the basis of examples yields gains in the performance of the sentence selection system. The next step is the validation of the experiments in other summarization test corpora for texts other than news articles. Although the CNN-corpus may possibly be the largest and best test corpus for assessing news articles today, the authors of this paper are promoting an effort to double its size in the near future, allowing even better testing capabilities.

5.

ACKNOWLEDGMENTS

The research results reported in this paper have been partly funded by a R&D project between Hewlett-Packard do Brazil and UFPE originated from tax exemption (IPI Law number 8.248, of 1991 and later updates).

6.

Figure 3: Evaluation of the summarization systems

Figure 4: Precision of the Summarization Systems using ROUGE 2

4.

REFERENCES

[1] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Mach. Learn., 6(1):37–66, Jan. 1991. [2] L. Breiman. Random forests. Mach. Learn., 45(1):5–32, Oct. 2001. [3] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority over-sampling technique. J. Artif. Int. Res., 16(1):321–357, June 2002. [4] R. Ferreira, L. de Souza Cabral, R. D. Lins, G. P. e Silva, F. Freitas, G. D. Cavalcanti, R. Lima, S. J. Simske, and L. Favaro. Assessing sentence scoring techniques for extractive text summarization. Expert Systems with Applications, 40(14):5755 – 5764, 2013. [5] W. B. Frakes and R. Baeza-Yates, editors. Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1992. [6] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In International Conference on Machine Learning, pages 148–156, 1996. [7] S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall PTR, Upper Saddle River, NJ, USA, 2nd edition, 1998. [8] G. H. John and P. Langley. Estimating continuous distributions in bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, pages 338–345, San Francisco, CA, USA, 1995. Morgan Kaufmann Publishers Inc. [9] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In M.-F. Moens and S. Szpakowicz, editors, Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. [10] E. Lloret and M. Palomar. Compendium: a text summarisation tool for generating summaries of multiple purposes, domains, and genres. Natural Language Engineering, FirstView:1–40, 2012. [11] E. Lloret and M. Palomar. Text summarisation in progress: a literature review. Artif. Intell. Rev., 37(1):1–41, Jan. 2012. [12] A. Patel, T. Siddiqui, and U. S. Tiwary. A language independent approach to multilingual text summarization. In Large Scale Semantic Access to Content (Text, Image, Video, and Sound), RIAO ’07, pages 123–132, Paris, France, France, 2007. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE. [13] C. Silva and B. Ribeiro. The importance of stop word removal on recall values in text categorization. In IJCNN 2003, volume 3, n/a, 2003.

CONCLUSIONS AND LINES FOR FURTHER WORKS

Automatic summarization opens a wide number of possibilities such as the efficient classification, retrieval and information based compression of text documents. This paper presents an assessment of the most widely used sentence scoring methods for text summarization. The results demonstrate that a criterions choice of the set of automatic sentence summarization methods provides better quality summaries and also greater processing efficiency. The proposed system selects 554 more relevant sentences to the summaries,

194