Automatic multi-document summarization for Indonesian ... - IEEE Xplore

0 downloads 0 Views 224KB Size Report
Automatic Multi-Document Summarization for. Indonesian Documents Using Hybrid Abstractive-. Extractive Summarization Technique. Glorian Yapinus1, Alva ...
2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

Automatic Multi-Document Summarization for Indonesian Documents Using Hybrid AbstractiveExtractive Summarization Technique Glorian Yapinus1, Alva Erwin1, Maulahikmah Galinium1, Wahyu Muliady2 1 Faculty of Engineering & Information Technology Swiss German University, BSD, Tangerang, Indonesia 2 Akon Teknologi, BSD, Tangerang, Indonesia 1 {glorian.yapinus[at]student. , alva.erwin[at], maulahikmah.galinium[at]}sgu.ac.id 2 wahyu.muliady[at]akonteknologi.com

Abstract—This paper discusses the development of multidocument summarization for Indonesian documents by using hybrid abstractive-extractive summarization approach. Multidocument summarization is a technology that able to summarize multiple documents and present them in one summary. The method used in this research, hybrid abstractive-extractive summarization technique, is a summarization technique that is the combination of WordNet based text summarization (abstractive technique) and title word based text summarization (extractive technique). After an experiment with LSA as the comparison method, this research method successfully generated a well-compressed and readable summary with a fast processing time. Keywords—Multi-Document Summarization, Abstractive Technique, Extractive Technique, Indonesian Documents

I. INTRODUCTION The growth of online information is increasing tremendously. Because of this reason, it is important to have a mechanism that able to present these information effectively [1]. Therefore, in order to solve this problem, research on automated summarization for unstructured text has increased and received much attention in recent years. In addition, the current research work for automatic text summarization tends to focus on multi-document summarization rather than single document summarization [2]. Generally, the goal of automatic text summarization is to condense the source text in order to make it shorter without losing its content and meaning [3]. This shorter text is called a summary. A summary can be extracted from a single document or from multiple documents. The concept is that if the summary is extracted from a single document, then it is called single document summarization whereas if the summary is extracted from documents which discuss the same topic, then it is called multi-document summarization [4]. Text summarization is usually classified into 2 techniques, abstractive and extractive summarization [5]. An abstractive summarization attempts to gain an understanding of the main concepts in a document [6]. On the other hand, extractive summarization attempts to extract parts of the original document such as important sentences and paragraphs in order

to concatenate them into a summary. The method used in this research is a combination of those two methods. In this case, abstractive summarization technique is implemented with WordNet based text summarization and extractive summarization technique is implemented with title word extraction based on a paper by [5]. The purpose of this research is to ascertain whether the two current text summarization techniques, abstractive and extractive, can be combined together in order to generate a fastgenerated, well-compressed, and readable summary. The documents used as the research object are only news documents because news documents typically have a title which is relevant to the whole body of the document. Moreover, the multi-document summarization developed in this research utilized the document title to generate a summary. So, by using the news document title, the meaning of the entirety of the document can be obtained. The news documents are collected from online Indonesian news portals. This research scope is limited to the development of summarization engine which is marked in blue in figure 1. Start

Web Crawler

Generates

Raw Documents

Clustering System Generates

Clustered Documents

Clustered Documents

Clustered Documents

Summarization Engine

Summarization Engine

Generates

Generates

Generates

Output Summary

Output Summary

Output Summary

Recommendation System

Summarization Engine

End

Fig. 1. The Research Scope

II. RELATED WORKS The study of automated summarization started 40 years ago. At that time, simple document features such as word frequency and word/sentence position were harnessed to create a

978-1-4799-5303-5/14/$31.00 ©2014 IEEE

2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

document summary [7]. Since then, many single document approaches have been created such as linguistic unit extraction and the implementation of machine learning to find patterns in text. Several researchers have extended single document summarization approaches to summarize multi-documents. For instance, Steinberger et al. implemented Latent Semantic Analysis (LSA) technique to summarize multi-document [8]. Although they implemented a sophisticated summarization technique, there are still some drawbacks from this technique such as the need of sentences sorting when they are inserted into the summary from different documents and anaphoric expressions that cannot be resolved in the context of the summary. Another example is Goldstein et al. who implemented sentence extraction approach for multi-document summarization [1]. Their approach possesses some advantages such as fast processing time and domain-independent. However, because they do not implement natural language understanding techniques, their method may generate a summary which has passages that disjoint from one another. Up to now, there has been little research about automatic text summarization for documents which written in Bahasa Indonesia. This is shown by the small amount of published research regarding this topic. However, there are some people who have successfully done such research. Tardan et al. conducted research about single document summarization using semantic analysis approach for documents in Bahasa Indonesia [9]. In this research, they implemented semantic approaches to summarize a document in Bahasa Indonesia and these techniques proved to be effective in summarizing the document. Budhi et al. developed an automatic text summarization for documents in Bahasa Indonesia by calculating the weight of sentence and weight of sentences relation [10]. In their research, they found that the size of the input document paragraphs influence the size of the output summary. III. METHODOLOGY Figure 2 depicts the methodology in this research :

Input (Set of Clustered Documents)

Concatenation

Pre-processing

Feature Scoring

Feature Ranking & Extraction

Output (Summary)

Fig. 2. The Methodology

WordNet

As shown in figure 2, the input documents are documents that have already been clustered. Therefore, the input documents should discuss the same topic and also have the same document category. In this research, the input documents are clustered manually because this research scope is limited to the development of summarization engine. Before the system starts to summarize the input documents, the user needs to input a number of paragraphs that will be generated in the output summary. This user input influences the output summary size. WordNet is a lexical database that contains word meanings and its semantic relations such as synonyms and antonyms [11]. The WordNet implemented in this research contains 3 entities which are synonym entity, dictionary entity, and category entity. The synonym entity contains word synonyms, while the dictionary entity contains word attributes such as word meanings, POS (Part-of-Speech) tag, and examples. The category entity contains synset category. There are 15 categories used to categorize documents and synsets (a set of one or more synonyms) in this research which are “Ekonomi” (Economics), “Bola” (Soccer), “Teknologi” (Technology), “Kesehatan“ (Health), “Kuliner” (Culinary), Sport (other sport besides soccer), “Otomotif” (Automotive), “Properti” (Property), Travel, ”Pendidikan” (Education), Lifestyle, Entertainment, Beauty, Politik (Politics), and “Bisnis” (Business). There are 4 phases that must be passed in order to obtain a summary which are Concatenation, Pre-processing, Feature Scoring, and Feature Ranking & Extraction. This section explains about those phases. A. Concatenation In this step, all of the input document titles and paragraphs are concatenated as defined in equation (1) : C = {D1 ∪ D2 ∪ D3 ∪ ….. ∪ Dn}

(1)

where, D is the input document and C is a big document which contains the concatenated document title and paragraph. B. Pre-processing The pre-processing phase in this research is applied mostly for cleaning noises or junks on the title of the document obtained from the concatenation step because in the latter step, the title of this document is used to determine the output summary content. There are 4 pre-processing techniques implemented in this phase which are Duplicate Words Removal, Stop Words Removal, N-Gram Detection and Extraction, and Tokenization. Here is the explanation of how those techniques are implemented : 1) Duplicate Words Removal The document title obtained from the concatenation phase has a big possibility of containing duplicate words. Therefore, this process attempts to remove duplicate words from the document title.

2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

2) Stop Words Removal Stop words removal is the process of removing words that have no significance in a text. In this case, the stop words in the concatenated title are cleaned based on the suggested stop words list from [12]. 3) N-Gram Detection and Extraction The purpose of this process is to detect and extract n-grams which formed phrases and words that have a certain meaning. The maximum size of extracted n-gram in this research is up to “bigram” (2 n-grams) because Indonesian phrases formed by 3 or more words are very rare and technically, big size n-gram extraction consumes a considerable amount of processing time due to repeating lookup to the database (WordNet). 4) Tokenization After the concatenated title has passed the previous processes, it is safe to say that it is ready to be processed further. In this step, the concatenated title and paragraph are converted into tokens and these tokens are used as the input for the Feature Scoring phase. C. Feature Scoring This step scores the concatenated paragraphs based on their word similarity and synonym with the title. Hence, if a paragraph contains many words similar to or synonymous with the words in the title, that paragraph will have a high score and considered as an important paragraph [5]. Algorithm 1 shows how this process works. Algorithm 1 Feature Scoring String clean_title = concat_title.preprocessing(); String[] split_title = clean_title.split(" "); String[] synonym = split_title.getSynonym(); String[] paragraph = concat_body.split("\n"); int[] score_per_paragraph = new int[paragraph.length]; int score = 0; foreach(paragraph) { split_paragraph = paragraph.split(" "); foreach(split_paragraph) { foreach(split_title) { if(split_paragraph == split_title){ score++; break;} } foreach(synonym){ if(split_paragraph == synonym){ score++; break;} } } score_per_paragraph = score; score = 0; }

WordNet is utilized in order to obtain word synonyms. Furthermore, the system selects only synsets that have the same category with the document category. For example, the word “uang” has 2 synsets which are synset 1 that consist of “dana” and “duit”, and synset 2 that consist of “harta” and “kapital”. In WordNet, synset 1 is categorized as “Properti” (property) and synset 2 is categorized as “Ekonomi” (Economics). Therefore, if the input documents are categorized as “Ekonomi” (Economics), then the system only selects synsets categorized as “Ekonomi” (Economics). In this example, synset 2 is

selected by the system. Similarly, if the input documents are categorized as “Properti” (property), then only synsets categorized as “Properti” (property) are selected by the system. In this example, synset 1 is selected by the system. This technique is called concept detection and it is intended to increase the summarization accuracy by selecting only synsets that have the same category as the input documents category. In this phase, the application of abstractive technique can be clearly seen when the system attempts to gain the understanding of word meaning by retrieving word synonym from the WordNet. D. Feature Ranking & Extraction The scored paragraphs are then sorted and ranked in descending order based on their score. The paragraphs which have the highest score become the output summary. In order to avoid duplicate paragraphs, the system checks for duplicates in each paragraph that have the highest score before the paragraph inserted into the output summary. The number of paragraphs generated in the output summary is the same as the number of the paragraphs input before the summarization system starts. For example, if the user inputs 3 paragraphs for the output summary, then the top 3 paragraphs which have the highest score become the output summary. The reason of extracting the whole of highest scoring paragraphs for the summary is because if the system only extracts some sentences of the highest scoring paragraphs, the summary may confuse the readers since the extracted sentence may not represent the essential idea of the paragraph. In this phase, the implementation of extractive technique occurs when the system extracts part of the document which in this case, the highest scoring paragraphs, in order to become a summary. IV. RESULTS AND DISCUSSION An experiment was conducted in order to know the performance of hybrid abstractive-extractive summarization technique. In this experiment, LSA (Latent Semantic Analysis) summarization is implemented as the comparison method. Solr LSA feature is used to implement this summarization technique. Table I describes the details of the news datasets used in this experiment. TABLE I. THE EXPERIMENT DATASETS Cluster ID 1

Cluster

2

Manchester City Juara Liga Inggris 2014 Virus MERSCoV di Indonesia Nikita Willy Putus Persaingan Capres Lemahkan Rupiah

3 4 5

Jokowi Capres

Cluster (English Translation) Jokowi Is The President Candidate Manchester City Won The English Premier League 2014

Category

MERS-CoV Virus in Indonesia

Kesehatan (Health)

10

Nikita Willy Break Up The Rivalry of President Candidates Weaken Rupiah

Entertainment

10

Ekonomi (Economics)

10

Politik (Politics) Bola (Soccer)

Total Documents 10 10

2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

Table II is the experiment result of hybrid abstractiveextractive summarization technique with different clustered articles. In this experiment, the number of output summary paragraphs is 5. In table II, Cluster ID refers to the Cluster in the table I. For instance, Cluster ID 1 refers to the cluster “Jokowi Capres” (Jokowi is The President Candidate) and Cluster ID 4 refers to the cluster “Nikita Willy Putus” (Nikita Willy Break Up). CR stands for Compression Ratio and PT stands for Processing Time.

CR

1 2 3 4 5

8.5 % 8.7 % 8.4 % 8.3 % 8.7 % 8.5 %

Overall

PT (seconds) 11 9 24 11 12 13

LSA CR 13 % 11 % 12 % 15 % 12 % 13 %

One Indonesian language lecturer in a public university. He is also an author who has published several books in Indonesia. x One journalist from a renowned Indonesian newspaper. He is also a media journalism lecturer in a public university. x One news anchor from a popular Indonesian TV station. Table III is the result of this evaluation. TABLE III. THE HUMAN-BASED EVALUATION RESULT

TABLE II. THE EXPERIMENT RESULT Cluster ID

x

Category LSA PT (seconds) 2 2 2 2 2 2

Based on the experiment result in table II, hybrid abstractive-extractive summarization technique shows promising results. It able to generate a summary of 10 documents in an overall processing time of 13 seconds. However, the processing time of hybrid abstractive-extractive summarization technique is beaten by LSA summarization with Solr. This presumably because LSA is a pure statistical technique which typically has a fast processing time [8], [9]. Hybrid abstractive-extractive summarization technique also generates a well compressed summary with an overall compression ratio of 8.5 %. This number even beats the overall compression ratio of LSA summarization with Solr. This means hybrid abstractive-extractive summarization technique generates a shorter output summary than LSA summarization with Solr. Although hybrid abstractive-extractive summarization technique is able to generate output summary quickly and with good compression ratio, it will mean nothing if the output summary cannot be understood. Therefore, the summary readability level needs to be measured. In order to measure the readability level of the output summary, human-based evaluation was conducted in this research. Basically, this evaluation method is conducted by giving a few articles and their summary to evaluators. Then, the evaluators evaluate and score the readability of the output summary with the range of one (1) as the lowest score to ten (10) as the highest score. In this research, there are five evaluators which consist of Indonesian language teachers and media experts. Each evaluator is given three articles on the same topic which retrieved from different online news portals and two summaries from those articles generated with two different summarization methods which are hybrid abstractiveextractive summarization technique and LSA summarization with Solr. Here is the short profile of the evaluators : x Two high school Indonesian language teachers.

Ekonomi (Economics) Politik (Politics) Entertainment Properti (Property) Pendidikan (Education) Overall

Hybrid AbstractiveExtractive Score 5.5

LSA Score

7 8.5 9 8

6 7.5 8 7

8

7

7.5

Based on the evaluation result in table III, the output summary generated by hybrid abstractive-extractive summarization technique shows a good readability level with overall score of 8. This score beats the score given to LSA summarization with Solr. This means hybrid abstractiveextractive summarization technique generates an output summary which is more readable than LSA summarization with Solr. However, the evaluators found some issues regarding the output summary of hybrid abstractive-extractive summarization technique such as in some cases, there are paragraphs that state the same idea and paragraphs that are in the wrong order. V. CONCLUSION AND FUTURE WORKS Hybrid abstractive-extractive summarization technique proven to be effective in summarizing multi-documents in order to gain a fast-generated, well-compressed, and readable summary. This is shown by the experiments conducted in this research. In the future, the summary readability level can be increased if the natural language processing technique, namely discourse analysis, is applied to hybrid abstractive-extractive summarization technique in order to identify the relation between summary paragraphs and eliminate the one that is not related to context. In addition, similarity techniques can also be used to identify and eliminate redundant summary paragraphs. Hypothetically, this should fix the issues found by this research evaluators although it may consume the processing time [9]. This research datasets are news documents that clustered manually and as a result, they discuss the same topic and fall under the same category. Further investigation is needed if the given set of documents fall under the same category but report different news. For another future work, the summarization accuracy can be increased if the categories and their keywords that generated manually in this research are replaced with expanded categories

2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia

with keywords that generated statistically. By doing this, the synset category can be more accurately weighted and it will lead to a better summarization accuracy. ACKNOWLEDGMENTS The authors would like to thank the summary evaluators, Swiss German University, and Akon Teknologi for their support to this research. REFERENCES [1] J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz. MultiDocument Summarization By Sentence Extraction. In Proceedings of ANLP/NAACL-2000 Workshop on Automatic Summarization, 2000. [2] T. Hirao, T. Fukusima, M. Okumura, C. Nobata and H. Nanba. Corpus and Evaluation Measures for Multiple Document Summarization with Multiple Sources. In Proceedings of the Twentieth International Conference on Computational Linguistics (COLING), pp. 535–541, 2004. [3] Y.J. Kumar and N. Salim. Automatic Multi Document Summarization Approaches. Journal of Computer Science Vol. 8 No.1, pp. 133-140, 2012. [4] M.G. Ozsoy, I. Cicekli, and F.N. Alpaslan. Text Summarization of Turkish Texts using Latent Semantic Analysis. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pp. 869–876, Stroudsburg, PA, USA. Association for Computational Linguistics, 2010.

[5] V. Gupta and G.S. Lehal. A Survey of Text Summarization Extractive Techniques. Journal of Emerging Technologies in Web Intelligence, Vol. 2, No. 3, August 2010. [6] G. Erkan and D.R. Radev. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Journal of Artificial Intelligence Research, Vol. 22, pp. 457-479, 2004. [7] H.P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, Vol. 2, No. 2, pp. 159165, 1958. [8] J. Steinberger and M. Kˇrišt’an. LSA-Based Multi-Document Summarization. In 8th International PhD Workshop on Systems and Control, a Young Generation Viewpoint, pp. 87–91, Balatonfured, Hungary, 2007. [9] P.P.D. Tardan, A. Erwin, K.I. Eng, and W. Muliady. Automatic Text Summarization Based on Semantic Analysis Approach for Documents in Indonesian Language. In Information Technology and Electrical Engineering (ICITEE), 2013 International Conference on, pp. 47-52, 2013. [10] G.S. Budhi, R. Intan, Silvia R., and Stevanus R.R. Indonesian Automated Text Summarization. In Proceeding ICSIIT, pp. 2627, 2007. [11] G.A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K.J. Miller. Introduction to WordNet: an On-line Lexical Database. International Journal of Lexicography, 3(4): 235-244, 1990. [12] F.Z. Tala. A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia, 2003.