A Comparative Study for Arabic Multi-Document Summarization Systems

5 downloads 0 Views 613KB Size Report
Abstract—This paper demonstrates a comparative study of. Arabic Multi-Document ... the area of Arabic Automatic Text Summarization systems. Therefore, we ...
2017 8th International Conference on Information Technology (ICIT)

A Comparative Study for Arabic Multi-Document Summarization Systems (AMD-SS) Mossab N. Ibrahim, Khulood Abu Maria and Khalid Mohammad Jaber Faculty of Science and Information Technology Al-Zaytoonah University of Jordan Amman, Jordan [email protected], [email protected] , [email protected] Abstract—This paper demonstrates a comparative study of Arabic Multi-Document Summarization System (AMD-SS). These methods are compared and analyzed, aiming to detect which method generates a genuine summary and achieves the best results in comparison with the human summarization techniques. The comparative study shows that there is a lack in the area of Arabic Automatic Text Summarization systems. Therefore, we proposed an Arabic Text Summarization model that built in linear algorithms based on parallel computing techniques. The proposed model is built in order to generate an Arabic Document Summary (ADS) that is fully coherent, grammatical and meaningful Arabic sentences, closing to a human summarization. Recent researches have not provided perfect Arabic summary. Keywords—Multi-Document Summarization (MDS); Clustering; Graph; Cross Document Structure Theory (CST); Features Selection (FS); Human Summarization.

I.

INTRODUCTION

The text, which is generated from single, or multi document (maintain the meaning of the original document and shorter than its length) and discussed the same topic is a summarization of the text. The general differences of text summarization can be categorized by input document (Single, Multi Document as shown in Figure 1), the aim (Generic, User or Topic Focused or Query-Based), number of language (Mono-Language or Multi-Languages) and output method form (Extractive or Abstractive) as demonstrated in Figure 2. The purpose of the automatic text summarization has increased in many areas like summarizing the news articles, emails, business summary, biomedical documents and much more [1].

Fig. 1.

Multi-Document Text Summazization [3]

978-1-5090-6332-1/17/$31.00 ©2017 IEEE

Fig. 2.

Text Summazization System

When you input several documents to the summarization system you will face major problems such as distinguish the differences between collected documents, coherence guarantee and conquer redundancy. You must satisfy the best summary optimization such as authentic text parts, length is fair and unique textual units. The differences between single and multi-document summarization are existent in the cases of merging, speeding up, improving the squeeze, dealing with redundancy and multilingualism in the documents [2]. The rest of the paper is organized as follows: Section 2 demonstrates research assumptions, Section 3 strategy design, Section 4 multi-document summarization: overview, Section 5 evaluation of Arabic Summarization Systems (ASS), Section 6 analysis and findings, Section 7 proposed model justification and Section 8 conclusion and future work.

2017 8th International Conference on Information Technology (ICIT) II.

RESEARCH ASSUMPTIONS The research assumption will be as the following: • What are the differences in the multi-document text summarization approaches? • What are the differences in perceptions towards Arabic text documents? • Which types of Arabic Text Summarization Systems are plenty being used by the researchers?

• • III.

How we can improve the efficiency and speed up text summarization process? Can we generate a final summary in instant time?

Kaur & Chopra (2016)

STRATEGY DESIGN

This work suggested a model that generates an Arabic structure summary. The model is characterized as accurate, coherent and complete base on Arabic grammar and special features. This model is proposed to improve the summarization process performance by using parallel computing and deal with the drawbacks of previous methods. IV.

Author & Year

Fejer & Omar (2015)

MULTI-DOCUMENT SUMMARIZATION: OVERFVIEW

The evaluation for the existing techniques, which are used for multi document summarization, will be judged by the quality and fluency of the summarization using the recall, precession and F-measure. Summarization techniques are classified by two general categories called Abstractive and Extractive. The performance measures are the precision of the summary, which it can be defined as the measurement of the retrieved relevant sentences to the query of the total retrieved sentences (measures the accuracy of the recall of the summary as retrieving of relevant sentences to the total database sentences) and the F-measure, which can be defined as the measurement for summary accuracy (F-measure reaches its best value at 1 and worst at 0) [3].

Waheeb & Husni (2014 )

A. Clustering Based Method Clustering is the way toward gathering comparative sentences together [3]. It is positioning the closeness between sentences in the record; as the sentences, which are greatly like each other (ordered into a similar cluster). Thus, every group contained sentences which denoted a similar subject. Typically the cosine comparability measure is utilized to gauge the closeness between two sentences. In the wake of grouping the sentences, sentence choice is performed by choosing sentence from each cluster. Sentence choice depends on the closeness of the sentences to the top positioning TF-IDF in that group. Those chose sentences are then assembled to shape the last synopsis [4]. Table 1 exhibits the fundamental grouping examines. TABLE 1.

MAIN CLUSTERING TECHNIQUES

Froud et al. (2013)

Schlesin ger et al. (2008)

Description Methods & Result

Calculated TF*IDF values or weights of the words using word net dictionary, grouped them in cluster by using k means clustering algorithm. Chosen some of tokens randomly as initial centroids. compared the Euclidean distance , when gotten a stable clusters for all the tokens they were grouped according to their frequencies in the respective clusters ,the sentences which had the words were selected from documents ,then generated summaries of the cluster on the basis of word ranked. Text was processed ,two clustering techniques were used on documents : Hierarchical clustering (Single-Linkage & Complete-Linkage) and K-means , extracted and ranked Key phrases , important splitted sentences are extracted and scored, then similarity between sentences were measured using (Jaccard coefficient & cosine similarity ) . Accuracy of 43.4%. Text was processed ,K-means randomly chosen sentences as the initial centroid for each cluster, sentence-to-sentence similarity metric Sentences were represented by a weight vector based on cosine similarity between the sentences .the identified information were combined and merged ,select all the sentences from the biggest cluster based on sentences rank. Recall = 0.6 & Precision = 0.6 Latent Semantic Analysis Model was used on Arabic Documents Clustering in order to solve noise problems ,five similarity/distance measures were used : Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. improved the clustering performance by solving the problems of noisy information and documents length Document is prepared, sentences were trimmed and ranked, redundancy was reduced, and sentences were sorted. p-value = 0.2435

Languag e & Data Set

English newspap ers

Limitation

Time taken for doing clustering increased as the number of clusters increased

Arabic DUC 2002

Arabic EASC

Arabic CCA

English / Arabic DUC 2004

It takes a long time

Implement ation is missed enhanced clustering for improved results.

Concentrat ed on the sentences which increased time complexity

Needed to improve a nonEnglish language tasks, such as sentence splitting

2017 8th International Conference on Information Technology (ICIT) Author & Year

Description Languag e & Data Set

Methods & Result

Limitation

Author & Year

and lemmatizat ion.

Cluster (Group based) strategies have been fruitful in its assignment to speak to assorted qualities and take out excess for every theme, which can be esteemed the benefit of utilizing clustering techniques uniquely when managing multi documents handling. Yet, as far as virtue qualities, unpromising outcomes were gotten while applying the summarization procedure. A rundown can't be sufficiently important and shutting to the human focus if the significance of a sentence is judged simply in view of the clusters. This is on the grounds that in clustering based technique, in conclusion sentences are positioned by its closeness with group centroid which simply represents frequent occurring terms [5]. B. Graph Based Method Graph approach is used to model associates (links) which is present between terms. In the situation of text documents, climax denotes sentences and an edge is the weight between two sentences. Therefore, documents can be represented as a graph where each sentence becomes the vertex and the weight between each vertex relates to the likeness between the two sentences. The most generally used likeness measure is the cosine similarity measure [6]. An edge then exists if the likeness weight is above some predefined sill. Once the graph is created for a set of documents, essential sentences will then be recognized; it follows the idea that a sentence is considered significant if it is strongly associated to many other sentences. This method differs from the cluster based method where sentences are graded based on its familiarity to cluster centroid. Two distinguished graph based ranking algorithms is the HITS algorithm [7] and the Google's PageRank [8] .Table 2 demonstrates the comparison study to the main Graph methods. TABLE 2. Author & Year

Alwan & Onsi (2016)

Alami (2015)

Harihar an et al. (2013)

MAIN GRAPH TECHNIQUES Description Language & Data Set

Limitation

Represented the multidocuments by directed weighted graph.Applied structural rules to generate the summary sentences.Refined the sentences which contained unwanted parts and added them into the final summery. reached to 88% reduction ratio

Arabic 1651 documents collected

Lacks on Arabic semantic process and adding dictionaries will maximize the reduction ratio.

Processed text, undirected weighted graph is constructed for each document with sentences as nodes and similarities as edges, the weighted ranking algorithm Page Rank was performed on

Arabic 25 documents collected

Methods & Result

Needs a semantic analysis , few documents

Xiaojun Wan (2008)

Description Methods & Result

the graph to generate salient score for each sentence in the document. The sentences were ranked according to their salient scores. The top-ranking sentences are selected to form the summary for the input document with a filter out redundant information was used. F1-measure = 0.75 Modified of LexRank methods which was developed based on modification of the most popular page ranking algorithms. A link between two sentences is considered as a vote cast from one sentence to the other sentence. The score of a sentence is determined by the votes that are cast for it, and the scores of the sentences casting these votes .Cosine Similarity used ,Discounting technique envisages that once a sentence is selected then the corresponding row and column values of the matrix are set to zero, The idea behind discounting technique is that once the sentence is selected, the chance for repetition of information in the succeeding sentences is minimized. The information will not be duplicated and the summary will be cohesive and meaningful in nature. importance to position of the sentence can be given by giving preference to sentences that occurs earlier out of the two documents considered. Got higher results than best DUC results The document affected by incorporating the document importance and the sentence-to document correlation into the sentence ranking process. The sentences, which belonged to an important document and were highly correlated with the document, were chosen into the summary. ROUGE-1 =0.39013

Language & Data Set

English DUC 2002

English DUC 2002

Limitation

Shortage in the meaningful summaries which generated from the approach

Tested for English language only. Shortage in the meaningful generated summaries

The approach of the Graph Based have given growth to encouraging feedback from the Multi Document Summarization Research Communities as it was able to recognize ‘prestigious’ sentences through the documents[9]. The resulting graph is also able to capture diverse topics from independent sub-graphs. However, this approach relies heavily on sentence likeness to produce graph, without “understanding” the association between the sentences.

2017 8th International Conference on Information Technology (ICIT) C. Cross Document Structure Theory (CST) Usually, the documents that discuss the same topic contain semantic relations between their sentences, these relations call CST relations. Examples of semantic relations are “Identity”, “Overlap”, “Description”, and “Historical background”. Many researches used CST to excerpt the most significant sentences for multi-document summarization [10]. Table 3 demonstrates the main CST techniques. TABLE 3. Author & Year

ALMA HY et al. (2014)

Kumar et al. (2013)

Jorge & Pardo (2010)

Miyabe et. Al. (2008)

for instance, sentence boundary identification (e.g., “.”, “;”, “?”) or a word stemming process to evade repeating words, because of the morphological variations of the words[11]. Table 4 demonstrates the main FS techniques. TABLE 4. Author & Year

MAIN CST TECHNIQUES Description

Methods & Result

Investigated the utility of cross-document relations (CST relations) to identify the most important sentences in the thread to be included in the summary. They developed sentence scoring based on model selection technique. F-measure = 0.62 Developed a new sentencescoring model based on voting technique over the identified cross-document relations. They used a Genetic-CBR classifier. F-measure = 84.47%

Initial rank of sentences is built, sentences were ordered according to the number of CST relations they present and gave a new/refined rank, received some privilege to be in the summary. F-measure = 0.4994

Dataset consisting of pairs of sentences splitted into clusters according to their similarities, then constructed a classifier for each cluster that identified equivalence relations. Adopted “coarse-to-fine” approach used the identified equivalence relations to address the task of identified transition relations.

Languag e & Data Set

English BC3

English DUC 2002

Portugue se CST News corpus

English 115 sets of related news articles

Limitation

Tested for English language only. Shortage in the meaningful generated summaries Tested for English language only. Needs more sentence ordering methods, as well as more techniques to jointly apply more than one content selection operator.

Classificati on performanc e need to be improved

All of the above mentioned works and others only applied on English, Brazilian Portuguese and Japanese texts. There is no works deal with Arabic language. Arabic CST dataset created by translating existing English annotated CST relations dataset (the translation done by a human translator). Moreover, it needs more modifying related to basic Arabic relations. D. Extraction Features Based Methods The extraction summarization process includes choosing and fetching the most greatly classified sentences based on some statistical individual/mixed features, so-called scoring features, e.g., word occurrence. This process typically needs preprocessing steps to compute the weights of these features,

AlZahrani et. al. (2015 )

AlThwaib (2014)

Abu Kwaik (2011)

MAIN FS TECHNIQUES Description

Methods & Result

Text is segmented , root extracted and tokenized ,decomposed into a set of paragraphs,Each paragraph is parsed into sentence,removed the stop words based on their PoS tags , eight state-of-the-art features for sentence scoring was used,PSO-based learning process was applied on sentences scores. F-measure = 0.73 Data set documents have been tokenized,two copies of the data set are made,Sakhr summarization techniques are applied to the second copy of the data set, the first data set classified using SVM. Accuracy = 94% Data set documents have been tokenized,entity recognition,six features applied, scoring sentences, and generated summary.Fmeasure = 86.5 %

Language & Data Set

Arabic EASC

Limitation

The efficiency need to be improved

Arabic 800 Arabic text documents

Arabic 200 news documents

longer execution time

Needs improved Arabic entity recognition system, need more semantic cohesion.

Choosing such features is a composite process; though, it plays a vital role in many diverse areas of natural language processing, such as information retrieval, text classification, and text summarization. In the meantime, the process of scoring sentences depends on these features, and therefore the quality of the output summary is sensitive to the scoring features selected. Consequently, the problem of picking effective scoring features could be considered a complex optimization problem [12]. It is remarkable that each of the above mentioned methods has its own benefits towards multi-document summarization. However, there are some affairs and restrictions. The features based methods are knowledge poor in term of catching contextual information contents that occur in the sentences and multiple documents. These restrictions are due to the sentence scoring process which depends only on flat feature illustration of a sentence while neglecting cross-document relationships between text units in unlike documents. Clustering based methods are also have a concerns. In clustering based methods, sentences are graded according to the likeness with cluster centroid which represents frequent occurring terms. Thus, this method is also reflected to be knowledge poor in term of its inability to capture contextual information contents that happen

2017 8th International Conference on Information Technology (ICIT) in the sentences. In conclusion, the approach to graph based methods have caused positive feedback from the multidocument summarization research. The causing graph is also able to capture diverse topics from unconnected sub-graphs. Conversely, since this approach depends seriously on sentence likeness to produce graph, it only treats sentence as bag of words without “understanding” the text. This would produces the concluding summary to be not comprehensive enough specifically for an informative summary generation. V.

Author & Year

A compare to the available literature on Arabic Systems Corpus

The researches in this field is restricted and reasonably new compared to the existing literature on other languages, such as English. Therefore, there exists a great scope for additional investigation in Arabic text summarization. Additionally, one of the prime complications in Arabic summarization was the lack of Arabic gold standard summaries. Admitting this status is beginning to change, especially with the presence of Arabic language as a part of the corpora and tasks in the TAC 2011 MultiLing Pilot and ACL 2013 MultiLing Workshop [14]. As a final point, given that the requisite corpora and assuming them in Arabic summarization studies is a major request. To the best of our knowledge, there is no Arabic text summarization systems that can generate abstractive summaries. Though, there is some research still tiresome to construct an Arabic text abstractor. In addition, the researchers in [15] categorize their second method as an abstractive one, but the paper didn't clear if any new tools has been added to the summary or any modifications made to the concluding text. It is expected that generating abstract summary will be one of the main challenges in the automatic text summarization field [16]. Arabic text summarization studies requisite to jump the curve and discover the semantic-based techniques as well as build more abstractive summarizers. Table 5 demonstrates the main of Arabic summarization systems researches. TABLE 5. Author & Year

ALKhawal deh and Samawi (2015)

Fejer and Omar (2014)

Oufaida et al. (2014)

Summary Types

mono-lingual summaries

m (2015)

EVALUATION OF ARABIC SUMMARIZATION SYSTEMS

The state of the art of investigation in multi-document summarization of the Arabic language is in its initial stages compared to the literature on English [13] and most of the studies followed the type of extractive summaries, single language, and output generic summaries.

Evaluation Measures

Authors’ corpus and EASC corpus

ROUGE-2, ROUGE-L, ROUGEW, ROUGE-S, AutoSumm Eng and Manual evaluation

EASC corpus

ROUGE-1, ROUGE-2, ROUGE-3, and ROUGE-4

EASC corpus and TAC 2011 MultiLin g Pilot corpus

ROUGE-1 and ROUGE-2

Singledocument, generic, and mono-lingual summaries

Singledocument, multidocument, generic, and mono-lingual summaries Singledocument, multidocument, generic, monolingual, and cross-lingual summaries

Belguith et al. (2014)

Authors’ corpus

Recall, precision, and Fmeasure

Singledocument, mono-lingual, opinion, informative, indicative, and personalized summaries

ElFishawy et al. (2014)

Authors’ corpus

F-measure, and NDCG

Multi-post, generic, and mono-lingual summaries

ElGhanna

TAC 2011

ROUGE-2 , ROUGE-S

Multidocument,

ARABIC SUMMARIZATION SYSTEMS

A compare to the available literature on Arabic Systems Corpus

Evaluation Measures

Alwan and Onsi (2016)

Authors’ corpus

Manual evaluation

Alezzani et al. (2016)

Arabic Gigawor d

ROUGE-1, ROUGE-2 , ROUGES4, ROUGESU 4 and Manual Evaluation

Belkebir and Guessou

Authors’ corpus

F-measure

Summary Types

Single, multidocument, generic, and mono-lingual summaries Singledocument, multidocument, generic, and mono-lingual summaries Singledocument, generic, and

Approache s

Textual graph

Combinatio n of two methods that are LSA and Arabic Word Morphologi cal model. machine learning techniques

Approache s

(AdaBoost boosts the SVM classifier) LCEAS (Lexical Cohesion and Entailment based segmentati on for Arabic text Summariza tion) lexical cohesion and text entailment based segmentati on Clustering methods and keyphrase

Clustering algorithm and a discriminan t analysis method Hybrid approach that combines symbolic (rhetorical analysis) and numerical (machine learning) techniques. Machine learning to summarize Arabiclanguage Twitter posts , specifically posts in the Egyptian dialect, by selecting a sub set of posts related to a specific topic Keyphrasebased

2017 8th International Conference on Information Technology (ICIT) Author & Year m and ElShishta wy (2014)

El-Haj and Rayson (2013)

A compare to the available literature on Arabic Systems Corpus

MultiLin g Pilot corpus and authors’ corpus

MultiLin g 2013 corpus

Ibrahim and Elghaza ly (2013a)

Authors’ corpus

Imam et al. (2013)

Authors’ corpus and EASC corpus

Froud et al. (2013)

Corpus of contemp orary Arabic (AlSulaiti and Atwell 2006)

Habous h et al. (2012)

Alotaiby et al. (2012)

Authors’ corpus

Arabic gigawor d

Evaluation Measures

and Manual evaluation

ROUGE1, ROUGE2, ROUGESU4 (for multidocument summaries) , MeMoGAutoSumm ENG, , NPowER and Manual evaluation

Precision

ROUGE-L

Purity and entropy

Recall and precision

ROUGE-1, ROUGE-L, ROUGEW, ROUGESU and Manual evaluation

Summary Types

generic, and mono-lingual summaries

Singledocument, multidocument, generic, and mono-lingual summaries

Singledocument, generic, and mono-lingual summaries

Singledocument, query-driven, and monolingual summaries

Singledocument, generic, and mono-lingual summaries

Singledocument, generic, and Mono-lingual summaries

Singledocument, generic, and mono-lingual summaries

Approache s

Author & Year

A compare to the available literature on Arabic Systems Corpus

Evaluation Measures

Summary Types

model (HMM) and by exploring different bigram language models

techniques Sen-Rich and DocRich

Selecting sentences that have the highest sum of their words’ loglikelihood scores. A rhetorical representati on of the text using RST & vector representati on based on the VSM. OSSAD, an Ontologybased Summariza tion System for Arabic Documents that uses machine learning approach

Clustering approach

Used the weight of word root, Instead of the weight of the word itself, to give a rank for each sentence Exact-word matching and character crosscorrelation, hidden Markov

Approache s

Azmi and AlThanyy an (2012)

El-Haj et al. (2011a)

El-Haj et al. (2011b)

El-Haj et al. (2011c)

Authors’ corpus

DUC 2002 corpus and corpus of ElHaj et al. (2011b) DUC 2002cor pus and authors’ parallel corpus by MT of DUC 2002 TAC 2011 MultiLin g Pilot corpus

ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L, ROUGEW, ROUGES4, ROUGESU4, recall, precision, and Fmeasure

Singledocument, generic, and mono-lingual summaries

A rhetorical representati on of the text using RST

ROUGE-1, recall, and precision

Multidocument, generic, and mono-lingual summaries

Clustering approach

ROUGE-1, recall, and precision

Multidocument, generic, and mono-lingual summaries

Clustering approach

ROUGE1, ROUGE2, ROUGESU4, MeMoGAutoSumm ENG and Manual evaluation

Multidocument, generic, and mono-lingual summaries

Clustering approach

Authors’ corpus

Recall, precision, , and Fmeasure

Singledocument, generic, and mono-lingual summaries

AlRadaide h and Afif (2009)

Authors’ corpus

Recall, precision, F-measure, compressio n rate, omission rate, and retention ratio

Singledocument, generic, and mono-lingual summaries

Fattah and Ren

DUC 2001

ROUGE-1, recall, and

Singledocument,

Boudab ous et al. (2010)

Machine Learning techniques (SVM algorithm to classify each sentence) Used a noun-based aggregate similarity method originally proposed for the Korean language Genetic algorithms

2017 8th International Conference on Information Technology (ICIT) A compare to the available literature on Arabic Systems

Author & Year

Corpus

corpus and authors’ corpus

(2009)

El-Haj and Hammo (2008)

Schlesin ger et al. (2008)

Authors’ corpus

MSE corpora

Sobh et al. (2007)

Authors’ corpus

Al-Sanie (2005)

Authors’ corpus

Douzidi a and Lapalm e (2004)

VI.

DUC 2004 corpus

Evaluation Measures

precision

Summary Types

generic, and mono-lingual summaries

Manual evaluation

Singledocument, query-driven, and monolingual summaries

ROUGE-2 and Manual evaluation

Singledocument, multidocument, generic, querydriven, monolingual, and multi-lingual summaries

Recall, precision, and Fmeasure

Singledocument, generic, and mono-lingual summaries

Precision

Singledocument, generic, and mono-lingual summaries

ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L, and ROUGE-W

Singledocument, generic, and cross-lingual summaries

Approache s

(GA), mathematic al regression (MR), feed forward neural networks (FFNN), probabilisti c neural networks (PNN), and Gaussian mixture models(G MM) Ranking each paragraph by the cosine similarity measure to extract the relevant paragraphs CLASSY (Clustering, Linguistics, And Statistics for Summariza tion Yield) Machine Learning techniques (Naïve Bayesian classifiers and genetic programmi ng (GP)) Based on RST where 11 relations were identified and used in the summarizin g process Linear combinatio n of four features

ANALYSIS AND FINDINGS

After reviewing the most famous Arabic Summarization Systems (ASS). We conclude that they have focused on the summarization processing without given the real time any importance in the Multi-Document processing. Therefore, it is

necessary to work on the quality and workmanship as well as high performance when summarizing a set of documents. In addition to that, existing researches still lack the final golden summary, which still has not extracted on fully coherent, grammatical and meaningful Arabic sentences. In addition, produce summary which is close to human summarization. So, it is important to cover these two lacks in the field of Arabic Automatic Summarization Systems (ASS) as the researches didn’t provide the best solution tell now for these shortages. VII.

PROPOSED MODEL

To build an Arabic structure summary which is accurate, coherent and complete based on Arabic Language grammar and skips the drawbacks of previous methods is a big challenge. Therefore, we build a modified model for Arabic MultiDocument Summarization system purposes to increase the summarization process efficiency and performance based on parallel computing. That will cover the lack of the field of Arabic Automatic Summarization (ASS) systems by working on the final summary. In addition to that, the model will work on a personal computer and exploitation optimized of the ability of it, and gives the results in real time with high performance and accuracy. The proposed framework for Arabic Multi-Document Text Summarization starts with a pre-processing the text, linear clustering algorithm has been selected to group documents into many clusters. Key-phrase extraction is choose to extract the important Key-phrases from each cluster, which guides to distinguish the most important sentences. Eliminate redundancy technique is applied in the merging techniques to extract one sentence from a group of similar sentences and ignoring the others. Summary builder which builds Arabic structure summary. Figure 3 demonstrates the proposed model phases and model overall architecture, which is outlined as the follow. Multi Arabic Document

Text Pre-Processing

Bisecting K-means Clustering

Noun and Verb phrase Extraction Build a Ranking Matrix

2017 8th International Conference on Information Technology (ICIT) Featured Sentences Selection Sentence Splitting High Ranked Sentence Merging Techniques

developing a complete Text summarization for Arabic language, we have suggested a new model that will cover the lacks in the Arabic summarization field. By working on the final summary and giving the results in real time with high performance based on parallel programming techniques. Our future work will focus on: (1) Implement the model; (2) Perform many experiments to validate out the model; (3) improved the foundations of the selections sentences based on the new rules which are built to produce correct (acceptable) Arabic language grammar. REFERENCES

Summary Builder [1]

Final Summary Fig. 3. Overall architecture of the Proposed Model

[2]

[3]

Model Main Phases: First Stage: input single or multi-documents to summarize from the required articles from different domains (subjects), such as education, sports and politics. These documents will contain texts of various sizes. Second Stage: will include four steps, namely, tokenizing, eliminating Arabic stop words, light stemming and text representation and term weighting. Third Stage: is clustering which decreases the amount of information by categorizing and grouping similar data. The Bisecting k-means algorithm is easy to apply, and it has a linear complexity [17] and has proven speed in clustering stage. Fourth Stage: extracts noun and verb keyphrases that reflects the subject (domain) of the text. Then matches the frequently occurring noun / verb phrases in a text. Finally, we will use multi features to rank them. Fifth Stage: extracts the most remarkable sentences from each cluster after splitting the cluster into several paragraphs and sentences using delimiters (eg. full stop and question mark). Then eliminates the redundancy by using similarity measures. Sixth Stage: all of the remarkable sentences will entered in a summary builder wich will regenerate them on fully coherent, grammatical and meaningful Arabic sentences.

[4]

[5]

[6]

[7] [8]

[9] [10] [11]

[12]

[13] [14]

CONCLUSION AND FUTURE WORK To conclude, the main objective of this paper is to provide an understanding of the current issues in this area for better future academic research and industrial practice of Text Summarization. We have presented a comparative study of the state of the art for text summarization approaches especially in Arabic Language. We compared and discussed various approaches in text summarization techniques. We also discussed the challenges as well as future research directions in

[15]

[16] [17]

[18]

Yogan Jaya Kumar, O.S.G., Halizah Basiron,Ngo Hea Choon and Puspalata C Suppiah: ‘A Review on Automatic Text Summarization Approaches ’, Journal of Computer Science 2016, pp. 178.190 ALTMANN, Gabriel; KÖHLER, Reinhard. Forms and degrees of repetition in texts: detection and analysis. Walter de Gruyter GmbH & Co KG, 2015. S.Sathya Bama, M.S.I.A., A.Saravanan: ‘A SURVEY ON PERFORMANCE EVALUATION MEASURES FOR INFORMATION RETRIEVAL SYSTEM’, International Research Journal of Engineering and Technology (IRJET), May-2015 Volume: 02, ( Issue: 02), pp. 1015.1020 V.Abinaya, M.V., N.Padmanabhan: ‘Sentence Level Text Clustering using a Hierarchical Fuzzy Relational Clustering Algorithm’, International Journal of Communication and Computer Technologies, 02 March 2014, Volume 02, (Issue: 02), pp. 50.55 Deshmukh, P.S.T.K.a.P.Y.S.: ‘Analysis of Query Dependent MultiDocument Summarization using Feature based and Cluster based Methods’, International Journal of Latest Trends in Engineering and Technology, 2016, Vol 7, (Issue 1) ERKAN, Günes; RADEV, Dragomir R. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 2004, 22: 457-479. KLEINBERG, Jon M. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 1999, 46.5: 604-632. BRIN, Sergey; PAGE, Lawrence. The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN systems, 1998, 30.1: 107-117. Salim, Y.J.K.a.N.: ‘Automatic Multi Document Summarization Approaches’, Journal of Computer Science, 2012, 8 (1), pp. 133-140 Harazin, K.S.A.: ‘Multi-document Arabic Text Summarization’, Islamic University, Gaza, Palestine, April 2015 Al-Thwaib, E.: ‘Text Summarization as Feature Selection for Arabic Text Classification ’, World of Computer Science and Information Technology Journal (WCSIT) 2014, Vol. 4, pp. 101-104 Linli Xu, Q.Z., Aiqing Huang, Wenjun Ouyang, Enhong Chen: ‘Feature Selection with Integrated Relevance and Redundancy Optimization’. Proc. IEEE International Conference on Data Mining2015 pp. Pages Mahmoud El-Haj, U.K., and Chris Fox: ‘Exploring Clustering for Multidocument Arabic Summarisation’, Springer -, 2011, pp. 550–561 Al-Saleh, Asma Bader, and Mohamed El Bachir Menai. "Automatic Arabic text summarization: a survey", Artificial Intelligence Review, 2015. Alotaiby F, Foda S, AlkharashiI 'New approaches to automatic headline generation for arabic documents' . J Eng Comput Innov, 2012, 3(1):11– 25 LLORET, Elena; PALOMAR, Manuel. Text summarisation in progress: a literature review. Artificial Intelligence Review, 2012, 37.1: 1-41. CIMIANO, Philipp; HOTHO, Andreas; STAAB, Steffen. Learning concept hierarchies from text corpora using formal concept analysis. J. Artif. Intell. Res.(JAIR), 2005, 24.1: 305-339. Awoyelu I.O., A.R.O., Olaniran A.T., Amoo A.O, Mabude C.N.: ‘Performance Evaluation of an Improved Model for Keyphrase

2017 8th International Conference on Information Technology (ICIT)

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36] [37]

Extraction in Documents’, Computer Science and Information Technology 2016, 4(1), pp. 33-43 Binwahlan, M.S.: ‘Extractive Summarization Method for Arabic Text – ESMAT’, International Journal of Computer Trends and Technology (IJCTT) Mar 2015, Volume 21 D.Y. Sakhare, D.R.K.: ‘Syntactic and Sentence Feature Based Hybrid Approach for Text Summarization’, I.J. Information Technology and Computer Science, 2014, 03, pp. 38-46 Deshmukh, P.S.T.K.a.P.Y.S.: ‘Analysis of Query Dependent MultiDocument Summarization using Feature based and Cluster based Methods’, International Journal of Latest Trends in Engineering and Technology, 2016, Vol 7, (Issue 1) Hassan M. Najadat , M.N.A.-K., Ismail I. Hmeidi, Maysa Mahmoud Bany Issa: ‘Automatic Keyphrase Extractor from Arabic Documents’, International Journal of Advanced Computer Science and Applications, 2016, Vol. 7, (No. 2) Hesham Ahmed Hassan, M.Y., Khaled Bahnassy, Amira M. Idrees, FatmaGamal: ‘Arabic Documents Classification Method a Step towards Efficient Documents Summarization’, International Journal on Recent and Innovation Trends in Computing and Communication, 2015, Volume: 3 (Issue: 1) Omar, H.N.F.a.N.: ‘Automatic Multi-Document Arabic Text Summarization Using Clustering and Keyphrase Extraction’, Journal of Artificial Intelligence, 2015, 8 (1), pp. 1-9 Yu Kou, Q.W., Xue Li, and Sanliang Hong: ‘Fast Clustering Based on State Learning Machine’, World Congress on Intelligent Control and Automation (WCICA) 2016 A., Muneer, and Hoda M.. "A Proposed Textual Graph Based Model for Arabic Multi-document Summarization", International Journal of Advanced Computer Science and Applications, 2016. Karwa, Shweta, and Niladri Chatterjee. "Discrete Differential Evolution for Text Summarization", 2014 International Conference on Information Technology, 2014. Nabil Alami, Mohammed Meknassi, Said Alaoui Ouatik, NourEddine Ennahnahi. "Impact of stemming on Arabic text summarization", 2016 4th IEEE International Colloquium on Information Science and Technology (CiSt), 2016 Hariharan, Shanmugasundaram; Ramkumar, Thirunavukarasu and Srinivasa, Rengaramanujam. "Enhanced Graph Based Approach for Multi Document Summarization", International Arab Journal of Information Technology (IAJIT), 2013. AL-Khawaldeh F, S.: ‘Lexical cohesion and entailment based segmentation for arabic text summarization (lceas)’, World Comput Sci Inf Technol J 2015, 5(3), pp. 51–60 Al-Radaideh Q, Afif M ‘ Arabic text summarization using aggregate similarity’. In: International Arab conference on information technology (ACIT2009), Yemen 2009. Hanane FROUD , I.S.a.A.L.: ‘AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW KEYPHRASES EXTRACTION ALGORITHM ’, Computer Science & Information Technology (CS & IT) 2013, pp. 243–256 Judith D. Schlesinger, D.P.O.L., and John M. Conroy: ‘Arabic/English Multi-document Summarization with CLASSY—The Past and the Future’, Springer -Verlag Berlin Heidelberg, 2008, pp. 568–581 IBRAHIM ALMAHY, N.S., YOGAN JAYA KUMAR, AMEER TAWFIK ‘Discussion Summarization Based On Cross-Document Relation Using Model Selection Technique ’, Advances in Neural Networks, Fuzzy Systems and Artificial Intelligence, 2014, pp. 218.225 Maria Lucía del Rosario Castro Jorge, T.A.S.P.: ‘Experiments with CST-based Multidocument Summarization ’, in Editor (Ed.)^(Eds.): ‘Book Experiments with CST-based Multidocument Summarization ’ (16 July 2010, edn.), pp. 74–82 Yasunari Miyabe , H.T., Manabu Okumura ‘Identifying CrossDocument Relations between Sentences’, 2008, pp. 141.148 Yogan Jaya Kumar , N.S., Albaraa Abuobieda, Ameer Tawfik ‘Multi Document Summarization Based On Cross-Document Relation Using Voting Technique ’. Proc. INTERNATIONAL CONFERENCE ON

[38]

[39]

[40]

[41]

[42]

[43] [44]

[45]

[46] [47]

[48] [49]

[50] [51] [52] [53]

[54]

[55]

[56]

[57]

COMPUTING, ELECTRICAL AND ELECTRONIC ENGINEERING (ICCEEE) 2013 pp. Pages Borhan Samei, S.H., FazelKeshtkar ,Marzieh Eshtiagh ‘Multi-Document Summarization Using Graph-Based Iterative Ranking Algorithms and Information Theoretical Distortion Measures ’, in Editor (Ed.)^(Eds.): ‘Book Multi-Document Summarization Using Graph-Based Iterative Ranking Algorithms and Information Theoretical Distortion Measures ’ (Association for the Advancement of Artificial Intelligence (www.aaai.org). 2014, edn.), pp. Khushboo S. Thakkar , D.R.V.D., M. B. Chandak ‘Graph-Based Algorithms for Text Summarization ’. Proc. Third International Conference on Emerging Trends in Engineering and Technology2010 pp. Pages Wan, X.: ‘An Exploration of Document Impact on Graph-Based MultiDocument Summarization ’, in Editor (Ed.)^(Eds.): ‘Book An Exploration of Document Impact on Graph-Based Multi-Document Summarization ’ (2008, edn.), pp. pages 755–762 Ahmed M. Al-Zahrani , H.M., Hassan Abdalla: ‘PSO-Based Feature Selection for Arabic Text Summarization ’, Journal of Universal Computer Science, 2015, vol. 21, pp. 1454-1469 Al-Thwaib, E.: ‘Text Summarization as Feature Selection for Arabic Text Classification ’, World of Computer Science and Information Technology Journal (WCSIT) 2014, Vol. 4, pp. 101-104 ShaheenM,EzzeldinA 'Arabic question answering :systems, resources, tools, and future trends'.Arabian J Sci Eng. 2014, Vol :39(6):4541–4564 Sobh I, Darwish N, Fayek M ' An optimized dual classification system for arabic extractive generic text summarization' . In: Proceeding of the 7th conference on language engineering .2007 Schlesinger J, OLeary D, Conroy J 'Arabic /english multi-document summarization with classy-the past and the future'. In: Gelbukh A (ed) Computational linguistics and intelligent text processing, lecture notes in computer science,2008, vol 4919. Springer, Berlin, pp 568–581 Ryding K ‘ A reference grammar of modern standard Arabic’. Cambridge University Press, Cambridge,2005. Al-Radaideh Q, Afif M 'Arabic text summarization using aggregate similarity'. In: International Arab conference on information technology (ACIT2009), Yemen,2009. Al-Saeedan W, Menai M ' Swarm intelligence for natural language processing'. Int J Artif Intell Soft Comput,2015, 5(2):117–150 Al-Sanie W 'Towards an infrastructure for arabic text summarization using rhetorical structure theory'. Master’s thesis,2005, King Saud University, Riyadh . Al-SulaitiL,AtwellES ' The design of a corpus of con temporary arabic' . IntJCorpusLinguist,2006, 11(2):135– 171 Azmi AM, Al-Thanyyan S ' A text summarizer for arabic'. Comput Speech Lang,2012, 26(4):260–273 Bassiouney R, Katz EG ‘ Arabic language and linguistics’. Georgetown University Press,2012, Washington, DC. Belguith L, Ellouze M, Maaloul M, JaouaM , Jaoua F, BlacheP ‘Automatic summarization’..In:Zitouni I (ed) Natural language processing of semitic languages, theory and applications of natural language processing. 2014,Springer, Berlin, pp 371–408 FEJER, Hamzah Noori; OMAR, Nazlia. Automatic Multi-Document Arabic Text Summarization Using Clustering and Keyphrase Extraction, 1–9. Journal of Artificial Intelligence, 2015, 8.1: 1-9. WAHEEB, Samer Abdulateef; HUSNI, Husniza. Multi-Document Arabic Summarization Using Text Clustering to Reduce Redundancy. International Journal of Advances in Science and Technology (IJAST), 2014, 2.1: 194-199. FROUD, Hanane; LACHKAR, Abdelmonaime; OUATIK, Said Alaoui. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. arXiv preprint arXiv:1302.1612, 2013. SCHLESINGER, Judith D.; O’LEARY, Dianne P.; CONROY, John M. Arabic/English multi-document summarization with CLASSY—the past and the future. In: International Conference on Intelligent Text Processing and Computational Linguistics. Springer Berlin Heidelberg, 2008. p. 568-581.

2017 8th International Conference on Information Technology (ICIT) [58] ALWAN, Muneer A.; ONSI, Hoda M. A Proposed Textual Graph Based Model for Arabic Multi-document Summarization. International Journal of Advanced Computer Science & Applications, 2016, 1.7: 435-439. [59] ALAMI, Nabil, et al. Arabic text summarization based on graph theory. In: Computer Systems and Applications (AICCSA), 2015 IEEE/ACS 12th International Conference of. IEEE, 2015. p. 1-8. [60] HARIHARAN, Shanmugasundaram; RAMKUMAR, Thirunavukkarasu; SRINIVASAN, Rengaramanujam. Enhanced graph based approach for multi document summarization. Int. Arab J. Inf. Technol., 2013, 10.4: 334-341. [61] WAN, Xiaojun. An exploration of document impact on graph-based multi-document summarization. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008. p. 755-762. [62] ALMAHY, Ibrahim, et al. Discussion summarization based on Crossdocument relation using model selection technique. Advances in Neural Networks, Fuzzy Systems and Artificial Intelligence, 2014, 218229. [63] KUMAR, Yogan Jaya, et al. Multi document summarization based on cross-document relation using voting technique. In: Computing, Electrical and Electronics Engineering (ICCEEE), 2013 International Conference on. IEEE, 2013. p. 609-614. [64] CASTRO JORGE, Maria Lucía del Rosario; PARDO, Thiago Alexandre Salgueiro. Experiments with CST-based multidocument summarization. In: Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing. Association for Computational Linguistics, 2010. p. 74-82. [65] MIYABE, Yasunari; TAKAMURA, Hiroya; OKUMURA, Manabu. Identifying Cross-Document Relations between Sentences. In: IJCNLP. 2008. p. 141-148. [66] AL-ZAHRANI, Ahmed M.; MATHKOUR, Hassan; ABDALLA, Hassan Ismail. PSO-Based Feature Selection for Arabic Text Summarization. J. UCS, 2015, 21.11: 1454-1469. [67] AL-THWAIB, Eman. Text summarization as feature selection for arabic text classification. World of Computer Science and Information Technology Journal (WCSIT), 2014, 4.7: 101-104. [68] KWAIK, Kathrein Abu. Automatic Arabic Text Summarization System (AATSS) Based on Semantic Feature Extraction. 2011. PhD Thesis. Islamic University of Gaza. [69] BELKEBIR, Riadh; GUESSOUM, Ahmed. A supervised approach to arabic text summarization using adaboost. In: New Contributions in Information Systems and Technologies. Springer International Publishing, 2015. p. 227-236. [70] AL-KHAWALDEH, F.; SAMAWI, V. Lexical cohesion and entailment based segmentation for arabic text summarization (lceas). The World of Computer Science and Information Technology Journal (WSCIT), 2015, 5.3: 51-60. [71] OUFAIDA, Houda; NOUALI, Omar; BLACHE, Philippe. Minimum redundancy and maximum relevance for single and multi-document Arabic text summarization. Journal of King Saud University-Computer and Information Sciences, 2014, 26.4: 450-461.

[72] IMAM, Ibrahim, et al. An ontology-based summarization system for arabic documents (ossad). International Journal of Computer Applications, 2013, 74.17. [73] IBRAHIM, Ahmed; ELGHAZALY, Tarek. Improve the automatic summarization of Arabic text depending on Rhetorical Structure Theory. In: Artificial Intelligence (MICAI), 2013 12th Mexican International Conference on. IEEE, 2013. p. 223-227. [74] El-Haj, Mahmoud, and Paul Rayson. "Using a keyness metric for single and multi document summarisation." Association for Computational Linguistics, 2013. [75] ALY, Walid Mohamed; SHARABY, Wafaa Hanna; KELLENY, Hany Atef. A New Machine Learning Approach for Arabic/English Documents Classification. 2013. [76] EL-GHANNAM, Fatma; EL-SHISHTAWY, Tarek. Multi-topic multidocument summarizer. arXiv preprint arXiv:1401.0640, 2014. [77] EL-FISHAWY, Nawal, et al. Arabic summarization in twitter social network. Ain Shams Engineering Journal, 2014, 5.2: 411-420. [78] BELGUITH, Lamia Hadrich, et al. Automatic summarization. In: Natural Language Processing of Semitic Languages. Springer Berlin Heidelberg, 2014. p. 371-408. [79] FROUD, Hanane; LACHKAR, Abdelmonaime; OUATIK, Said Alaoui. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. arXiv preprint arXiv:1302.1612, 2013. [80] HABOUSH, Ahmad, et al. Arabic text summarization model using clustering techniques. World of Computer Science and Information Technology Journal (WCSIT) ISSN, 2012, 2221-0741. [81] BOUDABOUS, Mohamed Mahdi; MAALOUL, Mohamed Hédi; BELGUITH, Lamia Hadrich. Digital learning for summarizing Arabic documents. In: International Conference on Natural Language Processing. Springer Berlin Heidelberg, 2010. p. 79-84. [82] AL-RADAIDEH, Q.; AFIF, Mohammad. Arabic text summarization using aggregate similarity. In: the international Arab conference on information Technology. 2011. [83] AZMI, Aqil M.; AL-THANYYAN, Suha. A text summarizer for Arabic. Computer Speech & Language, 2012, 26.4: 260-273. [84] ELMARHUMY, Mahmoud; FATTAH, Mohamed Abdel; REN, Fuji. Automatic text classification using modified centroid classifier. In: Natural Language Processing and Knowledge Engineering, 2009. NLPKE 2009. International Conference on. IEEE, 2009. p. 1-4. [85] EL-HAJ, Mahmoud O.; HAMMO, Bassam H. Evaluation of querybased Arabic text summarization system. In: Natural Language Processing and Knowledge Engineering, 2008. NLP-KE'08. International Conference on. IEEE, 2008. p. 1-7. [86] Giannakopoulos, G., El-Haj, M., Favre, B., Litvak, M., Steinberger, J., & Varma, V. (2011). TAC 2011 MultiLing pilot overview. [87] DOUZIDIA, Fouad Soufiane; LAPALME, Guy. Lakhas, an Arabic summarization system. Proceedings of DUC2004, 2004.