arabic text summarization using an aggregate

The 2009 International Arab Conference on Information Technology (ACIT’2009), Dec. 2009, Yamen.

ARABIC TEXT SUMMARIZATION USING AGGREGATE SIMILARITY Qasem A. Al-Radaideh and Mohammad Afif Computer Information Systems Department, Faculty of IT & Computer Science Yarmouk University, Irbid 21163, Jordan [email protected], [email protected]

ABSTRACT Summarization is the technique to distill information from original source that describe the content of the source or provide indication about the content, to produce a short version of source to the user for a particular task. Summarization is an important technique to manage the large amount of text that people need to read. It is also reduces the amount of the text that people read, to let them decide if a document is relevant to their information need or not.

This paper proposes an Arabic text summarization approach based on an aggregate similarity method which is originally proposed for the Korean language text. The proposed approach depends mainly on nouns as indicators of the importance of the sentences. Hence, the noun extraction process is the main process in the proposed approach. To do summarization of a given document, the document is segmented into sentences and then the sentences are tokenized into words. The noun extraction process is performed using fourteen noun extraction rules that are used as indication for the distinction of nouns from other non noun words. In the next step the frequencies for each noun in each sentence and in the whole document are computed and the sentence similarity between the noun frequency in the sentence and the document is calculated using the Inner Product measure. The summation of all similarities of every sentence represents an Aggregate similarity; the sentences that have the highest value of similarity are selected as the summary where the number of sentences that are selected is determined by a user defined threshold value.

People need tools or techniques which provide them the ability to access anything they want quickly. Also when deal with information such as search on internet or read News, articles, books; the reader needs to get a brief description quickly, about the subject that interest in.

The idea of automatic summarization depends on the selection of the most important sentences in the document, as in many commercial tools such as autosummarize that is available in MS-Word, these selected sentences represent the summary of the document. The selection of sentences is done using some extraction techniques based on some features or based on the similarity of the sentences. Abstraction is one of the methods used for text summarization that relies on regenerating new sentences that represent the same idea but in different phrasing that is not found in original text. In the extraction methods the sentences in the summary are selected from the original text [2].

To evaluate the proposed approach, a dataset of fifty documents is used and the performance of the approach is evaluated using the Recall and Precision measures. The results obtained were 62% for Precision, 70 % for Recall, and 14% for the compression rate. As a conclusion, the result is acceptable according to the nature of the Arabic language which has rich vocabulary and complex grammar rules.

There are many benefits for summarization, especially with increased use of the Internet. Such as, summarizing News to SMS or WAP format for mobile phone/PDA; and it allows the computer to synthetically read the summarized text. Written text can be too long and tedious to listen, in search engines, to present compressed descriptions of the search results, and to direct News subscriptions of News which are summarized and sent to the user [3].

Keywords: Arabic Language Processing, Information Retrieval, Text Summarization, Aggregate Similarity. 1. INTRODUCTION The increasing amount of data on internet, which grow monthly with rate 20 terabytes [1], lead to difficulty in filtering and managing the information that people need. Search engines help users to find and access information that they desire, but what should one do when there is too much information.

Many different methods have been proposed for automatic summarization. It is still a very important topic for research, especially for Arabic language that is a

1


widely spoken language in the world, in addition it is the native language of almost 300 million people, and used by 1.2 billion Muslims in religious ceremonies. Arabic language is read and written from right to left and it is also well known for its rich vocabulary [4]. These features attract researches to investigate the possibility of presenting new techniques for summarization or evaluate the current techniques available for other languages for Arabic language.

preprocessing and extracting noun of the Korean text. Then the similarities is computed using Inner Product equation for each sentence and aggregates all the similarity of document, and then rank the sentences according to the high degree of similarity. The authors claimed that these techniques proved to be efficient and gave good result, when compared with other summarization systems and auto summary tool available with MS-Word. The evaluation process produced a recall rate of 46.6% and 76.9% for precision.

This paper proposed an Arabic text summarization approach based on an aggregate similarity method which is originally proposed for the Korean Language text. The proposed approach depends mainly on nouns as indicators of the importance of the sentences. Hence, the noun extraction process is the main process in the proposed approach.

Leskovec et al. [7] presented a method for summarizing document by creating a semantic graph of the original document and identifying the substructure of such a graph that can be used to extract sentences for a document summary. First, the method starts with deep syntactic analysis of the text and for each sentence; it extracts the logical form triples. After this step, it applies crosssentence pronoun resolution, co-reference resolution, and semantic normalization to refine the set of triples and merge them into a semantic graph. This procedure is applied to both documents and corresponding summary extracts. In the evaluation phase, the method achieved an average recall of 75% and precision of 30% when compared with human summarization.

2. RELATED WORK Automatic text summarization is an appealing topic of research where several researchers from multiple perspectives have proposed some techniques for automatic text summarization for various languages [5][7][8][16][17]. This section summarizes and evaluates some of these proposed researches for various languages.

Ryu et al. in [8] used a hybrid approach to extract topic phrases using machine learning and to select a summary for Korean document using locality-based similarity. The topic phrases are considered as the queries while computing the similarity they use naïve Bayesian, decision tree and supported vector machine as machine learning algorithms. The system extracts topic phrases automatically from new document based on these models and outputs the summary of the document using querybased summary which considers the extracted topic phrases as queries and calculates the locality-based similarity of each topic phrase this approach is suggested by [8]. Moreover, they have empirically proved their hybrid method that can be well applied to document summarization, and they get the average of overlap between two manual extracts is only 47% but that method showed better performance than method of MS-Word.

2.1 Summarization for Arabic Language There are not much automatic summarization techniques or commercial tools proposed for Arabic text; an exception is the Lakhas system, which developed based on the extraction techniques suggested by [5] which is considered the first Arabic summarization tool. The architecture of the tool consists of the following modules: Sentence segmentation, Word segmentation, Normalization, Stop words removal, Lemmatization, Frequency computation, Indicative expressions and Weight computation. The results of the tool are evaluated based on the number of words in the original text and on the summarized text. An experiment conducted using the tool used a document with 187 words and produce a summary with 29 words with a compression rate 16%. In addition, authors applied four methods for sentence reduction to reduce the length of produced summary; these methods are: Name substitution, Removal of some type of words, Removal of part of sentence following some boundaries, and Removal of indirect discourse. By applying these methods Lakhas reduced the summaries by approximately 50% and reduced the number of words to 15 words in summary.

Dalianis in [9] explain and show the summarization system for Swedish language. This was the first automatic system for text summarize for Swedish language called SweSum. This system is built on both statistical and linguistic methods as well as heuristic methods. It uses a 700.000 word entries dictionary which tells if the word belongs to the open word class group and also gives the stem of the word. It gives good performance, in addition to it is estimated to be as good as the State of the art techniques, for English, i.e. an average of 30% summary, compression, of 2-3 pages News text gives a good summary.

2.2 Summarization for other Language Kimt et al. [6] proposed a summarization approach using an aggregate similarity for Korean text. This method depends on segmenting the document into sentence after

2


Mazdak [3] proposed a summarization system for Persian which is designed based on the methods and algorithms implemented in the SweSum system. The system is so called FarsiSum. The author claimed that the system improved both in the coherence and the preservation of important information of final summary when applied to Persian text. In addition, the author proved that the procedures used in SweSum are suitable for the Persian language.

segmenting the document into sentences instead of paragraph will improve the process of summarization and give more precise summary.

The Input Document

Sentence Segmentation

Pachantouris [10] described the construction and evaluation for the first automatic text summarizer system for Greek News text called GreekSum, which relies on the methodology of SweSum summarization system in the process of development and the system proved to present good results.

Word Segmentation

Text Preprocess In Each Sentence Stop-word DB Noun Extraction and Compute Frequency

Another summarization system called EstSum which constructs short summaries of text by selecting the key sentences that characterize the document. Sentences are ranked for potential inclusion in the summary using a weighted combination of statistical, linguistic and typographic features like the position, format and type of sentence, and the word frequency, this system is presented for Estonian language especially for Estonian Newspaper and performed by [11].

Rules KB

Compute Similarity and an Aggregate Sim Input threshold ( X) Important Sentence Selection

3. THE PROPOSED APPROACH FOR ARABIC LANGUAGE

Final Summary

Figure 1: The Steps of Methodology.

The methodology of the proposed approach for Arabic language summarization consists of five major modules: Sentence Extraction, Word Segmentation, Text Preprocessing, Noun Extraction Process, Computing similarity and Aggregate Similarity, and Selecting the Important Sentences. These modules are illustrated in Figure 1.

Word Segmentation When the process of sentence segmentation is complete, the tokenization process will be done to segment each sentence into words that form the sentence. This task is done based on space that determines the boundaries of each word and begins new word.

Sentence Extraction

Text Preprocessing

In this module, the document is segmented into sentences; it is known that the structure of any Arabic text consists of a set of paragraphs that represent the document. Each paragraph represents a part of the whole idea of the text, and includes a set of sentences which contains words that form characters, symbols, and numbers [12].

This phase includes stop-word removal. These stop-words can be classified into three types: 1. Frequent Words: These words or characters are occurring more frequently in the text like pronoun ( ،‫هي‬ ،‫ هن‬،‫ هم‬،‫هؤالء‬,) and some particles such as ( ‫ لماذا‬،‫)ماذا‬.

The sentences that form the paragraph represent the backbone of text. Each sentence has a collection of words, character, symbols or marks that organize in a way to give a good and useful meaning.

2. Words with no particular meaning: These words appear in the context of text without indication to particular information about the text, these words like (،‫ بغض النظر‬،‫ بالذكر‬،‫ بالرغم‬،‫ الجدير‬،‫ باإلضافة‬،‫)بالنسبة‬.

There are many punctuation marks that denote the end of sentence and begin new sentence such as (.. ’ , ’ ! ’ ? ’ - ‘ : ’ . ’ … ’ “’ ‫’ ؛‬، ‘…) [12]. In this paper, the end of sentence is the dot (.) while all other punctuation marks found between words are not used. In addition,

3. General Words and Numeral: This type includes some general words likes (month, month name, days, day name, weeks…etc( and some numeral words such as (‫ الثاني‬،‫ االولى‬،‫ األول‬،‫االول‬..,). 3


Such grammar includes the affixes of the word as prefixes like "‫"لل‬، "‫ "ال‬...etc.

In the preprocessing phase, the words that are extracted from the sentences are checked whether they match with a predefined set of stop-words or not. The set of the predefined stop-words used in this paper contains about 700 stop-words.

Some rules depend on the position of the word in the sentence which is a good indicator to identify nouns. Some words are usually followed by nouns such as “ ‫كاا‬ ‫ظنّ وأخواتها‬، ‫“وأخواتها‬, these are some examples of rules that will be used to extract nouns from sentences.

Noun Extraction Process The sentence is the major component of the Arabic text that is built from words, where the word is the lowest unit of semantic that denotes to the meaning [12].

In addition to, there is a set of particles that includes: adverbs, prepositions, exceptions, conjunction, interrogation…etc. Some of these particles are a good indication to identify the noun like in rules number (5, 6, 7, 8, 9, 10, 11, 12) in Table 1. All these are rules based on those particles which are followed by noun [4].

The words are divided into three main categories according to the grammars of Arabic. This classification has subcategories that collectively cover the whole of the Arabic language. These categories are nouns, verbs and particles [13]. This paper deals only with nouns.

The sentences in Arabic language are divided according to semantic that express the meaning into two types: Nominal sentences and Non-nominal sentences [12]. To extract nouns from sentences, fourteen different rules are used which are collected from different recourses. These rules are listed in Table 1 along with the source. But this rules list is not complete and not most ideal where there are some drawbacks especially when automating the task of noun extraction by machine. In addition to the process of noun extraction, the frequency of every extracted noun is computed in each sentence and in the whole document.

Noun Extraction Rules The process of extracting nouns is not an easy task and needs to use some rules that identify the word which is noun or not, where a noun is a name or a word that describes a person, idea or thing [13]. Rules for extracting nouns from sentences are built according to the special grammar of the Arabic language.

Table 1: Noun Extraction Rules. Rule # 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Rule description Any word starts with ،"‫"ال‬, "‫"أل‬, "‫"إل‬is a noun. {with some exceptions} Any word not starts with any one of ( "‫"أ‬،"‫ت‬،"‫"ي‬،"‫ )"ن‬character and ending with ( "‫"ون‬،"‫ )"ين‬is a noun. Any word ends with "‫ "ــة‬is a noun. Any word starts with "‫ "لل‬is a noun. Any word appears after "‫" آ‬،"‫"وا‬،"‫"أي‬،"‫ "أيا‬،"‫"هيا‬،"‫ "يا‬is a noun. )‫ (أ وات النداء‬. Any word appears after a preposition is a noun. ‫حروف الجر‬ (، ‫بـ‬،‫ل‬،‫إلى‬، ‫ في‬،‫على‬، ‫عن‬،‫)من‬. Any word appears after exception tools is a noun. Such as )‫سواء‬،‫سوى‬،‫غير‬،ّ‫إال‬ ( ‫ أ وات االستثناء‬. Any word appears after ( ‫لعل(أنّ وأخواتها‬،‫ليت‬، ّ‫كأن‬، ّ‫لكن‬، ّ‫إن‬، ّ‫ )أن‬is a noun. Any word appears after ( ‫(كان وأخواتها‬، ‫مافتئ‬،‫ماأنفك‬،‫مازال‬،‫مابرح‬،‫ما ام‬،‫الزال‬،‫ ليس‬،‫صار‬،‫بات‬،‫أمسى‬،‫ظل‬،‫ أضحى‬، ‫أصبح‬،‫كان‬ ‫أضحت‬،‫ما امت‬،‫ لست‬،‫الزالت مازالت‬،‫ليست‬،‫ باتت‬،‫ فصارت‬،‫صارت‬،‫ظلت‬،‫امست‬،‫أمست‬، ‫ ))كانت‬is a noun. Any word appears after ( )‫بدأ‬،‫أخذ‬،‫شرع‬،‫أوشك‬،‫قرب‬،‫عسى‬، ‫ )كا وأخواتها (كا‬is a noun. Any word appears after ( ) ‫تعلّم‬،‫علّم‬،‫صيّر‬،‫ح ّول‬،‫زعم‬،‫خال‬،‫حسب‬، ّ‫ )ظن وأخواتها (ظن‬is a noun. Any word appears after some adverbs is a noun. Such as ‫بعض الظروف‬ ) ‫حول‬،‫عبر‬،‫بين‬،‫غرب‬،‫شرق‬،‫جنوب‬،‫شمال‬،‫يمين‬،‫يسار‬،‫وسط‬،‫تحت‬،‫فوق‬،‫جانب‬،‫خلف‬،‫أمام‬،‫عند‬،‫نحو‬،‫بعد‬،‫( قبل‬. Any word appears after ( ‫هنو االسماء الستة‬،‫حمو‬،‫ذو‬،‫فو‬،‫أخ‬،‫ )أب‬is a noun. Any word ends with "‫ "ات‬is a noun.

[5,14] [21,14] [5] [14,15] [14,15] [14,15] [14,15] [5] [5]

Where: Si,k: is the frequency for noun k in sentence i. Sj,k: is the frequency for noun k in document. n: is the number of nouns in the document.

Compute Similarity and the Aggregate Similarity All information collected from the previous step is important for this stage to calculate the similarity for each sentence, where the Inner Product measure which is presented in formula (1) is used to compute the similarity. [6].

The aggregate similarity is the process of assembling the similarity values for each sentence to produce one value that represents the aggregate similarity. Figure 2 provides an illustration for this process. To compute the aggregate similarity value, formula 2 is used.

n

Sim (i, j )   Si, k * Sj, k

Source [5] [5] [5] [5] [5]

(1)

k 1

4


summarization is not simple to define which could make the summary useful and efficient [2].

n

(2) asim(i )   sim(i, j ) Where: j 1 n: is the numberj!ofi sentences in the document. i: is the sentence number. S

S

S

3

4

In this paper the measures that used to evaluation was the precision and recall.

S

To illustrate how these two measures are used to evaluate text summarization; consider an example document for summarization and let X be the set of sentences in its summary (generated manually by an expert in the field), and Y be the set of sentences that are extracted by the system from the text, and Z be the set of sentences in the intersection of the sets X and Y as illustrated in Figure 3 [6].

5

2

S

S

1

m

∑S

Y

X

Z

i the aggregate similarity [6]. Figure 2: Graphical representation for Figure 3: Sentences Intersection.

Selecting Important Sentences

The recall and precision can be computed as:

In this phase the task of selecting the sentences is done based on the similarity value of sentence and the value of an aggregate similarity. Any sentence with a high similarity is selected. If two sentences have the same value of similarity the first index sentence will be taken and the second will be ignored. All sentences that are selected represent the summary of the document.

Recall R is the percentage of the target sentences that the system extracted [6].

|Z| (4) |X| Precision P is the percentage of the extracted sentences that the system got right [6]. R

The size of summary is not always constant and possibly many factors effects it, such as the length of source and the purpose of summary as the factor that affect in determining the number of sentences in summary. In this paper, the number of sentences in summary is represented by a percentage of the size of the source. Formula 3 is used to determine the size of produced summary size (SS). SS= X * # of sentences in source) / 100

P

|Z| |Y |

(5)

F-measure F is to combine precision and recall into a single measure of overall performance [6]. 2 PR (6) F PR

(3)

Beside these measures three more measures are used:

Where X: is a threshold value determined by the user which represents the number of sentences in summary. For evaluation purposes the values of X used in this paper were 30, 20, and 10.

1) Compression rate (CR) which reflect the size of summary relative to the source and calculate as:

4. EXPERIMENTS AND EVALUATION

2) Retention ratio (RR) is another measure that denoted how much information the summary save from the source and computed as:

CR = the number of words in summary / the number of words in source.

The evaluation of automatic summarization is usually done manually and it is a very complex task, because the summary varies from people to people for the same text, as well as the purpose for a summary is different from reader to reader. So it is impossible to get a perfect summary. In addition a precise criterion for

RR= #of noun in summary / # of noun in source. 3) Omission rate (OR) is a measure for the ratio of information that missed form the source and calculated as:

5


OR = (# of noun in source - # of noun in summary) / # of noun in source).

70 60 50 40 30 20 10 0

4.1 Preparing the Data for Evaluation The data sets consist of one collection for evaluation, and the collection contains 50 documents, each document with it is summary that is done by human by selecting the most important sentences, and sampled from internet in different areas like (e_learning, distant learning, and computer sciences, information technology). The content is prepared by remove figures, tables, captions, references and cross references from the original documents and checked manually. Table 2 shows the statistics of the data sets collection.

Precision using Inner Precision using Cosine 10%

20%

30%

summary size

Figure 4: Precision and Recall for Inner Product and size of summary.

Table 2: Statistics about the dataset. Test Collection Number of document

50

Total number of sentences in collection

1168

Total number of words in collection

40892

Average number of sentences per document

23.36

Average number of sentences per summary (by human)

4.14

Figure 5: Precision for Inner Product vs. Cosine measure.

80 70 60 50 40 30 20 10 0

4.2 Results Evaluation The results obtained from experiments are presented in Table 3 and Table 4, which display the performance of system. Table 3 displays some statistics which include the average number of words in source and summary, and the average number of nouns that are found in source and summary. Table 4 presents the performance of the system which includes the Precision and other measures that are used to evolution the system, where the first column represents the summary size and the second column represents the similarity measure that includes the two measures that are used to calculate the similarity.

Recall using inner Recall using Cosine

10%

20%

30%

summary size

Figure 6: Recall for Inner Product vs. Cosine.

Table 5 includes the information that is extracted by the approach such as the number of terms that extracted. Table 6 contains the data that represent the performance of the system using the second method by stem extraction instead of noun extraction and all the measures is calculated as the previous method except the RR and OR that are computed using the number of terms in summary and source using the same formulas that are used in the first method.

The experiment proved that the Inner Product measure under summary size 20% gives the highest value of precision which was 62% and the highest value for recall was 70% under summary size 30%. Figure 4 showed that the lowest value for the compression rate is 14% with a summary size of 10%. The results of using the other measures are illustrated in Figure 5 and 6.

The experiment proved that the method that done using noun extraction given very good result from the second method, because the noun from it is definition any word that denoted to idea or thing or person which reflect this on the summary while the process of finding the stem some factor may be affected the efficient of this method for summarization like the efficient of stem algorithm.

To further evaluate the proposed approach, extra experiments are performed by replacing the process of noun extraction with the process of finding the stem of the terms using an approach for extracting Arabic stems proposed by [16]. Table 5 and Table 6 display the output result that are produces using this approach.

The Inner product in both methods produced better results than the cosine measure. Figure 7 and Figure 8 illustrates the performance of the two methods.

6


Table 3: Information about the summary and dataset.

Summary size

10% 20% 30%

Similarity measure

Inner Product Cosine measure Inner Product Cosine measure Inner Product Cosine measure

Words in source

Words in summary

817.84

122.06 123.92 258.36 259.22 365 364.44

Noun in source

Noun in summary

237.44

51.96 53.96 97.58 100.60 128.3 130.18

Overlap between sentences (By human and by system 0.98 0.90 2.12 2.10 2.94 2.74

Sentences in summary

1.84 4.24 6.48

Table 4: Show the performance of the system. Similarity measure Summary size

CR (%) Inner product Cosine measure Inner product Cosine measure Inner product Cosine measure

10% 20% 30%

Precision %

Recall %

F-measure %

RR (%)

OR (%)

Time(s)

53 51 62 60 55 52

22 20 51 51 70 67

30 28 54 53 60 57

20 21 40 41 53 54

80 79 60 59 47 46

107.6 107.10 109.68 109.68 111.10 109.30

14 31 43

Table 5: Show information about the summary and dataset.

Summary size

Similarity measure Inner Product Cosine measure Inner Product Cosine measure Inner Product Cosine measure

10% 20% 30%

Words in source

Words in summary

817.84

120.38 120.90 257.26 256.92 365.46 364.56

Terms in source

Terms in summary

278.98

56.50 56.76 109.82 110.22 146.38 147.27

Sentences in summary 1.84 4.24 6.48

Over lap between sentences (By human and by system 1.06 1 2.04 1.92 2.74 2.64

Table 6: Show the performance of the system using by Stem Extraction method.

Summary size

Similarity measure

CR (%)

Inner product Cosine measure Inner product Cosine measure Inner product Cosine measure

10% 20% 30%

14 31 44

Precision %

Recall %

F-measure %

59 57 58 55 50 48

23 22 48 46 65 63

32 31 51 48 55 53

60 Precision By Noun Ex. Precision By Stem Ex.

40 30 20 10 0 10%

20%

OR (%)

19

81

39

61

52

48

80 70 60 50 40 30 20 10 0

70 50

RR (%)

52.70 52.44 59.98 60.18 65.76 65.56

Recall By Noun Ex. Recall By Stem Ex.

10%

30%

Time(s)

20%

30%

Summary Size

Summary Size

Figure 8: Recall for Inner Product by Noun Ex. Method vs. Stem Ex. Method.

Figure 7: Precision for Inner Product by Noun Ex. Method vs. Stem Ex. Method.

7


Proceedings of the fifth international workshop on on Information retrieval with Asian languages. pp 111 – 118. [7] Leskovec J., Grobelnik M. and Frayling N. (2002). Learning Semantic Sub-graphs for Document Summarization. N/A. [8] Ryu J., Han K and Rim K. (2003). Korean document summarization using topic phrases extraction and locality-based similarity. (N.Zhong et al. (Eds): ISMIS2003). pp 320-325. [9] Dalianis H. ( 2000). SweSum - A Text Summarizer for Swedish. Technical report TRITA-NA-P0015. IPLab-174. NADA, KTH. [10] Pachantouris G. (2005). GreekSum– A Greek Text Summarizer. Master Thesis. Department of Computer and Systems Sciences. KTH-Stockholm University. [11] Müürisep K. and Mutso P. (2005). Estsum–Estonian Newspaper Texts Summarizer. Proceedings of the Second Baltic Conference on Human Language Technologies. pp 311-316. [13] Al-Shalabi R. and. Kanaan G. (2004). Constructing an Automatic Lexicon for Arabic Language. International Journal of Computing & Information Sciences Vol. 2, No. 2. pp 114-128. [16] Al-Shalabi R., Kanaan G., and Muaidi H. (2003). New Approach for Extracting Arabic Roots. Proceeding of the International Arab Conference on Information Technology. Alexandria, Egypt . pp 42-59. [17] Ramiz M. Aliguliyev (2008). New sentence similarity measure and sentence based extractive technique for automatic text summarization. Journal of Expert Systems with Applications 36 (2009) – Elsevier, pp 7764-7772. [18] Chan W., Lixia L., and Lei L., (2008), HowNet based Evaluation for Chinese Text Summarization. In Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, 2008. NLP-KE '08. Pp 1-6

5. CONCLUSION This paper proposed a text summarization approach for Arabic language which is mainly based on the idea of noun extraction and the aggregate similarity. The performance of the approach is measured using the Precision and the Recall measures, where the highest Precision obtained was 62% and 70% for the Recall and 14 % for compression rate. The results of the experiments provide an indication that the proposed technique is acceptable and has the possibility to be applied for Arabic text. Nevertheless, we can list some points that possibly affect the quality of the produces summary: - The source language of the text and the sequence of ideas and the coherent of text. - The size of sentences in document and the number of words in each sentence. - The adherence of using the correct grammatical rules when written the text. Generally, the experiments proved that the proposed approach has good possibility for adapting to Arabic text. Nevertheless, further investigation need to be conducted to enhance the proposed approach which may include adding more rules for noun extraction and investigating other similarity measures. REFERENCES [1] Evans D. (2005). Identifying Similarity in Text: MultiLingual Analysis for summarization. PhD Thesis. Columbia University. [2] de Smedt K., Liseth A., Hassel M. and Dalianis H. (2005). How short is good? An evaluation of automatic summarization. In Holmboe, H. (ed.) Nordisk Sprogteknologi. [3] Mazdak N. (2004). FarsiSum - A Persian text summarizer. Master Thesis. Stockholm University . [4] Al Shamsi F. and Guessoum A. (2006). A Hidden Markov Model –Based POS Tagger for Arabic. JADT2006. [5] Douzidia F. and LapalmeG. (2004). Lakhas, an Arabic summarization system. Proceedings of DUC. [6] Kimt J., Kimt J. and Hwang D. (2000). Korean Text Summarization Using an Aggregate Similarity. ACM

ARABIC REFERENCES [12].‫عبد العجيلي‬. )6991( . ‫ منشورات جامعة‬.‫الحاسوب واللغة العربية‬ .‫ االر ن‬.‫اليرموك‬ [14]‫ االنصاري‬.‫ تحقيق محيي الدين‬.‫ هشام‬..)6999 ( . ‫مغني عبدالحميد‬ .‫بيروت‬.‫ المكتبة العصرية‬.‫اللبيب عن كتب االعاريب‬ [15] ‫ جالل الدين‬.‫ السيوطي‬.‫ تحقيق عبدالعال‬. ‫ مكرم‬.) 6991( . ‫الهوامع‬ .‫ الكويت‬.‫ ار البحوث العلمية‬.‫في شرح جمع الجوامع‬

8

arabic text summarization using an aggregate

arabic text summarization using an aggregate

Suggest Documents

Automatic Arabic Text Summarization Approaches

Text Summarization Using Neural Networks

Using text mining to support text summarization

Arabic Text Summerization Model Using

An Approach To Automatic Text Summarization Using Simplified Lesk ...

Text Summarization Using Latent Semantic Analysis ...

Using Word Sequences for Text Summarization

Efficient Text Summarization Using Lexical Chains

Efficient Text Summarization Using Lexical Chains

text summarization: using centrality in the pathfinder

IRJET- News Summarization using Text Mining

Text Summarization using Term Weights - Semantic Scholar

Automatic Text Summarization using a Machine ...

Persian Text Summarization Using Fractal Theory

Generic Text Summarization Using Probabilistic Latent Semantic ...

Odia Text Summarization using Stemmer - CiteSeerX

Chinese text summarization using a trainable

Improving Text Summarization Using Fuzzy Logic

An Extractive Text Summarization approach for

An approach to Abstractive Text Summarization

Text Summarization: An Old Challenge and New

An Ontology-based Summarization System for Arabic ... - CiteSeerX

Text Summarization Challenge 2 Text summarization evaluation at ...

Arabic Text Classification Using Maximum Entropy - CiteSeerX