A New Hybrid Farsi Text Summarization Technique

0 downloads 0 Views 275KB Size Report
Apr 7, 2009 - Abstract. The importance of text summarization grows rapidly as the amount of information increases exponentially. This paper presents a new ...
Ninth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing

A New Hybrid Farsi Text Summarization Technique Based on Term Co-Occurrence and Conceptual Property of the Text Azadeh Zamanifar, Behrouz Minaei-Bidgoli, Mohsen Sharifi Faculty of Computer Engineering, Iran University of Science and Technology [email protected], [email protected], [email protected],

reach a level where machines can fully understand documents [1]. In Farsi, due to the lack of integrated lexical database like WordNet [2], applying the full Farsi linguistic properties for summarization is impossible. Another fact about Farsi is that the language has no capital letters, and there may be different styles of writing for a word, existence of compound words and lack of defined rules for stemming. These features increase ambiguity and complicate the information extraction process. We propose a technique that is based on the co-occurrence property of the text in order to resolve some of the noted Farsi language challenges. It could detect different subjects in the text and eliminate similar sentences in order to increase the accuracy of Farsi text summarization. The rest of paper is organized as follows. Section 2 describes some related work. Section 3 presents our proposed technique. The empirical results are presented in Section 4. Finally, conclusion and some future works are described in Section 5.

Abstract The importance of text summarization grows rapidly as the amount of information increases exponentially. This paper presents a new hybrid summarization technique that combines statistical properties of documents with Farsi linguistic features. The originality of the technique lies on the use of term co-occurrence property of the text. It could detect the number of subjects. The proposed technique summarizes the document in proportion to the subject treated in a document. It considers the conceptual property of the text algorithm and based on word synonymy prevents similar sentences to be included in the summary. It also preserves the cohesion of the summarized text. Our results show better performance in comparison with FarsiSum, well known Farsi Summarizer, which is based only on the heuristic property of the text and do not consider the Farsi challenges.

1. Introduction

2. Related Work

Summarization is a brief and accurate representation of input text such that the output covers the most important concepts of the source in a condensed manner. The summarization process could be extractive or abstractive. Extract summaries contain sentences that are copied exactly from the source document. In abstractive approaches, the aim is to derive the main concept of the source text, without necessarily copying its exact sentences. It is generally agreed that automating the summarization procedure should be based on text understanding that mimics the cognitive processes of humans. However, this is a sub problem of Natural Language Processing (NLP) and is a very difficult problem to solve at present. It may take some time to

The previous works on text summarization could be categorized into two classes: statistical and linguistic based methods. Luhn [3] focuses on word distribution in sentences, wherein word frequencies specify the topic of source document. There are some methods [3, 4, 5] that concentrate on positions of terms in document. There are some extra points that can be viewed as special cue (i.e. "introduction", "method", "conclusion" and "result"). Hovy [4] uses an optimum policy to learn the position of relevant summarization information. Edmundson [5] introduces a linear ranking equation that considers the title, cue and keyword position. It shows that term position works well for news.

978-0-7695-3263-9/08 $25.00 © 2008 IEEE DOI 10.1109/SNPD.2008.57

629 635

Authorized licensed use limited to: Polytechnic Inst of New York Univ. Downloaded on April 7, 2009 at 07:59 from IEEE Xplore. Restrictions apply.

There is just one Farsi text summarizer, FarsiSum [17] which is based on only statistical property of the text. FarsiSum is a Farsi version of its counterpart SweSum [18] which is a Swedish text summarizer. It is based only on statistical feature of the text and does not consider Farsi linguistic property of the text for preprocessing, except for the stop list.

The main problem with purely statistical methods is that they do not consider the cohesion. In other words they do not consider related terms. Text summarization could be considered as a classification problem [6, 7]. In the training phase, the training data (i.e. sentences) are applied to a classification algorithm and then the unseen sentences (i.e. the test data) are classified into two categories, namely summary included or not. Thus, some features of the text are selected for the training phase. [8] uses a genetic algorithm to extract a proper feature for classification, but the accuracy is limited to only 52%. [9] applies the Bayesian learning method to text summarization. The probability that a sentence is included in summary is related to the number of votes collected by the sentence. Supervised learning may need to label a sentence as relevant or not relevant, which is a tedious task. Beside the solutions that concentrate on statistical solutions, there are solutions that mainly consider the linguistic property of a certain language and semantic relations between words [9, 10,11]. Word Net [2] is a lexical database that consists of more than 118,000 English words, and includes relations between words such as synonymy, hyponym, hypernym, antonym, meronyms and some other relations. In these solutions, lexical chains are built based on the words in the text. Lexical chain is a word sequence where the words are related by one of the noted relations. This helps in ambiguity resolutions and identification of discourse structure. The main task is to assign words in the best chain and the algorithm which ranks the chains. Barzilay [10] has proved the possibility of computing lexical chains for the first time. A more recent implementation focuses on improving word sense disambiguation. A more recent implementation focuses on improving word sense disambiguation. There are solutions such as [12] that are based on HowNet [13] lexical database. This database consists of Chinese words and the relations among them and their English equivalence. There are other solutions which consider the morphological property of the text. [14] has created a graph based on the relations between words. This relation can be conceptual or statistical. Geng [15] has proposed a summarization algorithm based on term co-occurrence property of the text. Our technique is similar to that, but we modify and integrate it with more concept-based properties of the text. Yu [16] has proposed an integrated algorithm that combines lexical chains with structural features for Chinese Text summarization. But our method differs from their methods since they do not consider the term co-occurrence parameter.

3. Proposed Technique Text summarization has three main steps: topic identification, topic interpretation and summary generation. Topic identification includes text segmentation, removing trivial, redundant information and stemming. In other words, text preprocessing is done in this stage. After preprocessing stage, the most important units of the text remain. The main part of summarization is topic interpretation. Topic interpretation consists of finding and ranking important words. In the final stage, the summarized text is generated based on sentences and their relevant ranking. The proposed technique is presented in this section based on these steps.

3.1 Farsi Challenges and Preprocessing Stage Farsi language differs from English language both morphologically and semantically. Some intrinsic problems related to Farsi texts are categorized as follows: 1) Lack of defined verbal grammatical points such as the ones in English. Examples are: conjunction of some parts of verb to non verbal words, conjunction of objects to verbs, compound verbs with preverbal elements. These exceptions affect the verb stemming. 2) Different types of writing for a word. 3) Word/phrase ambiguity. 4) Ambiguity in morphology. The same word may have different meanings. This is because of the lack of vowel in writing. 5) Compound words may appear as two different words. Because of the above drawbacks in Farsi, text preprocessing step is very important for future accurate text processing. In the preprocessing phase, unlike English preprocessing, we first apply Farsi stemming algorithm that mostly resolves the Farsi challenges in Stemming; if stop list words detection are applied first, it is probable that a preverbal element is considered as a stop list word. Then the compound words are detected. We use the dataset consisting of more than 1400 Farsi compound words and as a final stage in text preprocessing, stop words are deleted. Our stop list

636 630

Authorized licensed use limited to: Polytechnic Inst of New York Univ. Downloaded on April 7, 2009 at 07:59 from IEEE Xplore. Restrictions apply.

dataset consists of more than 400 common Farsi words. This phase helps to detect important words more accurately.

wi ∈ Chain d where : ⎧ ⎫ ⎪ ⎪ d = max ⎨ j =1,..., LexicalChainCount C ( wi , w k ) ⎬ ⎪⎩ ⎪⎭ Wk ∈Chain j



3.2 Text Processing

We rank each word by multiplying the word frequency by the total number of its related chain:

As mentioned before, the lack of a comprehensive Farsi lexical database may cause full conceptual analysis impossible. Thus, we use a hybrid method that regards both statistical and conceptual property of the text. Our method is based on term co-occurrence property that is based on the fact that if two terms occur in the same window, they are conceptually related to each other. The larger the degree of cooccurrence, the more they are related to each other. We use our modified version of the method that is described in [15] and combine it with linguistic properties. Here are the changes made: 1) The method which has been used to choose the important word and link term. 2) The method that is used to select a sentence from each cluster. 3) Combining the result with concept based method. Like [15] we define relative term co-occurrence degree as:

R ( wi | w j ) = f ( wi , w j ) / f ( w j )

(3)

Important_ce(wi)= Ferequency(wi)×ChainMemberCount(dwi)

(4)

Based on the number of document words, the number of document word divided by 20; n top ranked words are selected. A graph is created with important words as its nodes. There is an edge between two words in the graph if the co-occurrence degree between them is above a predefined threshold. We experienced that 0.7 is the best threshold. The graph is disjoint if the input text describes more than one topic. Now, in order to keep the coherency of the text, the word that occurs in more than one cluster is defined. We find these words by computing the probability that a word belongs to more than one cluster. The probability that a word belongs to a cluster (C i ) is defined as follows [15]:

∑ C(w, w ) c

(1)

wc ∈Ci , w≠ wc

p( w ∈ C i ) =

f ( wi ) is the frequency of w j and f ( wi , w j ) is



∑ C(w , w)

(5)

c

wc ∈C , w∈D, w≠ wc

That is the sum of the co-occurrence degree of a word to each word of the cluster (dependency of a word to a cluster) divided by sum of the dependency of a word to all clusters. Then the probability that a word belongs to more than a cluster is:

the number of times that these terms are occurred in the same window. Considering the best window size for measuring the co-occurrence degree is very important and affects the summarized accuracy directly. In our experience, the best co-occurrence window is the average sentence length.

LinkScore( w) = Count

(6)

Co-occurrence degree is defined as follows [15]:

1−

C ( wi , w j ) = ( R( wi | w j ) + R( w j | wi )) / 2 (2)

where, Count is the number of clusters. Now, the m top words with highest linkscore are selected and added to the graph. The gain of each word is calculated as the sum of the weight of each edge that is connected to. We can compute the sentence score by summing the gain of all its words. Sentence length is considered also. So the sentence score is calculated as follows:

∑ ( p(w ∈ C ) × p i

i =1

For each two words, the co-occurrence degree is computed. Then lexical chains are created based on lexical synonym database information. Due to probability of word ambiguity, each word may belong to one or more lexical chain. In order to choose the correct lexical chain, for each new word, the sum of the co-occurrence degree to every member of each lexical chain is computed, and the lexical chain with the maximum value is selected as a candidate for inserting the word. If there is no related lexical chain, a new one is created.

w( s i ) =

∑w

i≠ j

( w ∉ C j ))

ki

si

(7)

The first sentence of the text is located in summarized text as the heuristics show that the first sentence is important in approximately most of the text.

637 631

Authorized licensed use limited to: Polytechnic Inst of New York Univ. Downloaded on April 7, 2009 at 07:59 from IEEE Xplore. Restrictions apply.

3.3 Summary Generation

Re call =

After ranking the sentences, n top ranked sentences are selected. But, in order to prevent the imbalance distribution of sentences between topics, in cases where a word locally occurs at the most in one topic, we do as follows. First it is determined if the graph is disconnected and only then, for each cluster, we define a coefficient relative to the number of sentences that each cluster contains. This coefficient is used later to pick the sentences from each cluster. We define a summary coefficient for each cluster as follows: NoOfWord (C i ) SumCoefficient (C i ) = (8) TotalNoOfWord

(10) It is obvious that as the length of the summary increases, the differences between human summarized texts are also increases. In English, there are DUC human generated summaries that can be used as a benchmark. In [19], they compare their work with the well known Copernicus summarizer and they use DUC datasets as benchmark. In [20, 21], they use the Reuters data set [22] consisting of news-wire summaries. This corpus is composed of 1000 documents and their associated extracted sentence summaries. They also use the datasets in [23] that are composed of 183 scientific articles. As there is not any approved benchmark for the Farsi language to evaluate our approach, we compare our result with FarsiSum that is the only Farsi summarizer. We have used sixty Farsi news texts to construct the testing corpus. For each text, five students have performed summarization manually. This task has been done in three different proportions of the original text: 30%, 40%, and 50%. The ideal summary is constructed based on the majority voting of the manual results. Then precision and recall for our work and Farsi Sum are calculated. The results are presented in Table 1. The results show that our technique works better especially when the compression rate is 30%.

Summary percent is variable and can be adaptive to the user request. Results show that if one summarizes the text to 30 percent of original length, it will obtain around 70-80 percent accuracy on 3-4 pages of news articles [18]. As the last step, sentence similarity is calculated based on the cosine measure. For each sentence, a vector of the word is constructed. The similarity between two sentences increases if their words are identical or synonymous, i.e. they are in the same lexical chain. Then, sentences with similarities more than a predefined threshold are eliminated and the one with longer sentence remains. This avoids redundancy in summarized text.

4. Evaluation

Table1. Results of our evaluation Compression Parameter Proposed Rate Approach Precision 0.72 30% Recall 0.71 Precision 0.66 40% Recall 0.69 Precision 0.62 50% Recall 0.66

It is a complicated task to determine the quality of the automatic summarization, especially for extract type. Typically, the performance of the summarization is calculated by comparing it with a manual method or it can be achieved by a task-based method. The latter means the performance of carrying out the task is considered as the criteria for evaluating summarization. In manual analysis, ideal summary is defined by majority voting or by union or intersection of manual human summaries. The number of manual summaries is different and is variable from three persons above [16]. In [15] they pick different types of documents (computer, art, sport...) and compare the manual one with their proposed summarization algorithm. The parameters that are mostly considered in evaluating the performance of the text summarization are precision and recall [20]: Pr ecision =

# SentencesIn( SystemExtractdSummary ∩ IdealSummary ) total # ofSentencesInIdealSummary

FarsiSum 0.63 0.65 0.59 0.61 0.53 0.56

5. Conclusion and Future work In this paper, we proposed an integrated method for Farsi text summarization which combines the term cooccurrence property and conceptually related feature of Farsi language. Text preprocessing steps is improved to cover the Farsi linguistic challenges. Unlike the Farsi Sum, which considers only the statistical features of the text, we consider the relationship between words and use a synonym dataset to eliminate similar sentences. The proposed method also keeps the text

# SentencesIn( SystemExtractdSummary ∩ IdealSummary ) total # ofSentencesInSystemExtractedSummary

(9)

638 632

Authorized licensed use limited to: Polytechnic Inst of New York Univ. Downloaded on April 7, 2009 at 07:59 from IEEE Xplore. Restrictions apply.

cohesive by including the linked sentences in the summary. The results show that it could detect different topics of the text and generate the balanced summary with respect to the number of subjects. Future work includes developing comprehensive Farsi lexical database like WordNet that includes all the relations between words such as hyponymy, antinomy, and so forth. This would help in producing more accurate summarized text. There are also some challenges like abbreviation in Farsi that is not addressed yet. They can be considered as future work.

[10] R. Barzilay, M. Elhadad, "Using Lexical Chains for Text Summarization", the MIT Press, Cambridge, Massachusetts, 1999, pp. 111-121. [11] Y. Chen, X. Wang, Yi. Guan, "Automatic Text Summarization Based on Lexical Chains", LNCS Vol. 3610, 2005, pp. 947-951. [12] H. Silber, K.F McCoy, "Efficient Text Summarization Using Lexical Chains", In Proceedings of the ACM Conference on Intelligent User Interfaces, 2000. [13] Z. Dong, Q. Dong, "HowNet - a hybrid language and knowledge resource", In the proceedings of International Conference on Natural Language Processing and Knowledge Engineering, 2003, pp. 820 – 824.

6. Acknowledgments This work is partially supported by Farsi Text Mining Research group at Computer Research Center of Islamic Sciences (CRCIS), NOOR co. P.O. Box 37185-3857, Qom, Iran.

[14]. I. Mani, M. Maybury, "Advances in Automatic Text Summarization", the MIT Press, Vol. 26, No. 10, Cambridge, 1999, pp. 280-281. [15] H. Geng, P. Zhao, E. Chen, Q. Cai, "A Novel Automatic Text Summarization Study Based Term Co-Occurrence", In the Proceeding of the 5th International IEEE Conference on Cognitive Informatics , 2006, pp. 601-606.

7. References [1] W. Doran, N. Stokes, J. Carthy, J. Dunnion, "Comparing Lexical Chain-based Summarization Approaches using an Extrinsic Evaluation", In Global Word Net Conference (GWC), 2004, pp. 112-117.

[16] L. Yu, J. Ma, F. Ren, S. Kuroiwa, "Automatic Text Summarization Based on Lexical Chains and Structural Features", In Proceedings of the Eighth International IEEE ACIS Conference , 2007, pp. 574-578.

[2] D. Miller, "Wordnet: An On-line Lexical Database", International Journal of Lexicography, Vol. 3, 1990

[17] M. Hassel, N. Mazdak, FarsiSum, "A Persian Text Summarizer", 20th International Conference on Computational Linguistics, Geneva, Switzerland, 2004.

[3] H. P. Luhn, "The Automatic Creation of Literature Abstracts", IBM Journal of Research and Development, 1958, pp. 159-165.

[18] H. Dalianis, "SweSum- a Text Summarizer for Swedish", Technical report, TRITANA-P0015, IPLab-174, NADA, KTH, 2000.

[4] E. Hovy, "Parsimonious and Profligate Approaches to the Question of Discourse Structure Relations", 5th Workshop

on NLG, Dawson, Pennsylvania, 1990. [5] H.P. Edmundson, "New Methods in Automatic Abstracting", Journal of the ACM, 1969, pp. 264-285.

[19] K. Bellare, A.D. Sarma, N. Loiwal, "Generic Text Summarization using Word Net", International Conference on Language Resources and Evaluation, 2004.

[6] J. Kupiec, J. Pedersen, F. Chen, "A. Trainable Document Summarizer", In Proceedings of the 18th ACM SIGIR, 1995, pp. 68-73.

[20] M. Amini, P. Gallinari, "Automatic Text Summarization Using Unsupervised and Semi-supervised Learning", Lecture Note on Computer Science, Vol. 2168, 2001, pp. 16-28.

[7] I. Mani, E. Bloedorn, "Machine Learning of Generic and User-Focused Summarization", In proceedings of the Fifteenth National Conference on AI, 1998, pp. 821-826.

[21] M. Amini, P. Gallinari, "The Use of Unlabeled Data to Improve Supervised Learning for Text Summarization", In the Proceedings of the 25th ACM SIGIR, 2002, pp.105–112.

[8] C. N. Silla, G. L. Pappa, A. A. Freitas, C. A. Kaestner, "Automatic Text Summarization with Genetic AlgorithmBased Attribute Selection", LNAI 3315, 2004, pp. 305-314.

[22] Reuter, (WordNews, Business News), Available at http://www.reuters.com/

[9] T. Nomoto, "Bayesian Learning in Text Summarization", In the proceedings of Human Language Technology Conference on Empirical Methods in Natural Language Processing, 2005, pp. 249-256.

[23] I. Mani, G. Klein, D. House, L. H. Schman, T. Firmi, "SUMMAC: A Text Summarization Evaluation", Cambridge Journal, Vol. 8, 2002, pp. 43-68.

639 633

Authorized licensed use limited to: Polytechnic Inst of New York Univ. Downloaded on April 7, 2009 at 07:59 from IEEE Xplore. Restrictions apply.