AZOM: A Persian Structured Text Summarizer

1 downloads 0 Views 142KB Size Report
Abstract. In this paper we propose a summarization approach, nicknamed. AZOM, that combines statistical and conceptual property of text and in regards.
AZOM: A Persian Structured Text Summarizer Azadeh Zamanifar and Omid Kashefi School of Computer Engineering, Iran University of Science and Technology, Tehran, Iran [email protected], kashefi@{ieee.org, iust.ac.ir}

Abstract. In this paper we propose a summarization approach, nicknamed AZOM, that combines statistical and conceptual property of text and in regards of document structure, extracts the summary of text. AZOM is also capable of summarizing unstructured documents. Proposed approach is localized for Persian language but easily can apply to other languages. The empirical results show comparatively superior results than common structured text summarizers, also than existing Persian text summarizers. Keywords: Summarization, Persian, Fractal Theory, Statistic, Structure, Conceptual.

1 Introduction Summarization is a brief and accurate representation of input text such that the output covers the most important concepts of the source in a condensed manner [1]. The summarization process could be extractive or abstractive. Extractive summaries contain sentences that are copied exactly from the source document [2]. In abstractive approaches, the aim is to derive the main concept of the source text, without necessarily copying its exact sentences [2]. The traditional text summarization methods use statistical properties of the text such as term frequency, sentence position, and cue terms. Researchers enhance the traditional statistical methods using multiple techniques such as extract the relation between words by lexical database in order to generate more coherent summaries [3], and applying rhetorical structure analysis [4]. In this paper we propose a summarization approach in regards of statistical, conceptual, and structural features of the document for Persian text.

2 Proposed Summarization Approach Referring to our studies, there is not any notable Persian summarizers that consider the structure of the document along with conceptual properties. In this paper we present an automatic text summarizer considering statistical, conceptual, and structural features of the text. The proposed approach uses a lexical database in order to determine the relationship between words as conceptual feature of the text. Proposed approach consists three steps includes (1) preprocessing of the text, (2) text interpretation, and (3) summary generation that are described in follow. R. Muñoz et al. (Eds.): NLDB 2011, LNCS 6716, pp. 234–237, 2011. © Springer-Verlag Berlin Heidelberg 2011

AZOM: A Persian Structured Text Summarizer

235

2.1 Preprocessing In order to reduce the dimensionality, the original text must be preprocessed [5]. At the first, the text is segmented in order to detect the boundary of the sentences [6]. Then the stop words are eliminated. After that, each word is lemmatized using the method proposed in[7] and inflected forms of words are unified. Now the whole text is converted to a uniform text in which the text processing can be applied. 2.2 Interpretation In this step, whole document is scanned and the structure of the text such as chapters, sections, paragraphs and sentences are extracted by constructing the corresponding fractal tree of the document. If the document does not have any structure, nothing is done in this phase. Then, each word is looked up at Persian lexical in order to extract the relations (e.g. synonym, hypernym and hyponymy) are between words. 2.3 Statistical Weighting Static score of each term Ti can be calculated by modified version of entropy as Equation 1. /

1

/

(1)

Where tfir is the frequency of Ti in block r and fi is the total frequency of Ti in the whole document; M is the number of blocks in the document. If the document does not have any structure, term frequency of each word in the document is considered. 2.3.1 Conceptual Weighting After calculating the statistical weight of each term, we update the weight of each term in each block with the summation of the weights of the terms in its lexical chain according to the lexical relation type. If the relationship between terms is synonyms, the weights of both terms are updated by summation of the weights of their corresponding synonym terms. If the relationship between terms is not synonym and they are in the same block, the weight of each term sums up with 0.7 weight of other term; if they are not in the same block, the weight is increased by half of the weight of the other word. Equation 2 shows the conceptual weighting of terms. ,

,

(2)

, ,

2.3.2 Structural Weighting Next step is scoring each sentence of a block. The weight of each sentence is the summation of each word’s score divided by total number of sentence words. The score of sentence k is computed as (Equation 3). ∑

(3)

236

A. Zamanifar and O. Kashefi

Raw score of each block (i.e. ) is calculated as the summation of the score of its sentences divided by total number of its sentences. To normalize the score, raw score of each block is divided to the raw scores of its sibling blocks in corresponding fractal tree of the current text (Equation 4). ∑

(4)

Therefore, in text interpretation step, terms are statistically weighted by entropy metric, terms weight are updated through conceptual potential of the text, sentences are weighted based on terms’ weights, and document blocks are weighted regarding their sentences’ weights and the structure of the text. 2.4 Summary Generation We generate the summary based on fractal theory. To extract sentences from each block according to the importance of that block, the block that is more important have more sentences in summarized text. Therefore, the normalized score for each block is calculated according to Equation 6. Compression ratio is variable and can be adaptive to the user request. ∑

(6)

3 Evaluation Usually, performance of a summarization technique is calculated by comparing the results with manual (intrinsic) extracted summary; but to our knowledge, since there is not any manual extracted summary corpus in Persian, we employ a novel strategy to evaluate the effectiveness of automatic summarization. We consider the abstract part of the scientific and scholarly papers as the ideal manual summarization. Abstract of the scientific and scholarly papers are written by educated authors who try to capsulate the summaries of all section of the document in abstract, so it is a good candidate for manual summary. We have used 100 different Persian scientific papers to construct the benchmark. We use the abstract of each paper as ideal summary, and the body of the paper (i.e. except the abstract, keywords, acknowledgment, and references sections) as the original text. We compare our result with fractal based method proposed by Yang and Wang [8] that is one of the few structured summarized approach with good result and our previous method [9] that is one of the few text summarization method with good result which is applied on Persian. We also compare the results with flat summary which is the same as our proposed method except the fact that whole document is considered as one block. As it is shown in Table 1, if we compare the exact sentences of abstract and the output of our approach, the precision and recall would not be high. It is because of the fact that human abstract does not necessarily contain the exact sentences of the text. Therefore we calculated the similarity of extracted summaries with the abstract based on the method proposed in [10]. The results are shown in Table 2. As it is shown in Table 2 precision and recall significantly increase compared to Table 1. As it is shown in Table 1 and Table 2 the precision and result in score based summary- in which the importance factor of block is considered in order to determine the number of

AZOM: A Persian Structured Text Summarizer

237

sentences from each block- is better than distributed summary, where equal number of sentences are extracted from each block, this is because of the fact that in most of the cases human extract more sentences form important block of the text. Table 1. Comparative evaluation result with matching sentences Compression Rate

Parameter Our Approach

Fractal Yang

Flat Summary Co-occurrence

Distributed Structured Summary

Precision

0.64

0.61

0.55

0.45

Recall

0.65

0.57

0.53

0.48

Score based structured Summary

Precision

0.73

0.61

0.57

0.42

Recall

0.71

0.60

0.55

0.4

Table 2. Comparative evaluation result with matching similarity Compression Rate

Parameter Our Approach

Fractal Yang

Flat Summary Co-occurrence

Distributed Structured Summary

Precision

0.73

0.67

0.62

0.55

Recall

0.71

0.68

0.59

0.58

Distributed Structured Summary

Precision

0.81

0.71

0.65

0.54

Recall

0.76

0.70

0.66

0.50

References 1. Luhn, H.P.: The Automatic Creation of Literature Abstract. IBM Journal of Research and Development 2, 159–165 (1958) 2. Yatsko, V.A.: Special Features of the Communication Syntatical Structure of Summary Utterances. NTI 2, 1–5 (1993) 3. Barzilay, R., Elhadad, M.: Using Lexical Chains for Text Summarization. In: Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization, Spain, pp. 10–17 (1997) 4. Mann, W.C., Thompson, S.A.: Rhetorical Structure Theory: A Theory of Text Organization (1987) 5. McCarty, L.T.: Deep Semantic Interpretations Of Legal Texts. In: Proceedings of the 11th International Conference on Artificial Intelligence and Law, USA, pp. 217–224 (2007) 6. Reynar, J.C., Ratnaparkhi, A., Maximum, A.: Entropy Approach to Identifying Sentence Boundaries. In: Proceedings of the 5th Conference on Applied Natural Language Processing, pp. 16–19 (1997) 7. Kashefi, O., Mohseni, N., Minaei, B.: Optimizing Document Similarity Detection in Persian Information Retrieval. Journal of Convergence Information Technology 5, 101– 106 (2010) 8. Yang, C.C., Wang, F.L.: Hierarchical Summarization of Large Document. American Society or Information Science and Tecnology 10, 888–902 (2008) 9. Zamanifar, A., Minaei-Bidgoli, B., Sharifi, M.: A New Hybrid Farsi Text Summarization Technique Based on Term Co-Occurrence and Conceptual Property of the Text. In: 9th SNPD Conference, pp. 635–639. IEEE Computer Society, Thiland (2008) 10. Zamanifar, A., Minaei, B., Kashefi, O.: A New Technique for Detecting Similar Documents based on Term Co-occurrence and Conceptual Property of the Text. In: Int. Conf. on Digital Information Management, England, pp. 526–531 (2008)

Suggest Documents