Automatic Generation of Text Summary - CiteSeerX

9 downloads 14298 Views 286KB Size Report
Indian Institute of Information Technology, Allahabad ... summary, Coherence, Degree of representative ness. 1. ... information as it exists in the source document.
A language independent approach to multilingual text summarization Alkesh Patel Indian Institute of Information Technology, Allahabad [email protected]

Tanveer Siddiqui Indian Institute of Information Technology, Allahabad [email protected]

U. S. Tiwary Indian Institute of Information Technology, Allahabad [email protected]

Abstract This paper describes an efficient algorithm for language independent generic extractive summarization for single document. The algorithm is based on structural and statistical (rather than semantic) factors. Through evaluations performed on a single-document summarization for English, Hindi, Gujarati and Urdu documents, we show that the method performs equally well regardless of the language. The algorithm has been applied on DUC data for English documents and various newspaper articles for other languages with corresponding stop words list and modified stemmer. The results of summarization have been compared with DUC 2002 data using degree of representativeness. For other languages, the degree of representativeness we get is highly encouraging. Key Words: Content richness, Theme features, Sentence reference index, Partitioning, Fuzzy summary, Coherence, Degree of representative ness.

1. Introduction Automatic summarization has been studied for over 40 years [11]. With the explosion in the quantity of on-line information in recent years, there has been a current resurgence in research in this area. The information overload either leads to wastage of significant time in scanning all the information or else useful information is missed out. Automatic text summarization offers an effective solution to this problem by significantly compressing information content. Another

advantage of text summarization is in referring to the already-referred archived information in summarized form rather than in full. Most of the work so far has been focused on English and other European language. Languages of Indian subcontinent have received little, if at all, attention; primarily because the amount of information available in non-English language was less. However, the scenario is now changing and a large amount of information has become available in various languages. The need for text summarization methods that can handle multiple languages appear to be growing. Two approaches are generally followed for automatic text summarization research: the shallow sentence extraction (knowledge poor) and the deep understand and generate approach [13]. Text understanding method may produce high quality summary but suffer from the knowledge bottleneck problem. Further existing natural language generation techniques are still immature and often lead to incoherence even within the sentence thereby making it difficult to produce high quality abstractive summary. Sentence extraction methods on the other hand are quite robust [10]. Extractive summary, in which snippets from within the source document is selected and presented to the user, are good indicator of texts. This is because it presents the information as it exists in the source document and thus allows the user to read between the lines information. Various methods of scoring the relevance of sentences or passages and combining the scores are described in [1,8,9,17]. Generally, different languages involve different complexity of their own semantics, making it harder to apply natural language processing. A statistical approach on the other hand is quite

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

robust and can be easily adapted to different languages. This paper presents a statistical approach to generate generic extractive summary. Our algorithm is highly flexible and requires only a stop words list (provided externally) and stemmer for corresponding language in which documents are to be summarized. The rest of the paper is organized as follows: Section-2 describes the preprocessing and some issues for handling different languages and our approach for tackling them. Section-3 describes the process of determining the content richness of sentences. This is done by calculating the weight of each sentence in the document. Section-4 gives the details of how the summary is generated by extracting information rich sentences. Section-5 elaborates the post processing to make smooth summary using similarity measure among the sentences. Section-6 gives details of material and evaluation strategies used. Finally, Section-7 shows the results of our approach.

2.3 Noun/Proper Noun Feature Vector: Nouns are considered to contain richness of context. Proper nouns are supposed to have content richness more prominently. In English, it is easy to find proper nouns by just knowing the first letter of a word. But, in other languages which have no case sensitive structure, it’s difficult to find the proper noun. Two obvious approaches: First, maintaining a list of names and every time a word comes it is checked against this list. If it is there then this information can be used for increasing the weight of the word. Second, check each word in the dictionary (vocabulary) of respective language. If word is not found in the dictionary then it is considered to be a name. Both of these approaches involve a lot of overhead and the execution of algorithm becomes slow. We have used a novel way to handle this issue with satisfactory results. We consider only those words as names which are within single/double quotes. And we enhance the weight of sentence in as described in section-3.

2. Pre-Processing

2.4 Document Feature Vector:

The starting step in summary generation is to scan and analyze the document for the following:

Using the threshold as 30% of the highest frequency, the resulting vector is called the Document Feature Vector.

2.1 Removal of stop words: In case of English language, the stop words list is directly available but for other languages we have to make a list of stop words. For that we are finding the most frequent words that are occurring throughout the test sets. These words are then analyzed by language expert to make a final list of stop words. We have prepared list of 275 stop words for Hindi, 258 stop words for Gujarati and 398 stop words for Urdu initially and these lists are updated as and when required.

2.2 Stemming: Stemming is applied to convert the words into their corresponding roots so as to make frequency analysis accurate. We have used Porter’s algorithm for English. However, for other language like Hindi, Gujarati and Urdu, we have developed our own stemmer which removes the dependent vowels from the end of the words.

2.5 Forming the Enhanced Feature Vector:

Theme

For English: We have proposed an enhanced feature vector i.e. Theme feature vector which tells more about the central theme of the document. In order to form this, we extract frequent terms from the title and merge these with the frequent proper nouns vector. The resulting vector is termed as Theme Feature Vector. This is a significant improvement over title feature vector as used by most of the researchers [3,5,18], because in many cases the title feature vector is not richly suggestive of the core idea(s) of the document whereas the proposed Theme Feature Vector is rich in conveying the central idea(s) of the document. Examples to substantiate the above hypotheses are presented in Table-1 below.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Table-1: Title Feature Vector vs. Theme Feature Vector (For English Documents) Doc. Ref. AP8809120095

Title Gilbert Reaches Jamaican Capital With 110 Mph Winds

Title Feature Vector Gilbert, Jamaican, Mph, Winds

Theme Feature Vector Gilbert, mph, wind, Jamaica, Cuba, Kingston, Cayman, Hurricane, Republic, Dominican Haiti, Caribbean, Forecasters, Miami, Rico, Puerto, Latina, Prensa, Sunday, Grand, Guantanamo, Havana, Islands Center, National, Sheets,

FTG44-16774

Judges split as Kelman wins Booker

Judge, split, wins, Booker

LA1112890035

A STRATEGY THAT BEARS REPEATING

WSJ8810210008

Slimy Portrait of an ExBeatle

STRATEGY, BEARS, REPEATING (title is not suggestive of theme) Slimy, Portrait, Ex-Beatle

Judge, Kelman, Booker, Booker, Wood, Bayley, Scottish, Late Quayle, Bush

Kelman,

Lennon, Goldman, John

For Other Languages:

sufficient or reliable.

Forming theme feature vector for documents which are not in English is somewhat challenging as we have no direct accessibility of proper nouns in the document. However, as said earlier, we are considering the names in quotes as special nouns and including them into theme feature vector. However, only taking such special nouns as theme words may lead to following two problems:

We have successfully handled the above problems by applying threshold α on document feature vector and including the top most frequent terms from it in our theme feature vector. The title words are included in theme feature vector only if title word appears in Document feature vector above threshold β. Initially, we have considered following criteria.

1. 2.

The documents not contain quoted names may have empty theme feature vector. Even theme feature vector of documents containing quoted names, may not be

James,

λ1 = 50% of highest term frequency λ2 = max of [25% of highest term frequency, 2] Some examples which support this approach are shown in following Tables.

Table-2: Title Feature Vector vs. Theme Feature Vector (For Hindi Documents) Doc. Ref. Dainikjag1_29_9

Title दं गे रोकने के उपाय

Title Feature Vector दं गे, रोकने, उपाय

Theme Feature Vector दं गे, ःथल, धािमक, लाउडःपीकर, ूयोग, सिमित, समुदाय, वोट, उपाय,

Dainikjag1_26_9

मुशरफ क ल फाजी

Loktej1_26_9

बदलेगा

प रसर का Dainikjag5_29_9

माहौल

मानिसकता बदले

िश ा

मुशरफ, ल फाजी

मुशरफ, परवेज

बदलेगा, माहौल, िश ा,

छाऽ, चुनाव, प रसर

मानिसकता, बदले

मानिसकता, बदले,

प रसर

ववाह, बेट ,

पता, अंतरजातीय, ूो साहन, गांव,

युगल

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Table-3: Title Feature Vector vs. Theme Feature Vector (For Gujarati Documents) Doc. Ref. Article10

Title

Title Feature Vector

મને

ભાિવત

કરનારાં

ભાિવત,

ુ તક

ુ તકો

Theme Feature Vector ુ તક, અ ુ િધસ લા ટ, ઑન કૉ

ોમાઈઝ, સ યા હની મયાદા,

‘સાધના’ , હદ ુ રાનેશર ફ,

Article13

હમાલયના શેરપાઓ

Bhashaindia2

વન

શેરપા, નામચે બઝાર,

ૂ ુ ં ુ ગ, મ

યેતી, નો મેન, એવર ટ ુ જરાતી -

ુ ર

દશની

ભાષા Bhashaindia3

હમાલય, શેરપા

વરા ય, થાઈ,

ુ જરાતી ,

ુ ર,

દશ,

ુ જરાતી , ભાષા, મહાકિવ, ભરતે ર બા બ ુ લ

ભાષા

રાસ,

આ યાન, ફો ્ સ કો

ુ ં સ ં ૃત ટરમા

કો

ટુ ર, સ ં ૃત

કો

રાસ,

ગરબો,

ુ જરાતી સભા

ટુ ર, સ ં ૃત, એ કોડ,કોડ,

ટાઈપ, ફોનેટ ક ા સ

-

શન, ભાષા

Table-4: Title Feature Vector vs. Theme Feature Vector (For Urdu Documents) Doc. Ref. Bbc_10 Bbc_15 Bbc_19

Bbc_20

Title ‫ ﺟﻨﮓ ﺑﻨﺪﯼ ﮐﯽ ﻣﻌﻴﺎد ﺧﺘﻢ‬:‫اﻟﻔﺎ‬ ‫ اﮔﺴﺖ‬13 ‫ﺣﮑﻮﻣﺖ ﻧﮯ ﺁﺳﺎم ﻣﻴﮟ‬ ‫ﮐﻮ ﺟﻨﮓ ﺑﻨﺪﯼ ﮐﺎ اﻋﻼن ﮐﻴﺎ ﺗﻬﺎ‬ ‫ ﺟﻨﺴﯽ ﮨﺮاس ﮐﮯ ﺧﻼف‬:‫ﮨﺮﻳﺎﻧہ‬ ‫ﮐﻤﻴﭩﻴﺎں‬ 1‫ درﺧﻮاﺳﺘﻴﮟ‬: ‫ ﻻﮐﻪ‬2 ‫اﻣﻴﺎں‬ ‫ﮐﺮوڑ‬

Title Vector ‫ﺟﻨﮓ ﺑﻨﺪﯼ ﻣﻌﻴﺎد ﺧﺘﻢ ﺣﮑﻮﻣﺖ‬ ‫ﺁﺳﺎم اﻋﻼن اﻟﻔﺎ‬

Theme Vector ‫ ﺟﻨﮓ‬, ‫ ﺑﻨﺪﯼ ﻣﻌﻴﺎد‬, ‫ﺧﺘﻢ‬,‫ﺣﮑﻮﻣﺖ‬, ‫ﺑ ﺎت‬

‫ ﺟﻨﺴﯽ ﮨﺮاس ﺧﻼف‬:‫ﮨﺮﻳﺎﻧہ‬ ‫ﮐﻤﻴﭩﻴﺎں‬ ‫ درﺧﻮاﺳﺘﻴﮟ‬: ‫اﻣﻴﺎں ﻻﮐﻪ‬ ‫ﮐﺮوڑ‬

‫ﺑﮩﺎر ﻣﻴﮟ ﺑﺎرش ﮐﯽ ﺗﺒﺎﮦ ﮐﺎرﻳﺎں‬

‫ ﺑﮩﺎر‬, ‫ ﺑﺎرش‬, ‫ﺗﺒﺎﮦ ﮐﺎرﻳﺎں‬

‫ﺟﻨﺴﯽ‬, ‫ ﮨﺮاس‬, ‫ ﺧﻼف‬, ‫ﮐﻤﻴﭩﻴﺎں‬ ‫ﮨﺮاﺳ ﺎں‬, ‫ ﺗﻌﻠﻴﻤﯽ‬, ‫اداروں‬, ‫اراﮐﻴﻦ‬ ‫ﮐ ﺮوڑ‬, ‫ ﺳﻴﻨﭧ‬, ‫ ﮐﺎﻧﻮﻧﭧ‬, ‫۔ںﯼ ﮦ‬, ‫ ﻻﮐﻪ‬, ‫ﺗﻌ ﺪاد‬, ‫درﺧﻮاﺳﺘﯽ‬, ‫ ﺗﻌ ﻞ‬,‫مﯼﺣﮑﻮﻣ ﺖ‬, ‫ﻣﻮﺻ ﻮل‬, ‫ﻧﻮﮐ ﺮﯼ‬, ‫ ﺗﻨﺨ ﻮاﮦﯼ‬,‫ﺗﻘ ﺮر‬, ‫ﺳ ﻨﺪ‬ ‫ ﺑﮩﺎر‬, ‫ اﻓﺮاد‬, ‫ ﺗﺒﺎﮦ ﮐﺎرﻳﺎں‬, ‫ﺑﺎرش‬, ‫رﻳﺎﺳﺖ‬

Thus, after preprocessing we have made the text suitable for extracting important sentences. So, the next step is to carry out sentence analysis and to determine sentence weight.

3. Sentence Analysis

respectively. α and β are constants whose values depend one the language being considered for the summarization. If a term is repeating more than once in a sentence, only one instance is considered for computing intermediate sentence weight.

3.1 Information Content of a Sentence Degree of information content of a sentence is represented by sentence weight, which is computed as follows:

W ' ( s ) = α ∑ WD ( s ) + β ∑ WT ( s ) D

(3.1)

T

Where W’(s) is an intermediate sentence weight, WD and WT are the term weights of those sentence terms, which belong to the document feature vector and the theme feature vector,

3.2 Sentence Reference Index One of the problems of summary generation is the risk of extracting a sentence, which is not complete by itself as it makes reference to previous sentence(s). This problem has been handled by analyzing the sentence for the presence of certain set of terms or phrases within positional constraints [19]. Our algorithm tags the sentences making references to previous sentences. This process is repeated recursively updating the tag values appropriately. It thus

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

generates what we call as Sentence Reference Index. This index is used to enhance information content of a sentence. The intermediate sentence weight W’(s) is enhanced for those sentences whose following sentences make reference to it. This is based on the following hypothesis: If more is said (written) in the following sentence(s) about the contents of the current sentence, it implies higher importance (richness) of content of the current sentence. In order to calculate the Sentence Reference Index, we are maintaining an external table for the words that give the indication for previous sentence reference. We have collected a list of such words for English, Hindi, Gujarati and Urdu languages. However, we can provide this table for other languages as well without any change in algorithm. The Sentence Reference Index, as mentioned above, helps in identifying the number of sentences, which make reference (cascaded reference) to the current sentence. Thus, we are able to easily determine the maximum number of subsequent contiguous sentences making reference to the current sentence. This figure is used in enhancing the sentence weight as follows: W(s) = W’(s) * [1 + Nr * γ]

(3.2)

Where W(s) is Net sentence weight and Nr is maximum number of subsequent sentences Level of Summary Very Brief Brief Short Normal

3.3 Location Scheme)

Feature

(Partitioning

Location feature depends on the genre of the text document and author’s style. Therefore, it is difficult to determine the location feature of a document in general. This fact has been acknowledged by previous researchers [3, 5, 18] and they have mentioned the importance of automatic way of finding out what location values are useful for a specific genre of documents and how it should be combined with other features. We have tackled this complex issue using partitioning scheme [19]. Our approach is based on the following hypotheses: a)

The important and content-rich information (from summarization point of view) has a potential, in general, to be present anywhere in the document irrespective of the genre of the document. Thus at various locations, important sentences may exist. b) Number of sentences to be drawn from a location will be determined by relative sentence weights within that location. Based on the above hypothesis, we have implemented the following methodology: The whole text is partitioned into number of parts as per following scheme:

No. of Partitions 5 % of [Total No. of Sentences in the Document] subject to the limits of [2, 4]. Let this value be N. (N + 1) Partitions (N + 2) Partitions (N + 3) Partitions

Each partition of the document is adjusted to align with the nearest paragraph boundary wherever it is possible. The procedure for extraction of potential sentences for the summary from these partitions is given in the next section. c)

making reference to the current sentence. γ is the weight to be given to Sentence Reference Index.

In case, a document has two or more sections, each section is considered as a subdocument.

4. Summary Generation Summary is a fuzzy concept rather than exact one. So, we have provided four fuzzy levels of summaries, i.e. Normal Summary, Short Summary, Brief Summary, and Very Brief Summary. The algorithm for summary generation is as follows: Step-1: Partition the text as suggested in the previous subsection (i.e. Section-3.3).

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Step-2: In case of the 1st partition, select the first sentence invariably.

5. Post Processing

Step-3: For each partition (including the first one also), select the sentence(s) with the highest sentence weight in that partition. Also select sentences with sentence weight at least equal to p% where p will be a relatively lower figure for Normal summary and its value will be progressively higher for Short, Brief, and Very Brief summaries.

The summary so generated may have the following problems:

Step-4: If the extracted sentence makes reference to the previous sentence(s), the previous sentences are also extracted. If the fuzzy level of summary is Very Brief or Brief, such sentences including the sentence extracted in Step-3 are dropped in the summary.

5.1 Handling of Low Coherence: Addition of the sentence

a) Discontinuity: if consecutive sentences in summary have low coherence and, b) Redundancy: if consecutive sentences repeat the same thing or sentence contains very few words that form redundant sentence.

Coherence between adjoining sentences, say S1 and S2, is computed as follows:

Step-5: Compile the extracted sentences in the same sequence as their order (position) in the original text (document).

∑ [w ( S ) .w ( S )] 1

t

Coherence ( S 1, S 2 ) =

t

2

( D ,T )

∑ [w ( S )] t

1

( D1,T 1)

Where (Di,Ti) denote the union of terms of Document Feature Vector(Di) and Theme Feature Vector (Ti) of sentence(Si). Let D = D1 U D2 and T = T1 U T2 where D1, D2 are Document Feature Vectors of Sentences S1 and S2, respectively; and T1, T2 are Theme Feature Vectors of S1 and S2, respectively. wt(S1) and wt(S2) are weights of terms of Document Feature Vector (D) as well as terms of the Theme Feature Vector (T) for sentence-1 (S1) and sentence-2 (S2), respectively. If Coherence (S1, S2) is low (i.e. less than the threshold value), the intermediate sentence between S1 and S2 with relatively higher weight is considered. Let that sentence be Si. Next, the Coherence (S1, Si) and Coherence (Si, S2) are computed. If these values are higher than the threshold value, the sentence Si is included.

2

∑ [w ( S )] t

2

(5.1)

2

( D 2 ,T 2 )

In case both these coherence values are lower than the threshold value, the coherence between S1 and S3 (i.e. the sentence next to S2) as well as coherence between S0 (i.e. the sentence before S1) and S2 are computed. If Coherence (S1, S3) is above the threshold value, sentence S2 is dropped. Similarly, if Coherence (S0, S2) is above the threshold value, sentence S1 is dropped. Following the above procedure recursively, high coherence of summary sentences is achieved.

5.2 Handling of Redundancy: Dropping the sentence Degree of ‘new’ information content of S1 with respect to S2 will be low only if the similarity between S1 and S2 w. r. t. Document Feature and Theme Feature is high, as well as similarity between S1 and S2 w. r. t. other terms (excluding stop words) is high.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

∑ [w (S ) ⋅ w (S )] t

1

t

2

( D ,T )

Sim( S1 , S 2 : D, T ) =

∑ [wt (S 2 )] 2

∑ [wt (S1 )] 2

( D1,T 1)

(5.2)

( D 2 ,T 2 )

∑ [w (S ) ⋅ w (S )] t

1

t

2

( D ,T )

Sim( S1 , S 2 : ¬D, ¬T ) =

∑ [wt (S1 )] 2

( ¬D1∧ ¬T 1)

In case, the two similarity values as computed through the above equations are high, it would imply that the sentence S2 does not contain significantly new information w. r. t. S1. In such cases, one of the two sentences, S1 or S2, will be dropped based on the following criteria: a)

The sentence with a lower sentence weight has a higher potential to be dropped. b) The sentence with a lower coherence value with its new neighbor (after dropping the other sentence) will have a higher potential to be dropped. A sentence in summary can be dropped if it has words length less than some threshold. In our algorithm we do following: c)

If a sentence contains less than 5 words then drop it from the summary.

6.

Evaluation: Degree Representative ness

of

Evaluation has been debatable for long time. There are two main methods widely used for evaluation. Intrinsic evaluations test the system in of itself; extrinsic evaluations test the system in relation to some other task. Two types of intrinsic evaluations are generally carried out in summarization. The first is Quality evaluation and the second is an informativeness evaluation. Quality is usually not sufficient condition for determining whether a summary is good summary of the source; it is possible to have beautifully-written but incorrect or useless output [13]. So, evaluation in terms of informativeness is usually preferred. For a generic summary, how much information from

∑ [wt (S 2 )] 2

(5.3)

( ¬D 2 ∧ ¬T 2 )

the source it preserves at different levels of compression is of real interest. One of the measures in informativeness methodology for extractive summary evaluation is content-based evaluation i.e. to test whether its contents of summary are the most representative ones or not. This test treats the summary as query to extract sentences from the corresponding document. If a very high percentage of a document is extracted, it would imply that the summary was most representative one. For English documents: We have compared our results with DUC 2002 data [4]. This data set contains documents on different categories and extractive summary per document. We have applied our algorithm on 567 different documents and created the summaries using very brief fuzzy level. The Summaries provided by the DUC 2002 data are of 100 words length and length of our very brief summaries range from 90 to 120 words. Thus, our summaries have a length comparable to the corresponding manual summaries, to ensure a fair evaluation. We used this metric by taking Very Brief Summary as the query to extract sentences of the corresponding document and similarly DUC summary for that document is also passed as the query to extract sentences of the original document. The results show that in 82% of cases we are getting better or equal degree of representative ness compared to DUC summaries. For other language documents: To test the language independence of the summaries generated by our algorithm, we have tested it on 70 news articles of Hindi leading dailies, 50 articles of Gujarati literature and 75 new articles of Urdu from BBC web site. In

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

almost every case, we are getting degree of representative ness more than 80%. These results also favor our summary generator to be very effective.

Table 5 shows the results obtained for the automatically generated summaries (100-words length) for the English from DUC 2002 data set of 567 documents and the results got by using our algorithm for summarization.

7. Results Table 5: Degree of Representative ness for English No. of cases in which D.R. > 60%

No. of cases in which D.R. > 70%

DUC 2002 summaries

48%

30%

Our summaries

80%

64%

% Improvement

67%

113% * D.R. = Degree of Representative ness

Table 6: Comparison of result with DUC-2002 data set (Total 567 documents) No. of documents for which our summary is Better or Equal to that of DUC summary

No. of documents for which our summary is poorer than DUC summary (only within 1.5% in terms of D.R.)

Efficiency of our Algorithm

465

102

82%

Figure 1: Comparison of Degree of Representative ness for DUC summaries and the Summaries generated by our algorithm (Graph truncated for 50 documents out Of 567 documents from DUC-2002 data)

% Degree of Representative ness

DUC Summaries Our Algorithm’s Summaries 100 80 60 40 20 0 1

3

5

7

9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 Number of Documents

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Table 7: Degree of Representative ness for other languages with different fuzzy Levels Degree of Representative ness (%) Language Hindi Gujarati Urdu

Normal 80 to 95 75 to 95 90 to 100

Above results show that even for the languages other than English, our approach is quite satisfactory. These results confirm the language independence of our summarizer.

8. Conclusion We have described an algorithm for automatically generating a generic summary of a single document. An improved method is suggested to derive a vector representing the central idea (theme) of the document. Location feature complexity has been handled by partitioning the text and extracting ‘best’ sentences from each partition. Sentences, which are not complete by themselves, lead to inclusion of their corresponding preceding sentences to resolve the gaps in context and meaning. Summaries are generated at four fuzzy levels, viz. Normal, Short, Brief, Very Brief. Experiments performed on standard data sets have shown that the results obtained with this method are comparable with those of state-ofthe-art systems for automatic summarization, while at the same time providing the benefits of a robust language independent algorithm. The quality of summary is tested w. r. t. its degree of representative-ness for languages other than English. The results are encouraging. All the summaries tested by us included sentences of importance. However in some cases, we found the flow of the summarized text not to be very smooth.

9. References 1. Aone et al., 1997 Chinatsu Aone, M. E. Okurowski, J. Gorlinsky, and B. Larsen. Ascalable summarization system using robust NLP. In Proceedings of the ACL'97/EACL'97 Workshop on IntelligentScalable Text Summarization, pages 66--73, Madrid, Spain, 1997.

Short 75 to 90 75 to 90 80 to 95

Brief 75 to 85 75 to 90 80 to 90

Very Brief 70 to 85 70 to 85 75 to 85

2. Baxendale, P., (1958) Man-made index for technical literature- an experiment IBM journal on research and development, 2(4). 3. Brandow, R., Mitze, K., and Rau, L. “Automatic condensation of electronic publications by sentence selection”. Information Processing and Management, 31, 5, 675-685, 1994. 4. DUC. 2002. Document understanding conference 2002. http://www-nlpir.nist.gov/projects/duc/. 5. Edmundson, H. P., “New methods in automatic abstracting”, Journal of the Association for Computing Machinery, 16, 2, pp. 264-285, 1969. 6. Erkan, G. and Radev, D., 2004. Lexpagerank: Prestige in multi-document text summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, July. 7. Halliday, M. and Hasan, R., “Cohesion in Text”, London, Longmans, 1996. 8. Hovy, E. H. and C-Y. Lin. 1998. Automating Text Summarization in SUMMARIST. In I. Mani and M. Maybury (eds), Advances in Automated Text Summarization. MIT Press. 9. Kupiec J., Pedersen J., and Chen F., ”A Trainable Document Summarizer”. In Proceedings of ACMSIGIR’95, Seattle, WA. 10. Kupper, Saggion et al. IJCAI 2003. Intelligent Multimedia Indexing and Retrieval through Multisource Information Extraction and Merging. 11. Luhn., H.P. "The Automatic Creation of Literature Abstracts". IBM Journal of Research and Development, Vol. 2, No. 2, pp. 159-165, April 1958. 12. Mani, I., & Bloedorn, E., “Machine Learning of Generic and User-Focused Summarization”, American Association of Artificial Intelligence, Nov-1998. 13. Mani, I. (2000-01). Automatic Text Summarization. John Benjamins Publishing Company.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

14. Rada Mihalcea and Paul Tarau, An Algorithm for Language Independent Single and Multiple Document Summarization, in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Korea, October 2005. 15. Salton, G., Allan J., Buckley C., and Singhal A., “Automatic Analysis, Theme Generation and Summarization of Machine-Readable Texts”, Science, 264, June 1994, pp. 1421-1426. 16. Sekine S. & Nobata C.(1998), “A Survey for Multi-Document Summarization” 17. Strzalkowski et al. 1998 Tomek Strzalkowski, Jin Wang, and Bowden Wise. A Robust Practical Text Summarization. In Proceedings of the AAAI Symposium on Intelligent Text Summarization, pages 26--33, Stanford University, Stanford, California, March 1998. American Association for Artificial Intelligence. 18. Teufel, S., & Moens, M., “Sentence Extraction As Classification Task”, In ACL/EACL Workshop on “Intelligent and Scalable Text Summarization”, Madrid, Spain, pp 58-65, 1997. 19. Vij S.K., Alkesh Patel, J.M. “Automatic Generation of Text Summary”, In Proceedings of AIMS-2005, 6th International Conference, IIIMIndore, India. 20. Zechner, K, (1995) Automatic text abstracting by selecting relevant passages Master’s thesis, Centre for Cognitive Science, University of Edinburgh.

Conference RIAO2007, Pittsburgh PA, U.S.A. May 30-June 1, 2007 - Copyright C.I.D. Paris, France

Suggest Documents