A NAÃVE BAYES CLASSIFIER FOR WEB DOCUMENT SUMMARIES ...

August 2, 2010 14:54 WSPC-IJAIT

S0218213010000285

International Journal on Artificial Intelligence Tools Vol. 19, No. 4 (2010) 465–486 c World Scientific Publishing Company

DOI: 10.1142/S0218213010000285

A NA¨ IVE BAYES CLASSIFIER FOR WEB DOCUMENT SUMMARIES CREATED BY USING WORD SIMILARITY AND SIGNIFICANT FACTORS

MARIA SOLEDAD PERA Computer Science Department, Brigham Young University, Provo, Utah, USA [email protected] YIU-KAI NG∗ Computer Science Department, Brigham Young University, Provo, Utah, USA [email protected]

Text classification categorizes web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified web documents to identify the ones with contents that satisfy their information needs. In solving this problem, we first introduce CorSum, an extractive single-document summarization approach, which is simple and effective in performing the summarization task, since it only relies on word similarity to generate high-quality summaries. We further enhance CorSum by considering the significance factor of sentences in documents, in addition to using word-correlation factors, for document summarization. We denote the enhanced approach CorSum-SF and use the summaries generated by CorSum-SF to train a Multinomial Na¨ıve Bayes classifier for categorizing web document summaries into predefined classes. Experimental results on the DUC2002 and 20 Newsgroups datasets show that CorSum-SF outperforms other extractive summarization methods, and classification time (accuracy, respectively) is significantly reduced (compatible, respectively) using CorSum-SF generated summaries compared with using the entire documents. More importantly, browsing summaries, instead of entire documents, which are assigned to predefined categories, facilitates the information search process on the Web. Keywords: Multinomial Na¨ıve Bayes classifier; sentence-based summaries; word correlation; significant factors.

1. Introduction The rapid growth of the Web has dramatically increased the complexity of web query processing, and locating relevant documents from a huge text repository has always been a challenge for web search engine designers, as well as web users. Even though popular web search engines are efficient and effective in retrieving ∗ Corresponding

author 465


466

S0218213010000285

M. S. Pera & Y.-K. Ng

information from the Web, users must scan through the retrieved documents to determine which ones are comparatively more relevant than the others, which could be a time-consuming task. Hence, an advanced, scalable method that can identify the content of web documents promptly is needed. This task can be accomplished by creating a summary that captures the main content of a web document and reduces the time required for users to scan through the retrieved documents prior to exploring their full contents. Text classification can take advantages of summaries to further assist web users in locating desired information among categorized web documents with minimized effort, such as RSS news feed, since classified web document summaries on various topics provide a quick reference guide. Moreover, classifying summaries requires only a fraction of the processing time compared with classifying the entire documents due to the reduced size of the summaries. In this paper, we first introduce a simple and yet effective summarization technique, called CorSum, which is based on word similarity to extract sentences from a web document D that are representative of the content of D. Thereafter, we enhance CorSum by considering the significance factor 20 of sentences in documents. The enhanced approach, called CorSum-SF , combines (i) the ranking score of each sentence S in a document D computed by using word-correlation (i.e., similarity) factors, which establishes how similar S is with respect to the remaining sentences in D, and (ii) the significance factor of S, which determines the proportion of significant, i.e., representative, words in S. Sentences in D with the highest combined ranking score and significant factor yield the summary of D. As opposed to other summarization approaches, CorSum-SF does not require any training, is computational inexpensive, and relies solely on word/sentence similarity and significance factor to generate an accurate summary. In addition, CorSum-SF is easy to implement, since the word-correlation factors are precomputed and only addition, multiplication, and division are involved in computing sentence significance factors. CorSum-SF is not domain specific and thus can be applied to generate summaries of documents with diverse structures and contents. Moreover, CorSum-SF can summarize multi-lingual document collections, if the proper word-correlation factors are available. We train a Na¨ıve Bayes classifier, using CorSum-SF generated summaries to facilitate the classification of their corresponding web documents, which speeds up the classification process with high accuracy. The remaining paper is organized as follows. In Section 2, we discuss existing summarization and text classification approaches. In Section 3, we present CorSum, introduce its enhanced version, CorSum-SF , and detail the classifier we adopt for text categorization. In Section 4, we evaluate the performance of CorSum-SF , in addition to CorSum, and compare their performance with other summarization approaches. We also validate the efficiency and effectiveness of the chosen classifier using CorSum-SF generated summaries for classification. In Section 5, we give a conclusion and directions for future work.


S0218213010000285

Classifying Summaries of Web Documents

467

2. Related Work A significant number of text summarization methods have been proposed in the past which apply different methodologies to perform the summarization task. As shown in Ref. 28, extraction, fusion, abstraction, and compression are four wellestablished summarization techniques. Extraction identifies representative sections of a text T which yield the summary of T , whereas fusion extracts and combines sentences in T to create revised sentences as the summary of T . Abstraction summarizes the content of T with new concise sentences, and compression removes sections in T that are considered relatively unimportant and retains the remaining sentences as the summary of T . In addition, existing summarization approaches can be grouped into two categories: single-document and multi-document. Singledocument summarization captures the main content of a document as its summary, whereas multi-document summarization creates a single summary for a document collection which describes the overall content of the collection coherently.6 Radev et al.28 claim that the extraction method for summarizing single documents is the most promising method, which is the strategy we adopt in our proposed summarization approach and is the focus of the subsequent discussion. Zhang and Li35 introduce a summarization approach based on sentence clustering. Given a document D, Zhang and Li cluster similar sentences in D using the k-mean clustering algorithm. The similarity among any two sentences, s1 and s2 , is computed according to (i) the occurrence of the same words in s1 and s2 , (ii) the similarity between sequences of words in s1 and s2 , and (iii) the semantic similarity between s1 and s2 , which is determined using HowNet, an online knowledge base for Chinese and English words. Thereafter, the sentence in each cluster with the highest overall similarity with respect to the remaining sentences in the same cluster is selected to create the summary of D. Shen et al.,31 on the other hand, treat summarization as a sequence labeling problem and rely on the Conditional Random Fields algorithm to label sentences in a document D with a “1” or a “0” to indicate whether a sentence should or should not be included in the summary of D, respectively. Even though CorSum-SF is an extractive, sentence-based summarization approach, CorSum-SF neither clusters sentences nor trains labeling algorithms prior to selecting the sentences that should be included in a document summary, which reduces the summarization processing time. Fattah and Ren8 develop a machine learning approach for text summarization that considers various features in a document, which include the (i) sentence position, (ii) positive/negative keywords in a sentence, (iii) similarity between a sentence and the document title, (iv) inclusion of numerical data in a sentence, (v) inclusion of named entities in a sentence, and (vi) sentence length. Each feature is used for training different machine learning algorithms, such as the genetic algorithm, neural networks, probabilistic neural networks, and Gaussian Mixture Model, to generate text summaries. Based on the conducted empirical studies, Fattah and Ren claim that the Gaussian Mixture Model-based approach is the most effective algorithm for creating extractive summaries.


468

S0218213010000285


Antiqueira et al.1 adopt metrics and concepts from complex networks to select sentences in a document D that compose the summary of D. The authors represent D as a graph in which sentences in D are denoted as nodes and sentences that share common nouns are connected as edges. Thereafter, nodes in the graph are ranked according to various network measurements, such as the (i) number of nodes a particular node is connected to, (ii) length of the shortest path between any two nodes, (iii) locality index, which identifies central nodes in partition groups of nodes, and (iv) network modularity, which measures the proportion of edges connecting to intra-communitya nodes, and the highest ranked nodes are chosen to create the corresponding summary. Unlike the approach in Ref. 8, which depends on sentencebased features to train the proposed summarizer, or the approach in Ref. 1, which relies on network-based features to create the summary of a document, CorSum-SF relies solely on the word-correlation factors among (the words in) sentences within a document D and the sentence significance factor to determine the sentences that should be included in the summary of D. The authors in Ref. 15 claim that many existing methodologies treat summarization as a binary classification problem (i.e., sentences are either included or excluded in a summary), which generates redundant, unbalanced, and low-recall summaries. In solving this problem, Li et al.15 propose a Support Vector Machine summarization method in which summaries (i) are diverse, i.e., they include as few redundant sentences as possible, (ii) contain (most of the) important aspects of the corresponding documents, and (iii) are balanced, i.e., they emphasize different aspects of the corresponding documents. By selecting the most representative sentences, CorSum-SF , similar to the approach in Ref. 15, creates summaries that are balanced and diverse, but does not require previous training in generating summaries, and hence is less computationally expensive than the summarization approach in Ref. 15. Besides using text summarization for capturing the main content of web documents, constructed summaries can be further classified. Yang and Pedersen34 present several feature-selection approaches for text classification and compare the performance of two classifiers, K Nearest Neighbor (KNN) and Linear List Squares Fit mapping (LLSF). The classifiers compute the confidence score CS of a document D in each category. CS in KNN is determined by the degrees of similarity of D with respect to the K nearest training documents in each category, whereas LLSF calculates CS of D in each category using a regression model based on the words in D. McCallum and Nigam22 discuss the differences between the Multi-variate Bernoulli and Multinomial Na¨ıve Bayes classifiers. Multi-variate Bernoulli represents a document D using binary attributes, indicating the absence and occurrence a The

concept of communities is defined in Ref. 5 as sets of nodes that are highly interconnected, whereas other sets are scarcely connected to each other.


S0218213010000285


469

of words in D, whereas the Multinomial classifier captures the content of D by the frequency of occurrence of each word in D. Regardless of the classifier, the classification is performed by computing the posterior probability of each class given an unlabeled document D and assigning D to the class with the highest probability. Nigam et al.25 rely on Maximum Entropy to perform text classification. Maximum Entropy, which estimates probability distributions of data on a class-by-class basis, represents a document D by its word count feature. Maximum Entropy assigns D to a unique class based on the frequency of occurrence of words in D that is more alike to the word occurrence distribution of a particular class. Using the Dempster-Shafer Theory (DST), the authors of Ref. 29 combine the outputs of several sub-classifiers (trained on different feature sets extracted from the same document collection C) and determine to which class a document in C should be assigned. As claimed by the authors, sub-classifiers reduce computational time without sacrificing classification performance, and DST fusion outperforms traditional fusion methods, such as plain voting and majority weighted voting. Unlike the methodologies adopted in Refs. 25, 29 and 34 in assigning (summarized) documents to their corresponding class, we depend on the Multinomial Na¨ıve Bayes classifiers, which is one of the most widely-used and effective text classifiers.22 3. Summarization and Classification In this section, we first discuss the overall design of CorSum, which uses the precomputed word-correlation factors to identify representative sentences in a document D to create the summary of D. Hereafter, we introduce CorSum-SF , which relies on sentence significance factors, in addition to word similarity, to improve the quality of CorSum generated summaries. Furthermore, we present a Multinomial Na¨ıve Bayes classifier, which we adopt for classifying CorSum-SF generated summaries of web documents in large collections. 3.1. CorSum, a summarization approach Mihalcea and Tarau24 propose a sentence-extraction summarization method that applies two graph-based ranking algorithms, PageRank3 and HITS,11 to determine the rank value of a vertex (i.e., a sentence in a document D) in a graph based on the global information computed using the entire graph, i.e., the similarity of sentence pairs in D, which is calculated as a function of content overlap. Hereafter, sentences are sorted in reversed order of their rank values, and the top-ranked sentences are included in the summary of D. CorSum also depends on ranked sentences, but the rank values are computed according to (i) the word-correlation factors introduced in Ref. 12, and (ii) the degrees of similarity of sentences. The highly ranked sentences are the most representative sentences of D, which form the summary of D.


470

S0218213010000285


3.1.1. Word-Correlation factors and sentence similarity The word-correlation matrix M introduced in Ref. 12, which includes the correlation factors of non-stop, stemmed wordsb , is a 54,625 × 54,625 symmetric matrix. The correlation factor of any two words wi and wj , which indicates how closely related wi and wj are semantically, is computed based on the (i) frequency of co-occurrence and (ii) relative distance of wi and wj in each document D in a collection and is defined as follows: X X 1 wcf (wi , wj ) = (1) d(x, y) x∈V (wi ) y∈V (wj )

where d(x, y) denotes the distance (i.e., the number of words in) between x and y in D plus 1, and V (wi ) (V (wj ), respectively) is the set of words that include wi (wj , respectively) and its stem variations in D. The word-correlation factors in M were computed using the Wikipedia Database Dump (http://en.wikipedia.org/wiki/Wikipedia:Databasedownload), which consists of 930,000 documents written by more than 89,000 authors on various topics, and hence is diverse in content and writing styles and is an ideal choice for measuring word similarity. Using the word-correlation factors in M , we compute the degree of similarity of any two sentences S1 and S2 by adding the word-correlation factors of each word in S1 with every word in S2 as follows: Sim(S1 , S2 ) =

n X m X

wcf (wi , wj )

(2)

i=1 j=1

where wi (wj , respectively) is a word in S1 (S2 , respectively), n (m, respectively) is the number of words in S1 (S2 , respectively), and wcf (wi , wj ) is defined in Equation 1. 3.1.2. Most representative sentences CorSum identifies sentences in a document D that most accurately represent the content of D during the summarization process. To determine which sentences should be included in the summary of D, CorSum computes the overall similarity of each sentence Si in D, denoted OS(Si ), with respect to the other sentences in D as n X OS(Si ) = Sim(Si , Sj ) (3) j=1,j6=i

where n is the number of sentences in D, Sj is a sentence in D, and Sim(Si , Sj ) is defined in Equation 2. b Stopwords

are commonly-occurred words such as articles, prepositions, and conjunctions, which are poor discriminators in representing the content of a sentence (or document), whereas stemmed words are words reduced to their grammatical root. From now on, unless stated otherwise, whenever we refer to words, we mean non-stop, stemmed words.


S0218213010000285


471

Fig. 1. A document D from the DUC-2002 dataset with (the most representative, highlighted) sentences extracted by CorSum to form the summary of D.

We rely on the Odds ratio,10 which is defined as the ratio of the probability (p) p , to that an event occurs to the probability (1 - p) that it does not, i.e., Odds = 1−p compute the Ranking value of Si in D. We treat OS(Si ) as the positive evidence of Si in representing the content of D. The Ranking value of Si , which determines the content significance of Si in D, such that the higher the ranking value of Si is, the more significant (content-wise) Si is in D, and is defined as Ranking(Si ) =

OS(Si ) . 1 − OS(Si )

(4)

Having computed the Ranking value of each sentence in D, CorSum chooses the top N (≥ 1) ranked sentences in D as the most representative sentences of D to create the summary of D. In establishing the proper value of N , i.e., the number of representative sentences to be included in a summary, we follow the results of a study conducted by Ref. 23 on two different popular datasets, i.e., Reuters-21578 (http://archive.ics.uci.edu/ml/databases/reuters21578/) and WebKB (http://www.cs.umd.edu/ sen/lbc-proj/data/WebKB.tgz), using different classifiers, which concludes that in general a summary with six sentences can accurately represent the overall content of a document. More importantly, Mihalcea and Hassan23 show in their study that using summaries with six sentences for clustering/classification one can achieve the highest accuracy, an assumption we adopt in designing CorSum. If a document D contains less than six sentences, CorSum includes the entire content of D as the summary of D. Example 1. Figure 1 shows a document D from the DUC-2002 dataset (to be introduced in Section 4.1) in which the six most representative sentences (i.e., sentences with their Ranking values higher than the remaining ones) are highlighted, whereas Table 1 shows the Ranking values of the sentences in D and Table 2 includes the degrees of similarity of the highest ranked (i.e., Sentence 10) and lowest ranked (i.e., Sentence 11) sentences with the remaining sentences in D.


472

S0218213010000285

M. S. Pera & Y.-K. Ng Table 1. Ranking values of the sentences in the document shown in Figure 1. Sentence

Ranking value

Sentence

Ranking value

1 2 3 4 5 6

-1.100 -1.050 -1.050 -1.055 -1.070 -1.083

7 8 9 10 11

-1.047 -1.090 -1.055 -1.045 -1.142

Table 2. The degrees of similarity of Sentence 10 (the highest ranked) and 11 (the lowest ranked) with respect to the others in the document shown in Figure 1. Si = 10

Si = 11

Sj

Sim(Si , Sj )

Sj

Sim(Si , Sj )

1 2 3 4 5 6 7 8 9 11

0.000007 2.000008 6.000005 3.000000 2.000003 0.000007 3.000004 2.000002 3.000001 1.000000

1 2 3 4 5 6 7 8 9 10

1.000001 1.000003 0.000001 2.000000 0.000001 0.000003 1.000001 0.000001 1.000001 1.000000

3.2. Enhancing CorSum using significance factors To further enhance the performance of CorSum, in terms of generating summaries that accurately reflect the content of the corresponding documents, we rely on the sentence significance factor, which was first introduced in Ref. 20 and employed as a summarization strategy in Refs. 4 and 30. We have considered the significance factor, since we embrace its assumption that sentences in a document D that contain frequently occurring words in D tend to describe the content of D. The significance factor for a sentence S is calculated based on the occurrence of significant words in S, i.e., words that are not stopwords and that have a frequency of occurrence in a document between pre-defined high-frequency and low-frequency cutoffs values.20 As defined in Ref. 20, the significance factor for a sentence S in D, denoted SF (S), is computed as SF (S) =

|Significant Words|2 |S|

(5)

where |S| is the number of non-stop words in S and |Signif icant W ords| is the number of significant words in S. A word w in D is a significant word if  if SD < 25  7 − 0.1 × (25 − SD ) fD,w ≥ 7 (6) if 25 ≤ SD ≤ 40  7 + 0.1 × (SD − 40) otherwise


S0218213010000285


473

where fD,w is the frequency of occurrence of w in D, SD is the number of sentences in D, 25 is the low-frequency cutoff value, and 40 is the high-frequency cutoff value. Initial empirical studies conducted using the DUC-2002 dataset (to be introduced in Section 4.1) show that the significance factor by itself is not an effective summarization strategy. For this reason, we have developed CorSum-SF , an enhanced summarization strategy which considers both the significance factor as well as the ranking score of a sentence (as defined in Equation 4) in creating the summary of a document. CorSum-SF combines the significance factor, denoted SF (S), and ranking value, denoted Ranking(S), of a sentence S in a document D to measure the relative degree of representativeness of S with respect to the content of D using the Stanford Certainty Factor,19 denoted SCF . The formal definition of SCF is given as follows: SCF (R1 ) + SCF (R2 ) SCF (C) = (7) 1 − M in(SCF (R1 ), SCF (R2 )) where R1 and R2 are two hypotheses that reach the same conclusion C, and SCF is the Stanford certainty factor (i.e., confidence measure) of C, which is a monotonically increasing (decreasing) function of combined assumptions for computing the confidence measure of C. Based on SCF , the combined SF (S) and Ranking(S) yields our content reflectivity score of S, denoted Con Ref (S), as defined below. Con Ref (S) =

Ranking(S) + SF (S) . 1 − M in(Ranking(S), SF (S))

(8)

Using the Con Ref value of each sentence in D, CorSum-SF selects the top six sentences with the highest Con Ref values (as the most representative sentences in D) to create the corresponding summary of D. 3.3. The Na¨ıve Bayes classifier To verify that classifying text documents using their summaries, instead of the entire documents, is cost-effective, we apply a Na¨ıve Bayes classifier, which is one of the most popular text classification tools, since Na¨ıve Bayes is simple, easy to implement, robust, highly scalable, and domain independent,13 and thus is an ideal choice for classifying summaries generated by CorSum-SF . Moreover, even though Na¨ıve Bayes assumes mutual independence of attributes, it achieves high accuracy in text classification.22 Among the variants of the Na¨ıve Bayes classifier, we choose the multinomial implementation of the Na¨ıve Bayes classifier,22 denoted MNB, since, as previously stated, MNB is one of the most widely-used text classifiers.13 MNB uses the frequency of word occurrence to compute the probability of a document to be assigned to a particular class. During the training process, MNB first estimates the probability of a word wt in a natural class cj , which is based on the frequency of occurrence of wt in each pre-classified, labeled document di , i.e., a summary generated by


474

S0218213010000285


CorSum-SF in our case, in a collection of documents CD, and is formally defined as P 1 + |CD| i=1 Nit P (cj |di ) P (wt |cj ) = (9) P|V | P|CD| |V | + s=1 i=1 Nis P (cj |di )

where Nit (Nis , respectively) denotes the frequency of occurrence of wt (word ws , respectively) in di , |CD| is the number of documents in CD, |V | is the number of distinct words in CD, and P (cj |di ) is 1, if cj is the pre-assigned class of di , i.e., di is pre-assigned to cj , and 0, otherwise. Having determined the required probabilities during the training step of MNB, the probability that a given (unlabeled) document dk belonged to the class cj , denoted P (dk |cj ), is computed at the classification phase as P (dk |cj ) = P (|dk |)|dk |!

|V | Y P (wt |cj )Nkt Nkt ! t=1

(10)

where |dk | denotes the number of words in dk , |V |, Nkt , and P (wt |cj ) are as defined in Equation 9, and the probability of a document dk in CD is defined as P (dk ) =

|C| X

P (cj )P (dk |cj )

(11)

j=1

where |C| is the number of predefined natural classes, P (cj ) is the fraction of documents in CD that belong to cj , which is determined at the training phase, and P (dk |cj ) is as computed in Equation 10. In classifying dk , i.e., a summary in our case, MNB assigns to dk the class label cj , if the computed probability P (cj |dk ) is the highest among all the probabilities P (ci |dk ) for each predefined natural class ci . P (cj |dk ) is computed as follows, which is the well-known Bayes’ Theorem: P (cj |dk ) =

P (cj )P (dk |cj ) P (dk )

(12)

where P (dk |cj ), P (dk ), and P (cj ) are as defined earlier. During the implementation process of MNB, the probability of a word in a given class is smoothed by using the Laplace approach, also-known as add-one smoothing,22 which adds the values 1 and |V | as shown in Equation 9. Probability smoothing is often applied to solve the zero-probability problem that occurs when a word not seen during the training process is in a document to be classified.27,33 4. Experimental Results In this section, we describe the datasets used for experimentation and present the evaluation measures which are adopted for (i) quantifying the quality of summaries generated by CorSum(-SF ), (ii) comparing the performance of CorSum(-SF ) with other well-known extractive summarization approaches, and (iii) demonstrating the


S0218213010000285


475

effectiveness and efficiency of using an MNB for classifying summaries generated by CorSum(-SF ), instead of the entire documents. 4.1. Datasets We assess and compare the performance of CorSum(-SF ) utilizing the widelyused Document Understanding Conference (DUC) 2002 dataset (http://wwwnlpir.nist.gov/projects/duc/past duc/duc2002/data.html). DUC-2002 includes 533 news articles divided into 60 clusters, each containing approximately 10 articles retrieved from popular news collections such as the Wall Street Journal, AP Newswire, Financial Times, and LA Times. The dataset also includes two summaries, called reference summaries, created by human experts for each news article, based on which the performance of any single-document or multi-document summarization approach can be evaluated. We compare the CorSum(-SF ) generated summaries of the DUC-2002 news articles with the corresponding reference summaries in terms of their overlapping content. To determine the suitability of using summaries, instead of the entire documents, in text classification, we applied the MNB classifier on the 20 Newsgroup (http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html) dataset, denoted 20NG, from where CorSum(-SF ) generated summaries are created. We rely on 20NG for evaluating the performance of the MNB classifier using summaries (generated by CorSum or CorSum-SF ), since 20NG is a popular news document collection used for verifying the accuracy of text classification or text clustering tools. The 20NG dataset14 consists of 19,997 articles retrieved from the Usenet newsgroup collection that are clustered into 20 different categories. For evaluation purpose, 80% of the documents in 20NG were used for training MNB, and the remaining 20% for classification evaluation. 4.2. Evaluation measures To evaluate the performance of CorSum and CorSum-SF , we have implemented a widely used summarization measure, the Recall-Oriented Understudy for Gisting Evaluation (ROUGE).16 ROUGE includes measures that quantify the quality of a summary S created using a summarization approach by comparing the number of overlapping units, such as n-gram, word sequences, or word pairs, in S with the ones in the expert-created reference summaries of the same document. Four different ROUGE measures are known for summarization evaluation: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S, which are based on n-gram overlap, leastcommon substrings, weighted least-common substrings, and skipping bigrams cooccurrence, respectively. We have considered ROUGE-N, as opposed to the other ROUGE variations, since as shown in Ref. 16, the unigram-based ROUGE-N score, i.e., ROUGE-1, is the most accurate measure for establishing the closeness between automatically-generated summaries and their corresponding reference summaries. In addition, ROUGE-N (for N = 1 and N = 2) is the most ideal choice


476

S0218213010000285


for evaluating short summaries and single-document summaries.16 ROUGE-N of an automatically-generated summary S is defined as P P HS ∈ Reference Summaries gramn ∈ HS Countmatch (gramn ) P P ROUGE-N = (13) HS ∈ Reference Summaries gramn ∈ HS Count(gram n ) where gramn is an n-gram, Countmatch (gramn ) is the number of common n-grams in S and in one of the reference summaries HS of S, and Count(gramn ) is the number of n-grams in HS. We have used ROUGE-1 and ROUGE-2, i.e., compared the unigram and bigram overlap between S and HS, respectively, to assess the performance of CorSum(-SF ) and other summarization techniques. To evaluate the performance of MNB on summaries, instead of the corresponding entire documents, we use the classification accuracy measure as defined below. Accuracy =

Number of Correctly Classified Documents . Total Number of Documents in a Collection

(14)

4.3. Performance evaluation of CorSum(-SF ) We have verified the effectiveness of CorSum(-SF ) based on the ROUGE-1 and ROUGE-2 evaluation measures, which determine the number of overlapped n-grams (1 ≤ n ≤ 2) in each summary generated by CorSum(-SF ) and one of the corresponding reference summaries. Table 3 shows the ROUGE-1 and ROUGE-2 values computed using summaries generated by CorSum(-SF ) on a number of documents in the DUC-2002 dataset. On an average, CorSum obtained 0.54 (0.28, respectively) ROUGE-1 (ROUGE2, respectively) value, which implies that 54% (28%, respectively) of the unigrams (bigrams, respectively) in a reference summary are included in its corresponding CorSum generated summary. The ROUGE-1 (ROUGE-2, respectively) value of CorSum generated summaries is increased by another 5% (4%, respectively) by its CorSum-SF counterpart. ROUGE-1 is almost twice as high as ROUGE-2, which is anticipated, since ROUGE-1 considers overlapped unigrams which includes overlap of stopwords in the summaries, whereas bigrams limit the extent of overlap between two summaries as a result of matching two adjacent words, which is lower in probability than matching two single words. More importantly, as previously stated, we compare summaries generated by CorSum(-SF ) with reference summaries, that are not extractive, which were created by human experts using new sentences that capture the gist of the test documents. For this reason, achieving high ROUGE1 and ROUGE-2 values is not a simple task. A high ROUGE-N value provides further evidence of the high quality of CorSum(-SF ) generated summaries compared with other summarization methods (as shown in Section 4.3.1), even though CorSum(-SF ) generated summaries are extractive and its creation process is relatively simple and straightforward compared with the rewriting approach.


S0218213010000285


477

Table 3. The overall ROUGE-1 and ROUGE-2 averages and the ROUGE-1 and ROUGE-2 values of a number of summaries generated by CorSum and CorSum-SF on the documents in DUC-2002, respectively. CorSum

CorSum-SF

Document Name

ROUGE-1

ROUGE-2

ROUGE-1

ROUGE-2

AP900625-0036

0.51

0.18

0.53

0.19

AP900625-0153

0.21

0.16

0.34

0.16

AP000626-0010

0.48

0.28

0.54

0.32

AP900705-0149

0.94

0.45

0.86

0.39

AP900730-0116

0.71

0.48

0.71

0.48

AP901123-0002

0.38

0.13

0.52

0.18

...

...

...

...

...

Overall Average

0.54

0.28

0.59

0.32

Fig. 2. A reference summary of the sample document in Figure 1 with (highlighted) unigrams and (underlined, sequence of) bigrams overlapped with the ones in the CorSum generated summary.

Example 2. Figures 1 and 2 show the CorSum generated summary CS and one of its corresponding reference summary RS in the DUC-2002 dataset, respectively. Out of the 99 words in RS, 80 of its unigrams and 41 of its bigrams are in CS. Some sentences in RS have been rephrased and include synonyms of words used in the original document d, which result in mismatched unigrams and bigrams between CS and RS. However, CorSum extracts sentences in d that are highly similar to the ones in RS and achieves higher ROUGE-N values compared with other existing summarization approaches, which have been verified.(See experimental results presented in the next subsection.) 4.3.1. Comparing the performance of CorSum(-SF ) To further assess the effectiveness of CorSum and CorSum-SF on summarization, we compared their performance, in terms of ROUGE-1 and ROUGE-2 measures, with other well-established summarization approaches that adopt different methodologies for text summarization using the DUC-2002 dataset. The various summarization strategies to be compared include the ones implemented in CollabSum,32 i.e., Uniform-Link, Union-Link, Inter-Link, and Intra-Link, which rely on inter- and intra-document relationships in a cluster to generate a summary, the


478

S0218213010000285


Latent Semantic Analysis (LSA) summarization method in Ref. 9, and the Top-N method in Ref. 26. We also compare the performance of CorSum(-SF ) with the ones in Refs. 2, 7, 17, 18 and 21. The Top-N summarizer26 is considered a naive summarization approach, which extracts the first N (≥ 1) sentences in a document D as its summary and assumes that introductory sentences contain the overall gist of D. Gong9 applies Latent Semantic Analysis (LSA) for summarization. LSA first establishes (i) inter-relationships among words in a document D by clustering semantically-related words and sentences in D and (ii) word-combination patterns that recur in D which describe a topic or concept. LSA selects the highest-scored sentences that contain recurring word patterns in D to form the summary of D. CollabSum in Ref. 32 creates the summary of D using sentences in D and documents in the cluster to which D belongs. To create the summary of D, CollabSum relies on (i) Inter-Link, which captures the cross-document relationships of a sentence with respect to the others in the same cluster, (ii) Intra-Link, which reflects the relationships among sentences in a document, (iii) Union-link, which is based on the inter- and intra-document relationships R, and (iv) Uniform-Link, which uses a global affinity graph to represent R. Martins and Smith21 create sparse summaries, each of which contains a small number of sentences that reflect representative information in a document. In Ref. 21 summarization is treated as an integer linear problem and an objective function is optimized which incorporates two different scores, one reflecting the likelihood of selecting a sentence S to be part of a summary and another the compression ratio of S. Given a document D, the subset of sentences in D that maximizes the combined objective function score yields the summary of D. The authors of Ref. 21 consider both unigrams and bigrams in compressing sentences, and the conducted empirical study shows that unigrams achieve the highest performance in text summarization. Dunlavy et al.7 rely on a Hidden Markov Model (HMM) to create the summary of a document, which consists of the top-N sentences with the highest probability values of features computed using the HMM. The features used in the HMM include (i) the number of signature terms in a sentence, i.e., terms that are more likely to occur in a given document rather than in the collection to which the document belongs, (ii) the number of subject terms, i.e., signature terms that occur in headline or subject leading sentences, and (iii) the position of the sentence in the document. Since the HMM tends to select longer sentences to be included in a summary,7 sentences are trimmed by removing lead adverbs and conjunctions, gerund phrases, and restricted relative-clause noun phrases. Binwahlan et al.2 introduce a text summarization approach based on a particle swarm optimization which determines the weights to be assigned to each of the following text features: (i) common number of n-grams between a given sentence and the remaining sentences in the same document, (ii) word sentence score, which is computed using the TF-IDF weights of words in sentences, and (iii) sentence sim-


S0218213010000285


479

ilarity score, which establishes the similarity between a given sentence and the first sentence of a document. Based on the extracted features and their corresponding weights, each sentence in a document D is given an overall score and the highestscored sentences are included in the summary of D. As defined in Ref. 17, words with high quantity of information often carry more content and are more important. In summarizing documents, Liu et al.17 consider the information captured in a word by measuring the importance, i.e., significance, of each word in a sentence using word information and entropy models. Given a document D, each sentence S in D is ranked using a score that is computed according to the position of S in D, as well as the significance of words in S, and the highest ranked sentences are included in the summary of D. Lloret and Palomar18 consider (i) word frequency, (ii) textual entailment, which is a natural language processing task that determines whether a text segment (in a document) can be inferred by others, and (iii) the presence of noun phrases to compute a score for each sentence in a document D. Lloret and Palomar extract sentences in D with the highest scores, which are treated as the most effective representation of the content of D, to generate the summary of D. As shown in Figure 3, CorSum-SF outperforms all of these summarization approaches by 9-26% (10-14%, respectively) based on unigrams (bigrams, respectively), which verifies that CorSum-SF generated summaries are more reliable in capturing the content of documents than their counterparts. 4.3.2. Observations on the summarization results The summarization approach based on LSA9 chooses the most informative sentence in a document D for each salient topic T , which is the sentence with the largest index value on T . The drawback of this approach is that sentences with the largest index values of a topic, which may not be the overall largest among all topics, are chosen even if they are less descriptive than others in capturing the content of D. The Top-N (= 6, in our study) summarization approach is applicable to documents that contain an outline in the beginning paragraph. As shown in Figure 3, the Top-N summarizer achieves relatively high performance, even though its accuracy is still lower than CorSum(-SF ), since the news articles in DUC-2002 dataset are structured such that the first few lines of each article often contain an overall gist of the article. However, the Top-N approach is not suitable for summarizing general text collections with various document structures, as opposed to CorSum(-SF ) which is domain-independent. Since the summarization approaches in CollabSum, i.e., Inter-Link, Intra-Link, Uniform-Link, and Union-Link,32 must capture the inter- and intra-document relationships of documents in a cluster to generate the summary of a document, this process increases the overall summarization time. More importantly, inter- and intra-document information used by CollabSum are not as sophisticated as the word-correlation factors used by CorSum(-SF ) in capturing document contents,


480

S0218213010000285


Fig. 3.

ROUGE-N values achieved by different summarization approaches on DUC-2002.

since CollabSum approaches yield lower ROUGE-N values than CorSum(-SF ) (as shown in Figure 3). Unlike the summarization methods in (i),21 which requires training the compression and selection models as a pre-processing step, (ii),7 which uses a supervised learning approach, and (iii),2 which learns from a particle swarm optimization model, neither CorSum nor CorSum-SF require any training step for document summarization. The summarization methods in Refs. 17 and 18 depend solely on the word significance value of a word w in a sentence S and the word frequency of w, respectively. Contrarily, besides the significance factor of w in S, CorSum-SF uses word-correlation factors to determine the ranking score of S. 4.4. Classification performance evaluation We have evaluated the effectiveness and efficiency of classifying summaries, as opposed to entire documents, using MNB on the 20NG dataset. Figure 4 shows the classification accuracy achieved by MNB using automatically-generated summaries, as well as the entire content, of the documents in 20NG for comparison purpose. Using CorSum generated summaries, MNB achieves a fairly high accuracy, i.e.,


S0218213010000285


481

Fig. 4. Accuracy ratios and processing time achieved by MNB using automatically-generated summaries, as well as the entire content, of articles in the 20NG dataset.

74%, even though using the entire documents MNB achieves a higher classification accuracy of 82%, which is less than 10% difference. However, the training and classification processing time of MNB is significantly reduced when using CorSum generated summaries as opposed to the entire documents as shown in Figure 4 — the processing time required for training the MNB classifier and classifying on entire documents is reduced by more than 60% when using CorSum generated summaries. By using CorSum-SF , instead of CorSum, to summarize documents in the 20NG dataset, the classification accuracy on training and testing (CorSum-SF ) summaries is increased by 4%, which means that MNB achieves 78% accuracy when classifying CorSum-SF generated summaries. More importantly, the classification accuracy using CorSum-SF is only 4% lower than the one achieved using the entire documents on the 20NG dataset, while the training and testing time are significantly reduced by almost two-thirds compared with using the entire documents. In comparing with the classification accuracy of Top-N and LSA summaries, CorSum(-SF ) outperforms both of them. This is because using summaries generated by CorSum(-SF ), MNB can extract more accurate information based on the probability of words belonged to different classes (as computed in Equation 9) in a labeled document collection, which translates into fewer mistakes during the


482

S0218213010000285


Fig. 5.

Accuracy ratios for CorSum and CorSum-SF generated summaries.

classification process. Furthermore, on the average MNB performs classification at least as fast as using the summaries generated by CorSum(-SF ) than the LSA or Top-N summaries as shown in Figure 4. In Figure 5, we show the classification accuracy using MNB on CorSum (CorSum-SF , respectively) generated summaries for each natural class in 20NG, and the corresponding labeled classes are (1) sci.electronics, (2) comp.sys.mac.hardware, (3) soc.religion.christian, (4) comp.windows.x, (5) comp.sys.ibm.pc.hardware, (6) comp.graphics, (7) misc.forsale, (8) rec.motorcycles, (9) comp.os.ms-windows.misc, (10) rec.sport.hockey, (11) talk.politics.misc, (12) alt.atheism, (13) sci.crypt, (14) talk.politics.guns, (15) rec.sport.baseball, (16) sci.space, (17) talk.politics.mideast, (18) sci.med, (19) rec.autos, and (20) talk.religion.misc. Regardless of the natural class considered, the accuracy achieved by MNB on CorSum-SF generated summaries outperforms the accuracy achieved on CorSum generated summaries. Figure 6 shows the number of false positives, which is the number of documents assigned to a class when they should not be, and false negatives, which is the number of documents that were not assigned to a class to which they belong. We observe that except for classes 4, 5, 6, and 8, the average number of false positives for each of the remaining classes in 20NG is 30, which constitutes approximately 12% of the classified news articles. The same applies to the number of false negatives. Except for classes 1, 11, 14, 16, and 18, the average number of mislabeled articles is 33, which constitutes approximately 13% of the articles used for the classification purpose. The overall average number of false positives and false negatives is 41, an average of 23%, per class. Figure 7 shows the number of false positives, and number of false negatives, for classifying the summarized documents generated by CorSum-SF in the 20NG dataset into the corresponding pre-defined classes. We observe that compared with


S0218213010000285


483

Fig. 6. False positives and false negatives of classifying CorSum generated summaries of each class in the 20NG dataset.

Fig. 7. False positives and false negatives of classifying CorSum-SF generated summaries of each class in the 20NG dataset.

CorSum, even though CorSum-SF yields the same average number of false positives, its average number of false negatives is reduced, i.e., from 41 to 31, an average of 25%. 4.5. Implementation CorSum(-SF ) was implemented on an Intel Dual Core workstation with dual 2.66 GHz processors, 3 GB RAM size, and a hard disk of 300 GB running under Windows XP. The empirical studies conducted to assess the performance of CorSum(-SF ),


484

S0218213010000285


as well as the performance of MNB, were performed on a HP workstation running under the Windows 7 operating system, with 2 Intel Core Duo 3.166 GHz processors, 8 GB RAM, and hard disk of 460 GB. 5. Conclusions Locating relevant information on the web in a timely manner is often a challenging task, even using well-known web search engines, due to the vast amount of data available for the users to process. Although retrieved documents can be precategorized based on their contents using a text classifier, web users are still required to analyze the entire documents in each category (or class) to determine their relevance with respect to their information needs. To assist web users in speeding up the process of identifying relevant web information, we have introduced CorSum, an extractive summarization approach which requires only precomputed word similarity to select the most representative sentences of a document D (that capture its main content) as the summary of D. We further enhance CorSum by considering the significance factor of, besides the correlation factors of words in, a sentence S to measure the relative degree of representativeness of S with respect to the content of a document D to which S belongs. We denote the enhanced summarization approach CorSum-SF . CorSum-SF selects the most representative sentences in D based on their combined significance factor and ranking score, which are the highest among all the sentences in D, to create the summary of D. We have also used summaries generated by CorSum-SF to train a multinomial Na¨ıve Bayes (MNB) classifier and verified its effectiveness and efficiency in performing the classification task. Empirical studies conducted using the DUC-2002 dataset have shown that CorSum-SF creates high-quality summaries compared with other well-known extractive summarization approaches. Furthermore, by applying the MNB classifier on CorSum-SF generated summaries of the news articles in the 20NG dataset, we have validated that in classifying a large document collection C, the classification task using CorSum-SF generated summaries is in the order of magnitude faster than using the entire documents in C with compatible accuracy. For future work, we will consider applying feature extractors and selectors, such as sentence length, topical words, mutual information, or log-likelihood ratio, on a classifier to (i) further enhance the classification accuracy on using CorSum-SF generated summaries and (ii) minimize the classifier’s training and classification time. References 1. L. Antiqueira, O. Oliveira, L. Costa, and M. Nunes. A Complex Network Approach to Text Summarization. Information Sciences: An International Journal, 179(5):584– 599, 2009.


S0218213010000285


485

2. M. Binwahlan, N. Salim, and L. Suanmali. Swarm Based Text Summarization. In Proceedings of International Association of Computer Science and information Technology - Spring Conference, pages 145–150, 2009. 3. S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30:1–7, 1998. 4. O. Buyukkokten, O. Kaljuvee, H. Garcia-Molina, A. Paepcke, and T. Winograd. Efficient Web Browsing on Handheld Devices Using Page and Form Summarization. ACM Transactions on Information Systems (TOIS), 20(1):82–115, 2002. 5. L. da F. Costa, F. Rodriguez, G. Travieso, and P. Villas Boas. Characterization of Complex Networks: A Survey of Measurements. Advances in Physics, 56(1):167–242, 2007. 6. D. Das and A. Martins. A Survey on Automatic Text Summarization. Literature Survey for the Language and Statistics II Course at CMU, 2007. 7. D. Dunlavy, D. O’Leary, J. Conroy, and J. Schlesinger. QCS: A System for Querying, Clustering and Summarizing documents. Information Processing and Management, 43(6):1588–1605, 2007. 8. M. Fattah and F. Ren. GA, MR, FFNN, PNN and GMM Based Models for Automatic Text Summarization. Computer, Speech and Language, 23(1):126–144, 2009. 9. Y. Gong. Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In Proceedings of International Conference on Research and Development in Information Retrieval (ACM SIGIR), pages 19–25, 2001. 10. P. Judea. Probabilistic Reasoning in the Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. 11. J. Kleinberg. Authoritative Sources in Hyperlink Environment. JACM, 46(5):604–632, 1999. 12. J. Koberstein and Y.-K. Ng. Using Word Clusters to Detect Similar Web Documents. In Proceedings of International Conference on Knowledge Science, Engineering and Management (KSEM), pages 215–228, 2006. 13. A. Kolcz. Local Sparsity Control for Naive Bayes with Extreme Misclassification Costs. In Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD), pages 128–137, 2005. 14. K. Lang. Newsweeder: Learning to Filter Netnews. In Proceedings of International Conference on Machine Learning (ICML), pages 331–339, 1997. 15. L. Li, K. Zhou, G. Xue, H. Zha, and Y. Yu. Enhancing Diversity, Coverage and Balance for Summarization through Structure Learning. In Proceedings of International Conference on World Wide Web (WWW), pages 71–80, 2009. 16. C. Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of ACL Workshop on Text Summarization Branches Out, pages 74–81, 2004. 17. X. Liu, J. Webster, and C. Kit. An Extractive Text Summarizer Based on Significant Words. Lecture Notes in Artificial Intelligence, 5459:168–178, 2009. 18. E. Lloret and M. Palomar. A Gradual Combination of Features for Building Automatic Summarization Systems. Lecture Notes in Artificial Intelligence, 5729:16–23, 2009. 19. G. Luger. Artificial Intelligence: Structures and Strategies for Complex Problem Solving, 6th Ed. Addison Wesley, 2009. 20. H. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2(2):159–165, 1958. 21. A. Martins and N. Smith. Summarization with a Joint Model for Sentence Extraction and Compression. In Proceedings of the Association for Computational Linguistics Workshop on Integer Linear Programming for Natural Language Processing, pages 1–9, 2009.


486

S0218213010000285


22. A. McCallum and K. Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In Proceedings of Workshop on Learning for Text Categorization, pages 41–48, 1998. 23. R. Mihalcea and S. Hassan. Recent Advances in Natural Language Processing IV, chapter Text Summarization for Improved Text Classification. John Benjamins, 2006. 24. R. Mihalcea and P. Tarau. A Language Independent Algorithm for Single and Multiple Document Summarization. In Proceedings of International Joint Conference on Natural Language Processing (IJCNLP), pages 19–24, 2005. 25. K. Nigam, J. Lafferty, and A. McCallum. Using Maximum Entropy for Text Classification. In Proceedings of Workshop on Machine Learning for Information Filtering, pages 61–67, 1999. 26. L. Qiu, B. Pang, S. Lin, and P. Chen. A Novel Approach to Multi-document Summarization. In Proceedings of International Conference on Database and Expert Systems Applications (DEXA), pages 187–191, 2007. 27. D. Radev. Text summarization. International Conference on Research and Development in Information Retrieval Tutorial, 2004. 28. D. Radev, E. Hovy, and K. McKeown. Introduction to the Special Issue on Summarization. Computational Linguistics, 28(4):399–408, 2002. 29. K. Sarinnapakorn and M. Kubat. Combining Subclassifiers in Text Categorization: A DST-Based Solution and a Case Study. IEEE TKDE, 19(12):1638–1651, 2007. 30. D. Shen, Z. Chen, Q. Yang, H. Zeng, B. Zhang, Y. Lu, and W. Ma. Web-page Classification Through Summarization. In Proceedings of International Conference on Research and Development in Information Retrieval (ACM SIGIR), pages 242–249, 2004. 31. D. Shen, J. Sun, H. Li, Q. Yang, and Z. Chen. Document Summarization Using Conditional Random Fields. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), pages 2862–2867, 2007. 32. X. Wang and J. Yang. CollabSum: Exploiting Multiple Document Clustering for Collaborative Single Document Summarizations. In Proceedings of International Conference on Research and Development in Information Retrieval (ACM SIGIR), pages 143–150, 2007. 33. Y. Yang. An Example-Based Mapping Method for Text Categorization and Retrieval. ACM Transactions on Information Systems (TOIS), 12(3):253–277, 1994. 34. Y. Yang and J. Pedersen. A Comparative Study on Feature Selection in Text Categorization. In Proceedings of International Conference on Machine Learning (ICML), pages 412–420, 1997. 35. P. Zhang and C. Li. Automatic Text Summarization Based on Sentences Clustering and Extraction. In Proceedings of IEEE International Conference on Computer Science and Information Technology (ICCSIT), pages 167–170, 2009.

A NAÃVE BAYES CLASSIFIER FOR WEB DOCUMENT SUMMARIES ...

A NAÃVE BAYES CLASSIFIER FOR WEB DOCUMENT SUMMARIES ...

Suggest Documents

A Naive Bayes Source Classifier for X-ray Sources

A Naive Bayes Classifier for Protein Function Prediction - IOS Press

A Naive Bayes Source Classifier for X-ray Sources

bayes classifier in multidimensional data classification - CiteSeerX

The Indifferent Naive Bayes Classifier - CiteSeerX

Contextual Features Based NaÃ¯ve Bayes Classifier

Binary LNS-based NaÄ±ve Bayes Hardware Classifier for ... - CiteSeerX

Weighted NaÃ¯ve Bayes Classifier with Forgetting for Drifting Data ...

Application of the Naive Bayes Classifier for Representation ... - MDPI

Hierarchical Gaussian NaÄ±ve Bayes Classifier for Multiple-Subject ...

Binary LNS-based NaÄ±ve Bayes Hardware Classifier for ... - CiteSeerX

Robustification of NaÃ¯ve Bayes Classifier and Its Application for ...

Active Example Selection with NaÃ¯ve Bayes Classifier for ... - CiteSeerX

RBNBC: Repeat Based Naive Bayes Classifier for ... - IIIT Hyderabad

Naive Bayes Classifier Algorithm Approach for Mapping Poor Families

Modified NaÃ¯ve Bayes Classifier for E-Catalog Classification

Weighted Naive Bayes Classifier: A Predictive Model ...

BAYES-NEAREST: a new Hybrid classifier Combining Bayesian ...

A Novel PCA-Based Bayes Classifier and Face ... - Semantic Scholar

BAYES-NEAREST: a new Hybrid classifier Combining ... - EHU

Selecting Text Spans for Document Summaries: Heuristics and Metrics

Real Time Web Vehicle Classifier

text mining dengan metode naïve bayes classifier dan support ...

Field Association words with Naive Bayes Classifier based Arabic ...

A NAÃVE BAYES CLASSIFIER FOR WEB DOCUMENT SUMMARIES ...