Genre Identification and Goal-Focused Summarization - CiteSeerX

Genre Identification and Goal-Focused Summarization Jade Goldstein

Gary M. Ciany

Jaime G. Carbonell

U.S. Department of Defense

Dragon Development Corp.

Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213

[email protected]

[email protected]

[email protected]

alone may be insufficient for categorizing documents/web pages for a given product, one user may be interested in a product press release, another may be interested in reviews and a third might be interested in stores that are offering the product for the cheapest price. A user may decide what pages to view based on the genre.

ABSTRACT In this paper, we present a novel technique of first performing document genre identification, then utilizing the genre for producing tailored summaries based on a user’s information seeking needs – genre oriented goal-focused summarization – such as a plot or opinion summary of a movie review. We create a test corpus to determine genre classification accuracy for 16 genres, and examine performance on various amounts of training data for machine learning algorithms - Random Forests, SVM light and Naïve Bayes. Results show that Random Forests outperforms SVM light and Naïve Bayes. The genre tag is used to inform a downstream summarization engine. We define types of summaries for 7 genres, create a ground truth corpus and analyze the results of genre oriented goal-focused summarization, showing that this type of user based summarization requires different algorithms than the leading sentence baseline which is known to perform well in the case of news articles.

Thus, genre may be a preferable way to organize and label documents. Genre information can be determined from the web page and used as categorical metadata about the document as a means of indexing and retrieving documents. Roussinov and colleagues, in preliminary studies of people searching the Web, discovered that the genre of the document was one of the clues used in assessing relevance, value, quality and usefulness [6]. There are many different definitions of genre; most include two principal characterizations - the intended communicative purpose and form. Some contain a third - the content of the document [7]. We utilize this triple since content will inform downstream processes, such as summarization. For example, a movie review is often organized differently to that of a product review. A reader of both reviews may want to know about the author’s opinions. However, a reader of the product review may also want to learn about the functionality of the product as well as the price whereas the reader of the movie review may want to know about the some information about the plot as well as the running time of the film. We therefore introduce a new form of summarization – genre oriented summarization – summary creation based on the characteristics of a genre. For the movie review genre, one user might want a summary of the plot, whereas another may want a summary of the reviewer’s opinion. This motivates the concept of genre oriented goal-focused summarization.

Categories and Subject Descriptors H3.1 [Information Storage and Retrieval]: Content Analysis and Indexing H.3.3 [Information Search and Retrieval]: Information filtering, Selection Process

General Terms Algorithms, Design, Experimentation, Performance

Keywords Genre Identification, Text Classification, Text Categorization, Machine Learning, Summarization, Metadata Extraction, Data Mining, Task-based Information Retrieval, Evaluation

In this paper, we first examine the performance of machine learning techniques to classify certain types of web pages by genre based on a corpus collected from the web. The level of accuracy by which a classification system can identify genre will affect the results of the downstream summarization system. For example, if a movie review is misclassified as a biography, information such as rating and running time will not be extracted. After examining classification performance, we discuss the performance of goal- focused summarization algorithms on a gold standard sentence extract summarization corpus (created by 3 human summarizers).

1. INTRODUCTION With the continuing rapid growth of the Web, it is becoming essential to find ways of categorizing and filtering information to be able to quickly locate and access particular items of interest. An information-access system has a mechanism to connect users and their information seeking needs with a store of information. For the internet this mechanism has focused on search algorithms and short summaries intended to assist the user in choosing relevant web pages to view. Some search engines include clustering processes (www.clusty.com), which are usually organized around topically based keywords or concepts. Topic

2. CORPORA We collected a genre corpus from the web with the particular goal of examining the effects of genre classification output on summarization. We selected 9 categories (Table 1) for genreidentification, 7 of which are targeted towards summarization experiments (Table 2). Data was collected resulting in approximately a total of 1000 documents per category (empty frame files were deleted). A maximum of 10 data items were

Copyright 2007 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the U.S. Government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. CIKM’07, November 6-8, 2007, Lisbon, Portugal. Copyright 2007 ACM 978-1-59593-803-9/07/0011…$5.00.

889

overlap to allow investigate the effects of confusable conditions, e.g., editorials and articles with a topic of Bush as well as product press releases, product reviews and product store pages. Furthermore, we examine the effects of the number of training documents on classification results, as well as compare Random Forests (RF) to Support Vector Machines (SVM). There has been, to our knowledge, no published research comparing RF to SVM for genre classification or text categorization.

collected from any web site. We added 1000 randomly selected documents from 7 additional categories that were collected by CMU in previous studies [2] to form a total of 16 genres. Table 1: Genre Document Collection – 16 genres Genre - New # of docs Biographies 959 Interviews 1025 Movie Reviews 968 Editorials-politics 1006 Articles-politics 1092 Product Press Releases 992 Product Reviews 930 Store Products 954 Search Results 956

Genre - Old Advertisements Bulletin Board FAQs Message Board Radio News Reuters TV News

# of docs 1000 998 1000 1000 1000 1000 1000

Expanding on Dewdney’s research [2], our feature extractor contains 119 features: a baseline of 66 items consisting of a combination of layout, character and structural features (e.g., number of white space lines, number of quotes, counts of words appearing in headings) plus derivative cues (e.g., readability measures), 26 grammatical features (includes items such as the number of verbs), and 27 genre oriented topic content words lists., which were derived from an independent sample of similarly collected genres.

The goal of genre oriented summarization is to be able to tailor the summary to the users' needs. For example, one might want to have a plot or opinion summary for a movie review summary in contrast to the most salient points as produced by a newswire summary. Accordingly, for the 7 genres in Table 2, three senior English students collectively determined the types of summaries a user might want to view - the genre oriented goal-focused summaries. We selected the sentence as the basic summarization unit, decided a sentence summary length and guidelines for selecting particular sentences for each summary type. Interviews, which tend to be long, were allowed more summary sentences.

Three different classifiers were used for our experiments: WEKA’s implementation of Naïve Bayes with the default settings [5], Joachims’ implementation of Support Vector Machines SVMlight with a radial basic function and the default settings [8] and Random Forests (RF) [1]. Random Forests grows many classification trees; we use 100 trees and the number of variables selected at random at each node is set to the square root of the number of features (suggested by Leo Breiman). The predicted value for classification is the class with the majority of the forest votes. Since SVM-light builds binary models, Random Forests has the characteristic that the training time is significantly less than for SVM-light especially for large numbers of genres.

Table 2: Corpus Description including # of documents collected, # of sentences in summary and # of summary types and types for that genre. O=Overview, Pl=Plot, Op=Opinion, Pe=Personal, Pr=Professional, T=Thematic. NG: # # Summ Genre Docs Sent Articles 30 5 Editorials 30 5 Movie Reviews 30 5 Product Reviews 30 5 Product Press Releases 30 3 Biographies 45 5 Interviews 45 5-9

Min M a x Mean Types Sent Sent Sent O 27 82 47 O 8 37 37 O, Pl, Op 17 163 41 O 15 216 60 O 6 116 34 O, Pe, Pr 13 351 89 T, Op 37 1040 176

% Summ Sent > Lead 53% 78% 88% 87% 48% 79% 97%

For example, for Movie Reviews, there are three types of summaries: Plot: like movie trailer. Opinion: reviewers’ or others’ opinions. Overview: 1 plot sentence, 1 opinion sentence, 1 cast sentence, 1 movie genre sentence, 1 other supporting sentence.

4. RESULTS Evaluation results for the three classifiers using 10-fold crossvalidation are shown in Table 3. Results are reported for precision, recall and F1 using the micro-average, in which each relevant document is a point in the average. RF performs better than SVM, a result that occurred consistently across the data. All algorithms perform better for 16 genres, perhaps due to the addition of more training data. RF performs extremely well just using our baseline 66 general features, with results that match SVM for the full 119 features. The content words did not provide a large increase in scores; we plan to investigate this result. Due to the much lower performance of Naïve Bayes compared to RF and SVM, the rest of the paper will not report Naïve Bayes results. Table 3: Classification results Micro-Average F1 (10 fold cross validation) on 9 and 16 genres for Random Forests (RF), Support Vector Machine (SVM) and Naïve Bayes (NB).

3. GENRE CLASSIFICATION

9 Genres Features \ Classifier RF SVM NB Baseline (basic) (66) 0.81 0.67 0.48 Baseline+grammatical (92) 0.85 0.78 0.58 Baseline+gramm+content (119) 0.87 0.81 0.63

Genre classification has been studied since the early 1990s, when Karlgren and Cutting [3] used discriminant analysis to classify four categories using features derived from part of speech analysis, structural cue, lexical cues, character-level cues and derivative cues. Recent research has used these types of cues for classification of press genres, positive or negative reviews and email speech acts. Dewdney and colleagues [2] used these types of cues (89 features) to categorize seven genres using three classifiers Naïve Bayes, C4.5, and SVM-light.

16 Genres RF SVM NB 0.86 0.76 0.63 0.88 0.83 0.69 0.90 0.85 0.71

Table 4 shows the individual scores per genre for all 16 genres. From the confusion matrices (not presented), Editorials are confused with the Articles on the same topics and Product Reviews are often confused with Product Press Releases. RF does better than SVM at distinguishing these confusable categories.

Our research differs in that we combine genre identification with summarization and use a greater number of categories than prior researchers. Some categories have a high degree of topical

890

news event genre and the scientific article genres [4, 9] for which this type of methodology works well. Recent work has focused on creating opinion summaries for web reviews and news editorials. Consider, however, the genre of movie reviews. A user may want an overview of a review (generic), a specific answer to a question (query-based), plot details, or reviewers’ opinions. Accordingly, the movie review genre, as do others, require a new class of summary, that of the goal-focused summary. Summaries that reflect a user’s information seeking needs requires genre oriented goal-focused summarization.

Table 4: Overall within-genre statistics for Random Forests on 16 genres, 119 features. Random Forests P R F1

Genre advertisements biographies bulletin-board FAQ interviews message board movie reviews editorials (topic Bush) articles (topic Bush) product press releases product reviews radio news Reuters store products search results TV news

0.86 0.82 0.90 0.91 0.87 0.99 0.88 0.92 0.82 0.82 0.84 0.98 0.99 0.94 0.93 0.90

0.92 0.89 0.80 0.99 0.94 0.99 0.85 0.78 0.94 0.89 0.65 0.86 0.99 0.89 0.96 0.99

0.89 0.85 0.85 0.95 0.90 0.99 0.86 0.84 0.88 0.85 0.73 0.91 0.99 0.91 0.95 0.94

P

SVM R

F1

0.83 0.82 0.86 0.95 0.79 0.98 0.81 0.76 0.78 0.79 0.76 0.94 0.98 0.87 0.92 0.80

0.86 0.77 0.80 0.99 0.89 0.99 0.73 0.73 0.89 0.85 0.56 0.73 0.98 0.92 0.90 1.0

0.85 0.79 0.83 0.97 0.84 0.99 0.76 0.75 0.83 0.82 0.64 0.82 0.98 0.90 0.91 0.89

Our summarizer creates summaries based on a particular genre and goal. Sentences receive a score using document features such as position, matches to created word lists for a genre and content derived features (e.g., number of people). Sentences are ordered in the output summary according to the order they appear in the document. Each summary consists of specific features based on the genre and focus, e.g., newswire summaries often include the first sentence. These features are given weights, which were tuned by experimentation, e.g., the first sentence feature is given a lower weight for the movie review genre than for newswire where the first sentence may be a catchy lead-in. Due to paper page constraints, the algorithms shown in Figure 2 are only for movie summaries. People, organizations and locations are extracted using Alias-I’s Lingpipe (www.alias-i.com/lingpipe) and then counted. The opinion word list is Wilson’s list of subjectivity and sentiment clues [10]. All other word lists were created based on the specific genre. A sentence match score with a list is based on cosine similarity. All sentence features scores are normalized within the document before combination into the final score.

Finally, we examine the overall effects on training/test corpus on results for both our baseline 66 features and our full set of 119 features (Figure 1). Note that for Random Forests, after 250 documents, the increase in collection effort for the extra documents might not be worth the performance gain. 1

Movie Plot: Only consider sentences from the first 2/3 of the document. Select sentences using Score = (0.3* Number of Capital letters) + (0.3 * Number of Persons, Locations and Organizations (NPLO)) + (0.1 * Plot Keyword List Matches) + .(3 * Consecutive Sentence Match). Movie Opinion: Match with Opinion word list. Movie Overview: Take best sentence from cast sentence score list. Cast summary sentence is the highest scoring sentence based on: Score = (0.2 * 1st sentence) + (0.3 * cast keywords) + (0.2 * Number of persons) + (0.3 * Number of capital letters in parentheses). Take the best sentence from genre + director (combined keyword) sentence score list. If there is a repeated sentence, no genre + director sentence is used. For the remaining sentences, select by (0.7* HighFreqTerms) + (0.3 * NPLO).

0.9 0.8

F1 Score

0.7 0.6 0.5

RF 66 Features (Baseline)

0.4

RF 119 Features

0.3 0.2

SVM 66 Features (Baseline)

0.1

SVM 119 Features

Figure 2: Movie Review Summary Algorithms

0 0

250

500

750

Table 5 shows the results of all algorithms. The highest ranking sentences are selected for each algorithm up to the number of summary sentences for that genre, NG (Table 2). The normalized score is computed by counting the system summary sentences that match any human summary sentence (score of 1 for a match) and dividing by NG. For news articles, which tend to be summaries, lead (document initial) sentences often create good overview summaries. Accordingly, we use lead sentences as our baseline for all genres. From Table 5, note that many algorithms still outperform the lead even if they are not the best performing algorithm, motivating the need for summaries beyond lead - genre oriented, goal-focused summaries. Additionally, such summaries provide supplementary information, which is lacking without the genre tag. For movie-reviews, the genre information allows the extraction of the running time and movie rating and the goalfocus allows for summaries of a different composition (Figure 3). Thus, if the genre classifier can pass an accurate tag (for movie reviews, F1 = 0.86, Table 4), the system can present a tailored summary.

1000

Number of Documents Used for Training/Test

Figure 1: Performance Effects based on Number of Documents for Random Forests, 16 genres, 66 features (baseline) and 119 features (10 fold cross validation). In this section, we have shown that we can obtain very good genre identification performance of approximately 0.9 F1 (Figure 1) with RF. An increased number of genres appear to assist scores. 100-250 documents with some carefully chosen lexical and topical features seem to result in performance scores close to 0.8 F1, possibly adequate for a genre oriented summarization system.

5. SUMMARIZATION Summarization systems have focused on two types of summaries: the generic or overview summary, which gives an overall sense of the document's content, or a query-based summary, which presents the content that is most closely related to the initial search query. Most summarization research has focused on the

891

Table 5: Results of various summarization algorithms on Summary Data. Algorithms: Articles (Newswire), Bio-Overview, BioPersonal, Bio-Professional, Editorials, Interviews-Thematic, Interviews-Opinion, Movie Reviews-Opinion, Movie Reviews-Plot, Movie Reviews -Overview, Product-Press-Release, Product, Review, Leading Sentence, Number of Initial Capital Letters, Number of Persons (using Alias-I Lingpipe), Num of Named Entities (Person, Location, & Organization using Lingpipe), High Frequency Terms of Document. Scores use lenient scoring – a system summary sentence has full score if it matches any summary sentence produced by a human. Maximum score is 1.0. Baseline is Lead. Highest score for the Genre is in bold font. ALGORITHMS Summaries Articles (News) Bios - Overview Bios - Personal Bios - Professional Editorials Interviews - Author - Opinion Interviews - Author - Thematic Interviews - Entertainers - Opinion Interviews - Entertainers - Thematic Interviews - Politicians - Opinion Interviews - Politicians - Thematic Movie Reviews - Opinion Movie Reviews - Plot Movie Reviews - Overview Product Press Release Product Reviews

ARTIC INTLES BIO-O BIO-PEBIO-PR EDIT OP 0.45 0.34 0.25 0.30 0.43 0.11 0.16 0.07 0.19 0.04 0.07 0.30 0.41 0.49 0.36 0.32

0.30 0.44 0.39 0.29 0.39 0.15 0.11 0.15 0.15 0.08 0.15 0.35 0.35 0.44 0.27 0.27

0.35 0.32 0.39 0.23 0.41 0.15 0.17 0.05 0.11 0.12 0.11 0.37 0.36 0.43 0.50 0.27

0.48 0.36 0.31 0.40 0.48 0.07 0.15 0.05 0.17 0.07 0.09 0.21 0.43 0.37 0.56 0.31

0.46 0.35 0.25 0.32 0.44 0.11 0.19 0.08 0.12 0.04 0.07 0.34 0.36 0.48 0.34 0.34

0.42 0.30 0.21 0.28 0.41 0.12 0.19 0.09 0.12 0.04 0.08 0.35 0.41 0.51 0.46 0.36

INT- MOV- MOV- MOV PROD PROD Baseline Num Num Num TH OP PL -OV -PR -RE LEAD Caps Pers NE HF 0.41 0.31 0.21 0.29 0.45 0.17 0.19 0.07 0.17 0.04 0.09 0.31 0.49 0.53 0.43 0.32

0.21 0.24 0.18 0.23 0.32 0.15 0.09 0.12 0.08 0.09 0.01 0.40 0.28 0.28 0.32 0.29

0.37 0.35 0.26 0.34 0.44 0.04 0.13 0.04 0.13 0.08 0.09 0.21 0.51 0.42 0.40 0.34

0.39 0.31 0.20 0.31 0.41 0.13 0.16 0.08 0.17 0.08 0.09 0.32 0.44 0.51 0.29 0.30

0.75 0.40 0.33 0.36 0.47 0.07 0.19 0.04 0.08 0.05 0.09 0.29 0.25 0.37 0.79 0.31

0.44 0.40 0.27 0.32 0.40 0.13 0.19 0.08 0.15 0.05 0.15 0.37 0.38 0.49 0.51 0.38

0.76 0.40 0.34 0.37 0.47 0.07 0.19 0.04 0.08 0.05 0.09 0.29 0.23 0.36 0.73 0.31

0.24 0.26 0.22 0.26 0.38 0.09 0.19 0.09 0.20 0.07 0.08 0.25 0.39 0.39 0.31 0.25

0.16 0.21 0.23 0.21 0.38 0.13 0.11 0.11 0.16 0.04 0.09 0.26 0.36 0.37 0.22 0.17

0.24 0.28 0.21 0.25 0.37 0.09 0.13 0.05 0.16 0.04 0.09 0.25 0.41 0.39 0.27 0.22

0.44 0.32 0.18 0.32 0.44 0.11 0.19 0.08 0.13 0.07 0.08 0.33 0.43 0.50 0.37 0.34

produce more informative summaries by including information in the summary based on the genre tag. For example, in the movie review genre, the rating and running time could be extracted and included in the summary. Thus, genre oriented summaries have more utility over straight summarization algorithms. We also motivated the need for various summary types according to the genre and a user’s information seeking goals and presented a case showing genre oriented goal-focused summaries require varying summarization algorithms.

[GENRE: MOVIE]: TITLE: REVIEW - FILM - COMEDY LUCKY NUMBERS AUTHOR: By Ellen Futterman RATING: * * 1/2 (out of four) Rating: R, language, violence, adult themes RUNNING TIME: Running time: 1:50 DATE: Friday, October 27, 2000 2:15 a.m. [3] Richards (John Travolta) could easily be the poster boy for Carol House Furniture because he really does like nice things. [8] So with some help, Richards concocts a plan to rig the state lottery and win the $6.4 million jackpot. [14] Loosely based on the 1980 scheme to fix the Pennsylvania lottery, "Lucky Numbers" never quite finds the right footing. [17] After the critical and box-office bomb "Battlefield Earth," Travolta returns from the twilight zone to his comfort zone in comedy. [24] As dark comedies go, "Lucky Numbers" is no one-in-amillion, but Travolta and Kudrow do their part to keep it spinning.

7. REFERENCES [1] Breiman, L, Consistency for a Simple Model of Random Forests, U.C. Berkeley Technical Report 670, Berkeley, CA 2004.

[GENRE: ARTICLES]: TITLE: REVIEW - FILM - COMEDY LUCKY NUMBERS AUTHOR: By Ellen Futterman DATE: Friday, October 27, 2000 2:15 a.m. [1] You're definitely living large when you've got your own booth at the local Denny's. [2] Or so thinks Russ Richards, a television weatherman in Harrisburg, Pa., and the town's biggest celebrity. [3] Richards (John Travolta) could easily be the poster boy for Carol House Furniture because he really does like nice things. [4] He drives a Jag and lives in a well-appointed mansion. [5] But Richards is experiencing some unluckiness; he's on the verge of bankruptcy.

[2] Dewdney, N., VanEss-Dykema, C., and McMillan, R. The form is

[3] [4]

Figure 3: Movie Review – Movie Genre Overview Summary compared to Newswire Genre: Lead Sentence Summary ([Doc. Sent. Number] followed by Sentence). Extra items extracted by use of the movie genre tag are in bold.

[5] [6] [7]

6. CONCLUSIONS AND FUTURE WORK We examined classifier performance, comparing Support Vector Machines, known to perform well for categorization to Random Forests as well as Naïve Bayes, a commonly used machine learning algorithm. Random Forests performed better than SVM, which outperformed Naïve Bayes. The training time for RF was much faster than SVM. Additionally, we showed that identifying the genre could assist a summarization system to

[8] [9] [10]

892

100the substance: Classification of genres in text. In ACL Workshop on Human Language Technology and Knowledge Management, 2001. Karlgren, J. & Cutting D., Recognizing Text Genres with Simple Metrics Using Discriminant Analysis. In COLING 1994,Kyoto, Japan. Mani, I., House, D., Klain, G., Hirschman, L, Obrst, L., Firmin, T., Chrzanowki, M., and Sundhim, B. The TIPSTER SUMMAC Text Summarization Evaluation. Technical Report MTR 98W0000138, MITRE, October 1998. Naïve Bayes WEKA implementation: www.cs.waikato.ac.nz/ml/weka Roussinov, D., Crowston, K., Nilan, M., Kwasnik, B., Liu, X., & Cai., J. Genre-based navigation on the Web. In HICSS-34, Maui, HI, 2001. Shepherd, M and Watters, C., The Functionality Attribute of Cybergenres. In HICSS-32, Hawaii, 1999 SVM light webpages: http://svmlight.joachims.org Teufel, S., and Moens, M., Sentence extraction as a classification task. In ACL/EACL-97 Workshop on Intelligent Scalable Text Summarization, Madrid, Spain, July 1997, 58-65. Wilson, T, Wiebe, J., Hoffman, P., Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In HLT-EMNLP 2005, Vancouver, Canada.