Summarizing Online Reviews Using Aspect Rating ... - IEEE Xplore

2 downloads 0 Views 1MB Size Report
models, the first of which is a rating- prediction model. Given an evaluative text document as input, it outputs an estimate for each aspect of the likeli- hood that ...
Statistic al Approaches to Concep t-Le vel Sentiment Analysis

Summarizing Online Reviews Using Aspect Rating Distributions and Language Modeling Giuseppe Di Fabbrizio, Ahmet Aker, and Robert Gaizauskas, University of Sheffield

Product and service reviews are abundantly available online, but selecting relevant information from them is time consuming. Starlet solves this problem by extracting multidocument summaries that consider aspect rating distributions and language modeling. 28

W

ith the broad availability of always-connected portable devices such as mobile devices, tablets, and e-readers, condensing informa-

tion for display on a relatively small screen has become a necessity for the exceedingly demanding population of users on the go. The retail industry and service providers are recognizing that this growing crowd of potential customers relies on their devices to learn about products and services and to discover other users’ experiences, ultimately changing the way consumers make decisions about their purchases.1,2 Although product and service reviews are abundantly available online, selecting relevant information from them requires a ­ potential buyer to spend a significant amount of time reading the reviews and weeding out comments unrelated to the important aspects of the reviewed entity. For this task, opinion mining and sentiment analysis methods can help extract the target of the opinions expressed in the reviews and their relative polarity (positive, negative, or neutral).3-6 However, the user must engage in a complex task to process all the facts, opinions, and ratings read in the previous step and subsequently interpret, compare, contrast, and, ultimately, summarize the relevant information.

Although researchers have widely explored opinion mining and sentiment analysis, summarization of evaluative text—the documents containing opinion or sentiment-laden material—is fairly new and significantly different from traditional summarization tasks. Most summarization techniques focus on distilling factual information by identifying a document’s main topics, removing redundancies, and coherently ordering the extracted phrases or sentences. These approaches have largely been developed using corpora consisting of well-formed documents from domains such as news articles,7,8 medical literature,9 biographies,10 technical articles,11 blogs,12 and Web documents related to geolocated entities.13 Essentially, traditional summarization tends to identify and discard redundancies, but in sentiment-laden text, similar opinions mentioned multiple times across documents are crucial indicators of

1541-1672/13/$31.00 © 2013 IEEE

IEEE INTELLIGENT SYSTEMS

Published by the IEEE Computer Society

IS-28-03-fab.indd 28

31/07/13 8:06 PM

the overall strength of the sentiments expressed by the writers.14 More specifically, sentiment-laden documents are usually about a single entity, which can be either a product, such as digital cameras, DVD players, or books, or related to a user’s experiences of a service, such as staying in a hotel or dining in a restaurant. Typically, an entity has several ratable features or aspects that might be the subject of reviewers’ positive or ­ negative opinions—such as food quality, service, or price. In this sense, we can look at each review as a set of aspects with associated opinions and ratings that define the strength and polarity of the opinions expressed in the reviews. These can also have integer values, often visualized with a number of star symbols. Starlet, an approach we developed for extracting multidocument summaries for evaluative text, ­considers aspect rating distributions and language modeling as summarization features. These features encourage the inclusion of sentences in the summary that preserve the overall opinion distribution expressed across the original reviews and whose language best reflects those reviews’ language. In this article, we describe Starlet and how it improves on traditional summarization techniques and other approaches to ­multidocument ­summarization (see the “Related Work in Multidocument Summarization” sidebar for additional information).

Overview Textual summarization of multiple reviews could be accomplished by using abstractive techniques that directly express, for each aspect, the rating distribution across the whole review set and, in addition, select content or extract snippets from the reviews to illustrate this opinion distribution. However, we’re more may/june 2013

IS-28-03-fab.indd 29

i­nterested in exploring how far we can go in using extractive techniques to gather helpful sentences that reflect the modal view across the review set. Moreover, extractive techniques are less complex, have proved quite successful in other areas of automatic summarization, and require less manual domain adaptation than abstractive methods. Summarization by sentence selection requires identifying the relevant sentence features that the summarization model can use to assess individual sentence quality. With this goal in mind, we can divide our overall approach into three major steps: training rating prediction and n-gram language models; using these models to extract features from each input sentence; and using A* search15 to find an optimal subset of sentences from the input documents to create a summary. A* search is a method to efficiently explore a large space of possible options (in our case, the candidate sentences for the target summary) and select to optimal solution based on the least-cost path (the best combination of sentences for the target summary). In the first step, we create two models, the first of which is a ratingprediction model. Given an evaluative text document as input, it outputs an estimate for each aspect of the likelihood that the text expresses each of the five ratings on a five-point scale. So, for example, (aspect = food, ratings = {0.2, 0.1, 0.4, 0.2, 0.1}) says that the likelihood the given input text expresses a rating of three for food is 0.4. We use this model to estimate the rating distribution of each sentence used in the training and testing sets. The second model is actually a pair of n-gram language models that we use to enhance the sentence feature set, with additional features to measure how likely the sentence is www.computer.org/intelligent

to be used by humans to describe an aspect. In the second step, we use these models to create feature values for the sentences in our input documents. After this step, each sentence is associated with a set of features. The final step uses the features from the second step to perform the summarization task for training and testing purposes. In both training and testing, we use a summarization model to compute an overall summary score based on summing sentence feature values. The summarization model helps create a summary that is, according to the model, the best among all possible ­extractive summaries. For this purpose, we use A* search to search for the best summary. During training, we run A* search with randomly selected feature weights to generate the 10 best summaries with a maximum number of words defined by parameter L (we considered both 100- and 50word summaries). On these 10-best summaries, we run an optimization method to update or retrain the prediction model. We iterate this process until the predicted error is stable. In the testing mode, we only use A* search once with the optimized weights learned during training.

Rating Prediction and Language Modeling In the next sections, we describe the basic steps involved in the automatic summarization methods implemented in Starlet. To train our models, we assume that an initial set of reviews are already available with associated reviewers’ ratings across a finite number of aspects. In the first part, using our trained model, we predict sentences containing opinions about specific aspects and relative star-­ rating values. This helps to promote 29

31/07/13 8:06 PM

Statistic al Approaches to Concep t-Le vel Sentiment Analysis

Related Work in Multidocument Summarization

F

rom a high level viewpoint, we can divide approaches to multidocument summarization into extractive and abstractive. Extractive summarization assumes that we can use fragments (phrases or sentences) extracted from the original documents to assemble a coherent shorter version of the original text without substantially changing the information conveyed by the source. Abstractive methods summarize documents by generating a shorter natural language text document composed of new sentences based on information extracted from the input text. Although ­researchers have studied both types of summarization for factual and edited text documents, few contributions e­x tend these approaches to evaluative text summarization. Most studies focus on sentiment analysis and information extraction, neglecting the issue of how to adapt content ­selection when sentiment-laden sentences are present. Most evaluative text summarization methods try to organize the sentiment-laden sentences according to aspect and polarity. Sasha Blair-Goldensohn and his colleagues see sentences as qualitatively aggregated by aspect and “star ratings,” based on a manually defined strategy1; Minqing Hu and Liu2 simply list the sentences by aspect and polarity. Other work views aspects and polarities as graphically organized and visualized. 3,4 However, star ratings alone aren’t sufficient to understand the reasons why reviewers have assigned a specific value to their evaluation. Koji Yatani and his colleagues show that textual support is important to distill information from reviews,5 but none of the existing approaches consider text summaries in terms of rating distributions or introduce metrics to quantitatively evaluate the summary’s quality. Giuseppe Carenini and his colleagues propose a summarization method for evaluative text in the consumer product review domain focusing on digital cameras and DVD players.6 Specifically, they introduce aspect distributions across documents and sentiment polarity at the sentence level as features for the summarization models. This approach, although an improvement over traditional techniques, has several drawbacks. First, the sentence selection mechanism only considers the most frequently discussed aspects, leaving the decision about where to stop the selection process to the maximum summary length parameter. This could

c­andidate sentences with high opinions content. In the second part, to better capture the language used in sentiment-laden sentences, Starlet creates language models with positive and negative sentences and ranks them accordingly. This stage helps to boost sentences that are more representative of the language used to express opinions, demoting, at the same time, sentences that are syntactically ill-formed or far from the typical 30

IS-28-03-fab.indd 30

omit interesting opinions that don’t appear with sufficient frequency in the source documents. Second, they use the absolute value of the sum of positive and negative contributions to determine a sentence’s relevance in terms of opinion content. This flattens the aspect distributions ­because sentences with very negative or very positive polarity or with numerous opinions, but with moderate polarity strengths, will get the same score, regardless. Ultimately, how to use the summarization features is ­established a priori based on expert knowledge and prior work in this area7-9 rather than weighting these features from data using automatic quality metrics.

References 1. S. Blair-Goldensohn et al., “Building a Sentiment Summarizer for Local Service Reviews,” NLP in the Information Explosion Era, 2008; www.ryanmcd.com/papers/local_service_summ.pdf. 2. M. Hu and B. Liu, “Mining and Summarizing Customer ­Reviews,” Proc. ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD), ACM, 2004, pp. 168–177. 3. B. Liu, M. Hu, and J. Cheng, “Opinion Observer: Analyzing and Comparing Opinions on the Web,” Proc. 14th Int’l Conf. World Wide Web, ACM, 2005, pp. 342–351. 4. G. Carenini and L. Rizoli, “A Multimedia Interface for Facilitating Comparisons of Opinions,” Proc. 13th Int’l Conf. Intelligent User Interfaces, ACM, 2009, pp. 325–334. 5. K. Yatani et al., “Review Spotlight: A User Interface for ­Summarizing User-Generated Reviews Using Adjective-Noun Word Pairs,” Proc. SIGCHI Conf. Human Factors in Computing ­Systems, ACM, 2011, pp. 1541–1550. 6. G. Carenini, J. Cheung, and A. Pauls, “Multi-Document Summarization of Evaluative Text,” Computational Intelligence J., 2012; doi:10.1111/j.1467-8640.2012.00417.x. 7. M. Osborne, “Using Maximum Entropy for Sentence Extraction,” Proc. ACL-02 Workshop on Automatic Summarization, Assoc. Computational Linguistics (ACL), 2002, pp. 1–8; http:// dx.doi.org/10.3115/1118162.1118163. 8. M. Galley, “A Skip-Chain Conditional Random Field for Ranking Meeting Utterances by Importance,” Proc. Conf. Empirical Methods in Natural Language Processing, ACL, 2006, pp. 364–372. 9. S. Xie and Y. Liu, “Improving Supervised Learning for Meeting Summarization Using Sampling and Regression,” Computational Speech Language, vol. 24, no. 3, 2010, pp. 495–514.

l­anguage used in reviews. Once each sentence is characterized by the features described in these two steps, Starlet proceeds with the final step— selecting the optimal combination of sentences that satisfies the required summary length and rating distributions of the input reviews. Rating-Prediction Models

As previously mentioned, reviews refer to specific aspects of a product www.computer.org/intelligent

or service. For instance, restaurant reviews will express opinions about food quality, wait-staff courtesy, or ambience. These aspects are typically rated with a certain number of stars ranging from one (poor) to five (excellent). Other research has found16 that it’s possible to train a ratingprediction model to, for each aspect ai ∈ A (where A = {food, service, ambience, value, overall}), estimate the ratings ri ∈ R (where R = {1, …, 5}) for IEEE INTELLIGENT SYSTEMS

31/07/13 8:06 PM

any review document dj in the review corpus D as

0.63, on a five-point scale, from those assigned by humans.

rˆi = arg max P (ri d j ) (1) r ∈R = arg max P (ri s1, j ,s2, j ,,sn, j ) ,(2) r ∈R

Modeling Review Language

where each document dj is composed of n sentences or phrases s 1,j, s 2,j, . . . , s n,j. We used a maximum entropy (MaxEnt)17 model to estimate the conditional probability of the ratings (Equation 2), given the features extracted from the text documents. In this approach, each review document has associated with it a set of predefined aspects that reviewers have assessed with star-rating evaluations. During training, text features such as n-grams, parts of speech, and shallow parsing chunks are used together with the reviewer-assigned ratings to create a discriminative model classifier. There are five classifiers, one for each aspect.16 For each document, the MaxEnt models produce an estimate of the rating probability distributions rˆi , which describes how ratings for a specific aspect are likely to be associated to the document text. For instance, a review such as, “The service was flawless, timely, and nonintrusive; everything was great,” would have a rating distribution for the aspects service and overall skewed toward the high ratings (four or five) and an almost uniform rating distribution for the remaining aspects. The MaxEnt rating-prediction models have been automatically learned from 6,823 restaurant reviews, with each review containing a text description and the relative star rating for each evaluated aspect. Performance, measured in terms of rank loss, averages 0.63 across all the aspects—that is, on average, system ratings differ by may/june 2013

IS-28-03-fab.indd 31

We can evaluate a summary’s overall quality by using automatic metrics that depend on the summarization task’s goal; they’re typically highly correlated with the evaluation method used by human judges.18 When summarizing reviews, in addition to traditional evaluation criteria such as grammaticality, nonredundancy, clarity, and coherence, we might also want to consider how well the summary reproduces the opinion content expressed in the original review set and how well the selected sentences are representative of the language used in the specific review domain. Does it contain the pieces of information commonly used in aspect reviews? Does it conform to the way typical reviews are written? Although the former is captured by the rating-prediction models described in the previous section, the latter, which can be referred to as review n-gram language modeling, relates to specific language usage in the domain. Because the language style and words used for negative reviews ­differ substantially from those found in positive reviews, we use a generative approach to create statistical language models19 based on both the positive and negative reviews determined in the number of overall aspect star ratings. We can formulate a language model as probability distribution P(W) of the word sequence W = (w 1, . . . , w m). We can then decompose P(W) as P(W)= P(w1 ,..., wm ) = P(w1)P(w2 w1),..., P(wm w1 ,..., wm−1) = ∏m i =1 P(wi w1 ,..., wi −1) = ∏m i =1 P(wi wi −(n−1) ,..., wi −1) , www.computer.org/intelligent

where P(wi|w 1, . . . , wi–1) is the probability that the word wi follows the sequence of words (w 1, . . . , wi–1), and P(wi|wi–(n–1), . . . , wi–1) is the approximation of the same word sequence probability when assuming that the dependency with the previous words only extends to the preceding n − 1 words or n-grams. In the case of a trigram model, for instance, the dependency on previous words is limited to the previous two. In our generative model, we use perplexity20 to determine how a test sentence fits the training model. Lower values of perplexity for a test sentence indicate that the word sequence is more likely to have been generated by the language model. To model review language, we created two language models (LMs) from restaurant review data. We trained the first model, LM1, on negative reviews (rated with one star) and the second model, LM5, on positive reviews (five-star ratings). We then normalized, tokenized, and split the text data into sentences. We trained the models with trigrams, the same vocabulary, and Kneser-Ney smoothing.21

Feature Extraction for Sentence Scoring For the review summarization task, we hypothesized that the sentences to be preferred were most likely to match the modal opinions for each aspect expressed in the reviews and, at the same time, be in agreement with the syntactic and lexical style used by the reviewers. To achieve this dual goal, we used the previously described rating-­ prediction and language models to assign features to the sentences in our training and testing sets. Although we trained the rating-prediction model with labels associated with the full reviews, we predicted that 31

31/07/13 8:06 PM

Statistic al Approaches to Concep t-Le vel Sentiment Analysis

the model would generalize to single sentences, providing accurate probability distributions over the ratings. We validated this on a small sample of annotated sentences and found p erformances comparable to the ­ ­document-based test set. Given a set of review documents for a single reviewed entity, each of which consists of a textual review plus ratings for a set of aspects, feature extraction proceeds as follows. For each sentence and aspect, the feature extractor calculates the Kullback–Leibler (KL)-divergence22 between the predicted sentence rating distribution (as provided by MaxEnt) and a target rating distribution. The KL-divergence quantifies how far each sentence aspect rating distribution is from the target; target rating distributions are calculated by numerically aggregating the aspect ratings from the input reviews for the reviewed entity, as manually assigned by reviewers. This method promotes sentences whose rating distribution is similar to that of the review set as a whole, and hence, prefers sentences that are likely to express each aspect’s modal opinion. To capture features’ proximity to in-domain language usage, we use the positive and negative language models described earlier to compute each sentence’s perplexity.

Summarization Modeling We use a summarization model to score candidate summaries. While exploring the search space for best candidate summaries, a score function (defined later within the context of our feature sets) ranks each set of sentences considered in the current search state. Such a function would rate higher the combinations of sentences that better represent our summaries. The formulation of the summarization model (described in detail elsewhere)23 is reported ­below 32

IS-28-03-fab.indd 32

as linear combination of feature functions: s(y x) =

∑φ(xi )λ ,(3) i∈y

where x is the document set, composed of k sentences, y ⊆ {1, . . . , k} is the set of sentence indexes selected for a summary, ϕ(⋅) is a feature function that returns a vector of feature values for each candidate summary sentence, and l is the weight vector associated with the feature vector. We use the term prediction model to refer to the weight vector l; it must be trained to distinguish between good and bad summaries. The search process uses the summarization model to find an optimal summary y: ˆ yˆ = arg max s (y x).(4) y

In previous research, 23 summary creation was formulated as a search problem in which the aim is to find a subset of sentences from the entire set of documents, which is optimal according to an evaluation metric such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE).18 ROUGE is a well-known evaluation method for summarization that’s based on the common number of n-grams between a peer and one or several model summaries. Its results correlate strongly with human judgments of content overlap between peer and model summaries.18 The search is also constrained so that the subset of sentences doesn’t exceed the summary length threshold. While searching, a graph is constructed whose nodes are search states and whose edges represent sentences that get added to the summary if the edge is traversed (see Figure 1). Each node has associated information about summary length, summary score, and a heuristic ­function www.computer.org/intelligent

score. The search starts with an empty summary (start state length 0, and a summary and heuristic score of 0), and follows one of the outgoing arcs to expand it. A new state is created when a new sentence is added to the summary. The new state’s length is updated with the number of words in the new sentence, and the summarization model computes the state’s summary score. A goal state is any state where it’s not possible to add another sentence without exceeding the summary length threshold. The summarization problem is then equivalent to finding the best scoring path (the sum of the sentence scores on this path) between the start state and a goal state. To find this path, we use the A* search algorithm15 to efficiently traverse the search graph. A* search applies a best-first strategy to traverse the graph from a starting state to a goal state. The search procedure requires a scoring function or summarization model for each state and a heuristic function that estimates the additional score to get from a given state to a goal state. The search algorithm is guaranteed to converge to the optimal solution if the h ­ euristic function is admissible—that is, if the function used to estimate the cost from the current node to the goal never overestimates the actual cost. As a cost function estimator, we use the final aggregated heuristic.23 This function provides an upper bound on the additional score achievable in reaching a goal state from the current summary state (state 1 in Figure 1). It adds a sentence’s score to the heuristic score when it doesn’t violate the allowed summary length. When the next sentence is too long to fit the required length, it skips sentences until it finds the best scoring sentence that fits. In previous work, 23 the researchers posed the training problem as one of IEEE INTELLIGENT SYSTEMS

31/07/13 8:06 PM

Start state Summary length L=0 Summary score S=0 Heuristic score H=0

State 1 Summary length L = l (S1) Summary score S = s (S1) Heuristic score H = s (S1) + s (S2), ..., + s (Sn)

Reviews

1, 2, 3

1, 2

1, 2, 4

1, 3

1, 2, 5 ...

1

2, 3

1, 4

2

2, 4

1, 5

3

2, 5

1, 6

Goal state

... 4

2, 6

1, 7 Summary

5

2, 7

1, 8

Figure 1. A* search graph for extractive summarization. The search space is represented by all the possible combinations of sentences extracted from the input reviews to summarize (the number in the circle represents the sentence index). At the first step, only the summaries composed by one sentence are ranked. At the second step, a combination of two sentences is considered, and so on for the remaining steps until the goal state is reached. At each step, summary lengths, summary scores, and heuristic scores are computed. The goal state is reached when achieving the desired summary length.

finding model parameters l such that the predicted output yˆ closely matches the gold standard r. They measured match quality by using ROUGE.18 In training, this approach tries to minimize a loss function’s value, the degree of error in the prediction Δ(wi y, ˆ r). This loss is formulated as 1 – R, where R is the ROUGE score. The training problem is to solve λ = arg min Δ (yˆ , r)   ,(5) λ

where yˆ and r are taken to range over the corpus of many document sets and summaries. We train the prediction model using the minimum error rate training (MERT) technique.24 MERT is a first-order optimization method may/june 2013

IS-28-03-fab.indd 33

that uses Powell 25 search to find the ­parameters that minimize loss in the training data. A* search produces the n-best lists necessary for MERT to optimize an objective metric for summary quality such as ROUGE.

Evaluation In the next sections, we evaluate the Starlet approach to extractive review summarization in the restaurant review domain by comparing its performance with an existing approach to multidocument summarization using objective and subjective metrics. Data

We selected the review documents used in our experiments from a corpus of previously mined online www.computer.org/intelligent

r­estaurant reviews. In addition to ­textual data, this data provides numerical ratings for five predefined aspects: atmosphere, food, value, service, and overall. From the set of 3,866 available restaurants, we selected 131 with more than five reviews. Then, we manually searched for extra reviews on other websites and selected 60 of the 131 restaurants that had reviews highly voted by Web readers as useful. For each of the 60 restaurants, we selected the reviews with the highest number of “helpful votes” that were dated in the same time frame as the original reviews and with a word length similar to the target summary length. Among the matching reviews, we chose the reviews that most reflected the target 33

31/07/13 8:06 PM

Statistic al Approaches to Concep t-Le vel Sentiment Analysis

1.0 Atmosphere Food Price Service Overall

0.8 0.6 0.4 0.2 0.0

1

2

3

4

5

Random summary So A for great service, too. Wonderful attention to detail and extremely understanding of food allergies that can sometimes be a problem. One of my coworkers recommeded Twigs for a true fine dining mountain experience. It took us 15 minutes just to get our drinks and another 45 for the food. MEAD summary My wife and another couple traveled to the mountains from Charlotte for a weekend. Service was horrible-very smokey bar, too loud to be able to enjoy a romantic dinner poor drinks. I ate the crab cakes and they were so tasty-by far my favorite. Starlet summary (KL-divergence) Great food. The duck was terrific-crisp and flavorful. Go and enjoy. Very nice setup. What a place. We will return. My husband and 2 other couples ate at Twigs during leaf season. We all got different things from lobster to filet. Starlet Summary (KL-divergence + LM) We had a great time. Great food. Service was good and the food was better than expected. My wife and another couple ate there recently. We ended up staying for 4 hours. We will return. The food was amazing.

Figure 2. Example of summaries extracted from a set of 10 restaurant reviews with target modal distributions for aspect ratings (x-axis: star-ratings, y-axis: normalized frequency). The rating distributions are only used by Starlet summary examples. A qualitative analysis of the summary contents evince that sentences selected by Starlet are richer of opinions related to the evaluated restaurant aspects.

modal rating distributions and used it as a reference summary. We randomly split the 60 restaurants into 40 for training and 20 for testing. Each restaurant had between 6 and 10 reviews with an average of 7.72. For language-modeling purposes, we mined 32,728 restaurant reviews from the Web and divided them by overall star rating. We selected the two most extreme datasets in terms of overall polarity and processed the data for language-modeling training: LM1 trained on 31,388 sentences 34

IS-28-03-fab.indd 34

with 540,895 words from negative reviews, and LM5 with 103,481 sentences and 1,580,854 words from positive reviews.

(R-SU4). R-1 and R-2 compute the number of unigrams and bigrams, respectively, that coincide with the automatic and model summaries; R-SU4 measures the overlap of skip-bigrams between them, allowing a skip distance of four. We also performed a slightly modified version of the fivescale manual evaluation used in the Document Understanding Conference (DUC; http://duc.nist.gov) and the Text Analysis Conference (TAC; www.nist.gov/tac).26 We did this to assess summary quality along dimensions other than those captured by ROUGE. As baselines, we used two different summarizers: a baseline summarization system that randomly selects sentences with no repetition until it reaches the desired length limit measured in number of words (random), 27 and the open source MEAD system with the same output constraints.28 (We selected the word compression method and default values for the other parameters.) We created prediction models for both 100- and 50-word summary lengths and evaluated separately the contributions of the KL-divergencebased features and the combination of both KL-divergence (Starlet), and LM features (Starlet-LM). Figure 2 shows examples of 50-word summaries for one restaurant review set consisting of 10 reviews and the related graph of target distributions for each aspect rating. ROUGE evaluation. Table 1 shows

Results

To validate our Starlet approach, we used both automatic and manual evaluation. In the automatic evaluation, we compared our summarizer’s output against model summaries using ROUGE. The metrics we took into consideration were ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-SU4 www.computer.org/intelligent

the three ROUGE scores and the 95 percent confidence intervals for R-1 used to evaluate the four summarization models. Starlet-LM outperforms the two baseline systems in all ROUGE metrics for the 100-word summaries and outperforms all the models, including the KL-divergencebased summary, in the 50-word IEEE INTELLIGENT SYSTEMS

31/07/13 8:06 PM

Table 1. Test set for 100- and 50-word length summaries.* 100-word summary Model

s­ ummaries. This means that, according to ROUGE, Starlet-LM generates summaries whose lexical content is closer to human summaries. For the 100-word summaries, Starlet-LM was significantly better than both the random summaries (p < 0.04) and the MEAD system (p < 0.05). When considering the 50-word summaries, Starlet-LM is significantly better than all other systems: random with p < 0.0001, MEAD with p < 0.02, and Starlet with p < 0.03.

R-1 CI

R-2

R-SU4

Random



0.289 ± 0.031

0.030

0.081

MEAD

°

0.309 ± 0.037

0.051

0.089

0.336 ± 0.038

0.063

0.105

0.346 ± 0.035

0.060

0.106

R-1 CI

R-2

R-SU4

Starlet Starlet-LM

•°

50-word summary Model Random



0.307 ± 0.038

0.032

0.085

MEAD

°

0.346 ± 0.049

0.047

0.098

Starlet

*

0.392 ± 0.035

0.067

0.122

0.442 ± 0.051

0.110

0.165

Starlet-LM

• °*

* Statistically significant results are paired by the same symbol (• °*) and based on a t-test. CI = confidence interval.

Manual evaluation. In the manual evaluation, we considered only three summarization models at the maximum output length of 100 words. Manual evaluation is an expensive process, and we wanted to focus our evaluation on longer summaries that are more likely to contain summarization and opinion selection issues. We asked three people (two of whom are native English speakers) to evaluate the quality of the generated summaries according to the evaluation criteria described elsewhere.26 Without showing the reference summary, we asked each participant to rate the following linguistic qualities on a ­rating scale ranging from a maximum of five (very good) to a minimum of one (very poor): grammaticality (grammatically correct and without artifacts); redundancy (absence of unnecessary repetitions); clarity (easy to read); and coherence (well-structured and organized). Because the focus property26 applies mostly to the DUC summarization tasks, we replaced it with coverage to indicate the level of coverage for the aspects and views expressed in the summary. In other words, this metric is higher if the summary is mostly composed of opinion-laden sentences relating to the predefined restaurant aspects, and if these opinions reflect those expressed in the original review documents. may/june 2013

IS-28-03-fab.indd 35

Table 2. Manual evaluation for three summarization systems (100-word summaries).* Grammatically

Redundancy

Random

3.53 ± 0.29

2.82 ± 0.17

MEAD

3.68 ± 0.25

2.92 ± 0.24

Starlet

3.67 ± 0.25

3.00 ± 0.21

Coverage

Clarity •

2.78 ± 0.19 2.97 ± 0.26



3.10 ± 0.24

Coherence

Random



2.67 ± 0.22

•°

2.05 ± 0.18

MEAD

°

2.33 ± 0.22



2.57 ± 0.27

Starlet

•°

3.23 ± 0.25

°

2.62 ± 0.26

* Statistically significant results are paired by the same symbol (•°) and based on a t-test.

Table 2 shows the average scores and relative confidence interval for each criterion. Discussion

The Starlet summarizer uses as its primary feature for sentence selection the KL-divergence between the predicted rating distribution for a candidate sentence and the distribution of user-supplied ratings for the set of reviews from which the sentence is drawn. This approach performed better on all ROUGE measures and at both summary lengths than both the random sentence baseline and the MEAD approach, which uses a set of standard features widely employed in conventional multidocument summarization, such as the cosine similarity of a candidate sentence with centroid of document cluster, similarity www.computer.org/intelligent

with first sentence or title, position in document, and so on. However, this increase in performance isn’t significant, at least not for the R1 measure. This might be because we performed the ROUGE evaluation against a single reference summary, rather than several reference summaries reflecting a spread of opinion; we’re currently working on this broader-based evaluation. Note, however, that for the coverage measure forming part of the manual evaluation (see Table 2), Starlet performed substantially better than the other two systems in the evaluation. This suggests that the rating distribution features—one for each aspect, whose function is to encourage the summarizer to prefer sentences with rating distributions similar to the review set’s aggregated rating distribution—might be ­leading 35

31/07/13 8:06 PM

Statistic al Approaches to Concep t-Le vel Sentiment Analysis

the summarizer to select sentences that address different aspects, rather than those that reflect the central tendencies of the review cluster, as MEAD does. The Starlet-LM summarizer uses the KL-divergence rating distribution similarity feature found in Starlet as well as two language model features charged with helping the system select sentences whose language is characteristic of that found in review articles. Starlet-LM’s raw ROUGE scores are better than Starlet’s for all ROUGE measures and both summary lengths, except for the R2 measure on the 100-word summaries, where it’s just slightly lower. StarletLM’s R1 scores are significantly better than Starlet’s and the other two systems at both summary lengths. R1 measures lexical overlap with the reference summary, so Starlet-LM’s performance suggests that using the two language models as features significantly improves the summarizer’s ability to select sentences that use language typically found in the review genre. Looking at the manual evaluation from the judges, the grammaticality scores are consistent across the three methods and depend only on the source sentence’s quality. The redundancy score is slightly better for Starlet, but in the current version, there’s no mechanism for avoiding similar sentences, although selecting sentences according to the rating distribution should help reduce redundancy. Also, the clarity and coherence scores are better in our approach than the random system, but not significantly different from MEAD. Low scores are related to controversial reviews in which opinions are mixed and distributed across the ratings. In these cases, more investigation is necessary, perhaps ordering positive and negative 36

IS-28-03-fab.indd 36

s­ entences according to some rhetorical structure or learned-from-data language models. Finally, the coverage score for Starlet is decidedly better than for the other approaches, suggesting that Starlet selects content with a high level of opinions and aspects that could be relevant to users.

E

xtractive summaries are linguistically interesting and can be both informative and concise. They also require less engineering effort. On the other hand, abstractive summaries tend to have better coverage for a particular level of conciseness, be less redundant, and seem more coherent, but they require more computational resources and semantic interpretation. In future work, we plan to expand the weight-learning process to include other important summarization features, such as redundancy and coherence. Another future topic concerns the limitation on the fixed number of aspects considered for ratings. Typically, other aspects discussed in the reviews aren’t captured by the five given aspects, such as drinks, location, family friendly management, or other topics that are often e­ valuated in the text, but not explicitly expressed by star-ratings. Removing such a restriction might involve techniques for extracting the lexical forms identifying the aspects and recovering the concept associated with such surface forms. However, recent advances in semantic- and concept-parsing techniques29 offer promising approaches to reliably identifying those highlevel concepts targeted by reviewers’ opinions. One final interesting area is to experiment with extrinsic summarization metrics, in which judges evaluate the completion of a specific task executed by the system’s users to find out if an ­automatically generated www.computer.org/intelligent

s­ummary correlates to the summarization task that a user would conduct to select a service or a product based on ­reviews.30

References 1. W. Duan, B. Gu, and A.B. Whinston, “Do Online Reviews Matter? An Empirical Investigation of Panel Data,” J. Decision Support Systems, vol. 45, no. 4, 2008, pp. 1007–1016. 2. D. Park, J. Lee, and I. Han, “The Effect of On-line Consumer Reviews on Consumer Purchasing Intention: The Moderating Role of Involvement,” Int’l J. Electronic Commerce, vol. 11, July 2007, pp. 125–148. 3. B. Pang and L. Lee, “Opinion Mining and Sentiment Analysis,” Foundations and Trends in Information Retrieval, vol. 2, no. 1–2, 2008, pp. 1–135. 4. R. McDonald et al., “Structured Models for Fine-to-Coarse Sentiment Analysis,” Proc. Assoc. Computational Linguistics, Assoc. Computational Linguistics (ACL), 2007, pp. 432–439. 5. M. Hu and B. Liu, “Mining and Summarizing Customer Reviews,” Proc. ACM SIGKDD Conf. Knowledge Discovery and Data Mining, ACM, 2004, pp. 168–177. 6. E. Cambria and A. Hussain, Computing: Techniques, Tools, and Applications, Springer, 2012. 7. J. M. Conroy, J. G. Stewart, and J. D. Schlesinger, “CLASSY Query-Based Multi-Document Summarization,” Proc. Document Understanding Conf. Workshop Conf. on Empirical Methods in Natural Language Processing, Nat’l Inst. Standards and Technology, 2005; http://duc.nist.gov/pubs/2005papers/ ida.conroy.pdf. 8. K.R. McKeown et al., “Tracking and Summarizing News on a Daily Basis with Columbia’s Newsblaster,” Proc. 2nd Int’l Conf. Human Language Technology Research, Morgan Kaufmann, 2002, pp. 280–285. 9. N. Elhadad et al., “Customization in a Unified Framework for ­Summarizing IEEE INTELLIGENT SYSTEMS

31/07/13 8:06 PM

The Authors Medical Literature,” Artificial ­Intelligence in Medicine, vol. 33, Feb. 2005, pp. 179–198. 10. T. Copeck, N. Japkowicz, and S. Szpakowicz, “Text Summarization as Controlled Search,” Proc. 15th Conf. Canadian Soc. for Computational Studies of Intelligence on Advances in Artificial Intelligence, Springer-Verlag, 2002, pp. 268–280. 11. H. Saggion and G. Lapal, “Generating Indicative-Informative Summaries with Sumum,” Computational Linguistics, vol. 28, Dec. 2002, pp. 497–526. 12. S. Mithun and L. Kosseim, “Summarizing Blog Entries versus News Texts,” Proc. Workshop Events in Emerging Text Types, ACL, 2009, pp. 1–8. 13. A. Aker and R. Gaizauskas, “Generating Image Descriptions Using Dependency Relational Patterns,” Proc. ACL 2010, ACL, 2010, pp. 1250–1258. 14. L. W. Ku, Y. T. Liang, and H. H. Chen, “Opinion Extraction, Summarization and Tracking in News and Blog Corpora,” Proc. AAAI-2006 Spring Symp. Computational Approaches to Analyzing Weblogs, Am. Assoc. for Artificial Intelligence (AAAI), 2006; www.aaai. org/Papers/Symposia/Spring/2006/ SS-06-03/SS06-03-020.pdf. 15. S. Russell et al., Artificial Intelligence: A Modern Approach, Prentice Hall, 1995. 16. N. Gupta, G. Di Fabbrizio, and P. Haffner, “Capturing the Stars: Predicting Ratings for Service and Product Reviews,” Proc. NAACL HLT 2010 Workshop Semantic Search, Assoc. Computational Linguistics, 2010, pp. 36–43. 17. A.L. Berger, V.J.D. Pietra, and S.A.D. ­Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics, vol. 22, Mar. 1996, pp. 39–71. 18. C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” Proc. ACL Workshop Text Summarization Branches Out, ACL, 2004, pp. 74–81. may/june 2013

IS-28-03-fab.indd 37

Giuseppe Di Fabbrizio is a senior research scientist at Amazon.com and a research col-

laborator with the University of Sheffield and the Intelligent Systems Laboratory at AT&T Labs Research. His research interests and publication topics include conversational agents, text summarization, natural language generation, opinion mining, sentiment analysis, multimodal and speech system architectures, platforms, and services. Di Fabbrizio has a PhD in computer science from the University of Sheffield. Contact him at [email protected].

Ahmet Aker is a PhD student at the University of Sheffield in the natural-language pro-

cessing group. His research interests are in automatic text summarization, machine learning, statistical machine translation, comparable data acquisition from the Web, and multilingual term alignment. Aker has a German Diploma in computer science and an MS in advanced software engineering. Contact him at [email protected].

Robert Gaizauskas is a professor of computer science and head of the natural language

processing group in the Department of Computer Science at the University of Sheffield. His research interests lie in applied NLP, especially in its potential to improve information access to large text collections. Gaizauskas has a DPhil from the School of Cognitive and Computing Sciences at the University of Sussex. He has published more than 140 papers in peer-reviewed journals and conference proceedings and has been an investigator on 25 funded research projects. Contact him at [email protected].

19. F. Jelinek, L. Bahl, and R. Mercer, “Design of a Linguistic Statistical Decoder for the Recognition of Continuous Speech,” IEEE Trans. Information Theory, vol. 21, no. 3, 1975, pp. 250–256. 20. A. Mutton et al., “Gleu: Automatic Evaluation of Sentence-Level Fluency,” Proc. 45th Annual Meeting Assoc. Computational Linguistics, ACL, 2007, pp. 344–351. 21. R. Kneser and H. Ney, “Improved Backing-off for m-Gram Language Modeling,” Proc. IEEE Int’l Conf. Acoustics, Speech and Signal Processing, IEEE, 1995, pp. 181–184. 22. S. Kullback and R. A. Leibler, “On Information and Sufficiency,” The Annals of Mathematical Statistics, vol. 22, no. 1, 1951, pp. 79–86. 23. A. Aker, T. Cohn, and R. Gaizauskas, “Multi-Document Summarization Using A* Search and Discriminative Training,” Proc. Conf. Empirical Methods in Natural Language Processing, ACL, 2010, pp. 482–491. 24. F. Och, “Minimum Error Rate Training in Statistical Machine Translation,” Proc. 41st Annual Meeting Assoc. Computational Linguistics, ACL, 2003, pp. 160–167. 25. M.J.D. Powell, “An Efficient Method for Finding the Minimum of a Function www.computer.org/intelligent

of Several Variables without Calculating Derivatives,” The Computer J, vol. 7, no. 2, 1964, pp. 155–162. 26. H. Dang, “Overview of DUC 2005,” Proc. Document Understanding Conf. Workshop at the Human Language Technology Conf./Conf. Empirical Methods in Natural Language Processing, Nat’l Inst. Standards and Technology, 2005; www-nlpir.nist. gov/projects/duc/pubs/2005papers/ OVERVIEW05.pdf. 27. J. Goldstein et al., “Summarizing Text Documents: Sentence Selection and Evaluation Metrics,” Proc. 22nd ­Annual Int’l ACM SIGIR Conf. Research and Development in Information Retrieval, ACM, 1999, pp. 121–128. 28. D. Radev et al., “MEAD—A Platform for Multidocument Multilingual Text Summarization,” Conf. Language Resources and Evaluation (LREC), European Languages Resource Assoc., 2004; www.summarization.com/~radev/ papers/lrec-mead04.pdf. 29. E. Cambria et al., Big Social Data Analysis, Taylor and Francis, 2013, ch. 13. 30. I. Mani et al., “The Tipster Summac Text Summarization Evaluation,” Proc. 9th Conf. European Chapter of the Assoc. for Computational Linguistics, ACL, 1999, pp. 77–85. 37

07/08/13 7:46 PM

Suggest Documents