Comparison of Feature-Level Learning Methods ... - Semantic Scholar

3 downloads 5557 Views 228KB Size Report
Feb 20, 2012 - ther identified the ideal combination of learning functions to optimize its ... Consumer Reviews, e-Commerce, Feature-Level Opinion Mining, .... lower). However, as they used a Web search engine to do the proximity search, it.
Comparison of Feature-Level Learning Methods for Mining Online Consumer Reviews Li Chen∗, Luole Qi, Feng Wang Department of Computer Science, Hong Kong Baptist University Hong Kong, China {lichen, llqi, fwang}@comp.hkbu.edu.hk

Abstract The tasks of feature-level opinion mining usually include the extraction of product entities from consumer reviews, the identification of opinion words that are associated with the entities, and the determining of these opinions’ polarities (e.g., positive, negative, or neutral). In recent years, two major approaches have been proposed to determine opinions at the feature level: model based methods such as the one based on lexicalized hidden Markov model (L-HMMs), and statistical methods like the association rule mining based technique. However, little work has compared these algorithms regarding their practical abilities in identifying various types of review elements, such as features, opinions, intensifiers, entity phrases and infrequent entities. On the other hand, few attentions have been paid to applying more discriminative learning models to accomplish these opinion mining tasks. In this paper, we not only experimentally compared these methods based on a real-world review dataset, but also in particular adopted the Conditional Random Fields (CRFs) model and evaluated its performance in comparison with related algorithms. Moreover, for CRFs-based mining algorithm, we tested the role of a self-tagging process in two automatic training conditions, and further identified the ideal combination of learning functions to optimize its learning performance. The comparative experiment eventually revealed the CRFs-based method’s outperforming accuracy in terms of mining multiple review elements, relative to other methods. ∗

Corresponding author. Tel: +852 34117090; Fax: +852 34117892; Postal address: 224 Waterloo Road, Kowloon Tong, Kowloon, Hong Kong

Preprint submitted to Expert Systems with Applications

February 20, 2012

Keywords: Consumer Reviews, e-Commerce, Feature-Level Opinion Mining, Conditional Random Fields (CRFs), Lexicalized Hidden Markov Model (L-HMMs), Association Rule Mining 1. Introduction User-generated reviews have been increasingly regarded useful in business, education, and e-commerce, since they contain valuable opinions originated from the user’s experiences. For instance, in e-commerce sites, customers can assess a product’s quality by reading other consumers’ reviews to the product, which will help them to decide whether to purchase the product or not. Nowadays, many ecommerce websites, such as Amazon.com, Yahoo shopping, Epinions.com, allow users to post their reviews freely. The reviews’ number has in fact reached to more than thousands in these large websites, which hence poses a challenge for a customer to go through all of them. To solve this problem, in recent years, researchers have proposed various Web opinion mining approaches with the aim to discover the essential information from reviews and then present to users. Most approaches can be classified into two major branches: document-level which focuses on producing an overall opinion from one document; and feature-level which aims to discover feature entities from sentences and then identifies opinion words that are associated with each entity (e.g., a consumer’s opinion on the camera’s image quality, weight). For the latter branch (which is the focus of our work), previous works have mainly adopted rule-based techniques (Turney, 2002) or association rule mining based methods such as the one proposed in (Hu & Liu, 2004). Till recently, some researchers have attempted to adopt more precise machine learning models in order to increase the opinion mining efficacy. One typical work is OpinionMiner (Jin et al., 2009). It was built based on lexicalized Hidden Markov Model (L-HMMs) to integrate multiple important linguistic properties into an automatic learning process. Compared to non-model based algorithms, the model-based opinion mining method requires a training process for determining the model’s parameters. After the parameters were learnt, it is fast to decode the new data. In fact, Jin et al. (2009) demonstrated that their method is more effective and accurate in mining entities and opinions, against the rule-based method that does not have the training phase. However, few works have in-depth compared the above methods by means of experiment, regarding their precise accuracy in mining different review elements. 2

To be specific, the opinion mining should involve multiple tasks: 1) identifying features which can be a component (e.g., the camera’s battery), a function (e.g., the optical zoom), or a property (e.g., the color, size, weight); 2) discovering opinions that reviewers expressed on the feature, and determining the opinion orientation (e.g., positive, negative, or neutral); 3) identifying feature/opinion phrases if any. The phrase is commonly a set of two or more continuous words (e.g., “image quality”, “ease of use”, “easy to navigate”); 4) revealing intensifiers which reviewers used to strengthen or weaken an opinion (e.g., very, highly, absolutely, extremely, greatly); 5) clarifying infrequent entities which though commented by few reviewers, can be used to differentiate different products and answer users’ specific questions (Jin et al., 2009). Moreover, after extracting these elements, the system should decide whether a sentence is an opinion sentence (i.e., containing both feature(s) and opinion(s)), and conclude the sentence’s opinion orientation. As a matter of fact, related methods have mainly emphasized evaluating their algorithms in mining pure opinions, single words, and/or frequent features, while ignoring the other review elements which though are also important to be assessed. Therefore, in this paper, we have conducted an experiment on a real review dataset to compare these feature-level opinion mining algorithms, in terms of their actual performance in accomplishing the above-mentioned tasks. More notably, considering the inherent limitation of L-HMMs based learning method, we studied the particular impact of linear chain Conditional Random Fields (CRFs) model. As we will discuss later, L-HMMs can not represent distributed hidden states and can not model arbitrary, non-independent entities of the input words’ sequences. It can neither involve rich, overlapping model functions. Relative to L-HMMs, CRFs model is a discriminative, undirected graph model (Lafferty et al., 2001), which naturally considers arbitrary, non-independent entities without conditional independence assumption, and can involve more rich and overlapping functions. Given these properties, we believe that it could be more qualified and flexible to extract the review elements. It is also worth noting that although some efforts have lately been made to apply CRFs model for opinion mining (Miao et al., 2010; Li et al., 2010; Zhao et al., 2008), they primarily used it to perform document-level sentiment classification, or extract single pairs of feature and opinion. Little emphasis has actually been on exploring its specific effect on identifying other review elements like intensifiers, phrases and infrequent entities, and few attentions have been paid to experimentally comparing its accuracy with other typical methods. Thus, our contributions to the area of feature-level opinion mining can be concretely summarized into the following two major points.

3

1. To apply CRFs model into mining consumer reviews, we have not only specified a comprehensive set of word tags, but also integrated the word expansion and self-tagging techniques so as to minimize the manual labeling effort. In addition, we involved different types of learning functions and identified their optimal combination through the experiment. 2. In the experiment, we also compared CRFs-based method with other four representative feature-level opinion mining algorithms: L-HMMs based (Jin et al., 2009), association rule mining (ASM) based (Hu & Liu, 2004), ASM plus linguistic rules (Zhang et al., 2010; Ding & Liu, 2007), and rule-based. Notably, the comparison was conducted at multiple opinion mining units: basic product entities (including components, functions, and features), opinions, intensifiers, phrases, infrequent entities, and opinion sentences. The results reveal that the differences of CRFs-based method from related approaches, especially L-HMMs based method, reach at statistically significant levels regarding most comparisons. The rest of this paper is therefore organized as follows: we first discuss related works in Section 2 and then describe in detail the CRFs based opinion mining method and algorithm steps in Section 3. In Section 4, we introduce the other four typical methods that were compared in the experiment, and then present the experiment setup. In Section 5, we discuss the experimental results. Finally, we conclude our work and indicate its future directions in Section 6. 2. Related Work As mentioned before, the opinion mining has been mainly conducted at the document level or at the feature level. In this section, we first give a brief review of related work on document-level opinion mining, and then in detail discuss two representative types of methods that have been so far proposed for feature-level opinion mining. 2.1. Document Level Opinion Mining At the document level (which is to obtain an overall opinion value for the whole document), Turney (2002) used point-wise mutual information (PMI) to calculate an average semantic orientation score of extracted phrases for determining the document’s polarity. Pang et al. (2002) examined the effectiveness of applying machine learning techniques to address the sentiment classification problem for movie review data. They aimed at addressing the rating-inference 4

problem, i.e., how to summarize a user’s review to a virtual rating from 0 to 5, or binary value (thumb up/down). Goldberg & Zhu (2006) presented a graphbased semi-supervised learning algorithm to resolve the rating inference problem. It inferred numerical ratings from unlabeled documents based on the sentiment expressed in the text. Concretely, they solved an optimization problem to obtain a smooth rating function over the whole graph that was created on the review data. They proved that when limited labeled data is available, their method can achieve significantly better predictive accuracy over other methods. More recently, Hatzivassiloglou & Wiebe (2000) studied the effect of dynamic adjectives, semantically oriented adjectives, and gradable adjectives on a simple subjectivity classifier, and proposed a trainable method that statistically combines two indicators of grad ability. Wilson et al. (2005) proposed a system called OpinionFinder that automatically identifies the appearance of opinions, sentiments, speculations, and other private states in text, via the subjectivity analysis. Das & Chen (2007) studied sentiment classification for financial documents. However, although the above works are all related to opinion mining (also called sentiment analysis), they mainly targeted to discover the sentiment to represent a reviewer’s overall opinion, but did not find which features the reviewer actually liked and disliked. For example, an overall negative sentiment on an object does not mean that the reviewer dislikes every aspect of the object. Thus, in our work, we put major focus on the feature-level opinion mining. 2.2. Feature Level Opinion Mining To in-depth discover a reviewer’s opinions on every aspect that s/he mentioned in the text, some researchers have tried to mine and extract opinions at the feature level. The task of opinion mining at the feature level needs identifying the individual elements which form the opinion. This means that individual words or phrases must be extracted as elements from reviews. In Cowie and Lehnert’s work, the task was regarded as an information extraction (IE) problem, which inherently handles the extraction of needed information from unstructured text: relation extraction, terminology extraction and named entity recognition (Cowie & Lehnert, 1996). The purpose of relation extraction is to find out the relationship between the elements. The terminology extraction is to extract the relevant terms from a given corpus (Bourigault & Jacquemin, 1999; Wermter & Hahn, 2005), while the named entity recognition aims to classify the extracted elements into a specific entity category such as location, person, etc. These three tasks can be in nature related to feature level opinion mining for product reviews. For example, the step of discovering the specific entities (e.g., features, functions from a digital 5

camera’s reviews) can be related to the named entity recognition. The mining of opinions that are associated with these entities can be then taken as a variation of the relation extraction problem. In recent years, the approached for feature level Web opinion mining have evolved into two principal branches: rule-based and statistical methods, and modelbased methods. 2.2.1. Statistical Approaches Statistical approaches do not rely on data labeling and model-training process, so most of them are domain-independent (or even language-independent). One popular work was proposed by (Hu & Liu, 2004). They extracted product features through association rule mining, and identified opinion words that are adjacent to frequently appearing features. In the algorithm evaluation, they reported that their approach achieved a precision of 0.56. Popescu & Etzioni (2005) developed a system called OPINE based on the work of (Hu & Liu, 2004). In their system, statistical analysis was applied to calculate the probability that a candidate term is relevant in a specific domain (e.g., digital cameras). They concretely took point-wise mutual information as a measure of relevance to produce the relevance score, and performed an analysis of term co-occurrence statistics on a large corpus. The experiment showed that their result outperforms Hu and Liu’s work w.r.t. precision (+0.22 improvement), but a small drop in terms of recall (-0.03 lower). However, as they used a Web search engine to do the proximity search, it is difficult to verify and deploy their system in the real e-commerce environment. Scaffidi et al. (2007) also presented a search system called Red Opal that examined prior consumer reviews, identified product features, and then scored each product on every feature. Red Opal used these scores to determine which products to be returned when a user specifies a desired product feature. In the condition that a review corpus consists of a mixture of an arbitrary number of topics (e.g., containing both digital cameras and cars), Titov & McDonald (2008) proposed to adopt Latent Dirichlet Allocation (LDA). In the traditional LDA algorithm (Blei et al., 2003), each word is assumed to be attributable to only one topic. The number of topics should be specified by the user, and the topics will be determined by clustering the words around them. The improvement made in (Titov & McDonald, 2008) was to model two distinct types of topics: global topics and local topics. It first identified different domains in the corpus, and then the individual aspects of the entities in a given domain. These entity aspects serve as candidates in the opinion mining task. However, the limitation of their work is that they can simply perform extrinsic evaluation of the algorithms, because the 6

data sets were not annotated with individual opinion targets. Since above works are based on either rules or statistical mining mechanisms, their accuracy might be limited as many of other types of entities (such as components, functions, non-independent features, etc.) cannot be automatically identified. Thus, lately, some researchers have attempted to adopt supervised learning models in order to increase the opinion mining efficacy. 2.2.2. Model-based Approaches Zhuang et al. (2006) presented a supervised approach for extracting featureopinion pairs. Their method learns the opinion and a combination of dependency and part-of-speech paths connecting such pairs from an annotated dataset. They evaluated the algorithm on a set of movie text reviews and compared it to the results of the work in (Hu & Liu, 2004). The results yielded a F-measure of 0.529, which is higher than the F-measure 0.488 by (Hu & Liu, 2004). Feiguina & Lapalme (2007) presented an approach using an information extraction system which is independent both of search engine and the web corpus. Their information extraction system learns a language model on the part-of-speech patterns. This was accomplished by training it on a dataset which has been labeled (i.e., entity and aspect). The information extraction system will then learn the part-of-speech pattern which connects the entity and the aspect. They evaluated the approach in a cross-domain setting. However, only the precision value of the terminology extraction was reported, and the performance of opinion mining results was unknown. Kessler & Nicolov (2009) emphasized finding out feature opinion pairs within a sentence. They manually annotated features and opinions in datasets of car and camera reviews. In their work, they provided detailed guidelines for annotation, and trained a Support Vector Machine (SVM) classifier for extracting feature and associated opinions. The SVM-based approach got a F-measure of 0.698. Lately, Jin et al. (2009) proposed OpinionMiner, which was built based on lexicalized Hidden Markov Model (L-HMMs) and it integrates multiple important linguistic properties into an automatic learning process. In theory, Hidden Markov Model is a common approach to performing sequence labeling and segmentation (Rabiner & Juang, 1986). In the generic model, a joint probability distribution p(X, Y) was defined, where X and Y are variables (that can respectively represent the words’ sentence and the corresponding part-of-speech tags). Normally, it should iterate over all possible observable word combinations in a sentence to define the joint probability distribution. However, in reality, this enumeration is hard to be traced due to the large amount of possible combinations. In order to 7

obtain a computable and trainable model, an assumption has often been made: the elements are independent of each other, but because of this, the L-HMMs based opinion mining method is inherently limited in that it cannot represent distributed hidden states and complex interactions among labels. It can neither involve rich, overlapping function sets. Relative to L-HMMs, Conditional Random Fields (CRFs) model has the potential advantage, given its conditional nature for modeling state transitions (Lafferty et al., 2001). Specifically, it defines a probabilistic model p(Y|X) which is employed to label an unknown observation sequence X by selecting the label sequence Y for maximizing the conditional probability. The model naturally considers arbitrary, non-independent entities without conditional independence assumption. This property enabled it to be broadly applied to perform traditional information extraction tasks, such as part-of-speech tagging and parsing (Sha & Pereira, 2003), and named entity recognition (McCallum & Li, 2003). Recently, few researchers have attempted to adopt it for processing consumer reviews. For example, Zhao et al. (2008) utilized it to perform sentiment classification at the sentence and document level. Choi & Cardie (2009) employed CRFs to identify opinion holders from review data. In their error analysis, they reported that the inaccurate identification of opinions had considerable negative impact on the results. Miao et al. (2010) used CRFs to perform feature extraction and got reasonable results. They extracted features with a precision of 0.86 in movie dataset. Nearly 15% of the product features were merged into their sentiment lexicon, and in electronic product datasets, the percentage went up to 25%. Jakob & Gurevych (2010) further exploited CRFs’ ability in addressing the cross-domain applicability problem, i.e., whether a model that was trained in domain A can perform well in another product domain B. They specifically evaluated the model’s accuracy in extracting opinion targets in such setting. However, to the best of our knowledge, little work has investigated CRFs’ impact on improving the accuracy of feature-level opinion mining from the aspect of extracting various review elements. Though Li et al. (2010) have shown that it is feasible to perform feature-level review extraction based on CRFs, they purely applied it to identifying single object features and opinions, rather than covering other types of elements such as intensifiers, phrases and infrequent entities. Thus, in our work, we have not only extended their work to address these issues, but also integrated several improvements such as the optimization of learning functions and the word expansion and self-tagging. In the experiment, we in depth compared the CRFs-based learning algorithm with above-mentioned typical statistical and model-based methods. 8

3. CRFs based Approach for Feature-Level Opinion Mining In this section, we first introduce the CRFs model and our problem statement. We then in detail describe how we have applied the CRFs to conducting the various opinion mining tasks. 3.1. Problem Statement CRFs are conditional probability distributions that factorize according to an undirected model (Lafferty et al., 2001). It is formally defined as follows: considering a graph G = (V, E), let Y = (Yv )v∈V , and (X, Y) is a CRF. X is the set of variables over the observation sequence to be labeled. For example, it can be a sequence of natural language words that form a sentence. Y is the set of random variables over the corresponding labeling sequence which obeys the Markov property with respect to the graph. In our case, Y denotes the tags that are assigned to the words. p(y|x) is globally conditioned on the observation X: p(y|x) =

1 Y φi (yi , xi ) Z(x) i∈N

(1)

P Q where Z(x) = y i∈N φi (yi , xi ) is a normalization factor over all states for the sequence X. The potentials are normally factorized according to a set of functions fk : X φi (yi , xi ) = exp( λk fk (yi , xi )) (2) k

Given the model defined in Equation (1), the most probable labeling sequence for an input X is hence b Y = arg max p(y|x) y

(3)

Problem statement. When we build the model, we take all the nodes V of the graph as states, including observed states and hidden states. We use W and S to respectively represent the observed states and part-of-speech (POS) tags, and use T to represent the hidden states. The edges E of the graph are relationships among all the states, which are formally defined by the learning functions (see details in Section 3.5). Our objective was thus to extract different types of entities from product reviews, and identify opinions and intensifiers that are associated with the entities. A review document is thus denoted as W, and if we can give each word

9

Pre-processing Crawling reviews from website Cleaning (removing meaningless characters such as @-+)

Tagging data

Manually labeling data Self-Tagging

Defining learning functions

Training CRFs model

Labeling

Maximizing the conditional likelihood

Decoding

Mining Result

Building the model

Testing with new data

Figure 1: CRFs-based opinion mining process.

a pre-defined tag which is denoted in T , this objective can be achieved. Therefore, the task of opining mining can be regarded as an automatic labeling process, and the challenge is how to define appropriate tags and furthermore optimize the learning functions so as to maximize the conditional likelihood. Here we first give an overview of our opinion mining process (see Figure 1). It is divided into four major steps: (1) pre-processing, which includes crawling raw review data and cleaning them; (2) preparing the training set to learn the model; (3) defining learning functions for CRFs and training it to maximize the conditional likelihood; (4) applying the model to label new review data, in respect of specific elements like product entities, opinions, intensifiers, and phrases. It also identifies opinion sentences (i.e., the sentences that include feature-opinion pairs) and the sentence’s opinion orientation. 3.2. Data Pre-Processing During the pre-processing stage, we first crawled raw and real product reviews from some popular e-commerce sites (e.g., Yahoo Shopping and Amazon.com). 10

Table 1: Review entities and their descriptions

Entity Component Function Feature Opinion Intensifier

Description and examples Physical objects of a product, such as the camera itself, the camera’s battery, LCD Capabilities provided by a product, e.g., optical zoom, auto focus, movie playback Properties of components or functions, e.g., color, speed, size, weight Ideas and thoughts expressed by reviewers on the product’s components, functions, or features, e.g., satisfying, good, easy to use Words expressed by reviewers to indicate the strength of opinions, e.g., extremely, highly, slightly

The raw reviews usually contain meaningless characters such as ‘&’, ‘+’ and html tags (e.g.,
). So we first removed these characters. We also processed wordvariants and miss-spelling words, and converted noun words to their stems. 3.3. Defining Tags and Preparing Training Set At the second stage, our focus is on defining product entities the algorithm could target to extract, and then labeling the training data with the pre-defined tags. According to (Jin et al., 2009), the entities that appear in a product review can be classified into four categories : Components, Functions, Features and Opinions. Based on this classification, we add a new category of entities called Intensifiers (e.g., “very”, “extremely”), which reviewers have often used to strengthen or weaken the polarities of corresponding opinions. Table 1 lists these five categories and gives their descriptions and examples. To tag these entities for forming the model’s training set, we designed three types of tags: entity tags, opinion tags and intensifier tags. As for basic product entities, we simply use the category names as tags (e.g., tag “Component” used for component entity, “Function” for function entity, “Feature” for feature entity). As for opinion entity, besides the “Opinion” tag, we use characters ‘P’ and ‘N’ to respectively represent Positive opinion polarity and Negative opinion polarity, and use ‘Exp’ and ‘Imp’ to respectively indicate explicit opinion and implicit opinion. Here, explicit opinion means that the user expresses opinion in the review explicitly, and implicit opinion means that the opinion needs to be inferred from the review.

11

As for intensifier entity, besides “Intensifier” tag, we use characters ‘S’ and ‘W’ to denote that it strengthens the corresponding opinion (e.g., “extremely”, “highly”), or weakens it (e.g., “slightly”, “somewhat”, “sort of”). For a phrase entity, we assign a position to each word of this phrase, which is similar to the method used by (Jin et al., 2009). Any word of a phrase has three possible positions: the beginning of the phrase, the middle of the phrase and the end of phrase. We hence use characters ‘B’, ‘M’ and ‘E’ as position tags to respectively indicate these three positions. For a word that does not belong to any of the above categories, we use “BG” to represent it. Thus, we have total 15 distinct tags. With these tags, we can label any word and its role in a sentence. For example, the following sentence from a real review (to a camera) is labeled as: The(BG) battery(Feature-B) life(Feature-E) is(BG) really(Intensifier-S) great(OpinionP-Exp) compared(BG) to(BG) most(BG) cameras(Component) .(BG) For a new un-trained review sentence, if we get its words’ tag sequence, we can know which product entities it contains, and discover these entities’ related opinions and intensifiers if any. We can also identify whether the entities are expressed in the form of phrases or not. By this way, the task of opining mining can be transformed to an automatic labeling task. The problem is then formalized as follows: given a sequence of words W = w1 w2 w3 . . . wN , and its corresponding part-of-speech tags S = s1 s2 s3 . . . sN , the objective is to find an appropriate sequence of tags which can maximize the conditional likelihood of Equation (3). The resulting equation is then: b = arg max p(T |W, S ) = arg max T T

T

N Y

p(ti |W, S , T (−i) )

(4)

i=1

In Equation (4), T (−i) = {t1 t2 . . . ti−1 ti+1 . . . tN } (which is exclusive of the current word’s tag ti ). From this equation, we can see that the tag of a word at position i depends on all the words W = w1:N in that sentence, part-of-speech S = s1:N , and tags on all of the other words. Unfortunately, it is very hard to compute with this equation as it involves too many variables. To reduce the complexity, we employ linear-chain CRFs as an approximation to restrict the relationship among tags. In the linear-chain CRFs, all the hidden states T in the graph form a linear chain and only two consecutive states are connected. If we further assume that the current state is only dependent on the previous one, Equation (4) can be rewritten as:

12

b = arg max p(T |W, S ) = arg max T T

T

N Y

p(ti |W, S , ti−1 )

(5)

i=1

3.4. Word Expansion and Self-Tagging Process In order to minimize the human effort in labeling and preparing the training data, we conducted two processes to automatically enlarge the training set. First of all, for a single entity word or an opinion that appears in a sentence, we substitute it with its synonymous word and remain the tag unchanged. The synonymous words are from the Microsoft Word’s Thesaurus. For example, for the sentence “it shoots good image”, we first obtain the synonym set of “good”, which is {“great”, “nice”, “cracking”}, and the synonym set of “image”, which is {“picture”, “photo”}. We then list the possible combinations of these words such as “nice photo”, “great picture”, etc., and add them in the training set. In addition, we propose a self-tagging (ST) technique. That is, we only manually label part of training data and the other training data is automatically tagged by the system. This process concretely works as follows: 1. Evenly divide the manually labeled data into three groups, and give each group a number (with numbers 1, 2, 3); 2. Conduct training process on the three groups separately, by which we will get 3 different training models. We then use the three models to self-tag sentences from the un-labeled training data. After this process, three tagging results will be generated for a sentence; 3. Compare the three tagging results and select the most appropriate tags according to this criterion: if at least two of the three tagging results hold agreement on the same sentence (that is, they produced the same tags to the concerned words of the sentence); 4. Randomly assign the newly tagged sentences into the three groups (as denoted in step 1) and then go back to step 2. Repeat the process till all the un-trained data are self-tagged. Two properties of this self-tagging process are hence that: 1) we divided the manually labeled data into three groups and required that the agreement must be achieved between at least two of the three groups; 2) to compute the agreement degree, we emphasized the concerned entities (e.g., component, function, feature, opinion, intensifier), rather than restricting all tags (such as background tags) to be matched. With these refinements (against (Jin et al., 2009)), we believe that the training data augmentation process could be more effectively performed. 13

3.5. Defining Learning Functions for the CRFs Model The next step is then to define learning functions for the CRFs model. As indicated before, the functions define the types of relations between observation states W = w1:N , S = s1:N (i.e., the words’ sequence in a sentence and their corresponding part-of-speech tags) and hidden states T = t1:N (the tags’ sequence). Therefore, these functions are very crucial to determine the model’s final accuracy. In our case with the linear-chain CRFs (see Equation (5)), the general form of a function is fi (t j−1 , t j , w1:N , s1:N , j), which looks at a pair of adjacent states t j−1 , t j , the input words’ sequence w1:N and s1:N , and the current word’s position j. For example, we can define a function that produces the binary value: 1 if the current word w j is “image”, the corresponding part-of-speech s j is “NN” (i.e., single noun word), the previous state t j−1 is “Opinion”, and the current state t j is “Feature”; 0 otherwise.    1     fi (t j−1 , t j , w1:N , s1:N , j) =      0

i f w j = image, s j = NN, t j = Feature, and t j−1 = Opinion; otherwise

(6)

Combining this function with Equation (1) and Equation (2), we have: p(t1:N |w1:N , s1:N ) =

N X F X 1 exp ( λi fi (t j−1 , t j , w1:N , s1:N , j)) Z j=1 i=1

(7)

According to Equation (7), each function fi is associated with a weight λi . That is, if λi > 0, and fi is active (i.e., fi = 1), it will increase the probability of assigning the tag to that word. On the contrary, if λi < 0, or fi is inactive (i.e., fi = 0), it will decrease the probability of assigning the tag. Another function example is:    1 i f w j−1 = good, s j−1 = JJ,     fi (t j−1 , t j , w1:N , s1:N , j) =  (8) and t j = Feature;     0 otherwise With the above two functions (6) and (8), if the current word is “image” such as in “good image”, they will both be active, so the probability of assigning tag “Feature” to the word “image” will be increased. This is an example of overlapping functions that L-HMMs cannot address. 14

Table 2: CRFs learning function types and expressions

Type 1. First-Order 2. First-order+ Transition 3. First-order+ Transition 4. First-order+ Transition 5. First-order+ Transition 6. Second-Order 7. Second-Order 8. Third-Order

Function Expression f (ti , wi ), f (ti , si ), f (ti , wi , si ) f (ti , wi ) f (ti , ti−1 ), f (ti , si ) f (ti , ti−1 ), f (ti , wi , si ) f (ti , ti−1 ) f (ti , wi ) f (ti , ti−2 ), f (ti , si ) f (ti , ti−2 ), f (ti , wi , si ) f (ti , ti−2 ) f (ti , wi−1 ) f (ti , ti−1 ), f (ti , si−1 ) f (ti , ti−1 ), f (ti , wi−1 , si−1 ) f (ti , ti−1 ) f (ti , wi−1 ) f (ti , ti−2 ), f (ti , si−1 ) f (ti , ti−2 ), f (ti , wi−1 , si−1 ) f (ti , ti−2 ) f (ti , ti−1 , wi ), f (ti , ti−1 , si ), f (ti , ti−1 , wi , si ) f (ti , ti−2 , wi ), f (ti , ti−2 , si ), f (ti , ti−2 , wi , si ) f (ti , ti−1 , ti−2 , wi ), f (ti , ti−1 , ti−2 , si ), f (ti , ti−1 , ti−2 , wi , si )

More specifically, we defined several types of functions to specify the statetransition structures between W/S and T, according to the Markov orders. For instance, the first-order functions are defined as: 1) the assignment of current tag t j only depends on the current word. The function is hence represented as f (t j , w j ). 2) The assignment of current tag t j only depends on the current part-of-speech tag. The function is represented as f (t j , s j ). 3) The assignment of current tag t j depends on both the current word and the current part-of-speech tag. The function is represented as f (t j , s j , w j ). These three functions are first-order, by which the inputs are examined in the context of the current state only. In addition, we defined first-order plus transitions, second-order functions and third-order functions (see Table 2), by which the inputs are examined in the context of both the current state and previous one or two states. In order to evaluate the impact of these functions on opinion mining accuracy, we tested different combinations in the experiment and identified the optimal function set (see the results in Section 5.1). 3.6. Training CRFs Model After the graph and functions are defined, we can start training the CRFs model. The purpose of training is in essence to determine the parameter values of λ1:F (i.e., the weights for learning functions as defined in Equation (7)). Formally, the fully labeled review data is denoted as {(w(1) , s(1) , t(1) ), ..., (w(M) , s(M) , t(M) )}, j) j) where w( j) = w(1:N (the j-th sentence), s( j) = s(1:N (the j-th part-of-speech sej j

( j) quence), and t( j) = t1:N (the j-th tags’ sequence). Given that in CRFs we defined j the conditional probability p(t|w, s), the aim of parameter learning is to maximize the conditional likelihood according to the Equation (9):

15

M X

log p(t( j) |w( j) , s( j) )

(9)

j=1

To avoid over-fitting, the likelihood can be penalized by some prior distributions on the parameters. A commonly used distribution is zero-mean Gaussian: if λ ∼ N(0, σ2 ), Equation (9) becomes M X j=1

F X λ2i log p(t |w , s ) − 2σ2 i ( j)

( j)

( j)

(10)

This equation is concave, so λ has a unique set of global optimal values. We learnt parameters by computing the gradient of the objective function, and applied the gradient in an optimization algorithm called Limited memory BFGS (LBFGS) (Liu & Nocedal, 1989). Specifically, the gradient of the objective function is computed as: m F X λ2i ∂ X log p(t( j) |w( j) , s( j) ) − ∂λk j=1 2σ2 i m F X λ2i ∂ X XX ( j) = ( λi fi (tn−1 , tn , w1:N , s1:N , n) − log Z ) − ∂λk j=1 n i 2σ2 i

=

m X X j=1

n

fk (tn−1 , tn , w1:N , s1:N , n) −

m X X j−1

n

0 0 0 ,t0 [ fk (t Etn−1 n−1 , tn , w1:N , s1:N , n)] − n

(11) λk σ2

In Equation (11), the first term is the actual times that a function fi is activated (i.e., fi = 1) in the training data. The second term is the predicted activation times for this function under the current trained model. The third term is generated by the prior distribution. Hence, this derivative measures the distance between the actual frequency and the predicted frequency. Suppose that in the training data, a function fk is activated at A times, while under the current model, the predicted activations are B times: when |A| = |B|, the derivative is zero. Therefore, the training process is to find λs that can minimize such derivative. 3.7. Decoding and Determining Orientations for Opinions and Intensifiers After the parameters were learnt, the model is fixed. We then apply it to process new review data. As we discussed before, our objective is to apply the model 16

to automatically label the words with the most suitable tags. This requires that the conditional likelihood (as defined in Equation (4)) can be maximized at each step. More specifically, suppose that the current word’s position is j and there are M different tags that are candidates for this word, we can get the Viterbi variables α j (m) = p(W, S , t j = m) (m ∈ M). The Viterbi recursion is obtained with: α j (m) = max ϕ j (W, S , m0 , m)α j−1 (m0 ) m∈M

(12)

where ϕ j (W, S , m0 , m) is the transition function from state m0 to state m when the observations are W, S . In our work, the transition function is formally defined as: X ϕ j (W, S , m0 , m) = exp( λk fk (W, S , t j−1 = m0 , t j = m, j)) (13) k

After the Viterbi recursion, we can identify the proper tags’ assignment for a review sentence. Here we give some examples that were resulted from the above decoding process, from which we can see that the orientations for opinions and intensifiers are defined by their associated tags. For example, ‘P’ represents that the opinion (e.g., “easy to use”, “love”) is positive, and “S” denotes that the intensifier word (e.g., “absolutely”) is strengthening the tone. I(BG) love(Opinion-P-Exp) the(BG) image(Feature-B) quality(Feature-E) of(BG) this(BG) camera(Component) .(BG) I(BG) am(BG) absolutely(Intensifier-S) in(Opinion-B-P-Exp) love(OpinionM-P-Exp) of(Opinion-E-P-Exp) this(BG) camera(Component) .(BG) Buttons(Component) on(BG) the(BG) menu(Function) are(BG) easy(OpinionB-P-Exp) to(Opinion-M-P-Exp) use(Opinion-E-P-Exp) .(BG) From these examples we can see that when the labeling tags and learning functions are defined, it is feasible to extract various types of product entities, and the entity-related opinions and intensifiers. In addition, we checked the appearance of negations (e.g., “not”, “don’t”, “no”, “didn’t”) in the review sentences. If they are within five-word distance in front of an opinion/intensifier and their total number is odd, the orientation for that corresponding opinion/intensifer will be reversed. 3.8. Determining Opinion Sentences and Sentence Orientation If a sentence contains one or more entities (e.g., component, function, or feature) and one or more opinion words that are associated with the entities, it is an opinion sentence. To determine the sentence’s opinion orientation, we consider 17

whether it contains one pair of , or more than one pair. Specifically, if an opinion sentence contains only one pair, the sentence has the same orientation as the corresponding opinion word/phrase. If more than one pair is found or if one entity is associated with multiple opinions, the sentence’s overall Mi N P P orientation is decided according to this formula: (Forie )i , where (Forie )i = Oi j i

j

(assume the sentence has N entities and each entity has Mi opinions). O j is the j-th opinion of the entity (Forie )i (it is +1 if positive and -1 if negative). Thus, if this overall score is greater than 0, the sentence’s orientation is positive; if it is below 0, its orientation is negative. 4. Experiment Setup We conducted two experiments. In the first experiment, we aimed at identifying the optimal combination of learning functions for CRFs-based opinion mining algorithm (henceforth CRFs). We also evaluated the role of self-tagging (ST) process in two automatic training conditions: 1) with ST versus without it in the same training dataset; 2) integration of new data (as trained by ST) in the training dataset versus without this integration. In the second experiment, we systematically compared the CRFs based opinion mining method to four typical related methods: rule-based (as baseline), two variations of association rule mining based methods (Hu & Liu, 2004; Ding et al., 2008; Zhang et al., 2010), and lexicalized Hidden Markov Model (L-HMMs) based method (Jin et al., 2009). 4.1. Compared Algorithms in the Experiment As mentioned in the related works, most of recent approaches that have been applied to performing feature-level opinion mining can be classified into two types: statistical methods with the association rule mining based method as the representative, and model-based methods with the L-HMMs based as the typical one. Thus, in the experiment, we primarily compared the CRFs-based opinion mining algorithm with these approaches. 4.1.1. Rule-based Method as Baseline The first step of the rule-based method was performing Part-of-Speech (POS) task. Each word can get a tag of POS such as NN (noun word), JJ (adjective word), etc. We then applied several classic rules to extract objective product entities, according to (Hu & Liu, 2004; Jin et al., 2009). One rule is that a single noun that follows an adjective word or consecutive adjective words can be regarded as 18

a product entity, i.e., NN in JJ + NN or JJ + JJ + NN. Any single noun word that connects an adjective word to a verb is also taken as a product entity, i.e., NN in NN + VBZ +JJ. Any consecutive noun words that appear at above positions are taken as an entity phrase. As for opinion words, the adjective words that appear in above rules will be opinion entities, and their sentiment orientation was determined by looking up a lexicon that contains polarities for over 8000 adjective words 1 . Moreover, we added a rule to extract intensifier: the adverb that is within three-word distance to the opinion entity is regarded as an intensifier. 4.1.2. Association Rule Mining based Approach (ASM) As described in (Hu & Liu, 2004), the core idea of their approach is a statistical method based on the association rule mining. Association rule mining was originally developed for solving market basket analysis problem. For example, it was used to discover frequently co-occurring items which were often bought together by customers in a supermarket (Agrawal & Srikant, 1994). Given this function, Hu & Liu (2004) applied it to identify the candidates of features that were frequently mentioned in product reviews. In this case, the words of a sentence represent the items, and the algorithm was to mine the occurrence patterns among them. They assumed that the product features are nouns or noun phrases, and that the opinions describing these features are adjective words. They divided product features into two groups: frequent features and infrequent features, dependent on their appearing times. The association rule mining was mainly used to identify frequent features. Specifically, its first step was also performing Part-of-Speech (POS) parsing. Each word was then assigned a tag of POS such as NN (noun word), JJ (adjective word), etc. The noun words of each sentence were formed into a transaction. All transactions were the inputs to the association rule mining algorithm (e.g., Apriori) to obtain the frequent item sets. The minimum support value was set to be 1%. The returned frequent item sets were then used to identify product features. More concretely, it started from the sets with larger sizes. If a set contains consecutive noun words and these words were not a subset of any previously identified feature phrases, they were mapped to a feature phrase. The same procedure was applied to single noun word: if a word was not redundant to any of previously identified features, it was considered as a feature. This procedure ends when all frequent 1

http://www.cs.cornell.edu/People/pabo/movie-review-data/

19

item sets were processed. Based on the detected features, opinions were then found. According to (Hu & Liu, 2004), adjacent adjective words (e.g., within the 3-word distance to the feature or feature phrase) were regarded as opinions. To determine the polarity of each opinion, a WordNet lexicon was used. In addition, opinion words were used to find infrequent features (which were not extracted through the above association rule mining process). Concretely, noun or noun phrase which is nearest to an opinion word was taken as a feature or feature phrase. Similar to the opinion extraction, it was also defined to be within 3-word distance. Though intensifiers were not extracted in this referred work, to facilitate comparison in our experiment, we added a function by following this approach’s mechanism: considering opinions are usually modified by adverbs, the adverbs which are within 3-word distance to opinions were taken as intensifiers. 4.1.3. Association Rule Mining plus Linguistic Rules (ASM+Rules) Given that ASM was unable to find special linguistic patterns, a set of empirically proven rules was further incorporated (Ding et al., 2008; Zhang et al., 2010). For instance, one feature can be part of another feature, e.g., “the engine of the car”, which indicates the part-whole relation (i.e., “engine” is part of “car”). “no” pattern is another extraction pattern, e.g., “no noise” where “no” is followed by a feature. Thus, according to (Zhang et al., 2010), five patterns were added to extract features: 1) NP + Prep + CP, where NP (i.e., noun phrase) is the part word, CP is the class concept/phrase, Prep is like “of”, “in” and “on”; 2) CP + with + NP; 3) NP CP or CP NP (where NP and CP form a compound word); 4) CP + Verb + NP (where Verb is like “have/has”, “include” and “contain”); 5) “no” + NP. To determine the opinion’s orientation, three contextual conditions were additionally considered (Ding et al., 2008). 1. Intra-sentence conjunction rule: it means that a sentence only expresses one opinion orientation unless there is a “but” word which changes the direction. For example, in the sentence “this camera tasks great pictures and has a long batter life”, where “long” can be decided to be positive because it is conjoined with the positive opinion word “great”. 2. Pseudo intra-sentence conjunction rule: it uses other reviews to infer the current opinion word’s orientation. 3. Inter-sentence conjunction rule: it uses the context of previous or next sentence (or clause) to decide the current sentence’s opinion orientation. 20

4. Besides, negation rules were used to handle the situations when there are negation words, e.g., “no”, “not”, “never”. If a sentence has such words, the sentence’s opinion orientation was reversed. 4.1.4. L-HMMs based Opinion Mining Method Similar to CRFs-based method, L-HMMs based opinion mining method also relies on the model training and optimization process. As noted in Section 2, Jin et al. (2009) integrated linguistic properties, including part-of-speech results and lexical patterns, with Hidden Markov Model (HMMs) model. Their aim was to maximize the conditional probability as defined in the following equation: ^

T = arg max T

p(W,S |T )p(T ) P(W,S )

Because the value of P(W, S ) remains constant for all candidate tag sequences, they produced a general statistical model: ^

T = arg max p(W, S |T )p(T ) T

= arg max p(S |T )p(W|T, S )p(T ) T  p(si |w1 ...wi−1 , s1 ...si−1 , t1 ...ti−1 ti )× N   Y   p(wi |w1 ...wi−1 , s1 ...si−1 si , t1 ...ti−1 ti )× = arg max    T  i=1 p(t |w ...w , s ...s , t ...t ) i

1

i−1

1

i−1

1

i−1

        

In order to simplify the computation, they made several assumptions. For example, the assignment of the current tag depends only on its previous tag and the previous word; the appearance of the current word depends on the current tag, the current POS, and the previous word. Their final objective was to maximize:     p(si |wi−1 , ti )×   N   Q  p(w |w , s , t )× arg max   i i−1 i i    T i=1   p(t |w , t )  i

i−1

i−1

Maximum Likelihood Estimation (MLE) was used to estimate the parameters. Other techniques were also used in their approach, including information propagation (similar to our word expansion process) and bootstrapping process. In our experiment, in order to make the CRFs-based and L-HMMs based methods comparable, we used the same definitions of word tags (described in Section 3.3) and conducted similar word expansion and self-tagging processes (as introduced in Section 3.4) in both approaches. Therefore, they varied only in terms of the models themselves. 21

4.2. Evaluation Metrics and Dataset To test the algorithm’s accuracy, we used three standard metrics: recall, precision and F-measure. Recall is |C∩P| and Precision is |C∩P| , where C and P are the |C| |P| sets of true and predicted tags respectively. F-measure is the harmonic mean of 2RP precision and recall, i.e., R+P (where R is the recall value and P is the precision value). The reason that we did not use label accuracy to test each type of tag is because for some tags (such as “BG” for background words), it is not so meaningful to evaluate them. Our dataset was from two sources: one part was crawled from Amazon, and another was the publicly available corpus provided in (Hu & Liu, 2004). In total, it contains 821 reviews (1775 sentences) from 9 digital cameras. We also collected 187 reviews (426 sentences) from four extra digital cameras in order to assess the self-tagging process’s performance (see Section 5.1). After pre-processing and cleaning the raw data, two researchers were involved in preparing the training set. They first independently labeled the data, and then met together to aggregate their results. We applied the LBJPOS tool 2 to produce the part-of-speech tag for each word. We performed the 10-fold cross-validation in the experiments. That is, the whole dataset was evenly divided them into 10 sets: one set was randomly selected as the validation data for the testing during each round, and the other nine sets acted as training data. The cross-validation process was repeated for ten times. Afterwards, the results were averaged to calculate the precision, recall and Fmeasure scores. 5. Experimental Results 5.1. Testing Learning Functions and Self-Tagging Process in CRFs-based Method As defined in Table 2, we have defined eight types of CRFs learning functions: one first-order, four first-order plus transition, two second-order, and one third-order. We assigned each type with a number (from 1 to 8; see Table 2). We varied their combinations and tested the performance when building the model. Specifically, in the first round, we only used the first-order functions. Then, in the second round, different combinations of first-order plus transition functions were respectively added with the first-order functions (i.e., 1+2, 1+3, 1+4, 1+5, 1+2+3, 1+2+4, 1+2+5, 1+3+4, ..., 1+2+3+4+5). The comparison among these 2

http://cogcomp.cs.illinois.edu/page/software_view/3

22

Table 3: Testing on learning functions (P: Precision, R: Recall, F: F-measure)

Round The 1st The 2nd The 3rd The 4th

Function combination (the number refers to Table 2) 1 1+3+5 1+3+5+6+7 1+3+5+6+7+8

Basic Entities P R F 0.895 0.8 0.845 0.905 0.804 0.852 0.902 0.812 0.855 0.899 0.791 0.842

P 0.893 0.9 0.901 0.899

Opinions R 0.782 0.804 0.787 0.789

F 0.834 0.85 0.841 0.841

combinations was conducted mainly regarding their accuracy in mining basic entities such as components, functions and features (with their overall accuracy) and entity-associated opinions (see Table 3). 1+3+5 was found with averagely higher precision, recall and F-measure scores in terms of both entities and opinions, relative to most of other combinations. In fact, though another combination 1+2+3+4+5 reached at similar accuracy level, considering that it demands much higher time complexity, we prefer the less complex one, i.e., 1+3+5. The combination 1+3+5 was then further unified with second-order functions (i.e., 1+3+5+6, 1+3+5+7, 1+3+5+6+7). In this round, the winner is 1+3+5+6+7, which was finally combined with the third-order function (i.e., 1+3+5+6+7+8). Table 3 lists the best combination resulted from each round. Comparison of their accuracy shows that the involvement of second-order and third-order functions did not help much on increasing the accuracy (but with the cost of higher time complexity). We have thus eventually fixed the functions’ set to be 1+3+5, which contains the first-order functions and two types of first-order plus transition. This set of functions was used in the implementation of CRFs-based opinion mining algorithm when comparing it to related approaches. Moreover, in order to test the role of self-tagging process in saving human labeling effort when preparing the model’s training data, we tested it in two automatic training conditions. As mentioned before, the whole dataset (with 821 reviews) was divided into 10 sets, where one set was used for the validation purpose, and the other nine sets were training data. The first testing condition was hence to compare the quality of self-tagged data with the one of manually labeled ones. More concretely, we compared the algorithm outcomes from the 9 training data that were all manually labeled, to the ones resulted from 6 sets (manually labeled) plus 3 sets (self-tagged by the system). For the latter, as described in Section 3.4, during each round of validation, 6 labeled sets (divided into three groups) were

23

Table 4: Testing on Self Tagging (ST) process (P: Precision, R: Recall, F: F-measure)

CRFs(6+3ST) CRFs(9) CRFs(9+4ST)

Basic Entities P R F 0.887 0.777 0.826 0.896 0.786 0.837 0.902 0.787 0.841

Opinions P R F 0.895 0.822 0.859 0.902 0.834 0.867 0.904 0.837 0.869

applied as inputs to the self-tagging to train the remaining unlabeled 3 sets. The results from the self-tagging were then integrated with the original 6 sets to form the final training set. In this testing condition, our hypothesis was that the replacement of manually labeled ones by the self-tagged data would not significantly impair the model’s accuracy. Table 4 shows the results (where CRFs(6+3ST) denotes the training set’s formation with 3 self-tagged sets, and CRFs(9) denotes the condition with all sets manually labeled). It can be seen that though the accuracy values by CRFs(6+3ST) are slightly lower than CRFs(9), the differences are not significant (i.e., all above 0.1 by ANOVA test). In fact, the distances are all within the range from 0.7% to 1.2% (for example, the precision of CRFs(6+3ST) for identifying opinions is 89.5%, which is only 0.7% lower than the precision of CRFs(9)). In the second testing condition, we crawled 187 new reviews (with 426 sentences from four new digital cameras) and activated self-tagging process to label them. These reviews were then combined with the nine manually labeled sets to form the training data. The goal of this testing was to validate the impact of self-tagging on automatically augmenting the training data quantity. The denotation for this training formation is hence CRFs(9+4ST) (meaning that nine sets were manually labeled and four new sets were self-tagged). Table 4 exposes the effectiveness of self-tagging in this condition. It shows that the results from CRFs(9+4ST) are better than that without the four self-tagged sets. Specifically, the improvement on precision is from 89.6% of CRFs(9) to 90.2% of CRFs(9+4ST) in mining basic entities, and from 90.2% to 90.4% in mining opinions. It thus implies that the self-tagging process can likely take positive effect on automatically growing the training data, so as to increase the CRFs model’s accuracy to a higher level. It also indicates the big potential of applying our algorithm to process the large amount of online reviews in the real e-commerce environment.

24

Table 5: Comparison results regarding components, functions, features and overall accuracy (P: Precision, R: Recall, F: F-measure)

Method CRFs L-HMMs ASM+Rules ASM Rule-Based

Component P R F 0.911 0.827 0.867 0.786 0.647 0.709 -

Function P R F 0.843 0.675 0.749 0.661 0.475 0.551 -

Feature P R 0.907 0.828 0.878 0.762 -

Overall F P R 0.862 0.887 0.777 0.815 0.775 0.628 0.433 0.672 0.412 0.616 0.285 0.258

5.2. Comparing CRFs-based Method with Other Feature-Level Opinion Mining Methods After clarifying the roles of learning functions and self-tagging process in CRFs-based method, we then quantitatively compared it with above-mentioned typical approaches that have been proposed by other researchers (see Section 4.1). In this comparative experiment, we emphasized the precision, recall and Fmeasure values at both feature-level (e.g., component, function, feature, opinion, intensifier, phrase, infrequent entity) and sentence-level (i.e., opinion sentence extraction and orientation). Note that since CRFs and L-HMMs both require the training process, the self-tagging was applied in both algorithms to form the training set. 5.2.1. Basic Product Entities Table 5 lists the comparison results after 10-fold cross validation, from which we can see that CRFs achieve better accuracy results about all the basic product entities (i.e., component, function and feature), than L-HMMs. In fact, most of its accuracy values are above 80% level. The ANOVA analysis further shows that the differences are significant (the p values are less than 0.05) regarding all measured precisions, recalls and F-measure scores. Because ASM, ASM+Rules, and rule-based methods cannot handle these basic entities separately, we calculated their overall accuracy, and compared the results to the average values respectively from CRFs and L-HMMs. The comparison indicates that ASM is better than the rule-based, and ASM+Rules is slightly more accurate than ASM. Another interesting observation is that the recall of ASM+Rules is slightly higher than L-HMMs’ recall value. However, its precision and F-measure scores are lower. Indeed, these methods’ overall accuracy 25

F 0.826 0.692 0.526 0.494 0.256

Table 6: Comparison results regarding opinions, intensifiers, phrases, and infrequent entities (P: Precision, R: Recall, F: F-measure)

Method CRFs L-HMMs ASM+Rules ASM Rule-Based

Opinion P R 0.895 0.822 0.895 0.790 0.2 0.321 0.173 0.289 0.296 0.214

F 0.859 0.839 0.246 0.217 0.248

Intensifier P R F 0.912 0.832 0.870 0.897 0.792 0.841 0.334 0.301 0.317 0.327 0.296 0.311 0.2 0.141 0.166

Phrase P R 0.825 0.700 0.800 0.667 0.119 0.098 0.064 0.069 0.052 0.044

F 0.757 0.726 0.108 0.067 0.048

Infrequent entity Hit Ave 0.908 0.781 0.902 0.766 0.804 0.685 0.801 0.678 0.713 0.549

values (including L-HMMs) are all lower than CRFs’, in terms of all metrics. 5.2.2. Opinions and Intensifiers As for opinions and intensifiers, CRFs still outperform L-HMMs, ASM, ASM+Rules and rule-based methods. Table 6 lists the comparison results. The precisions, recalls, and F-measures of CRFs all reach above 80% level for both opinions and intensifiers (note that the results include the assessment of phrases). In comparison, the accuracy values of ASM and ASM+Rules are relatively far lower, which are almost close to the accuracy level of the baseline rule-based method. To look into the reason, we compared their predicted labels with actual ones (that we manually tagged) in the testing data. It was found that although ASM can sometimes accurately identify a single opinion word and ASM+Rules can further improve the opinion’s orientation accuracy, they were less accurate to find out opinion phrases. For example, “easy” in the phrase “easy to use” was discovered, but the whole phrase was unidentifiable. This may additionally explain why the phrases’ accuracy was also lower in ASM and ASM+Rules (see next section). In comparison, the model-based learning methods are more accurate in mining the opinions regardless of they are single words or phrases. The in-depth comparison between CRFs and L-HMMs further shows that though their accuracy values are closer, the differences are significantly in favor of CRFs (by ANOVA test), particularly in respect of the opinion’s recall (i.e., increasing the accuracy from 79.0% to 82.2%, p < 0.05), and the intensifier’s recall (from 79.2% to 83.2%, p < 0.05) and F-measure score (from 84.1% to 87.0%, p < 0.1). Combining with the results from the previous section, we can see that the CRFs-based method is more competent and flexible in extracting these different 26

review elements. It even reaches significantly better accuracy level than the LHMMs based method, regarding all the basic entities, opinions and intensifiers. The findings hence well demonstrate our hypothesis that the CRFs model is more effective in accomplishing these opinion mining tasks. 5.2.3. Phrases and Infrequent Entities We further compared these approaches regarding their abilities in determining phrases and infrequent entities. Specifically, a phrase was defined as a set of two or more consecutive words that include the beginning, (middle) and end position tags. It can not only refer to a product entity (e.g., “image quality”), but also an opinion (e.g., “easy to use”, “easy to navigate”). Table 6 reports the results from the 10-fold cross validation. It shows that ASM and ASM+Rules obtained lower accuracy in discovering phrases (below 12% in terms of all metrics). For example, they can not accurately identify the opinion phrases “fond of”, “in love of”, “user friendly”, “easy to navigate”, etc. In terms of feature phrases (e.g., “image quality”, “battery life”), ASM and ASM+Rules are more qualified than the rule-based method (given that it considers frequent noun sets as feature phrases). Relatively, CRFs and L-HMMs are more capable of finding out both types of phrases. CRFs-based method even achieves higher accuracy levels (82.5% precision, 70.0% recall and 75.7% F-measure), than L-HMMs. The difference regarding F-measure is moderately significant (p < 0.1) between them. To define infrequent entities, we took entities with the appearance frequency below or equal to 1% within the whole dataset as infrequent entities. We used the hit-ratio Rhit and the average ratio Rave to measure the algorithm’s accuracy in discovering them. That is, assume there are Nall infrequent entities, and N f ind of them were correctly extracted by the evaluated algorithm, Rhit = N f ind /Nall . Average ratio refers that, for the discovered infrequent entities, the average times P Fi , where Ci that the algorithm was able to correctly identify them: Rave = N1all Ci Nall

indicates the actual appearance times of i-th entity, and Fi is the times that the algorithm found it. The results indicate that though the accuracy values of ASM and ASM+Rules are not very low (e.g., ASM+Rules obtains 80.4% hit ratio and 68.5% average ratio), CRFs still outperform them (see Table 6). CRFs also performs slightly better than L-HMMs in terms of both hit ratio and average ratio.

27

Table 7: Comparison results regarding opinion sentence and sentence orientation (P: Precision, R: Recall, F: F-measure)

Method CRFs L-HMMs ASM+Rules ASM Rule-Based

Opinion sentence P R F 0.871 0.799 0.834 0.848 0.774 0.810 0.6 0.519 0.557 0.557 0.511 0.533 0.436 0.397 0.416

Sentence orientation P R F 0.871 0.797 0.852 0.809 0.751 0.779 0.541 0.467 0.501 0.524 0.457 0.488 0.408 0.383 0.395

5.2.4. Opinion Sentences and Sentence Orientation As introduced before, we also measured these methods’ performance in identifying opinion sentences and the sentences’ orientation (see Table 7). CRFs and L-HMMs still achieve higher accuracy values (i.e., reaching above 75% for both measures). CRFs is further better than L-HMMs. The reason could be mainly attributed to its higher accuracy in identifying different types of product entities and opinions. In this regard, it is also interesting to find that ASM can correctly identify opinion sentences at 55.7% precision, 51.1% recall, and 53.3% F-measure. ASM+Rules improves the accuracy to 60.0% precision, 51.9% recall, and 55.7% F-measure. Their performance in determining opinion sentence’s orientation is on average with around 50% accuracy (w.r.t. precision and F-score). The results hence suggest that these methods can be somewhat qualified to identify the sentence-level opinion orientation (though are less accurate than the model-based methods), with the discovered single opinion words and reasonable accuracy in feature extraction. 5.3. Discussion Thus, the experimental findings suggest that CRFs-based learning method is more competent in mining product entities, opinions and intensifiers (including phrases), being compared to the L-HMMs based and statistical methods. In addition to the comparison via the 10-fold cross validation, we actually also assessed these approaches’ accuracy in respect of each product (see the Appendix with Tables A.1 and A.2). In this case, at a time, one product’s reviews were tested, based on the training over the other eight products’ reviews. It can be seen from the two tables that the differences are still valid in this comparative condition. That is, for each product’s reviews, CRFs can always obtain higher 28

accuracy values against others, in terms of mining basic entities, opinions, intensifiers, phrases, infrequent entities, and opinion sentences. Furthermore, L-HMMs’ accuracy is still closest to CRFs’, relative to the other compared algorithms. It hence verifies again that the model-based learning methods are more flexible in accomplishing the various tasks related to feature-level opinion mining, as we hypothesized at the beginning. To explain the superior accuracy as achieved by CRFs, we believe that there are two major reasons. The first is that the CRFs model can naturally incorporate arbitrary, non-independent entities of the input reviews, without making conditional independence assumption among them. The second is that it can involve rich and overlapping learning functions, which should help to discover phrases and infrequent entities. For example, although the entity “ISO” only appears once in our data, more than one functions were activated for finding it (so its exposure probability was increased). These two reasons might also explain why CRFs-based method is better than another model-based mining method based on L-HMMs (since it lacks these properties). On the other hand, this experiment revealed the respective merits of modelbased methods and statistical ones. Considering that the training effort is requisite for the model-based methods (like CRFs), the statistical approaches might replace if there is only need to discover opinion sentences from reviews (due to their reasonable accuracy in this regard). However, if it is meaningful to extract detailed review elements (such as individual features, opinions, and intensifiers) as we emphasized in this paper, CRFs should be much more qualified, and it is worth consuming training effort to build the model. In essence, CRFs-based method is flexible of extracting various types of review elements, if it is with well defined, comprehensive set of word tags and the optimal learning functions. For instance, it can elaborate whether a “feature” is component, function or property, whereas statistical approaches can only identify whether it is a “feature” or not. 6. Conclusions In conclusion, this article described how the CRFs model was adopted for accomplishing feature-level Web opinion mining tasks (to analyze online consumer reviews), and how we experimentally compared it with other typical approaches from different angles. Specifically, to practically apply CRFs, we not only integrated a self-tagging process in order to reduce manual labeling effort, but also identified the optimal set of learning functions to achieve the desired balance between algorithm complexity and accuracy. In the experiments, we particularly 29

compared five opinion mining methods: the rule-based method as the baseline, two variations of association rule mining based methods (i.e., ASM and ASM plus linguistic rules), L-HMMs based and CRFs-based. The comparison was in-depth performed regarding their actual abilities in mining different types of review elements: basic entities, opinions, intensifiers, phrases, infrequent entities, opinion sentences, and sentence orientation. Overall, the experimental results demonstrated the outperforming effectiveness of CRFs relative to other algorithms. More specifically, contrasting to another model-based learning method, L-HMMs, which holds conditional independence assumption, CRFs can handle non-independent input entities and integrate rich learning functions to discover relationships among tags. This property enables it to achieve relatively higher accuracy in respect of mining these review elements (including infrequent entities). In comparison with the association rule mining based approaches, CRFs’ distinguishable performance in extracting fine-grained features (such as component, function), opinions and phrases is significantly revealed. Encouraged by these findings, in the future, we will design user interfaces to visually display the opinion mining results (e.g., in tag clouds or bar chart) to end users. The extracted features and opinions will be then judged by users by means of user evaluations, so as to empirically understand the impact of feature-level opinion mining on saving users’ effort of browsing the original textual reviews and effectively supporting their product purchase decisions. Acknowledgements This research has been supported by Hong Kong Baptist University Faculty Research Grant (HKBU/FRG2/10-11/041).

30

Appendix A. Table A.1: Comparison respecting individual product at the feature level (P: Precision, R: Recall, F: F-measure) Method

Basic Entity P R F CRFs 0.874 0.764 0.815 L-HMMs 0.754 0.605 0.671 Product1 ASM+Rules 0.460 0.681 0.549 ASM 0.422 0.626 0.504 Rule-Based 0.286 0.263 0.274

Opinion P R F 0.884 0.822 0.851 0.848 0.732 0.786 0.221 0.338 0.267 0.204 0.311 0.247 0.282 0.177 0.217

P 0.805 0.725 0.095 0.059 0.041

Phrase R 0.673 0.556 0.095 0.062 0.040

F 0.733 0.629 0.095 0.060 0.040

Infrequent Entity P R F 0.701 0.778 0.737 0.710 0.807 0.755 0.558 0.700 0.633 0.542 0.697 0.603 0.383 0.428 0.404

Intensifier P R F 0.883 0.794 0.837 0.891 0.779 0.831 0.322 0.311 0.316 0.313 0.270 0.290 0.227 0.136 0.170

CRFs L-HMMs Product2 ASM+Rules ASM Rule-Based

0.884 0.845 0.457 0.437 0.299

0.765 0.719 0.683 0.666 0.270

0.818 0.776 0.541 0.527 0.284

0.882 0.884 0.189 0.150 0.278

0.799 0.767 0.309 0.259 0.173

0.838 0.822 0.235 0.190 0.214

0.857 0.869 0.105 0.032 0.052

0.625 0.773 0.092 0.059 0.047

0.723 0.818 0.098 0.041 0.049

0.711 0.709 0.651 0.639 0.413

0.828 0.780 0.755 0.754 0.410

0.765 0.743 0.716 0.692 0.411

0.886 0.896 0.328 0.324 0.196

0.796 0.777 0.295 0.284 0.144

0.839 0.832 0.311 0.303 0.166

CRFs L-HMMs Product3 ASM+Rules ASM Rule-Based

0.873 0.714 0.460 0.452 0.291

0.763 0.562 0.694 0.673 0.231

0.814 0.628 0.560 0.541 0.258

0.945 0.926 0.193 0.181 0.287

0.901 0.824 0.302 0.295 0.213

0.923 0.872 0.236 0.224 0.244

0.805 0.810 0.100 0.088 0.077

0.717 0.713 0.090 0.087 0.068

0.759 0.758 0.095 0.087 0.072

0.745 0.718 0.541 0.535 0.420

0.831 0.785 0.715 0.684 0.411

0.786 0.750 0.616 0.574 0.415

0.904 0.854 0.315 0.305 0.216

0.822 0.778 0.316 0.275 0.128

0.861 0.814 0.315 0.289 0.160

CRFs L-HMMs Product4 ASM+Rules ASM Rule-Based

0.856 0.807 0.426 0.419 0.269

0.727 0.674 0.621 0.595 0.232

0.785 0.733 0.506 0.492 0.249

0.917 0.883 0.201 0.180 0.277

0.807 0.801 0.322 0.284 0.180

0.858 0.840 0.247 0.220 0.218

0.784 0.727 0.150 0.075 0.041

0.617 0.533 0.105 0.081 0.037

0.690 0.615 0.124 0.078 0.039

0.700 0.675 0.590 0.583 0.400

0.768 0.775 0.723 0.695 0.422

0.733 0.722 0.673 0.634 0.411

0.895 0.853 0.327 0.301 0.184

0.792 0.794 0.335 0.270 0.140

0.840 0.822 0.331 0.285 0.159

CRFs L-HMMs Product5 ASM+Rules ASM Rule-Based

0.890 0.743 0.428 0.427 0.267

0.803 0.610 0.680 0.625 0.238

0.844 0.668 0.525 0.507 0.252

0.925 0.926 0.190 0.155 0.313

0.832 0.812 0.308 0.265 0.216

0.876 0.865 0.235 0.195 0.256

0.884 0.860 0.105 0.064 0.053

0.792 0.707 0.102 0.068 0.035

0.835 0.776 0.103 0.066 0.042

0.665 0.671 0.550 0.511 0.406

0.753 0.780 0.705 0.663 0.439

0.706 0.721 0.648 0.573 0.422

0.888 0.886 0.325 0.316 0.208

0.793 0.767 0.309 0.285 0.139

0.838 0.822 0.317 0.300 0.166

CRFs L-HMMs Product6 ASM+Rules ASM Rule-Based

0.901 0.754 0.432 0.427 0.291

0.818 0.635 0.656 0.633 0.263

0.857 0.689 0.521 0.510 0.276

0.890 0.879 0.247 0.214 0.302

0.849 0.744 0.369 0.324 0.210

0.869 0.806 0.296 0.258 0.248

0.833 0.820 0.159 0.075 0.074

0.700 0.602 0.101 0.078 0.059

0.761 0.694 0.124 0.076 0.066

0.739 0.698 0.570 0.550 0.396

0.856 0.762 0.779 0.778 0.410

0.793 0.729 0.659 0.630 0.403

0.895 0.864 0.332 0.298 0.229

0.792 0.801 0.304 0.293 0.146

0.841 0.831 0.317 0.295 0.178

CRFs L-HMMs Product7 ASM+Rules ASM Rule-Based

0.884 0.815 0.387 0.377 0.296

0.761 0.705 0.620 0.566 0.275

0.818 0.755 0.498 0.453 0.285

0.919 0.907 0.198 0.173 0.270

0.821 0.785 0.320 0.281 0.179

0.867 0.842 0.245 0.214 0.215

0.875 0.723 0.144 0.084 0.063

0.700 0.593 0.107 0.068 0.044

0.778 0.651 0.123 0.075 0.051

0.686 0.711 0.545 0.539 0.416

0.805 0.809 0.672 0.663 0.450

0.741 0.757 0.626 0.588 0.432

0.916 0.878 0.345 0.323 0.183

0.796 0.801 0.299 0.262 0.123

0.852 0.838 0.321 0.289 0.147

CRFs L-HMMs Product8 ASM+Rules ASM Rule-Based

0.865 0.825 0.484 0.392 0.267

0.725 0.679 0.694 0.579 0.247

0.787 0.744 0.570 0.468 0.257

0.901 0.872 0.247 0.238 0.273

0.830 0.772 0.367 0.339 0.192

0.864 0.819 0.296 0.280 0.225

0.800 0.805 0.153 0.086 0.054

0.596 0.698 0.106 0.080 0.048

0.683 0.748 0.125 0.083 0.050

0.693 0.706 0.583 0.574 0.388

0.753 0.799 0.795 0.787 0.408

0.722 0.750 0.672 0.660 0.398

0.901 0.853 0.338 0.305 0.220

0.837 0.778 0.309 0.296 0.165

0.868 0.814 0.323 0.300 0.188

CRFs L-HMMs Product9 ASM+Rules ASM Rule-Based

0.865 0.823 0.484 0.430 0.294

0.721 0.639 0.705 0.654 0.262

0.786 0.714 0.574 0.519 0.277

0.881 0.878 0.174 0.161 0.280

0.792 0.834 0.758 0.814 0.301 0.221 310.201 0.269 0.184 0.222

0.842 0.795 0.103 0.074 0.042

0.640 0.690 0.106 0.073 0.031

0.727 0.739 0.104 0.074 0.036

0.659 0.679 0.537 0.518 0.381

0.735 0.799 0.664 0.632 0.411

0.695 0.734 0.594 0.569 0.395

0.910 0.893 0.333 0.309 0.211

0.795 0.806 0.313 0.297 0.140

0.848 0.847 0.323 0.303 0.168

CRFs L-HMMs ASM+Rules ASM Rule-Based

0.877 0.787 0.447 0.420 0.285

0.761 0.648 0.670 0.624 0.254

0.814 0.709 0.536 0.502 0.268

0.905 0.889 0.207 0.184 0.285

0.828 0.777 0.326 0.292 0.192

0.832 0.793 0.124 0.070 0.055

0.673 0.652 0.101 0.073 0.045

0.743 0.714 0.110 0.066 0.050

0.700 0.697 0.569 0.554 0.400

0.790 0.788 0.737 0.697 0.421

0.742 0.740 0.635 0.623 0.410

0.898 0.874 0.329 0.310 0.208

0.802 0.787 0.310 0.281 0.140

0.847 0.828 0.320 0.295 0.167

Average

0.864 0.829 0.254 0.225 0.229

Table A.2: Comparison respecting individual product at the sentence level (P: Precision, R: Recall, F: F-measure) Method

Product1

Product2

Product3

Product4

Product5

Product6

Product7

Product8

Product9

Average

CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based CRFs L-HMMs ASM+Rules ASM Rule-Based

Opinion sentence P R F 0.894 0.818 0.854 0.833 0.769 0.800 0.579 0.532 0.554 0.568 0.521 0.543 0.362 0.291 0.322 0.896 0.834 0.864 0.842 0.803 0.822 0.601 0.543 0.571 0.578 0.536 0.556 0.341 0.311 0.325 0.913 0.877 0.895 0.870 0.760 0.811 0.589 0.531 0.558 0.565 0.491 0.525 0.346 0.301 0.322 0.903 0.828 0.864 0.869 0.793 0.830 0.558 0.471 0.511 0.546 0.461 0.500 0.345 0.308 0.325 0.850 0.793 0.820 0.862 0.780 0.819 0.561 0.476 0.515 0.552 0.473 0.510 0.366 0.312 0.337 0.871 0.758 0.811 0.877 0.788 0.830 0.575 0.523 0.548 0.574 0.498 0.533 0.371 0.324 0.346 0.882 0.753 0.812 0.870 0.772 0.818 0.616 0.591 0.603 0.594 0.559 0.576 0.366 0.295 0.327 0.849 0.756 0.800 0.840 0.770 0.804 0.599 0.550 0.573 0.606 0.553 0.578 0.340 0.310 0.324 0.880 0.839 0.859 0.871 0.789 0.828 0.583 0.515 0.547 0.552 0.491 0.520 0.345 0.294 0.317 0.882 0.806 0.842 0.859 0.781 0.818 0.582 0.521 0.550 0.573 0.514 0.542 0.354 0.305 0.327

32

Sentence orientation P R F 0.849 0.791 0.819 0.781 0.725 0.752 0.538 0.475 0.504 0.522 0.471 0.495 0.329 0.278 0.301 0.878 0.812 0.844 0.802 0.754 0.777 0.559 0.495 0.525 0.540 0.493 0.515 0.346 0.291 0.316 0.866 0.842 0.854 0.824 0.747 0.784 0.556 0.511 0.533 0.532 0.509 0.520 0.312 0.296 0.304 0.884 0.805 0.843 0.807 0.746 0.776 0.512 0.420 0.461 0.500 0.415 0.454 0.316 0.272 0.292 0.841 0.762 0.800 0.797 0.766 0.781 0.558 0.435 0.489 0.512 0.432 0.469 0.330 0.284 0.305 0.870 0.751 0.806 0.823 0.723 0.770 0.576 0.461 0.512 0.566 0.455 0.505 0.306 0.286 0.296 0.867 0.742 0.799 0.806 0.769 0.787 0.560 0.507 0.532 0.540 0.500 0.519 0.340 0.294 0.315 0.871 0.732 0.795 0.789 0.721 0.753 0.557 0.492 0.522 0.546 0.490 0.517 0.349 0.304 0.325 0.858 0.815 0.836 0.785 0.741 0.763 0.519 0.447 0.480 0.510 0.446 0.476 0.306 0.274 0.289 0.865 0.784 0.822 0.802 0.744 0.771 0.548 0.471 0.507 0.530 0.468 0.497 0.326 0.287 0.305

References Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994, Santiago de Chile, Chile (pp. 487–499). Morgan Kaufmann. Blei, D., Ng, A., & Jordan, M. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993–1022. Bourigault, D., & Jacquemin, C. (1999). Term extraction+ term clustering: An integrated platform for computer-aided terminology. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics (pp. 15–22). Association for Computational Linguistics. Choi, Y., & Cardie, C. (2009). Adapting a polarity lexicon using integer linear programming for domain-specific sentiment classification. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2 (pp. 590–598). Association for Computational Linguistics. Cowie, J., & Lehnert, W. (1996). Information extraction. Communications of the ACM, 39, 80–91. Das, S., & Chen, M. (2007). Yahoo! for Amazon: Sentiment extraction from small talk on the web. Management Science, 53, 1375–1388. Ding, X., & Liu, B. (2007). The utility of linguistic rules in opinion mining. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 811–812). ACM. Ding, X., Liu, B., & Yu, P. (2008). A holistic lexicon-based approach to opinion mining. In Proceedings of the international conference on Web search and web data mining (pp. 231–240). ACM. Feiguina, O., & Lapalme, G. (2007). Query-based summarization of customer reviews. Advances in Artificial Intelligence, (pp. 452–463). Goldberg, A., & Zhu, X. (2006). Seeing stars when there aren’t many stars: Graph-based semi-supervised learning for sentiment categorization. In Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing (pp. 45–52). Association for Computational Linguistics. 33

Hatzivassiloglou, V., & Wiebe, J. (2000). Effects of adjective orientation and gradability on sentence subjectivity. In Proceedings of the 18th Conference on Computational linguistics-Volume 1 (pp. 299–305). Association for Computational Linguistics. Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 168–177). ACM. Jakob, N., & Gurevych, I. (2010). Extracting opinion targets in a single- and crossdomain setting with conditional random fields. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing EMNLP ’10 (pp. 1035–1045). Stroudsburg, PA, USA: Association for Computational Linguistics. Jin, W., Ho, H., & Srihari, R. (2009). OpinionMiner: a novel machine learning system for web opinion mining and extraction. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1195–1204). ACM. Kessler, J., & Nicolov, N. (2009). Targeting sentiment expressions through supervised ranking of linguistic configurations. In Proceedings of the Third International AAAI Conference on Weblogs and Social Media (pp. 90–97). Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning ICML ’01 (pp. 282–289). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. Li, F., Han, C., Huang, M., Zhu, X., Xia, Y.-J., Zhang, S., & Yu, H. (2010). Structure-aware review mining and summarization. In Proceedings of the 23rd International Conference on Computational Linguistics COLING ’10 (pp. 653– 661). Stroudsburg, PA, USA: Association for Computational Linguistics. Liu, D., & Nocedal, J. (1989). On the limited memory bfgs method for large scale optimization. Mathematical programming, 45, 503–528. McCallum, A., & Li, W. (2003). Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In

34

Proceedings of the seventh conference on Natural language learning at HLTNAACL 2003-Volume 4 (pp. 188–191). Association for Computational Linguistics. Miao, Q., Li, Q., & Zeng, D. (2010). Mining fine grained opinions by using probabilistic models and domain knowledge. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01 WI-IAT ’10 (pp. 358–365). Washington, DC, USA: IEEE Computer Society. Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing-Volume 10 (pp. 79–86). Association for Computational Linguistics. Popescu, A., & Etzioni, O. (2005). Extracting product features and opinions from reviews. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 339–346). Association for Computational Linguistics. Rabiner, L., & Juang, B. (1986). An introduction to hidden markov models. ASSP Magazine, IEEE, 3, 4–16. Scaffidi, C., Bierhoff, K., Chang, E., Felker, M., Ng, H., & Jin, C. (2007). Red Opal: product-feature scoring from reviews. In Proceedings of the 8th ACM Conference on Electronic commerce (pp. 182–191). ACM. Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language TechnologyVolume 1 (pp. 134–141). Association for Computational Linguistics. Titov, I., & McDonald, R. (2008). A joint model of text and aspect ratings for sentiment summarization. Urbana, 51, 308–316. Turney, P. (2002). Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics ACL ’02 (pp. 417–424). Stroudsburg, PA, USA: Association for Computational Linguistics.

35

Wermter, J., & Hahn, U. (2005). Finding new terminology in very large corpora. In Proceedings of the 3rd international conference on Knowledge capture (pp. 137–144). ACM. Wilson, T., Hoffmann, P., Somasundaran, S., Kessler, J., Wiebe, J., Choi, Y., Cardie, C., Riloff, E., & Patwardhan, S. (2005). OpinionFinder: A system for subjectivity analysis. In Proceedings of HLT/EMNLP on Interactive Demonstrations (pp. 34–35). Association for Computational Linguistics. Zhang, L., Liu, B., Lim, S., & O’Brien-Strain, E. (2010). Extracting and ranking product features in opinion documents. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters (pp. 1462–1470). Association for Computational Linguistics. Zhao, J., Liu, K., & Wang, G. (2008). Adding redundant features for crfs-based sentence sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing EMNLP ’08 (pp. 117–126). Stroudsburg, PA, USA: Association for Computational Linguistics. Zhuang, L., Jing, F., & Zhu, X. (2006). Movie review mining and summarization. In Proceedings of the 15th ACM international conference on Information and knowledge management (pp. 43–50). ACM.

36

Suggest Documents