A User-based Approach to Multi-way Polarity Classification Yanir Seroussi
Ingrid Zukerman
Fabian Bohnert Clayton School of Information Technology Faculty of Information Technology, Monash University Clayton, Victoria 3800, Australia
[email protected]
Abstract Sentiment analysis deals with inferring people’s sentiments and opinions from texts. An important aspect of sentiment analysis is polarity classification, which consists of inferring a document’s polarity – the overall sentiment conveyed by the text – in the form of a numerical rating. In contrast to existing approaches to polarity classification, we propose to take the authors of the documents into account. Specifically, we present a nearest-neighbour collaborative approach that utilises novel models of user similarity. Our evaluation shows that our approach improves on state-of-the-art performance in terms of classification error and runtime, and yields insights regarding datasets for which such an improvement is achievable.
1
Introduction
Polarity classification is one of the earliest tasks attempted in the sentiment analysis field [13]. The binary case consists of classifying texts as either positive or negative. Less attention has been paid to Multi-way Polarity Classification (MPC), i.e., inferring the “star rating” of texts on a scale of more than two values. One of the challenges in polarity classification is that relying only on keywords may result in poor performance [13]. The following snippet, from a one-star movie review in the IMDb62 dataset (Section 3), illustrates this challenge: . . . If you like Paris Hilton and enjoy watching her on the silver screen, don’t miss this film, it will hold your interest and keep you entertained. Enjoy.
1
Methods that rely only on keywords may conclude that this snippet is positive. However, this is the case only for people who like Paris Hilton. Therefore, more sophisticated methods are required. An additional challenge, which is relevant to MPC, is that ratings on a non-binary scale are more open to interpretation than binary ratings (e.g., the difference between a rating of 6 and 7 on a 10-point scale is less clear cut than the difference between “good” and “bad”), and thus every user may have a different “feel” for the rating scale. This challenge was noted in [12], but not dealt with directly. Our approach to MPC takes the users into account, rather than relying solely on standalone texts. We do this by introducing a nearest-neighbour collaborative framework: we train user-specific classifiers, and consider user similarity to combine the outputs of the classifiers. This approach decreases the error in cross-user MPC, while requiring less computational resources than user-blind methods (Section 6.2). In addition to introducing the basic collaborative MPC framework, we address two main issues: (1) sparsity of the item/rating matrix, which makes item-based similarity measures unreliable; and (2) modeling user similarity when both ratings and reviews are unavailable (but other texts are available). These problems are addressed mainly by modeling similarity between users based on their texts. When the item/rating matrix is sparse, we show that basing similarity on all the reviews by the users reduces classification error (Section 6.4). When no ratings or reviews are available, we show that message board posts can be used to successfully model user similarity (Section 6.5). This report is organised as follows. Related work is surveyed in Section 2. Our dataset is described in Section 3. Polarity classification methods and user similarity models are presented in Sections 4 and 5 respectively. Section 6 presents the results of our evaluation, and Section 7 discusses our results and our plans for future work. The report is concluded in Section 8.
2
Related Work
Binary polarity classification has been an active research area since the early days of sentiment analysis [14, 21]. Recent work is by Dasgupta and Ng [2] on movie and product reviews in a semi-supervised setup, Wan [22] on Chinese product reviews, and Qiu et al. [16] on domain adaptation of word level polarity. Multi-way Polarity Classification (MPC) was attempted in several domains, e.g., movie reviews [4, 12], restaurant reviews [20], and customer feedback [3]. Results vary depending on the domain and size of the texts. However, not surprisingly, the results for MPC are inferior to those for binary classification. Several researchers found that authorship affects performance in opinion and sentiment analysis [5, 10, 12]. Specifically, Pang and Lee [12] found that a classifier trained on film reviews by one author and tested on reviews by a different author is likely to perform poorly. Lin et al. [10] and Greene and Resnik [5] obtained similar results with respect to the Bitter Lemons dataset, which contains proPalestinian and pro-Israeli articles, half of them written by two editors and the other half by various guest writers. Specifically, they found that training and testing on articles by the editors results in near-perfect
2
accuracy, and that training on articles by the guest writers and testing on the editors’ articles results in higher accuracy than the reverse case (the accuracy in both cases was still lower than training and testing on the editors). Our work takes these insights one step further, in that we harness cross-author similarity to reduce classification error. The inspiration for considering the users for polarity classification comes from recommender systems [17], and specifically from Collaborative Filtering (CF), which employs a target user’s previous ratings and ratings submitted by similar users to predict the ratings that the target user will give to unrated items [1, 7]. MPC resembles CF in that both output ratings. However, there are two fundamental differences: (1) in MPC, the ratings are obtained from classifiers that take as input a user’s textual review of an unrated item, whereas CF relies on the user’s ratings of other items; and (2) CF systems generally require some ratings by the target user, while polarity classification systems infer the polarity (numeric rating) of a target user’s text even when no ratings by the target user are available. Despite the differences between MPC and CF, they have two problems in common: (1) sparsity of the item/rating matrix, where rating prediction is based on a relatively small number of known ratings, compared to the number of possible item/rating pairs; and (2) the new user problem – predicting ratings for users who supplied few or no ratings [1]. Our text-based measures address these problems by reducing the dependency on rated items to calculate similarity between users (Section 6.4), and by decreasing the need for ratings and opinion-bearing texts (Section 6.5).
3
Dataset
The Sentiment Scale Dataset (v1.0) [12] includes movie reviews by four different users and mappings to 3-star and 4-star rating scales. We could not use this dataset, as this small number of users is inadequate to support experiments regarding the impact of authorship on sentiment. To address this problem, we created the Prolific IMDb Users dataset by collecting data from the Internet Movie Database (IMDb) at www.imdb.com in May 2009. This dataset contains 184 users with at least 500 movie reviews per user. Not all users have a large number of labeled reviews, as IMDb users may choose not to assign a rating to their reviews. However, all ratings are on the same 1–10 star scale. In addition to movie reviews, users may write message board posts. IMDb message boards are mostly movie-related, but some are about television, music and other topics. In the experiments presented in this paper, we use a subset of the Prolific IMDb Users dataset, called IMDb62. This subset has the following properties: • The total number of reviews is 62,000 (1000 reviews per user). Each user’s reviews were obtained using proportional sampling without replacement (i.e., for each user, the 1000 reviews have the same rating frequencies as their complete set of reviews). Reviews without ratings were excluded from the dataset. It is worth noting that in our evaluation we do not assume that every user has submitted 1000 reviews. In fact, we show that our methods yield improved performance even 3
Table 1: IMDb62 Dataset Properties Users: 62 Words per review mean: Labeled reviews: 62000 Words per review stddev: Reviewed items: 29116 Message board posts: Items with only one review: 18322 Number of users with no posts: Item/rating matrix sparsity: 96.57% Posts per user mean (for users with posts): Posts per user stddev (for users with posts):
300 198 17560 11 344 743
when the number of prolific users is small (Section 6.6). • Each item is reviewed only once by each user. Reviews for items with multiple reviews by the same user were discarded to reduce ambiguity. • Explicit ratings were automatically filtered out from the review texts, e.g., “5/10” was removed from texts such as “this movie deserves 5/10”. • For each user, all the message boards posts are included. Some users have not submitted any posts, while others wrote hundreds to thousands of posts. No sub-sampling of message board posts was performed, because we want to emulate the real-world scenario where users submit a variable number of posts. Table 1 displays some statistics for the IMDb62 Dataset. Notable properties of the dataset are the large percentage of items that were reviewed by only one user (around 63%), and the high standard deviation of the review word count. These properties make cross-user polarity classification a challenging task. Another difficulty in cross-user polarity classification is that different users may have different interpretations of the rating scale, e.g., two users may express a similar opinion of an item, but assign it a different rating [12]. Further, users select the items they review, and therefore they might choose to submit only reviews with extreme ratings. These challenges are visible in IMDb62, which displays a large variability of rating distributions. For example, it contains users with more than 40% 10-star ratings and almost no 1–4 star ratings, while others have most of their ratings in the 1–5 star range.
4
Polarity Classification Methods
In this section, we describe several methods for multi-way polarity classification. In the following descriptions, the target user is the author of the reviews to be classified (i.e., reviews for which the polarity is unknown), and the training users are the authors of reviews for which the polarity labels are known. The target user may be a new user, for whom few or no labeled reviews are available. All methods rely on classifiers that are trained on labeled reviews and output the classification of unlabeled reviews. Pang and Lee [12] focused on training a single classifier on labeled reviews. They used training data from a single user (SCSU – Single Classifier, Single User) or from multiple users (SCMU – Single 4
Classifier, Multiple Users). SCSU is similar to content-based recommender systems [1], as it is based only on the target user’s past ratings and reviews. SCSU is expected to perform best on reviews by the user on whom it was trained, but it may require many reviews to achieve acceptable performance (Section 6.1). In addition, SCSU is unsuitable when there are few or no labeled reviews by the target user. SCMU addresses SCSU’s problem of target users with few reviews, since it does not rely solely on labeled reviews from the target user. However, SCMU’s classification performance may be worse than that of SCSU, as differences between the training users may make it hard for the classifier to generalise. Moreover, training an SCMU classifier on all the available data is infeasible in a system with many users and reviews. To address this problem, one could randomly sample a subset of the available reviews and use them for training the classifier, but this is unlikely to result in satisfactory performance (Section 6.2). Our method, Multiple Classifiers, Multiple Users (MCMU), addresses these problems by training a separate classifier for each training user and outputting the normalised rating inferred by the classifiers. The simple way of normalising the outputs of the training user classifiers is by using a weighted average: X wu P rˆu,qa (1) rˆqa = w u u∈U u∈U where U is the set of training users; rˆu,qa is the rating inferred by user u’s classifier for a review q by the target user a; and wu , the weight of each training user classifier, is calculated using the methods introduced in Section 5. Herlocker et al. [7] showed that normalising the training user ratings rˆu,qa using Equation 2 reduces the Mean Absolute Error (MAE – defined in Section 6) for Collaborative Filters (CFs): rˆqa = µa + σa
X P u∈U
wu u∈U
wu
rˆu,qa − µu σu
(2)
where µx and σx are user x’s rating mean and standard deviation. Like Equation 1, this equation can also be used when there are no available target user ratings (Sections 6.3 and 6.5). In this case, the rating mean and standard deviation over all training users are used to estimate the target user’s rating mean and standard deviation. We compare the performance yielded by using Equations 1 and 2 in Section 6.3.
5
User Similarity Models
In this section, we describe several methods for modeling user similarity. These methods yield a similarity score sim(a, u) for users a and u, which is then incorporated into the MCMU classifier ensemble described in Section 4. That is, given a target user a, the weights used in Equations 1 and 2 are: wu = f (sim(a, u))
5
(3)
Table 2: Similarity Measures Taxonomy Rating-based (Section 5.1) Text-based (Section 5.2) All items AIR AIT, AIP Co-reviewed items CRR CRT, CRP
where f (x) is a transformation function that ensures non-negative weights. We chose f (x) = ex in order to give more weight to similar users (the results of our experiments with two different f (x) definitions are given in Appendix A.1). In this setup, all available users are given weights, but it can easily be modified to consider only users above a similarity threshold s (or some other selection criterion): ( wu =
f (sim(a, u)) 0
if sim(a, u) > s otherwise
(4)
Table 2 groups the similarity models based on the type of information they use and the sources for this information. Rating-based methods rely only on ratings in order to measure similarity between users, while text-based methods employ only the users’ documents. Measures based on all items calculate general statistics on the entire set of user reviews or documents, while measures based on co-reviewed items perform a pairwise comparison of the reviews for items reviewed by two users. We expect measures based on co-reviewed items to be more informative than measures based on all items. This is because the former take into account the actual items reviewed, while the latter may be computed on the basis of users who have only a few or no items in common. However, measures based on co-reviewed items may require more labeled reviews, so that the size of the set of co-reviewed items is sufficiently large to give meaningful similarity values. These measures may also underperform when the item/rating matrix is sparse (i.e., many items are reviewed by only a few users).
5.1
Rating-based Models
5.1.1
All Item Ratings (AIR)
Let Qx denote the reviews written by user x, and Qx,r denote x’s reviews with rating r (r ∈ {1, 2, ..., 10}). The similarity between users a and u is one minus the Hellinger distance between their rating distributions: r P10 q |Qa,r | q |Qu,r | 2 1 − ∈ [0, 1] (5) sim(a, u) = 1 − 2 r=1 |Qa | |Qu | This similarity measure accounts for the relative positivity or negativity of the users. For instance, if one user mostly gives low ratings and another mostly high ratings, they are considered dissimilar. For the similarity measure to be meaningful, we may need a sufficiently large sample of ratings for the two users, so that it accurately represents the overall rating distribution. However, no textual analysis is required to 6
calculate this measure, and thus its computation is faster than that of text-based models (Section 5.2). 5.1.2
Co-reviewed Ratings (CRR)
Basing user similarity on co-reviewed item ratings is common in CF, and many variations have been explored [1]. We experimented with two functions for pairwise comparison of rating vectors: the Pearson correlation coefficient (Equation 6) and cosine similarity (Equation 7), with the former giving superior results (Appendix A.4): P
sim(a, u) = qP
− r¯a )(ru,i − r¯u ) ∈ [−1, 1] P 2 2 (r − r ¯ ) (r − r ¯ ) a,i a u,i u i∈Ia,u i∈Ia,u i∈Ia,u (ra,i
P sim(a, u) = qP
ra,i ru,i qP
i∈Ia,u
2 i∈Ia,u ra,i
∈ [0, 1]
(6)
(7)
2 i∈Ia,u ru,i
where Ix is the set of items reviewed by user x, and Ia,u = Ia ∩ Iu is the set of items co-reviewed by users a and u. User x’s rating for item i is rx,i , and r¯x denotes the mean of user x’s ratings for items in Ia,u .
5.2
Text-based Models
5.2.1
All Item Terms (AIT)
We employ the Jaccard coefficient of the sets of terms used in two documents to measure the similarity between them: |T (d1 ) ∩ T (d2 )| J(d1 , d2 ) = ∈ [0, 1] (8) |T (d1 ) ∪ T (d2 )| where T (d) is the set of terms that appear in document d. We chose the Jaccard coefficient, rather than cosine similarity of tf-idf vectors, because our experiments showed that the former gives more informative scores than the latter (Appendix A.5). This is in line with the results reported by Pang et al. [14], who found that unigram presence performs better than frequency when classifying textual polarity. Equation 8 can be modified to consider only certain types of terms. For example, instead of all the terms, T (d) may include only adjectives (following Pang et al. [14], who found that adjectives are related to textual polarity). We experimented with several definitions for T (d). Our results show that the performance of our methods depends on the number of documents used for calculating this similarity measure (Appendix A.6). We define the AIT similarity between two users a and u as the Jaccard coefficient of the documents (not necessarily reviews) written by these users: sim(a, u) = J(da , du ) ∈ [0, 1]
7
(9)
where dx is the concatenation of the documents written by user x. 5.2.2
Co-reviewed Terms (CRT)
We define similarity between users using the Jaccard coefficient (Equation 8) to measure the similarity between the users’ reviews of co-reviewed items: sim(a, u) =
X J(qu,i , qa,i ) ∈ [0, 1] |Ia,u | i∈I
(10)
a,u
where qx,i is user x’s review of item i. 5.2.3
All Item PSPs (AIP)
Positive Sentence Percentage (PSP) was defined by Pang and Lee [12] as the percentage of positive sentences out of the subjective sentences in a review. To detect the subjective sentences, they used the method described in [11], and to find the positive sentences, they trained a Naive Bayes (NB) classifier on the Sentence Polarity Dataset (v1.0). When used to model review similarity, PSP outperformed termbased methods for multi-way polarity classification [12]. Here we introduce a user similarity measure based on PSP. This measure replaces ratings with PSPs, thereby obviating the need for explicit ratings. In contrast with Pang and Lee, we define PSP as the percentage of positive sentences among all the sentences in a document (rather than just subjective sentences). This generalises the PSP definition to include any type of text, such as message board posts. Additionally, we use a Support Vector Machine (SVM) trained on the Sentence Polarity Dataset, because the classification accuracy of the SVM implementation we use was found to be higher than that of the NB classifier (running a 10-fold cross validation on this dataset). AIP is defined in a similar way to AIR (Section 5.1.1), as one minus the Hellinger distance between the PSP distributions for users a and u: sim(a, u) = 1 −
√
q P L 1 2
l=1
pa,l −
√ 2 pu,l ∈ [0, 1]
(11)
where L is a discretisation factor (determined experimentally – see Appendix A.7), and px,L is defined as follows: |{q ∈ Qx : l − 1 ≤ L × psp(q) < l}| px,l = , l ∈ {1, 2, ..., L} (12) |Qx | where Qx is the set of user x’s reviews, and psp(q) is the PSP of review q. The last component px,L is calculated using L × psp(q) ≤ L (instead of < L) to include reviews with psp(q) = 1.
8
5.2.4
Co-reviewed PSPs (CRP)
Like CRR (Section 5.1.2), this model is based on co-reviewed items, but it does not require explicit ratings. Instead of ratings it uses the Pearson correlation coefficient of PSPs, yielding the following similarity measure: P
− pspa )(psp(qu,i ) − pspu ) ∈ [−1, 1] P 2 2 i∈Ia,u (psp(qa,i ) − pspa ) i∈Ia,u (psp(qu,i ) − pspu )
sim(a, u) = qP
i∈Ia,u (psp(qa,i )
(13)
where pspx denotes the mean of user x’s PSPs for items in Ia,u . In a similar way to CRR, we can utilise the cosine similarity of PSPs (however, the experiments in Appendix A.4 show that the Pearson correlation coefficient yields better results): P
sim(a, u) = qP
6
psp(qa,i )psp(qu,i ) qP ∈ [0, 1] 2 2 psp(q ) psp(q ) a,i u,i i∈Ia,u i∈Ia,u i∈Ia,u
(14)
Evaluation
In this section, we evaluate the methods and models introduced in Sections 4 and 5 by running experiments on the IMDb62 dataset (Section 3). In all the experiments, we perform leave-one-out cross validation, training on at most 61 users and testing on the remaining user. We report the Mean Absolute Error (MAE) across all 62 users: X |rq − rˆq | (15) MAE = |Q| q∈Q where Q is the set of reviews to classify, rq is the actual rating of review q, and rˆq is the rating inferred by the classifier. Note that our methods return integer ratings, not star fractions. That is, on a 10star scale the only possible values are 1, 2, . . . , 10, as it was found that returning integers reduces the MAE (Section 6.1). We chose the MAE measure because the ratings for the reviews in the dataset are given on an ordinal 10-point scale, and using MAE (rather than classification accuracy) gives different weights to different classification errors: misclassifying a 10-star review as a 1-star review is different from misclassifying it as a 9-star review. Statistically significant differences in MAE are reported when p < 0.05 according to a paired two-tailed t-test. In our experiments we used machine learning algorithms as implemented in Weka 3.6.0 (www.cs. waikato.ac.nz/ml/weka).1 Default settings were used for every algorithm (using different settings did not improve performance in our experiments, which is consistent with the results reported in [12, 14]). The features used by the learning algorithms are unigrams extracted from the review texts. Even though 1
See [8] for Naive Bayes, [6, 9, 15] for Support Vector Machines, and [18, 19] for Support Vector Regression.
9
we used different implementations from those used by Pang and Lee [12], we obtained comparable results on their dataset. Two of the text-based methods, AIT (Section 5.2.1) and CRT (Section 5.2.2), require part-of-speech tagging, which was done using OpenNLP 1.4.3 (opennlp.sourceforge.net) with the default English language models.
6.1
SCSU Experiments
In Section 4, we hypothesised that the performance of SCSU will be suboptimal when the number of training reviews is small (recall that in SCSU only the target user’s reviews are used for training). To test this hypothesis, we ran SCSU separately on each user in the IMDb62 dataset with different numbers of training reviews. For each training set size, the set was sampled uniformly without replacement from the user’s reviews. The classifier was trained on this set and tested on the remaining reviews by the user. This was repeated 50 times for each set size. We experimented with Support Vector Regression (SVR), Naive Bayes (NB), and Support Vector Machines (SVMs) in a 1-vs-1 setup (OVO). Figure 1 shows the results of this experiment (all the differences are statistically significant). As expected, the overall mean MAE (across all 62 users) decreases as the number of reviews used for training increases, independently of the learning algorithm. The best performing algorithm out of those we tested was SVR, and thus it was used for subsequent experiments. Note that NB and OVO are classifiers, while SVR is a regressor. We therefore compared the results obtained when SVR’s output is rounded to when it is left unrounded. As Figure 1 shows, we found that it is preferable to discretise the output by rounding, and thus we report rounded results in subsequent experiments. In Section 4, we also hypothesised that SCSU will perform best on reviews by the user on whom it was trained (i.e., when the training user and the target user are the same person). We verified this hypothesis by training an SCSU classifier on all 1000 reviews by each user, and then testing it on the remaining 61000 reviews by the other users. As expected, this resulted in poor performance with an overall MAE of 1.86 (for comparison, the MAE achieved when training and testing on the same user, with only 5 reviews available, is 1.65).
6.2
SCMU versus MCMU
In this section, we compare SCMU to MCMU (Section 4), using Equation 1 with equal weights for all the training user classifiers (denoted EQW), i.e., sim(a, u) = 1 for all users a and u. This is done for different numbers of training reviews. An equal number of reviews is sampled uniformly for each training user and is used to train the classifiers. For example, if the number of reviews for each of the 61 training users is 50: for SCMU, the classifier is trained on 3050 (= 61×50) reviews and tested on the target user’s 1000 reviews; for MCMU, each training user classifier is trained on 50 reviews and then the ensemble is
10
2
SVR (rounded) SVR (unrounded) OVO NB
Overall MAE
1.8
1.6
1.4
1.2
1 0
200
400 600 Number of Training Reviews
800
1000
Figure 1: SCSU and the Number of Training Reviews Experiment Results
tested on the target user’s reviews. Figure 2 shows the overall MAE across all target users. The results are incomplete for SCMU due to its exponential increase in runtime as the number of reviews per training user increases. As shown in Figure 3, the runtime (including both training and testing) of SCMU with 450 training reviews per user (27,450 training reviews in total) is about 144 hours. In contrast, the runtime for MCMU remains under an hour even for 1000 reviews per user. Even though we could not run all the experiments, Figure 2 shows that SCMU’s error grows when reviews are added. This verifies our hypothesis from Section 4 that SCMU loses its ability to generalise if trained on many reviews. In comparison to SCMU’s poor performance, splitting the reviews by user proves to be beneficial when there is a sufficient number of reviews per training user. The error decreases gradually as reviews are added. This experiment shows that using MCMU enables learning from many examples when there are enough reviews per training user, and significantly reduces classification error as well as training time. It is worth noting that MCMU does not require a large number of prolific users to achieve acceptable performance. In fact, we found that the reduction in overall MAE when increasing the number of training users from 20 to 61 is very gradual and amounts only to 0.01 (Section 6.6).
6.3
MCMU Normalisation Equation
In this section, we compare the two equations suggested in Section 4 for normalising the outputs of the user classifiers in MCMU: Equation 1 (weighted average) and Equation 2 (z-score normalisation). We 11
140
SCMU MCMU-EQW
2
SCMU MCMU-EQW
120
Time per User (hours)
Overall MAE
1.8
1.6
1.4
100 80 60 40
1.2 20 1
0 0
200
400 600 Number of Reviews per Training User
800
1000
0
Figure 2: SCMU vs. MCMU – MAE 2
400 600 Number of Reviews per Training User
800
1000
Figure 3: SCMU vs. MCMU – Runtime 2
EQW-Average EQW-Z Score
EQW-Average EQW-Z Score
1.8
Overall MAE
1.8
Overall MAE
200
1.6
1.4
1.2
1.6
1.4
1.2
1
1 0
5
10
20
50
100
150
200
400
600
800
950
Number of Labeled Target User Reviews
0
200
400
600
800
1000
Number of Reviews per Training User
Figure 4: MCMU Normalisation Equation Comparison – Test User Review Numbers
Figure 5: MCMU Normalisation Equation Comparison – Training Users’ Review Numbers
assign equal weights to all the user classifiers, so Equation 1 becomes an unweighted average of the classifiers’ outputs. Equation 2 is affected by the availability of labeled reviews by the target user, while Equation 1 does not take the target user into account. We therefore consider two separate cases: (1) no labeled reviews by the target user are available, and thus the mean and standard deviation of the target user are estimated as the mean and standard deviation of the training users (so both equations do not use any information about the target user); and (2) a certain number of labeled reviews by the target user is available (Equation 2 can then utilise the extra information about the target user). We split the target user’s reviews into two sets by sampling uniformly without replacement, to achieve the same split as for SCSU (Section 6.1): the first set, for which the ratings are known, is used for calculating the mean and standard deviation of the target user’s ratings; and the second set, with unknown ratings, is used to test the classifier. We consider different set sizes of labeled reviews by the target user (5, 10, 20, 50, 100, 150, 200, 400, 600, 800, 950), and repeat this process 50 times for each set size. The results of this experiment are presented in Figure 4 (all the differences are statistically significant). 12
As the figure shows, using Equation 2 yields better results than using Equation 1, even when only few or no labeled target user reviews are available. In addition, the MAE obtained using Equation 2 reaches a minimum when 400 reviews are available and does not continue to improve beyond this point. This may be because the mean and standard deviation of the target user’s ratings are virtually the same for sample sizes of 400 or more, and thus methods that utilise the extra data from the review texts (such as SCSU) should be used beyond this point. The point where the MAE yielded by Equation 2 is closest to the MAE yielded by Equation 1 is where no labeled reviews by the target user are available. Therefore, we ran an experiment to compare the performance yielded by both equations when a variable number of training user reviews is available (but no labeled reviews by the target user are available). This experiment is the same as the MCMU experiment described in Section 6.2, and its results are presented in Figure 5. As the figure shows, Equation 2 outperforms Equation 1 in most cases (for 30, 40, 100, 150, and 300 or more reviews – all the differences are statistically significant). Therefore, we used Equation 2 in all subsequent experiments.
6.4
Similarity Modeling Experiment
In this section, we describe the evaluation of the similarity measures introduced in Section 5. We use these measures to give different weights to the training users, and compare the resulting MAEs to those obtained using MCMU with equal weights (EQW) and SCSU. To calculate the similarity between a target user and the training users, we need a certain amount of reviews by the target user. Therefore, we use uniform sampling without replacement to obtain the same sets of labeled target user reviews as in Sections 6.1 and 6.3. We use the labeled reviews for similarity calculation and classify the unlabeled reviews. We experiment with training user selection by setting a threshold for user similarity as specified in Equation 4 (if thresholding filters out all the users, then all classifiers are given the same weight). The optimal threshold is expected to be user-specific, and thus it is learned separately for each target user from their labeled reviews. This is done as follows. We vary the threshold over the interval [0, 1],2 and classify the labeled reviews of the target user. The threshold yielding the lowest MAE on the labeled reviews is used for classifying the unlabeled reviews of the target user. Using the similarity threshold yields a lower MAE than when no threshold is applied (Appendix A.2). In addition to the similarity threshold, we set a threshold on the size of the set of co-reviewed items for the similarity measures that depend on co-reviewed items (i.e., both thresholds are applied). This threshold is set dynamically for each target user in a similar manner to the similarity threshold. Applying the set size threshold also reduces the MAE (Appendix A.3). Figure 6 displays the results of this experiment. All the differences are statistically significant except for: AIR vs. CRR for 150 labeled reviews by the target user; CRT vs. CRR for 200 reviews; AIT vs. 2 This removes negatively correlated users from the set of neighbours for CRR (Section 5.1.2) and CRP (Section 5.2.4). The other similarity models only produce values within this range.
13
2
AIR CRR AIP CRP AIT CRT EQW SCSU
Overall MAE
1.8
1.6
1.4
1.2
1 5
10
20 50 100 150 200 400 600 Number of Labeled Target User Reviews
800
950
Figure 6: Similarity Modeling Experiment Results
CRT, AIP vs. AIR and CRP vs. CRR for 800 reviews; and AIR vs. AIP and CRP vs. CRR for 950 reviews. The discretisation factor L for AIP (Equation 12) is set to 100, and T (d) consists of all nouns for AIT, and all unigrams for CRT (Equation 8). These options yielded the lowest MAE from those we tested, for up to 100 labeled target user reviews (see Appendix A.6 for T (d) and Appendix A.7 for L). As seen in Figure 6, EQW – obtained by assigning equal weights to the training models in MCMU – outperforms SCSU when up to 100 reviews are available. Modeling user similarity enables us to decrease the error even further. Moreover, the best similarity measure (AIT) outperforms EQW for every number of labeled target user reviews, and performs better than SCSU for up to 200 reviews. This is an encouraging result, since in general users submit a relatively small number of reviews (e.g., as seen in www.imdb.com/title/tt0068646/usercomments in January 2010, 90% of the 1420 reviews for the movie “The Godfather” were submitted by users with less than 200 reviews).3 Another important result is that similarity measures based on co-reviewed items generally require many labeled reviews by the target user to achieve comparable performance to that of the measures which are based on all items (150 for CRR vs. AIR, and 950 for CRP vs. AIP and CRT vs. AIT). This may be attributed to the sparsity of our dataset, which results in small sets of co-reviewed items. Note that the inferior performance of the co-reviewed items measures is not due to a fall back to equal weights, as the results for these measures differ from the results obtained by EQW for almost all target users when more than 5 labeled reviews are available. In this case, equal weights are used only for about 20% of the 3
Our methods reach optimal performance for 200–600 labeled reviews, at which point the MPC system should switch to SCSU. In the future, we will train the system to decide when to switch from MCMU to SCSU.
14
Table 3: Message Board Posts Similarity Experiment Results Similarity Measure Optimal Threshold Optimal Threshold MAE EQW — 1.51 AIT (all unigrams) 0.02 1.50 AIP 0.34 1.49
target users.
6.5
Similarity Modeling without Reviews or Ratings
One advantage of modeling user similarity based on texts is that no explicit ratings are required. Texts are generally easier to obtain than ratings, since users commonly communicate textually, e.g., using emails, instant messaging or message boards. Our dataset of IMDb movie reviews also includes message board posts (Section 3). In this section, we consider the case where no target user ratings or reviews are available, and thus we model user similarity based only on message board posts. In this case, the only relevant similarity models are AIT and AIP (Section 5). Since we have no labeled reviews by the target user, we cannot set the similarity threshold dynamically for each user. Thus, we set a global threshold for all users. The results of this experiment are displayed in Table 3 (all the differences are statistically significant). Modeling user similarity based on message board posts yields a lower MAE than EQW, but the margin is smaller than when labeled reviews by the target user are available (as seen in Figure 6, AIT’s lowest MAE is 1.28 for 200 reviews, while for EQW it is 1.36). One reason for the smaller improvement in MAE is that there are 11 users with no message board posts, in which case their similarity to other users is 0. Another possible reason is that we use a global similarity threshold. If we take the mean of the MAEs obtained using the optimal threshold for each individual target user, we get an MAE of 1.45 for AIT (all unigrams) and 1.32 for AIP. This shows that modeling user similarity based on message board posts can be beneficial when no other information is available (e.g., for a completely new user). AIP outperforms AIT in this experiment, unlike most cases in the experiment from Section 6.4. However, as seen in Figure 6, AIP yielded the lowest MAE for 5 labeled reviews by the target user. The combined results of both experiments indicate that AIP is preferable when little information is available about the target user, while AIT performs better when more information is available. The reason for this might be that AIT compares the vocabulary of users and thus requires a relatively large number of documents to produce reliable results, while AIP compares users’ positivity (in the form of PSP distribution), which may be accurately represented by fewer documents.
15
2
AIT 5 AIT 10 AIT 20 AIT 61 EQW 5 EQW 10 EQW 20 EQW 61 SCSU
Overall MAE
1.8
1.6
1.4
1.2
1 5
10
20 50 100 150 200 400 600 Number of Labeled Target User Reviews
800
950
Figure 7: Training Users Experiment Results
6.6
Learning from a Few Training Users
In some scenarios, we may not have many prolific users to train on. Thus, we evaluated the performance of our methods for different numbers of training users. This was done by uniformly sampling without replacement a subset of the training users, and then running the experiment from Section 6.4. This procedure was repeated 10 times for each set size (we experimented with 5, 10, 20, 30, 40 and 50 training users). We experimented with the best performing methods from Section 6.4, one for each cell in Table 2: AIR, CRR, AIT and CRT. The resulting MAE is compared to the MAE yielded by SCSU and EQW. Figure 7 shows the results of this experiment for AIT and EQW with 5, 10, 20 and 61 training users (labeled AIT5 and EQW5, etc.). All the differences are statistically significant except for EQW vs. AIT for 5 training users and 10 labeled reviews by the target user. The trend for AIR, CRR and CRT resembles that of AIT, and the MAEs for 30, 40 and 50 training users fall between the MAEs for 20 and 61 training users. As the figure shows, AIT outperforms EQW even when a few training users are considered. The difference between EQW and AIT increases as more training users are added, up to the point where all 61 users are considered. This is not surprising, as our thresholding mechanism selects the best training users for each target user, hence it is expected to perform better when a larger selection of users is available. Another trend that is demonstrated in Figure 7 is that the performance of AIT improves compared to SCSU as the number of training users increases. Further, when up to 50 labeled reviews are available for the target user, AIT outperforms SCSU for 5 or more training users; for 100 labeled reviews, 10 or 16
more training users are required for AIT to outperform SCSU; and for 200 labeled reviews, all 61 prolific users are needed to match SCSU’s performance. These results show that our collaborative approach outperforms the content-based approach in common situations where target users have not submitted many reviews, and only a few prolific training users are available.
7
Discussion and Future Work
The results presented in Section 6 clearly demonstrate the merits of our collaborative approach to polarity classification. As shown in Section 6.1, the content-based approach of training on a single user (SCSU) yields a high MAE in the common scenario where this user has not submitted enough reviews. Additionally, the experiments in Section 6.2 have shown that extending the single classifier approach to train on reviews from multiple users is infeasible. By contrast, even the simple switch to a user-based ensemble of classifiers results in a reduced MAE and runtime (Section 6.2), and modeling user similarity decreases the MAE even further (Section 6.4). While earlier work showed that polarity classification is possible [12, 14, 21], the question of scaling the suggested methods to datasets consisting of millions of texts has not been tackled to date. This question becomes more important and urgent as the amount of available data increases. Our user-based approach, which addresses this problem, can be employed in many situations, as text authors are usually known. This includes other domains, such as product reviews, and other sentiment analysis tasks, such as subjectivity detection and perspective identification [13]. The similarity modeling experiment (Section 6.4) also demonstrated that modeling user similarity based on all the users’ texts or ratings is beneficial. Like many real-life datasets, our dataset is very sparse [1]. Thus, our collaborative approach and similarity models are likely to apply to other datasets as well, so long as some prolific users are available. This result is reinforced in Section 6.6, where our approach improved on the SCSU baseline even when only a few prolific training users were considered. In the future we also plan to test our similarity models when different numbers of reviews per training user are available (i.e., for non-prolific training users) – preliminary results show that our approach yields good results in this case. As shown in Section 6.5, our text-based models of user similarity can be applied to more general forms of texts than rated reviews. This allows us to measure similarity based on activities that are more commonly performed than reviewing movies. We envision a system where the users can benefit from personalisation without having to go through the tedious process of rating or reviewing movies. This requires further testing on various types of texts, but our results show that this is a promising direction. As mentioned in Section 2, our approach to polarity classification has much in common with CF. Thus, we conjecture that text-based user similarity can also be applied to CF systems – an area where similarity is traditionally calculated based solely on ratings. The fact that our text-based measures outperformed the rating-based measures in most cases (Section 6.4) indicates that using text-based similarity in
17
CF can result in improved performance, especially for sparse datasets. We therefore plan to investigate additional similarity measures, and apply our similarity models to CF systems.
8
Conclusion
In this report, we presented a novel approach to multi-way polarity classification that considers the users who wrote the texts. This approach is more scalable than existing approaches, yields better classification performance, and is applicable in a variety of real-life situations. The main incentive in considering users for sentiment analysis was to further our understanding of the relation between users and the sentiments expressed in the texts they write. Our results show that this relation indeed exists, and thus we intend to continue exploring this relation in the future and apply it to other fields, such as recommender systems.
References [1] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005. [2] Sajib Dasgupta and Vincent Ng. Mine the easy, classify the hard: A semi-supervised approach to automatic sentiment classification. In Proceedings of the Association for Computational Linguistics (ACL), pages 701–709, Singapore, 2009. [3] Michael Gamon. Sentiment classification on customer feedback data: Noisy data, large feature vectors, and the role of linguistic analysis. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 841–847, Geneva, Switzerland, 2004. [4] Andrew B. Goldberg and Jerry Zhu. Seeing stars when there aren’t many stars: Graph-based semisupervised learning for sentiment categorization. In TextGraphs: HLT/NAACL Workshop on Graphbased Algorithms for Natural Language Processing, pages 45–52, New York, NY, 2006. [5] Stephan Greene and Philip Resnik. More than words: Syntactic packaging and implicit sentiment. In Proceedings of the Joint Human Language Technology/North American Chapter of the ACL Conference (HLT/NAACL), pages 503–511, Boulder, CO, 2009. [6] Trevor Hastie and Robert Tibshirani. Classification by pairwise coupling. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS), pages 507–513, Denver, CO, 1997. [7] Jonathan L. Herlocker, Joseph A. Konstan, Al Borchers, and John Riedl. An algorithmic framework for performing collaborative filtering. In Proceedings of the Annual International ACM SIGIR 18
Conference on Research and Development in Information Retrieval, pages 230–237, Berkeley, CA, 1999. [8] George H. John and Pat Langley. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, pages 338–345, Montreal, Quebec, 1995. [9] S. Sathiya Keerthi, Shirish Shevade, Chiranjib Bhattacharyya, and K. R. Krishna Murthy. Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3):637–649, 2001. [10] Wei-Hao Lin, Theresa Wilson, Janyce Wiebe, and Alexander Hauptmann. Which side are you on? Identifying perspectives at the document and sentence levels. In Proceedings of the Conference on Natural Language Learning (CoNLL), pages 109–116, New York, NY, 2006. [11] Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the Association for Computational Linguistics (ACL), pages 271–278, Barcelona, Spain, 2004. [12] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the Association for Computational Linguistics (ACL), pages 115–124, Ann Arbor, MI, 2005. [13] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval, 2(1–2):1–135, 2008. [14] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 79–86, Philadelphia, PA, 2002. [15] John C. Platt. Fast training of Support Vector Machines using Sequential Minimal Optimization. In B. Schoelkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 185–208. MIT Press, 1998. [16] Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. Expanding domain sentiment lexicon through double propagation. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pages 1199–1204, Pasadena, CA, 2009. [17] Paul Resnick and Hal R. Varian. Recommender systems. Communications of the ACM, 40(3): 56–58, 1997. [18] Shirish Shevade, S. Sathiya Keerthi, Chiranjib Bhattacharyya, and K. R. Krishna Murthy. Improvements to SMO algorithm for SVM regression. Technical report, National University of Singapore, 1999. 19
[19] Alex J. Smola and Bernhard Schoelkopf. A tutorial on support vector regression. Technical report, NeuroCOLT Technical Report Series, 1998. [20] Benjamin Snyder and Regina Barzilay. Multiple aspect ranking using the Good Grief algorithm. In Proceedings of the Joint Human Language Technology/North American Chapter of the ACL Conference (HLT/NAACL), pages 300–307, Rochester, NY, 2007. [21] Peter Turney. Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. In Proceedings of the Association for Computational Linguistics (ACL), pages 417–424, Philadelphia, PA, 2002. [22] Xiaojun Wan. Co-training for cross-lingual sentiment classification. In Proceedings of the Association for Computational Linguistics (ACL), pages 235–243, Singapore, 2009.
20
Appendix A
Similarity Parameters
In this section we present the results of the experiments to determine the parameters of the similarity models and the MCMU classifier. All the experiments are based on the user similarity experiment from Section 6.4. In some cases, only a part of the experiment was performed due to the infeasibility of performing the full experiment for every possible set of parameters. Throughout this section, the lowest MAEs appear in boldface when the differences are statistically significant.
Appendix A.1
Similarity to Weight Function
As mentioned in Section 5, we used f (x) = ex to ensure the weights of the training users’ classifiers are non-negative. Another alternative is using f (x) = x + 1, as all the similarity models produce scores in the range [−1, 1]. We compared both functions by running a part of the experiment described in Section 6.4, using 100 labeled target user reviews and no thresholds. The results are given in Table 4, and they show that using f (x) = ex is preferable in most cases, probably because it gives more weight to similar users.
Appendix A.2
Similarity Threshold
In this section we compare using no threshold to using only a similarity threshold that is learned dynamically for every target user, as described in Section 6.4. As in Appendix A.1, we tested for 100 labeled target user reviews. The results of this experiment are presented in Table 5 and show that using this threshold consistently improves performance, as expected.
Appendix A.3
Set Size Threshold
In a similar manner to the similarity threshold experiment (Appendix A.2), we tested the effect of applying the set size threshold to similarity measures that are based on co-reviewed items. The results of this
Table 4: Similarity to Weight Function Experiment Results Similarity Measure MAE with f (x) = ex MAE with f (x) = x + 1 AIR 1.3622 1.3640 CRR 1.3649 1.3650 AIT 1.3658 1.3660 CRT 1.3668 1.3668 AIP 1.3649 1.3658 CRP 1.3683 1.3673
21
Table 5: Similarity Threshold Experiment Results Similarity Measure MAE without Threshold MAE with Threshold AIR 1.3622 1.3266 CRR 1.3649 1.3503 AIT 1.3658 1.2908 CRT 1.3668 1.3462 AIP 1.3649 1.3150 CRP 1.3683 1.3597
Table 6: Set Size Threshold Experiment Results Similarity Measure MAE without Threshold MAE with Threshold CRR 1.3649 1.3322 CRT 1.3668 1.3317 CRP 1.3683 1.3317
experiment are presented in Table 6 and show that using this threshold improves performance.
Appendix A.4
Pearson Correlation Coefficient versus Cosine Similarity
We compared using the Pearson correlation coefficient to using cosine similarity for CRR (Section 5.1.2) and CRP (Section 5.2.4) by running the experiment from Section 6.4. The results are presented in Table 7. We found that the differences are usually very small, and therefore used the Pearson correlation coefficient, as it is preferable in most cases.
Appendix A.5
Jaccard versus Tf-idf
We compared using the Jaccard coefficient to using cosine similarity of tf-idf vectors for AIT (Section 5.2.1), with T (d) defined as all unigrams, by running the experiment from Section 6.4. The results are presented in Table 8 and show that Jaccard consistently outperforms tf-idf. As mentioned in Section 5.2.1, this is in line with the results reported by Pang et al. [14], who found that unigram presence performs better than frequency when classifying textual polarity.
Appendix A.6
Term Similarity Definition
We compared different definitions of T (d) for AIT (Section 5.2.1), by running the experiment from Section 6.4. We experimented with: all unigrams; all unigrams without stopwords; the open-word classes: adverbs, adjectives, nouns and verbs; and a combination of the two leading open-word classes – adjectives and nouns. The results are presented in Table 9 (when more than one number per row is in boldface, it means that the difference between the lowest MAEs were not statistically significant). 22
Table 7: Pearson Correlation Coefficient versus Cosine Similarity Experiment Results Target User Reviews CRR – Cosine CRR – Pearson CRP – Cosine CRP – Pearson 5 1.5142 1.5146 1.5139 1.5129 10 1.4368 1.4403 1.4381 1.4414 20 1.3952 1.3992 1.3979 1.4030 50 1.3562 1.3550 1.3567 1.3564 100 1.3318 1.3300 1.3319 1.3325 150 1.3247 1.3199 1.3238 1.3240 200 1.3185 1.3128 1.3182 1.3152 400 1.3059 1.2995 1.3071 1.3026 600 1.3057 1.2982 1.3070 1.3006 800 1.3107 1.3040 1.3127 1.3048 950 1.3099 1.3032 1.3090 1.3011
Table 8: Jaccard versus Tf-idf Experiment Results Target User Reviews AIT – Tf-idf AIT – Jaccard 5 1.5007 1.4967 10 1.4161 1.4095 20 1.3827 1.3652 50 1.3480 1.3171 100 1.3264 1.2914 150 1.3195 1.2844 200 1.3146 1.2787 400 1.3051 1.2792 600 1.3064 1.2867 800 1.3148 1.2999 950 1.3130 1.3032
We found that using nouns yields the best performance when 5–100 labeled reviews by the target user are available. This may seem surprising, as adjectives are traditionally thought of as carrying information about sentiments [14, 21]. However, adjectives perform best for 400–950 labeled reviews, which may indicate that the number of adjectives when less reviews are available is insufficient to accurately model user similarity. For this reason we used nouns for AIT in Section 6.4, as our focus is on target users with a small number of reviews.
Appendix A.7
PSP Discretisation Factor
To determine the discretisation factor L for CRP (Equation 12), we experimented with L values of 10, 20, 50, 100, 150, 200, 500 and 1000. The results were similar for all values of L, but slightly better for L = 100, and therefore this value was chosen.
23
Table 9: Term Similarity Definition Experiment Results Target Adverbs Adjectives Nouns Verbs Adjectives + Unigrams No stopwords User Nouns Reviews 5 1.5123 1.5069 1.4886 1.5080 1.4931 1.4967 1.4977 10 1.4259 1.4165 1.3931 1.4206 1.3979 1.4095 1.4020 20 1.3883 1.3775 1.3535 1.3842 1.3581 1.3652 1.3611 50 1.3434 1.3300 1.3123 1.3428 1.3143 1.3171 1.3147 100 1.3103 1.2991 1.2908 1.3146 1.2910 1.2914 1.2911 150 1.3015 1.2866 1.2844 1.3015 1.2838 1.2844 1.2837 200 1.2963 1.2802 1.2807 1.2923 1.2792 1.2787 1.2793 400 1.2893 1.2740 1.2803 1.2828 1.2793 1.2792 1.2796 600 1.2941 1.2822 1.2895 1.2898 1.2880 1.2867 1.2878 800 1.3043 1.2953 1.3014 1.3050 1.2995 1.2999 1.3006 950 1.3069 1.2987 1.3039 1.3076 1.3041 1.3032 1.3049
24