Sentiment analysis is techniques, which ... For instance, âgoodâ or âgreatâ are a positive words, ... tively unimportant compared to the best possible accuracy.
Fast and Accurate - Improving Lexicon-Based Sentiment Classification with an Ensemble Methods L ukasz Augustyniak(B) , Piotr Szyma´ nski(B) , Tomasz Kajdanowicz, and Przemyslaw Kazienko Department of Computational Intelligence, Wroclaw University of Technology, Wroclaw, Poland {lukasz.augustyniak,piotr.szymanski,tomasz.kajdanowicz, przemyslaw.kazienko}@pwr.edu.pl
Abstract. A lexicon-based ensemble approach to sentiment analysis that outperforms lexicon-based method is presented in this article. This method consists of two steps. First we employ our own method (called frequentiment) for automatic generation of sentiment lexicons and some of publicly available lexicons. Secondly, an ensemble classification is used to improve the overall accuracy of predictions. Our approach outperforms publicly available sentiment lexicons and automatically generated domain lexicons. We conduct comprehensive analysis based on 10 Amazon review data sets that consist of 4,200,000 reviews. Keywords: Sentiment analysis based sentiment analysis
1
·
Ensemble classification
·
Lexicons-
Introduction
Nowadays a lot of business takes place in the Internet and everybody want to know how their brand is recognizable. Sentiment analysis is techniques, which help in detecting emotions and opinions on social media data. This may help in finding how your brand is seen in the Internet. A social media monitoring shows increased growth, hence an analysis of web written texts (i.e., opinions, reviews) that is fast and accurate is needed. In the past it was possible to read and annotate manually such an opinions. Having enough money, one would hire a group of human annotators, and employ them to read the texts and use their intelligence and knowledge to complete the task. A growth of the Internet usage implies much more opinions and its not possible to conduct it by hands. There appeared need for automatic processing and annotation textual data. For this reason we are witnessing growing popularity of Sentiment Analysis. This kind of analysis in part of Digital Universe of Data. IDC1 projects that the Digital Universe will reach 40 zettabytes (ZB) by 2020, an amount that exceeds previous 1
http://www.idc.com.
c Springer-Verlag Berlin Heidelberg 2016 N.T. Nguyen et al. (Eds.): ACIIDS 2016, Part II, LNAI 9622, pp. 108–116, 2016. DOI: 10.1007/978-3-662-49390-8 10
Fast and Accurate - Improving Lexicon-Based Sentiment Classification
109
forecasts by 5 ZBs, resulting in a 50-fold growth from the beginning of 2010. In terms of sheer volume, 40 ZB of data is equivalent to for example two statistics. First one, there are 700,500,000,000,000,000,000 grains of sand on all the beaches on earth (or seven quintillion five quadrillion). That means 40 ZB is equal to 57 times the amount of all the grains of sand on all the beaches on earth. The other one, in 2020, 40 ZB will be 5,247 GB per person worldwide. Hence, even small part of such analysis that is for natural language data analysis is outstandingly huge. We need fast, accurate and low memory/processor computation for sentiment analysis alike processing. The approach that provides memory, processor and easy way to count sentiment are lexicons [6,14], but they accuracy is not satisfying. We proposed ensemble-based extension for lexicons and achieved better accuracy while the time and memory complexity of approach stay at level for lexicons. Sentiment analysis is used in many areas e.g. predicting election outcomes [15], supplying organizations with information on their brands [6], summarizing product in reviews [3], building better recommendation system [5] and even predicting the stock market [2]. In this paper, we compared several different lexicons and fusion classifiers and found out, without much surprise, that combination of various lexicons performs better than any individual sentiment lexicon. The ensemble best benefits from models variability and complementary, thus having diverse set of techniques is desirable. We used several different learners such as Decision Tree, Extra Tree Classifier and AdaBoost. Usage of ensembles in such an approach doesn’t appear in literature, hence we wanted to verify our method on big review data (further description of the data in Sect. 4.1).
2
Related Work
In this section we provided some examples related to lexicon-based and ensemblebased approaches to opinion mining tasks. 2.1
Lexicon-Based Approach
The lexicon-based methods assume that sentiment orientation is related to presence of certain words/phrases in a document. The sentiment lexicon is a set of ngrams (one or more consecutive words) with sentiment orientation assigned to these ngrams. Overall sentiment of the document is annotated using these features from the lexicon that are (or are not) present in the document. Sentiment lexicons are used in many sentiment classification tasks. Sentiment words are always divided into at least two classes according to their orientation: positive and negative attitudes. For instance, “good” or “great” are a positive words, and “bad” or “catastrophic” are a negative words. Sentiment words and their weights form the sentiment lexicon [8,11].
110
2.2
L . Augustyniak et al.
Ensemble Classification Approach
The ensemble techniques are used widely in the literature for classification methods. The ensemble uses a variety of models whose predictions are taken as input to a new model that learns how to combine the predictions into an overall prediction. Whitehead [16] describes ensemble learning as a technique increasing machine learning accuracy with a trade-off of increasing computation time so they are best suited in those domains where computational complexity is relatively unimportant compared to the best possible accuracy. Lin and Kolcz [9] used Logistic Regression classifiers learned from byte 4-grams (hashed) as features. The 4-grams refers to four characters (and not to four words). They didn’t do any processing tasks, not even word tokenization. The ensembles were formed by different models, obtained from different training sets, but with the same learning algorithm that was mentioned above Logistic Regression. Their results show that the ensembles lead to more accurate classifiers. The next approach with ensembles presented Rodriguez et al. [13]. He used classifier ensembles for expression, not char ngrams. In this situation the sentiment orientation label (positive, negative, or neutral) is applied to a phrase or word within the tweet. What is important in such method the sentiment label does not necessarily match the sentiment of the entire tweet. The class imbalance and the feature space sparsity are big issues in text classification problems. Hassan et al. [7] addressed these problems. They proposed to enrich the corpus using multiple additional datasets related to the task of sentiment classification. The authors used a combination of standard approach with unigrams and bigrams of simple words, part-of-speech, and other semantic features. None of the previous works used AdaBoost [4]. Moreover, lexicon’s predictions as a feature space for fusion classifier have not been addressed widely in the literature.
3
Ensembles of Lexicons
Our method consists of two steps: 1. Lexicons-based sentiment classification. We used publicly available lexicons and automatically generated lexicons based on method presented in [1]. 2. Ensemble classification (fusion classifier) step. The main part of our proposed methods is the lexicon ensemble approach. It consists of two stages - building the relevant input space for ensemble classification and learning a fusion classifier based on mentioned input space. This part of our method uses a variety of models (lexicons in this experiment) whose predictions are taken as input to a new model that learns how to combine the predictions into an overall prediction. We built a sentiment polarity matrix S(L, D) using predictions from sentiment lexicons. Sentiment orientation was obtained for every document d ∈ D = {d1 , . . . , dn } and every lexicon l ∈ L = {l1 , . . . , ln }. We denoted the sentiment polarity of a document d using lexicon l as sl (d) regardless. The sentiment polarity matrix is defined as follows:
Fast and Accurate - Improving Lexicon-Based Sentiment Classification
111
⎛
⎞ sl1 (d1 ) sl1 (d2 ) · · · sl1 (dn ) ⎜ sl2 (d1 ) sl2 (d2 ) · · · sl2 (dn ) ⎟ ⎜ ⎟ S(L, D) = ⎜ . .. .. ⎟ .. ⎝ .. . . . ⎠ sln (d1 ) sln (d2 ) · · · slm (dn )
(1)
Afterwards, we used such feature space as input for the fusion classifier. We tried couple of a classifiers such as Decision Tree, Extra Tree Classifier, and AdaBoost. The experimental scenario is presented in Fig. 1.
4
Experimental Scenario
In this section the experimental scenario: dataset, text preprocessing with crossvalidation division.
Fig. 1. The concept of the proposed ensemble classification.
112
4.1
L . Augustyniak et al.
Dataset
We used an Amazon Reviews Dataset published by SNAP [10]. Domains presented in Table 1 were chosen for experimental scenario. Table 1. Dataset’s domains used in experiment Domain
Number of reviews
Automotive product Book
188,728 12,886,488
Clothing Electronics product
581,933 1,241,778
Health product
428,781
Movie TV
7,850,072
Music
6,396,350
Sports Outdoor product
4.2
510,991
Toy Game
435,996
Video Game
463,669
Text Pre-processing
Each of the review consists of Amazon’s user opinion (text) and his star score (1–5 scale), where 1 is the worst score and 5 is the best. The review data set was cleaned up from its raw form. All the HTML tags and entities were removed or converted to textual representations using the HTML parser in python library BeautifulSoup42 . Next the unicode review texts were decoded to ASCII using the unidecode3 python library. In addition, all punctuation and numbers were removed. Each of the data sets was divided into a training and test set in 10 cross validations. For tractability, especially with supervised learners, the training data set consisted of 12,000 randomly drawn reviews. Reviews were selected evenly per sentiment, the training set included 2,000 reviews with 1, 2, 4 and 5 stars each and 4,000 labeled with 3 stars. We have thus obtained a balanced set of 4,000 positive, negative and neutral reviews each. The test data set consisted of 30,000 evenly distributed across sentiment labels (distributed analogously to training set). In order to check the accuracy of the proposed methods, the ground truth sentiment was extracted from ratings expressed with stars. Ratings were mapped to text classes “positive”, “neutral” and “negative”, using 1 and 2 stars, 3 stars, 4 and 5 stars respectively. 2 3
https://pypi.python.org/pypi/beautifulsoup4. https://pypi.python.org/pypi/Unidecode.
Fast and Accurate - Improving Lexicon-Based Sentiment Classification
4.3
113
Lexicons
In this paper, we used human generated lexicons, automatic generated lexicon’s based on method provided in [1]. In addition, we used Bing Liu’s Opinion Lexicon [8], AFINN Lexicons [12] and list of positive/negative words from www. enchantedlearning.com. These lexicons (Table 2) was needed for first step of our method, further description in Sect. 3. The polarity of the document is calculated based on detecting occurrences of sentiment words from the lexicon in each of the document (some of the lexicons contain weights for sentiment words). We didn’t perform any negation handling. Table 2. Examples of sentiment lexicons Lexicon
Positive words
Negative words
Simplest (SM)
good
bad
Simple List (SL)
good, awesome, great, fantastic, bad, terrible, worst, sucks, wonderful awful, dumb
Simple List Plus (SL+)
good, awesome, great, fantastic, bad, terrible, worst, sucks, wonderful, best, love, awful, dumb, waist, boring, excellent worse
Past and Future (PF)
will, has, must, is
was, would, had, were
Past and Future Plus (PF+)
will, has, must, is, good, awesome, great, fantastic, wonderful, best, love, excellent
was, would, had, were, bad, terrible, worst, sucks, awful, dumb, waist, boring, worse
Bing Liu
2006 words
4783 words
AFINN-96
516 words
965 words
AFINN-111
878 words
enchantedlearning 266 words
5
1599 words 225 words
Results
Table 4 reports the results of each individual model (lexicons and lexicon-based ensemble). The performance was evaluated by F-measure per each method and domain. 5.1
F-Measure Explanation
As the introduction for analyzing the results, the F-measure is described. Firstly, the confusion matrix and two additional measures called precision and recall must be presented. Let’s now define the most basic terms for confusion matrix, which are whole numbers - not any rates (Table 3):
114
L . Augustyniak et al. Table 3. Confusion matrix for classification problem of two classes Class A
Class B
Classifier says A true positive (TP)
false positive (FP)
Classifier says B false negative (FN) true negative (TN)
– true positives (TP) - these are cases when classifier predicted class A, and the classified object do have class A. – true negatives (TN) - classifier predicted class B and the object was classified correctly. – false positives (FP) - classifier predicted class A, but it was mistake, the object should be classified as class B. This is also known as a “Type I error”. – false negatives (FN) - classifier predicted class B, but again it was mistake, because it should be class A. It is known as a “Type II error”. TP (2) TP + FN Recall measures the completeness, or sensitivity, of a classifier. A higher recall means less false negatives, while lower recall means more false negatives. Improving recall can often decrease precision because it gets increasingly harder to be precise as the sample space increases. recall =
TP (3) TP + FP Precision measures the exactness of a classifier. A higher precision means less false positives, while a lower precision means more false positives. This is often at odds with recall, as an easy way to improve precision is to decrease recall. precision =
F − measure = 2 ∗
precision ∗ recall precision + recall
(4)
F-measure is widely used for text classification evaluation, because it uses precision and recall to produce a additional single metric known as F-measure. This metric is the weighted harmonic mean of precision and recall. 5.2
Results Evaluation
We observe immediately that the lexicon-based ensemble with AdaBoost classifier (AB in Table 4) outperformed all lexicons, even this automatically generated for each domain (unigram, bigram and trigram) and other fusion learners. The AdaBoost achieved the highest accuracy for the Toys & Games and the lowest for the Music. The Music domain is an outstanding example, because the overall accuracy is the lowest one across all domains. Some of the tested classifiers presented better than an AdaBoost in specific domain, but the AdaBoost was the most consistent method. The Random Forest classifier performed really well
Fast and Accurate - Improving Lexicon-Based Sentiment Classification
115
in the Cloths & Accessories (56.9 %) - it was the highest F-measure across all results. However, it’s performance in other domains was worse than unigram lexicons. Interestingly, the Cloths & Accessories presents higher accuracy for other fusion classifiers - Decision Tree - 55.4 %, Extra Tree Classifier - 55.5 %.
Ensemble
Lexicons
Table 4. Results for all methods - F-measure Method SM SL PF SL+ AF-111 PF+ trigr. AF-96 EN BL bigr. unigr. DT ET RF AB
Auto 0.244 0.335 0.352 0.351 0.368 0.366 0.348 0.390 0.419 0.411 0.440 0.500 0.493 0.495 0.503 0.522
Books 0.227 0.333 0.365 0.364 0.346 0.375 0.395 0.364 0.389 0.387 0.461 0.505 0.472 0.473 0.482 0.524
C&A 0.249 0.341 0.361 0.385 0.364 0.411 0.361 0.391 0.400 0.421 0.496 0.530 0.554 0.555 0.569 0.538
Elect 0.244 0.349 0.348 0.364 0.376 0.360 0.386 0.401 0.406 0.414 0.498 0.508 0.475 0.474 0.483 0.537
Health 0.245 0.342 0.362 0.366 0.358 0.376 0.380 0.387 0.411 0.410 0.503 0.505 0.474 0.476 0.485 0.529
M&TV 0.230 0.382 0.368 0.398 0.370 0.389 0.39 0.385 0.394 0.406 0.457 0.512 0.476 0.478 0.488 0.534
Mus 0.228 0.357 0.340 0.362 0.350 0.335 0.353 0.371 0.391 0.407 0.370 0.435 0.464 0.465 0.475 0.510
SP 0.245 0.343 0.362 0.366 0.368 0.381 0.366 0.398 0.410 0.429 0.472 0.514 0.488 0.489 0.499 0.527
T&G 0.230 0.344 0.379 0.376 0.359 0.387 0.392 0.395 0.411 0.439 0.500 0.511 0.504 0.505 0.512 0.552
VG 0.237 0.367 0.357 0.395 0.368 0.370 0.388 0.388 0.394 0.404 0.495 0.499 0.471 0.474 0.484 0.529
Domains: Automotive, Books, Clothing & Accessories, Electronics, Health, Movies & TV, Music, Sports & Outdoors, Toys & Games, Video Games. Methods: lexicons as described in Table 2 with extension of unigrams, bigrams and trigrams. Fusion Classifiers: DT - Decision Tree, AB AdaBoost, ET - Extra Tree Classifier, and RF - Random Forest.
6
Conclusions and Future Work
We have proposed a very simple yet powerful ensemble system for sentiment analysis. We combine lexicon predictions to build more complex and more accuracy sentiment predictor. Each such lexicon contributes to the success of the overall system, outperforming single lexicon approach on Amazon Reviews dataset. We conclude that AdaBoost learner performed the best among all fusion classifiers. However, we feel that further investigations related to ensembles and feature space for fusion classifier are necessary. Extension of new lexicons could also influence of the performance of the overall method. Acknowledgment. This work is partially funded by the European Commission under the 7th Framework Programme, Coordination and Support Action, Grant Agreement Number 316097, European research centre of Network intelliGence for INnovation Enhancement (ENGINE).
116
L . Augustyniak et al.
References 1. Augustyniak, L., Kajdanowicz, T., Szymanski, P., Tuliglowicz, W., Kazienko, P., Alhajj, R., Szymanski, B.K.: Simpler is better? lexicon-based ensemble sentiment classification beats supervised methods. In: 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2014, Beijing, China, 17–20 August 2014, pp. 924–929 (2014) 2. Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011) 3. Brody, S., Elhadad, N.: An unsupervised aspect-sentiment model for online reviews. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT 2010, Stroudsburg, PA, USA, pp. 804–812. Association for Computational Linguistics (2010) 4. Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm (1996) 5. Galitsky, B., McKenna, E.W.: Sentiment extraction from consumer reviews for providing product recommendations, November 12 2009. US Patent App. 12/119,465 6. Ghiassi, M., Skinner, J., Zimbra, D.: Twitter brand sentiment analysis: a hybrid system using n-gram analysis and dynamic artificial neural network. Expert Syst. Appl. 40(16), 6266–6282 (2013) 7. Hassan, A., Abbasi, A., Zeng, D.: Twitter sentiment analysis: a bootstrap ensemble framework. In: 2013 International Conference on Social Computing (SocialCom), pp. 357–364, September 2013 8. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2004, pp. 168–177. ACM, New York (2004) 9. Lin, J., Kolcz, A.: Large-scale machine learning at twitter. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, pp. 793–804. ACM, New York (2012) 10. McAuley, J., Leskovec, J.: Hidden factors, hidden topics: understanding rating dimensions with review text. In: The 7th ACM Conference on Recommender Systems, pp. 165–172. ACM (2013) 11. Mohammad, S., Dunne, C., Dorr, B.: Generating high-coverage semantic orientation lexicons from overtly marked words, a thesaurus. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, Stroudsburg, PA, USA, vol. 2, pp. 599–608. Association for Computational Linguistics (2009) 12. Nielsen, F.˚ A.: Afinn, March 2011 13. Rodriguez-Penagos, C., Atserias Batalla, J., Codina-Filb` a, J., Garc´ıa-Narbona, D., Grivolla, J., Lambert, P., Saur´ı, R.: FBM: combining lexicon-based ML and heuristics for social media polarities. In: Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 483–489, Association for Computational Linguistics, Atlanta (2013). http://www.aclweb.org/anthology/S13-2080 14. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods for sentiment analysis. Comput. Linguist. 37(2), 267–307 (2011) 15. Tumasjan, A., Sprenger, T.O., Sandner, P.G., Welpe, I.M.: Predicting elections with twitter: what 140 characters reveal about political sentiment. In: ICWSM, vol. 10, pp. 178–185 (2010) 16. Whitehead, M., Yaeger, L.: Sentiment mining using ensemble classification models. In: SCSS (1), pp. 509–514. Springer (2008)