Effective Method for Sentiment Lexical Dictionary Enrichment based on Word2Vec for Sentiment Analysis Eissa M.Alshari
Azreen Azman Shyamala A/p C Doraisamy Ibb University, Yemen UPM, Serdang, Malaysia UPM, Serdang, Malaysia
[email protected] [email protected] [email protected] Norwati Mustapha Mostafa Alksher UPM, Serdang, Malaysia UPM, Serdang, Malaysia
[email protected] [email protected]
Abstract—Recently, many researchers have shown interest in using lexical dictionary for sentiment analysis. The SentiWordNet (SWN) is the most used sentiment lexical to determine the polarity of texts. However, there are huge number of terms in the corpus vocabulary that are not in the SWN due to the curse of dimensionality, which will limit the performance of the sentiment analysis. This paper proposed a method to enlarge the size of opinion words by learning the polarity of those non-opinion words in the vocabulary based on the SWN. The effectiveness of the method is evaluated by using the Internet Movie Review Dataset. The result is promising, showing that the proposed Senti2Vec method can be more effective than the SWN as the sentiment lexical resource.
score as NOUN but will have negative score as VERB, which can lead to noise for the sentiment classification [7]. In addition, the SWN does not include all terms in the corpus vocabulary as depicted in Fig. 1. The terms that will be included as the features for sentiment classification reside within the intersection of the two sets. As such, this can be the limitation to the performance of any sentiment analysis approach. It is assumed that by enlarging the size of the intersection will lead to a more effective sentiment analysis.
Keywords—Sentiment analysis, Word2Vec, Word embeddings, SWN
I. I NTRODUCTION Recently, there is an explosive number of user reviews or comments on products and services available on the Web and social media. It has become the source of information for users in making everyday decision, especially on choosing a product to buy [1]. Due to the huge number of different opinions on a certain product, a user may find it difficult to summarize the overall sentiment based on those reviews or comments. Over the years, researchers have developed different classification techniques for opinion mining to determine sentiment polarity of a text to be either positive, negative or neutral [2]. Several machine-learning techniques such as logistic regression (LR), support vector machine (SVM) and Naive Bayes have shown to be effective in this text classification problem [3]. The effectiveness of such a technique relies on the features used in the classification task. Several type of features have been investigated for this task such as the bagof-words (BoW), lexical and syntactic features [4], [5]. SWN is highly regarded as an effective sentiment analysis lexical resource [6]. Each term in the SWN is associated with a set of scores representing its positivity, negativity and objectivity. The score may also depend on the part-of-speech (POS) tagging of the term, such that a term can have positive
Fig. 1. Intersection of the SWN and the corpus vocabulary
Since the introduction of the Word2Vec by Mikolov et al. [8] to discover the word embeddings, it has been used as features for several text classification tasks [9]. Due to the high-dimensional nature of the Word2Vec features, it increases the complexity of the classifier. Several feature extraction methods have been applied in order to reduce the dimension of the Word2Vec features [10]. In this paper, a model to enlarge the intersection of the SWN and the corpus vocabulary is proposed to improve the effectiveness of sentiment analysis. The model, named as Senti2Vec, is based on the assumption that the polarity of any terms in the vocabulary can be learnt from the terms in the SWN. This paper is organized as follows. A review of related work on sentiment analysis and the SWN is presented in Section II. The proposed model is discussed in Section III. In
Section IV, analysis of the experimental results is elaborated. Finally, the conclusion and future work are discussed in Section V. II. R ELATED W ORK SWN is a sentiment lexical dictionary where each synset of the WordNet is associated with objective, positive and negative scores. It is commonly used in sentiment analysis, which is a collection of methods to determine the sentiment orientation of a text (either positive, negative or neutral) [11]. Many techniques and type of features have been investigated for sentiment analysis including the use of bag-of-words (BoW) model as the feature for the classification. The bagof-words is an approach to model texts numerically in many text mining and information retrieval tasks. Several weighting schemes have been successfully used in the BoW such as the n-gram, Boolean, co-occurrence, tf and tf.idf [12]. SWN is developed by [7] as a lexical to associate each synset of the WordNet. In [13], the authors proposed a sentencelevel sentiment analyzer for the Telugu news. It exploited the available Telugu SWN to perform sentiment analysis for Telugu e-Newspapers sentences. [6] proposed a SWN-based algorithm to efficiently determine the polarity of a sentence. They used part-of-speech tagger to tag words and search only those words with polarity (such as adjectives and adverbs) and it has increased the performance without removing stop words. Tomar algorithm performed quite good on normal input sentences that are chosen at random, which has been evaluated in several types of customer review datasets. The authors in [7] have made comparison between the SWN 3.0-semi and the SWN 3.0 and observed a little improvement in the ranking of using the SWN 3.0-semi as compared to the SWN 3.0. Meanwhile, the SenticNet has been developed by [14] for exploiting common sense reasoning techniques, such as blending and spectral activation, together with an ontology and an emotion categorization model to describe human emotions. In [15], the authors combined the generic sentiment-trained word embeddings and manually crafted features. The result of this combination is applied to the model for aspect based sentiment analysis in order to improve the classification performance. A model to capture both semantic and sentiment similarities among words is presented by Maas. The semantic component learned word vectors through an unsupervised probabilistic model of documents. However, linguistic and cognitive researchers have argued that expressive content and descriptive semantic content are distinct. They observed that the model missed crucial sentiment information. On other hand, [16] learned vector representations of words by using the Word2Vec in both the continuous bag-of-words (CBOW) and the skipgram (SG) models for discovering semantic of words for various Natural Language Processing tasks. In the context of modeling distributional semantics within text, several models have been proposed for estimating con-
tinuous representations of words, such as the Latent Semantic Analysis (LSA) [17], the Latent Dirichlet Allocation (LDA) [18], the Second Order Attributes (SOA) [19], the Document Occurrence Representation (DOR) [20], the Word2Vec [21] and the GloVe [22]. Villegas et al. [23] compare these word embedding approaches for sentiment analysis by using several weighting schemes including tf.idf and Boolean on a subset of the IMDB Review Dataset. They found out that the effectiveness of the LSA as the feature set with Naive Bayes classifier outperforms other techniques. In [24], Giatsoglou et al. observed that LDA is computationally very expensive as compared to LSA on large data sets. Improvement of the Word2Vec model is proposed by Le et al. that treats each document separately as a document to vector approach (Doc2Vec). Each vector is represented as length of texts such as paragraph, sentence and documents [25] . In order to evaluate the effectiveness of the Doc2Vec, Lau et al. used the Word2Vec with n-gram model to construct both Distributed bag-of-words version of the Paragraph Vector (DBoW) and the Distributed Memory version of Paragraph Vector (DMPV) for the Doc2Vec [26]. They observed that the DBoW is better than DPMV model [21]. III. M ETHOD FOR S ENTIMENT L EXICAL D ICTIONARY E NRICHMENT As mentioned in Section I, the effectiveness of any sentiment analysis method that uses sentiment lexical resources such as the SWN will be limited by the number of terms intersect with the corpus vocabulary. As such, a model to enlarge the intersection is proposed to improve the effectiness of sentiment analysis. The method is based on the assumption that the polarity of any terms in the corpus vocabulary can be learnt from the SWN, which will enlarge the intersection of the two sets as depicted in Fig. 1. The polarity of those terms in the corpus vocabulary is estimated by calculating the distance between the terms and the nearest term in the SWN. A. Learning terms vectors based on the Word2Vec The first step of the method deals with the learning of term representation based on the Word2Vec model. Given that a corpus D consists of a set of texts, D = {d1 , d2 , d3 , ..., dn }, and a vocabulary T = {t1 , t2 , t3 , ..., tm } consists of unique terms extracted from D. Then, the representation of the terms ti are discovered by using the Skip-gram model of the Word2Vec [27] to calculate the probability distribution of other terms in context given ti . − In particular, ti is represented by a vector → vi that comprises of the probabilistic values of all other terms in the vocabulary. This word embedding technique discovers semantic relation among terms in the corpus. However, the resulting set of vectors for all terms in the corpus is high-dimensional and is inefficient for the classifier in the sentiment analysis task. As a result, this first step discovers a set of vector − − − VT = {→ v1 , → v2 , → v3 , ..., − v→ m } representing the set of terms in the vocabulary T .
B. Synset unification
•
The aim of the synset unification is to create a sentiment dictionary from the SWN where common terms will be represented by a single term with positive and negative scores. • First, all terms are extracted from the synset words in the SWN with their positivity and negativity values, as well as the associated part-of-speech (POS) tagging of those terms. SW N et = [t1:w , +, −, P OS]
•
(1)
while t is term in the SWN , w is the size of the total number of terms, + is the positive value, − is the negative value and P OS is its part-of-speech tagging. Then, the same terms from the SW N et are grouped together and their P OS tagging are removed and left with the positive and negative scores. For instance, the term t10 may have four entries in the SWN with different positive and negative scores as shown in (2).
Then, those non-opinion words in the vocabulary that previously do not have polarity scores will be assigned with the polarity scores of the closest opinion word from the SWN as calculated based on Equation 4. As such, more non-opinion words will have polarity scores and will be part of the intersection between the SWN and the vocabulary as depicted by Fig. 2. The remaining terms that are not part of the intersection are due to the problem of the curse of dimensionality in word embeddings.
Corpus Vocabulary SentiWordNet
t3(+,-) t1 (+,-) t9 (+,-)
t10
•
0.5 0.3 0.2 0.8 0.9 0.2 0.3 0.4
t10 t10 = t10 t10
SW N etunif ied
t1 t2 . = . t10 . tw
+ +
0.47 +
(2)
− − 0.42 −
(3)
C. Labeling of non-opinion words for sentiment polarity The aim of the proposed method is to enlarge the size of the intersection between the SWN and the corpus vocabulary as in Fig. 1 in order to improve the performance of sentiment analysis. It is achieved by learning the sentiment polarity of each non-opinion word in the vocabulary based on the polarity of its closest term in the SWN. •
For each term in the vocabulary T and not in the SWN (non-opinion words), calculate the distance between the term and all other terms in the SWN by using the Frobenius formula [28] as shown in Equation 4. X ||A||[F ] = [ abs(ai,j )2 ]1/2 i,j
t7
t7(+,-)
Next, as the proposed method requires only one positive and/or negative scores for each term, an average of those scores are calculated. As such, the unification will generate another set SW N etunif ied with only terms and their average positive and negative scores as shown in (3) for the term t10 .
t2 (+,-) t4(+,-)
(4)
t5 tm (+,-) (+,-)
t8
tn(+,-)
Fig. 2. The enlargement of the intersection
IV. E XPERIMENTAL R ESULTS AND A NALYSIS In order to conduct the benchmark evaluation, the proposed method for sentiment analysis is evaluated by using the Large Movie Review Dataset (ACLIMDB), which is available online1 . The dataset consists of 100,000 movie reviews and 50,000 of the reviews are labeled [29]. The baseline for the comparison is based on the sentiment analysis method in [6] that is using the standard SWN. The proposed Senti2Vec method will assign polarity scores to those non-opinion words in the corpus vocabulary that lead to the enlargement of the intersection between the SWN and the corpus vocabulary as illustrated in Fig. 2. Such increase of the number of sentiment words will enrich the feature set for classification that leads to the improvement of the sentiment analysis performance. The performance of the sentiment analysis is measured based on the accuracy on the dataset labeled as positive and negative to show differences in the performance in different polarity. TABLE I shows that the proposed Senti2Vec method outperforms the standard SWN in both positive and negative dataset. The accuracy of the proposed method is 85.4% as compared to 73.2% for the positive dataset, which is an increase of 16.7%. Meanwhile, the performance on the negative dataset is 83.9% for the Senti2Vec, which is an increase of 16.3% as compared to the standard SWN, which is 72.1% . 1 http://ai.stanford.edu/∼amaas/data/sentiment/aclImdb
v1.tar.gz
TABLE I T HE ACCURACY OF THE S ENTI 2V EC AND THE STANDARD SWN SWN [6] Senti2vec
Positive 73.2 85.4
Negative 72.1 83.9
V. C ONCLUSION In this paper, a method to enrich the feature set for sentiment analysis by enlargement of the intersection between the SWN and the corpus vocabulary is proposed. It assigns polarity scores to those non-opinion words in the corpus vocabulary by learning from the SWN. The method is evaluated by using a labeled dataset from the movie reviews. It is observed that the performance of the proposed method is encouraging, showing that it can be more effective than the standard SWN. In the future, more investigation will be conducted on the impact of different distance measures used in the Senti2Vec to the performance of sentiment analysis ACKNOWLEDGMENT This work is partly supported by the Ministry of Higher Education Malaysia under the FRGS Grant (FRGS/1/2015/ICT04/UPM/02/5). R EFERENCES [1] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, C. Potts et al., “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the conference on empirical methods in natural language processing (EMNLP), vol. 1631. Citeseer, 2013, p. 1642. [2] X. Lui and W. B. Croft, “Statistical Language Modeling For Information Retrieval,” Annual Review of Information Science and Technology 2005 Volume 39, vol. 39, p. 1, 2003. [Online]. Available: http://ciir.cs.umass. edu/pubfiles/ir-318.pdf [3] Z. Yu, H. Wang, X. Lin, and M. Wang, “Learning term embeddings for hypernymy identification.” in IJCAI, 2015, pp. 1390–1397. [4] X. Liu and W. B. Croft, “Statistical language modeling for information retrieval,” Annual Review of Information Science and Technology, vol. 39, no. 1, pp. 1–31, 2005. [5] Sharef, Nurfadhlina Mohd and Zin, Harnani Mat and Nadali, Samaneh,“Overview and Future Opportunities of Sentiment Analysis Approaches for Big Data,” in Journal of Computer Sciences,2016. [6] D. S. Tomar and P. Sharma, “A text polarity analysis using sentiwordnet based an algorithm,” IJCSIT) International Journal of Computer Science and Information Technologies, 2016. [7] S. Baccianella, A. Esuli, and F. Sebastiani, “Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining.” in LREC, vol. 10, 2010, pp. 2200–2204. [8] T. Mikolov, G. Corrado, K. Chen, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Proceedings of the International Conference on Learning Representations (ICLR 2013), pp. 1–12, 2013. [Online]. Available: http://arxiv.org/pdf/1301.3781v3.pdf [9] C. C. Aggarwal and C. Zhai, “A survey of text clustering algorithms,” Sentiment,Sentiment/rate, pp. 77–128, 2012. [10] M. Hu and B. Liu, “Mining and summarizing customer reviews,” Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining KDD 04, vol. 04, p. 168, 2004. [Online]. Available: http://portal.acm.org/citation.cfm?doid=1014052.1014073 [11] J. Kim, J.-B. Yoo, H. Lim, H. Qiu, Z. Kozareva, and A. Galstyan, “Sentiment prediction using collaborative filtering.” in ICWSM, 2013. [12] E. Emad, E. M. Alshari, and H. Abdulkader, “Arabic vector space model based on semantic,” International journal of computer science (IJISI), vol. 8, no. 6, pp. 94–101, 2013. [13] R. Naidu, S. K. Bharti, K. S. Babu, and R. K. Mohapatra, “Sentiment analysis using telugu sentiwordnet,” ., 2017.
[14] E. Cambria, R. Speer, C. Havasi, and A. Hussain, “Senticnet: A publicly available semantic resource for opinion mining.” in AAAI fall symposium: commonsense knowledge, vol. 10, 2010. [15] O. Araque, I. Corcuera-Platas, J. F. S´anchez-Rada, and C. A. Iglesias, “Enhancing deep learning sentiment analysis with ensemble techniques in social applications,” Expert Systems with Applications, vol. 77, pp. 236–246, 2017. [16] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato, “Learning longer memory in recurrent neural networks,” arXiv preprint arXiv:1412.7753, 2014. [17] T. K. Landauer and S. T. Dumais, “A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge.” Psychological review, vol. 104, no. 2, p. 211, 1997. [18] M. Hoffman, F. R. Bach, and D. M. Blei, “Online learning for latent dirichlet allocation,” in advances in neural information processing systems, 2010, pp. 856–864. [19] A. P. L´opez-Monroy, M. Montes-Y-Gomez, H. J. Escalante, L. V. Pineda, and E. Villatoro-Tello, “Inaoe’s participation at pan’13: Author profiling task notebook for pan at clef 2013.” in CLEF (Working Notes), 2013. [20] A. Lavelli, F. Sebastiani, and R. Zanoli, “Distributional term representations: an experimental comparison,” in Proceedings of the thirteenth ACM international conference on Information and knowledge management. ACM, 2004, pp. 615–624. [21] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111– 3119. [22] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.” in EMNLP, vol. 14, 2014, pp. 1532–1543. [23] M. P. Villegas, M. J. Garciarena Ucelay, J. P. Fern´andez, M. A. ´ Alvarez Carmona, M. L. Errecalde, and L. Cagnina, “Vector-based word representations for sentiment analysis: a comparative study,” in XXII Congreso Argentino de Ciencias de la Computaci´on (CACIC 2016)., 2016. [24] M. Giatsoglou, M. G. Vozalis, K. Diamantaras, A. Vakali, G. Sarigiannidis, and K. C. Chatzisavvas, “Sentiment analysis leveraging emotions and word embeddings,” Expert Systems with Applications, vol. 69, pp. 214– 224, 2017. [Online]. Available: http://dx.doi.org/10.1016/j.eswa.2016.10. 043 [25] Q. Le and T. Mikolov, “Distributed Representations of Sentences and Documents,” International Conference on Machine Learning - ICML 2014, vol. 32, pp. 1188–1196, 2014. [Online]. Available: http://arxiv.org/ abs/1405.4053 [26] J. H. Lau and T. Baldwin, “An empirical evaluation of doc2vec with practical insights into document embedding generation,” arXiv preprint arXiv:1607.05368, 2016. [27] T. Mikolov, A. Joulin, S. Chopra, M. Mathieu, and M. Ranzato, “Learning longer memory in recurrent neural networks,” arXiv preprint arXiv:1412.7753, pp. 1–9, 2014. [28] G. H. Golub and C. F. Van Loan, Matrix computations. JHU Press, 2012, vol. 3. [29] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts, “Learning word vectors for sentiment analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Portland, Oregon, USA: Association for Computational Linguistics, June 2011, pp. 142–150. [Online]. Available: http://www.aclweb.org/anthology/P11-1015