Short Research Papers II
SIGIR’18, July 8-12, 2018, Ann Arbor, MI, USA
SAAN: A Sentiment-Aware Attention Network for Sentiment Analysis Zeyang Lei
Yujiu Yang
Min Yang
Graduate School at Shenzhen, Tsinghua University
[email protected]
Graduate School at Shenzhen, Tsinghua University
[email protected]
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
[email protected]
ABSTRACT
out by companies, marketers, and political analysts. Sentiment classification is typically treated as a text classification problem. Support Vector Machine (SVM) [7], Convolutional Neural Network (CNN) [4], Long Short-Term Memory (LSTM) [2], and Tree-LSTM [11] are representative supervised machine learning methods for sentiment classification. Although significant progresses have been achieved by deep neural networks, most approaches (e.g., CNN, LSTM) do not make full use of domain-specific sentiment linguistic knowledge (i.e., sentiment lexicon, negation words, intensity words). Recently there have been few studies attempting to alleviate the problem [5, 8, 9], but the implementation of the networks is insufficient and classification performance is unsatisfactory. In this study, all the aforementioned limitations are considered and alleviated to some extend. Our method SAAN, a Sentiment-Aware Attention Network, incorporates many kinds of sentiment resources as a whole into convolutional neural networks to improve the final sentiment classification performance. Specifically, our model utilizes sentiment resources including sentiment words, negation words, intensity words as a whole attention source to attend to the context, which enhances the performance of convolutional neural networks. Meanwhile, to learn better sentiment-specific sentence semantics, we adopt a three-step strategy including the word-level correlation modeling, the phrase-level correlation modeling and the sentence-level semantic modeling to exploit the semantic compositionality of the sentence and capture more comprehensive information of the sentence. Concretely, inspired by [5], we first employ a mutual attention mechanism to capture the word-level correlation between context words and sentiment words within the sentence. The mutual attention mechanism helps highlight the sentimentrelevant context words and provides guidance to learn the representation of sentiment resource words by leveraging the contextual global information interactively. Second, a dynamic convolutional attention network is adopted to extract the important and distinguishable n-gram features that form the phrase-level correlation between context words and sentiment words. Finally, motivated by [13], we propose a sentence-level multi-head attention mechanism to capture different sentimental information from different subspaces, which helps to learn more comprehensive sentiment-specific sentence representation. The main contributions of this study can be summarized as follows.
Analyzing public opinions towards products, services and social events is an important but challenging task. Despite the remarkable successes of deep neural networks in sentiment analysis, these approaches do not make full use of the prior sentiment knowledge (e.g., sentiment lexicon, negation words, intensity words). In this paper, we propose a SentimentAware Attention Network (SAAN) to boost the performance of sentiment analysis, which adopts a three-step strategy to learn the sentiment-specific sentence representation. First, we employ a word-level mutual attention mechanism to model word-level correlation. Next, a phrase-level convolutional attention is designed to obtain phrase-level correlation. Finally, a sentence-level multi-head attention mechanism is proposed to capture various sentimental information from different subspaces. The experiments on Movie Review (MR) and Stanford Sentiment Treebank (SST) show that our model consistently outperform the previous methods for sentiment analysis.
KEYWORDS Sentiment knowledge; mutual attention; multi-source attention ACM Reference Format: Zeyang Lei, Yujiu Yang, and Min Yang. 2018. SAAN: A SentimentAware Attention Network for Sentiment Analysis. In SIGIR ’18: 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, July 8-12, 2018, Ann Arbor, MI, USA. ACM, New York, NY, USA, 4 pages. https: //doi.org/10.1145/3209978.3210128
1
INTRODUCTION
Sentiment classification of natural language texts is an active research field, which attempts to automatically determine the subjective statement of a given text. It is recently a widely used way in the repertoire of social media analysis carried Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. SIGIR’18, July 8–12, 2018, Ann Arbor, MI, USA © 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-5657-2/18/07. . . $15.00 https://doi.org/10.1145/3209978.3210128
1197
Short Research Papers II
SIGIR’18, July 8-12, 2018, Ann Arbor, MI, USA
∙ We incorporate many kinds of sentiment linguistic knowledge as a whole into convolutional neural networks to learning the sentiment-specific sentence representation. ∙ Motivated by multi-head attention, we design a multihead attentional convolution neural network to model the sentence semantics from different semantic subspaces and capture more comprehensive sentimentspecific information.
2
way, we can obtain the contextual enhanced sentiment resources representation matrix 𝐸 𝑠 ∈ R𝑑×𝑚 as follow: 𝐸 𝑠 = 𝑓˜𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑠 (𝑥𝑠 + em ⊗ v𝜙 )
With the above transformation process, we can model the correlation between the context and the sentiment resources, which is helpful for the following phrase-level modeling.
2.2
Word-Level Correlation Modeling
𝑃 𝑐 = 𝑃ℓ𝑐1 ⊕ 𝑃ℓ𝑐2 · · · ⊕ 𝑃ℓ𝑐𝑞
Mathematically, we first assume that the embedding of a sentence 𝑥 containing 𝑛 context words is 𝑥𝑐 =[𝑥𝑐1 , 𝑥𝑐2 , ...𝑥𝑐𝑖 , ..., 𝑥𝑐𝑛 ] and the embedding of corresponding sentiment resource words is 𝑥𝑠 = [𝑥𝑠1 , 𝑥𝑠2 , ..., 𝑥𝑠𝑗 , ..., 𝑥𝑠𝑚 ], where each 𝑥𝑐𝑖 and 𝑥𝑠𝑗 denotes a specific word embedding. To model the word-level correlation between the context words and the sentiment resource words, we adopt the dot product between them to define the correlation matrix 𝑅 as follow: 𝑅 = (𝑥𝑐 )𝑇 · 𝑥𝑠 ∈ R𝑛×𝑚
Phrase-Level Correlation Modeling
Given 𝐸 𝑐 and 𝐸 𝑠 , multi-gram convolution is performed to extract different local semantic features. The convolution operation involve a filter 𝑊 ℓ ∈ Rℓ×2𝑑 , which is applied to ℓ continuous words to extract a n-gram phrase. With different window size ℓ𝑖 (i.e. ℓ𝑖 -gram), we have the combined multi-gram feature maps 𝑃 𝑐 and 𝑃 𝑠 for context words and sentiment words, respectively, which are computed as:
THE PROPOSED MODEL
SAAN is composed of three key components: the word-level correlation modeling (§2.1), the phrase-level correlation modeling (§2.2) and sentence-level semantic modeling (§2.3). The overall framework of SAAN is shown in Figure 1.
2.1
(4)
𝑠
𝑃ℓ𝑠1
⊕
=𝛾(𝐸 * 𝑊
ℓ𝑖
𝑃 = 𝑃ℓ𝑐𝑖 𝑃ℓ𝑠𝑖
𝑐
𝑠
= 𝛾(𝐸 * 𝑊
ℓ𝑖
𝑃ℓ𝑠2
··· ⊕
+ 𝑏) ∈ R
+ 𝑏) ∈ R
(5)
𝑃ℓ𝑠𝑞
(6)
(𝑛−ℓ𝑖 +1)×𝑘
(𝑚−ℓ𝑖 +1)×𝑘
, 𝑖 = 1, 2, ..., 𝑞
(7) (8)
where 𝛾(·) is a nonlinear activation function, 𝑏 is a bias matrix, * represents the convolution operator, k means the number of filters, ⊕ is the concatenation operator, 𝑃ℓ𝑐𝑖 and 𝑃ℓ𝑠𝑖 are the convolution results of the context and the sentiment resource words with window size ℓ𝑖 , respectively. We then design a novel dynamic attention mechanism to choose the important and distinguishable phrases from the learned n-gram chunks by the convolution operation. Mathematically, we formulate the chunk-based attention mechanism below:
(1)
where each element 𝑅𝑖,𝑗 refers to the correlation between the 𝑖-th word 𝑥𝑐𝑖 in the context and 𝑗-th word 𝑥𝑠𝑗 in the sentiment resource words. Next, we average the values of each row of 𝑅, resulting in a column vector of size 𝑛. This column vector is input into a softmax layer to produce an attention vector for the context: ∑︀𝑚 𝑅[:,ℎ] 𝜑 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( ℎ=1𝑚 ) Similarly, we can get a row vector by averaging the values of each column of 𝑅 and apply it as the input of a softmax layer to produce an attention vector ∑︀𝑛 𝑅[𝑘,:] for the sentiment resource words: 𝜙 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( 𝑘=1𝑛 ). Then we take the context representation matrix 𝑥𝑐 and the attention weight vector 𝜑 as inputs and can obtain the overall ∑︀ 𝑐 correlation vector v𝜑 for the context: v𝜑 = 𝑛 𝑖=1 𝜑𝑖 x𝑖 . Similarly, The overall correlation vector v𝜙 can be∑︀obtained for 𝑠 𝜑 the sentiment resource words as follow: v𝜙 = 𝑚 𝑖=1 𝜙𝑖 x𝑖 .v 𝜙 and v denote the overall relevancy metric for the context words and the sentiment resource words, respectively. Finally, given the context embedding matrix 𝑥𝑐 and the correlation vector v𝜑 , the sentiment-enhanced context representation matrix 𝐸 𝑐 ∈ R𝑑×𝑛 can be computed as follows: 𝐸 𝑐 = 𝑓˜𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑠 (𝑥𝑐 + en ⊗ v𝜑 ) (2) {︃ 𝑥 𝑥>0 𝑓˜𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑠 (𝑥) = (3) 𝜉(𝑒𝑥𝑝(𝑥) − 1) 𝑥 ≤ 0, 𝜉 > 0
𝑆 = 𝑃 𝑠 ⊙ {(𝑃 𝑐 )𝑇 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑃 𝑐 (𝑃 𝑠 )𝑇 )} ∑︀𝑚 𝑗=1 𝑆[:, 𝑗] 𝑠¯ = 𝑚 where ⊙ is element-wise multiplication.
2.3
(9) (10)
Sentence-level Semantic Modeling
Inspired by [13], we design a sentence-level multi-head attention mechanism to capture various sentimental information from different sentiment resources, which helps form more complete sentiment-specific sentence representation. To be specific, the multi-head attention mechanism is implemented by using multiply sentiment resources as attention sources to attend to the context in the form of attention weight matrix. The detailed formulation is shown as follows: ˜ (𝑃 𝑐 + 𝑒𝑛 ⊗ 𝑠¯)) 𝜌 = 𝑄𝑇 𝑡𝑎𝑛ℎ(𝑈 𝑒𝑥𝑝(𝜌) 𝐵 = ∑︀𝑛 𝑖=1 𝑒𝑥𝑝(𝜌) o ˜ = 𝑓 𝑙𝑎𝑡𝑡𝑒𝑛(𝐵𝑃 𝑠 )
where en = [1, 1, .., 1]𝑇 denotes an 𝑛-dimensional all-ones vector, en ⊗ v𝜑 = [v𝜑 ; v𝜑 ; ...; v𝜑 ] denotes the Kronecker product between en and v𝜑 , and 𝑓˜𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑠 is the semantic mapping function . Here, we choose exponential linear unit (ELU) [1] as 𝑓˜𝑠𝑒𝑚𝑎𝑛𝑡𝑖𝑐𝑠 since it can reduce bias shift effect. In the same
(11) (12) (13)
where 𝐵 is the attention matrix with each row denoting a hop of sentiment resources attention, 𝑓 𝑙𝑎𝑡𝑡𝑒𝑛 is an operation ˜ are projection that will flatten matrix into vector form, 𝑄, 𝑈
1198
Short Research Papers II
SIGIR’18, July 8-12, 2018, Ann Arbor, MI, USA
Phrase-level Correlation Modeling
Word-level Correlation Modeling
...
Context
Col-wise average
Multi-gram Convolution
Row-wise average
Chunk-based attention
Softmax Classifier
...
correlation matrix
...
Sentiment Resources
Sentence-level Semantics Modeling
Figure 1: The Overall Framework of SAAN Model parameters to be learned and o ˜ refers to the final enhanced each sample is annotated as very negative, negative, neutral, sentiment-specific sentence representation. positive, or very positive. The sentiment lexicon used in this paper is the combination of two popular sentiment lexicons 2.4 Sentence Classifier from [8] and [3]. For negation and intensity words, we collect them manually since the number is small. We feed the final sentence representation o ˜ to a softmax layer to predict the category label distribution:
3.2
𝑒𝑥𝑝(𝑈𝑜𝑇 o ˜ + 𝑏𝑜 ) 𝑦ˆ = ∑︀𝐶 𝑇˜ + 𝑏 ) 𝑒𝑥𝑝(𝑈 𝑜 𝑜 o 𝑘=1
(14)
300-dimensional Glove3 vectors are utilized as the initializations of the word embeddings in SAAN. The tokens that did not appear in the pre-trained word embeddings are initialized from the uniform distribution [-0.25,0.25]. We set the initializations of bias vectors to zero, and all weight matrices are randomly sampled from orthogonal matrices. The window sizes of the multi-gram CNN are set to 1, 2, 3, 4, respectively. SAAN is trained by using the RMSprop optimization algorithm with mini-batch. Other hyperparameters include: 𝐿2 regularization 10−5 , batch size 60.
where 𝑦ˆ is prediction probabilities, C is the number of classes, 𝑊𝑜 and 𝑏𝑜 are weight matrix and bias to be learned, respectively. Given a corpus with N training samples (𝑥𝑖 , 𝑦𝑖 ), the parameters of the networks are trained to minimize the following loss function. 𝐿(ˆ 𝑦 , 𝑦) = −
𝑁 ∑︁ 𝐶 ∑︁ 𝑖=1 𝑗=1
𝑦𝑖𝑗 𝑙𝑜𝑔(ˆ 𝑦𝑖𝑗 ) + 𝜆(
∑︁
𝜃2 )
(15)
𝜃∈Θ
where 𝑦𝑖𝑗 is the ground truth label, 𝑦ˆ𝑖𝑗 is the prediction probabilities, 𝜃 denotes each parameter to be regularized, Θ is parameter set, 𝜆 is the coefficient for 𝐿2 regularization. Here, the first term of the loss function are the cross-entropy function of the predicted and true distributions and the second term is 𝐿2 regularization .
3.3
We evaluate our model on Movie Review (MR)1 and Stanford Sentiment Treebank (SST) datasets2 . The detailed properties of the datasets are described below. MR consists of 5,331 positive and 5,331 negative samples with each sample annotated as positive and negative, and SST contains 8545 training samples, 1101 validation samples, 2210 test samples, where 2
Experiment Results
In the experiments, we adopt classification accuracy as the evaluation metric, which is widely used in recent sentiment classification studies. Table 1 shows the performance comparison of SAAN with the baseline methods. From Table 1, we observe that SAAN substantially outperforms the compared methods by a noticeable margin on both MR and SST datasets. It demonstrates that our proposed model is effective to improve the performance of sentence-level sentiment classification. Concretely, LR-LSTM and LR-Bi-LSTM stably outperform RNTN, LSTM, BiLSTM, and CNN on both datasets. The advantage of LR-LSTM and LR-Bi-LSTM is that they employ linguistic role of sentiment, negation and intensity words into deep neural networks. SAAN further improves the performance of sentiment analysis by designing
3 EXPERIMENTS 3.1 Datasets and Sentiment Resources
1
Experiment Setups
http://www.cs.cornell.edu/people/pabo/movie-review-data/ https://nlp.stanford.edu/sentiment/
3
1199
http://nlp.stanford.edu/projects/glove
Short Research Papers II
SIGIR’18, July 8-12, 2018, Ann Arbor, MI, USA
attention mechanisms to integrate the knowledge into deep neural network. Methods
MR
RNTN [10] LSTM [2] BiLSTM Tree-LSTM [11] CNN [4] NSCL [12] LR-LSTM [8] LR-Bi-LSTM [8] Self-attention [6] SAAN (our model)
75.9% 77.4% 79.3% 80.7% 81.5% 82.9% 81.5% 82.1% 82.5%* 84.3%
SST(sent.-level) 45.7% 45.6% 46.5% 48.1% 48.0% 47.1% 48.3% 48.6% 48.7%* 49.7%
Figure 2: Ablation test results we will consider using sentiment words, negation words and intensity words separately as attention source to model the sentiment-specific sentence representation.
Table 1: Evaluation results. The best result on each dataset is in bold. Results with * are obtained by our own implementation.
4
ACKNOWLEDGMENTS
MODEL ANALYSIS
We would like to thank all anonymous reviewers for their valuable comments. The corresponding author is Yujiu Yang. This work was supported in part by the Research Fund for the development of strategic emerging industries by ShenZhen city (No.JCYJ20160301151844537 and No. JCYJ20170412170118573).
Effect of Sentiment Resources Attention. In order to further reveal the effect of sentiment resources attention, we compare SAAN model and the CNN model without sentiment resources attention. The experimental results are shown in Figure 2. From Figure 2, we can observe that our proposed SAAN model significantly outperforms the CNN model without sentiment resources attention. This verifies the effectiveness of the sentiment resources attention.
REFERENCES [1] Sepp Hochreiter Djork-Arne Clevert, Thomas Unterthiner. 2016. fast and accurate deep network learning by exponential linear units (ELUS). In ICLR 2016. [2] Felix A Gers, J¨ urgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with LSTM. (1999). [3] Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In SIGKDD 2004. [4] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In EMNLP 2014. [5] Zeyang Lei, Yujiu Yang, and Yi Liu. 2018. LAAN: A LinguisticAware Attention Network for Sentiment Analysis. In The 2018 Web Conference Companion. [6] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In ICLR 2017. [7] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques. In ACL 2002. [8] Qiao Qian, Minlie Huang, Jinhao Lei, and Xiaoyan Zhu. 2017. Linguistically Regularized LSTM for Sentiment Classification. In ACL 2017. [9] Bonggun Shin, Timothy Lee, and Jinho D. Choi. 2017. Lexicon integrated cnn models with attention for sentiment analysis.. In In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. [10] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP 2013. [11] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In ACL 2015. [12] Zhiyang Teng, Duy-Tin Vo, and Yue Zhang. 2016. ContextSensitive Lexicon Features for Neural Sentiment Analysis.. In EMNLP 2016. [13] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS 2017.
Effect of Each Component of SAAN. To investigate the effect of each component of the SAAN model, we also perform the ablation test of SAAN in terms of discarding word-level mutual attention module (denoted as SAAN-I), phrase-level dynamic semantic attention module (denoted as SAAN-II), sentence-level multi-source attention module (denoted as SAAN-III). The results are reported in Figure 2. Generally, all three factors contribute, and the word-level mutual attention module contributes most. This is within our expectation since word-level mutual attention module, establishing the correlation relation between the context and the sentiment resources, is the basis of the latter sentence semantic modeling. The phrase-level dynamic semantic attention module also makes great contributions to sentence-level sentiment classification. We believe this is because the dynamic semantic attention module can select out the sentimentrelevant phrase chunks to compose the sentiment-specific sentence representation.
5
CONCLUSION AND FUTURE WORK
In this paper, we propose a Sentiment-Aware Attention Network (SAAN) to boost the performance of convolutional neural networks, which incorporates many kinds of the sentiment linguistic knowledge as a whole into the deep neural network. Experiments on the MR and SST datasets show the effectiveness of SAAN. As for future work, instead of using the sentiment resource words as a whole attention source to model the sentiment-specific sentence semantics,
1200