Bayesian Model Averaging and Model Selection ... - Semantic Scholar

4 downloads 53968 Views 2MB Size Report
One of the most relevant task in Sentiment Analysis is Polar- ity Classification. ... der to help in selecting the best ensemble composition, we propose an heuristic ... define automatic tools able to extract subjective information, such as opinions.
Bayesian Model Averaging and Model Selection for Polarity Classification Federico Alberto Pozzi, Elisabetta Fersini, and Enza Messina University of Milano-Bicocca Viale Sarca, 336 - 20126 Milan, Italy {federico.pozzi,fersini,messina}@disco.unimib.it Abstract. One of the most relevant task in Sentiment Analysis is Polarity Classification. In this paper, we discuss how to explore the potential of ensembles of classifiers and propose a voting mechanism based on Bayesian Model Averaging (BMA). An important issue to be addressed when using ensemble classification is the model selection strategy. In order to help in selecting the best ensemble composition, we propose an heuristic aimed at evaluating the a priori contribution of each model to the classification task. Experimental results on different datasets show that Bayesian Model Averaging, together with the proposed heuristic, outperforms traditional classification methods and the well known Majority Voting mechanism.

1

Introduction

According to the definition reported in [1], sentiment “suggests a settled opinion reflective of one’s feelings”. The aim of Sentiment Analysis (SA) is therefore to define automatic tools able to extract subjective information, such as opinions and sentiments from texts in natural language, in order to create structured and actionable knowledge to be used by either a Decision Support System or a Decision Maker. The polarity classification task can be addressed at different granularity levels, such as word, sentence and document level. The most widely studied problem is SA at document level [2], in which the naive assumption is that each document expresses an overall sentiment. When this is not ensured, a lower granularity level of SA could be more useful and informative. In this work, polarity classification has been investigated at sentence level. The main polarity classification approaches are focused on identifying the most powerful model for classifying the polarity of a text source. However, an ensemble of different models could be less sensitive to noise and could provide a more accurate prediction [3]. Regarding SA, the study of ensembles is still on its infancy. This is mainly due to the difficulty to find out a reasonable trade-off between classification accuracy and increasing computational time, that is particularly challenging when dealing with online and real-time big data. To the best of our knowledge, the existing approaches of a voting system for SA are based on traditional methods such as Bagging [4] and Boosting [5], disregarding how to select the best ensemble composition. In this paper we propose a novel BMA approach that combines different models selected using a specific selection strategy heuristic. E. M´ etais et al. (Eds.): NLDB 2013, LNCS 7934, pp. 189–200, 2013. c Springer-Verlag Berlin Heidelberg 2013 

190

2

F.A. Pozzi, E. Fersini, and E. Messina

Bayesian Model Averaging

The idea behind a voting mechanism is to exploit the characteristics of several independent classifiers by combining them in order to achieve better performance than the best single classifier. The most popular ensemble model is the Majority Voting (MV), which is characterized by an ensemble of “experts” that classifies the sentence polarity by considering the vote of each classifier as “equally important” and by determining the final polarity by selecting the most popular label prediction [3]. Let C be a set of n independent classifiers and li (s) the label assigned to a sentence s by classifier i ∈ C. Then, the optimal label lMV (s) is assigned as follows:   ⎧ positive if li (s)+ > li (s)− ⎪ ⎪ ⎨ i∈C i∈C   li (s)+ < li (s)− (1) lMV (s) = negative if ⎪ i∈C i∈C ⎪ ⎩ l(s) otherwise where li (s)+ = 1 if the label assigned by i to s is positive (0 otherwise), li (s)− = 1 if the label assigned by i to s is negative (0 otherwise) and  l(s) is the label assigned to s by the “most expert” classifier, i.e. the classifier that is able to ensure the highest accuracy. A voting mechanism can be improved by explicitly taking into account the marginal distribution of each classifier prediction and its overall reliability when determining the optimal label. To this purpose, we propose a voting mechanism based on Bayesian Model Averaging (BMA) [6], where the weighted contribution of each classifier is used to make a final label prediction. This approach assigns to s the label lBMA (s) that maximizes:  P (l(s)|C, D) = P (l(s)|i)P (i|D) i∈C

=

 i∈C

(2) P (l(s)|i)P (i)P (D|i)

where P (l(s)|i) is the marginal distribution of the label predicted by classifier i, while P (D|i) represents the likelihood of the training data D given i. The prior P (i) of each classifier is assumed to be a constant and therefore can be omitted. The distribution P (D|i) can be approximated by using the F1 -measure obtained during a preliminary evaluation of classifier i: P (D|i) ∝

2 × Pi (D) × Ri (D) Pi (D) + Ri (D)

(3)

where Pi (D) and Ri (D) denotes precision and recall obtained by classifier i. According to (2), we take into account the vote of each classifier by exploiting the prediction marginal instead of a 0/1 vote and we tune this “probabilistic claim” according to the ability of the classifier to fit the training data. This approach allows the uncertainty of each classifier to be taken into account, avoiding over-confident inferences.

Bayesian Model Averaging and Model Selection for Polarity Classification

3

191

Model Selection Strategy

An important issue related to voting mechanisms is referred to the selection of models to be included in an ensemble. The best composition is a  of classifiers n! where combinatorial optimization problem over a dimension of nk=1 k!(n−k)! n is the number of classifiers and k represents the dimension of each potential ensemble. For example, if we want to find out the best ensemble given 10 classifiers, we should test more than 1000 potential ensembles before determining the optimal one. In order to reduce the search space, we propose an heuristic able to compute the discriminative contribution that each classifier is able to provide with regard to other classifiers. Given two classifiers, i and j, i could help j to globally tag a sentence with the correct label l when: 1. j incorrectly labels the sentence s, but i correctly tags it. This is the most important contribution of i to the voting mechanism and represents how much i is able to positively correct j; 2. Both i and j correctly label s. In this case, i enhances the weight used to choose the correct label. On the other hand, i could also damage the ensemble in the following cases: 3. j correctly labels sentence s, but i incorrectly tags it. This is the most harmful contribution in a voting mechanism and represents how much i is able to negatively change the (correct) label tagged by j. 4. j incorrectly labels sentence s that has been misclassified also by i. In this case, i cooperates to further decrease the weight that the voting mechanism uses to chose the correct label. To formally represents the cases above, we consider P (i = 1|j = 0) as the number of instances correctly classified by i over the number of instances incorrectly classified by j (case 1), P (i = 1|j = 1) as the number of instances correctly classified by i over the number of instances correctly classified by j (case 2). Analogously, P (i = 0|j = 1) can be considered the number of instances misclassified by i over the number of instances correctly classified by j (case 3) and P (i = 0|j = 0) as the number of instances misclassified by i over the number of instances misclassified also by j (case 4). The contribution riC of each classifier i ∈ C can be estimated as:   P (i = 1|j = k)P (j = k) riC =

j∈{C\i} k∈{0,1}





j∈{C\i} k∈{0,1}

P (i = 0|j = k)P (j = k)

(4)

where P (j = k) is the prior of classifier j to either correctly or incorrectly predict labels. In particular, P (j = 1) denotes the percentage of correctly classified instances (i.e. accuracy), while P (j = 0) represents the rate of misclassified (i.e. error rate). Note that riC depends on the ensemble C of classifiers: starting from

192

F.A. Pozzi, E. Fersini, and E. Messina

an initial set C, riC is iteratively computed excluding at each iteration the classifier that achieves the lowest riC . In order to define the initial ensemble, the baseline classifiers in C have to show some level of dissimilarity. This can be achieved using models that belong to different families (i.e. generative, discriminative and large-margin The proposed strategy allows us to reduce the n models). n! to n − 1 potential candidates for determining search space from k=1 k!(n−k)! the optimal ensemble. In fact, at each iteration the classifier with the lowest riC is disregarded until the smallest combination is achieved. The baseline classifiers considered in this paper are the following: Dictionary-Based. A Dictionary-based classifier is the simplest and naive method for the polarity classification task. Given two dictionaries, one for negative and one for positive terms, the sentence polarity is determined, checking if each sentence term belongs to the positive or the negative dictionary and finally using the following aggregation function:  positive if #tokens+ > #tokens− (5) l(s) = negative otherwise Na¨ıve Bayes. NB [7] is the simplest generative model that can be applied to the polarity classification task. It predicts the polarity label l given a vector representation of textual cues by exploiting the Bayes’ Theorem. Maximum Entropy. ME [8] is a discriminative model that has been largely adopted in the state of the art for polarity classification. It makes no assumption about the relationships between textual cues, which are modeled through several feature functions that can eventually be overlapping and non-independent. In this study, the ME model is trained with feature functions that represent the unigrams within a sentence. Support Vector Machines. SVMs [9] are linear learning machines that try to find the optimal hyperplane discriminating samples of different classes, ensuring the widest margin. Conditional Random Fields. CRFs [10] are a type of discriminative probabilistic graphical model. In this work, a linear-chain CRF has been applied at sentence level in order to model the sentiment flow within a paragraph when it is seen as a sequence of dependent sentences. Each sentence, which is assumed to be composed of a sequence of unigrams, is evaluated according to a set of binary feature functions able to capture local properties.

4 4.1

Experimental Investigation Experimental Setup

In this study, three benchmark datasets are considered.

Bayesian Model Averaging and Model Selection for Polarity Classification

193

The first one is “Finegrained Sentiment Dataset, Release 1”1 (ProductData) [11], and relates to product reviews about books, dvds, electronics, music and video games. Although the original dataset is composed of 5 levels of polarity (POS, NEG, NEU, MIX and NR), a reduction of instances has been performed in order to deal only with positive and negative opinions. The resulting dataset is unbalanced, composed of 1320 ( 58.84%) negative and 923 ( 41.16%) positive reviews. The second dataset is “Multi-Domain Sentiment Dataset”2 (ProductDataMD) [12] and contains product reviews taken from Amazon.com about many product types (domains). Reviews contain star ratings (1 to 5 stars) that have been converted into nominal labels (’neg’ for ratings lower than 3, ’neu’ for ratings equal to 3 and ’pos’ for ratings greater than 3). In this study, reviews from category ’Music’ and ’Books’ are studied separately. ProductDataMD is balanced, composed of 2000 reviews for each of the two categories. The third dataset, known as “Sentence polarity dataset v1.0”3 (MovieData) [13], is composed of 10662 snippets of movie reviews extracted from Rotten Tomatoes4 . The main characteristics of this dataset, which comprises only positive and negative sentences, are related to the informal language adopted (slang and short forms) and to the presence of noisy polarity labeling. A 10-folds cross validation has been adopted as evaluation criteria. The setup of the experimental phase relates to classifier settings. The dictionary-based classifier exploits the polarity dictionary originally created by Hu and Liu5 (DictHuLiu) [14]. This lexicon is composed of 4783 negative and 2006 positive words. Terms that are not included neither in the positive nor in the negative dictionary are not considered during the aggregation process. DictHuLiu was defined starting from a small seed set of opinion words, concerning the domain of product reviews, then expanded it exploiting iteratively WordNet’s synsets and relationships to acquire other opinion words (also morphological variants and slang words). For NB and SVM, although different sentence representations (boolean, tf, tf-idf) have been considered, we report only the results based on the “best” weighting schema. To this purpose, only the investigations based on the boolean representation are shown. For training SVM, a linear kernel has been assumed while the training of NB assumes a multinomial distribution. NB and SVM experiments are based on LingPipe 6 and LIBSVM 7 respectively. Concerning ME and CRF, the models have been trained by maximizing the likelihood until convergence. In particular, ME is trained using multiple conditional likelihood

1 2 3 4 5 6 7

http://www.sics.se/people/oscar/datasets/ http://www.cs.jhu.edu/∼mdredze/datasets/sentiment/ www.cs.cornell.edu/people/pabo/movie-review-data/ http://www.rottentomatoes.com/ www.cs.uic.edu/∼liub/FBS/sentiment-analysis.html http://alias-i.com/lingpipe/ http://www.csie.ntu.edu.tw/∼cjlin/libsvm/

194

F.A. Pozzi, E. Fersini, and E. Messina

[8] and CRF is induced exploiting regularized likelihood [10]. ME and linear chain CRF classifiers have been applied using the MALLET package8 . 4.2

Computational Results

In this section the performance achieved on the considered datasets, both by the baseline classifiers and the ensemble methods (MV and BMA, described in Sect. 2), are presented. To this purpose, we measured Precision (P ), Recall (R) and F1 -measure, defined as P =

TP TP + FP

R=

TP TP + FN

F1 =

2·P ·R P +R

(6)

both for the positive and negative labels (in the sequel denoted by P+ , R+ , F 1+ and P− , R− , F 1− respectively). We also measured Accuracy, defined as Acc =

TP + TN TP + FP + FN + TN

Table 1. Performance of the baseline classifiers on ProductData DIC P− 0.7561 R− 0.7545 F 1− 0.7548 P+ 0.6469 R+ 0.6478 F 1+ 0.6463 Acc 0.7107

ProductData NB ME SVM 0.7024 0.7033 0.6880 0.7250 0.7758 0.8326 0.7121 0.7360 0.7530 0.5872 0.6203 0.6550 0.5565 0.5239 0.4565 0.5689 0.5638 0.5364 0.6558 0.6723 0.6781

CRF 0.7292 0.7621 0.7443 0.6375 0.5924 0.6120 0.6924

(7)

Table 2. Performance of DIC (best classifier), MV and BMA for the ensemble {DIC, ME, CRF} on ProductData ProductData DIC MV BMA P− 0.7561 0.7603 0.7703 R− 0.7545 0.8265 0.8447 F 1− 0.7548 0.7912 0.8050 P+ 0.6469 0.7136 0.7401 R+ 0.6478 0.6217 0.6348 F 1+ 0.6463 0.6621 0.6813 Acc 0.7107 0.7424 0.7585

Table 1 reports performance achieved on ProductData. As expected, performance on negative cases are higher than performance achieved on positive cases, because ProductData is unbalanced (1320 negative and 923 positive instances). This is also confirmed by MV and BMA, which achieve high recall on the negative cases and low recall on the positive ones (Table 2). The best classifier DIC Table 3. Computation of riC and accuracy ensembles on ProductData Iteration 1 2 3 4 8

DIC NB ME SVM CRF Accuracy 2.154 1.572 1.648 1.641 1.783 0.747 2.111 1.678 1.593 1.735 0.757 2.131 1.676 1.870 0.758 2.123 1.918 0.745

http://mallet.cs.umass.edu/

Bayesian Model Averaging and Model Selection for Polarity Classification 



195

        

       !     

Fig. 1. Accuracy of baseline classifiers, MV and BMA on ProductData

achieves 71.07% of global accuracy (Table 1), while the best ensemble (composed of DIC, ME and CRF) achieves an accuracy of 74.24% and 75.85% for MV and BMA respectively. The contribution of each classifier belonging to a given ensemble can be computed a priori by applying the model selection strategy. Starting from the initial set C={DIC, NB, ME, SVM, CRF}, the classifiers are sorted with respect to their contribution by computing (4). As shown in Table 3, the classifier with the lowest contribution at the first iteration is NB. Then, (4) is re-computed on the ensemble {C \ NB}, highlighting SVM as the classifier with the lowest contribution. At iteration 3 and 4, the worst classifiers to be removed from the ensemble are ME and CRF respectively. As highlighted by the accuracy measure, the model selection heuristic is able to determine the optimal composition by evaluating four ensemble candidates. In this case, the optimal solution is found at iteration 3, where the best ensemble is





                                                        

Fig. 2. Cumulative chart of accuracy on ProductData

196

F.A. Pozzi, E. Fersini, and E. Messina

Table 4. Performance of the baseline classifiers on ProductDataMD “books”

Table 5. Performance of the baseline classifiers on ProductDataMD “music”

ProductDataMD on Books DIC NB ME SVM CRF P− 0.7381 0.6564 0.7387 0.7470 0.8041 R− 0.5640 0.7060 0.8100 0.7560 0.7550 F 1− 0.6383 0.6795 0.7713 0.7502 0.7781 P+ 0.6471 0.6837 0.7943 0.7538 0.7696 R+ 0.7980 0.6300 0.7130 0.7410 0.8150 F 1+ 0.7142 0.6547 0.7495 0.7458 0.7912 Acc 0.6810 0.6680 0.7615 0.7485 0.7850

ProductDataMD on Music DIC NB ME SVM CRF P− 0.8042 0.6652 0.8107 0.7251 0.7894 R− 0.4610 0.6360 0.7010 0.7230 0.7550 F 1− 0.5839 0.6495 0.7508 0.7233 0.7711 P+ 0.6229 0.6526 0.7383 0.7249 0.7663 R+ 0.8870 0.6800 0.8360 0.7250 0.7980 F 1+ 0.7314 0.6654 0.7834 0.7241 0.7813 Acc 0.6740 0.6580 0.7685 0.7240 0.7765

composed of {DIC, ME, CRF}. For sake of completeness, all ensemble performance are depicted in Figure 1. The cumulative chart of accuracy is reported in Figure 2. Table 4 reports performance achieved on ProductDataMD “books”. The contribution of the best BMA is about 3.55%, while 1.6% for MV.





        

      !           !    

Fig. 3. Accuracy of baseline classifiers, MV and BMA on ProductDataMD “books” 



 

 

 

 

 

 

 

 

 

 

 





 

 

      

Fig. 4. Cumulative chart of accuracy on ProductDataMD “books”

Bayesian Model Averaging and Model Selection for Polarity Classification

197

As shown in Table 6, also in this case the optimal ensemble is determined within the search space of four ensemble candidates. According to the proposed heuristic, the optimal combination is composed of {DIC, ME, SVM, CRF} found at iteration 2. This result can be also validated looking at Figure 3 and 4. Table 6. Computation of riC and accuracy ensembles on ProductDataMD “books” Iteration DIC NB ME SVM CRF Accuracy 1 1.859 1.654 2.473 2.117 2.644 0.808 2 1.816 2.480 1.959 2.460 0.820 3 2.315 1.696 2.165 0.809 4 2.219 2.616 0.813

Performance achieved on ProductDataMD “music” are reported in Table 5. The classifier which achieves the highest performance is CRF with 77.65%. Concerning the voting paradigms, while the accuracy of the best MV is 79.35%, BMA is able to guarantee performance of 80.3%.





        

    !           !     

Fig. 5. Accuracy of baseline classifiers, MV and BMA on ProductDataMD “music”

This result can be easily figured out by Table 7, where the best ensemble for BMA is composed of DIC, ME, SVM and CRF (Iteration 2). Table 7. Computation of riC and accuracy ensembles on ProductDataMD “music” Iteration DIC NB ME SVM CRF Accuracy 1 1.700 1.579 2.565 1.982 2.608 0.797 2 1.638 2.558 1.852 2.457 0.803 3 2.537 1.644 2.211 0.794 4 2.337 2.472 0.786

198

F.A. Pozzi, E. Fersini, and E. Messina 



 

 

 

 

 

 

 

 

 

 





 

 

      

Fig. 6. Cumulative chart of accuracy on ProductDataMD “music”

Table 8 shows performance achieved by the baseline classifiers on MovieData. As mentioned in Sect. 4.1, the opinion words belonging to the dictionary are concerned with the product reviews. This explains why DIC obtains low performance on this dataset. Although DIC is the classifier with the worst performance, the outperforming ensemble is composed of DIC, ME and CRF (78.55% of accuracy by BMA and 78.08% by MV). This highlights again that the best ensemble is not necessarily composed of the classifiers which individually lead to the highest performance. Table 8. Performance of the baseline classifiers on MovieData DIC P− 0.6313 R− 0.7325 F 1− 0.6780 P+ 0.6810 R+ 0.5717 F 1+ 0.6214 Acc 0.6521

MovieData NB ME 0.7104 0.7638 0.7023 0.7816 0.7062 0.7713 0.7057 0.7738 0.7135 0.7523 0.7095 0.7609 0.7079 0.7670

SVM 0.7395 0.7559 0.7475 0.7504 0.7334 0.7417 0.7447

CRF 0.7711 0.7409 0.7555 0.7508 0.7795 0.7647 0.7602

In fact, the improvement with respect to the best classifier ME (76.70%) is close to 2% for MV and 3.55% for BMA (Figure 7 and 8). The selection of the best BMA ensemble can be easily derived from the application of the proposed heuristic reported in Table 9. In conclusion, Figure 9 shows that BMA, together with the model selection strategy, ensures a significant performance improvement with regard to the studied baseline classifiers and MV.

5

Conclusion

In this work we discussed how to explore the potential of ensembles of classifiers for sentence level polarity classification and proposed an ensemble method

Bayesian Model Averaging and Model Selection for Polarity Classification





199

        

!           !      

Fig. 7. Accuracy of baseline classifiers, MV and BMA on MovieData 



  

     

                     

Fig. 8. Cumulative chart of accuracy improvement on MovieData Table 9. Computation of riC and accuracy ensembles on MovieData Iteration 1 2 3 4

DIC NB ME SVM CRF Accuracy 1.619 1.602 2.121 1.706 1.948 0.784 1.584 2.132 1.580 1.847 0.782 1.586 2.251 2.147 0.785 1.742 1.648 0.777

based on Bayesian Model Averaging. We further proposed an heuristic aimed at evaluating the a priori contribution of the single models to the classification task that can be used to help in selecting the best ensemble composition. The experimental results show that the proposed solution is particularly effective and efficient, thanks to the ability to a priori define a strategic combination of different classifiers. An ongoing research is the extension of BMA to a wider range of labels. We are considering a hierarchical voting framework where the discrimination between ’objective’ and ’subjective’ is firstly addressed, to then approach the polarity classification of subjective expressions.

200

F.A. Pozzi, E. Fersini, and E. Messina 



 

"*$"'

"*& "*%

")*''

")*"*

")'*'

")) ")( ")&$&



")*

")()"

"))('

")*'

"* ")+

"*"%

"*#

")+%'

"*"#

"*$

")'

  



   

 

 

")



")#

  

")$

")#")

")%

   

")&

"(+



 

  



Fig. 9. Summary of accuracy comparison

References 1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Foundations and Trends in Information Retrieval 2, 1–135 (2008) 2. Yessenalina, A., Yue, Y., Cardie, C.: Multi-level structured models for documentlevel sentiment classification. In: Proc. of the Conf. on Empirical Methods in NLP (2010) 3. Dietterich, T.G.: Ensemble learning. In: The Handbook of Brain Theory and Neural Networks, pp. 405–508. Mit Pr. (2002) 4. Whitehead, M., Yaeger, L.: Sentiment mining using ensemble classification models. In: Sobh, T. (ed.) Innovations and Advances in Computer Sciences and Engineering, pp. 509–514. Springer Netherlands (2010) 5. Xiao, M., Guo, Y.: Multi-view adaboost for multilingual subjectivity analysis. In: 24th Inter. Conf. on Computational Linguistics, COLING 2012, pp. 2851–2866 (2012) 6. Hoeting, J.A., Madigan, D., Raftery, A.E., Volinsky, C.T.: Bayesian model averaging: A tutorial. Statistical Science 14(4), 382–417 (1999) 7. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categ., pp. 41–48 (1998) 8. McCallum, A., Pal, C., Druck, G., Wang, X.: Multi-conditional learning: Generative/discriminative training for clustering and classification. In: AAAI, pp. 433–439 (2006) 9. Cortes, C., Vapnik, V.: Support-vector networks. ML 20(3), 273–297 (1995) 10. Sutton, C.A., McCallum, A.: An introduction to conditional random fields. Foundations and Trends in ML 4(4), 267–373 (2012) 11. T¨ ackstr¨ om, O., McDonald, R.: Semi-supervised latent variable models for sentencelevel sentiment analysis. In: Proc. of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 569–574 (2011) 12. Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In: Association for Computational Linguistics (2007) 13. Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proc. of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 115–124 (2005) 14. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proc. of the 10th ACM SIGKDD Inter. Conf. on Knowledge Discovery and DM, pp. 168–177 (2004)