Improving Patient Opinion Mining through Multi ... - Semantic Scholar

Improving Patient Opinion Mining through Multi-step Classification Lei Xia1 , Anna Lisa Gentile2 , James Munro3 , and Jos´e Iria1 1

3

Department of Computer Science, The University of Sheffield, UK {l.xia, j.iria}@dcs.shef.ac.uk, 2 Department of Computer Science, University of Bari, Italy [email protected], Patient Opinion, http://www.patientopinion.org.uk/ [email protected]

Abstract. Automatically tracking attitudes, feelings and reactions in on-line forums, blogs and news is a desirable instrument to support statistical analyses by companies, the government, and even individuals. In this paper, we present a novel approach to polarity classification of short text snippets, which takes into account the way data are naturally distributed into several topics in order to obtain better classification models for polarity. Our approach is multi-step, where in the initial step a standard topic classifier is learned from the data and the topic labels, and in the ensuing step several polarity classifiers, one per topic, are learned from the data and the polarity labels. We empirically show that our approach improves classification accuracy over a real-world dataset by over 10%, when compared against a standard single-step approach using the same feature sets. The approach is applicable whenever training material is available for building both topic and polarity learning models.

1

Introduction

Opinion mining or extraction is a research topic at the crossroads of information retrieval and computational linguistics concerned with enabling automatic systems to determine human opinion from text written in natural language. Recently, opinion mining has gained much attention due to the explosion of usergenerated content on the Web. Automatically tracking attitudes and feelings in on-line blogs and forums, and obtaining first hand reactions to the news online, is a desirable instrument to support statistical analyses by companies, the government, and even individuals. Patient Opinion4 is a social enterprise pioneering an on-line review service for users of the British National Health Service (NHS). The aim of the site is to enable people to share their recent experience of local health services online, and, ultimately, help citizens change their NHS. Online feedback is sent to 4

http://www.patientopinion.org.uk

health service managers and clinicians, but, due to “big numbers”, it is difficult for a single person to read them all. Dedicated staff are needed to classify the comments received, both according to topic and polarity (i.e., as either positive, negative or neutral), and to provide managers a summary report. However, the cost of running such staff quickly becomes prohibitive as the amount of comments grows, and this greatly limits the usefulness of all the collected data. Careful analysis of the on-line comments from Patient Opinion indicates that the polarity distribution of the comments is not independent from their topic, and that topic-specific textual cues provide valuable evidence for deciding about polarity. For example, patient comments about “parking” almost always express a negative opinion, presumably because when parking does not pose a problem people tend to forget to mention it; and in comments about “staff”, the presence words like “care” and “good” is highly indicative of positive experiences involving human contact with health care staff, whereas this is not necessarily the case with comments about something else. Unfortunately, to the best of our knowledge, no previous work has exploited the topic distribution of the data in the design of automatic polarity classification systems. In this paper, we present a novel machine learning-based approach to polarity classification of short text snippets, which takes into account the way data are naturally distributed into several topics in order to learn better classification models for polarity. Our approach is multi-step, where in the initial step a standard topic classifier is learned from the data and the topic labels, and in the ensuing step several polarity classifiers, one per topic, are learned from the data and the polarity labels. We empirically show that our approach improves classification accuracy over a real-world dataset from Patient Opinion by over 10%, when compared against a standard single-step approach using the same feature sets. We also compare against a baseline multi-step approach in which the first step is unsupervised, which turned out to perform the worst of all approaches, further confirming our claim that distribution according to topic is relevant for polarity classification. Our approach is applicable whenever training material is available for building both topic and polarity learning models. The rest of the paper is structured as follows. A review of related work is given in the next section, including an examination of application domains and commonly used techniques for the polarity classification task. In Section 3 we describe our approach in detail. A complete description of the experimental setting is given in Section 4, and a discussion of the results obtained is presented in Section 5. Our conclusions and plans for future work close the paper.

2

Related Work

The research field of Sentiment Analysis deals with the computational treatment of sentiments, subjectivity and opinions within a text. Polarity Classification is a subtask of sentiment analysis which reduces the problem to identifying whether the text expresses positive or negative sentiment. Approaches to the polarity classification task mainly fall into two categories: symbolic techniques and machine learning techniques. Symbolic techniques are commonly based on rules and manually tuned thresholds, e.g., [8], and tend to rely heavily on external lexicons and other structured resources. Machine learning techniques have more recently been widely studied for this task, mostly with the application of supervised methods such as Support Vector Machines, Na¨ıve Bayes and Maximum Entropy [1, 3, 7, 14, 16], but also unsupervised methods, such as clustering, e.g., [10]. In this paper, we use an off-the-shelf supervised learning algorithm within the context of a meta-learning strategy which we propose in order to take advantage of the availability of labeled data along two dimensions (topic and polarity) to improve classifier accuracy. Our meta-learning strategy resembles work on hierarchical document classification [11, 4]. There, the idea is to exploit the existence of a document topic hierarchy to improve classification accuracy, by learning classification models at several levels of the hierarchy instead of one single “flat” classifier. Inference of a document’s category is performed in several steps, using the sequence of classifiers from the top to the bottom of the hierarchy. The key insight is that each of the lower-level classifiers is solving a simpler problem than the original flat classifier, because it only needs to distinguish between a small number of classes. Our approach is similar in that we perform classification in two steps, building separate polarity classifiers per topic which are also solving an easier problem than a single topic-unaware polarity classifier. Machine learning-based work on polarity classification typically formulates a binary classification task to discriminate between positive and negative sentiment text, e.g., [7]. However, some works consider neutral text, hence addressing the additional problem of discriminating between subjective and objective text before classifying polarity, e.g., [15]. Niu et al. [13] even consider a fourth class, namely “no-outcome”. In this work, we follow the traditional formulation and consider the problem of classifying subjectivity vs. objectivity to be outside the scope of the paper. A significant amount of work has also been done on studying which features work well for the polarity classification task. Abbassi et al. [1] provides a categorization of the features typically employed, dividing them into four top level categories: syntactic, semantic, link-based and stylistic. Syntactic features

are the most commonly used and range from simply the word token to features obtained by natural language processing, e.g., lemma. Semantic features result from additional processing of the text, typically relying on external resources or lexicons designed for this task, such as SentiWordnet [6]. Link-based features [5, 2] are derived from the link structure of the (hyperlinked) documents in the corpus. Examples of stylistic features include vocabulary richness, greetings, sentence length, word usage frequency [1]. In this paper, it suffices to make use of standard syntactic features in the literature, since the focus is rather in drawing conclusions about the proposed multi-step approach. Sentiment analysis has been applied to numerous domains, including news articles [9], web forums [2] and product reviews [7, 15]. Our problem domain is more similar to the latter, since we also analyse reviews, but rather deal with the feedback provided by patients of the British NHS, spanning topics such as staff, food and parking quality.

3

Multi-step Polarity Classification

We introduce a novel multi-step approach to polarity classification, which exploits the availability of labels along the dimensions of topic and polarity for the same data. Following the intuition that polarity is not statistically independent from topic, our hypothesis is that learning topic-specific polarity classification models can help improve the accuracy of polarity classification systems. The proposed multi-step method is the following. Given a document collection D = DT ∪ DP , a set of topics T , and training labels given by LT : D → T and LP : D → {−1, 1}, for topic and polarity respectively, do: 1. Learn a topic classifier from DT using labels LT , by approximating a classification function of the form fT : D → T ; 2. Apply fT to DP , thereby splitting DP by topic, that is, creating sub-datasets DP t = {d ∈ DP : fT (d) = t}, ∀t ∈ T ; 3. For each t ∈ T , learn a polarity classifier from DP t using labels LP , by approximating a classification function of the form fP t : D → {−1, 1}; 4. Classify any previously unseen input document d by arg maxP fP fT (d) (d), that is, topic classification determines which polarity classifier to run. We use a standard multinomial Na¨ıve Bayes approach for learning the classifiers. In the Na¨ıve Bayes probabilistic framework, a document is modeled as an ordered sequence of word events drawn from a vocabulary V, and the assumption is that the probability of each word event is independent of the word’s context and position in the document. Each document di ∈ D is labeled with a class cj ∈ C and drawn from a multinomial distribution of words with as

many independent trials as its length. This yields the familiar “bag of words” document representation. The probability of a document given its class is the multinomial distribution [12]: P (di |cj ; θ) = P (|di |)|di |!

|V| Y P (wt |cj ; θ)Nit t=1

Nit !

,

(1)

where Nit is the count of the number of times word wt occurs in document di , and θ are the parameters of the generative model. Estimating the parameters for this model from a set of labeled training data consists in estimating the probability of word wt in class cj , as follows (using Laplacian priors): P 1+ D i=1 Nit P (cj |di ) P (wt |cj ; θ) = . (2) P|V| P|D| |V| + s=1 i=1 Nis P (cj |di ) The class prior parameters are set by the maximum likelihood estimate: P|D| P (cj |di ) . (3) P (cj |θ) = i=1 |D| Given these estimates, a new document can be classified by using Bayes rule to turn the generative model around and calculate the posterior probability that a class would have generated that document: P (cj |θ)P (di |cj ; θ) P (cj |di ; θ) = P|C| . j=1 P (cj |θ)P (di |cj ; θ)

(4)

Classification becomes a simple matter of selecting the most probable class.

4

Experiments

A set of experiments was designed and conducted to validate our hypothesis, using 1200 patient comments provided by Patient Opinion. The comments consist of short texts manually annotated by Patient Opinion experts with an indication of polarity (either positive or negative) and one of the following eight topics: Service, Food, Clinical, Staff, Timeliness, Communication, Environment, and Parking. Each comment is represented as a vector of frequencies of word stem unigrams. The word stems are obtained by running the OpenNLP5 tokeniser, removing stop words and applying the Java Porter Stemmer6 . The accuracy of the proposed method is compared against three baseline methods: 5 6

http://opennlp.sourceforge.net/ http://tartarus.org/˜martin/PorterStemmer/java.txt

– Single-step. This learns fP : D → {−1, 1} directly from D, without any knowledge about topic. – Topic as feature. Same as single-step, but introduces knowledge about the topic from the labels LT as a feature for learning. – Clustering as first step. Here an approximation of the true fT : D → T is created in an unsupervised way by clustering D according to the traditional cosine similarity between the vector representations of the comments, without any knowledge about topic. In order to understand the influence of the topic classifier, used in the first step of our proposed method, on the overall system accuracy, we simulate a noise process over the topic labels LT . Concretely, given a noise ratio r ∈ [0, 1], let S be a random subset of D such that |S| = r × |D|. We define noisy topic labels L0T : D → T as: LT (d) , if d ∈ D \ S; 0 LT (d) = t ∈ T \ {LT (d)} , if d ∈ S. By varying r, it is possible to study the effect of the quality of the underlying topic classifier, since L0T mimics the (imperfect) output of a learned fT . Finally, for the Na¨ıve Bayes algorithm we used the off-the-shelf implementation in LingPipe7 , while for clustering we used our implementation of the well-known k-means algorithm, setting k = 8 to match the number of topics.

5

Results and Discussion

We measure systems’ accuracy using the standard information retrieval measures of precision, recall and F-measure. All the results presented in Figure 1 consist of the micro-averaged F-measure over 10 trials with 60/40 random splits (60% of the comments used for training and 40% used for testing). The single-step approach achieved a F1 value of 65.6%. The attempt to improve it by introducing the topic from LT as an additional feature for the learning task yielded a small improvement of roughly 1%. The proposed multi-step approach provides a much more significant improvement over the previous ones: the upper bound, given by the use of a perfect topic classifier, is at 77% F1, but even considering a noise ratio of up to 60% the multi-step approach obtains better results than the single-step approaches. Thus, we can conclude that it is sufficient to use a reasonable accuracy topic classifier, easily obtainable with the current state-of-the-art, in order to get a significant improvement in polarity classification. 7

http://alias-i.com/lingpipe/

Fig. 1. F-measure, micro-averaged over all polarity classes and trials, obtained by the proposed multi-step approach (MS) by varying the ratio of noise introduced into the topic labels. Indicated for the purposes of comparison are also the results obtained by the “single-step” approach (SS), the “topic as feature” approach (SST) and the “clustering as first step” approach (MSC).

The “clustering as a first step” approach, which clusters documents according to cosine similarity, performed worse than the original single-step approach, obtaining roughly 60% F1, while the lower bound, at around 40%, is given by a fully random allocation of documents to topics. This further confirms our claim that distribution according to topic is relevant for polarity classification.

6

Conclusions and Future Work

Exploiting the availability of training material for building both topic and polarity learning models, in this paper we proposed a meta-learning strategy which performs classification in two steps: classifying documents according to topic as a first step and then building separate polarity classifiers per topic. Our experimental results show that our approach, evaluated against three baseline methods, improves conspicuously polarity classification accuracy, thus proving our intuition that the topic distribution has significant influence on polarity classification. As future work, we will work in close cooperation with Patient Opinion to further improve the accuracy of the system, both with the addition of further training material to refine the learning models, and by investigating which learning features could further improve the approach presented here. The end goal is to introduce the polarity classification system in Patient Opinion’s work processes, delivering a crucial cost-cutting functionality.

References 1. A. Abbasi, H. Chen, and A. Salem. Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Trans. Inf. Syst., 26(3), 2008. 2. R. Agrawal, S. Rajagopalan, R. Srikant, and Y. Xu. Mining newsgroups using networks arising from social behavior. In WWW ’03: Proceedings of the 12th international conference on World Wide Web, pages 529–535, New York, NY, USA, 2003. ACM. 3. E. Boiy, P. Hens, K. Deschacht, and M.F. Moens. Automatic sentiment analysis in on-line text. In L. Chan and B. Martens, editors, ELPUB, pages 349–360, 2007. 4. S.T. Dumais and H. Chen. Hierarchical classification of web content. In SIGIR, pages 256– 263, 2000. 5. M. Efron. Cultural orientation: Classifying subjective documents by cocitation analysis. In Proceedings of the 2004 AAAI Fall Symposium on Style and Meaning in Language, Art, Music, and Design, pages 41–48, 2004. 6. A. Esuli and F. Sebastiani. Sentiwordnet: A publicly available lexical resource for opinion mining. In In Proceedings of the 5th Conference on Language Resources and Evaluation (LREC06, pages 417–422, 2006. 7. M. Gamon. Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis. In COLING ’04: Proceedings of the 20th international conference on Computational Linguistics, page 841, Morristown, NJ, USA, 2004. Association for Computational Linguistics. 8. S. Gindl and J. Liegl. Evaluation of different sentiment detection methods for polarity classification on web-based reviews. In Proceedings of the18th European Conference on Artificial Intelligence (ECAI-2008), ECAI Workshop on Computational Aspects of Affectual and Emotional Interaction, Patras, Greece, 2008. 9. V. Hatzivassiloglou H. Yu. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In Proceedings of the 2003 conference on Empirical methods in natural language processing, pages 129–136, Morristown, NJ, USA, 2003. Association for Computational Linguistics. 10. V. Hatzivassiloglou and K.R. McKeown. Predicting the semantic orientation of adjectives. In Proceedings of the eighth conference on European chapter of the Association for Computational Linguistics, pages 174–181, Morristown, NJ, USA, 1997. Association for Computational Linguistics. 11. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In D.H. Fisher, editor, ICML, pages 170–178. Morgan Kaufmann, 1997. 12. A. Mccallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI Workshop on Learning for Text Categorization, 1998. 13. Y. Niu, X. Zhu, and G. Hirst. Using outcome polarity in sentence extraction for medical question-answering. AMIA Annu Symp Proc, pages 599–603, 2006. 14. Y. Niu, X. Zhu, J. Li, and G. Hirst. Analysis of polarity information in medical text. AMIA Annu Symp Proc, pages 570–4, 2005. 15. B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the ACL, pages 271–278, 2004. 16. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 79–86, 2002.