Comments Using Naive Bayesian Model .... main application area is for movie reviews. ..... whether they're about superheroes ( batman , superman , spawn. ) ... or decimal value, which may further express the feature frequency or the intensity of ...
Sentiment Classification of Movie Review Comments Using Naive Bayesian Model University of Computer Studies, Yangon
Wai Nwe Tun 11 January, 2016
Contents
1
2
3
Introduction
2
1.1
Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Brief Overview of the System . . . . . . . . . . . . . . . . . . . . . .
3
1.3
Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.4
Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.5
Preview of the Contents . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Related Work
5
2.1
5
Recent Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
BACKGROUND THEORY 3.1
10
Sentiment Classification . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.1.1
Opinion Definition . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.2
Different Levels of Sentiment Classification . . . . . . . . . . . 11
3.1.3
Sentiment Classification using Supervised Learning . . . . . . . 11
3.2
POS tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3
Negative Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4
Features Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4.1
3.5
3.6
Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . 15
Naive Bayesian Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.5.1
Bernoulli Na¨ıve Bayesian Model . . . . . . . . . . . . . . . . . 16
3.5.2
Multinomial Na¨ıve Bayesian Model . . . . . . . . . . . . . . . 17
3.5.3
Avoiding Zero Probability . . . . . . . . . . . . . . . . . . . . 17
Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
i
4
5
6
DESIGN AND IMPLEMENTATION
19
4.1
Operating Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2
Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3
Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.3.1
POS tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.2
Negative Handling . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.3
Extraction of Adjectives, Adverbs, Verbs, and Nouns . . . . . . 21
4.3.4
Feature Selection using Information Gain . . . . . . . . . . . . 22
4.4
Bernoulli Naive Bayesian Classification . . . . . . . . . . . . . . . . . 23
4.5
Sentiment Classification User Interface . . . . . . . . . . . . . . . . . . 24
EXPERIMENTATION RESULTS
26
5.1
Hardware and Software . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2
Dataset Nature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3
Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
CONCLUSION
31
6.1
Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3
Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
ii
List of Figures
3.1
General Flow of Negative Handling[1] . . . . . . . . . . . . . . . . . . 14
4.1
Sentiment Classification System Architecture . . . . . . . . . . . . . . 20
4.2
POS Tagging Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3
POS Tagging Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4
POS Tagging Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.5
DB: Information Gains of Features . . . . . . . . . . . . . . . . . . . . 23
4.6
Information Gains of Features . . . . . . . . . . . . . . . . . . . . . . 23
4.7
DB: Prior Probabilities of Features . . . . . . . . . . . . . . . . . . . . 24
4.8
Prior Probabilities of Features . . . . . . . . . . . . . . . . . . . . . . 24
4.9
Results of Testing Dataset . . . . . . . . . . . . . . . . . . . . . . . . 25
4.10 Sentiment Classification User Interface . . . . . . . . . . . . . . . . . . 25 5.1
Evaluation Comparisons between Features with Nouns and Features without Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2
Comparison of Accuracy on Different Portions of Extracted Features . . 28
5.3
Comparison of Precision(pos) on Different Portions of Extracted Features 29
5.4
Comparison of Precision(neg) on Different Portions of Extracted Features 29
5.5
Comparison of Recall(pos) on Different Portions of Extracted Features . 29
5.6
Comparison of Recall(neg) on Different Portions of Extracted Features . 30
5.7
Comparison of Accuracy on Different Portions of Extracted Features(Negative Handling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iii
List of Tables
3.1
Penn Treebank Part-of-Speech(POS) tags[2] . . . . . . . . . . . . . . . 13
5.1
Numbers of Words on Each Class in Entire Dataset [3] . . . . . . . . . 26
5.2
Numbers of Words on Each Class in the Training Dataset . . . . . . . . 27
5.3
Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.4
Evaluations Results on Different Portions of Extracted Features with Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.5
Evaluations Results on Different Portions of Extracted Features without Nouns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.6
Evaluations Results on Different Portions of Extracted Features without Nouns (Without Performing Negative Handling . . . . . . . . . . . . . 30
iv
ACKNOWLEDGMENTS
I would like to express my sincere gratitude to the following persons who have implicitly or explicitly contributed to this master thesis and encouraged me to complete this dissertation work. Firstly, I would like to denote my respectful thanks to Dr. Mie Mie Thet Thwin, Rector of University of Computer Studies, Yangon, for her kind permission to submit this thesis. Secondly, my heartful gratitude goes to my supervisor, Dr. Khin Nwe Ni Tun, an Associate Professor of University of Computer Studies, Yangon. Without her support, I would not accomplish this work. Then, I personally thank Dr. Nang Saing Moon Kham, a Professor of University of Computer Studies, for her wonderful data mining lectures. Besides, I would love to express my regardful thanks to all my teachers from kindergarten to university, as well. Thirdly,I would like to give my thank-and-love-you note to all my friends, especially Chit, Su Aung, Pwint, and Suu Thadar for encouraging me all time and showing greatness of true friendship. Last but not least, I would like to express my love and gratitude to my parents and my elder brothers for their mentally and physically limitless support in every aspect of my life.
v
ABSTRACT
Sentiment classification means the task of determining whether the expressed texts hold a sentiment of positive, negative or neutral. This thesis is for sentiment classification of already-processed movie review comments (IMDb Movie Review) using Bernoulli Naive Bayesian model. It can differentiate a review as positive from negative. To select useful features for classification, certain preprocessing steps such as POS tagging, negative handling, extraction of particular POS (adjectives, adverbs, verbs, and nouns), and Information Gain selection are performed. Instead of using frequency counts on features, it simply uses presence/absence features. To save operating time, it never physically generates a set-of-words model, containing of each feature with its status (presence or absence) for each review, but it applies the logical sense of the setof-words model in calculating information gain of each features. The whole system is implement by the author, except a POS tagger, the Stanford POS tagger and using the external non-coded resources. Performance evaluation on the processed reviews is based on accuracy, precision and recall. Besides, it evaluates these methods using the different size of features to show that performance is based on the size of training features set. It also highlights how the system is improved when excluding nouns from a set of extracted features. However, the system is biased on negative sentiment because of the non-uniform set of features containing in the positive and negative comments and Bernoulli Naive Bayes which is statistically sensitive to the size of features when classifying the comments. To conclude, the system is for document-level sentiment classification and its domain application area is for movie reviews.
1
CHAPTER 1
Introduction
It cannot be denied that the Internet could play a vital role in majority of people’s daily life. They use the Internet not only for consuming information but also generating loads of data. A typical example is online shopping. Nobody just buys an item whenever he/she likes it. Before they buy something from online, they will make sure of whether the page owner is reliable in terms of item quality, transportation speed, and others. It can be easily done by checking the customer reviews. Some sites let a user to write feedback and then manually set how many stars will be given. Sometimes, it is awkward for the user just not to provide feedback in text but also select the grade of the item because it should have an automatic tool which can analyze if the feedback is of a positive or negative opinion. This kind of requirement is fulfilled by the field of sentiment classification. Thus, sentiment classification aims to analyze textual data to determine how an individual feels about a particular topic. This classification can be seen in many domains such as online shopping by customer feedback, measurement of movie quality by movie reviews, public trend in twitter by tweets and so on. The organization of this chapter is as follows: in Section 1.1, it is described about sentiment and its classification. Section 1.2 depicts the overview of the system in brief. In Section 1.3, the objectives of this thesis are shown. In Section 1.4, the content preview is presented.
2
1.1
Sentiment Classification
Sentiment can be described as emotions or as judgments, opinions, attitudes or ideas prompted or color by emotions. Textual information can be divided on two types: objective (factual) and subjective (opinionated) information in which only the latter can express sentiment. Given a set of documents D containing opinions, it can be determined whether each document d expresses a positive or negative opinion of an object.[2] Existing research in this area makes the assumption that the opinionated document d (e.g., movie review) contains opinions regarding a single object but forum and blog posts may contain many opinions on multiple products and services. The binary classification task of labeling an opinionated document as expressing either an overall positive or an overall negative opinion is called sentiment polarity classification or polarity classification, which is also termed as sentiment classification or opinion mining in the literature. Sentiment classification can be performed in three different levels: document level, sentence level, and aspect level. [2] In document and sentence levels, the task is to classify a sentiment of the whole document and each sentence respectively. The aspect level is based on the idea that an opinion consists of a sentiment and a target, and the task is to classify a sentiment specified with its associated target. The main difficulty in sentiment analysis and classification is an opinion word which is treated as denoting a positive side may be considered as negative in another situation. Next, people’s way of expressing a situation varies. The traditional text processing considers that a small change in two pieces of text has no change in the meaning. However, in sentiment analysis, the statement of “the movie is good” is different from that of “the movie is not good”. Most statements contain both positive and negative words. [4] Next, sentiment classification has a domain dependent nature, that is, the same word could be positive in one domain while it reflects as negative in the other domain, for example, ‘the vacuum cleaner sucks’ and ‘this movie sucks’ in which the former ‘suck’ reflects as positive whereas the latter ‘sucks’ does as negative. Besides, opinions can be expressed in subtle and complex ways, involving the use of slang, ambiguity, sarcasm, irony, and idiom. 3
Sentiment classification is performed for several reasons, such as to track the ups and downs of aggregate attitudes to a brand or product, to compare attitudes of online customers between one brand or product and another, to pull out examples of particular types of positive or negative statement on some topic, to help other potential customers make informed choices, and to help organizations enhance customer relationship management. [5]
1.2
Brief Overview of the System
This thesis focuses on document-level sentiment classification on the movie review comments, out of three different levels, in which the task is to classify the whole comments according to their sentiments such as positive or negative. The work of sentiment classification can be divided into two main processes: preprocessing steps and classification step. The preprocessing steps involves extraction of features with particular Part-of-Speech namely adjectives, adverbs, verbs, and nouns using a certain linguistic and static technique, the Stanford POS tagger, as well as selection of key features by Information Gain. Before selection step, the whole review is updated using an extra symbol ’!’ by simple Negative Handling. Then, using Bernoulli Na¨ıve Bayesian classification, it presents a model which could differ positive reviews from negative reviews using the selected key features from the preprocessing steps. As a dataset, For evaluation part, the work shows classification on varied number of features as well as different sets of features, leading to highlight which Part-of-Speech is not that important to be considered in the classification and it is ‘noun’.
1.3
Dataset
The dataset used in this thesis is movie review polarity dataset v2.0. It consists of 1000 positive and 1000 negative processed reviews. It is introduced by Pang and Lee in [6]. 4
For classification and learning phase, 700 reviews from each class are used and the remaining ones for validation and evaluation purpose.
1.4
Objectives
The main objectives of this thesis are 1. To provide a classification model that can differentiate the movie review comment into positive from negative 2. To track customer feedback, and 3. To enhance customer relationship management for online business.
1.5
Preview of the Contents
In this section, each chapter is summarized. Chapter 1 represents the introduction, in which sentiment classification is discussed in terms of sentiment and its challenges, as well as the brief overview of the system is presented in brief, and objectives are described. Chapter 2 provides the related work of sentiment classification and analysis undertaking with machine learning techniques. Chapter 3 describes the theory background, concerned with the sentiment classification system. Chapter 4 provides the design and implementation of this system in which certain design and implementation aspects are illustrated. Chapter 5 presents the experimental results fully. In Chapter 6, it concludes this book.
5
CHAPTER 2
Related Work
The work of Sentiment Classification (Opinion Mining) can be classified into two main types of research approach which are static and semantic.[7] Statistical approach makes use of learning techniques to classify the semantic polarity of the opinion into positive, negative, or neutral classes by approximating the value of their intensity. The approximation techniques are varied from supervised to unsupervised learning, typically probabilistic methods (Naive Bayes, Maximum Entropy), linear discrimination (Support Vector Machine), and non-parametric classifiers (K-Nearest Neighborhood clustering) as well as similarity scores methods (Phase Pattern Matching, Frequency Counts and Statistical Weight Measures). The latter approach, semantic, tends to differentiate the opinion into positive from negative using common sense ontologies, sentiment and lexical-semantic resources such as SentiWordNet, WordNet, WordNet Gloss, and WordNet-Affect respectively. The main limitation is that system performance heavily depends on its resources, i.e., if a particular word might not be contained there and in that worst case, it is not easy to classify the opinion. Thus, most of the researchers did not rely just on semantic approach. Then, statistical approaches are coupled with semantic approaches to imporve sentiment classification by integrating features from lexical resources and learning features. Thus, the previous work of Sentiment Classification and Analysis carried out using certain machine learning techniques and linguistic methods are discussed in the next section.
6
2.1
Recent Work
In V. Narayanan et al. sentiment classification [1], they mainly emphasized on exploring different methods to improve a Naive Bayes Classifier accuracy. Those methods were negative handling, n-gram features (bigram and tribigram), and feature selection by Mutual Information. From their results, it can be said that Bernoulli Naive Bayes is more efficient than the original Naive Bayes in sentiment classification. They achieved an accuracy of 88.8% on IMDB movie review dataset which is different from the one used in this thesis, by applying all the above methods with Bernoulli Naive Bayesian classification and Laplacian Smoothing. In P.Kalaivani’s work [8], three supervised machine learning algorithms of Support Vector Machine, Na¨ıve Bayes, and k Nearest Neighbour were compared for sentiment classification of IMDb movie reviews. Before classification, the authors applied Information Gain as features extraction. The goal of their work is to analyze the performance for sentiment classification of three machine learning algorithms in terms of accuracy, precision, and recall. As his research, SVM outperformed over Na¨ıve Bayes and kNN methods and the error rate of SVM achieved less than 20 A. Kennedy and D. Inkpen [9] also performed sentiment classification of movie reviews using contextual valence shifters (negations, intensifiers, and diminishers). Two methods were presented in their paper. In their first method, semantic orientation was applied using General Inquire which is a dictionary that contains information about English word, including tags that label them as positive, negative, negation, overstatement, or understatement and classifies reviews based on number of positive and negative terms. Then, they showed extending the term-counting method with contextual valence shifters improves the accuracy of classification. The second method uses a Machine Learning algorithm, Support Vector Machines, and features are started as unigrams and then add bigrams. Besides, they performed evaluation on ten different approaches. For product review classification, improved GI and CTWR has the same accuracies as improved GI and SO-PMI 1. These two methods outperformed than the others. But for movie reviews, improved GI and CTWR and Adj has the higher accuracy than the oth7
ers. In the work of F. Peng and D. Schuurmans [10], they presented a chain argumented Na¨ıve Bayes classifier (CAN) based on statistical n-gram language modeling for a dataset of Reuters-21578. They CAN Bayes model captures dependence between adjacent attributes as a Markov chain. Written-Bell smoothing is applied instead of Laplacian smoothing. Their model is able to work at either the character level or the word level, which provides language independent abilities to handle Eastern languages like Chinese and Japanese just as easily as Western languages like English or Greek. Evaluation is performed on four different languages and three text classification problems. Besides, the authors evaluated the impact of the different factors that can impact the accuracy of their proposed approach and they observed that most of the result are robust. B. Pang et al. [3] developed sentiment classification of movie reviews. They found that standard machine learning techniques definitely outperform human-produced baselines. The three machine learning methods namely Na¨ıve Bayes, Support Vector Machines, and maximum entropy classification, are comparatively used for classification. They also pointed out the factors that make the sentiment classification problem more challenging such as Part-of-Speech, Position, Bigrams, and Feature Presence vs. Frequency. In their results, SVM surpass the other two methods using the appropriate given factors but differences are not very large. In the work of D. Allotey [5], the performance of Categorical Proportional Difference (CPD) is investigated as a feature selection method together with other popular feature selection methods: Information Gain and chi-square with two weighting schemes, Feature Presence and SentiWordNet scores. Two datasets namely IMDb movie reviews and congressional speech corpus were used and SentiWordNet was used to score the terms in each document after tagging using Stanford POS tagger. The SVM and Na¨ıve Bayes classifier in the Weka data mining application were used for classifying the datasets into positive and negative sentiments. Evaluation results show that CPD has the accuracy of 88.1% with 42% of features. The performance of Information Gain
8
and simple chi-square is changed on dataset. Information Gain outperformed simple chi-square on movie review dataset although chi-square is better on the congressional floor debate dataset. The author proposed CPD as a good feature selection method for sentiment classification. K. Yessenov and S. Misaiovic [11] presented an empirical study of efficacy of machine learning techniques in classifying text messaged by semantic meaning. As a dataset, movie review comments from popular social network Digg were used to classify subjectivity/objectivity and negative/positive attitude. They proposed different approached in extracting text features such as bag-of-words model, using large movie reviews corpus, restricting to adjectives and adverbs, handling negations using chunking, bounding word frequencies by a threshold, and using WordNet synonyms knowledge to control features. They evaluated their effect on accuracy of four machine learning methods – Na¨ıve Bayes, Decision Trees, Maximum Entroy, and K-Means clustering. The evaluation results show that simple bag-of-words model has satisfying performance but it can be improved by the selection of features based on syntactic and semantic information from text. H. Khoshnevis’s work [12] compared feature sets and classifiers for sentiment analysis of opinionated free text with societal themes. The author investigated common feature sets which have been used in the previous similar studies on sentiment analysis as well as the new feature set that he introduced, that contain mostly the high PMI (Pointwise Mutual Information) rank sentimental words based on positive and negative documents, on societal theme datasets. The author intended to find the best classifier in terms of accuracy and time, on large and small datasets represented by the mentioned feature sets. Classifiers which were used to evaluate categorized to linear and non-linear ones and they were Support Vector Machines, different Bayes classifiers, kNN classifier, and Nearest Mean classifier from PRTools which is a MATLAB based toolbox for Pattern Recognition. Principal Component Analysis was also used for linear feature extraction. The author found out that even though the general performance of SVC and Linear Discriminate Classifier (LDC) are similar, the computation time needed for prediction is higher than LDC. In experimental results, the author obtained
9
the mean accuracy of 68.51% with large dataset while 73.41% with small dataset using PMI dictionary. K. S. Doddi [13] worked to provide a platform for serving good news and create a positive environment. This is achieved by finding the sentiments of the news articles and filtering out the negative articles which carry negative sentiments. The author built an own algorithm for classification of News articles. This includes data aggregator tool and processing engine at the server side as a sentiment classifier and a platform for user where positive news being served to read and this is a GoodBook. The approaches used in this system are stop words removal and stemming, SentiWordNet scoring, POS tagging, applying Bag-of-Words model with the help of Binary, Term-frequency, and IF-IDF methods, and Support Vector Machines as classification model. The author calculated the adverb scaling and verb scaling factors by using his heuristic algorithm for document classification. For the maximum accuracy, adverb scaling factor and verb scaling factor is 0.35 and 0.3 respectively. These factors were used for further evaluation stages. The experimental result was analyzed by testing for two kernels. The results show that higher accuracy can be obtained by with Gaussian Kernel with Gamma value of 1.8. Author [14] performed supervised classifications of Na¨ıve Bayesian model and Support Vector Machines with two points of view English and Spanish and semi-supervised classification with an unlabeled Portuguese corpus. Some of the most controversial issues in Sentiment Analysis were also subject of this study, such as the contributing of n-grams, best classifier, and feature selection methods namely Information gain, Chi-square, Mutual Information, Ginni Index, Bi-Normal Separation, and so on to use, granularity which means document classification at Word or Sentence Level and Sentence Classification ,and cross-lingual adaption. Besides, he also applied the use of punctuation, tokenization, stopwords removal, stemming and lemmatization, and word sense disambiguation. According to experimental results, the proposed Mutual Information Weighted SVM ((MImax)-WSVM) surpass SVM in all cases when corpus consists of longer reviews while Multinomial Na¨ıve Bayes (MNNB) has higher accuracy
10
with shorter reviews. This indicates that (MImax)-WSVM is suitable to operate at document level whereas MNNB is for sentence level. To sum up, this thesis is about sentiment classification by Bernoulli Naive Bayesian using the presence/absence of the feature. As preprocessing steps, firstly desired POS of features are extracted with the help of Stanford POS tagger, through simple negative handling in [1], and then, final lists of features are selected by Information Gain. For performance evaluation, accuracy, recall, and precision are applied, highlighting that ’noun’ POS is not as necessary as adjective, adverbs, or verbs to include in sentiment classification process.
11
CHAPTER 3
BACKGROUND THEORY
3.1
Sentiment Classification
With the growing availability and popularity of opinion-rich resources such as online review sites and personal blogs, new opportunities and challenges arise as people now can, and do, actively use information technologies to seek out and understand the opinions of others. The sudden eruption of activity in the area of opinion mining and sentiment analysis, which deals with the computational treatment of opinion, sentiment, and subjectivity in text, has thus occurred at least in part as direct response to the surge of interest in new systems that deal directly with opinions as a first-class object.
3.1.1
Opinion Definition
An opinion consists of two key components: a large g and a sentiment s on the target, i.e., (g,s), where g can be any entity or aspect of the entity about which an opinion has been expressed, and s is a positive, negative, or neutral sentiment, or a numeric rating score expressing the strength/intensity of the sentiment (e.g., 1 to 5 stars). Positive, negative and neutral are called sentiment orientations. Moreover, an opinion can be defined as a quadruple, (g,s,h,t) where g and s are the same as above, h is the opinion holder, and t is the time when the opinion was expressed. It is more concise than the fist one. To be more precise, an opinion can be of a quintuple (e, a, s, h, t) where e is the name of entity (i.e., iPhone6s) and a is an aspect of that entity e (i.e., battery life). It is very challenging to define which aspect of the entity beholds which sentiment.
12
This is because some expression could be one of an implicit aspect expression types which means aspect expressions that are not nouns or noun phrases. For example, “This camera is expensive.” sentence implies the aspect price. However, there are also explicit aspect expressions such as “The battery life of iPhone6 is very reliable.” An opinion can be classified into two types in the perspective of expression which are regular opinion and comparative opinion. Regarding with the comparative one, it expresses a sentiment relation between two or more entities. For example, sentences “Coke tastes better than Pepsi” and “Coke tastes the best” expresses two comparative opinion. A comparative opinion is usually expressed using the comparative or superlative form of an adjective or adverb. The former type, regular opinion’, is simply an opinion where the sentiment is expressed on a single entity or aspect.
3.1.2
Different Levels of Sentiment Classification
AAs described in Chapter 1, sentiment classification can be done in three different levels: document level, sentence level, and aspect-entity level. These three level classification is different in how much amount of contents are analyzed to get a sentiment. In document level, the classification focuses on estimating sentiment of the whole document while in sentence level, it analyzes each sentence in the document and determines its sentiment. However, in aspect level also known as entity or feature level. 1. Document Level: The classification focuses on estimating sentiment of the whole document. For example, in this thesis, the main task is to determine whether a comment expresses positive or negative by analyzing the whole document. However, it is suitable only for the document having expressed opinions on a single entity. 2. Sentence Level: The task goes to sentences and determines sentiments of each sentence as positive, negative, or neutral. Neutral can be said as no opinion. 3. Aspect and Entity Level: It is also known as feature level. It analyzes to discover what particularly people like and does not like so this level looks not only senti13
ment but also target of that sentiment. For example, the sentence ”although the service is not good, I still love this restaurant” can be classified as overall positive, but in aspect level, it determines positive tone just for restaurant(target) and negative one for service(target). This level is more challenging than the above two levels. [bing-lu]
3.1.3
Sentiment Classification using Supervised Learning
Sentiment classification is usually formulated as a two-class classification problem, positive and negative. Training and testing data used are normally product reviews. Since online reviews have rating scores assigned by their reviewers, e.g., 1-5 stars, the positive and negative classes are determined using the ratings. For example, a review with 4 or 5 stars is considered a positive review, and a review with 1 to 2 stars is considered a negative review. Most research papers do not use the neutral class, which makes the classification problem considerably easier, but it is possible to use the neutral class, e.g., assigning all 3-star reviews the neutral class. Sentiment classification is essentially a text classification problem. Traditional text classification mainly classifies documents of different topics, e.g., politics, sciences, and sports. In such classifications, topic- related words are the key features. However, in sentiment classification, sentiment or opinion words that indicate positive or negative opinions are more important, e.g., great, excellent, amazing, horrible, bad, worst, etc. Since it is a text classification problem, any existing supervised learning method can be applied such as Naive Bayesian classification, Support Vector Machines, and so on. It is also important to select suitable features in suitable forms in machine learning. The key factors which need to be considered in sentiment classification are 1. Terms and their frequency: They are associated with n-gram features with frequency counts. Instead of frequency which is implied with bag-of-words model, the presence/absence feature can also be applied as a set-of-words model. 2. Part of speech (POS): Words of different parts of speech may be treated differ14
ently. For example, adjective is considered as the most important indicator of opinions. 3. Sentiment shifters: These are expressions that are used to change the sentiment orientations, e.g., from positive to negative or vice versa. They are negation words. It is also tricky to resolve an negation problem in sentiment classification. Besides, sentiment classification can be done using unsupervised learning such as lexiconbased method and pointwise mutual information (PMI) measures.[2]
3.2
POS tagging
POS tagging is an important task in sentiment classification. It distinguishes the words from words to emphasize or ignore some particular POS in classification step. For POS tagging task, researchers usually use certain POS tagger to determine each word’s POS. The tagger normally parses the text and tags each word with its corresponding POS. For example, parsing the following text which is extracted from the reviews: films adapted from comic books have had plenty of success , whether they’re about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there’s never really been a comic book like from hell before. generates the following output: films/NNS adapted/VBD from/IN comic/JJ books/NNS have/VBP had/VBN plenty/NN of/IN success/NN ,/, whether/IN they/PRP ’re/VBP about/IN superheroes/NNS -LRB-/-LRB- batman/NN ,/, superman/NN ,/, spawn/VBP -RRB-/-RRB- ,/, or/CC geared/VBN toward/IN kids/NNS -LRB-/-LRB- casper/NN -RRB-/-RRB- or/CC
the/DT arthouse/NN crowd/NN -LRB-/-LRB- ghost/NN world/NN -RRB-/-RRB 15
,/, but/CC there/EX ’s/VBZ never/RB really/RB been/VBN a/DT comic/JJ book/NN like/IN from/IN hell/NN before/IN ./. According to the standard Penn Treebank POS Tags shown in Table 3.1, each term has been associated with its relevant POS, such as NNS means plural noun, VBD means past tense verb, JJ means adjective, and so on. There are together 36 tags in the Penn Treebank POS tagset as shown in Figure, which is an annotated corpus which seems to be the most popular standard used in most POS tagging.[2] Table 3.1: Penn Treebank Part-of-Speech(POS) tags[2]
3.3
Negative Handling
Negation is a strongly expressive linguistic phenomenon. It usually reverses the truth value of a statement. For example, Statement 1: The movie is good. (Positive sentiment) Statement 2: The movie is not good. (Negative sentiment) If negation is not taken into account, the given two statements would represent the 16
same sentiment (positive) in general. For this obvious reason, it is important to handle negation when performing sentiment classification. There are several techniques to handle negation. The simple method is to mark the words which are followed by the negative word such as no, not, never, rarely, etc. For example of this simple negative handling using the symbol “!” mark, the Statement 2 would be like “The movie is not !good.”, in which good and !good are different in the sense. [1] This is described in ??.
Figure 3.1: General Flow of Negative Handling[1] The other technique of handling negation is chunking the sentence according to the criterion, in which some researchers support only a small number of patterns, which handle adjectives and adverbs while it would be possible to create more extensive set of rules that would match nouns and verbs instead.
3.4
Features Selection
In order to perform machine learning, it is necessary to extract clues from the text that may lead to correct classification. Clues about the original data are usually stored in the form of a feature vector, F = f1, f2, f3, . . . , fn. Each coordinate of a feature vector represents one clue also called a feature, f, of the original text. The value of the coordinate may be a binary value indicating the presence or absence of the feature, an integer or decimal value, which may further express the feature frequency or the intensity of the feature in the original text. In most machine learning approaches, features in a vector are considered statistically independent from each other.[11] The selection of features 17
strongly influences the subsequent learning. The goal of selecting good features is to capture the desired properties of the original text in the numerical sentiment analysis task. Feature selection is used to enhance the efficiency of classification by eliminating less relevant features, from the document. Each feature is scored based on some predefined measure and the most relevant terms are selected based on this measure.
3.4.1
Information Gain
Information gain could be used as attribute (feature) selection measure. This measure is based on pioneering work by Claude Shannon on information theory, which studied the value or “information content” of messages. IG measures the decrease in information entropy when a selected feature is present versus when it is absent in the document. Information entropy is used to measure the uncertainty of the feature and the dataset. The idea behind IG is to find out how well each single feature separates the given dataset. [data mining book ref] Information Gain of the feature is computed by the following equation 3.1:
|C|
|C|
|C|
In f oGain(ti ) = − ∑ P(c j ) log P(c j )+P(ti )[ ∑ P(c j |ti )]+P(t¯i )[ ∑ P(c j |t¯i ) log P(c j |t¯i )] j=1
j=1
j=1
(3.1) This equation has three parts: the first part calculating the overall entropy, the second part calculating the entropy of the particular feature presence, and the last part calculating the entropy of that feature absence. After computing IG values for all features, the features with the scores higher than the certain threshold will be selected for the classification phase.
3.5
Naive Bayesian Classifier
Bayesian classifiers are generative and statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a partic-
18
ular class using equation 3.2 [1]. P(c)P(d|c) P(d)
P(c|d) =
(3.2)
Bayesian classification is based on Bayes’ theorem, named after Thomas Bayes, which is shown in equation 3.2. Na¨ıve Bayes classifier is a simple Bayesian classifier, in which assumes that the effect of an attribute value on a given class is independent of the values of the other attributes. Na¨ıve Bayes classifier models the distribution of the documents in each class using a probabilistic model with this assumption, called class conditional independence. Two classes of models are commonly used for na¨ıve Bayes classification. Both models essentially compute the posterior probability of a class, based on the distribution of the words in the document. These models ignore the actual position of the words in the document, and work with the bag-of-words (or set-ofwords) assumption. The major difference between these two models is the assumption in terms of taking (or not taking) word frequencies into account and the corresponding approach for sampling the probability.[15]
3.5.1
Bernoulli Na¨ıve Bayesian Model
In the Bernoulli model, document is represented by a binary vector, which represents either feature presence or feature absence, instead of taking into account of feature frequencies in the document. If we have a vocabulary containing a set of |V | words, then the tˆth dimension of document vector corresponds to word w t in the vocabulary. Let b j be the feature vector for the jˆth document; then the tˆth element of b j, written b jt, is either 0 or 1 representing the absence or presence of word wt in the jˆth document [16]. Let P(w t |C) be the probability of word wt occurring in a document of class C; the probability of wt not occurring in a document of this class is given by (1-P(w t|C)). If we make the Na¨ıve Bayes assumption, that the probability of each word occurring in the document is independent of the occurrences of the other words, then we can write the document likelihood P(D j|C) in terms of the individual word likelihoods P(w t|C)
19
in equation 3.3 [16]: |V |
P(C|D j ) = P(C) ∏[b jt P(w j |C) + (1 − b jt )(1 − P(wt |C))]
(3.3)
t=1
This product goes over all words in the vocabulary. If word w t is present, then b jt = 1 and the required probability is P(w t|C) ; if word w t is not present, then b jt = 0 and the required probability is (1-P(w t|C)). The parameters of likelihoods are probabilities of each word given the document P(w t|C); the model is parameterized by the prior probabilities, P(C). We can estimate these parameters from a training set of documents labeled with class C=k. Let N(w t, C=k) be the number of documents of class C=k in which w t is observed; and let N k be the total number of words of that class. Then we can estimate the parameters of the word likelihoods as the relative frequency of documents of class C=k that contain word w t, using the following equation 3.4 in [16]:
P(wt |C = k) =
N(wt ,C = k) Nk
(3.4)
) If there are N documents in total in the training set, then the prior probability of class C=k may be estimated as the relative frequency of documents of class C=k may be estimated as the relative frequency of documents of class C=k in equation 3.5 in [16]:
P(C = k) =
3.5.2
Nk N
(3.5)
Multinomial Na¨ıve Bayesian Model
In the Multinomial model, the document feature vector captures the frequency of words, not just their presence or absence. Let x i be the multinomial model feature vector for the iˆth document D i. The tˆth element of x i, written x it, is the count of the number of times word wt occurs in document D i. Let n i be the total number of words in document D i. Let P(w t|C) again be the probability of word w t occurring in class C, this time estimated using the word frequency information from the document feature vectors. According to the Na¨ıve Bayes assumption, the probability of each word 20
occurring in the document is independent of the occurrences of other words. Document likelihood P(D i|C) can be written with the aid of multinomial distribution as equation 3.6: P(Di |C) =
n! |V | ∏t=1 P(wt |C)xit
(3.6)
To estimate the parameters of likelihood from a training set of documents labeled with class C=k, let z ik be an indicator variable which equals 1 when D i has class C=k, and equals 0 otherwise. If N is again the total number of documents, then an estimation of the probability P(w t|C=k) as the relative frequency of w t in documents of class C=k with respect to the total number of words in documents of that class using the following equation 3.7 in [16]: P(wt |c = k) =
3.5.3
ik ∑N i=1 xit z |V |
∑g=1 ∑N i=1 xis zik
(3.7)
Avoiding Zero Probability
It is generally said that zero probability problem can occur frequently in the Na¨ıve Bayesian classification, which the probability of both classes (positive class and negative class) would become zero and there will be nothing to compare between if the classifier encounters a word that has not been seen in the training set. This problem can be solved by Laplacian Smoothing. The probability will be calculated by the following equation 3.8 P(wi , ci ) =
N(W = wi ,C = ci ) + 1 W | + N(C = ci )
(3.8)
In which: |W | is the number of attributes values of vocabularies. (In this case, |W | is 2.)
3.6
Evaluation Methods
Accuracy, recall, and precision methods are used for evaluating the performance of the system. These methods are measured on a test-set consisting of class-labled tuples that were not used to train a model. Accuracy is the overall accuracy of the sentiment model. Recall(Pos) and precision(Pos) is the ratio and precision ratio for true
21
positive reviews. Recall (Neg) and precision(Neg) is the ratio and precision ratio for true negative reviews. These methods are calculated as follows: [8]
Accuracy =
true p os + truen eg true p os + truen eg + f alse p os + f alsen eg
(3.9)
Recall(pos) = (true p os)/(true p os + f alsen eg)
(3.10)
Recall(neg) = (truen eg)/(truen eg + f alse p os)
(3.11)
Precision(pos) = (true p os)/(true p os + f alse p os)
(3.12)
Precision(neg) = (truen eg)/(truen eg + f alsen eg)
(3.13)
22
CHAPTER 4
DESIGN AND IMPLEMENTATION
This chapter describes the design and implementation of document-level sentiment classification using Bernoulli Naive Bayes model. The organization of this chapter is as follows: Section 4.1 describes operating environment of this system and Section 4.2 does the overview of the system. Section 4.2 fully discusses the preprocessing steps involved in this sentiment classification with implementation results. Besides, Section 4.3 presents how the Bernoulli Na¨ıve Bayes classifier is made to differentiate the reviews. In the last Section 4.4, sentiment classification user interface is shown from the work of this thesis.
4.1
Operating Environment
This sentiment classification is implemented in Java language in addition with embedded sQLite database using Netbeans IDE(Integrated Development Environment). IMDb movie review dataset prepared by [3] is implied. As a running environment, Mac Book Pro having processor of 2.7 GHz Intel Core i5, RAM of 8 GB, and OS X Yosemite is used to deploy this system. Java Virtual Machine(JVM) is required to run this system and SQLite browser is also for browsing data in database.
4.2
Proposed System
To perform sentiment classification, the work can be divided into two main processes which are preprocessing steps and classification step as shown in Figure 4.1. Firstly, the comments need to be performed certain steps which are called preprocess-
23
ing steps to be ready for use in classification. After preprocessing the reviews, the features are gained as a result of these steps. Using the features, classification is performed to be able to determine whether the new movie review is of positive or negative sentiment.
Figure 4.1: Sentiment Classification System Architecture
4.3
Preprocessing Steps
Overall, the preprocessing steps are to extract the features, having part-of-speech of adjectives, adverbs, verbs, and nouns, which could be useful for classification step from the sentenced review comments. There are three preprocessing steps as shown in Figure 4.2 which are to parse the review comments with the help of Stanford POS tagger, to perform a simple negative handling method which will mark words followed by negations with an extra symbol in front of the word, to extract only adjectives, adverbs, verbs, nouns and to select features from those extracted ones using Information Gain. 24
4.3.1
POS tagging
The main reason why POS tagging is needed is that all eight parts of speech are not important to be taken into account for classification and so to extract the important POS, words need to be tagged first with its corresponding parts of speech. For this task, Stanford POS tagger is applied. Stanford POS tagger is a piece of software that reads text in some language and assigns parts of speech to each word (and other token), such as noun, verb, adjective, etc., although generally computational applications use more fine-grained POS tags like ’noun-plural’. This software is a Java implementation of the log-linear part-of-speech tagger. The tagger was originally written by Kristina Toutanova. Since that time, Dan Klein, Christopher Manning, William Morgan, Anna Rafferty, Michel Galley, and John Bauer have improved its speed, performance, usability, and support for other languages. It requires Java 1.8+ to be installed. Depending on running 32 or 64 bit Java and the complexity of the tagger model, between 60 and 200 MB of memory is needed to run a trained tagger. Plenty of memory is needed to train a tagger. It again depends on the complexity of the model but at least 1GB is usually needed, often more. [17] A sample result from tagging one of the reviews using the tagger is shown in 4.2.
Figure 4.2: POS Tagging Result
25
4.3.2
Negative Handling
As described in Chapter 3, it is necessary to handle negation which is occurred in the review comments because negation can reverse the sentiment, that is, from positive to negative and vice versa. In this system, simple negative handling method is performed, which just appends every word followed by certain negative words such as no, not, lack, rarely, neither, never, n’t, etc, with ! symbol in front of it within the single sentence. After negative handling stage is carried out, the review becomes as follows in Figure 4.3.
Figure 4.3: POS Tagging Result
4.3.3
Extraction of Adjectives, Adverbs, Verbs, and Nouns
After performing POS tagging and negative handling, the important task is extraction of words with the appropriate parts of speech which might express sentiments, leading to the classification more efficient and effective. Generally, it can be said that adjective is the most important POS in sentiment classification and so adverbs, verbs as well as nouns could be sentiment holders. The result of extraction stage is shown in Figure 4.4.
26
Figure 4.4: POS Tagging Result
4.3.4
Feature Selection using Information Gain
Even after extracting adjectives, adverbs, verbs, and nouns, there might be some words that will not hold any sentiments. For example, man and is: these two words are not concerned with any sentiment. These kinds of words might decrease the system’s performance. Both text classification and sentiment classification have benefited from the use of feature selection methods to improve classification accuracy and efficiency. In this sentiment classification system, Information Gain (IG) is applied to extract useful features that will help in accurate classification to classifier. Figure 4.6 shows the gain values for each extracted feature using the formula described in Chapter 3. Using the gain values, features are selected with certain threshold. Here, in this system, the threshold is considered as 0.0007, which is just lower than the average gain of 0.00083. As a result, about 75% of features with higher gain values are selected. The calculated information gains are shown in Figure 4.5 as tabular form and Figure in graph form.
4.4
Bernoulli Naive Bayesian Classification
There is a computational problem in calculating Bernoulli Na¨ıve Bayes which consist of multiplying the probability of each feature in the whole training dataset. Due to
27
Figure 4.5: DB: Information Gains of Features
Figure 4.6: Information Gains of Features the fact that computers can handle numbers with specific decimal point accuracy, calculating the product of probabilities will lead to floating point underflow. This means that it will end up with a number so small, that will not be able to fit in memory and thus it will be rounded to zero, rendering the analysis useless.[18] To avoid this, instead of maximizing the product of probabilities, it will maximize the sum of their logarithms by using equation 4.1 |V |
P(C|D j ) = P(C) ∑ b jt log P(wt |C) + (1 − b jt )(1 − log P(wt |C))
(4.1)
t=1
In performing classification, to save time, prior probabilities of each feature are calculated not only for positive but also negative class which is shown in Figure 4.8. The actual test results of the testing dataset are shown in Figure 4.9, which are calculated using the prior probabilities in Figure 4.8. 28
Figure 4.7: DB: Prior Probabilities of Features
Figure 4.8: Prior Probabilities of Features
Figure 4.9: Results of Testing Dataset
4.5
Sentiment Classification User Interface
In this system user interface (UI), there are two main pages which are for training and classification as shown in Figure 4.10. In the training page, a new movie review can 29
be trained easily by clicking ’Train!’ button with appropriate sentiment of ’Positive’ or ’Negative’ by choosing the correct radio button. Similarly, as in the classification page, there are two sub functions which are for new comment to classify and all the training dataset to classify again for a purpose of system performance, in which evaluation results are shown using the methods described in the previous chapter.
Figure 4.10: Sentiment Classification User Interface
30
CHAPTER 5
EXPERIMENTATION RESULTS
A series of several experiments are performed on sentiment classification task. In this chapter, the nature of IMDb movie review dataset is statistically analysed and the evaluation results are discussed. The organization of this chapter is as follows: Section 5.1 presents about hardware and software which are used in this system. In Section 5.2, IMDb dataset used in these experiments is discussed, and Section 5.3 presents the experimental results.
5.1
Hardware and Software
The experiments are executed on the same operating environment as in the implementation session and the required software lists are the same, as well.
5.2
Dataset Nature
Cornell IMDB-movie-review polarity dataset V 2.0 [3] is used for the experiments. The average review size (in words) is 747.3. 700 reviews on each are used for learning phase and the remaining 300 for testing and evaluating purpose. Table 5.1 shows the number of words on each class in the entire dataset to analyze the system. Besides, due to the figures in Table 5.2 which shows the statistics of the training review comments, it can be said that the dataset is sparse because almost half of the dataset is having a count of 1.
31
Table 5.1: Numbers of Words on Each Class in Entire Dataset [3] Types of Words Positive Negative Total No. of distinct words(all POS) 37470 34542 51585 Total no. of words (all parts of speech) 788047 706601 1494648 No. of distinct words (adj, adv, verbs, nouns) 37000 34648 51370 Total no. of words (adj, adv, verbs ,nouns) 343147 305086 648233 No. of distinct words(adj, adv, and verbs) 15937 14934 22194 Total no. of words(adj, adv, and verbs) 160739 145204 305943 Table 5.2: Numbers of Words on Each Class in the Training Dataset Types of Words Positive Negative Total No. of distinct words (adj,adv,verbs,and nouns) 30665 28724 42891 No. of distinct words(count 1) (adj, adv, verbs, nouns) 11208 9819 21027 Total no. of words (adj,adv,verbs,nouns) 180431 163967 344398 No. of distinct words (adj,adv,and verbs) 13141 12290 18369 No of distinct words(count 1)(adj, adv, verbs) 4887 4262 9149 Total no. of words (adj,adv,and verbs) 87855 79660 167515
5.3
Evaluation Results
The performance of this system is measured using the popular metrics such as accuracy, precision (positive) and precision (negative), as well as recall (positive) and recall (negative). These measures are calculated using the equations described in Chapter 3. The performance measures used in this system are the same as that used by P.Kalaivani [8]. Evaluation results are mainly compared between the types of extracted features: the first type is the features with POS of adjectives, adverbs, verbs, nouns and the other the features with POS of only adjectives, adverbs, verbs. There are two reasons why nouns are excluded from the features. Firstly, due to the nature of the movie review comments, most of nouns are names of actors, actresses, and movies. Next, anyone can write movie review and so, spelling errors are frequently seen there. Besides, Stanford POS tagger usually considers every unknown word (which could be spelling errors) as noun. Thus, overall accuracy has improved by 5% when nouns are removed from features as tabulated in Table 5.3 and the improvement can be clearly seen in Figure 5.1. Table 5.4 and Table 5.5 show the results of five sub experiments on two types of features extracted with nouns and without nouns correspondingly which have been per-
32
Table 5.3: Evaluation Results Methods Features: adj, adv, verbs, and nouns Features: adj, adv, and verbs Accuracy (%) 74.33 79.67 Recall (pos) (%) 67.63 75.72 Recall (neg) (%) 89.25 85.04 Precision(pos) (%) 93.33 87.33 Precision(neg) (%) 55.33 72
Figure 5.1: Evaluation Comparisons between Features with Nouns and Features without Nouns formed. In the first test case, the testing reviews are classified using only 25% of extracted features. In the second case, it uses 50% of features in classification step. In the third and fourth cases, 75% and all(100%) of the features are used respectively. For these corresponding four test cases, accuracies have been compared between the features having POS of adjective, adverb, verb, nouns and those with adjective, adverb, verb in Figure 5.2.
Table 5.4: Evaluations Results on Different Portions of Extracted Features with Nouns No. of features Accuracy(%) Precision(pos)(%) Precision(neg)(%) Recall(pos)(%) Rec 7547(25%) 68.17 52.33 84 76.59 15094(50%) 74.5 67 82 78.82 22641(75%) 76.16 76.33 76 76.08 30188(100%) 74.33 93.33 55.33 67.63
Table 5.5: Evaluations Results on Different Portions of Extracted Features without Nouns No. of features Accuracy(%) Precision(pos)(%) Precision(neg)(%) Recall(pos)(%) Rec 3248(25%) 68 82.33 53.67 63.99 6496 (50%) 73.17 75.67 70.67 72.06 9745 (75%) 77.5 79.33 75.67 76.53 12995 (100%) 79.67 87.33 72 75.75
33
Figure 5.2: Comparison of Accuracy on Different Portions of Extracted Features Figure 5.3 and Figure 5.4 also depicts the comparison of precision (positive) values and precision (negative) values respectively between features with nouns and features without nouns. From these two figures, overall precision (positive) of features with nouns is greater than that of features without nouns by just 6% while overall precision (negative) of features with nouns is much less than that of features without nouns by 17%.
Figure 5.3: Comparison of Precision(pos) on Different Portions of Extracted Features
Figure 5.4: Comparison of Precision(neg) on Different Portions of Extracted Features In Figure 5.5 and Figure 5.6, recall values are compared between features with 34
nouns and features without nouns not only for positive but also for negative. As regards with recall(positive), features without nouns work much better in classification than those with nouns by 8% whereas for recall(negative), the reverse is true by merely 4%.
Figure 5.5: Comparison of Recall(pos) on Different Portions of Extracted Features
Figure 5.6: Comparison of Recall(neg) on Different Portions of Extracted Features Even though the overall accuracy is nearly 80% when features are only adjectives, adverbs, and verbs, this system is biased to negative not because of a negative handling method which is applied in this system. In fact, according to Table 5.6 which shows the results when classification is performed without the reviews handling negative, the negative handling method does not have a great impact on the classifier’s performance. In Table 5.6, classification is performed for four times depending on the different number of extracted features used like the previous two types of features.
Table 5.6: Evaluations Results on Different Portions of Extracted Features without Nouns (Without Performing Negative Handling No. of features Accuracy(%) Precision(pos)(%) Precision(neg)(%) Recall(pos)(%) Rec 2565 (25%) 70.17 70 70.33 70.23 5130 (50%) 73.83 86.33 61.33 69.0 7695(75%) 75 89 61 69.72 10261 (100%) 78.33 88.33 68.33 73.61 35
Figure 5.7: Comparison of Accuracy on Different Portions of Extracted Features(Negative Handling) In Figure 5.7, comparison of accuracies is illustrated on features without nouns which performed negative handling and those with no negative handling. It can be said that the negative handling method used may be excluded from preprocessing steps because the overall accuracies is not much greater even when performing the handling method. The main reason of why the system is biased to negative is the nature of Bernoulli Na¨ıve Bayes model. Since the number of positive words is greater than that of negative words by 5% (8195 words out of 167515 words) in this training dataset, this system is biased to the negative sentiment when classifying using the Bernoulli Na¨ıve Bayes’ formula which simply products the probability of each feature on each class. For example, the probability is always from 0 to 1, and the more the decimal points are multiplied, the less the result is. Thus, when using Bernoulli Na¨ıve Bayes, it is necessary that the uniform number of features on each class should be kept for classification.
36
CHAPTER 6
CONCLUSION
The proposed system is about sentiment classification of movie review comments. Before classification, to improve the accuracy of classifier, certain preprocessing steps are performed to extract the more suitable features from the training dataset, such as POS tagging, negative handling, feature selection and restricting to particular POS. Information Gain is used for feature selection and Bernoulli Na¨ıve Bayes document model for classification. Accuracy, precision and recall are applied for evaluating the classifier’s performance. The organization of this chapter is as follows: in Section 6.1, certain limitation facts of this system are discussed. Section 6.2 presents future work which can be extended in this system.
6.1
Limitation
Since the whole system is developed by Java programming language, Java Virtual Machine is needed to run it. The main limitation is that the system performance depends on the training processed dataset. Thus, due to the biased to the negative sentiment, the newly input review must have a certain number of positive words over that of negative as in the training set of features to get a result as positive sentiment when the review is of the small size in general. It is also because of using Bernoulli Naive Bayesian classification which is statistically sensitive. Then, in training phase, using the provided user interface by this system, it is possible to train at most a single new comment into the existing training dataset at a time.
37
6.2
Future Work
Movie review mining is the challenging task because of the fact that movie reviews are written with the mixed real-life review data and ironic words. Besides, sentiment classification can be performed in three different levels. In this system, it performs only document-level sentiment classification. Next, this system applies the advantage of only statistical methods. It can be asserted that the accuracy could be improved with the combination of use of semantic resources such as WordNet, SentiWordNet and statistical approach. Moreover, this system only considers the sentiment classification of subjective comments, and so it can be added a function of subjectivity which can determine if a comment is objective (just factual) or subjective (opinionated).
6.3
Advantages
Using this system, a newly movie review comment could be automatically classified according to its sentiment within a fraction of a minute depending on the comment’s size of length. From the work of this system, when dealing with Bernoulli Na¨ıve Bayes classifier, the equally uniform dataset on each class is of significant importance. Besides, in sentiment classification, the sense of the fact that nouns are not as essential to be considered as features as adjectives and adverbs is known.
38
Bibliography
[1] V. Narayanana, I. Arora, and A. Bhatia, “Fast and accurate sentiment classification using an enhanced naive bayes model,” vol. 8206. [2] B. Liu, Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers. [3] B. Pang, L. Lee, and S. Vaithyanathan, eds., Thumbs up? Sentiment Classification using Machine Learning Techniques. [4] G. Vinodhini and R. M. Chandrasekaran, “Sentiment analysis and opinion mining: A survey,” vol. 2, no. 6. [5] D. A. Allotey, “Sentiment analysis and classification of online reviews using categorical proportional difference.” [6] B. Pang and L. Lee, A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. [7] H. Ghorbel and D. jacot, Sentiment Analysis of French Movie Reviews. [8] P. Kalavivani and D. K. L. Shunmuganathan, “Sentiment clssification of movie reviews by supervised machine learning,” vol. 4, no. 4, pp. 285–292. [9] A. Kennedy and D. Inkpen, “Sentiment classification of movie and product reviews using contextual valence shifters,” vol. 22, pp. 110–125. [10] F. Peng, D. Schuurmans, and S. Wang, “Augmenting naive bayes classifiers with statistical language models,” vol. 7, pp. 317–345. [11] S. M. Kuat Yessenov, “Sentiment analysis of movie review comments,” 2009.
39
[12] H. Khoshnevis, “Comparing features sets and classifiers for sentiment analysis of opinionated free text.” [13] K. S. Doddi, “Sentiment classification of news article.” [14] P. do Nascimento Barata Leal Varela, “Sentiment analysis.” [15] O. Kummer and J. Savoy, Feature Selection in Sentiment Analysis. [16] H. Shimodaira, “Text classification using naive bayes.”. [17] S. N. Group, “The stanford natural language processing group.” [18] E. Davis, “Naive bayes classification steps.”
40