Opinion Mining of Online Product Reviews using ...

31 downloads 0 Views 358KB Size Report
Boley, Said Tabet, Benjamin Grosof, and Mike Dean. SWRL: A Semantic Web Rule Language Combining. OWL and RuleML. W3C Member Submission, 21.
Opinion Mining of Online Product Reviews using Rough Set Reducts and Semantic Web Rules K.C. Ravi Kumar Associate Professor, CSE Sridevi Womens Engg. College Gandipet, Hyd [email protected]

D. Teja Santosh

B. Vishnu Vardhan

Research Scholar, CSE UCEK, JNTUK Kakinada [email protected]

Professor, CSE JNTUHCEJ, Jagityal Karimnagar [email protected]

Abstract-Opinions are the perspectives of the users about the products they have used. These opinions are written in the review sites. These reviews reflect the experience of the user with the product over a considerable amount of time. The opinions vary from user to user about the features of the product. Extracting the discriminative product features from uncertain and incomplete reviews is a complex task. The classification of the opinion of these discriminative features leading to knowledge extraction is also an important task. To solve and carry out these functionalities, a mathematical approach called rough sets is used. Rough set provides the approach of finding the discriminative features from the set of given product features called reducts. These reducts preserve the existing knowledge expressed in the actual information system (containing all product features and reviews). These obtained reducts are further used in engineering Product Feature Ontology (PFO) and expressing the opinion orientation on the PFO using Semantic Web Rule Language (SWRL) rules. The learned rules by the machine is able to automatically classify the positive and negative opinions thereby helping the new user to decide about the product and take wise purchase decisions. Keywords: Opinions, reviews, rough set, reduct, POT Ontology, SWRL rules

I.

INTRODUCTION

Today E-commerce websites provide customers with the needed product information by giving various kinds of services to choose from. One such service is to allow the customer to read the end user online reviews. Online reviews contain features which are useful for the product analysis. These reviews vary from user to user. The size of the review database was increased from time to time leading to the surplus of reviews. It is found that diverse number of web sources for writing opinions exists.

These reviews were regularly fed into the system were not useful for certain cross section of people and finding the relevant sources of review information is found to be a formidable task. The product features as conditional attributes and the reviews with the decision class label as decision attribute are organized in a structured form which is known as information system. Identifying all the features from all the reviews and using them in automated reviews classification is a heavy job. Also all the identified features are not suitable for machine learning the review categories. Extracting the discriminative product features from uncertain and incomplete reviews is a complex task. The classification of the opinion of these discriminative features leading to knowledge extraction is also an important task. This has led to the concept of Opinion Mining [2]. Opinion Mining is carried out by using a mathematical approach called rough sets [1] and the concept of training less automatic opinion classification of the feature using Ontology [3]. Rough set provides the approach of finding the discriminative features from the set of given product features. These are called as reducts. These reducts preserve the existing knowledge expressed in the actual information system. The obtained reducts are further used in engineering Product Feature Ontology (PFO) and expressing the opinion orientation on the PFO using Semantic Web Rule Language [4] (SWRL) rules. The learned rules by the machine is able to automatically classify the positive and negative opinions thereby helping the new user to decide about the product and take wise purchase decisions.

The organization of the paper is as follows: the contributions in this direction are critically reviewed in section 2, terminology is introduced in section 3, the proposed method is explained in section 4, results evaluation is discussed in section 5 and finally conclusion and future scope are specified in section 6. II.

times. Wang, Wei, et al.. [13] investigated customer reviews and support for online phone buying decision was offered. Das et al., [14] learned rules from the rough set product reviews information system using LEM2 algorithm. The authors have made an attempt to provide help to business analysts to understand the product dimensions and natural associations among them.

RELATED WORK

Rough sets for finding the discriminative features from the set of given features often known as feature subset selection is considered as a major research work. Zdzislaw Pawlak [1] introduced the concept of rough set theory and explained the concept of reduct. Chang-Yun et al., [5] integrated rough set theory with the artificial neural network in order to predict the business failure models in the business domain. They have considered eight important attributes and applied the concept of reduct. They have achieved twelve minimal reducts from these attributes. Aboul Ella Hassanien and Jafar M.H. Ali [6] applied the concept of rough sets to breast cancer data with nine conditional attributes and one decision attribute in the information system. Reduct is generated with only two attributes dictating the label of the stage of the cancer class. Świniarski, Roman [7] applied rough sets and Principal Components Analysis [8] (PCA) on the face recognition application for reducing the number of features from the images. Komorowski, Jan, et al., [9] used the concept of reducts in reducing the number of concept attributes in candidates hiring example. Ontology based decision making using rough set reduct data is considered as a major research work. C. Maria Keet [10] developed Ontology to ascertain the completeness of the knowledge. The author has compared Horizontal gene transfer ontology with the genomics ontology to validate the knowledge. Nguyen Sinh Hoa and Nguyen Hung Son [11] applied concept ontology to the rough classifiers obtained from reducts to improve the classifier accuracy learned on Nursery data set. Grochowalski, Piotr and Pancerz, Krzysztof [12] used ontology to search the rough set database in an intelligent manner. Rough set reducts based rule learning is considered as a major research work in the recent

In the process of understanding deductive rules from Ontologies using rough set reducts certain shortcomings are identified: Reducts are generated from the information system and are used in the Rough set theory based applications only. The obtained reduct attributes work well in rough set environment only. The reducts are to be combined with other attribute reduction techniques like PCA to work better. Rough set reducts based rule learning employ rough set specific rule learning algorithms whose rules are not supportive to take clear decisions. Reducts are also used with other rule learning algorithms which will learn better rules than the rough sets. Ontology based decision making provided support to strengthen the beliefs about the domain information system. Ontology was also used to hierarchically learn the rules from it to improve the accuracy of the classifier learned from rough set environment. There is a scope to define SWRL rules on the ontologies to learn more meaningful rules. SWRL rules improve the expressiveness of the ontology and thereby better automatic reasoning. III.

TERMINOLOGY

The terms introduced in this section are categorized into two ways. They are terms related to Ontology tools and terms related to Rough set theory topics. Terms related to Ontology tools 3.1 Protégé Ontology Editor Protégé [15] is a platform for engineering Ontologies. It is open source software developed by Stanford Medical Informatics. Ontological knowledge Engineering is based on frames representation from Artificial Intelligence environment.

3.2 Ontorion Fluent Editor Ontorion Fluent Editor [16] is a comprehensive tool for editing and manipulating complex ontologies that uses Controlled Natural Language. Fluent editor provides one with a more suitable for human users’ alternative to XML-based OWL editors. The editor’s main feature is the usage of Controlled English as a knowledge modeling language. Supported via Predictive Editor, it prohibits one from entering any sentence that is grammatically or morphologically incorrect and actively helps the user during sentence writing. The Controlled English is a subset of English with restricted grammar and vocabulary in order to reduce the ambiguity and complexity of the language. Terms related to Rough set theory topics 3.3 Object, Conditional Attribute, Decision attribute, Equivalence class and Information System An object [1] is information pertaining to a particular domain. There are many objects in a domain possessing similar and dissimilar attributes. The conditional attributes [1] are attributes of the domain objects. These are unique. A decision attribute [1] is a special attribute which has a class label assigned to it and is dependent on the conditional attributes. An Equivalence class [1] is a set containing all those objects that are similar (indiscernible) to each other. The collection of objects, conditional attributes and decision attribute together form an information system or Universe [1]. 3.4 Lower Approximation, Upper Approximation, Boundary Region and Reduct The lower approximation [1] consists of all objects which surely belong to the set and the upper approximation [1] contains all objects which possibly belong to the set. The difference between the upper and the lower approximation constitutes the boundary region [1] of the rough set. Approximations are fundamental concepts of rough set theory. Reduct [1] is the reduced attribute set of the given original set of conditional attributes such that the equivalence classes of the Reduct is same as the equivalence class structure of the original attribute set and no attribute be removed from reduct without changing the

original equivalence class structure of the full attribute set. IV. OPINION MINING OF ONLINE PRODUCT REVIEWS USING ROUGH SET REDUCTS AND SEMANTIC WEB RULES The principal objective of learning new relations from the Product Feature Ontology data by the machine is to filter the reviews based on the incoming feature request into positive and negative classes. This is carried out by retrieving product reviews written on feature hierarchy for wise decision making for the new customer who wants to purchase a product. In order to achieve this goal, a proposed framework presented in Figure.1. The proposed model consists of three modules namely obtaining the reduced product feature set from the information system using rough set approach, engineering the Product Feature Ontology (PFO) using the reduct features and defining SWRL rules to classify the product reviews based on reduct features. Extraction of product reviews

Pre-process the reviews

Generate the information system

Obtain reduced attribute set (Reducts)

Engineer Product Feature Ontology (PFO)

Define SWRL rules to classify opinions

Figure.1. Proposed Model

Apply Rough Set Theory

4.1 Generation of reduced product feature set (Reduct) from the information system using rough set approach

indiscernible (similar) in view of the available information about them. The indiscernibility relation generated in this way is the mathematical basis of rough set theory. A fundamental principle of a rough set based learning system is to discover redundancies and dependencies between the given features of a problem to be classified. The target set ‘X’ of the reduct is considered to have only six features of the given set of features.

Online reviews are written by the customers on single, two or many features of a product. The identification of distinguished features from a huge database of reviews is a complex job. The information system is developed for this job is by using the benchmark dataset used by Bing Liu [17] in his work. The unstructured dataset of the product reviews are to be first preprocessed for removing stop words. After preprocessing, the obtained dataset is left with only the product features, sub features and corresponding opinion words. The information system is now created consists of objects which are online reviews, the condition attributes which are product features and the decision attribute which is class label. The information system for online product reviews is tabulated in Table 1.

The reducts are generated by finding the equivalence classes and then approximating the given objects below and from above, using lower and upper approximations based on the equivalence classes and finding the decision boundary. After approximations are done, the reducts are generated by following the principles as specified in sub section 3.4 of section 3. The equivalence classes for the rough set database of 30 reviews are provided in Table 2. and are given below. Table 2. Equivalence Classes

Table 1. Information system for online product reviews Obj ect

Ra dio (A 1)

Blueto oth (A2)

Price (A3)

1

1

1

O3

O4

O1

O2

O5

Equivalence Classes

Siz e (A 4)

Cam era (A5)

Displ ay (A6)

Fla sh (A 7)

Battery (A8)

O O

1

0

0

0

0

0

P

1

1

0

0

0

0

0

P

1

1

1

0

0

0

0

0

P

1

1

1

0

0

0

0

0

P

1

1

1

0

0

0

0

0

P

1

1

1

0

0

0

0

0

P

1

0

1

0

0

0

0

0

N

1

0

1

0

0

0

0

0

P

1

0

1

0

0

0

0

0

P

1

0

0

1

0

0

0

0

N

{O1, O2, O3, O4, O5, O6} {O7, O8, O9} {O10} {O11, O14] {O12, O13} {O15, O16, O17, O18, O19} {O20, O21}

O6

{O22} O7

{O23} O8

O9

O1 0

{O24, O25, O26} {O27}

The value of ‘1’ represents the presence of the feature in the review and a value of ‘0’ represents the absence of the feature in the review. The character ‘P’ represents Positive class label and character ‘N’ represents Negative class label for the Opinion Orientation (OO) decision attribute. Reviews characterized by the same information are

{O28, O29, O30} The lower approximation or positive region is the union of all equivalence classes which are contained by the target set. This is considered to be the positive region in the decision boundary. The upper approximation is the union of all equivalence classes which have non-empty intersection with the target

set. This is considered to be the negative region in the decision boundary. These approximations are illustrated in Figure.2. as specified below.

Following the two properties of forming a reduct, the reduct set of attributes for the given information system is tabulated in Table 4. Table 4. Reduced Attribute Set

Reduct

{A1, A2, A3, A4, A5, A6}

Figure.2. Rough set notations (Source: from [18]) The lower and upper approximations based on the equivalence classes are specified in Table 3. Table 3. Lower and Upper Approximations of the objects Type of Approximation

Approximation sets

Lower Approximation

{O1, O2, O3, O4, O5, O6,O12, O13, O15,O16,O17,O18,O19,O20,O21, O22, O23,O24,O25,O26,O28,O29,O30}

Upper Approximation

{O7, O8, O9, O10, O11, O14, O27}

The decision boundary is the difference between lower approximation and upper approximation. The objects on the decision boundary are {O7,O8,O9,O11,O14}. These objects are neither be used nor ignored as members of the target set. Now the rough set is the pair of lower and upper approximation set of objects. A reduct is thought of as a sufficient set of features – sufficient, that is, to represent the category structure. To calculate the reducts from the information system, two things are to be clearly understood. These are; 1.

The equivalence classes induced by the reduct are the same as the equivalence classes provided by the full attribute set.

2. The reduct is minimal. This means that no attribute must be removed from the reduct set without the change in the equivalence classes.

The information system contains eight condition attributes and the reduct is containing only six attributes. These six attributes preserve the existing knowledge expressed in the actual information system. Only one reduct is obtained that satisfies the above mentioned properties. The core set is calculated as finding the common attribute from the reducts. There is only one reduct is present, the reduct itself becomes the core set. The core set helps in representing the category structure at the time of opinion classification. The rules learned from rough set environment are; RuleNo. Condition relative strength(%)]

Decision [support,

Rule 1. (Radio=0.000000) => (Class at most P); [10, 43.48%]

Rule 2. (Bluetooth=1.000000) => (Class at most P); [8, 34.78%]

Rule 3.

(camera=0.000000) & (price=0.000000) & (display=0.000000) & (Bluetooth=0.000000) & (size=0.000000) => (Class at most P); [5, 21.74%]

Rule 4. (size=1.000000) (display=0.000000) => (Class at least N); 50.00%]

& [1,

Rule 5. (camera=1.000000) (Radio=1.000000) => (Class at least N); 50.00%]

& [1,

Rule 6. (price=1.000000) (Bluetooth=0.000000) => (Class = P OR N); 60.00%]

& [3,

Rule 7. (display=1.000000) (camera=0.000000) => (Class = P OR N); 40.00%]

& [2,

The accuracy of the learned rules is understood with the help of test data. It is presented in Figure.3. below.

The Artificial Intelligence literature contains many definitions of ontology many of these contradict one another. An ontology is a formal explicit description of concepts in a domain of discourse (classes sometimes called concepts), properties of each concept describing various features and attributes of the concept (slots sometimes called roles or properties), and individual instances of classes. This constitutes a knowledge base. In reality, there is a fine line where the ontology ends and the knowledge base begins. In practical terms, developing the ontology includes determining the domain and scope of Ontology, defining classes in the ontology, arranging the classes in a taxonomic (subclass–superclass) hierarchy, defining slots and describing allowed values for these slots, filling in the values for slots for instances. PFO Ontology is engineered following the above specified steps. 4.2.1 Determining the domain and scope of PFO Ontology

Figure.3. Rough Classifier Accuracy 4.2 Engineering the Product Feature Ontology (PFO) In computer science and information science, ontology [3] formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It is be used to model a domain and support reasoning about concepts. In theory, ontology is a "formal, explicit specification of a shared conceptualization". Formally, ontology is the statement of a logical theory [19]. Some of the reasons to engineer Ontology are: to share common understanding of the structure of information among people or software agents, to enable reuse of domain knowledge, to make domain assumptions explicit, to separate domain knowledge from the operational knowledge, and to analyze domain knowledge.



The domain that the PFO Ontology will cover is Product Reviews.



The PFO Ontology is used to relate product features among themselves and identify the opinion orientation associated with the particular feature.



PFO Ontology answers what are product features, what is the object present in the review, upon which feature the opinion is expressed, what is the opinion orientation of the review w.r.to. review feature.

4.2.2 Defining classes in the PROO Ontology The classes in the PFO Ontology are defined in the top-down approach. The creation of the general classes namely Battery, Display, Camera, Memory, Features, Price and so on are carried out. The specialized classes namely Card Slot and Internal for Memory are created. 4.2.3 Arranging the classes in a taxonomic (subclass–superclass) hierarchy The list of classes that are defined the selection of the terms that describe objects having independent

existence rather than terms that describe these objects is carried out. These terms will be classes in the ontology and will become anchors in the class hierarchy. The organization of the classes is in a hierarchical manner. 4.2.4 Defining slots and describing allowed values for these slots The classes defined for PFO Ontology alone will not provide enough information to answer the competency questions. Once the classes are defined, the internal structure of concepts is to be described. These are called properties of the class. The object properties are namely dependson, decidesupon, shows, fixedon etc.

Ontology for learning the opinion orientation of the feature present in the review. The below specified are the SWRL rules. •

Price(amount) ∧ Display(display_size) → OpinionOrientation(Negative)



specifiedon(?x, ?z) ∧ decidesupon(?x, ?y) → OpinionOrientation(?y)



CommunicationFacilities(bluetooth) → OpinionOrientation(Positive)

The Reasoner of Ontorion Fluent Editor helps to query the Ontology in the Controlled Natural Language with its corresponding Grammar for understanding classes, properties and instances.

4.2.5 Filling in the values for slots with instances The final step is creating individual instances of classes in the hierarchy. Defining an individual instance of a class requires choosing a class, creating an individual instance of that class, and filling in the slot values. The PFO ontology is engineered with the obtained reduct features as Ontology classes and corresponding values as instances of the class. The visualization of PFO Ontology is presented in the below Figure.4.

Figure.5. Ontorion Reasoner to relate price and opinion-orientation classes The classification of features obtained from reducts that are also specified in the review using SWRL rules are presented in Figure.6. Direct feature values are used as atoms in the SWRL entities to test for the classification.

Figure.4.Visualization of Product Feature Ontology 4.3 Defining SWRL rules to classify the product reviews based on product features Reasoning the PFO ontology is carried out by improving the expressiveness of the ontology with SWRL rules. The following SWRL rules deduce the relation among the related classes in the PFO

Figure.6. SWRL rules after debug in Ontorion Fluent Editor

V.

RESULTS AND EVALUATION

The benchmark dataset used by Liu et al. [17] is used in this work. The classification accuracy obtained after generation of reduct attributes is found to be 83%. SWRL rules defined on the PFO ontology containing reduct features as instances, the training less classifier (ontology) is found to have 86.7% accuracy in classifying the product reviews based on the feature. Basant Agarwal and Namita Mittal [20] learned SVM classifier on electronics reviews dataset using rough set based hybrid feature selection. The accuracy of the SVM classifier obtained on Electronics reviews was 83.5%. This specifies that usage of the machine learning classifier with reduct attributes will not suffice at all situations. The comparison of the classifier accuracies are presented in Figure.7 below.

rating, review URI, strength of the feature, positive count of reviews on that feature details. REFERENCES [1] Z. Pawlak, Rough Sets – Theoretical Aspects of Reasoning about Data. Boston, London, Dordrecht: Kluwer, 1991. [2] Bo Pang and Lillian Lee, Opinion mining and sentiment analysis, Foundations and Trends in Information Retrieval Vol. 2, No 1-2 (2008) 1–135. [3] Ian Horrocks, Ontologies and the Semantic Web, ACM 2009. [4] Ian Horrocks, Peter F. Patel-Schneider, Harold Boley, Said Tabet, Benjamin Grosof, and Mike Dean. SWRL: A Semantic Web Rule Language Combining OWL and RuleML. W3C Member Submission, 21 May 2004. Available at http://www.w3.org/Submission/SWRL/. [5] Ahn, B. S., S. S. Cho, integrated methodology of artificial neural network prediction." Expert systems (2000): 65-74.

Figure.7. Comparison of classifier Accuracy

VI.

CONCLUSION AND FUTURE SCOPE

The automatic machine learning of opinion orientation from Product Feature Ontology by extending the expressiveness of the Ontology with SWRL rules has been done successfully. Rough set reduct is used in the engineering of PFO Ontology. It has been shown that machine learning the review categories using Ontology is better than machine learning the review categories using SVM classifier with reduct attributes. In the future, in order to verify the obtained classes of positive and negative feature specific reviews, the reviews are queried upon the feature using SPARQL query. Reviews related to the feature, and combined features with the current feature are shown with

and C. Y. Kim. "The rough set theory and for business failure with applications 18.2

[6] Hassanien, Aboul Ella, and Jafar MH Ali. "Rough Set Approach for Generation of Classification Rules of Breast Cancer Data." Informatica, Lith. Acad. Sci. 15.1 (2004): 23-38. [7] Świniarski, Roman W. "Rough sets methods in feature reduction and classification." International Journal of Applied Mathematics and Computer Science 11.3 (2001): 565-582. [8] Smith, Lindsay I. "A tutorial on principal components analysis." Cornell University, USA 51 (2002): 52. [9] Komorowski, Jan, et al. "Rough sets: A tutorial." Rough fuzzy hybridization: A new trend in decisionmaking (1999): 3-98. [10] Keet, C. "Ontology engineering with rough concepts and instances." Knowledge Engineering and Management by the Masses (2010): 503-513.

[11] Hoa, Nguyen Sinh, and Nguyen Hung Son. "Improving rough classifiers using concept ontology." Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2005. 312322.

[17] Minqing Hu, Bing Liu, Mining and summarizing customer reviews, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, August 22-25, 2004, Seattle, WA, USA.

[12] Grochowalski, Piotr, and Krzysztof Pancerz. "The outline of an ontology for the rough set theory and its applications." This three-volume booklet contains the papers presented at CS&P’08, the 17th (2009): 192.

[18] Pawlak, Z., Skowron, A.: Rudiments of rough sets. Inform. Sciences 177(1) (2007) 3–27.

[13] Wang, Wei, et al. "A rough set approach to online customer’s review mining." Advances in Computer Science and Information Engineering. Springer Berlin Heidelberg, 2012. 229-234. [14] Das, Tushar Kanti, D. P. Acharjya, and M. R. Patra. "Business intelligence from online product review-a rough set based rule induction approach." Contemporary Computing and Informatics (IC3I), 2014 International Conference on. IEEE, 2014. [15] Alani, Harith, et al. "Using protege for automatic ontology instantiation." (2004). [16]Pawel Kaplanski. "Controlled English Interface for Knowledge Bases." (2011).

[19] Gruber, Thomas. "Toward Principles for the Design of Ontologies Used for Knowledge Sharing." International Journal Human-Computer Studies Vol. 43, Issues 5-6, Novemer 1995, p.907-928. [20] Agarwal, Basant, and Namita Mittal. "Sentiment classification using rough set based hybrid feature selection." Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment & Social Media Analysis (WASSA). 2013.