Sci.Int(Lahore),26(2),865-869,2014
ISSN 1013-5316; CODEN: SINTE 8
865
OPTIMIZING UNSUPERVISED LEARNING OF OPINION TARGETS FROM UNSTRUCTURED REVIEWS USING WORDNET BASED SEMANTIC ORIENTATION Khairullah Khan Email:
[email protected] Department: Institute of Engineering and Computing Sciences, University Science & Technology Bannu, Office: 0928-621440
Aurangzeb Khan Email:
[email protected] Department: Institute of Engineering and Computing Sciences, University Science & Technology Bannu,
Baharum Bin Baharudin Department: Department of Computer and Information Sciences, Universiti Teknologi PETRONAS, Malaysia .
Email:
[email protected] Ashrafullah Email:
[email protected] Department: Institute of Engineering and Computing Sciences, University Science & Technology Bannu,
ABSTRACT : Opinion Target identification is an important task of opinion mining problem. Several approaches have been employed for this task which can be broadly divided into two major categories: supervised and unsupervised. The supervised approaches require training data which need manual work and are mostly domain dependent. Unsupervised technique is most popularly used due to its two main advantages: domain independent and no need of training data. This paper presents optimization of an unsupervised approach for opinion target identification from unstructured reviews. The unsupervised approaches basically extract frequently observed features in text hence the infrequently occurred features are ignored. This paper utilizes lexical resources to identify infrequent features. Empirical results shows direct impact on the performance of the frequency based features extraction
1
INTRODUCTION The trend of getting consumers reviews about products through online forums and blogs, and comment boxes for product analysis, performance analysis, target and achievement analysis, sales is increasing day by day. Customers are mostly interested to know about the reputation of products, person or event in shortest time. Now-a-days the fast and the most feasible media for the knowledge extraction is the World Wide Web. Therefore efficient system is needed to provide fast and summarize response to community at every level for quick decision making. Opinion Mining is the process of automatic extraction and summarization of opinion from text. Opinion Mining is most popularly used to mine public opinion and expert from web contents about different domain. For example, opinion mining is used to analyze customer reviews about particular product to predict future trend, interest and perceptions of consumers for planning making right decision about future. The opinion mining problem has been broadly divided into the following sub tasks: opinion source identification, opinion target identification, opinion determination, opinion summarization [1, 2]. In this paper our focus is on the opinion target identification for opinion mining process through unsupervised classification. The problem of opinion target identification is related to the question: “opinion about what?’. Opinion target identification is essential for opinion mining. For example, the in-depth analysis of every aspect of a products based on consumer opinion is equally important for consumer, merchants and manufacturer. In order to compare
the reviews it is required to automatically identify and extract those features which are discussed in the reviews. Furthermore, analysis of product at feature level is more important. For example, which feature of the product is liked and which feature is disliked by consumers. Hence features mining of products are important for opinion mining and summarization. The task of features mining provides a base for opinion summarization [3]. There are various problems related to opinion target extraction. Generally speaking if a system is capable to identification target feature in a sentence or document then it must be able to identify opinionated terms or evaluative expression. Thus in order to identify opinion target at sentence or document level the system should be capable to identify evaluative expression. Also some features are not explicitly presented and some are predicted from terms semantics called implicit features. The focus of this paper is on explicit features identification techniques. The rest of this paper is organized as follows: Section 2 presents related work and existing approaches, Section 3 describes proposed methodology, and Section 4 presents results and discussion while Section 5 concludes the paper. 2 Related Work Several approaches are employed for this sub task of opinion mining. These approaches can be broadly divided into two major categories: supervised and unsupervised. Some author has also used semi-supervised approach. The supervised learning approaches are based on manually labeled text. In this approach a machine learning model is trained on manually labeled data to classify and predict
866
ISSN 1013-5316; CODEN: SINTE
features in the reviews. Although supervised techniques provide good results in feature extraction, it requires manual work for preparation of training sets. Manual process is laborious, skill-oriented, and time consuming. Furthermore supervised learning techniques are mostly employed for specific domain hence it cannot be used in general. The most widely used supervised techniques are Decision Tree, KNearest Neighbor (KNN), Support Vector Machine (SVM), Neural Network, and Naïve Bayesian Classifier [4]. On the other hand unsupervised techniques do not require labeled data and are mostly domain independent. Unsupervised machine learning techniques are more general and specifically dependent on probability or frequency distribution. In terms of opinion target extraction unsupervised techniques automatically predict product features based on syntactic patterns and semantic relatedness. It may be noted that in this paper we use the term features for product features unless and otherwise specified as linguistic features etc. Unsupervised techniques have been potentially employed for this sub task of opinion mining [5-14]. Two most popularly used techniques for opinion target extraction are Association Mining and Likelihood Ratio Test. Hu & Liu [9]employed association mining [15] for product features extraction. The association mining algorithm was originally used for market basket analysis which predicts dependency of one item sale on another item sale. Based on the analogy of the market basket analysis the authors in [9] assumes that the words in as sentences can be considered as bought items. Hence the association between terms can predict features and opinion words association. The implementation of this technique was very successful in features extraction. Later on this approach is extended by [16] for the same task with semantic based pruning for frequent features refinement and identification of infrequent features. The subsequent approach improved the results of opinion target identification through association rule mining algorithm. Another potential employed unsupervised classification technique is Likelihood Ratio Test (LRT). LRT was introduced by [17] and has been reported in different NLP tasks. The LRT was employed by [18] for product features extraction and sentiment analysis. The LRT technique assumes that a feature related to the topic are explicitly presented by noun phrase in the document using syntactic patterns associated with subjective adjectives. Yi, et al. (2003) described different linguistic patterns for extraction of product features based on LRT classification technique. LRT is further enriched by [19]by using different linguistic patterns and off-topic documents.The LRT approach has certain advantages over association mining approach. 3. Proposed Architecture This section describes the proposed architecture and This section describes the proposed architecture and algorithms of the opinion target identification. The whole process is divided into three main phases: pre-processing, candidate opinion target selection, and features refinement as explained in the Figure 1.
Sci.Int(Lahore),26(2),865-869,2014
Input reviews
Pre-Processing Candidate Selection Patterns Filtering Identify Target Features Figure 1: Block diagram of proposed technique In the first phase two steps are performed noise removal and part of speech tagging (POS) as usually performed for preparation of the text for further processing. In the second phases candidate features are selection through noun phrase patterns as described by [19]. The third phase contains two major steps. The first step is used for relevance scoring which classify the features into frequent and infrequent features. For relevance scoring likelihood ratio test [18] is employed in hybrid fashion with WordNet based similarity as explained in the following section algorithms of the opinion target identification. The whole process is divided into three main phases: pre-processing, candidate opinion target selection, and features refinement as explained in the Figure 1. 4. Wordnet Based Semantic Similarity As mentioned in the related work LRT approach is employing frequency of base noun phrases for relevance scoring. Due to the predefined threshold the infrequent features remains unidentified. Hence the recall of this method is affected. In order to address this problem we have proposed recycling of infrequent candidate feature through linguist relatedness in lexical components. This technique is based on some basic linguist features presented below. Similar features with different names are represented by synonyms in lexical dictionary e.g. use and utilize. Related features are represented by Hyponymy or Hypernym relation e.g. car and vehicle. Relevant features or part-of-object are represented by Meronym relations e.g. Car or vehicle has parts: splashboard, dashboard etc. For example, if a target feature is classified as irrelevant due to its low frequency, then there is a high chance that the selected target features may have relation (e.g. synonyms, hyponyms etc.) to unidentified features. Fortunately we have very good lexical resources which can be potentially employed to extract similar terms based on the these relation
Sci.Int(Lahore),26(2),865-869,2014
ISSN 1013-5316; CODEN: SINTE 8
e.g. WordNet [20]. The proposed optimization steps are described below. Step 1: In this step, the LRT technique is applied to predict the frequent features. The input to this step is the histogram of the candidate features obtained in phase 2. This technique has been formulated by [18] and [19] as below. Yi, Nasukawa et al. [18] presented unsupervised technique for relevance scoring of candidate features. This paper employed two unsupervised techniques, i.e. The Mixture Model, and LRT. However, the results show that the LRT performed relatively good. The likelihood ratio test is formulated as: Let Dc denoted topic relevant collection of documents and Dn represents collection of documents not relevant to the topic. Then a base noun phrases occurring in the Dc are candidate feature to be classified as topic relevant or topic irrelevant using the likelihood ratio test as: if the likelihood score of BNP satisfies the predefined threshold value then BNP is considered as target feature. The LRT value for any BNP x is calculated as: Let n1 denotes the frequency of a BNP in a Dc, n2 represents sum of frequencies of all BNPs in Dc except x, n3 denoted frequency of x in Dn, and n4 represents the sum of frequencies of all BNPs in Dn except the frequency of x. Then the ratios of relevancy of the BNP x to topic and nontopic, which are presented by r 1 and r2 respectively, can be calculated as below.
Thus the combined ratio is calculated as:
Hence to normalize the ratios with log:
(Eq. 4)
867
Hence the likelihood ratio is calculated as below.
(Eq. 5)
The likelihood is directly proportional to the value of . As mentioned in the previous sub section, the LRT was employed by [18]; however, due to non-availability of proper data sets for evaluation measures the author only calculated precision. Ferreira et al. (2008) performed an evaluation on the state-of the art datasets, which are manually, annotated corpuses created by [9]. Furthermore, they have modified the algorithm using subsequent similarity measures based on the following two rules. Identification of Feature Boundaries for Patterns The earlier work [18] used BNPs, dBNPs and bBNPs for candidate feature identification. Noun phrases in these patterns are considered as candidate features. However, there is no rule mentioned for multiple matches. For example, in the pattern “battery life“, three features can be reflected: “battery life”, “battery”, and “life”. The recent work [19] extended the earlier algorithm, which only selects the longest BNP patterns. For example, in the above expression this rule considers only “battery life” as a feature. Classification of Patterns with an Adjective Noun (JJNN) Most of the candidate BNPs is combinations of JJNN patterns. The adjective sometimes represents features e.g. “digital images” and sometimes it represents an opinion; hence, it is required to classify the subsequent adjectives in the candidate patterns. The rule is employed by [19], due to which the results are improved. Another main contribution of this paper is the new annotation scheme of the features in the existing dataset that were originally employed by [9]. According to the revised annotation scheme, the number of features was increased as their focus was on all features. Step 2: In the second step, the optimization technique is employed to predict infrequent features based on a semantic relation using the WordNet Lexical dictionary. The input to this step is the list of those features which are classified as irrelevant by the relevance scoring LRT technique in Step 1. The algorithm in this step finds semantic relatedness of the irrelevant classified features to the relevant features using the WordNet based IS-A relation. The IS-A relation is based on the path length similarity between synset [21]. WordNet dictionary is a large lexical database containing English language words with 117000 synsets that linked to each other and are grouped into sets of logical synonyms called synsets. Each synset represents a distinct concept which is interlinked by means of conceptual-semantic and lexical relations. WordNet dictionary groups words together based on their meaning as well as specific sense of words. WordNet creates taxonomy or chain of words semantically disambiguated. The groupings of words in other dictionaries
ISSN 1013-5316; CODEN: SINTE
868
like thesaurus do not follow any explicit patterns other than meaning similarity. However words in WordNet are arranged with additional semantic features such as supersubordinate relation and brief definition such called “gloss” which illustrates the use of sysnsets members. Hence word forms with several distinct meanings are represented in as many distinct synsets and each form of meaning pair in is unique [22]. WordNet organizes words in the taxonomies of the four different POS (i.e. noun, verb, adjective and adverb). Each sysnset represent a node of the corresponding taxonomy. If a word has more than one sense, it appears in more than one node in the taxonomy. Hence the words in WordNet are managed with two types of relations i.e. semantic and lexical relation. The relation between sysnset is a semantic relation while the relation between word senses is a lexical relation. As described above a synset has different members called synonyms hence in other words we can say that the lexical relation is between the members of different sysnsets while semantic relation is between the whole sysnets or node of taxonomy [2, 23]. The semantic relations are further categorized as below: The most popular relation among synsets is the IS-A relation also called super-subordinate relation or hyperonymyhyponymy relation. It links more general synsets to specific. For example, vehicle links to car and truck or conversely specific to general. Hyponymy relation is transitive e.g. if an armchair is a kind of chair, and if a chair is a kind of furniture, then an armchair is a kind of furniture. The IS-A relation has been potentially employed for semantic relatedness in a wide range of projects and research work [21, 24-26]. The taxonomy of the words created through IS-A relation can be simply used to calculate the semantic similarity between two synsets by treating the taxonomy as an undirected graph and measure the distance between the nodes. “The shorter the path from one another, the more similar they are” [21] where the path length is measure nodes/vertices instead of edges and the length between two members of the same synset (synonyms) is 1. The path length is giving simple way to compute the semantic relatedness between words. A sample output of infrequent features (IF) extracted on the basis of semantic similarity with frequent feature (FF) is given in Table 4. Table 4: WordNet based Similarity Scores Frequent Infrequent Features Features SimScores Case
disks_types
0.903
Drawer
drive_bay
0.810
Finish
color_look
0.800
Image
Picture
1.000
Motor
drive_bay
0.800
dvd_menu
Player
0.867
dvd_movie
Player
0.879
forward_speed
Player
0.851
Sci.Int(Lahore),26(2),865-869,2014
Frame
Picture
0.832
interlace_mode
zoom_mode
0.814
5. EXPERIMENTS AND RESULTS We preformed experiments on benchmark datasets about five different products collected from Amazon product review site and manually annotated by [9] and is freely available from the authors website 1 .These data have been widely reported for features and opinion extraction. The same data set has been re-annotated by [19] due their focus study on feature extraction only. The summary of the dataset is given in table 1. Table 2: Summary of five products data sets with manually tagged features Datasets No of Manually Tagged Sentences Features By [19] Distinct Total APEX 739 166 519 Canon 597 161 594 Creative 1716 231 1031 Nikon 346 120 340 Nokia 546 140 470
For POS tagging we use state of the art software the Stanford POS tagger [21] freely available from website2. To evaluate the effectiveness of features extraction results we use standard evaluation measures i.e. precision, recall and f-score. To get these matrices we use confusion matrix. In term of features extraction we use total noun phrases as total features, all manually annotated features as positive, and all other noun phrases as negative features. By positive features we mean the actual product features and negative features are noun phrase but not actually the features. Thus we calculate true positive true negative, false positive and false negativefeatures.Hence the precision, recall, and fscores are calculated as below.
As shown in the table 3, the precision of LRT with dBNP is high however recall is very low. Through pre-optimization the recall has been improved significantly with a little decrease in precision. The post-optimization further improved the recall. However the average f-score at both phases is outperform the existing dBNP with LRT approach. 6. CONCLUSION In this paper presents an improved unsupervised learning technique for opinion target identification. Two-way 1
www.cs.uic.edu/~liub/FBS/sentiment-analysis.htm http://nlp.stanford.edu/software/tagger.shtml
2
Sci.Int(Lahore),26(2),865-869,2014
ISSN 1013-5316; CODEN: SINTE 8
optimization is proposed to improve that result of likelihood ratio test for opinion target extraction. In the preoptimization phase contextual patterns are employed while in the post-optimization lexical dictionary is used to improve the accuracy. All though precision decreases with the strictness in boundaries of the patterns however the recall is significantly improves. Both optimization steps provides higher f-scores with balanced precision and recall, hence the proposed technique outperform the exiting approaches. REFERENCES 1. Pang, B. and L. Lee, Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval,. 2(1-2): p. 135(2008). 2. Liu, B., Sentiment Analysis and Subjectivity, Handbook of Natural Language Processing, pp. 627666, N.I.a.F.J. Damerau, Editor. 2010. 3. Feldman, R., et al., Extracting Product Comparisons from Discussion Boards, in The 2007 Seventh IEEE International Conference on Data Mining., IEEE Computer Society. p. 469-474, 2007. 4. Weiss, S.M., et al., Text Mining, in Predictive Methods for Analyzing Unstructured Information, 2005, Editor. 2005, Springer. 5. Lu, B., Identifying Opinion Holders and Targets with Dependency Parser in Chinese News Texts, in NAACL HLT Student Research Workshop. 2010: Los Angeles, California,2010. 6. Xia, Y., B. Hao, and K.-F. Wong, Opinion Target Network and Bootstrapping Method for Chinese Opinion Target Extraction. Information Retrieval Technology, Springer Berlin / Heidelberg. p. 339-350. 7. Popescu, A.-M. and O. Etzioni, Extracting product features and opinions from reviews, in Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics: Vancouver, British Columbia, Canada. p. 339-346.,2005. 8. Liu, B., M. Hu, and J. Cheng, Opinion observer: analyzing and comparing opinions on the Web, in Proceedings of the 14th international conference on World Wide Web. ACM: Chiba, Japan. p. 342-351, 2005. 9. Hu, M. and B. Liu, Mining and summarizing customer reviews, in 10th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM: Seattle, WA, USA. p. 168-177, 2004,. 10. Somasundaran, S. and J. Wiebe, Recognizing stances in online debates, in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Association for Computational Linguistics: Suntec, Singapore, Volume 1.. p. 226-234, 2009,. 11. Jijkoun, V., M.d. Rijke, and W. Weerkamp, Generating focused topic-specific sentiment lexicons, in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010,
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
869
Association for Computational Linguistics: Uppsala, Sweden. p. 585-594. Pak, A. and P. Paroubek. Twitter for Sentiment Analysis: When Language Resources are Not Available. in 22nd International Workshop on Database and Expert Systems Applications (DEXA). 2011. Yu, J., et al., Aspect ranking: identifying important product aspects from online consumer reviews, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies -. Association for Computational Linguistics: Portland, Oregon, Volume 1. p. 1496-1505, 2011. Zhang, L. and B. Liu, Identifying noun product features that imply opinions, in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Association for Computational Linguistics: Portland, Oregon. Volume 2: p. 575-580, 2011. Agrawal, R. and R. Srikant, Fast Algorithms for Mining Association Rules in Large Databases, in 20th International Conference on Very Large Data Bases. 1994, Morgan Kaufmann Publishers Inc. p. 487-499. Wei, C.-P., et al., Understanding what concerns consumers: a semantic approach to product features extraction from consumer reviews. Info Syst E-Bus Management, (8): p. 149-167, 2010. Dunning, T., Accurate methods for the statistics of surprise and coincidence. Comput. Linguist., 1993. 19(1): p. 61-74. Yi, J., et al. Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques. in Third IEEE International Conference on Data Mining (ICDM) 2003. Ferreira, L., N. Jakob, and I. Gurevych. A Comparative Study of Feature Extraction Algorithms in Customer Reviews. in Semantic Computing, 2008 IEEE International Conference on. 2008. Stark, M. M. and R. F. Riesenfeld, WordNet: An Electronic Lexical Database, in 11th Eurographics Workshop on Rendering. 1998, MIT Press. Toutanova, K. and C. D. Manning. Enriching the Knowledge Sources Used in a Maximum Entropy Partof-Speech Tagger. in The Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000). 2000.