AWERProcedia Information Technology & Computer Science 1 (2012) 1025-1032
2nd World Conference on Information Technology (WCIT-2011)
Extracting reliability of the products through user reviews Vesile Evrim a *, Daniyar Matraimov a a
Information System Engineering, Cyprus International University, Haspolat-Lefkoşa, via Mersin 10 Turkey, North Cyprus
Abstract As the number of available products increases, consumers are often feel enforced to search through online reviews to make an informed product choice. Therefore, it has become important to develop methods to help customers to find the best products that fit their request. This paper analyzes the two dimensions of the review rating process: feature extraction and sentiment analysis. In feature extraction, we used term frequencies and adopted several algorithms to reduce the dimension of the feature set. For sentiment analysis, extracted feature set is used with developed algorithms to determine the polarity of the reviews. Several experiments are done to test the success of the system. As a result, we did not only able to provide detailed feature-based ratings to the users but also able to propose more reliable rating scores than the websites of the products. Keywords: Sentiment Analysis, Feature Extraction, User Review, Rating, WordNet, Polarity; Selection and peer review under responsibility of Prof. Dr. Hafize Keser. ©2012 Academic World Education & Research Center. All rights reserved.
1. Introduction The World Wide Web is expanding enormously not only in size but the services and the content provides for its users. Individual users participate more actively to create and retrieve the information from Web. They can sell/buy products, make travel plans and able to express their opinions on products and services through many channels such as online forms, shopping websites, blogs and wikis. Similarly, the growth of the technology in general provides many options among other to select. For example, ten years ago, users have few options to select a laptop. However today, there are hundreds of different models with different features to select among
*ADDRESS FOR CORRESPONDENCE; Vesile, Evrim, Information System Engineering, Cyprus International University, Haspolat-Lefkosa, via Mersin- 10 Turkey, North Cyprus E-mail address:
[email protected]/ Tel.: +0392-671-111-2408
Evrim/ AWERProcedia Information Technology & Computer Science (2012) 1025-1032
different brands. The availability of variety of options enforces people to do comparisons and to search for the best option for their needs. Currently, many websites such as Epinions.org, Expedia.com and Amazon.com provide reviews about the products and the services. For example, a user that wants to reserve a place in a hotel can access the reviews about the particular hotel or a person that wants to buy a digital camera; can check the reviews about the particular product or the products. Usually these reviews are provided with star ratings to help the users’ selection process. However, It is obvious that the provided reviews and their star ratings are not always matching [1][2]. For example, it is possible to see a star rating five (very satisfied) for the revi ew that includes negativity in it. Moreover, even the provided ratings for the review is correct, most of the time it is not specific enough to provide the required details. Therefore, for some categories, specific features are provided by the websites. For example, Epinions.org provides rating for “ease of use” feature besides the overall rating of the Laptop category. However, still the provided features are too general and the ratings of these features are filled by the users independent than their written reviews. The aim of the research explained in this paper is first, automatically extracting the features of the given domain and than determining the polarity of the features in addition to the polarity of the overall reviews, by using methodologies that provide reliable rating scores for the products. The remainder of this paper is organized as follows: Section II gives an overview of the related work for feature selection and sentiment analysis. Section III explains the overall framework, Section VI introduces the feature extraction system used and the related experiments, Section V describes the sentiment analysis and the experiments. Section VI presents the conclusion. 2- Related Research Extracting polarity of the various products (e.g., digital cameras, electronics, movie reviews) is not an easy task. It requires reviews to be crawled from websites and cleared out from the unnecessary information before presented to the user. Therefore, we want to analyze the background of this paper in two main areas. Firstly, we will examine related research in the area of feature extraction from the user reviews. Next, we will discuss relevant work for extracting negativity and positivity of the words in user reviews to determine the polarity of the user feedback. 2.1- Feature Extraction Identification of product features is in some sense a standard information extraction task with little to distinguish it from other non-sentiment related problems. Currently, there are two common review formats on the Web. The ones that requires users to describe Pros and Cons separately and also to write a detailed review, and the ones that there is no separation of Pros and Cons and the reviewer can write freely [24]. When it comes to feature extraction, different formats may need different techniques to perform the feature extraction task. Several supervised pattern learning approach is introduced to extract product features from Pros and Cons in reviews [3]. Lafferty et al., use Conditional Random Fields (CRF) to extract features from Pros and Cons [3] where Liu et al., uses sequential rule based method [4]. Pros and Cons mainly consist of short phrases and incomplete sentences. However, the reviews of free style usually use complete sentences. To extract features from such reviews, the above algorit hms can also be applied. However, experiments show that it is not effective because complete sentences are more complex and contain a large amount of noise. Existing work on identifying product features discussed in reviews often relies on the simple linguistic heuristic that features are usually expressed as noun or noun phrases. This narrows down the candidate words or phrases to be considered, but obviously not all nouns or noun phrases are product features. Yi et al., [5] consider three increasingly strict heuristics to select from noun phrases based on part -of-speech-tag patterns. Hu and Liu [6] follow the intuition that frequent nouns or noun phrases are likely to be features. 1026
Evrim/ AWERProcedia Information Technology & Computer Science (2012) 1025-1032
2.2- Sentiment Polarity Much work on sentiment polarity classification has been conducted in the context of reviews (e.g., “thumbs up” or “thumbs down” for movie reviews). While in this context “positive” and “negative” opinions are often evaluative (e.g., “like” vs. “dislike”), there are other problems where the interpretation of “positive” and “negative” is not clear. In most of the reviews, customers describe a product in terms of a usage scenario by using their perspective but that is not always clearly stated [23]. Kamps et al. [9] construct a network based on WordNet synonyms, and then use the shortest paths between any given word and the words “good” and “bad” to determine word polarity. They report that using shortest paths could be very noisy. Hu and Liu [10] use WordNet synonyms and antonyms to predict the polarity of words. For any word, whose polarity is unknown, they search WordNet and a list of seed labeled words to predict its polarity. Some other methods try to build lexicons of polarized words. Kim and Hovy [11] start with two lists of positive and negative seed words. WordNet is used to expand these lists. Synonyms of positive words and antonyms of negative words are considered positive, while synonyms of negative words and antonyms of positive words are considered negative. Ahmed and Radev [20] study the problem of automatically identifying the semantic orientation of any word by analyzing its relations to other words. By automatically classifying words as either positive or negative, they enable to automatic identification 3- Framework In this research, we have chosen Epinions.org as our source of product reviews. These reviews contain Pros and Cons in addition to free format reviews. The context of this paper analyses the feature and sentiment analysis by using free format reviews not Pros and Cons. Once the user determines the product category, all the reviews related to a particular product is collected. We observed that although Epinions.org is widely used product review website, products with reviews are still a sparse set. It is also the case that the some of the products appear to have empty reviews. Fig. 1 gives an architectural overview of the system. The system performs rating analysis in two main steps: feature extraction and sentiment analysis. Input of the system is the crawled results of the reviews for the products while the output is the overall and feature based ratings of the products assigned by the system. Product reviews can be written by any user and most of the time they are missing proper formatting, and syntax check. It is obvious that not all the words in the reviews are useful for feature extraction or stressing the polarity of the reviews. Therefore, there is a need of pealing the words to reduce the dimension of the feature set to be analyzed. For this purpose, Stop Word elimination and Poter Steaming algorithms [8] are the common methods in reducing the dimensions of the initial set. For Stop Word elimination, we us e Onix stop word list [13]. After applying Poter Steaming algorithm, we selected the words with the smallest size and compared them with all the other words in the set to find the number of unique words. Following this, term frequency and inverse document frequency (TF-IDF) is applied and the resulting set is send to the feature extraction module. Once the possible feature set is constructed, the algorithms in method 1 or method 2 are applied on the feature set to further reduce the dimension of the set. Once the important features are extracted, sentiment analysis module uses the extracted feature set to calculate the feature based and overall polarity of the reviews to present them to the customers.
1027
Evrim/ AWERProcedia Information Technology & Computer Science (2012) 1025-1032
Figure 1. System Framework
Term frequency is one of the features that are used frequently in determining the feature selection of the text [6]. However, obviously it is clear that not all the frequent words are features and similarly some of the infrequent words are features. In this project, term frequencies are calculated for monograms and bigrams (terms that co-occur together), Therefore, in addition to finding the term frequency of the word for all the reviews belonging to the different product categories, we also find its inverse term frequency meaning the number of different product categories a particular term appears. The major motive behind that is to find the words that are commonly used in the category other than the specific product. This method is specifically useful to get rid of the nouns that are belonging to the specific product. For example, the word “hp” in “hp ProBook 4520s” might be used frequently for that particular product but might not appear in the reviews of the other products. That is an indication in which it is a specific name for a particular product other than being a feature for the product category. Therefore, even if the term frequency of a word is high, If it is not used in certain number of product reviews (depends on the threshold) it is eliminated from the set of possible features. In this project, the stop word elimination and stemming follows with finding the TF -IDF of the words. After applying TF-IDF, the set of words that appears 60% of the product categories are selected. Similarly, the second elimination is done base on the normalized term frequencies of the terms. Terms that appear 6% or more are selected to represent the candidate terms for the frequency. The decisions of the thresholds are obtained from the manual analysis of the test results over different domains. In these experiments, we tested six domains (blender, book, camera laptop, music, and movie) of reviews from Epinions.org. In each domain, we have used 10-15 product categories and in average 110 reviews. Although using TF-IDF helps eliminating some of the words from the feature set, it is still not enough for the good estimation of the features for a given product set. In order to improve the accuracy of the feature words, we added couple of filters on the extracted term set. 4- Feature Extraction In this module, the first task is to determine the category of the product in question (e.g., laptop, camera, and movie). Once the product category is known, the reviews of the various products in the same category can be collected. These varieties of reviews are specifically important in determining the features for the product that do not have large set of reviews. This process is similar to the machine learning methods in a sense that, as the variety of product reviews increases, the possibility of extracting better feature set increases. First, all the candidate terms are passed to the WordNet for the elimination of some of the nouns like names. The filtered set contains only the terms that have meanings in the dictionary. This process el iminates quite a bit number of useless terms from the list. In order to find the terms that are related to the search domain, we again use WordNet to extract hyponyms, and meronyms of the search domain. Although these words supposed to directly relate to the category, our analysis with WordNet showed that hyponyms does not
1028
Evrim/ AWERProcedia Information Technology & Computer Science (2012) 1025-1032
always retrieved the most related terms of the domain. Hence, to add a bit more flavor to the corpus of the related terms, we mined the Wikipedia.com. By the our observations about Wikipedia, we identified that the introduction part, specifically words with hyperlinks, of the documents provide quite a bit number of quality words to be a feature. Therefore, the words with hyperlinks are extracted from the introduction part of the Wikipedia for the given domain. For example, if the user is searching for the “camera” reviews, the algorithm first finds the hypoyms and meronyms of the word “camera” from the WordNet and then goes to Wikipedia and get the words with hyperlinks in the introduction part of the document for the word ”camera” (e.g., lens, diaphragm, shutter button). Of course, Wikipedia also provides links for the people, place names and for the other rich data in which is not so useful in the feature extraction. In order to eliminate these unwanted words, the extracted set of words from Wikipedia is passed to WordNet before added to the corpus of the domain related terms to check for their availability. Following the above-mentioned analysis, two set of terms are constructed: one that comes from TF-IDF dimension reduction, the set of words available in the reviews, and the other one comes from the set of terms extracted from hyponyms, meronyms and Wikipedia in which is the set constructed from the related terms of the domain. 4.1- Relatedness Calculation The reason of extracting the domain related terms in the previous section is to find the similarity of the frequently used terms within the domain. In order to do this WordNet::Similarity [22] module is used . Methods in this package allow the user to measure the semantic similarity or relatedness between a pair of concepts (or word senses), and by extension, between a pair of words. The measure of similarity divided into two groups: Path-based and Information content-based. Information-based similarity is based on the information content of the least common susbsumer (LSC) of concepts A and B. Some of the examples of content- based measures include res [15], lin [16], and jcn [17]. In addition there are three similarity measures that are based on path lengths between a pair of concepts: lch [18], wup [19], and path[14]. Several other algorithm that are developed are Lesk Algorithm of Banerjee et al. [20] and hso [21] measures. In this project, the success of the several algorithms and the methods are compared. Moreover, jcn, path, lesk, and hso algorithms are used with WordNet::Similarity module to test the similarity of the words and the domain (Table 1). The domain word itself (e.g., laptop) without the context information of the domain, is used and compared to the set of obtained words from the reviews with only the category name. However, since this methodology used a single word for similarity calculation, in some cases such as in “Laptop”, and “Blender” not enough similarity has been found between the other terms and the categories in which we believe that it is because of the lack of details provided for these categories in WordNet. It is observed that, as the number of elements selected starting from the top decreases the precision of the algorithms increases and the recall decreases. Similarly, as the number of elements selected starting from the top increases, the recall increases and the precision decreases. Table1: Success of four algorithms by using Wordnet::Similarity module against 6 product categories Movie
Music
Laptop
Blender
Book
Camera
Precision
Recall
Precision
Recall
Precision
Recall
Precision
Recall
Precision
Recall
Precision
Recall
JCN
20%
20%
45%
45%
0%
0%
0%
0%
36%
36%
46%
46%
LESK
50%
50%
49%
49%
50%
50%
44%
44%
36%
36%
57%
57%
PATH
31%
31%
54%
54%
71%
71%
70%
70%
25%
25%
49%
49%
HSO
48%
48%
42%
42%
91%
40%
100%
11%
35%
35%
67%
40%
Among the four algorithms while there isn’t a huge difference between the performances of lesk, path and hso, hso seems to give slightly better precision than the other 3 algorithms and suprisingly jcn, as an algorithm
1029
Evrim/ AWERProcedia Information Technology & Computer Science (2012) 1025-1032
used widely to find the similarity between the noun-noun pairs among the researchers [ 27] provided the worst performance. 5- Sentiment analysis As in the feature extraction, for the sentiment analysis, we have used the detailed reviews of the users, not the pros and cons section of the reviews. The main idea behind this process is to use the extracted features of the feature extraction module to provide the users fine-grained reviews, based on the specific features of the product category. Most of the review ratings in the Internet are based on the overall feeling and the perspective of the user in which most of the time the reader does not have a clue why the user thinks as a positive or a negative about a particular product. Some of the categories in the Epinions.org such as “blender” has more specific categories for example, “ease of use”, ease of clean” and “style” but again the ratings of these features are entered by the user while filling the review form and most of the time they are not based on the results of the user reviews. In the sentiment analysis module, we have used several algorithms to extract the ratings of these features. Adjectives of the sentences are the main determinants of the polarity estimation [7]. For this purpose, we have used five set of adjectives: Positive Adjectives, Negative Adjectives, Positive Superlatives, Negative Superlatives and Intensifiers. The adjectives falling under these categories are collected from several webpages [13], extended by using Wordnet synonyms, and then manually checked by the literature professor. The reason of separating superlatives and intensifiers is the fact that the adjectives in these categories are assigned value 2/-2 whereas the others are assigned value 1/-1. The analysis of polarity is done in a sentence level. First, each sentence is checked for the availability of the sub-sentences (i.e, sentences that contains “but”, “yet”,”and”, etc.) then for each sub-sentence/sentence, the following steps are applied: 1- The sentence is checked for negative words such as “no”, “not”, “don’t”, “doesn’t” etc. This is important since these words are used to reverse the value of the finding. For example, “the laptop is not good” has a negative word “not” in it in which reverse the value of a positive adjective “good” from positive to negative. 2- The sentence is checked for superlatives. If superlative is found, it is checked for the feat ure and the value of the superlative is assigned to the feature found. If no feature is found, the value is assigned to the overall rating of the product. 3- If the sentence does not include superlative, it is checked to find out if it has any intensifier and any positive or negative adjectives. If so, its features are checked and the calculated value is assigned to the specific feature. Otherwise, the value is added to the overall rating of the product. 4- If there is no adjective assigned in the sentence but the sentence is about a specific feature, than a value is assigned to this sentence to indicate that feature is mentioned but no opinion is given about the specific feature. After calculating the rating of the product, there is a need of determining how correct the algorithm works. Therefore, we used “Blender” as a test category. From “Blender” category, we extracted 104 comments about 11 products. In addition, we chose 9 features (power, price, material, sound, bowl, durability, ease-of-use, easeof-clean, style) for the “Blender” category to test. The reviews and the feature sets are given to four people to survey. Surveyors rated each review based on their understanding of the reviews. Likers scale of 1 -5 is used as an evaluation measure in which, 1-very bad, 2-bad, 3-natural, 4-good, 5-very good and “----” is assigned as not mentioned. The survey results are based on the average ratings of the four surveyors over several reviews of the products. The average surveyors’ ratings are used to compare our system and the ratings of Epinions.org website about the products. Fig. 2 shows the side-by-side comparison of Epinion’s, Surveyors’ and our system’s general rating score over 11 products for the “Blender” category. In the figure, we can see that, compared to Epinion’s results, the results of our system is closer to the results of Surveyors’. In average for “Blender” category, our system has 0,43/5 closeness to the surveyor’s results while Epinion’s is having 0,71/5.
1030
Evrim/ AWERProcedia Information Technology & Computer Science (2012) 1025-1032
Figure 2: Blender overall rating comparison
Table 2 : Blender Feature ratings
Of course, our aim is not only calculate and compare the overall ratings of the products but the specific features too. Not all the product categories we analyzed for the feature extraction had the feature categories. The main reason behind choosing “Blender” categoriy as a test for the polarity extraction is that it had some feature sets such as “durability”, “ease of use”, “ease of clean” and “style”. Table 2 shows side-by-side comparison of these features over “Blender” category for Epinion’s, Surveyor’s and our system’s. One of the observations we made from our results is the fact that most of the values are collected around rating 3. The reason of this is the long documentation of the users with lots of jargon in their reviews. Reviewers usually talk about previous product they use, how much they like the new product at the beginning, and then they conclude with the problems of the product in which even confusing for the human surveyor. Therefore, with so many positive and negative feelings for the product, it seems very natural for the review ratings to converge 3 other than 1 or 5. 6- Conclusion In this paper, we proposed the feature extraction technique and tested it on several algorithms. Our observations showed that Wordnet::Similarity with path, lesk and hso algorithms provided competitive results for the top word lists sorted in decreasing order. In determining the polarity of the documents, we provided different rating assignment for different adjectives and used the part of the extracted features to determine feature level polarity of the reviews. Based on the experiments our system provided better results than the websites in informing user about the product ratings in both overall level and the detailed feature level. References [1].
J. O'Donovan, B. Smyth, V. Evrim, D. McLeod, "Extracting and visualizing trust relationships from online auction feedback comments", IJCAI-07. [2]. J. O'Donovan, V. Evrim, B. Smyth, D. McLeod, P. Nixon, “Personalizing trust in online auctions” STAIRS'06, Trentino, Italy, 2006. [3]. J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” Proceedings of ICML, pp. 282–289, 2001. [4]. B. Liu, M. Hu, and J. Cheng, “Opinion observer: Analyzing and comparing opinions on the web,” Proceedings of WWW, 2005. [5]. J. Yi, T. Nasukawa, R. Bunescu, andW.Niblack, “Sentiment analyzer: Extracting sentiments about a given topic using natural la nguage processing techniques”. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2003. [6]. M. Hu, and B. Liu, “Mining opinion features in customer reviews”. In Proceedings of AAAI, pages 755–760, 2004. [7]. B. Liu, M. Hu, and J. Cheng,. “Opinion observer: Analyzing and comparing opinions on the Web.” In Proceedings of the 14th International World Wide Web Conference, 2005. [8]. M. Porter. “The porter stemming algorithm”. Accessible at http://www.tartarus.org/˜martin/PorterStemmer. [9]. J.Kamps, M. Marx, R. Mokken, and M. Rijke. Using WordNet to measure semantic orientation of adjectives. In Proceedings of LREC-04, 4th International Conference on Language Resources and Evaluation, volume IV, pages 1115 –1118, Lisbon, PT. 2004 [10]. M. Hu and B. Liu, “Mining and summarizing customer reviews,” Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 168–177, 2004.
[11]. S. Kim and E. Hovy. . Determining the sentiment of opinions.In COLING. 2004 [12]. H. Ahmed and D. Radev, “Identifying text polarity using random walks. Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 133–140,2010. [13]. Onix stop word list (http://www.lextek.com/manuals/onix/stopwords1.html )
1031
Evrim/ AWERProcedia Information Technology & Computer Science (2012) 1025-1032 [14].
[15]. [16]. [17]. [18]. [19]. [20]. [21]. [22]. [23]. [24].
S. Patwardhan, S. Banerjee, T. Pedersen, “Using measures of semantic relatedness for word sense disambiguation”. In: Proceedings of the Fourth International Conference on Intelligent Text Process ing and Computational Linguistics. 2003. P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy”. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448–453, Montreal, August. 1995. D. Lin, “ An information-theoretic definition of similarity”. In Proceedings of the International Conference on Machine Learning, Madison, August. 1998. J. Jiang and D. Conrath, “Semantic similarity based on corpus statistics and lexical taxonomy”. In Proceedings on International Conference on Research in Computational Linguistics, pages 19–33, Taiwan. 1997. C. Leacock and M. Chodorow, Combining local context and WordNet similarity for word sense identification. In C. Fellba um, editor, WordNet: An electronic lexical database, pages 265–283. MIT Press. 1998. Z. Wu and M. Palmer, “Verb semantics and lexical selection”. In 32nd Annual Meeting of the Association for Computational Linguistics, pages 133–138, Las Cruces, New Mexico.1994. Banerjee and T. Pedersen., “An adapted Lesk algorithm for word sense disambiguation using WordNet. In Proceedings of the Thi rd International Conference on Intelligent Text Processing and Computational Linguistics, pages 136 –145, Mexico City, 2002 Hist and D. St-Onge, “Lexical chains as representations of context for the detection and correction of malapropisms”. In C. Fellbaum, editor, WordNet: An electronic lexical database, pages 305–332. MIT Press.1998. T.Pedersen. WordNet::Similarity. http://search.cpan.org/_tpederse/WordNet-Similarity-0.15/Similarity.pm, 2005. Y. Chen, and J. Xie, “Online consumer review: A strategic analysis of an emerging type of word -of-mouth. University of Arizona, 2004 Bo Pang and Lillian Lee: (2008) "Opinion Mining and Sentiment Analysis", Foundations and Trends in Information Retrieval Volume 2 Issue 1-2
1032