Ranking Products through Interpretation of Blogs Based on Users' Query

3 downloads 10921 Views 1MB Size Report
May 5, 2010 - International Conference on Methods and Models in Computer Science, 2009 ... finding out the best available blog according to users' query.
International Conference on Methods and Models in Computer Science, 2009

Ranking Products through Interpretation of Blogs Based on Users' Query Niladri Chatterjee', Nishant Agarwaf 1,2Dep t. ofMathematics, lIT Delhi, New Delhi 110 016 e-mail:[email protected][email protected]

Abstract- In this work we look into analyzing blogs to classify products according to users' query. Blogs can be found over the internet where buyers share their opinions on different products that are available in the market. Such pages may prove to be good guides for a prospective buyer. However, going through a large number of blogs and to convert their opinions into a meaningful decision is often difficult. In the present work we develop a scheme for analyzing blogs for the aforesaid purpose. For this work we restrict ourselves to the blogs related to computers only, i.e, laptops, netbooks, and desktops. Our initial results of experiment are found to be very encouraging. Keywords-« Text Mining, Natural Langugage Processing, Blogs Analysis. I. INTRODUCTION

I

t is our common perception that when it comes to buying sophisticated gadgets (e.g. computers, cars, cameras) average people suffer from an indecision regarding which product to buy. The dilemma arises due to: 1.The market is full with a number of products to choose from. 2.Most of the products have a large number of characteristic features which common people are seldom aware of. 3.People's choice vary. Hence no unique product cater to everybody's needs. Blogs containing users feedback can be very helpful in making up the prospective buyers' mind. A good amount of research is going on in the field of blog analysis. There are a number of search engines finding out the best available blog according to users' query. Software giantsl like Google, Yahoo have also provided search engines totally dedicated for blog searching. Technorati2 , also a blog searching site, as of June 2008, indexes 112.8 million blogs and over 250 million pieces of tagged social media. Also, with the advent of Citizen Journalism, many websites now enable active participation of users and blogging services. Hence a huge database of articles and user reviews is created on the web. Our aim is to extract the user reviews from blogs pertaining to a specific product or a set of products according to users' query, and then return the best products which may meet his 1 blogsearch.google.com,

requirements. This motivates us to develop a scheme to rate the blogs such that it first extracts the relevant blogs according to users' query and then rates the blogs with respect to the desired product. The task however, is not straightforward as users' comments are often replete with ungrammatical or poorly structured sentences, incoherence of themes, usage of synonyms, and many other usual NLP problems [1], which we have handled appropriately in a semi-automatic manner. This paper is organized as follows. Section 2 discusses some of the previous works on blog analysis. In Section 3, we closely look at the problem, by considering the blogger's perspective and we pro- pose a scheme which would achieve better results. Section 4 consists of detailed methodology for the proposed scheme. Sections 5 and 6 include experi- mental results, analysis and concluding remarks. II. RELATED RESEARCH

Analysis of blogs is very much under study for different purposes, such as Data Mining, Machine Learning. This section briefly surveys some previ- ous works on blog analysis. Ramaswamy in [2] talks about blog analysis and how it can be used for market research. Methodolo- gies used include Keyword and Phrase extraction, Named entity recognition. The work done by Wil- son and Wiebe in [3], focuses on disambiguating potentially subjective expressions based on density of other clues in the text. They use the term Potential Subjective Element (PSE) for a linguistic expression that is used to express subjectivity. The model presented by Chatterjee et al. in [4] initiates the idea which we are also aiming to. How- ever, the function used in [4], to rate a blog by giving the postive word a rating of 1 and a nega- tive word a rating of -1 is very simplistic. Also, no proper discussion is there on how the users' query will be used to determine the relevant blogs. The methodologies used in this work include, 1. Pre-processing which is the manual process of bringing the blog to a standard level. 2. Part of speech tagging. 3. Identification of the brand and features. 4. Ranking the blog using a mapping function.

ysearchblog.com

2 technorati.com

Authorized licensed use limited to: UNIVERSITA PISA S ANNA. Downloaded on May 05,2010 at 08:08:48 UTC from IEEE Xplore. Restrictions apply.

In the present work we take cue from the above, to develop our proposed scheme.

III. A

CLOSER LOOK

weight. . For illustration, if the author says Very few find any problems and you too will fmd the keyboard to be excellent. We tend to give it a lesser rating than the following sentence.Very few find any problems and you too will find the product to be excellent. We take the users' query into account by finding the blogs from the blog search engines (see Section I) using the keywords from the query. It is our common experience that the results from the search engines are often wrong as they fall prey to the ambiguous language being used by the authors. So, what is needed here is a tool that can properly analyze each sentence of the blog, and can then reveal the plausible idea. In our algorithm, we therefore, propose three improvements over the previous models [4]. 1.Finding the relevant blogs for the users' query. 2. Using a non-uniform weight function. 3. Certain heuristics to find out the product from the blog and its rating. We have therefore used a weight function that will rate the phrases by measuring the degree of the adjective or adverb which is describing its sentiment i.e. a phrase having very good will receive a higher rating than a phrase having good, similarly worst gets a lower rating than bad.

Upon reviewing a large number ofblogs, we infer that a sentence or a subsentence generally describes the feature or a part of the product. There are also some sentences which do not specify any feature but describe the product as a whole. The rest of the sentences are not useful to extract any information towards the ranking of the blog. We observed that the usage of the underlying feature or the product becomes implicit in the first sentence. For example, consider the following: The X301 employs an LED-backlit, 13.3-inch (1440 by 900 pixel resolution) display with a matte fmish that prevents most glares. That high resolution certainly lets you see more of documents and Web pages, though the default text may be too small for some eyes.3 Here, one can see that in the second sentence there is no mention about any product or feature and that it derives the subject implicitly from the previous sentence. There are also situations when a sentence neither IV. DETAILED METHODOLOGY contains any feature, nor does it refer to the feature discussed in the previous sentence. For example, Sections IV(A) and IV(B) describe our schemes for consider the following pair of typical sentences : Mind finding the relevant blogs and their ranking respectively. you, the product is no less than a playstation. Playstation Figure 1 provides an overall model for the proposed 2 is undoubtedly one of the devices which provide better scheme. gaming environment. We found that among the sentences which describe some feature, many do specify the technical On every comma. sa pa ra te it Into a new query. specification for that feature numerically e.g. capacity of and form p queries RAM is 1 GB. Such sentences we have ignored for most of the cases as the user can refer the Technical Extract Keyword s from each query Specification Sheet (TSS)4 for that particular product. (by refe rring to T , ) Instead we have focused on those sentences which enumerate the merit or demerit of any feature or the Identify a set of 20 blogs by product. This leaves us with extracting only those Entering o ne query a t a thne Finding the In t ers ec ti o n among the URis of p queries sentences which have the intersection of describing a feature or the product, and which have some adjective or adverb to classify it. ide ntify the Brand name alongwlth t h e Model name ( by referring to T,) Hence extraction of useful sentences describing some feature is another major task. For this we need to identify the important features a product can have based · Ext r a ct the 20 web p ages corresponding t o the 20 URis -Pa rae th em to remov e th e tags on which users are likely to choose some model or -Translate them Into English , if required ·Check for spelling mistakes brand over another. Most of the features can be enumerated by looking at the TSS for that product, and hence for other products of the same category also. So -Identify effective clauses -M a p them u s ing a non-unitenn w e ight function u sing LA we need to have a ready list of synonyms for each such -Check for product or feature using LW feature. We have created a list of words (LW) of · C h e c k for N egat ion phras e s commonly used terms for storing (and subsequent retrieval) of a feature and its possible synonyms. ·Rank the product In each of the b log s · Sort them In descending order ac e to the ir ra n ki ng Presently, LW contains around 700 terms pertaining to 'Return th e u s er with a li st of products computer related items manually extracted after reading different blogs. Care is also being taken if the author is talking about Fig.I: Model for the proposed scheme some feature of interest, or the whole product. If the writer is specifying details about a feature but not the product as a whole, then we have given it a lesser 3 www./aptopmag.com/Review/Laptops/Lenovo-T hinkpad-X300.aspx 4 Available

on the website of the manufacturer.

Authorized licensed use limited to: UNIVERSITA PISA S ANNA. Downloaded on May 05,2010 at 08:08:48 UTC from IEEE Xplore. Restrictions apply.

In this respect we first define the term Effective Clause used for giving weights to the sentences. Effective clause is a sequence of words having an adjective or an adverb specifying something about the product or some feature of the product. It consists of a subject and a predicate.

A.Finding Blogs 1. We ask the user to enter the query separated by commas, i.e, all the configuration details which the user wants to see in the product. For example, one can specify queries of the type, "2 GHz intel core 2 duo processor, 256 MB dedicated graphics card, HP pavillion laptop". 2. We split the input string on commas and identify each of the required specifications. Any spelling mistake in the query is removed by using any online spelling check tool5 . We then identify the set of keywords in each specification by comparing it with a list of keywords, maintained as a file Tl. 3. The next step is to find the relevant blogs from the web. This is done by feeding in the processed specifications one at a time. We maintain a list of search items returned on entering each of the specifications. From these lists we identify a set of k blogs which is common to every list. In our present experiments, we have used k=20. 4. If we are unable to find 20 intersecting URIs, then the URI which is in the maximum no. of lists is selected iteratively till we find 20 URIs. Still, if we are unable to find the requisite no. of blogs, then we are forced to work with less number of blogs for our analysis. 5. This set of blogs serves as a primary source of data for the next level of filtration. Once we have the 20 URIs, we are then prepared to move on to the stage where Natural Language Processing comes into use. Section IV(B) discusses the heuristics used for rating a blog. B. Rating Blogs

1. The rating of a blog will be given between a range of 0.0 (worst) to 1.0 (best). The idea is to give a rating to every effective clause and then take the average of the weights to find the rating of the corresponding product being described in the blog. 2. Initially, we recieve an HTML file which is first preprocessed and parsed by available online tools to bring it to a computable level. Tags are removed and we check for its language in the Meta Data. If the blog is in some different language", then Online Translation tools? are used to convert it5to English. Although the translation is not so proper, but this in a way enables us to use the data that was left untouched until now. Spelling mistakes in the blog are also removed by using a Spell Check tool as cited before.

3. We maintain a list of companies in a file T2 offering the product in that category. From the analysis of the blogs we found that the model name typically comes after the company name e.g. "Sony VAIO T2310", "HP Pavillion dv6314tx". We follow the algorithm given in [4] to identify the brand name. Here, one searches for a match of the company name from our database and can pick the next k words as a possible name of the model. We found that the values of k lie generally between 2 and 3 for most of the product types. Using that scheme over the whole blog we have a set of such k or less word elements. As a model name we pick the one that has the highest frequency of occurrence and take its largest number of words present in the set. This identifies the particular model the blog is talking about. 4. Stanford" POS Tagger has been used for tagging the text file. The Effective clauses are then extracted out from the blog sentences by using the idea that a clause should consist of a noun phrase and a verb phrase. For example, consider the sentence: There is nothing to

get excited about this Notebook, as it has no ground breaking technology or innovation and it definitely will not make a fashion statement. This sentence gets split into three clauses which can be very clearly seen. Next, we extract those clauses which have some adjective or adverb in it. The POS tagger marks the words with higher degree as superlative or comparative ac cording to the word. The tagged file helps us in recognizing the Effective clauses. 5. We have defined six sets for the mapping function : VVG, VG and G for words which are "good" and VVB, VB and B for words which are "bad". We collectively call them the list of adjectives (LA). VVG is for superlative degree of words, or adjectives with the hedge 'most' e.g. "most attractive", "best", VG for comparative degree of words, or adjectives with the hedge 'very', e.g. "very smooth", "cheaper" and G for positive degree of good words. Similar lists exist for the other three sets. The corresponding weights associated with them are:

I

VVG 0.90

VVBI 0.10

Justification: Our scheme is based on the fact that a bad word has more negative impact compared to the positive impact that a good word has. A buyer may not like to buy a product which has one negative comment even though it has many positive comments. 6. If we identify the context as feature oriented then the rating of that effective clause is multiplied by a factor of 4. The context is identified to be feature oriented if the noun phrases in the sentence is found to match with the keywords in the TSS. This results in controlled but significant deduction of the rating. 7. Handling negation phrases: Dealing with phrases like not good, not bad etc. is handled by using the following method. If we encounter such a phrase, we

5 http://www.spellchecker.net/spellcheck/

6Ace. to blogherald.com, there are more than 15 millionKorean,2 million Chinese & 2 millionEuropean blogs 7 http://translate.google.coml 8 http://nlp.stanford.edulsoftwareltagger.shtml

Authorized licensed use limited to: UNIVERSITA PISA S ANNA. Downloaded on May 05,2010 at 08:08:48 UTC from IEEE Xplore. Restrictions apply.

reverse our weightage system for the subsentence. That is, "not good" recieves a rating for 0.25 and "not bad" recieves a rating ofO.60. 8. The overall rating of the blog is computed from the ratings of the individual clauses and then normalizing it suitably. The scheme then recommends the user a list of products sorted in the descending order of their ratings. V. EXPERIMENTS AND RESULTS ANALYSIS In order to test the performance of our scheme, we have conducted the following types of experiments. 1.Searching the relevant blog, finding the Recall and Precision values 2. Rating the products using our scheme 3. Measuring the efficiency of our scheme in Product Identification TABLE

1: RESULT ANALYSIS FOR FINDING

Query Specified Intel core 2 duo Processor More than 1 GB RAM Less than 1 GB RAM 100 GB Hard Disk 1.3 MP Web Camera 2Hrs Battery Life

Relevant Blogs 15 14 12 18 20 14

RELEVANT BLOGS

Doubtful Blogs 3 3 3 0 0 4

Irrelevant Blogs 2 3 5 2 0 2

Table 2 Result Analysis for rating Products Blog About? Machine Manual Difference Rating Rating HP Pavilion zv6000 -0.09 0.61 0.70 Dell Inspiron 1501 +0.03 0.53 0.50 0.66 0.70 -0.04 Dell Latitude C600 0.71 0.70 Lenovo NI00 +0.01 0.77 0.80 Apple Macbook -0.03 TABLE 3: MEASURE OF E CIENCY OF THE SCHEME IN PRODUCT IDENTIfiCATION

Details Correct % Brand Services 85 % Brand Identification 95 % Features Identification 80 %

Doubtful % 10 % 0% 10 %

Wrong % 5% 5% 10 %

Tables 1, 2 and 3 give snapshots of our experimental findings. For finding the relevant blogs, we have tested the model over a selection of 80 blogs. We obtained a set of 50 queries. They were then checked manually as well as using the scheme. Table 1 presents a sample of our results picked randomly. For ranking the products, we selected a set of 20 computers. We already downloaded their reviews from the web9 . Each of the product was checked for its rating by our scheme against the gold standard. We have taken the manual rating given by 10 individuals and considered their average to be the gold standard. Table 2 shows the r~sults for five products. We checked for the percentage Reviews taken from www.notebookreview.com

of times our heuristics are able to identify the brand and the product names for the selected 80 blogs. Table 3 presents the results for the same. By 'Doubtful' in Table 3, we imply whether the product name identified is ambiguous or not. VI. CONCLUDING REMARKS Mining of blogs/texts for review or summarization is not new to NLP practitioners. NLP techniques have been used for movie reviews [5], determining sentiments of opinions, and in other domains. Although, we have applied our scheme on blogs, it can be used for other kinds of products also, like automobiles, electronic goods etc., where typically a lot of brands/products are available in the market, and it is difficult for a buyer to select the best product that suits his requirements mostly. This can be done by appropriately changing the corresponding files Tl , T2 , LW, LA as described in our methodology. We are working towards improving the accuracy of the scheme. We propose here certain improvements in our present scheme: 1.Price related searching: Most of the users generally decide their budget first and then start searching for a product so price related querying will be a major improvement. 2. Verbs and Conjunctions: From the examples taken into consideration, we observed that a big role is being played by the conjunctions and verbs. Conjunctions can easily change the point of view of the author where as verbs used by the author can often be misleading. 3. Processing comments on the blog: A very useful source of information which we have missed out till now is the comments which the users leave on the blog. This plays a very important role in judging about the prestige of the blog as well as the rating of the product. Some of the comments which are harsh should lead towards the decline in the rating of the product and vice-versa should also hold true. Weare currently working in these directions to develop an improved product-ranking system. REFERENCES

[1] Isabelle Moulinier Peter Jackson. Natural Language Processing for Online Applications. John Benjamins Publishing Company, 2002. [2] Srinivasan Ramaswamy. Blog analysis trendsand redictions, applied natural language processing project report ourses.ischool.berkeley.edu/i256/f06/projects/. [3] Theresa Wilson Janyce Wiebe. Learning to disambiguate potentially subjective expressions. Sixth Conference on Natural Language Learning (CoNLL-2002). ACL SIGNLL. Taipei, Taiwan, August, 2002. [4] Niladri Chatterjee Prasenjit Chakraborty, Sumit Bisai. Ranking of products through blog analysis. Proc. 1st IHCI-2009, lIlT Allahabad, India, pages 246-253, 2009. [5] F. Jing L. Zhuang and X.-Y. Zhu. Movie review mining and summarization. Proc. 15th ACM Inti Conf on Information and knowledge management(CIKM), pages 43-50, 2006.

Authorized licensed use limited to: UNIVERSITA PISA S ANNA. Downloaded on May 05,2010 at 08:08:48 UTC from IEEE Xplore. Restrictions apply.

Suggest Documents