Summarizing Disparate Search Results using Entity

0 downloads 0 Views 285KB Size Report
different pages are referencing the same restaurant by identifying its name or its address, automatic entity identification on unstructured content without a ...
Summarizing Disparate Search Results using Entity Identification, Consolidation, Mashup, and BuzzScore Jérémie Bordier

Gregory Grefenstette

Exalead 10 place de la Madeleine 75008 Paris, France +33 1 55 35 26 26

Exalead 10 place de la Madeleine 75008 Paris, France +33 1 55 35 26 26

[email protected]

[email protected]

ABSTRACT Classic search engines accept a user query and return a list of ranked results. Two independent phenomena may make soon make this response seem archaic. One is that younger users are used to seeing all their information always present, in different configurations maybe, but present, accessible. The other phenomenon is increasing sophistication of semantic extraction tools. These two tendencies find their expression in richer and richer mash-ups of information. In this paper, we present our current implementation of summarizing search results in a consolidated interface, rather than as a classic flat result list.

Categories and Subject Descriptors H.3.3 [Information Systems]: Information Storage and Retrieval – Information Search and Retrieval.

General Terms Algorithms, Performance, Design, Human Factors.

SERPs, but it can be extended to providing an overview of the all the relevant information responding to a query, possibly coming from different Web pages. Imagine a user posing the query “Restaurant in New York” on a classic search engines. One problem is a wide variety of sites have been trained to achieve the highest rank using “Search Engine Optimization” (SEO) techniques, to win the first place without even having useful content[8]. Another problem is that, given such a query, what the user might expect is a summary of all the information that might be found on the web, and not just one web page. A user might want to see information about each restaurant, address, reviews from review websites, user opinions from blogs, describing what they liked and disliked about each restaurant, with pictures or even videos. Such information appears on specialized sites such as tripadvisor.com, but current semantic and Web 2.0 technology should allow the automatic construction of vertical search engines that can provide a consolidated summary of information in response to a user query.

Keywords search engine, search, retrieval, mash-up, unstructured data, consolidation.

1. INTRODUCTION In response to a keyword query, classic search engines provide a search engine results page (SERP) containing an ordered list of document URLs, titles, web page textual snippets, and possibly a thumbnail version of each web page’s content. These items present a succinct summary of the query-related content to be found in each web page. But as user sophistication increases, these simple result lists may increasingly fall short of user expectations. Young users of the internet (imagine a FaceBook user) are familiar with complex presentations of information, separating different types of information into intuitively understandable spaces, using different fonts, panes, windows and other visual information to implicitly identify information type and source. Summarization can come to mean more than just providing an overview of one page’s content, as is now the case on current Copyright is held by the author/owner(s). WWW 2009, April 20-24, 2009, Madrid, Spain.

Even in a vertical search engine, there remain many challenges to consolidating related data in a simple and relevant mashed-up summary. For instance, while humans can easily recognize that different pages are referencing the same restaurant by identifying its name or its address, automatic entity identification on unstructured content without a pre-defined corpus is a real challenge [9]. Even with structured information, consistency on the displayed information is also difficult to achieve. For example, presenting normalized opening hours requires implementing complex pattern recognition covering the wide variety of representing even this simple item. Last, but not the least, the concept of relevance over extracted and consolidated knowledge is totally different from well known ranking algorithms [10][12]. In the following, we present our system for summarizing results returned by a keyword web query. We will present our first mockup concerning consolidating information concerning restaurants found on different web pages, but we feel the principles are general enough to be applied to any domain whose contour can be described. In section 2, we describe which subset of the web we index, and the data schema of the domain in question. In section 3 we present our approach to resolving ambiguous entity names. In section 4, we present our vertical ranking strategy based on buzz associated with an entity.

2. Sources One of the first steps in creating a vertical search engine is defining the slice of the Web to be indexed [2]. In our prototype, we have first built a consolidation engine around the theme of Restaurants in New York City. For variety, we selected an interesting subset of food related websites, each of which provides different types of information. In order to capture the main entities for our engine, we included a number of restaurant review sites. These sites usually present more or less structured information that is directly linked to a given restaurant such as reviews, user comments, various ratings, direct online reservation links, payment and parking options, etc. The selected review websites were: OpenTable, MenuPages, NYMag, Fodors, and Qype1. But there is also much content relevant to the restaurant goer that these review sites do not completely cover: news, interviews with chefs, special offers, events, concerts or even contextual details about the cuisine offered in a restaurant. Most of this information is available on fully unstructured sources like in blog posting or in weakly structured sources like Wikipedia. Extending the coverage of our vertical search engine, we included the following sites, blogs and events repository: NYMag Food, Midtown Lunch, Dining Fever, Metromix, and Wikipedia2. Once we have decided, then on our part of the web to cover, all these sources are crawled and scraped [5] using our core web technologies3.

2.1 Data schema A second step in producing a consolidated summary in the context of a vertical search engine involves defining the class schema of entities that can be matched and merged. In our long term vision of a user-manageable vertical search engine, an end-user will be able to interactively construct their own schema, but in our prototype version, we have defined the entities by hand. Once normalized, the different entities of our prototype application have the following static schema:  Restaurant o required: name, address o optional: phone, website, notation, cuisine, chef, pictures, nearby places, review, best comment, details (parking, etc.)  Comment o required: restaurant_id, text o optional: url, title, notation  BlogPosts o required: title, text, url, date  Event o required: title, date, address o optional: url, description  CuisineDetails o required: name, description o optional: dishes

There are also special attributes, evolving over the time for ranking and relevance, that we infer between entities as the entities are discovered on the Web. Our main example of such an attribute is the Restaurant's BuzzScore that increases as related pictures, comments and blog posts are found. The final structure of the stored data cloud can be seen as a graph of related entities in which Comments, BlogPosts and Events are linked to Restaurants, and in which Restaurants are linked to each other through CuisineDetails, nearby places, and so on.

3. ENTITY RESOLUTION 3.1 Resolution In treating unstructured and semi structured data from the Web, entity resolution is one of the hardest challenges to ensure quality over the whole application, collating similar data for the same entity, without duplicates. This problem is well known for person identification as seen the recent editions of the Web People Search Evaluation Workshop, in which the task is to distinguish web pages concerning different people possessing the same name 4 . Unfortunately, the same problems of homonymy and polysemy affect our case of restaurant names. While review websites always have at least two important pieces of information permitting the identification of a restaurant (at least the name and the address), we find that blog postings usually only provide the name of the restaurant, and this is presented in a variety of fashions. In the first case, entity identification though name and address is still very complicated [6] and involves invoking a flexible and accurate address normalization engine, coupled with approximate name matching algorithms. Dealing with blog posts is a bit more complex. Restaurants sometimes have (too) common names, such as “Butter”. In these cases, simple approximate name matching would generate too much noise in the database. To counter this, we use advanced natural language processing techniques with flexible transducers coupled in a single component named “Identity matcher”. This component, fed with a small set of simple descriptors, assigns a trustworthiness score to a given chunk of text. In our case, our descriptors are words related to the restaurant's vocabulary. As in most Web processing, these operations can be distributed, using mappers and reducers [3]. In a sense, the reduction stage, performed before indexing, can be considered as an integral part of the summarization of the results for a future query.

3.2 Consolidation Once we have identified the entities in text extracted from the Web, items are then sent to reducers where the consolidation process happens. We can have a specific reducing policy for each entity type defined in the data schema. In our implementation, consolidation is achieved in two different steps: Same entity reduction and cross entity reduction.

1

http://www.opentable.com, http://www.menupages.com, http://www.nymag.com, http://www.fodors.com, and http://www.qype.co.uk

Same entity reduction is performed by a module that receives as parameters the mapped item and the entity in its most recently consolidated form (originally the attributes of the entity are empty). The goal of this step is to enrich the current item with new discovered attributes that were not present in the previously processed documents. Attributes are consolidated using custom

2

http://www.nymag.com/daily/food, http://www.midtownlunch.com, http://www.diningfever.com/blog, http://newyork.metromix.com, http://en.wikipedia.org

4

3

http://www.vimeo.com/1117109

http://nlp.uned.es/weps/weps-1/ and the http://nlp.uned.es/weps/ being held at WWW 2009

second

edition

aggregation policies which depend on the nature of the attributes. Examples of these consolidation policies are: 

An existing address is never replaced



New notations are averaged into existing notations



Reviews may be replaced by a higher source trust level or following stronger language analysis results such as sentiment and opinion analysis



VIDi

S(COMi,t) is the sentiment score of the restaurant i from user commentary on review sites at time t. This score is calculated as follows: S(COMi,ti) = ((|PosAdjCOMi| |NegAdjCOMi| ) * WAdj) PosAdj

Pictures are added if not already known

Cross entity reduction is principally used to compute Restaurant's BuzzScore and Best Comment attributes. Each sentiment-bearing Comment increases or decreases the BuzzScore of its related restaurant according the values returned by our sentiment analysis. If the Comment has better statistics than the current restaurant's best comment, and a sufficient length, the attribute is updated. Each restaurant that has been identified in a blog posts also have its BuzzScore increased, and so on.

4. BuzzScore Our BuzzScore algorithm attempts to summarize popularity and hotness which are often key elements of relevance in vertical search engines. Capturing these sentiments is particularly pertinent in our Restaurant search case because people expect the most popular (and hippest) restaurants to show up first in the ranking. This algorithm can be considered as an extension to current news ranking algorithms [7]. While freshness, authority degree and citation count components are enough to achieve relevant news ranking, in dealing with blogs, we execute more probes to counter potential spamming. First, we measure the number of different websites talking about this restaurant. This value is not the most significant feature of BuzzScore since websites generally try to cover as many restaurants as possible. On the other hand, the number of available pictures of a given restaurant is a good clue to its popularity and we give this feature a medium weight in our BuzzScore formula. Another feature, the number of comments is included with a minor weight. The most important parameter in our BuzzScore comes from sentiment analysis [11] over user reviews. Depending on their polarity, the BuzzScore will increase or decrease. The same process is applied to blog posts, but with a larger impact on the score as blog mentions are much rarer. The total contribution of a given blog post also decreases over time in a buzz freshness component of BuzzScore. Finally, the number of videos found related to this restaurant significantly boosts the BuzzScore. Here the formula we use for calculating the BuzzScore of a restaurant: BuzzScore(Resti) = WSrc∑ SRCi + WImg∑ IMGi + WVid ∑ VIDi + WCOM ∑ S(COMi,t) + WBP ∑ S(BPi,t) where Resti

is the restaurant i begin scored

SRCi

is the number of different sites that mention this restaurant i in the slice of the Web begin summarized

IMGi

is the number of images found for this restaurant i in review sites

is the number of videos found for this restaurant i on http://www.dailymotion.com/us/channel/travel, a site that indexes videos

are positive valence adjectives found in user commentary of blog postings; these are adjectives such as good, great, lovely, …

NegAdj are negative valence adjectives found in user commentary of blog postings; these are adjectives such as awful, expensive, noisy, … S(BPi,t) is the sentiment score of the restaurant i from blog postings at time t. This time sensitive score diminishes over time as described in [9]. It is calculated as follows where t is the present time and tinit is the initial time that the blog post was posted. score is calculated as follows: S(BPi,t) = e-α(t-tinit)S(BPi,tinit) (the exponential factor tends to 0.5 over time), and where S(BPi,ti) = (|PosAdjBPi| - |NegAdjBPi| * WAdj * 10) The factor 10 shows that we give greater weight to blog post opinions than to commentary from review sites. Wk

are ad hoc weightings assigned to each element of BuzzScore

This BuzzScore then gathers a wide variety of statistics concerning a restaurant’s impact on the web and attempts to summarize all this data in one number. At query time, hits are sorted by a virtual score, called BuzzRanking, computed by taking a weighted linear combination of a score from traditional search engine text matching algorithms [1] and each restaurant’s BuzzScore. In this way, our summary of all user opinions affects the final ordering in which the responses are displayed.

5. SUMMARIZATION AND MASHUP 5.1 Search results By construction, our BuzzRanking algorithm favors results with rich data associated with them. The top items returned by a search t have pictures, videos, commentary and blog posts that enable us to build an attractive “Look Zone” based [4] Mashup. Since more than 75% of the time spent by users on result pages is spent on the first three results, it is important to make them easily decodable as well as being visually interesting. One of the base concepts of mashups is to have different font sizes as an implicit visual clue to content importance. Following this best practice, we designed a result page (see Figure 1, below) where the “best” restaurant has its own area, with bigger text, a bigger picture along with the best three adjectives that describes, according to user reviews, this restaurant. A short query biased textual summary of the best review is also available in the first result pane. Restaurants are geo-localized on a Google map to provide more information to the user. Bottom lists results have fewer details, but are still reachable through the map that leads the user to refine his choice to closer restaurants.

6. CONCLUSION

Figure 1. Restaurant mashup demonstrating font-size based content ranking

5.2 Restaurant details page The automatically constructed restaurant result pages (see an example in Figure 2) are also a summary of different information collected from different sources. Notable elements includes reviews and comments from various sources, normalized information such as Opening Hours, a little contextual cuisine frame extracted from relationships between cuisines and dishes articles from the Wikipedia [13], and of course the map. Sentiments analysis are displayed as a tag cloud with adjectives colored by tone that immediately offer a good overview of what people are thinking about this restaurant.

Here we have presented a version of a next generation retrieval system that types and summarizes information in a modern mashup format. This result is achievable nowadays thanks to the great strides made in semantic extraction techniques over the past two decades, as well as the growing number of information sources developed in Web 2.0. We present an example of a vertical search engine covering the domain of restaurants, but we believe the same summarization techniques can be applied in a great number of other domains, from patents to business to medicine.

7. REFERENCES [1] Brin, S and Page, L: “The anatomy of a large-scale hypertextual Web search engine” in Computer networks and ISDN systems, 1998

[2] Chau, M. and Chen, H. 2003. Comparison of Three Vertical Search Spiders. Computer 36, 5 (May, 2003), 56-62

[3] Dean, J., and Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters,” in Proceedings of OSDI ’04: 6th Symposium on Operating System Design and Implementation, San Francisco, CA, Dec. 2004. and Gay, G: “Eye-tracking analysis of user behavior in WWW search” in Proceedings of the 27th annual international ACM SIGIR, 2004

[4] Granka, LA, Joachims, T

[5] Hogue, A. and Karger, D. 2005. Thresher: automating the unwrapping of semantic content from the World Wide Web. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM, New York, NY, 86-95

[6] Horng-Jyh, PW, Jin-Cheon, N,and Soo-Guan, CK. A hybrid approach to fuzzy name search incorporating language-based and text-based principles. Journal of Information Science, Vol. 33, No. 1, 3-19 (2007).

[7] Xiaofeng Liu, Chuanbo Chen and YunSheng Liu: “Algorithm for Ranking News” in Proceedings of the Third International Conference on Semantics, Knowledge and Grid, Oct. 2007, 314-317

[8] Malaga, R. A. Worst practices in search engine optimization. Commun. ACM 51, 12 (Dec. 2008), 147-150.

[9] Nadeau D., and Sekine, S. A survey of named entity recognition and classification. Journal of Linguisticae Investigationes 30:1 ; 2007.

[10] Nie, Z., Wen, J.-R., and Ma, W.-Y.: Object-level Vertical Search. In Proc. of CIDR, 2007

[11] NGodbole, M Srinivasaiah, S Skiena: “Large-Scale Sentiment Analysis for News and Blogs” in Proc. Int. Conf. Weblogs and Social Media (ICWSM 2007) Figure 2 A Restaurant result page summarizing restaurant specific information automatically consolidated from many different sources

[12] O'Brien, P. and Abou-Assaleh, T. 2007. Focused ranking in a vertical search engine. In Proc. of SIGIR '07 (Amsterdam, July 23 - 27, 2007), 912-912

[13] T Zesch, I Gurevych in “Analysis of the Wikipedia category graph for NLP applications” Proc of NAACL-HLT, 2007