Detecting Dominant Locations from Search Queries Lee Wang1, Chuang Wang2, Xing Xie3, Josh Forman4, Yansheng Lu2, Wei-Ying Ma3, Ying Li1 1
Microsoft Corporation, One Microsoft Way, Redmond, WA 98052, USA {leew, yingli}@microsoft.com 2 Department of Computer Science, Huazhong University of Sci. & Tech., Wuhan 430074, P.R. China {chwang, ysl}@mail.hust.edu.cn 3
Microsoft Research Asia, 49 Zhichun Road, Beijing 100080, P.R China {xingx, wyma}@microsoft.com 4 Harvard University, Boston, MA 02138, USA
[email protected]
ABSTRACT
1. INTRODUCTION
Accurately and effectively detecting the locations where search queries are truly about has huge potential impact on increasing search relevance. In this paper, we define a search query’s dominant location (QDL) and propose a solution to correctly detect it. QDL is geographical location(s) associated with a query in collective human knowledge, i.e., one or few prominent locations agreed by majority of people who know the answer to the query. QDL is a subjective and collective attribute of search queries and we are able to detect QDLs from both queries containing geographical location names and queries not containing them. The key challenges to QDL detection include false positive suppression (not all contained location names in queries mean geographical locations), and detecting implied locations by the context of the query. In our solution, a query is recursively broken into atomic tokens according to its most popular web usage for reducing false positives. If we do not find a dominant location in this step, we mine the top search results and/or query logs (with different approaches discussed in this paper) to discover implicit query locations. Our large-scale experiments on recent MSN Search queries show that our query location detection solution has consistent high accuracy for all query frequency ranges.
A number of search queries are associated with geographical locations, either explicitly (i.e., the correct location qualifier is part of the query) or implicitly (i.e., the location is not present in the query string). Accurately and effectively detecting the locations where search queries are truly about has huge potential impact on increasing search relevance, bringing better targeted search results, and improving search user satisfaction.
Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – search process, retrieval models, information filtering
General Terms Algorithms, Performance, Experimentation.
Keywords Information retrieval, local search, query’s dominant location, search, search query location, search relevance.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR’05, August 15–19, 2005, Salvador, Brazil. Copyright 2005 ACM 1-59593-034-5/05/0008...$5.00.
In this paper, we define a query’s dominant location (QDL) and present a novel solution that effectively detects it. A QDL is geographical location(s) associated with a query in collective human knowledge. QDL is very important in that if it exists, it should be used as the intended location for the query. Challenges in detecting queries’ dominant locations lie in that QDL is a subjective and collective measure. It is the location existing in the collective human knowledge. The location name contained in the query string may or may not mean a geographical location. Even if it does, it may not mean the dominant location that we are seeking. On the other hand, there are queries that do not contain any location names but do have dominant locations. Running entity extraction algorithms based on geographical location dictionary look-up introduces high false positives for queries whose location names do not mean locations or are not dominant locations, and high false negatives for queries that do not contain location names. From our experience, what is important for location-explicit queries is false positive suppression. There are two scenarios where a false positive can happen. One is when a query is geographically ambiguous. For example, the query “denzel washington” contains a location name “washington” but the query is not geographical at all. This is the scenario that has been addressed in the literature [1,6,11,12], using statistical analysis or natural language processing to suppress this type of false positives. The second scenario is also hard to suppress and we have not seen an algorithmic solution in the literature yet. As an example, the word “kentucky” in “kentucky fried chicken” does mean state of Kentucky, USA. It is where KFC was originated from. But KFC has grown into a world-wide business and today Kentucky should not be the QDL for the query. This query is locally intended but does not have a QDL. In this paper, we present a solution that effectively suppresses false positives of both types. We are also able to detect dominant locations for location-implicit queries and minimize false negatives (misses). For example, our solution will give “Seattle, WA, USA” as the QDL to the query
“space needle.” For these queries, location name extraction from query simply does not work. In our solution, we mine top search results or search logs to infer locations. Knowing a query is local but does not have a QDL is also important because only when a local query does not have a QDL, other (distributed) locations (such as from user IPs) can be used. For example the query “McDonald’s” (another fast food restaurant chain worldwide) does not have a QDL. When user searches for McDonald’s, his intention is normally the one closest to where he is currently located at. Therefore, the location of McDonald’s to user A will be different from that to user B if these users are geographically apart from each other. Thus collectively in human knowledge, we know that “McDonalds” is a local query but without a QDL. This example illustrates the difference between local intention and dominant location of a search query. Table 1 gives sample location-explicit and location-implicit queries, shown with their dominant locations and local intention. Local intention detection is not covered in this paper (which we are doing as the next step of our research). Table 1. Example queries for QDL and local intention. Query kentucky fried chicken denzel washington pizza in redmond new york times new york time mcdonald's space needle information retrieval
Dominant location Redmond, WA, USA New York, NY, USA Seattle, WA, USA -
Local intention Local Global Local Global Local Local Local Global
Our contributions from this paper include: ●
A formal definition of query’s dominant location (QDL), and discussions on why it is important to search relevance. We also stated the differences and relationship between QDL and queries’ local search intention.
●
A novel solution that detects QDLs from queries both with and without location keywords using a combination of data sources as necessary. Our solution effectively suppresses false positives and false negatives.
●
A classification system that categorizes search queries into four distinctive types by presence of location keywords and QDL. We labeled a large number of MSN Search queries covering all query frequencies, and studied query distributions by our types in different frequency ranges.
●
A large-scale evaluation of our QDL solution using these labeled queries. For performance, we report the precision, recall, Micro-F1, and error rates of our QDL detection across all queries as well as for different query frequency ranges and different query types. We also report the computational time cost for each of the test we ran. Our results show that our QDL detection performs consistently over all query frequency ranges and outperforms a dictionary look-up method and Google.
In this paper, we first describe some work related to our research. Then we formally define the QDL and categorize search query into distinct types. Next we will describe our QDL detection solution. We will first give an overview of our solution, and then discuss the query tokenization algorithm and our approaches of detecting
locations from search results and query logs. We will also give our experimental results to show the performance and speed of our solution. Finally, we will summarize for this paper, discuss how our DQL detection will improve local search, and point out future research directions.
2. RELATED WORK In the natural language processing (NLP) community, there are a lot of efforts on Named Entity Recognition (NER) (e.g., [5,18]). Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. NER systems try to identify all the named entities in a sentence or a paragraph of text, and tag them with appropriate entity types. For location entity names, researchers recently started to work on determining the actual places meant by them, which is usually called “grounding” [11,12]. For instance, one needs to determine whether a location name “Washington” means a state or a city. Usually the “grounding” algorithms use a gazetteer to verify geographic names, and use context information in the text to help distill the correct sense of a name. In our case, since queries are usually short and are often not proper sentences, NLP techniques are difficult to apply for high accuracy. In addition, NLP algorithms are too slow to process a large quantity of queries in a short period of time. Instead of tagging locations for only a few words such as in a query, there is also much work on tagging locations for a web page or a web site [1,6]. The basic idea behind this is to use more information, such as ZIP codes, phone numbers, languages, and HTML links, in addition to only the words appeared in the gazetteer, to deduce the location focus. Based on all the locations extracted from one page, an algorithm is applied to identify geographical locations at a proper level in a given location hierarchy. Their work is complementary to ours, since when the web location is known, the search engine can conveniently return pages with locations related to the QDL to improve the user satisfaction. The work most related to this paper is [10]. In that paper, the authors classified web queries into two types: local and global. They define a query as local if its best matches on a web search engine are likely to be local pages, like “houses for sale.” A number of classification algorithms have been evaluated using search engine queries. However, their experimental results showed that only a rather low precision and recall can be achieved. Their work is more similar to the concept of “local intention” described in the first section of our paper, which we think is a related but different problem to QDL. A number of commercial search services have started to support location-based search. Some of them, like Google and Yahoo!’s local search sites [8,17] require users to specify a location qualifier, in addition to giving a search query. More recently, MSN and Google Search [13,9] added location look-up capability that extracts location qualifiers from search query strings. These algorithms are based on some location string matching rules. For example, for a search for “Pizza Seattle”, Google returns “Local results for pizza near Seattle, WA.” However, no commercial search sites have yet successfully derived locations from location-implicit queries. For example, for a search for “Pizza near space needle”, Google does not return any local results for pizza businesses around Space Needle (which is in downtown Seattle). In summary, there already exist a number of studies on tagging geographical information to text or web content. However, none of
them has worked on web queries for detecting geographical locations from them. The main challenges include the location word ambiguities and the lack of context information since most queries are short. In the rest of this paper we will explain how our solution effectively solved the query location detection problem.
3. QUERY LOCATION AND TYPES We define queries’ dominant location as: Definition 1: Query’s Dominant Location (QDL) is one or more geographical locations associated with a query in collective human knowledge, i.e., prominent location(s) agreed by majority of people who know the answer to the query. This definition is rather subjective. We are trying to detect query’s dominant location, which is agreed by majority of the people. Therefore, the detected location can be used by most search users. For example, majority of people think query “new york style pizza” searches for a special type of pizza and therefore should not have a dominant location. For another query “new york pizza”, someone may consider it means the same as “new york style pizza”, but many other users consider it is searching for a pizza place local in New York city. Therefore, this query has a QDL, which is New York, NY, USA.
results as an approximation of the majority opinion to the answer of the query, since web pages can be looked as a huge collection of human knowledge and a good search engine returns the most relevant and popular usage/answer to the query in top results. Search queries are often short, containing only several words. One needs to leverage additional and related contextual information to provide more precise results. We use three types of information sources: queries, search results, and query logs. If we detected the QDL from the query, we do not need to look further. Search results contain two parts: text blobs (snippets) and returned web URLs (result pages). Query log contains several pieces of information: user location, search query, and web pages on the result list users clicked on. They also represent different view points: queries surely represent the intention of the current user; query logs usually represent the interest of previous search users; while search results stand for the interest of web authors. We believe that by combining information from different points of view, a complete understanding of the QDL can be obtained. Query
No
In this paper, we also categorize search queries into the following four distinctive types by presence of location keywords and QDL: ●
●
Log available?
Queries without location keywords and do not have QDLs (Type-1). This type contains location-implicit QDL- negative queries. Examples of this type are “poetry”, “Harry Potter”, and “information retrieval.” Queries with location keywords and have QDLs (Type-2). This type contains location-explicit QDL-positive queries. An example is “Chicago weather.” It is obvious that its QDL is Chicago, IL, USA. Google [9] and MSN Near Me [13] have started to provide local search by extracting location names from this type of queries. One needs to be sure that a query be of this type before location extraction can be applied.
●
Queries without location keywords but have QDLs (Type3). This type contains location-implicit QDL-positive queries. An example is “oyster Olympics.” Although there are no location keywords in the query, this query is locally bound to Seattle, WA, USA. Today no commercial search providers derive the implicit locations from this type of queries.
●
Queries with location keywords but do not have QDLs (Type-4). This type contains location-explicit QDL-negative queries. In this type, the location keywords are often used in a well-known phrase, such as in “Kentucky Fried Chicken” and “New York minute”, but are not the dominant locations. No commercial search providers can do a good job suppressing false positives from this type.
4. QDL DETECTION 4.1 Overview The simplest way for detecting QDL is doing a location dictionary look-up in the query. This approach would solve our problem if all queries contained location names and none of them had any ambiguities. However, in reality, a significant portion of web queries do not have this good property. Different from the look-up approach, the basic principle to our solution is to use top search
Location keyword?
No Result available?
No
Yes Calculating QDL-query
Location
Not found Yes Calculating QDL-log
Location
Not found Yes Calculating QDL-result
Location
Not found No QDL
Figure 1. The flowchart of QDL detection. Figure 1 shows the work flow of our QDL detection algorithm. In our solution, we calculate a QDL for each of the three information sources: queries (QDL-query), search results (QDL-result), and query logs (QDL-log). Then we combine the three locations together to get the final result. The basic principle for combining QDLs is that we consider users’ interest take precedence over web authors’ interest, while the current user’s interest takes precedence over previous users’ interest. Our QDL detection combination rules are: 1.
If the query contains location keywords, then we calculate the QDL-query. If the QDL-query is found and not ambiguous, then we return the QDL-query and stop.
2.
If the query log is available, we calculate the QDL-log. If the QDL-log is found and not ambiguous, then we return the QDL-log and stop.
3.
We retrieve search results and calculate the QDL-result based on result snippets and/or result pages. If the QDL-result is found and not ambiguous, then we return the QDL-result and stop.
4.
If we do not find a QDL yet, the input query does not have one. We stop.
Table 2. Top tokens from search results.
4.2 Detecting QDL from Search Queries We observe that search engines always do their best to return most up-to-date, relevant, and popular content and documents in the top portion of the returned results. This tells us that we should be able to use these top results to approximate the current collective human knowledge (sufficient for our purposes of improving search relevance). In other words, top results from a good search engine should represent the most popular and correct context and usage of the query on the web. This observation has also been utilized by other researchers. One example is [2] for building an automatic question answering system using search results. We developed a query tokenization algorithm to break a query into atomic parts (tokens) by usage of the query in top search results. In the outcome, if the location name contained in the original query is not an atomic token, then it is part of a well-known phrase and thus is not the QDL. There are some work from NLP on noun phrase extraction [3,4] which is related to our problem. Compared with their approaches, our approach is more light-weight. It is based on a much smaller while more relevant corpus (snippets). We formulate our problem as: for a given query Q, split it into the most probable token list TL={t1, t2,…,tn}, in order to maximize the conditional probability Pr(TL|Q). According to the Bayes’ law, we have: ^
TL = arg max Pr(TL | Q) = arg max TL
TL
Pr(Q | TL) Pr(TL) Pr(Q)
(1)
For a particular query Q, Pr(Q) is the same for all possible TLs. Pr(Q|TL) is usually called typing model, and Pr(TL) is called language model. In a real system, typing error processing can be separated from location processing. Therefore, all Pr(Q|TL) equals to one. In this way, the problem becomes maximizing Pr(TL) which represents the priori probability of token list. We estimate Pr(TL) as follows: n
Pr(TL) ≈
Σ TF ( si )
i =1 m
(2)
Σ TF (t j )
j =1
where TF(tj) and TF(si) stand for the frequency of token tj or si in the result snippets. m is the number of all possible tokens for a given query and n is the number of tokens in TL. For example, m is 15 for a query “kentucky fried chicken in seattle” and n is 3 if it is split into “Kentucky fried chicken | in | seattle” (“|” denotes token boundary). When calculating TF, we only count the longest match for a token occurrence e.g., for an occurrence “kentucky fried chicken”, we do not add a count to either “kentucky fried” or “fried chicken.” We now walk through the algorithm with an example query “kentucky fried chicken in seattle.” Query tokenization and explicit location detection algorithm: Step 1: Submit the query to search engine and collect a list of tokens (sub-queries) from top result snippets returned from the search engine. For our example, we parsed top 30 results and the following table lists tokens we obtained in descending order by each token’s TF. TF% is the number of occurrences of a token divided by total number of occurrences of all tokens in the top search result.
Token kentucky fried chicken kentucky fried …
TF 16 11 …
TF% 31.4% 21.6% …
Step 2: Assemble tokens from Step 1 back into original query, starting from the top one. A token cannot be reused in the assembly process. For our example, we obtained the following token lists. Table 3. Different assemblies for the example query. Token list kentucky fried chicken | in | seattle kentucky fried | chicken | in seattle …
Pr(TL) 31.4%+0%+15.7% = 47.1% 21.6%+13.7%+7.8% = 43.1% …
Step 3: Pick the top token list from Step 2. For our example, we pick “kentucky fried chicken | in | seattle.” Step 4: For each token in the Step 3 outcome, repeat Steps 1-3 until it is not further breakable. For our example, we send “kentucky fried chicken” to search engine, and found it is not further breakable because the first sub-token on the returned list is the input token itself. Step 5: Output the final token list that only contains atomic tokens and has the largest Pr(TL). For our example, the final output of the algorithm is: “kentucky fried chicken | in | seattle.” From this example, we have shown how the tokenization algorithm suppresses false positives. Because “kentucky” is always used together with “fried chicken,” by itself it cannot be a geographical location. The token “seattle” is atomic and not ambiguous, thus the QDL of this query is “Seattle, WA, USA.” Please note that we could further validate the QDL by looking at other context sources we use in our solution. We further illustrate the power and accuracy of our tokenization algorithm by examining the following two queries: “seattle best coffee” vs. “seattle’s best coffee.” These two queries are very similar but mean for different things. The first query is a general term, by which the user is searching for the best coffee in Seattle area; whereas the second query is used to search for a coffee shop chain named as Seattle’s Best Coffee (which was originated from Seattle but now has expanded into other cities as well). Our tokenization algorithm breaks the first query into three atomic tokens: “seattle | best | coffee”; but it finds that the second query itself is atomic. As the result, QDL for the first query “seattle best coffee” is “Seattle, WA, USA” and the second query “seattle’s best coffee” does not have a QDL. Another advantage from our tokenization algorithm is that because the algorithm is completely based on live search results, search queries will always be broken correctly by current popular usage. One can think that we are always using a fresh corpus representing the most relevant and current documents corresponding to each query. It will be impossible to have a static corpus work as relevant and complete as what is provided from a good search engine. We are aware that some false positive suppression implementations are using well-known named entity list for exclusion. Comparing with our approach using live web corpus, having an exclusion list would require additional and non-trivial cost to create, maintain and
constantly update the list for practical and consistent coverage over time.
4.3 Detecting QDL from Query Logs If we do not get a QDL from the last step, we move on to mine query logs to detect the implicit QDL for the query. User IPs and clicked URLs for the query from the log are used in our solution. For user IPs, we first map them to user locations. If we treat the collection of user locations as a web document, then the location detection problem becomes similar to that in [1,6]. An algorithm needs to be designed to find out the dominant location in a list of extracted locations. The main issues here are dealing with the location hierarchy and returning locations with appropriate levels in the hierarchy. It would be erroneous to simply calculate the frequency of locations and return those most frequent ones as the results. As illustrated in Figure 2, if we got one “Seattle,” two “California,”,, three “San Diego,”, two “Los Angeles,” and two “San Francisco.”, it’s clear that the main location focus should be California instead of any mentioned cities in California, nor it should be the state of Washington, or Seattle. USA CA (2) …
WA
…
Figure 2. Illustration of implicit dominant location detection. In [1,6], different but similar algorithms have been designed to use the reinforcement relationships between different location hierarchies to solve this problem. Another advantage of such algorithms is that they can be used to disambiguate those location names corresponding to multiple physical locations. In this paper, we use a modified version of their algorithms, as illustrated in the following: Implicit dominant location detection algorithm: 1.
Map all the extracted location names to a hierarchy view of a geographical scope, where location nodes distribute in different geographical levels such as country, state and city.
2.
Two measures are borrowed from the CGS/EGS approach [6], namely power and spread. We extend the power concept of a location node by considering the influences brought by its offspring nodes. The power of a location node l is calculated as: (3)
i
where f(l) refers to the frequency of l. We adopted the entropy definition of the spread concept, which can achieve the best performance according to [6]. 3.
When combining the QDLs from user locations (QDL-log-IP) and from clicked URLs (QDL-log-URL), we first combine the two location trees together in Step 1 of the above algorithm, and then apply Step 2 and Step 3 to the combined tree. The new f(l) for a location node l is calculated as: f(l)=αf(l,QDL-log-URL)+(1-α)f(l,QDL-log-IP)
(4)
where 0