Document not found! Please try again

Robust Location Search from Text Queries

3 downloads 1018 Views 289KB Size Report
2Independent Consultant. #14A ... In location search, the goal is to return a geographic location ..... three popular online location search services (Google®, Yahoo!® .... address parser based on HMMs and a rule-based matching engine,.
Robust Location Search from Text Queries Vibhuti Sengar1, Tanuja Joshi2, Joseph Joy1, Samarth Prakash1, Kentaro Toyama1 1

Microsoft Research India 196/36 2nd Main Bangalore, 560 080, India {vibhutis, josephj, t-samarp, kentoy}@microsoft.com

ABSTRACT Robust, global, address geocoding is challenging because there is no single address format that applies to all geographies, and in any case, users may not restrict themselves to well-formed addresses. Particularly in online mapping systems, users frequently enter queries with missing or conflicting information, misspellings, address transpositions, and other such variations. We present a novel system which handles these difficulties by using a combination of textual similarity and spatial coherence to guide a depth-first search over the large space of possible interpretations of a text query. The system robustly matches text subsequences of a query with text attributes (i.e., any text labels associated with the entity) in a spatial-entity database. Each matched attribute is associated with the pre-computed spatial union of all entities that have that attribute. Candidate results are formed by incremental spatial intersections of these unions. Experimental results demonstrate that our system is capable of supporting regions with widely differing address formats, without region-specific customization or training. Furthermore, we show that our system significantly outperforms commercial systems for unstructured location queries and queries containing errors.

Categories and Subject Descriptors D.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing – abstracting methods, indexing method; D.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – query formulation, search process, selection process

General Terms Algorithms, Design

Keywords Location search, geocoding, ambiguous spatial queries

1. INTRODUCTION In location search, the goal is to return a geographic location given a text-based address query. Queries might be well-formed postal addresses, but they could also be less structured, as in the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACMGIS'07, November 7-9, 2007, Seattle, WA. Copyright 2007 ACM ISBN 978-1-59593-914-2/07/11...$5.00

2

Independent Consultant #14A, Silver Spring, Survey No 35/2 Panchavati, Pune 411 008, India [email protected] case of two cross streets used to identify an intersection. Existing address geocoding systems typically involve rule-based parsing of a query into a schematized record, followed by a query to a custom spatial database for a record match [12]. These rules may be manually constructed [17] or learned by training on manually tagged addresses [5, 18]. Such systems have two main limitations. The first is that they require special rules for each geographical region, as address schemes vary widely by country [16]. Even countries such as the United States with well-structured addressing schemes may still require customization for different regions within the country. The situation is much worse in countries such as India, which has a diversity of addressing schemes that have regionally evolved over time [14], and where addresses frequently include imprecise landmark-based elements such as “near railway station”. Constructing rules in such instances is tedious at best. The second problem with rule-based geocoding is that it is not robust to unstructured or ill-formed input, because it expects the query to follow a hand-specified schema that might include slots for “Street”, “Postcode” and so on. Particularly in online mapping systems, we have found from examining commercial location search query logs, that users often enter locations with missing or conflicting information, misspellings, transpositions, and other such quirks. Because rule-based geocoding depends on initial parsing, results can be very brittle to such input. It is also fundamentally unable to handle freeform text queries that do not adhere to any particular address format. Figure 1 shows some examples of user queries. “2nd Main, 10th Cross, Indira Nagar 1st Stage, Near RTO Office, Bangalore PIN 560038” – a valid postal address in Bangalore, India “malleshwaram 13th cross” – an unstructured query referring to a locality and street in Bangalore. “Middlesex Clifton Gardens Uxbridge, UB10 0EX” – a transposition of a valid address in London, UK (“Middlesex” should come after “Uxbridge”) “15400 se 30th pl suite 100, 98007 Seattle” – a mutation of a well-formed address, with transposition (zip code before city) and conflicting information (the city should be “Bellevue”, not “Seattle”) “places near Snset Dr and 6th Avnue” – an unstructured query, specifying two cross streets in the city of Tacoma, USA, with two misspellings and some extra words ( “places near” and “and”). Figure 1. Examples of potential user queries. In this paper, we present an altogether novel technique for generalized location search that can deal with ambiguous, ill-

formed, or freeform text queries. Instead of parsing queries into an intermediate form, we cast the problem as finding a direct mapping from subsequences in the text query to specific entities in a spatial database, and return the spatial (geometric) intersection of the entities as the result of the search. Our technique uses textual similarity (closeness of match between query terms and entity attributes) and spatial coherence (degree of overlap of the mapped entities) to guide the search for a viable interpretation of the query. Our approach uses no region-specific rules. In fact, it does not use domain-specific semantic knowledge of any kind – all entities are interpreted simply as geometric shapes with textual attributes. This has enabled rapid incorporation of geographic vector data without having to decipher entity ontologies or deal with variations in semantics across regions. To our knowledge, this system is the first to be able to claim this level of generality. We have applied our technique to build a prototype location search system that supports several cities in three countries with different addressing schemes: Seattle and surrounding cities, USA; London, UK; and Bangalore, India. The system returns a ranked list of results, as is common with online search providers. Experiments on 1800 test queries show that our system can successfully provide good results on a variety of location queries in any of these cities, and that it significantly outperforms existing geocoders on a wide range of ambiguous, ill-formed, and freeform query types, including all the examples in Figure 1. In the next section, we define the location search problem. Sections 3 and 4 describe our search algorithm and implementation, respectively. In Section 5 we present experimental results that demonstrate our system’s effectiveness. We survey related work in Section 6 and conclude with a discussion of future work in Section 7.

2. PROBLEM STATEMENT We expect a query in the form of a text string, for which our only assumption is that it can be separated into a list of tokens (exploiting delimiters such as spaces and punctuation). Thus, “202 Shangrila 8 Edwrd Rd Bangalore 560052” would be a legal query composed of the following tokens: “202”, “Shangrila”, “8”, “Edwrd”, “Rd”, “Bangalore”, and “560052”. We make no other assumptions about the format of the query; queries may be wellformed addresses or otherwise. A query subsequence is any continuous series of tokens in the query. Thus, “202 Shangrila” and “Shangrila 8” are both subsequences of the above query. A spatial database consists of entities that have geometry and one or more textual attributes. For example, in the geocoding domain, the spatial database would include entities such as landmarks, roads, rivers, city boundaries, and lakes, each with geometry in the form of points, polylines, and polygons. Attributes could include feature names (e.g., “Shangrila”, “Edward Road”, “Bangalore”) and classifications based on feature type (e.g., “apartment”, “park”, “school”). Entities can have multiple attributes (e.g., roads can have multiple names), and the same attribute can occur in multiple entities. An entity, e, is thus represented as a pair, e = (g, {a1, a2, …, an}), where g represents its geometry, and the ai’s are all of the attributes associated with the entity. Note that in this paper, we use the term “attribute” to refer to the value, not type, of the attribute.

We then define an interpretation, I, of a query as a set of mappings of non-overlapping subsequences to specific entities in the spatial database. Figure 2, illustrates a particular interpretation of the query, “202 Shangrila 8 Edwrd Rd Bangalore 560052”. This interpretation maps three subsequences of the query to three entities e1, e2 and e3 as follows: I = {(“Edwrd Rd”, e2), (“Bangalore”, e1), (“560052”, e3)}. In this interpretation, the tokens “202”, “Shangrila”, and “8” remain unmapped. Tokens “Edwrd” and “Rd” form a subsequence which maps to attribute “Edward Road” in the database. The shaded region in Figure 2 is the spatial intersection of the geometries of all entities in the interpretation, and it represents a candidate location for the input query. Thus, the generalized location search problem can be formulated as the problem of returning an interpretation whose spatial intersection of entities represents the location which the user intended to specify with the query text string. “202 Shangrila 8 Edwrd Rd Bangalore 560052” e2 e1:(g1, {a1})

e3

e1 g3

g1

a1: Bangalore

e2:(g2, {a2})

a2: Edward Road g2

e3:(g3, {a3}) Entities

a3: 560052

Geometries

Attributes

Figure 2: Sample interpretation The primary challenge is in handling the combinatorial explosion of possible matches between subsequences and entity attributes, especially when allowing for misspellings and other mutations of text. For an n-token query, there are 2n-1 ways to partition the query into a set of subsequences. Each of these subsequences may approximately match multiple attributes. For example, consider a query “a b c d”, and a specific partition p, composed of subsequences “a b”, “c” and “d”. Let abi, ci and di be, respectively, their ith textually closest matching attribute. The possible mappings for partition p form the following lattice: ab1

c1

d1

ab2

c2

d2







Thus, each partition generates a lattice of possible mappings and there are many such lattices. In fact, if we were to consider the k closest attribute matches for each subsequence in a query string with n tokens, the number of possible mappings between subsequences of the query string to attributes can be shown to be n 1

Cin 1k i

1

i 0

This number grows rapidly with n and k. For example with n=9 and k=4, this value is 8,398,080. Moreover, this number still only represents the number of possible distinct mappings from a query

to unique attributes. Because many entities can share the same attribute (for example, there are over a 1000 street entities with name “1st Cross” in our Bangalore dataset), a brute-force exploration of all possible query interpretations would be intractable. In the next section, we present our algorithm for finding viable interpretations. The algorithm greatly reduces the number of interpretations that need to be explored by considering the more promising partial interpretations first, and by aggressively pruning paths based on both textual and geometric constraints.

subsequence, we look up the Fuzzy Text Index (which supports queries with misspellings, word reorderings, etc.) for a set of k attributes that are textually similar to that subsequence. Each subsequence-attribute match generates what we call an Approximate Match Record (AMR) that contains the subsequence along with the matched attribute’s footprint. Each subsequence can generate k number of AMRs, one for each textually similar attribute found. The set of all AMRs identified in this way is then passed to the TEXSPACE algorithm.

TEXSPACE(unincorporatedAMRs, focus, partialInterpretation)

3. ALGORITHM Our system works as follows: Given a spatial database of entities, it pre-computes several indexes and repositories offline. These include representations of the spatial data that enable fast approximate spatial intersection, as well as a fuzzy text index of the entity attributes. Online, when a text query is received, the system first generates a list composed of multiple fuzzy (approximate) matches between query subsequences and entity attributes. Next, our TEXSPACE algorithm takes this list of candidate subsequence-attribute matches and efficiently explores the space of possible query interpretations, generating a set of candidate interpretations. Finally, the system outputs a ranked list of interpretations, and the spatial intersection of the entities in each interpretation gives its location. The relationships between these components are shown in Figure 3. The components are described below. Pre-computed data and indexes: For each attribute, we precompute what we call an attribute footprint – an approximate representation of the union of the geometries of all the entities that share the attribute (e.g., the union of the geometries of all the roads named “Taylor Street” or all landmarks labeled “school”). The approximate representation enables fast spatial intersection operations, which are a key part of our algorithm. We also maintain an entity lookup index that supports efficient lookup of multiple entities by attribute. Finally, we maintain a fuzzy text index that supports approximate lookup of attributes. While these components are critical to the system, they are independent building blocks whose internals are not the focus of this paper. We provide an overview of these components in the next section.

Text Query

Subsequence Mapping

TEXSPACE

Fuzzy Text Index

Attribute / Entity Index

Result Ranking

Ranked Results

Figure 3: System diagram Subsequence Mapping: From a given text query, we first enumerate all possible query subsequences. (For an n-token query, there are n(n+1)/2 distinct subsequences.) For each query

1.

Rank AMRs in unincorporatedAMRs in preferred order of exploration (see text).

2.

For each AMR amr in unincorporatedAMRs, do If terminate condition has been reached on solutionSet, return. a.

Compute newPartialInterpretation = addition of amr to partialInterpretation.

b.

Compute newFocus = spatial intersection of focus and amr’s footprint.

c.

Compute newUnincorporatedAMRs = RemoveIncompatibleAMRs( unincorporatedAMRs, amr, newFocus); If newUnincorporatedAMRs is empty, i.

Construct solution from newPartialInterpretation.

ii.

Add solution to solutionSet.

Else, i.

Run TEXSPACE (newUnincorporatedAMRs, newFocus, newPartialInterpretation).

Figure 4: TEXSPACE algorithm TEXSPACE: The TEXSPACE algorithm is the core of our system; its objective is to efficiently find viable interpretations from the large space of possible interpretations described in the previous section. The algorithm starts with a comprehensive set of AMRs, and implements a heuristic-guided depth first search over the space of possible interpretations. Partial interpretations of the initial query are constructed by choosing AMRs one by one from this set and adding them to partially constructed interpretations. As it picks additional terms, it computes spatial intersections of the footprints of all the chosen AMRs and filters out remaining AMRs whose footprints do not lie within the intersection or whose subsequences contain textual overlaps with the chosen AMRs. At the end, TEXSPACE returns a set of interpretations and corresponding candidate locations. The specifics of TEXSPACE are as follows. Each call to TEXSPACE takes the following variables as input: A partial interpretation, representing the current partially constructed interpretation. Unincorporated AMRs, representing possible AMRs that could be added to the partial interpretation.

The focus of the current partial interpretation, representing a geometric region that scopes further exploration. Separately, we maintain a global solution set, which contains complete interpretations of the query. TEXSPACE starts with (1) an empty partial interpretation, (2) a set of unincorporated AMRs formed by the complete output of the subsequence-mapping component (i.e., the set of AMRs that result from identifying possible subsequence-attribute matches), (3) an empty solution set, and (4) the initial area of focus. The initial focus can be the whole world or a smaller region provided by the caller. Specifying a smaller region as initial focus is the way to restrict the overall spatial scope of the query to a particular region. The first step in TEXSPACE is to order the input set of unincorporated AMRs in order of decreasing “promise”. We currently sort AMRs in order of decreasing fuzzy-text match score; in other words, AMRs with attributes that match closely with input terms are earlier in the list. Our experiments have shown this to be an effective strategy, though other orderings may be devised. TEXSPACE then considers adding, in turn, each of the AMRs in the unincorporated AMR set to the current partial interpretation. For each AMR amr in the unincorporated set, we compute the following: (a) a new partial interpretation that is the addition of amr to the current partial interpretation; (b) a new focus that is the spatial intersection of amr’s footprint and the current focus; and (c) a new unincorporated AMR list that filters out incompatible AMRs. Steps (b) and (c) are further described below. The intersection operation in Step b typically results in a narrowing of the focus, as illustrated in Figure 5. Recall that an attribute’s footprint approximately represents the geometries of all entities that match the attribute, and hence a footprint could represent a large number of discontiguous geometries. Therefore, efficient computation of this intersection is important to the overall efficiency of the algorithm. The successive reduction of focus can be considered to be incremental spatial joins [8]. We explain our implementation of fast, approximate spatial intersection in the next section.

Focus AMR footprint New focus Figure 5: Computing the new focus Computing the new unincorporated AMR list (Step c) is where the critical pruning of the search happens. Function RemoveIncompatibleAMRs takes the input list of unincorporated AMRs and returns a shorter list created by removing all AMRs that are either textually incompatible or spatially incompatible with the new partial interpretation and focus. AMRs are considered spatially incompatible if their associated footprints do not overlap with the new focus. Two AMRs are considered textually incompatible if their matched

subsequences contain the same token (or tokens) from the input query. For example, given a query “Street Corner Stand”, AMRs derived from “Street Corner” are incompatible with those derived from “Corner Stand”. If the new list of unincorporated AMRs is empty, it means the partial interpretation cannot be expanded further, in turn implying that a viable interpretation has been discovered. As a final step, we obtain the entities associated with this interpretation by querying the spatial index, specifying the matched attributes and final focus. We add this new interpretation to the solution set, and TEXSPACE terminates the current branch of the depth-first search. On the other hand, if there are AMRs remaining in the new list of unincorporated AMRs, then TEXSPACE is recursively called with the new partial interpretation, the new (and narrower) focus, and the new unincorporated AMR set. The operation continues until the search ends or until an artificial termination condition is met. Various heuristics could be used for early termination, that take into account the number and quality of found solutions. In our test results, we find that the time spent exploring states is minimal, due to the effective pruning that happens due to the narrowing of focus, as well as the efficacy of exploring the most promising avenues earlier on. The overall time spent in executing TEXSPACE is in fact less than time spent during term mapping, during the computation of AMRs using fuzzy text lookup. (For all the results presented in this paper, the number of states explored per query never exceeded 200.) Result Ranking: The results of TEXSPACE are passed to the Result Ranking component, which ranks the results based on parameters such as subsequence match score and the number of entities mapped. Relatively sophisticated domain-specific ranking techniques may be easily applied because the results consist of a small (ten to twenty) number of interpretations, and each interpretation already contains a list of specific entity references (so no further entity search is required). In order to keep our core system generic, we delegate any such domain-specific ranking to an external component. We explain in the next section how we were able to use this mechanism to easily incorporate a plot number interpolation scheme.

4. IMPLEMENTATION Based on the techniques described in the previous section, we have implemented a fully functional location finding service for several large cities. The system comprises over 20,000 lines of C# code. A core part of the implementation is fast approximate spatial intersection of AMR footprints, which are pre-computed and stored on disk. We represent footprints by linear bintrees [6]. Each geometric primitive is represented by one or more 64-bit vectors. Each vector represents the path to a bintree cell, and the union of these cells represent an Figure 6: Bintree illustration over approximation of the geometry. Figure 6 shows the city of Redmond, WA (which occupies two discontinuous regions) with its bintree representation overlaid; each rectangle is represented by a single vector. These vectors are stored as contiguous arrays, in an order

that supports union and intersection in linear time. We choose the degree of approximation depending on the number of entities that share the attribute, in order to keep a bound on the overall size of the footprint. In practice we keep up to 32 vectors per geometric primitive. Our implementation is extremely fast, and consequently, spatial intersection operations typically take up a small fraction of the overall query time. We provide some performance statistics in the next section. Our Fuzzy Index is based on Microsoft® Fuzzy Lookup (FL) technology [3]. FL builds error tolerant indexes based on n-grams of the indexed attributes, and supports approximate queries with various forms of similarity measures, and can support indexing of over a million distinct attributes. We have built an entity lookup index that supports looking up entities by a set of attributes and scoped to a particular region. Our system is accessed via a web service, which we have integrated as a web mashup on top of the maps.live.com online mapping service. Our location finder returns and displays a region or set of regions which corresponds to the best-ranked candidate location computed by our system. (We believe this is a more informational representation of the result for unstructured queries as opposed to a single point which most of the other location search providers return.) Results are visualized in the mashup as a translucent overlay, and can take the form of multiple points, lines and polygon overlays. A sample from the mashup is shown in Figure 7 illustrating our top scoring result for the query “I-90 on Mercer Island”.

Sample text queries were generated from randomly chosen wellformed addresses as well as synthesized queries designed to emulate unstructured queries. Additional test queries were generated by mutating text queries to introduce various kinds of errors in a controlled manner as explained in Section 5.2. We present comparative results for well-formed and unstructured queries, with and without errors in Section 5.4. We present some performance statistics of our system in Section 5.5.

5.1 Combined Location Search Service Altogether, the vector data for the three cities comprised about 200,000 entities. The counts of various types of entities are presented in Table 1. For the most part, linear features represent roads, polygonal features are postcode or locality boundaries, and points are landmarks. Table 1: Entity count for the three cities

5,536

3,727

Greater Seattle 2,694

Polylines

43,611

78,677

64,031

Polygons

1,054

1,207

1,432

Bangalore Points

London

We have built a test environment that performs relative comparison of location search services by wrapping these services in a common programmatic interface, firing test queries and comparing results against the ground truth.

5.2 Test Queries

Figure 7: Search result for “I-90 on Mercer Island” For evaluating our system against commercial location search providers, we have built in a simple plot number evaluation system into our Result Ranker, to emphasize the point made in the previous section that post-TEXSPACE ranking is ideal place to perform domain-specific re-ranking and further refinement of the result, since this only has to be done to a small number of entities. The interpolation involves examining the entities in the results, looking for matches between un-parsed numbers in the query and street number ranges associated with these entites. If there is a match then the rank of the particular interpretation is raised and the plot number is used to interpolate a more refined location. Even though we have used very generic interpolation mechanisms, our system performs well when compared with established location search providers, as we explain in the next section.

5. EVALUATION To test our system we built a location search system for several cities: Greater Seattle, WA, USA; London, UK; and Bangalore, India. These cities are from countries with widely differing address formats. We evaluated our system by comparing against three popular online location search services (Google®, Yahoo!® and Microsoft®).

Test queries consisted of two types: well-formed postal addresses, and unstructured location descriptions. Both types of addresses were then subjected to systematic mutation to simulate spelling errors, conflicting terms, reordered terms and extraneous words. To generate the initial list of well-formed postal addresses, we first started with a large corpus of postal addresses, obtained from large geocoder testing banks and programmatic trawling of the Internet, equally covering each postcode in each city to ensure uniformity in picking addresses. We then randomly selected 100 addresses from each city and verified the ground-truth locations of these addresses. The verification was done in a different manner for each city, for reasons we explain below. For Seattle, commercial geocoding services are of high quality. Hence, for the Greater Seattle area, the ground truth was computed by querying two of the commercial location search services and verifying that their results concurred within 250 meters (differences of up to 200 meters between the geocoders was not uncommon) and then taking a midpoint of the two results. For London, commercial address geocoders often disagreed. London, however, has very fine grained postcodes – over 250,000 postcodes defined at the block level. Therefore, for London, we defined the ground-truth location of an address to be the closest point on the named street or feature to the centroid of the sixcharacter postcode. For Bangalore, existing geocoders performed particularly poorly, even on well-formed postal addresses, and so we manually identified ground truth. For unstructured location queries, we synthesized 100 unstructured location descriptions by selecting combinations of

overlapping features from the underlying vector data from each city. Since these descriptions were synthesized from specific spatial entities, the ground-truth locations could be exactly computed. We also implemented software to introduce errors of various kinds in a controlled way. The errors are designed to simulate variations seen in actual query logs from users. Errors introduced are of the following kind: 1.

Spelling variations produced by truncation, deleting repeated letters and swapping vowel pairs in a single word.

2.

Reordering of names and elimination of commas.

3.

Ambiguity and conflict introduced by replacing some names by their most similar sounding name in the same city. For example, replacing “Edward Road” by “Edgerton Road”.

4.

Extraneous words introduced to unstructured queries in between names to simulate informal input. For example, transforming “I-90, I-405” to “places near I-90 and I-405”.

The degree of error introduced was one of the following: Clean: The original well-formed postal addresses or unmodified synthesized unstructured location, with all names comma-separated. Light: A single spelling error introduced in the entire address (not affecting postcodes), and 50% of the commas removed. Moderate: Reordering of terms and elimination of commas (type 2 error above), plus an additional two to three errors of any of the other types of errors chosen randomly, applied to distinct terms in the address. As explained earlier, we selected 100 well-formed addresses and 100 unstructured addresses for each of the three cities, for a total of 600 clean queries. We additionally generated light and moderate cases from each clean query, for a total of 1800 test queries. Sample queries of both well-formed and unstructured types, with and without errors can be found in Figure 1 in the introduction. Since our geocoder is prepared for a few cities, while the commercial services are set up for the whole world, we qualify all queries with the proper city, state and country name, for fairness. These qualifiers are always placed at their conventional position at the end of the query, and are not subject to introduced errors. It should be noted that the results returned by our search system are similar even without adding these qualifiers.

that matches to a lower precision entity such as city or large locality or suburb. Results in this category are of little value. 5.

No Results – no result returned.

In defining our metrics, we give the commercial services the maximum benefit of the doubt: If the service returns a point that appears different from the ground truth, we manually examine the point returned and give it credit if there appears to be an alternative valid interpretation (this can happen when errors are introduced). Also, if multiple options are returned by any service, we count it as a success if any one of the top three results is close to the ground truth.

5.4 Comparison with Commercial Services We now show the results of the 1800 test queries. We present detailed results for all three commercial services as well as our system, categorized according to the classification introduced in the previous section. Figure 8 shows the results for the wellformed postal address queries for the Greater Seattle area, while Figure 9 shows the same for unstructured queries. Each bar shows the results for a particular location service: our system (ours), and the three commercial location services (LS1, LS2 and LS3). The relative percentages of answers that fall into each class of distance error are indicated by grey levels. The “>10KM” category has the lightest hue and no border. We can see in Figure 8, that for clean addresses, our system fares reasonably well compared to the commercial services, identifying the location to within 250m 80% of the time, and 2.5KM in 89% of the cases. In all these cases, however, we correctly identified the street segment containing the address. Note that LS1 and LS3 score a 100% because we have preselected addresses for which both services returned results within 200M. As we move to the light and moderate classes of queries, we see the robustness of our system emerging. With light errors, where only a single error was applied to an address, we are on par with LS1, and perform considerably better than LS2 and LS3. If the query contains moderate errors, we considerably outperform all the commercial services, showing little degradation, while the results of the other services drop rapidly. We believe our use of spatially-guided search enables us to explore far more possibilities than commercial systems, resulting in increased robustness.

< 250m

< 2.5 KM

< 10 KM

>10 KM

5.3 Metrics for Comparison We classify results from the location search service into the following categories: 1.

“< 250M” – The result lies within 250m of ground truth. We interpret this as a correct return value.

2.

“< 2.5KM” – The result lies within 2.5km of ground truth. This is a near match, and typically of some value to the user.

3.

“< 10KM” – The result lies within 10km of ground truth. This frequently indicates that some part of the address was correctly identified, but it is of dramatically less value to the user.

4.

“> 10KM”– The result lies outside 10km of the ground truth. This typically indicates either a wrong result or a result

Clean Light Moderate Figure 8: Postal addresses comparison: Greater Seattle

< 250m

< 2.5 KM

< 10 KM

>10 KM

< 250m

< 2.5 KM

< 10 KM

>10 KM

Clean Light Moderate Figure 9: Unstructured address comparison: Greater Seattle

Clean Light Moderate Figure 11: Unstructured addresses comparison: Bangalore

When it comes to unstructured location descriptions, our performance is consistently better throughout, as shown in Figure 9. Here, commercial services perform poorly, being successful primarily in cases where the randomly generated unstructured queries are close to well-formed addresses. In particular, commercial services appear not to support street intersections and fair poorly when extraneous words such as “near” and “off” are added in between tokens. Our system performs very well, and continues to perform with the introduction of mutations, producing results within 2.5KM 80% of the time for light errors. For moderate errors, we find results within 2.5KM 70% of the time, while the other services do so only 15% of the time, at best.

5.5 Performance Statistics

Figure 10 shows our performance for well-formed Bangalore addresses against the one location search provider that supports Bangalore. It is clear that this provider is very restrictive in what it accepts as addresses. While it has some tolerance to errors, it is largely unable to handle unstructured queries as seen in Figure 11, and performance drops dramatically with light to moderate errors. Our system, however, performs almost on par with the results from Greater Seattle. Our results for London (not shown here due to space limitations) are similar, with all entities being found down to the street level for over 95% of both structured and unstructured queries, dropping to about 85% for queries with light errors, and to 70% for queries with moderate errors. Our performance in comparison to commercial services is also similar to the Seattle area results; commercial services’ performance for London postal addresses degrades rapidly with introduction of errors and their performance on unstructured location queries are poor even in the clean case. < 250m

< 2.5 KM

< 10 KM

>10 KM

We present some statistics on the performance of our system, which is implemented on an Intel Pentium III 64 bit 2GHz machine with 8GB of RAM. The system indexed a total of 200,000 entities with about 90,000 distinct attribute names. Our Fuzzy Index size was about 4MB. Entity footprints (used for approximate intersection) took up less than 10MB. All indexes and repositories are persisted to disk and segments are brought in on demand. Our average query lookup time is 170ms. We believe we can reduce this time considerably because the majority of the time is taken up by initial Fuzzy Lookup operations, which can be reduced by performing lazy evaluation of AMR entries. It is interesting to note that the time spent in the TEXSPACE algorithm itself, including spatial intersection operations, is less than 50ms. On average, under 200 interpretations are explored per query.

6. RELATED WORK A recent survey of state-of-the-art geocoding practices [7] notes that there are very few geocoding services which geocode anything other than postal address data. The survey further states that many areas lack a consistent addressing scheme. Even parsing structured addresses is challenging, because of the wide variety and variation in address format [14,16]. There has been recent work in reducing the need for manually created rules for address parsing. [18] describes information extraction from semi-structured text using machine-learning techniques by learning a discriminative context-free grammar from manuallytagged training data. [5] describes a geocoding system with an address parser based on HMMs and a rule-based matching engine, applied to the Australian National Address File system. However, while these systems improve the flexibility of address parsing, the underlying limitation of using a fixed schema remains. Plot-number based interpolation schemes are used to estimate precise coordinates for an address, and can be readily incorporated into our system. [2, 10, 15] evaluate the accuracy of existing interpolation systems in various contexts. [1] describes techniques to improve the accuracy of interpolation using sources from the web.

Clean Light Moderate Figure 10: Postal address comparison: Bangalore

Another area of research that is receiving increasing interest is geographic web search, which is to constrain text based Internet search by a geographic region that is specified in the users query. Algorithms for geographic web search usually associate a set of

geographic regions with every web page and with each query. Returned results are web pages that contain keywords from the query and for which the associated regions overlap with that of the query. [4] presents multiple geographic query processing techniques while [19] presents a hybrid index structure for combined text and spatial data. These systems require geocoding location references embedded inside documents. [9], [11] and [13] discuss the problem of identifying location references from document text. We believe our location search system can be used to geocode the identified (and typically unstructured) geographical references in this context.

7. CONCLUSIONS We present a new approach to location search from text queries that works well for both well-formed addresses as well as unstructured queries, and is particularly robust to errors in the query. We use this approach to build a location search system that works across multiple countries. We demonstrate that the system dramatically outperforms existing commercial geocoders on unstructured location queries, and queries with errors and conflicting terms. This is remarkable considering the fact that we do not use any region specific rules or training data. Our work to date has focused on demonstrating the overall viability of our approach, and our evaluation has emphasized accuracy and precision of location identification. Future work will address system scale. We believe a multilevel version of our search algorithm, TEXSPACE, can enable a robust location search system for the whole earth and is a current focus of our research. Finally, there are several promising avenues for extending this work. For example, we do not in this paper consider relational terms such as “near”, “East of”, etc. We believe that these can be incorporated by considering (possibly multiple) suitably expanded spatial extents of entities, followed by post-search re-ranking. Furthermore, since our technique is quite general, it can potentially be applied to other domains, such as for unstructured text queries over hierarchical data.

8. ACKNOWLEDGEMENTS We would like to thank Windows Live Local team from Microsoft Corporation for supporting us in this project and for providing detailed vector data for the cities of London and Seattle. We would also like to thank Spatial Data Inc., Bangalore, for providing the vector data for the city of Bangalore. We are very grateful to the Database, Mining and Exploration team from Microsoft Corporation for providing us with the Fuzzy Lookup technology. Finally, we would like to thank Professor Hanan Samet for his detailed comments on earlier drafts of this paper.

9. REFERENCES [1] Bakshi R., Knoblock C.A, and Thakkar S. Exploiting Online Sources to Accurately Geocode Addresses. Proceedings of the 12th ACM International Symposium on Advances in Geographic Information Systems, Washington DC, USA, November, 2004, 194–203. [2] Cayo, M.R. and Talbot, T.O. Positional error in automated geocoding of residential addresses. International Journal of Health Geographics 2003, 2:10.

[3] Chaudhary S., Ganjam, K., Ganti V. and Motwani R. Robust and Efficient Fuzzy Match for Online Data Cleaning ACM SIGMOD International Conference on Management of Data 2003 [4] Chen, Y.Y., Suel, T. and Markowetz, A. Efficient Query Processing in Geographic Web Search Engines. Proceedings of the 2006 ACM SIGMOD International Conference on Management of data, Chicago IL USA 2006. [5] Christen, P., Churches, T. and Willmore, A. A Probabilistic Geocoding System based on a National Address File. Proceedings of the 3rd Australasian Data Mining Conference, Cairns, December 2004. [6] Gargantini I., An Effective Way to Represent Quadtrees. Communications of the ACM 1982 [7] Goldberg, D. W., Wilson, J.P. and Knoblock, C. A. From Text To Geographic Coordinates: The Current State of Geocoding. Urban and Regional Information Systems Association Journal 2006 [8] Jacox, E.H. and Samet, H. Spatial Join Techniques ACM Transactions on Database Systems, Vol. 32, No. 1, Article 7 2007. [9] Kimler M. Geo-Coding: Recognition of geographical references in unstructured text and their visualization. Diplomarbeit, Fachhochschule Hof, 2004 [10] Krieger, N., Waterman, P., Lemieux, K., Zierler, S. and Hogan J.W. On the wrong side of the tracts? Evaluating the accuracy of geocoding in public health research American Journal of Public Health, Vol 91, Issue 7 2001 [11] Leidner J.L. Toponym Resolution in Text: “Which Sheffield is it?” Proceedings of the 27th annual International ACM SIGIR Conference on Research and Development in Information Retrieval 2004 [12] Nicoara, G. Exploring the Geocoding Process: A Municipal Case Study using Crime Data. Masters thesis, The University of Texas at Dallas, Dallas, TX, USA 2005 [13] Pouliquen, B., R. Steinberger, C. Ignat, and T. De Groeve Geographical Information Recognition and Visualisation in Texts Written in Various Languages. In Proceedings of the 19th Annual ACM Symposium on Applied Computing 2004. [14] Rajagopalan S., Spatial Data in Telematics: An Indian Experience Conference cum Exposition on Telematics in Transportation , Chennai September 2004 [15] Ratcliffe, J. H., On the accuracy of TIGER-type geocoded address data in relation to cadastral and census areal units. International Journal of Geographic Information Sciences 15 (5) 2001 [16] Rhind, G.R. Global Sourcebook of Address Data Management A Guide to Address Formats and Data in 193 Countries. Gower Publishing Ltd, 2005 [17] Trillium Software System ®, Harte-Hanks Trillium Software, Billerica, MA 01821. http://www.trilliumsoftware.com [18] Viola, P. and Narasimhan, M. Learning to Extract Information from Semistructured Text using a Discriminative Context Free Grammar. In Proc. of the ACM SIGIR, pages 330–337, 2005. [19] Zhou, Y., Xie, X., Wang, C., Gong, Y. and Ma, W.Y. Hybrid index structures for location based web search. In CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management, New York, NY, USA, ACM Press (2005)

Suggest Documents