Exploiting the Internet As a Geospatial Database Alexander Markowetz
1
∗
Thomas Brinkhoff
†
Bernhard Seeger
‡
Introduction
The World Wide Web is the largest collection of geospatial data. However, this tremendous resource is almost unexploited. This observation holds for individuals accessing the WWW by their favorite search engine as well as for corporate users performing geospatial analyses. Searching for a particular location by typing its name (in combination with the key words a user is interested in) retrieves often unsatisfactory results. First, names of locations might be homonyms. Second, the name of the location might not appear in a potentially interesting page. Third and worst of all, an interesting page may only refer to a location just outside the one specified. There is no possibility to express proximity or topological relationships. For the same reasons, geospatial analyses (e.g., in what regions is Audi more popular than BMW) also fail. Due to its high potential, there is increasing interest in supporting a geospatial information access to the WWW [1, 2, 3, 4, 5]. In this paper, we provide a brief overview on techniques for mapping web resources to locations. We outline an architecture for mapping URLs to geographic locations, that utilizes multiple techniques, integrates and further processes their results. Based on this mapping, we design a geospatial search engine. Such search engines differ fundamentally from their traditional counterpart. Finally, a geospatial analyses that use the WWW as tremendous source of data will be considered as another application of such a mapping. The paper concludes with an overview on challenging research questions.
2
Mapping Web Resources to Locations
The mapping of web resources to locations should be performed in two steps: 1. Initial mappings to one or more locations. 2. Verification, integration and adjustment of mappings from different techniques. For the first step, a whole range of techniques can be applied to assign initial locations to web resources (see also [5]). One of the most basic, yet powerful approaches simply processes the admin-c section of the whois-entry of a URL. In most cases, this section directly points to the company or individual who registered that domain. And, as for most companies, that is exactly the place, for which that information is relevant to. Our evaluations have demonstrated the very high relevance of the admin-c section. The evaluation of other parts of the whois-entry often fails because they are concerned with the location of the web server. However, most small companies or ∗
Fachbereich Mathematik und Informatik, Philipps Universit¨ at Marburg,
[email protected] Institute for Applied Photogrammetry and Geoinformatics (IAPG), FH Oldenburg/Ostfriesland/Wilhelmshaven,
[email protected] ‡ Fachbereich Mathematik und Informatik, Philipps Universit¨ at Marburg,
[email protected] †
1
individuals do not host their own server, but may co-host at a server farm, hundreds of miles away from their shop or home. Many authors propose adding geospatial meta information to web pages, denoting that the content of this page is relevant to a certain location. The location may be described by using the proposals of the Dublin Core Metadata Initiative or according to the ISO/TC 211 standard 19115. The use of geospatial tags, however, is quite problematic. As long as no search engine relies on geospatial tags, there is no need for administrators to implement them and vice versa. Even worse, webmasters may not be trusted. They may maliciously include tags for regions, for which their site is of no relevance. Another range of techniques requires parsing URLs as well as entire web pages for extracting names of geographic features like cities and landmarks, which can be mapped to locations. Wide-spread names such as ”Springfield” however are impossible to map. Analogically, geospatial codes like zip or dialing codes can be extracted. However, semantic analyses are laborious and often of limited certainty. The discussion demonstrates that using only one of the presented techniques allows mostly only a preliminary mapping to one or more locations. We propose integrating results from multiple mappings and apply specific verification techniques. As a simple, yet again very powerful approach, we propose using the web’s link structure. If for example, a whole cluster of pages that are situated in NY point to a web site, that we so far assumed to be in LA, but has only few links from that area, we might conclude that this site is more relevant to NY, rather than LA. Finding such clusters and detecting outliers is a task, for which data-mining techniques yet need to be adapted. Additionally, locations of users accessing a web resource can be used for verifying its location. In the near future, the widespread use of mobile web clients can be expected, which know their position by using GPS or Galileo or by evaluating the current cell of a mobile phone. Then, it is reasonable to assume a strong relation between the location of web resources and their users. This relation can be evaluated by analyzing corresponding clickstreams. Access to the mapping of web resources to locations can be granted as a geospatial web service. This allows incorporating the mapping in more specific and elaborated applications.
3
Geospatial Search Engines
One important application of the mapping of web resources to locations are geospatial search engines. In addition to the field for key words, its interface will provide a field for the user’s current location. In the case of a mobile client, the current position can automatically be passed to the search engine. The search engine will then return those results first that are not only relevant to the key words, but also within close distance to the user’s location. Such queries are of highest interest to the user, but only poorly supported by ordinary search engines. For example, ordinary engines support searches for ”Marburg AND Cycling”, but entirely ignore cycling activities taking place just outside the city limits. Geospatial search engines differ fundamentally from their traditional counterpart. The final order, in which results are presented, does not only depend on one criterion (relevancy) but also on a second (distance). The balance between these two criteria is crucial for delivering useful results. Depending on the key words, one criterion could be of a higher importance than the other could. For example, when looking for a restaurant, its proximity is of much higher importance than when 2
looking for a car dealership. Depending on the first batch of results delivered to the user, he or she might even want to re-adjust the balance. We present several prototypes for interfaces that support a dynamic re-adjustment of the balance between the two criteria. Additionally, we show how a search engine’s indices will need to be adapted in order to perform such queries efficiently.
4
Geospatial Analyses
A typical application of Geographic Information Systems are geospatial analyses. An example is a retail store interested in the best location for opening its next outlet. Traditionally, extensive studies about the acceptance of the company in that area or the nearest competitors need to be performed. Other analyses may try finding relations between geographic areas and retail prices. The acquirement of such geospatial data from surveys is expensive and for larger regions often impossible. Considering the WWW as a large geospatial data source, however, most of the required data is available for free. One can compare the number of web sites in a region dedicated to one’s competitors to the number of web sites dedicated to the own company. Or one can measure the impact of local sponsoring by monitoring the number of fan-sites over a period of time. Such data can additionally be augmented with data from traditional sources. One of our research goals is to build independent data marts from web data (and other sources) that enable people to make location-aware decisions.
5
Conclusions
We have presented a two-step mapping from web resources to geographic locations that serves as a foundation for two powerful applications. The main contribution of this paper is the broad range of research topics it opens up to the community. Scalability of searches is a major topic, since search engines already operate at the limit of feasibility. Indices need to be adapted to perform a dynamic re-balancing of priority at little extra cost. Data mining techniques need to be adapted in a similar way. Scalability is also a topic regarding the second step of mapping web resources. Since we consider the web as a graph, problems can get prohibitively expensive. Hence, there is a tradeoff between accuracy and computational costs. Due to the characteristics of the WWW, inaccuracy is however inevitable and will propagate into our data marts. Accuracy can be improved by developing ontologies for geographic data; a research area of its own.
References [1] O. Buyukkokten et al.: Exploiting Geographical Location Information of Web Pages, WebDB 1999. [2] E. Daniel: Geographic Search, Winner of the 2002 Google Programming Contest, http://www.google.com/programming-contest/winner.html. [3] J. Ding, L. Gravano and N. Shivakumar: Computing Geographical Scopes of Web Resources, VLDB 2000. [4] C.B. Jones et al.: Spatial Information Retrieval and Geographic Onotologies: An Overview on the SPIRIT Project, SIGIR 2002. [5] K.S. McCurley, Geospatial Mapping and Navigation of the Web, WWW10, 2001.
3