Mapping Geographic Coverage of the Web - CiteSeerX

8 downloads 0 Views 542KB Size Report
For example, '“Bredbury+Green” site:wikipedia.com' restricts the query “Bredbury Green” (where the word “Green” must follow the word “Bredbury”) to the index ...
Mapping Geographic Coverage of the Web Robert Pasley

Paul Clough

University of Sheffield University of Sheffield Dept. of Information Studies Dept. of Information Studies Sheffield, S1 4DP, UK Sheffield, S1 4DP, UK +44 114 2222666 +44 114 2222664 [email protected]

[email protected]

ABSTRACT In this paper, we describe a methodology to estimate the geographic coverage of the web without the need for secondary knowledge or complex geo-tagging. This is achieved by randomly selecting toponyms from the Ordnance Survey 50K gazetteer to create search queries and thus gather document counts from various web sources for Great Britain. The same gazetteer is then used to geo-code the results and enable mapping. To validate our approach, and demonstrate the effects of geo/non-geo and geo/geo ambiguity, we mapped the selected toponyms to Geograph, a community project that contains user generated geo-tagged photographs of the UK. Although success varies with resolution, the proposed approach is likely sufficient to be reliably used by applications exploring the geographic coverage of the web for cases where references to settlements are likely to be common. In our case, we applied the method to produce maps of web coverage for a range of sources at a resolution of 30km.

Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval – search process, retrieval models, information filtering.

General Terms Experimentation, Human Factors.

Keywords Web coverage, Geographic Information Retrieval, geographic references, gazetteers.

1.

INTRODUCTION

The web is a vast dynamic repository of information capturing many aspects of human life [6, 20], and rapidly increasing in size [11]. This information comes from various sources including those authored in a more formal way (e.g. news sites), and those written collaboratively and less formally (e.g. blogs, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM GIS '08, November 5-7, 2008. Irvine, CA, USA (c) 2008 ACM ISBN 978-1-60558-323-5/08/11...$5.00

Ross S. Purves

Florian A. Twaroch

University of Zurich Department of Geography Switzerland +41 44 635 6531

Cardiff University School of Comp. Sci. Cardiff, UK +44 2920 876058

[email protected]

[email protected]

wikis, social networking sites and volunteered geographic information [10]). As a source of information, the web is becoming important in its own right and being used for applications such as: the study of linguistics [17, 24], business intelligence [31], and the creation of translation resources [27]. Often information found on the web relates to location, and such information is assumed to be fairly accurate and up-to-date [13]. Markowetz, Brinkhoff and Seeger [21] state that “the World Wide Web is the largest collection of geospatial data; a resource that goes almost unexploited.” However, the suggestion that all information is available on the web is questionable. In particular coverage may not be representative across space for economic, cultural, technical or other reasons. To therefore use and exploit the web as a source for geographical purposes, one needs first to explore its geographical coverage. This paper presents an approach to generate data from web resources which we use to map the geographic coverage of the web for Great Britain. Cimiano and Staab [7] acquire collective knowledge by using a web search engine, and we follow the same approach to gather counts for the number of documents returned for selected toponyms from various web resources (web counts). One of the major limitations of such an approach is that the geo-spatial component of web content is not demarcated in any way. Placenames can be ambiguous and refer to multiple locations or have non-geographic senses and as a result geographic coverage may be over-estimate. However, we postulated that such an approach would offer a repeatable and robust technique that did not require the need for secondary knowledge or complex geo-tagging to map web coverage. To the best of our knowledge, this is the first attempt at mapping largescale geographical coverage of the web based on content, rather than visualizing the infrastructure of the Internet [8, 34]. Specifically, we address three research questions in this paper: 1.

What resources are necessary to map geographical coverage of the web?

2.

What are the effects of geo/non-geo and geo/geo ambiguity on mapping web coverage?

3.

What patterns do different web resources exhibit?

§2 discusses some relevant past work at using geographical information from web pages and calculating statistics concerning coverage of the web, §3 describes our methodology for estimating coverage, §4 the results of our study, §5 a discussion of the findings and §6 concludes the paper and identifies areas for future work.

2.

RELATED WORK

Increasingly, Geographic Information Retrieval (GIR) is concerning itself with using the geographic component of web pages and queries to find information relevant in certain spatial contexts. For example, Backstrom et al. [2] uncover the spatial variation in search engine queries (e.g. the spatial distribution of US baseball teams) by analyzing Yahoo! search query logs. Sanderson and Kohler [28] found that approximately 18% of web queries (from query logs of Excite from 2001) contain some kind of geographic component. Commonly a GIR system works by simply treating placenames in the page content as text and allowing searches through an index as for any other term [9] but it is also possible to index documents both thematically and spatially. A key tool in building and allowing searching of such indexed documents are geographic ontologies which allow searching for items spatially using places spatially or ontologically close to the query without requiring a specific placename to be mentioned [14]. Such ontologies are typically derived from gazetteers of administrative placenames. These placename lists are useful for defining the scope of the various documents to be indexed, and also for providing query terms to query that index Himmelstein [13]. Vernacular placenames and regions are an important reason why geographic searches fail, and do not tend to exist within the resources available [12]. Extracting geospatial references from documents involves two main tasks (collectively referred to as geo-tagging): identifying places and geographic regions (geo-references) and assigning them spatial coordinates. These are commonly referred to as geo-parsing and geo-coding respectively [18, 22]. Geotagging is challenging because of two major types of ambiguity exhibited by geo-references: geo/non-geo and geo/geo ambiguity [1, 30]. The first, geo/non-geo or referent class ambiguity, occurs when a name might be used to refer to a nongeographic entity (e.g. name of person, organization or object). The second, geo/geo or referent ambiguity, occurs when a georeference has multiple locations (e.g. “Sheffield UK” versus “Sheffield, Alabama”). To resolve geo/non-geo ambiguity, methods from NamedEntity Recognition (or NER) are commonly used. Most NER algorithms combine lists of known locations, organisations and people with rules, which capture elements of the surrounding context. Although simple list lookup for placenames can perform well [23], the approach fails when list entries are not used in a geographic sense (e.g. names of people, businesses or common language use) or variants of names are used (e.g. historical or vernacular forms). The simplest (and often most effective) method to resolve geo-geo ambiguity is to assign ambiguous places a default sense (or position). This can be decided by, for example, the most commonly occurring place [30], by population of the placename [26] or by semi-automatic extraction from the web [19]. In order to address the need for representations of vernacular regions and neighborhoods in geographic ontologies attempts have been made to extract these by data mining. Typically this is done by automatically constructing queries to a search engine in order to gather data. Returned pages are geotagged, so that any geographic references therein have coordinates associated with them. These co-ordinates can then be used in some way to define a boundary for the region. Purves et al. [25] experimented with techniques for defining vernacular

regions, such as “Mid Wales” or “the Highlands” that do not exist in administrative geography. Findings showed that when defining the Highlands (a large area in the north of Scotland) using Kernel Density Estimation (KDE), Inverness (a relatively large city) returned many more points than the surrounding rural area and thus skewed the results. Other work in this area by Schockaert and De Cock [29] defines urban city neighborhoods, such as, “Belltown, Seattle”. Web counts have also been used for applications such as gathering corpus statistics [16, 17] and verifying linguistic constructs for translation [27]. Tezuka and Tanaka [32] use document counts to decide which of various landmarks in a resource was the most conspicuous (effectively disambiguation). Techniques such as Information Extraction and Question Answering make use of the wealth of information published explicitly and implicitly by various authors including authorities, companies, and the general public to provide snippets of relevant information. The web is therefore seen by many as a source of up-to-date and relevant information (e.g. Himmelstein [13]). However, to date most studies appear to treat the web as geographically neutral – that is they assume that the results of any experiments based on mining the web will not be biased by variations in data density as a function of some other underlying variable. In this paper we wish to develop a technique which will allow us to explore how geographic coverage varies in space, and thus consider the implications of coverage (and its dependencies) on projects which data mine the web.

3.

METHODOLOGY

To enable us to estimate geographical coverage (and density) of the web (in Great Britain), we propose a simple methodology to first gather (§3.1) and then map (§3.2) counts for the number of documents retrieved for a set of toponyms. To validate our approach we compared results against data from Geograph (§3.3)

3.1

Gathering Counts

Web counts were gathered by constructing queries based on toponyms from the 1:50,000 scale gazetteer (OS50K) available from the Ordnance Survey (OS), the Great Britain’s1 national mapping agency. This is a gazetteer of cities, towns, villages and landmarks derived from OS 1:50,000 scale maps. This gazetteer contains over 250,000 entries, of which 42,531 are settlements2, and of these 35,184 are distinct names (indicating that 25% represent geo/geo ambiguity in this gazetteer). The gazetteer does not contain vernacular placenames. The text of the placenames within the resource needed a certain amount of simple normalization. The OS50K gazetteer has items such as “Cardiff/Caerdydd” which were split using pattern matching to two entries of “Cardiff” and “Caerdydd” respectively. Since web counts can only be collected at a rate governed by the site used (e.g. Yahoo! sets a limit of 5,000 requests per day), to make data collection and analysis within practical timescales feasible we selected a random sample of 6,374 (18%) 1

Great Britain is made up of Scotland, England and Wales. Northern Ireland has its own national mapping agency.

2

These are identified in the OS50K by the feature type codes of ‘c’, ‘o’ and ‘t’

from the list of distinct settlement toponyms. This figure was selected based on roughly the amount of data we could collect in 1-2 days (per source). We find this selected number of names to be sufficient for our investigation (see §4 and §5). Figure 1 shows the coverage of the selected 6,374 toponyms for Great Britain (resulting in 8,059 settlement locations because of some locations having multiple footprints).

Table 1. Median number of documents per toponym (nonzero) and number of zero counts for five web sources. Source

Median (non-zero)

Zero Counts

Yahoo

10,300

53

yahoo+wiki

47

445

yahoo+gumtree

23

3890

yahoo+bebo.com

17

2663

yahoo+facebook

9

4035

Table 1 summarises document counts collected from the five web sources: the “median (non-zero)” column showing the average number of documents per query (for non-zero counts) and the “zero counts” column representing the number of toponyms resulting in zero hits. As one might expect from a large general-purpose search engine, Yahoo! returns the largest number of documents per query and has lowest zero-hits.

3.2

Figure 1. Toponym coverage of Great Britain for 8,059 distinct settlement locations (6,374 toponyms).

Web counts were collected at a rate of 4,500 queries per day (below the 5,000 limit) from the commonly used search engine Yahoo!3. This provides access via an Application Programming Interface (API), thus avoiding the need to collect web counts using a user interface (web scraping). Using the “site:” restriction query syntax of Yahoo!, we were able to collect counts from various web resources, including Wikipedia4 and the social networking sites Gumtree5, Facebook6 and Bebo7. For example, ‘“Bredbury+Green” site:wikipedia.com’ restricts the query “Bredbury Green” (where the word “Green” must follow the word “Bredbury”) to the index held by Yahoo! for the site Wikipedia.com. The Yahoo! site restriction functionality was used rather than querying each site to enable a fairer comparison across sites. We also used query syntax to gather only pages indexed as being authored in the UK by Yahoo! (ignoring any classification error that may exist). It should be noted that UK pages can contain geographic references from other countries and need not be in English.

Visualizing Counts

Once document counts representing coverage were collected, each toponym from the selected list of distinct names from the OS50K was mapped to a coordinate to display coverage. In the case of ambiguous locations (670 or 11%), we mapped the location to all possible coordinates, the simplest (and most naïve) approach. Since these counts for these places are analyzed spatially, this ambiguity leads to problems of overrepresentation of some places - although they share their name with another place, they unlikely share the same web count. For example, the small hamlet of Sheffield in Cornwall is assigned the same web count as Sheffield, South Yorkshire which has a population of about 500,000. There are also places that coincide with words of common English language usage such as “Bath” and “Rugby” (geo/non-geo ambiguity). The effects of both geo/geo and geo/non-geo on our approach is shown in §4 and discussed in §5. To reduce the effects of geo/non-geo ambiguity we filtered out toponyms which are also commonly-used words in British English. In order to filter out toponyms, we created a “stopword” list using population counts from Geonames.org8 and term frequency counts from the British National Corpus9 (BNC). Each toponym was then given a score based on its ranking in each resource to compute a final rank according to the following formula: [(Geonames rank – BNC rank) / BNC rank] / Geonames rank. Upon manual inspection it was found a positive score related to suitable candidate stopwords and suggested 342 toponyms to ignore during visualization. The top 10 stop words are: ‘Old’, ‘Began’, ‘Church’, ‘Twenty, ‘Field’, ‘Green’, ‘Letter’, ‘Sea’, ‘Mark’ and ‘Park’. To compare document counts for Gumtree, Bebo, Facebook and Wikipedia, we used Yahoo! as a baseline. Finally, we compare the results of mapping document counts with population counts derived from the 2001 census (General Register Office for Scotland, Office of National Statistics [4, 5]) to investigate whether online coverage coincided with areas of human habitation (and activity). We calculated Pearson’s correlation scores to compare maps of coverage at different

3

http://www.yahoo.com

4

http://en.wikipedia.org

5

http://www.gumtree.com/

6

http://www.facebook.com

8

http://geonames.org

http://www.bebo.com

9

http://www.natcorp.ox.ac.uk/

7

resolutions, where a value of r=1.0 implies perfect correlation between two maps, and a value of r=0.0 implies no correlation. Where cells had no counts assigned in either map they were not considered in the calculation of correlation.

3.3

Validating the Approach

In order to investigate effects of ambiguity on the proposed methodology, Geograph10 was used as a source of pre-processed geographical information. Geograph is an initiative attempting to photograph the whole of the UK by allowing users to upload photographs and descriptions to a central web server. Geograph offers a good baseline because every image is mapped to a resolution of at least 1km and has a description including text which commonly contains toponyms describing the image. All images are effectively independently geo-coded and can be used to assess the performance of our method. Two different resolutions were compared: 5 and 30km. These were geo-coded by the co-ordinates from the Geograph data - each point was weighted by the number of images within the grid squares (i.e. if 5 images were associated to a grid square, that grid square was given a weight of 5). To validate our approach, we compared the following maps (see §4): • Baseline: maps counts of all Geograph images to grid squares, and so provides a baseline for all other methods. • T1: uses the randomly selected toponym list and on finding instances of these in image descriptions maps them to the coordinates associated with the image in Geograph (i.e. shows visualization possible if georeferencing was very good - since the coordinates in Geograph are used for the toponyms). • T2: uses the basic approach of identifying toponyms and mapping them to their locations using all coordinates from the OS50K. • T3: the same as T2 except toponyms identified as stop words, as described in §3.2, are removed.

4. 4.1

RESULTS AND INTERPRETATION Validating the Methodology

The methodology applied in this paper was deliberately simple, using a small sample of toponyms from a gazetteer and a naïve all senses approach to dealing with geographic ambiguity in placenames. A simple stopword list was developed to filter out toponyms which occur mostly in non-geographic senses. Figure 2 illustrates relative document counts associated with the Geograph11 dataset for resolutions of 5 and 30km for the four cases described in §3.3. The baseline maps effectively show the actual coverage of Geograph, aggregated to 5 and 30km respectively. Several observations can be made with respect to these baseline representations. Firstly, the 30km resolution baseline appears to cover the whole of Great Britain, with considerable loss of detail, through, for example, the merging of islands with the mainland as visible by comparison with the 5km baseline. The 5km baseline also shows almost

10

http://geograph.co.uk (site accessed: 18/06/08).

11

http://geograph.co.uk

complete coverage, though some areas in the northern Highlands of Scotland have sparse or no coverage. T1 illustrates the ability of our toponym list to replicate the baseline, given a technique which correctly, if naively, geocodes individual images. Here, any mention of a toponym in association with a picture is assigned to the grid square in which the picture itself was geo-located within Geograph. In turn, this means that this method is geographically unambiguous, but that geo/non-geo ambiguity may still cause counts to be falsely assigned to grid squares. The correlation coefficients for both 5 and 30km resolutions are very high (0.94 and 0.97 respectively), demonstrating that firstly, the sample of placenames extracted from the gazetteer is sufficient to replicate most features of the web coverage of Geograph, and secondly, that geo/non-geo ambiguity is evenly spread across Great Britain. If this was not the case, we would expect lower correlations, as particular locations were preferentially assigned counts falsely. A particularly interesting result here relates to the 5km resolution. Coverage is still very dense, with counts being assigned to squares which lie far from toponyms (Figure 1). This suggests that individual toponyms at the settlement level have a zone of influence which increases in size as toponym density decreases, for example in the Highlands of Scotland. Put differently, it appears that images in the Highlands which lie far from a toponym, are still mentioned in conjunction with the nearest toponym thus suggesting that this subset of toponyms is sufficient to identify documents which well represent the overall coverage of Geograph. By contrast, T2 shows the considerable degradation in performance for the first, naïve, application of our method. Here we assign counts to the grid squares in which toponyms themselves are found. For both 5 and 30km resolutions the correlations with the baseline are now weak (0.04 and 0.35 respectively) and in the case of the 5km resolution the implications of the sparse distributions of toponyms in the Highlands of Scotland become clear. Comparing this figure with Figure 1, it is clear (and logical) that locations without toponyms cannot be assigned counts, and thus cannot be mapped. In turn, it is also clear that it is not possible to map web coverage at a resolution of 5km using this subsample of 6,374 toponyms. T2 is subject to both geo/geo and geo/non-geo ambiguity. In a first attempt to establish the effects of geo/non-geo ambiguity we created a stopword list as detailed in §3.2. T3 shows the effects of applying this stop list at 5 and 30km. In both cases, the correlations improve considerable, to 0.31 and 0.66 respectively. Although there are still gaps in the coverage at 30km, these are relatively isolated and the overall pattern correlates reasonably with that of the baseline. Despite the improvement in correlation for the 5km grid, the fundamental problem of lack of toponym coverage means that this resolution was discarded in the following experiments. However, the correlation for the stopworded all senses method is sufficiently good at 30km resolution for this approach to be applied in mapping the web coverage of collections where we have no validation data.

Figure 2. Web coverage of Geograph (Baseline) and document counts using toponyms shown in Figure 1 for “perfect” geocoding using image coordinates (T1), naïve all senses technique (T2) and naïve all senses technique with stop words (T3) for 5km and 30km resolutions.

4.2

Mapping Web Coverage

Having identified a suitable set of parameters for the application of our technique to mapping web coverage, we then set out to map the web coverage of Great Britain based on the toponym counts associated with the Yahoo search engine. Figure 3 shows the resulting coverage when counts were mapped for the toponym list (excluding stopwords) using the Yahoo! search engine API. There are some holes in the coverage, notably in the Highlands of Scotland, but since these appear to be in similar locations to those identified in T2 and T3 in Figure 2, they are most likely an artifact of toponym density. Overall, the coverage is greatest in areas with higher populations – namely the south and midlands of England and indeed the Yahoo! counts are correlated with population (r=0.61). However, the map of counts is in general less spatially autocorrelated, since individual toponyms are always assigned to a single grid square in our approach. By contrast, the map of population shows concentrations of population spreading over larger areas, most prominently in the case of London in the southeast of England.

Figure 3. Yahoo! web counts for Great Britain and population mapped at a resolution of 30km. Source: 2001 Census: Standard Area Statistics (England and Wales) and Standard Area Statistics (Scotland).

This effect shows an important weakness of our approach – since we map toponym counts, and not documents, we cannot easily explore relationships between toponyms and, for instance, filter out placenames with coarser granularities from documents where information at a finer granularity is available. Thus, a document containing the terms London and Westminster (a district of London) will be mapped to both toponyms – but since London probably co-occurs with many district names the cell containing the London toponym will have many more counts.

4.3

correlations with overall web coverage. Facebook appears to have a less differentiated coverage, suggesting a few concentrations of users in metropolitan areas, whilst Bebo is more distributed over the whole of Great Britain. Finally, Gumtree, a popular classified advertising site appears to have relatively complete coverage with distinct spikes in metropolitan areas and relatively constant coverage in most other regions.

Comparing Web Coverage

In our last experiment, we collected web counts for our toponym list for Yahoo! and four other web domains restricted by searches through the Yahoo! API. The top five toponyms and their counts are shown in Table 2. London is the most common toponym in all except Gumtree, a classified advertising site. All of the top five toponyms appear to be large metropolitan settlements in Great Britain, suggesting a strong correlation of web coverage with population, as shown in Figure 3 and discussed in §4.2. However, the seventh and ninth ranked Yahoo! toponyms respectively were Calgary and Echt. Calgary is a small village on the island of Mull and Echt is a village in the northeast of Scotland. The high counts for these toponyms are unlikely to be the result of the popularity of these villages, but rather reflect Calgary’s Canadian namesake and the word “echt” in German. Since we did not attempt to address geo/geo ambiguity or identify stopwords in languages other than English, then our methods are still susceptible to such errors. However, as discussed in §4.1 and §4.2 we believe these errors are small enough to still allow us to comment on web coverage. Table 2. Top five toponyms and counts for five web sources. Yahoo!

Wikipedia

London (1490 x 106) Birmingham (281 x 106) Liverpool (228 x 106) Bristol (200 x 106) Derby (161 x 106)

London (860 x 103) Bristol (122 x 103) Birmingham (118 x 103) Glasgow (104 x 103) Liverpool (97 x 103)

Facebook

Bebo

Gumtree

London London Bristol (116 x 103) (155 x 103) (2.29 x 106) Portsmouth Glasgow London (48 x 103) (108 x 103) (2.15 x 106) Liverpool Liverpool Birmingham (33 x 103) (101 x 103) (2.13 x 106) Leeds Aberdeen Leeds (19 x 103) (53 x 103) (2.12 x 106) Derby Leeds Glasgow (14 x 103) (31 x 103) (2.06 x 106)

Finally, we mapped web coverage for four well known internet domains and calculated correlation with what we assume to be the web coverage of Great Britain, mapped on the basis of Yahoo! counts (Figure 4). The domain with the most complete coverage, and which correlates most closely with Yahoo coverage is Wikipedia. We believe this result may be associated with the high ranking of Wikipedia entries in many commercial search engines– meaning that in turn one would expect good agreement between a map of web coverage based on web searches and a map of web coverage based on Wikipedia searches. This however has an important implication – since Wikipedia is very strongly calculated with Yahoo, then it is also clear that Wikipedia is correlated with population (Figure 3), and thus Wikipedia coverage cannot be assumed to be even in geographical space. The patterns for Facebook and Bebo, two popular social networking sites are similar, with both having strong

Figure 4. Comparison of web coverage for five prominent internet domains.

5.

Discussion

In this paper we have set out to develop a repeatable and robust technique to map web coverage, using a set of relatively simple methods. In developing our method we are aware that other more robust approaches exist, particularly that of crawling the web and then applying sophisticated techniques to geo-parse and geo-code individual documents. Our approach is considerably simpler and is based on the notion of using the API of a popular search engine to retrieve the counts for a particular toponym, and then assigning this count a grid square at some resolution in which the toponym is found. We set out three research questions in this work, and we now explore these in more detail in the light of the results presented in §4. The first research question was related to the resources required to map the geographical coverage of the web. In our experiments we used a gazetteer provided by the Ordnance Survey which contains a total of over 250,000 toponyms. Searching for the counts associated with all of these toponyms would be very time consuming, and more importantly the ambiguity of toponyms increases with finer granularities [3]. We therefore restricted our toponym list to settlements, and used a random sample of some 6374 toponyms corresponding to about 8,000 individual locations. The comparison of Geograph (baseline) with T1 in Figure 2 demonstrated that this list of toponyms was sufficient to replicate very well the actual coverage of Geograph. Caution must be applied as to the general

applicability of this result, as the user interface of Geograph encourages users to use toponyms from the OS50K gazetteer. Nonetheless, it appears that toponyms from this gazetteer are widely used in the captions of images, and furthermore that the zone of influence of these toponyms varies with toponym density. This result can be related to that of [15] who showed how the scope of reformulated web queries varied in different regions. We conclude that a random sample of settlement toponyms is suitable to retrieve a representative set of documents describing the web coverage of a region. However, by comparison between Figure 1 and Figure 2 (5km, T2), toponym densities must over most of the area of interest be at least as high as the target resolution of web coverage maps. Equally, if we were interested in the web coverage of say, references to mountains, it is unlikely that our selection of settlement placenames would be appropriate. In our second research question we asked what the likely influences of ambiguity on toponyms were. We assumed that geo/non-geo ambiguity was likely to be a more significant problem and developed a stopword list based on the use of term frequency in the British National Corpus compared with settlement population. Initial experiments without the use of the stopword list (Figure 2, T2) showed that correlations at both 5 and 30km were low. By applying our stopword list we considerably improved correlations thus illustrating the importance of geo/non-geo ambiguity in our approach. Furthermore, as discussed in §4.3, high ranking ambiguous toponyms were still found – in one case a non-British placename and in the other a common German word. Since our method does not retrieve documents the former problem is difficult to resolve, since we do not have access to toponym context. The latter may be addressable by extending stopword lists, but it should be noted that our aim is to gain an impression of web coverage and that these errors appear not to seriously bias the results. Our final research question asked what patterns different web resources exhibited. In Figure 3 we showed that Yahoo! coverage was relatively strongly correlated with population, suggesting that care should be taken if we wish to map vernacular regions based on web counts (e.g. [25]) to normalize counts on the basis of population. Figure 4 illustrated how we can start to explore geographical coverage of different web resources using our method. Correlations between Yahoo!, Wikipedia, Facebook and Bebo are strong, whilst Gumtree shows a weaker correlation which appears to reflect its more even distribution in space. However at a resolution of 30km any patterns that are visible are relatively coarse. Since interesting patterns may exist at finer scales, according to the subject of interest, there are several potential approaches to deriving higher resolution maps of coverage. Figure 2 (T1) illustrated how our toponym set was sufficient to develop a detailed picture of coverage at both 5 and 30km where more detailed geocoding was available. By using our toponym list to identify relevant web pages, and then applying geoparsing and geocoding techniques on this selection of web pages it may be possible to build a more detailed map of web coverage at higher resolutions using more detailed resources. A second approach, based only on the original toponym list might attempt to model the zone of influence of individual toponyms and then use pycnophylactic interpolation to model web coverage [33].

6.

Conclusions and Future Work

This paper has attempted to map geographic coverage of the web in Great Britain using a sample of toponyms derived from the OS 1:50,000 gazetteer (OS50K) and simple techniques to calculate the number of hits (i.e. documents containing the toponym) from a web search engine. The selected toponyms have been geo-referenced to spatial grids. We introduced Geograph as a baseline dataset, which itself showed almost complete coverage for Great Britain. Our results indicate that the coverage of our randomly selected toponym list correlated well with the baseline representation at two different spatial grid resolutions. The investigation of the effects of geo/non-geo ambiguity on mapping web coverage revealed that for further experiments where no validation data is available the representation at 30 km resolution has sufficient coverage. We also illustrated that for the web, as captured by Yahoo!, a strong correlation exists with population when the web is mapped on the basis of settlement toponyms. We also identify a number of open research questions: •

Improvement of geo-coding approaches for geo/geo ambiguous toponyms (e.g. using a default sense)



Exploring whether the ambiguity of queries can be reduced by adding further context words (e.g. Sheffield, South Yorkshire” and “Sheffield, Cornwall”)



Improvement of the method by using bigger samples and mining the web in a longitudinal study.



Investigation of web coverage for other countries could be carried out to investigate cultural effects that may influence geographical coverage.



Use of gazetteers with higher granularities would allow exploration of web coverage in smaller regions, for example within London.



The correlation to population may not be reliable for all web page types; e.g. more photographs may be taken on holiday.

We believe that the results are of interest to any researcher dealing with the geography of the web or planning on exploiting data mining techniques to extract data from the web, e.g. for the automatic acquisition of vernacular placenames from web sources. Our method provides a tool to estimate sufficient and representative geographic coverage of the Web without parsing web pages (by relying on a search engine for parsing and crawling). The results and open questions also make us optimistic that this method could improve to a point where it is more flexible with regard to scale and level of detail.

7.

ACKNOWLEDGMENTS

Work partially supported by the EPSRC and Ordnance Survey (CASE/CAN/06/67). We would like to gratefully acknowledge contributors to Geograph British Isles (http://www.geograph. org.uk/credits/2007-02-24) (whose work is made available under a Creative Commons Attribution-ShareAlike 2.5 Licence (http://creativecommons.org/licenses/by-sa/2.5/)). We acknowledge partial funding of our research by EC FP6-IST project, TRIPOD (contract number 045335). The authors are solely responsible for the content of this paper. It does not represent

the opinion of supporting funding agencies, and the supporters are not responsible for any use that might be made of data appearing therein.

8.

REFERENCES

[1]

Amitay, E., N. Har'El, R. Sivan, and A. Soffer, Web-aWhere: Geotagging Web Content, in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. 2004, ACM: Sheffield, United Kingdom. p. 273-280. Backstrom, L., J. Kleinberg, R. Kumar, and J. Novak, Spatial Variation in Search Engine Queries, in Proceeding of the 18th international conference on World Wide Web. 2008, ACM: Beijing, China. Brunner, T. (2008), 'Geographic Information Retrieval: Identifikation der geographischen Lage von Zeitungsartikeln', Master's thesis, Geographisches Institut. Census, General Register Office for Scotland, Census: Standard Area Statistics (Scotland) [Computer File]. 2001, ESRC/JISC Census Programme, Census Dissemination Unit, MIMAS (University of Manchester). Census, Office for National Statistics, Census: Standard Area Statistics (England and Wales) [Computer File]. 2001, ESRC/JISC Census Programme, Census Dissemination Unit, MIMAS (University of Manchester). Chakrabrati, S., Mining the Web: Analysis of Hypertext and Semi Structured Data. 2002: Morgan Kaufmann. Cimiano, P. and S. Staab, Learning by Googling, in SIGKDD Explorations (Newsletter). 2004. p. 24-33. Dodge, M. and R. Kitchin, Mapping Cyberspace. 2001, New York: Routledge. Egenhofer, M. Toward the Semantic Geospatial Web. in 10th ACM International Symposium on Advances in Geographic Information Systems 2002. Goodchild, M.F., Citizens as Sensors: The World of Volunteered Geography. GeoJournal, 2007. 69(4): p. 211-221. Gulli, A. and A. Signorini. The Indexable Web Is More Than 11.5 Billion Pages. in WWW '05: Special Interest tracks and posters of the 14th International Conference on World Wide Web. 2005: ACM. Hill, L.L., Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints, in Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries. 2000, Springer-Verlag. Himmelstein, M., Local Search: The Internet Is the Yellow Pages. Computer, 2005. 38(2): p. 26-34. Jones, C., B., H. Alani, and D. Tudhope, Geographical Information Retrieval with Ontologies of Place, in Proceedings of the International Conference on Spatial Information Theory: Foundations of Geographic Information Science. 2001, Springer-Verlag.

[2]

[3]

[4]

[5]

[6]

[7] [8] [9]

[10]

[11]

[12]

[13] [14]

[15]

[16]

[17]

[18]

[19]

[20] [21]

[22]

[23]

[24]

[25]

[26]

[27]

Jones, C. B.; Purves, R. S.; Clough, P. D. & Joho., H., 'Modelling vague places with knowledge from the Web', International Journal of Geographical Information Science, 2008, 22 (10), 1045-1065. Keller, F. and M. Lapata, Using the Web to Obtain Frequencies for Unseen Bigrams. Computational Linguistics, 2003. 29(3): p. 459-484. Kilgarriff, A. and G. Grefenstette, Introduction to the Special Issue on the Web as Corpus. Computational Linguistics, 2003. 29(3): p. 333-347. Larson, R. Geographic Information Retrieval and Spatial Browsing. in Geographic Information Systems and Libraries: Patrons, Maps, and Spatial Information. 1996: Urbana-Champaign: Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign. Li, H., R.K. Srihari, C. Niu, and W. Li, Infoxtract Location Normalization: A Hybrid Approach to Geographic References in Information Extraction, in Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references - Volume 1. 2003, Association for Computational Linguistics. Lin, J. and A. Halavais, Geographical Distribution of Blogs in the United States. Webeology, 2006. 3(4). Markowetz, A., T. Brinkhoff, and B. Seeger. Geographic Information Retrieval. in 3rd International Workshop on Web Dynamics [online: http://dbs.mathematik.unimarburg.de/publications/myPapers/2004/WebDyn200 4.pdf]. 2004. McCurley, K.S. Geospatial Mapping and Navigation of the Web. in Proceedings of the 10th international conference on World Wide Web. 2001. Hong Kong, Hong Kong: ACM. Mikheev, A., M. Moens, and C. Grover, Named Entity Recognition without Gazetteers, in Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics. 1999, Association for Computational Linguistics: Bergen, Norway. Monroe, G., J. French, and A. Powell, Obtaining Language Models of Web Collections Using QueryBased Sampling Techniques, in Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 3 - Volume 3. 2002, IEEE Computer Society. Purves, R., P. Clough, and H. Joho. Identifying Imprecise Regions for Geographic Information Retrieval Using the Web. in GISRUK 2005 - 13th Annual Conference on GIS Research UK. 2005. Rauch, E., M. Bukatin, and K. Baker, A ConfidenceBased Framework for Disambiguating Geographic Terms, in Proceedings of the HLT-NAACL 2003 workshop on Analysis of geographic references Volume 1. 2003, Association for Computational Linguistics. Resnik, P. and N.A. Smith, The Web as a Parallel Corpus. Computational Linguistics, 2003. 29(3): p. 349-380.

[28]

[29]

[30]

[31]

Sanderson, M. and J. Kohler. Analyzing Geographic Queries. in SIGIR 2004 - Workshop on Geographic Information Retrieval. 2004. Schockaert, S. and M. De Cock. Neighborhood Restrictions in Geographic Ir. in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007. Amsterdam, The Netherlands: ACM. Smith, D.A. and G.S. Mann. Bootstrapping Toponym Classifiers. in The HLT-NAACL Workshop on Analysis of Geographic References. 2003. Srivastava, J. and R. Cooley, Web Business Intelligence: Mining the Web for Actionable Knowledge. 2003, INFORMS. p. 191-207.

[32]

[33]

[34]

Tezuka, T. and K. Tanaka. Landmark Extraction: A Web Mining Approach. in COSIT 2005 - Conference on Spatial Information Theory. 2005. Tobler, W. R. (1979), 'Smooth Pycnophylactic Interpolation for Geographical Regions', Journal of the American Statistical Association 74(367), 519-530. Zook, M., The Geographies of the Internet, in Annual Review of Information Science and Technology, B. Cronin, Editor. 2005. p. 53-78.

Suggest Documents