Building a Geographical Ontology by Using Wikipedia

3 downloads 10404 Views 257KB Size Report
the free online encyclopedia Wikipedia to extract lists of places, then rebuild a ... Our geo-ontology data are extracted from the Wikipedia website which is a free ...
Web Services and Applications (Short Papers)

iiWAS2011 Proceedings

Building a Geographical Ontology by Using Wikipedia Quoc Hung-Ngo

Son Doan

Werner Winiwarter

Faculty of Computer Science National Institute of Informatics University of Vienna University of Information Technology 2-1-2 Hitotsubashi, Chiyoda-ku, Research Group Data Analytics and Computing Vietnam National University-HCM City Tokyo 101-8430 Universitätsstraße 5, 1010 Vienna, Austria

[email protected]

[email protected]

ABSTRACT This paper introduces an approach to build a geographical ontology of countries at the global scale. Our approach is based on the free online encyclopedia Wikipedia to extract lists of places, then rebuild a hierarchy and relationships among extracted places.

languages, including English, Japanese, Vietnamese, Thai, Chinese, and Korean [3,4]. In the following example, Gia Lam and Long Bien are two districts of Ha Noi (VietNam). By this relation, we can determine that there are three flu epidemics in Ha Noi: Besides two new [flu epidemics]OUT have taken place at [Gia Lam]LOC suburban district, Ha Noi detected a new [flu epidemic]OUT at [Long Bien]LOC 2

Keywords Geographical ontology, ontology building.

(.Ngoài 2 [d ch cúm]OUT m i xu t hi n huy n [Gia_Lâm]LOC, [Hà_N i]LOC còn có 1 d ch m i [Long_Biên]LOC c ng ã phát_hi n [d ch]OUT .)

1. INTRODUCTION In practice as well as in research, determining administrative levels is necessary for information extraction, information retrieval, and generating reports. For each place, it is essential to find its administrative levels and sub-levels. There are some systems of administrative data and locations such as Google Maps or GEOnet Names Server (GNS). Google Maps is an online map system and geographical hierarchy, which has been implemented for a number of countries such as United States or United Kingdom [7]. GNS is provided by the National Geospatial-Intelligence Agency (NGA) and U.S. Board on Geographic Names (US BGN). This is a database system of locations on the earth. GNS data include about 5.5 million entries, which are frequently updated [6]. In addition, G. Fu and coauthors or the SPIRIT project are planning to build another geographical ontology system [1,2]. Though GNS and Google maps are very huge, they have some limitations: - Google Maps is an online map which is mainly visually exploited. - For a lot of countries, data are very huge but there are no hierarchical relations (e.g. districts and their sub-level places). Additionally, there are many entries with zero values in the administrative column, which means that there is no administrative level. The geographical ontology in the BioCaster 1 project is used to build a health monitor map of countries all over the world in six

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. iiWAS2011, 5-7 December, 2011, Ho Chi Minh City, Vietnam. Copyright 2011 ACM 978-1-4503-0784-0/11/12...$10.00 1

http://born.nii.ac.jp

[email protected]

We use articles on the Wikipedia website that introduce places of countries. During exploiting the structure of these articles, we recognize that introductive information is often followed by a list of cities, counties or districts in that place. Therefore we use this feature to extract locations with their sub-levels. We have built a geographical ontology with two administrative levels: 243 names in level 1 and 4,025 names in level 2 with their part-whole relationships and longitudes/latitudes. Moreover, in the BioCaster project, the full level of the geographical ontology is built with six countries: United Kingdom (GB), Japan (JP), China (CH), Vietnam (VN), Thailand (TH), and Korea (KO) including South Korea and North Korea (KP), and six languages: English, Japanese, Chinese, Vietnamese, Thai, and Korean.

2. DATA Our geo-ontology data are extracted from the Wikipedia website which is a free online multi-language encyclopedia. It originates from a project of the Wikimedia Foundation. Until now, there are over 5.3 million articles in more than 100 languages, of which 1.6 million articles are in English [8]. Articles about geographical areas on the Wikipedia website usually fully introduce information related to those regions, including: - Introduction - History - Geography - Administrative units: o Cities o Districts o Towns o Villages - Economy - Culture - Sports 2

OUT and LOC are tags of named entities indicating an outbreak and location, respectively.

iiWAS2011 Proceedings

Web Services and Applications (Short Papers)

- Tourism, Travel

3.2. Extracting process

In these informative areas, introductions and lists of cities, districts, and towns are used to build the geographical ontology.

The process of extracting and building data can be realized via the following steps:

3. BUILDING GEO-ONTOLOGY

Step 1: Collect names of places (see Figure 1) - Extract names and ISO 3166-1 code of countries

3.1. Levels and Standards In this Geography Ontology for BioCaster, we made use of the 5 following levels: Continents Countries SubCountries_1 SubCountries_2 SubCountries_3

- For each place in level 3, we extract a list of places in level 4 and level 5. Step 2: Re-create relations and ontology.

The geographical hierarchy in our ontology is organized in five levels: Level 1: The continents are generally identified by convention rather than any strict criteria, but seven areas are commonly regarded as continents. They are: Asia, Africa, North America, South America, Antarctica, Europe, and Australia. Level 2: According to the Maintenance Agency for ISO 3166 country codes, a country or territory must be listed in the United Nations Terminology Bulletin Country Names or Country and Region Codes for Statistical Use of the UN Statistics Division [5]. A country or territory must meet any of the 3 following criteria (ISO 3166-1): United Nations member state, a member of any of the UN specialized agencies, or a party to the Statute of the International Court of Justice. According to the criteria above, 243 countries and territories have formal codes. Level 3: Entities at SubCountry_1 must meet the following criteria: The largest region that belongs Country/Territory (as ISO 3166-1).

directly

- For each country, we extract a list of places in level 3 that are cities or counties of that country. Extracted places will be assigned an ISO 3166-2 code.

to

a

Name is listed in subdivisions of countries (sub-national entities) and dependent areas following the ISO 3166-2. ISO 3166-2 is a geo-code system created for coding the names of subdivisions of countries (sub-national entities) and dependent areas. The purpose of the standard is to establish a worldwide series of short abbreviations for places, for use on package labels, containers, and such; anywhere where a short alphanumeric code can serve to clearly indicate a location in a more convenient and less ambiguous form than the full place name. There are around 3,700 different codes. Level 4: Criteria of SubCountries_2: the largest region that belongs directly to a SubCountries_1 (as ISO 3166-2). These name resources are collected from Wikipedia as explained in Section 3.3. Level 5: Criteria of SubCountries_3: the smallest region that belongs directly to a SubCountries_2. These name resources are collected for six languages, including English, Japanese, Vietnamese, Thai, Chinese, and Korean. Geographical areas in lower levels (level 4 and level 5) are not mentioned in ISO 3166 standard.

hierarchy

of geographical

Step 3: Verify results by local administrative websites such as www.durham.gov.uk (Durham, England, United Kingdom) or www.phuyen.gov.vn (Phu Yen, Viet Nam).

Figure 1: Process of extracting names and information for countries The process of determining sub-levels such as level 4 or level 5 is done with data on the Wikipedia website through two stages: - From a list of management regions, names are extracted to create links to Wikipedia pages and to follow these links to extract contents about these management regions: In case there are no such links on the Wikipedia website, make alterations and reductions to create new links and retry. In case there are several pages with the same keyword, Wikipedia requires disambiguation, from presented list of alternative meanings, detect and choose correct entry using high-level information (country name). There are two cases: o A location with the same keyword: the following text will cover the higher-level information, such as city names or country names o A person name, organization name, or other types: the following brief text will usually not cover a location name and will be ignored. In case there is only one informative page about the keyword, check if it is the informative page about the considered place through the first definition segment. - From the extracted contents of the web site, continue to determine a list of administrative sub-regions (see Figure 2). The process includes the following steps: Determine a list of places by directive words in the previous section, such as: administrative units, cities, towns, and villages.

Web Services and Applications (Short Papers) The list of management regions has to be disambiguated by using other lists such as: airports, famous places, parks, and stations. Extract information from sentences in a description segment, which are not explicit lists but contain regional information.

iiWAS2011 Proceedings Attribute relation: describes characteristic attributes of regions, such as label, type, code, and external links. Hierarchy relation: relations among geographical areas. This relation can be expressed in two ways: straight direction and opposite sense. Based on the hierarchy relation, we can determine a string representation of any place.

4. INFORMATION RETRIEVAL

Figure 2: Area of administrative units in Busan (Korean) and Osaka (Japan) pages For each location entity, the system also detects the infobox3 on the Wikipedia page and extract properties for this location (see Figure 3). A template is used to extract and map necessary features, such as Continent, Latitude, Longitude, Area, Population, Density and so on (see Table 1).

The process of location retrieval is primarily based on string matching. The result of this process is the position of location names in the hierarchy and its relations. To disambiguate location names involved in queries, the ontology encodes containment relationships between places. This is useful as a place name may be shared by multiple places. By using the containment relationship, the geographical ontology is able to derive the broader spatial contexts of a place. For example, using following information, a user can know which place “Newton” belongs to: (1) (2) (3) (4)

United Kingdom, Wales, Bridgend, Newton United Kingdom, Wales, Swansea, Newton United Kingdom, Scotland, Na h-Eileanan Siar, Newton United Kingdom, England, Buckinghamshire, Wycombe, Newton Longville (5) United Kingdom, England, Sandwell, Newton If we have the full address of a location, we can determine the correct case in five cases of searching “Newton”. The most ambiguous place names for locations are shown in Table 2.

Figure 3: Flowchart of extracting properties for each location Table 1: Sample of properties for a location AD Andorra Europe 42°30'00"N 1°30'00"E 1.597563 42.541815 468 km² 71,822 (2007) 154/km² Catalan, Spanish, French Christianity 81 (men), 87 (women) 1 euro = 100 cents Andorra la Vella CET (UTC+1) http://www.roadjunky.com/images/1518.gif http://www.roadjunky.com/images/1517.gif < Link>http://en.wikipedia.org/wiki/Andorra

Table 2: The most frequently used names for each country United Japan Vietnam Thailand Kindom Newton (5) Hidaka (7) Tân Thành (28) Nong Bua (18) Sutton (5) Misato (7) Tân L p (26) Nai Mueang(14) Abbey (4) Asahi (6) QuangTrung (25) Ban Mai (13) Kenton (4) Miyoshi (5) B c S n (20) Ban Na (10) Milton (4) Nakagawa (5) Tân Bình (20) Tha Kham (10) Preston (4) Yamagata (5) Hòa Bình (19) Wiang (10) The hierarchical information can be derived by using string matching on the name of places or from the containment relationship between places if it exists. If two places have the same position in the hierarchy, then there is a strong possibility that they refer to the same place. If two places have different hierarchies, they can be considered to be different even if they have the same name or type of place.

5. EVALUATION Our first experiment is to build the geographical ontology with five levels based on the Wikipedia web pages. The detailed results of each level in the experiment are shown in Table 3. Table 3: The number of entries in 5 levels Level 1

Count 7

Entries continents

3.3. Relations

2

243

The relations we deal with are the management levels of each place name. Relations are divided into two relation types: attribute relation and hierarchy relation.

3

4.025

sub-countries (cities, counties)

4

6.792

districts (districts, suburban districts)

5

18.566

sub-districts (wards, towns, villages)

3

infobox is a brief information box of entities on Wikipedia page.

countries

iiWAS2011 Proceedings

Web Services and Applications (Short Papers)

We extract and build data of geographical ontology for level 2, level 3 for every country in the world. Extracted locations are encoded with ISO 3166-1 for level 2 and with ISO 3166-2 for level 3. However, several countries and territories have subregions, which are not encoded with ISO 3166-2, such as Armenia, Brunei, Bhutan (Asia), Belize (North America), or Cameroon (Africa). Table 4: The number of entries in level 2 and 3 Level

Count Total

Level 2 (has level 3 with ISO 3166-2 code)

117

Level 2 (has level 3 without ISO 3166-2 code)

76

Level 2 (hasn't level 3)

51

Level 3 (with ISO 3166-2 code)

3.246

Level 3 (without ISO 3166-2 code)

244

4.025

779

6. CONCLUSION

Because the process is based on extracting data from the data of the Wikipedia articles, the result of extracted locations is completely dependent on the resource of the Wikipedia site. Moreover, the Wikipedia site is a free open online encyclopedia and is written collaboratively by largely anonymous Internet volunteers. Therefore, several countries and territories, which are small locations or have less-developed technology, have not enough information to extract data, such as Aruba (ISO 3166: AW, North America), The Bahamas (ISO 3166: BS, North America), or Côte d’Ivoire (ISO 3166: CI, Africa). The first experiment for level 4 and level 5 is built with six countries in the BioCaster project: United Kingdom, Japan, China, Vietnam, Thailand, and Korea, and six languages including English, Japanese, Chinese, Vietnamese, Thai, and Korean. Table 5 shows the number of entries in level 3, 4, and 5 by country. Table 5: The number of entries in each level by country Level

GB

JP

CN

VN

TH

Figure 4: Location hierarchies of Vietnam and Japan

KO 4

Level 3

234

47

34

64

78

29

Level 4

2.679

1.215

460

614

932

430

Level 5

466

1.140

738

10.508

5.337

377

An instance of the geographical ontology is a tree through parentchildren relation shown in Figure 4. This tree has five levels. Parent nodes are countries in level 2 and their children are cities or counties in level 3. Data built by our system mainly concentrates on administrative hierarchical relations. Attributes of those locations such as coordinate, area, population, etc are paid less attention. Moreover, our system has not been combined with the GNS data to enrich the geographical ontology. Although the GNS data is very large (for example, Japan has 51,000 location name records or Vietnam has over 48,000 records), this data is mined from text and lacks administrative hierarchical relations, which are naturally important in building a geo-ontology.

Though the current system has been just supplied with six languages, it can be expanded for other countries, for which there is sufficient information on the Wikipedia. The geographical ontology was also integrated into the BioCaster ontology and can be freely downloaded at http://born.nii.ac.jp/index.php?page=ontology. Moreover, this ontology is also used in the BioCaster project to detect diseases by locations all over the world and monitor global health [4].

7. REFERENCES [1] Gaihua Fu, ChristopherB. Jones and Alia I. Abdelmoty, 2005. Building a Geographical Ontology for Intelligent Spatial Search on the Web, Databases and Applications (DBA2005), pp. 167-172. [2] Jones C.B., R. Purves, A. Ruas, M. Sanderson, M. Sester, M.J. van Kreveld, R. Weibel (2002). Spatial information retrieval and geographical ontologies an overview of the SPIRIT project. SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 387-388. [3] Son Doan, Quoc Hung-Ngo, Nigel Collier (2008), Building and Using Geographical Ontology in the BioCaster Biosurveillance System, Workshop on Bio-Ontologies 2008: Knowledge in Biology, July 2008. [4] Son Doan, Quoc Hung-Ngo, Ai Kawazoe and Nigel Collier (2008), The Use of the BioCaster Ontology for Mapping Infectious Diseases and Locations in the BioCaster Surveillance System, BioLINK 2008, Toronto, Canada, July 2008. [5] International Organization for Standardization, 1999. Maintenance Agency for ISO 3166 country codes. Available at: http://www.iso.org/iso/en/prods-services/iso3166ma/ [6] NIMA, 2004. GEOnet Names Server (GNS), Available at: http://earth-info.nga.mil/gns/html/index.html [7] Google Inc., 2006. Available at: http://maps.google.com/maps

4

[8] Wikimedia Foundation, Available at: http://www.wikipedia.org The number of entries of Korea includes South Korea and North Korea

Suggest Documents