16th Int'l Conf. Computer and Information Technology, 8-10 March 2014, Khulna, Bangladesh
Machine Understandable Information Representation of Geographic Related Data to the Administrative Structure of Bangladesh Md. Hasan Hafizur Rahman
Shima Chakraborty
Md. Hanif Seddiqui
Dept. of Computer Science & Engineering University of Chittagong Chittagong - 4331, Bangladesh Email:
[email protected]
Dept. of Computer Science & Engineering University of Chittagong Chittagong-4331, Bangladesh Email: shima
[email protected]
Dept. of Computer Science & Engineering Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj-8100 and University of Chittagong, Chittagong-4331 Email:
[email protected]
Abstract—The heterogeneous data on the web related to the geographic information of a country is an increasingly important field of data sharing and integrating with diverse sources for retrieving suitable information using search engines. These heterogeneous geographic data is mostly available in unstructured or semi-structured format. Converting and integrating these data into a machine understandable representation is a challenging task and is getting researchers’ attention at a rapid pace. In this regard, we described geographic data for administrative structure of Bangladesh which integrated all resources and instances by comprising of their concepts and relations, coined a repository name Geo-Bangladesh by rich semantic Resource Description Framework (RDF). Then we utilized our Geo-Bangladesh with one more knowledge repository Bangladeshi-Citizen to achieve the semantic interoperability in our application. Furthermore, we performed SPARQL query operations on our proposed semantic knowledge repository to retrieve and inference specific information that shows its usability and effectiveness through the adequate quantitative results, both in term of concepts and geographical entities of the administrative structure of Bangladesh.
I.
I NTRODUCTION
The procreation of heterogeneous data sources of geographic information along with the expeditiously growing World Wide Web (WWW) in an unstructured format with human understandable representation are insufficient for even smarter method in finding specific information on the web. These heterogeneous data repositories in the form of document having unstructured text related to particular domains are easily readable by humans, however it is hardly understandable to machine. To develop unstructured web contents we use Hypertext Markup Language (HTML)1 , Dynamic HTML (DHTML)2 where a sequence of characters declaring by predefined tags indicating the appearance of contents in the document instead of their machine understandable semantics. These tags are used to affirm the style rather than the interpretation of data in the documents. The following tag representation display Hafizur Rahman in a paragraph of a document without explicit meaning of Haf izur Rahman instead of a sequence of characters. e.g, < p > Haf izur Rahman < /p >
Therefore, Extensible Markup Language (XML)3 is a semistructured data representation technology having user defined tags. By reason of the inadequacy of uniqueness in user defined tags to represent information in a document is scarcely understandable to machines. In spite of these expeditiously growing data using these technologies, current WWW also contains a lot of database driven contents usually called as dynamic web where we design database schema first and then we add the contents. Although these contents are structured and rely on their own relational database systems, they are usually lack of machine understandable semantics. As a result of the deficiency of semantics, the traditional search engines do not attain specific information on the web. Consider a user query related to the place Dhaka, capital city of Bangladesh traditional search engine retrieves Dhaka related data without interpretation Dhaka is a place or not by searching different scattered and unstructured sources on the web, then feedback relevant and irrelevant documents to search keyword as well. Moreover, different software of Geographic Information System (GIS) uses incompatible system designs, data models, and database storage structures those are composed into a proprietary format [1], [2], [3]. The integration efforts of geographical information is a formidable task mainly rely on these proprietary solutions. Furthermore, geographical data semantics those deal with representations and reasoning on the meaning of geospatial data are critical for the development of interoperable geographical data and software [4], [5], geographical information retrieval [6], and automated spatial reasoning [7]. However, it is extremely difficult to capture and maintain semantic knowledge of geographical data due to the complexities of geographical categories [8], geospatial languages [9], and heterogeneous representation of spatial data. A series of specifications such as Geographic Markup Language (GML) [10] and data access protocols such as Web Feature Service (WFS) [11], and Web Map Service(WMS) [12] are able to access heterogeneous contents stored in the web. These standards do not have enough constructs to express data semantics on the web. On the other hand, ontology is used to define a common vocabulary that minimizes the semantic problems in interoperability, metadata modeling, and communicating
1 http://www.w3.org/html 2 http://en.wikipedia.org/wiki/Dynamic
3 http://www.w3.org/XML/
HTML
978-1-4799-3497-3/13/$31.00 ©2013 IEEE
236
16th Int'l Conf. Computer and Information Technology, 8-10 March 2014, Khulna, Bangladesh meaning of data across domains, and data integration [13], [14], [6]. It is employed as a method for identifying categories, concepts, relations, and rules that prescribe theories of the geographical domain [15]. However, a lot of efforts has been made to develop the next generation web, Semantic web, is a term coined by Berners-Lee et al. [16], define the objective of data on the web to achieve the semantic interoperability among metadata associated with the web information. In spite of the growth of semantic web technology that addresses the issue of data integration and machine understandability of huge data repositories, geographic information of a country is a motivated complication problem with vast application areas such as seeking related distinguishing information from interoperable web of data. In this regard, the main center of attraction of our research is to develop a generic machine understandable database Geo − Bangladesh from the existing available data using rich semantic web technologies to achieve resource sharing and interoperability issues in a faster and efficient way. We performed a couple of experiments to retrieve and inference specific information using a SPARQL that shows effectiveness of our proposed knowledge repository for geographic information of Bangladesh. The interoperability issue is crossexamined with knowledge repository Bangladeshi − Citizen related to the citizen information of Bangladesh. The rest of the paper is organized as follows. Section II introduces general terminologies to comprehend consequent contents of this paper. Our present administrative structure of Bangladesh is articulated in Section III, while Section IV focuses on the details data preparation procedure and Section V describe the approach of our system and describes the process of making machine understandable data. Section VI elaborates the process to publish geographic data of Bangladesh as Linked Open Data (LOD). Section VII includes the utilization of our proposed Linked Open Data (LOD) to experiment with SPARQL. Concluded remarks along with some future directions of our work is described in Section VIII. II.
≤C represents a partial order on C, called concept hierarchy or “taxonomy“. A partial order ≤R represents on R, called relation hierarchy,where r1 ≤R r2 iff dom(r1 ) ≤C dom(r2 ) and ran(r1 ) ≤C ran(r2 ) and a function represented by σ : R −→ C × C called signature of binary relation where σ(r) = hdom(r), ran(r)i where rR, domain dom(r) and range ran(c) [21]. B. Linked Data The ontology knowledge base is a structure KB = (C, R, I, ıC , ıR ) consisting of two disjoint sets C and R as defined before, a set I whose elements are called instances, two functions ıC and ıR called concept instantiation and relation instantiation respectively. With the structure of ontology and ontology knowledge base, we develop a method of publishing structured data so that it can be interlinked with diverse data sources. As a result of the interlinked data sources, both a machine or a human can explore specific data on the web is named Linked Data. C. RDF, RDF Triple and URI RDF is a standard model for data interchange as a triple format to interlink data on the web. In this model, resources are described using HTTP-URIs that create globally unique name without centralized management and can be easily accessing information of a resource avoiding information redundancy on the web. Moreover, an RDF triple is a piece of knowledge having subject, predicate and object. The Fig. 1 demonstrates the basic structure of triple based linked data to comprehend semantic knowledge representation.
G ENERAL T ERMINOLOGIES
In this section we introduced to readers some basic definition of terminologies in the domain of semantic web used throughout this research paper. In this stage we familiarized Ontology, Linked Data, Resource Description Framework (RDF)4 , Uniform Resource Identifiers (URIs) and RDF Triple to comprehend the essence of our paper. A. Ontology In the field of semantic web, an ontology is an explicit, formal specification of a shared conceptualization of a domain of interest [17], [18]. It acts as the backbone of the semantic web vision [19], [20] which is consider as the next generation web. An ontology contains core ontology, logical mapping, knowledge base, and lexicon. Furthermore, a core ontology, S, is defined as a five tuples
Fig. 1.
RDF Triple for “Comilla partOf Chittagong”
The subject of a triple is a URI, the address of the resource that we want to describe in a domain while the predicate of a triple is the description of a fact about a certain domain that represents the relationship between subject and object. We depict the RDF graph representation for Comilla partOf Chittagong in Fig. 2.
S = (C, ≤C , R, σ, ≤R ), consisting of two disjoint sets C and R whose elements are called concepts and relations respectively. The notation Fig. 2.
4 www.w3.org/RDF/
978-1-4799-3497-3/13/$31.00 ©2013 IEEE
237
RDF Graph for the Triple “Comilla partOf Chittagong”
16th Int'l Conf. Computer and Information Technology, 8-10 March 2014, Khulna, Bangladesh III.
G ENERAL A DMINISTRATIVE S TRUCTURE OF BANGLADESH
and there exist a partOf relation between each union and the pourashava. All unions are assigned to union class.
Bangladesh is a country of Southern Asia, officially T he P eople0 s Republic of Bangladesh lies on the geographical coordinates from 200 34N to 260 38N latitude and from 880 01E to 920 41E longitude. The logical relationship among different levels of the administrative structure of Bangladesh is depicted in the Fig. 3. Each level has its corresponding
The Mouza level: We define each mouza as an entity and create a partOf relation between each mouza and the union. They are assigned to mouza class. The village level: Villages are populated place of Bangladesh and the lower level of the administrative hierarchy. We create an entity for each village and a partOf relation between each village and the mouza. They are assigned to the village class. In spite of these different levels and their relationships, data of each level is connected with another level to create a chain of data in our semantic database. IV. G EOGRAPHIC DATA R ELATED TO THE A DMINISTRATIVE S TRUCTURE OF BANGLADESH
Fig. 3.
Logical Relationship between Different Levels of Bangladesh
geographic information consisting their name, latitude, longitude and so on. Each level in this structure has a logical relationship with its upper and lower levels. To identify the specific geographic location of the administrative structure using partOf relation from the Fig. 3 in the following manner:
The geographic data of Bangladesh is not convenient from a single data provider’s organization. It is a challenging task to collect data and validated these data using both manual and automatic process from different sources. The geographic designations to name and address of each location originated by Bangladesh Bureau of Statistics 5 is available in Microsoft Excel file, GeocodeBD.xls. This file contains 7 divisions, 64 districts, 500 upazilas and 509 administrative thanas, 265 pourashava or municipalities, 2407 wards, 4451 union council, 67100 mouza or moholla and 87968 villages. Among these data 96,600 populated places are obtainable. The fragment of data in GeocodeBD.xls is shown in the Fig. 4. Therefore
The country level: This entity is the root of the hierarchy represents Bangladesh and it is not explicitly defined in our database. This entity is assigned to the country class. The division level: We create an entity for each division, first administrative level and a partOf relation between each division and the country. They are described in the division class. The district level: We define each district as an entity in our domain and a partOf relation between each district and the division. We describe each district in the district class. The upazila level: There is a partOf relation between each upazila and the district. In the metropolitan area the third level administrative structure is thana (police station) instead of upazila. This entity is assigned to upazila class.
Fig. 4.
Fragment of Data in the GeocodeBD.xls file
another file published as the features file, feature.xls contains information about geographical classes such as Division, District, Upazila and so on. The data snippet of feature.xls is given in the Fig. 5. Moreover, it is a formidable task to
The pourashava level: The pourashava level has partOf relation with upazila and assigned to pourashava class. The code 99 used to link with union otherwise related to wards. The ward level: Each ward is an entity and we create a partOf relation between each ward and the pourashava. Each ward is assigned to a ward class. The moholla level: We create an entity for each moholla and a partOf relation between each moholla and the ward. They are assigned to the moholla class. This is one of the populated place level. The union level: Each union is a part of pourashava defining pourashava code 99. We created an entity for each union
978-1-4799-3497-3/13/$31.00 ©2013 IEEE
238
Fig. 5.
Data snippet of feature.xls file
aggregate latitude and longitude values for each location. To 5 http://bbs.gov.bd/RptGeoCode.aspx
16th Int'l Conf. Computer and Information Technology, 8-10 March 2014, Khulna, Bangladesh
Fig. 6.
Data segment of BD.xls file
do this task we conscientiously congregate this latitude and longitude coordinate values from the geo-names geographical database BD.xls 6 . The data segment of this file is given in the Fig. 6. In this data source, there are 53719 records are serviceable for geographic locations. We extract latitude and longitude values only for administrative structure related entities of Bangladesh. In spite of these data source, there are a huge number of challenges to incorporate appropriate values with Geocode-BD for a particular location. In this database the feature code ADM1 is equivalent to our division in GeocodeBD. Therefore feature code ADM2 is also similar to district or zila. There are 68 districts are available in this dataset instead of 64 districts from the dataset provided by the government of Bangladesh. In order to identify duplicate records for Jhalokati, Khagrachhari and Pirojpur districts are removed to aggregate latitude and longitude values. Moreover, another record for parbattya chattagram district is not a valid second level administrative entity of Bangladesh. Furthermore, in the ADM3 level has 308 data records in the geo-names geographic server. However, there are only four classified records for the fourth administrative level and 248 records missing in BD.xls file for ADM3 level. In order to identify these missing data and challenges of automatic extract coordinate values of an entity we successfully complete this task by google geocoding api. Moreover, each geographic entity is represented with latitude and longitude coordinates in Cartesian WGS84 (World Geodetic System 1984) format, a standard coordinate reference system mainly used in cartography, geodesy and navigation to represent geographical coordinates on the Earth 7 .
V.
O UR A PPROACH
The approach of our application is to classify resources and instances of the system including their respective features. The subsequent phase of our research is to recognize concepts and relationship types of resources for making machine understandable data repository coined a name Geo − Bangladesh, a semantic database related to the geographic information of Bangladesh from available data those are related to our domain of interest. In the third stage, we published this knowledge repository as Linked Open Data (LOD) to create link with related web of contents to address the interoperability issues on the web. In the consecutive step, a lot of experiment is performed to retrieve and inference specific information using SPARQL 8 from our machine understandable data repository. The overview of our approach is depicted in Fig. 7 to comprehend the essence of the system. 6 http://download.geonames.org/export/dump/ 7 https://www1.nga.mil/ProductsServices/ 8 http://www.w3.org/TR/rdf-sparql-query
978-1-4799-3497-3/13/$31.00 ©2013 IEEE
239
Fig. 7.
Overview of Our Information Retrieval System
A. Making Data Machine Understandable An ontology defines common vocabularies for researchers [22] who need to share information of a domain on the web of data. It is more easier to publish data in RDF using available vocabularies from the web. In the case of unavailability of vocabularies, users can propose for a suitable vocabulary to describe domain data on the web. In order to identify these vocabularies we develop Geo − inspired ontology to describe our domain knowledge explicitly. The basic steps to develop an ontology is [22]: defining classes and arranging these classes in a taxonomic (subclass-superclass) hierarchy, defining relationships with other classes called slots and allowed values for these slots. However, our knowledge data repository Geo − Bangladesh contains fundamental geographic data related to the administrative structure taking into account classes, sub-classes, instances, relations, object property, data property and general axioms etc. It is important to select a vocabulary to maximize interoperability with wider consensus on the web. We use Dublin Core (DC) (dublincore.org/documents/dces/) and DCMI-BOX (http : //dublincore.org/documents/dcmi − box/) standard vocabularies to encode geographic meta-data of our domain. For example, a resource type is defined in a class using http : //www.w3.org/1999/02/22−rdf −syntax−ns#type vocabulary. Each instance of a class consist their related features such as name and comment is published using http : //www.w3.org/2000/01/rdf − schema#label and http : //www.w3.org/2000/01/rdf − schema#comment vocabularies. Moreover, latitude and longitude of each location are defined using common vocabularies of http : //www.w3.org/2003/01/geo/wgs84 pos#lat and http : //www.w3.org/2003/01/geo/wgs84 pos#long. Therefore, the geographic bounding box attributes for north, south, east and west are mapped to dcmibox : northlimit, dcmibox : southlimit, dcmibox : eastlimit and dcmibox : westlimit respectively. Therefore, each resource of our domain is linked
16th Int'l Conf. Computer and Information Technology, 8-10 March 2014, Khulna, Bangladesh
Fig. 8.
Machine Understandable Geographic Data of Bangladesh in N-TRIPLE format
with other resources using RDF schema definition http : //www.w3.org/2000/01/rdf −schema#partOf . The Fig. 8 gives us a close look of the machine understandable representation of geographic data in RDF format. The representation of these data create link of a resource with other resources to form a directed, labeled graph which is portrayed in the Fig. 9.
can look up those names 3) When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL) 4) Links to resources in other datasets should be included in order to enable the discovery of more data. Adhering to these four linked data principles we published Geo − Bangladesh to make data interconnected with one more data repositories, Bangladeshi − Citizen to achieve the interoperability issue. In this case, resource define villages in the Geo − Bangladesh data source is shared with Bangladeshi − Citizen to retrieve the complete address of a citizen. VII.
Fig. 9. Machine Understandable Geographic Data of Bangladesh in a Graph
VI.
P UBLISHED G EOGRAPHIC DATA ON BANGLADESH AS L INKED O PEN DATA
Linked data focuses on the ontological level and inferences to create knowledge base interoperable global data sets. In this technology we published structured data in RDF using HTTP-URIs for public access and hence widely called as Linked Open Data (LOD). The LOD of our domain provides reusable data for other applications on the web and retrieve specific geographic data related to the administrative structure of Bangladesh using semantic search engine. The inventor of the World Wide Web Tim Berners-Lee outlines four principles of Linked Data [23] as: 1) Use URIs as names for things 2) Use dereference-able URIs (HTTP-URIs) so that people
978-1-4799-3497-3/13/$31.00 ©2013 IEEE
240
K NOWLEDGE R EPRESENTATIONS AND R EASONING
Our proposed application share resources or instances from Geo − Bangladesh without defining correlative data in other applications. Chaining different web content increases the number of knowledge-intensive task that can be carried out automatically [24]. The semantic web supports technology for describing, discovering and accessing web applications that can be formed a chain of interoperable web applications dynamically [25]. Moreover, we use java based semantic web application development framework, Jena for rule based inference and SPARQL, a query language for semantic data. Furthermore, we consider the following knowledge pieces, S-1, S-2, ...,S-7 with their associations to identify each entity of the administrative structure of Bangladesh and derive important information based on isa and partOf relation. S-1: Bangladesh isa country S-2: Bara Horipur isa village and partOf Bara Horipur Mouza S-3: Bara Horipur isa Mouza and partOf Dakshin Khosbas union S-4: Dakshin Khosbas isa union and partOf Barura Upazila S-5: Barura isa Upazila and partOf Comilla zila/district S-6: Comilla isa district and partOf Chittagong division S-7: Chittagong isa division and partOf Bangladesh The inference system can derive both direct and indirect relation from Geo-Bangladesh, using their relationship between the subject and the object of knowledge pieces. This inference engine also generate many more necessary statements by applying various rules on the available existing
16th Int'l Conf. Computer and Information Technology, 8-10 March 2014, Khulna, Bangladesh triples in our semantic data source. Therefore the experiment for the information of all district names those are more closest in terms of their distance from the district where Bara Horipur village exist and similar experiments also performed for other entities of Bangladesh i.e, division, upazila, union, mouza and moholla and vice-versa. The specific hierarchical data retrieval procedure is as follows:
[2]
[3]
[4]
Step-1: Based on part-of relation retrieve subject or object for an entity. Step-2: Apply subject or object from step-1 to retrieve other data. Step-3: Repeat step-1 and step-2 until the specific goal achieved. Moreover, the semantic query to retrieve URIs of all division from our knowledge-base data source Geo−Bangladesh using SPARQL semantic search engine is portrayed in the Fig. 10 and their respective result in the Fig. 11.
[5]
[6]
[7] [8]
[9]
[10]
[11] Fig. 10.
[12]
Semantic query using SPARQL search engine
[13]
[14]
[15]
Fig. 11.
Semantic query Result for URIs of all division
[16] [17]
VIII.
C ONCLUSION
Now-a-days machine understandable geographic data of a country is essential to decrease data profusion for other applications those use these compatible data on the web. Geo − Bangladesh, a geographic semantic data repository of Bangladesh reduces the formation of similar data for other applications. This machine understandable web content is interrelated with available semantic data sources to address data interoperability issues in a faster and more efficient way. RDF query is performed to retrieve and inference more specific information that reflects the effectiveness of our application. Using Geo−Bangladesh, we extend this research to map our locality to ease search ability by mass people who have access to mobile technology.
[18]
[19]
[20] [21] [22] [23]
[24]
R EFERENCES [1]
J. Chomicki and P. Revesz, “Constraint-based interoperability of spatiotemporal databases,” Geoinformatica, vol. 3, no. 3, pp. 211–243, 1999.
978-1-4799-3497-3/13/$31.00 ©2013 IEEE
241
[25]
Z. Peng, “A proposed framework for feature-level geospatial data sharing: a case study for transportation network data,” International Journal of Geographical Information Science, vol. 19, no. 4, pp. 459– 481, 2005. C. Zhang and W. Li, “The roles of web feature and web map services in real-time geospatial data sharing for time-critical applications,” Cartography and Geographic Information Science, vol. 32, no. 4, pp. 269–283, 2005. Y. Bishr, “Overcoming the semantic and other barriers to gis interoperability,” International Journal of Geographical Information Science, vol. 12, no. 4, pp. 299–314, 1998. F. Harvey, W. Kuhn, H. Pundt, Y. Bishr, and C. Riedemann, “Semantic interoperability: A central issue for sharing geographic information,” The Annals of Regional Science, vol. 33, no. 2, pp. 213–232, 1999. C. Jones, H. Alani, and D. Tudhope, “Geographical information retrieval with ontologies of place,” Spatial Information Theory, pp. 322–335, 2001. A. Cohn, “The challenge of qualitative spatial reasoning,” ACM Computing Surveys, vol. 27, no. 3, pp. 323–325, 1995. B. Smith and D. Mark, “Geographical categories: an ontological investigation,” International Journal of Geographical Information Science, vol. 15, no. 7, pp. 591–612, 2001. A. Frank and D. Mark, “Language issues for gis,” Geographical information systems: Principles and applications, vol. 1, pp. 147–163, 1991. O. Consortium et al., “Geography markup language (gml) 3.0,” Open GIS Implementation Specification,[Online]. Dispon´ıvel: http://www. opengis. org/docs/02-023r4. pdf, 2001. P. Vretanos, “Web feature service implementation specification,” Open Geospatial Consortium Specification, pp. 04–094, 2005. J. de La Beaujardi`ere, “Ogc web map service interface, version 1.3. 0,” Open Geospatial Consortium, 2004. J. Brodeur, Y. Bedard, G. Edwards, and B. Moulin, “Revisiting the concept of geospatial data interoperability within the scope of human communication processes,” Transactions in GIS, vol. 7, no. 2, pp. 243– 265, 2003. F. Fonseca, M. Egenhofer, P. Agouris, and G. Cˆamara, “Using ontologies for integrated geographic information systems,” Transactions in GIS, vol. 6, no. 3, pp. 231–257, 2002. D. Mark, B. Smith, and B. Tversky, “Ontology and geographic objects: An empirical study of cognitive categorization,” Spatial Information Theory. Cognitive and Computational Foundations of Geographic Information Science, pp. 747–747, 1999. T. Berners-Lee, J. Hendler, O. Lassila et al., “The semantic web,” Scientific american, vol. 284, no. 5, pp. 28–37, 2001. T. Gruber et al., “Toward principles for the design of ontologies used for knowledge sharing,” International journal of human computer studies, vol. 43, no. 5, pp. 907–928, 1995. R. Studer, V. Benjamins, and D. Fensel, “Knowledge engineering: principles and methods,” Data & knowledge engineering, vol. 25, no. 1, pp. 161–197, 1998. T. Berners-Lee and M. Fischetti, Weaving the Web: The original design and ultimate destiny of the World Wide Web by its inventor. DIANE Publishing Company, 2001. A. Maedche and S. Staab, “Ontology learning for the semantic web,” Intelligent Systems, IEEE, vol. 16, no. 2, pp. 72–79, 2001. M. Ehrig, Ontology alignment: bridging the semantic gap. Springer, 2006, vol. 4. N. Noy, D. McGuinness et al., “Ontology development 101: A guide to creating your first ontology,” 2001. C. Bizer, T. Heath, and T. Berners-Lee, “Linked data-the story so far,” International Journal on Semantic Web and Information Systems (IJSWIS), vol. 5, no. 3, pp. 1–22, 2009. Y. Ding, D. Fensel, M. Klein, and B. Omelayenko, “The semantic web: yet another hip?” Data & Knowledge Engineering, vol. 41, no. 2, pp. 205–227, 2002. M. Daconta, L. Obrst, and K. Smith, “The semantic web: a guide to the future of xml, web services, and knowledge management,” 2003.