mentation different datasets were used (Facebook Places3, Foursquare4, Google. Places5 ... 3 https://developers.facebook.com/docs/reference/api/search/.
Geodint: Towards Semantic Web-based Geographic Data Integration Tam´ as Matuszka and Attila Kiss E¨ otv¨ os Lor´ and University, Budapest, Hungary {tomintt, kiss}@inf.elte.hu
Abstract. The main objective of data integration is to unify data from different sources and to provide a unified view to the users. The integration of heterogeneous data has some benefits both for companies and for research. However, finding the common schema and filtering the same element becomes difficult due to the heterogeneity. In this paper, a system is presented that is able to integrate geographic data from different sources using Semantic Web technologies. The problems that appear during the integration are also handled by the system. An ontology has been developed that stores the common attributes that are given after schema matching. To filter the inconsistent and duplicate elements, clustering and string similarity metrics have been used. The data given after integrating can be used among others for touristic purposes, for example it could provide data to an augmented reality browser.
Keywords: data integration, Semantic Web, ontology, entity resolution, Augmented Reality
1
Introduction
The purpose of data integration is to combine and merge data from different data sources and to provide a unified query interface for the users. After the integration, the users do not have to worry about that the data came from different sources. The total amount of information could be regarded as a single dataset. There has long been a need to combine the heterogeneous data and to query it on a single interface. Data warehouses can be mentioned among the first solutions, which extract, transform and load the heterogeneous data to a unified schema. The problem with this approach is that the data warehouse is not always up-to-date. Later, the data integration has shifted towards the use of the mediated schema, which can obtain the information directly from the original database. This solution is enabled by mapping between the original schemas and the mediated schema. Due to the mediated schema, the query that is sent through a unified interface will transform into a form which complies with the original schemas. There are two approaches to this mediated schema solution. The first is the ”Global As View (GAV) approach”, which maps the entities of mediated schema to the original database schemas. The second is the ”Local As
View (LAV) approach”, which maps the entities of the original data sources to the mediated schema [11]. Nowadays the preferred method is the ontology-based data integration that defines the schema and helps avoid the semantic problems [14]. Such semantic problem can be when the coordinates of a POI (Point of Interest) is stored in degree (e.g. 47.162494◦ ) in the first database, and it is stored in degree-minute-second (e.g. 47◦ 9’ 44.9” N) in the second database. During the data integration a number of difficulties can be expected. Among these the most important is schema matching that gets two schemas as input and generates semantically correct schema mappings between them. Currently, the schema matching is typically done manually, usually with a graphical user interface. This method can be quite tedious, time-consuming and prone to errors [17]. Fortunately, nowadays there are some semi-automatic tools (e.g. COMA++ [2], Microsoft Biztalk) which can facilitate this process. Another similarly important problem is the entity resolution (also known as deduplication). This method is responsible for the identification and merging of the same real-word entities. A number of approaches have been developed for this problem, for example Swoosh approach [4], Karma [10]. The integration of geographic databases can play a particularly important role for location-based augmented reality browsers [13]. Augmented reality combines the real and virtual worlds in real-time. The location-based version takes advantage of the user’s current geographical location and location-based information can be superimposed into the real life view. A typical example is when the user looks around with the mobile phone and could see the icons which represent restaurants located near in the real-life view. The current augmented reality browsers (e.g. Junaio1 , Layar2 ) use only one data source nowadays [12]. In this paper a system is presented that could handle the general disadvantages of data integration in case of geographic data. This system could be a basis of a location-based augmented reality browser. The main advantage of our system compared to previous existing that it can extract more information with the integration of data from different sources than individually. During the implementation different datasets were used (Facebook Places3 , Foursquare4 , Google Places5 , DBpedia [7], LinkedGeoData [1]) for data provision. The advantages of Semantic Web were used for the integration. To avoid semantic conflicts, an ontology storing the common schema has been developed, which can be easily extended with new data sources. Clustering and string similarity metrics were used for filtering duplications. The rest of the paper is organized as follows. After the introductory Section 1, we outline the preliminary definitions in Section 2. Then, the details of our system is described in Section 3. Section 4 demonstrates the obtained results and the evaluation of the system. In Section 5 we present some applications that are 1 2 3 4 5
http://www.junaio.com/ https://www.layar.com https://developers.facebook.com/docs/reference/api/search/ https://foursquare.com https://developers.google.com/places/
similar to our system. Finally, the conclusion and the future plans are described in Section 6.
2
Preliminaries
In this section, the concepts that are necessary for understanding are defined. We provide insight into the basic concepts of data integration, schema matching, entity resolution and Semantic Web. A triple hG, S, Mi is called as data integration system, where G is the global schema, S is the set of source schemas and M is a mapping among the global schema and the heterogeneous source schemas. When a user sends a query, the request is executed over G by the system, and the M mapping is responsible for the mapping from this query to the schemas in S. Schema matching can be used for creating the G schema based on S. The core of schema matching is the match operator. Before we could define the match, it is necessary to introduce the mapping between two schemas, S1 and S2 . The mapping is a set of mapping elements. This method maps certain elements of S1 to certain elements of S2 . The match operator is f : S × S → M function, it gets two schemas as input and returns with the mapping between schemas. The given result is called match result [17]. During the data integration it is required to detect duplicates. Entity resolution gives a solution for this problem. It also uses a match function, which in this case is different from the one used for schema matching. Let E be the set of entities, the match is a f : E × E → Boolean function, which will decide whether two entities are the same real-world entities or not (denoted e1 ≈ e2 , if match(e1 , e2 ) = true, where e1 , e2 ∈ E). A merge function µ: E × E → E merges two matching entities into one entity. The merge closure (denoted I) can be obtained by executing all matching and merging on an instance I. If the match determines that e1 and e2 are same entity, then it just need to keep in the merge, which contains more useful information. For example, let e1 = J. Smith and e2 = John Smith. Then the latter has more information, denoted e1 e2 . This theory can be extended to instances as well. Given this, we can now define the entity resolution. Let be I an instance, and I the merge closure of I. An entity resolution of I is the minimal set of records I 0 , such that I 0 ⊆ I and I I 0 [4]. A possible way to manage the data available on the Internet is to use the Semantic Web [6]. The Semantic Web aims for creating a ”web of data”: a large distributed knowledge base, which contains the information of the World Wide Web in a format which is directly interpretable by computers. Ontology is recognized as one of the key technologies of the Semantic Web. An ontology is a structure O := (C, ≤C , P, σ), where C and P are two disjoint sets. The elements of C and P are called classes and properties, respectively. A partial order ≤C on C is called class hierarchy and a function σ: P → C × C is a signature of a property [18]. The Semantic Web stores the knowledge base as RDF triples. Let I, B, and L (IRIs, Blank Nodes, Literals) be pairwise disjoint sets. An RDF
triple is a (v1 , v2 , v3 ∈ (I ∪ B) × I × (I ∪ B ∪ L)), where v1 is the subject, v2 is the predicate and v3 is the object [15]. In this paper we present a geographic data integration system hG, S, Mi, which can determine the G global schema from S in semi-automatic way with schema matching. The given common schema and the semantic relations are stored in an O ontology. The classes of ontology store the required types and the properties of ontology describe the relations among them. During the data integration, the system performs the entity resolution as well. In addition, the resulted data are transformed into RDF format by the system.
3
Geodint (Geographic Data Integration System)
In this section, we overview the details of our system and describe the used data sources and the creation of the common schema. After that, we show the ontology that is given after schema matching and the entity resolution method. Finally, we present the core algorithm of our system. 3.1
Applied Data Sources
Five data sources are used by the system. Three of them provide data through web services and two store the data in semantic databases. The first is the Foursquare which is a social network with 40 million users. With this social network, the current position of the users could be shared. The most important element of Foursquare is the venue that are physically existing locations, where users can check-in. A venue has various attributes, but much of it is irrelevant in our case. The second data source is the Facebook Places. Facebook, similarly to Foursquare, allows their users to share the current position. Due to this, numerous data can be obtained about touristic sights, restaurants, museums, etc. Google also allows data provisioning about places which belong to different categories. For this purpose, the same database is used, what the Google Maps and Google+ Local use. The frequently updated database contains about 95 million POI-s. There are several publicly available datasets in semantic form. These data can be queried with the SPARQL query language [16]. SPARQL formulates the queries as graph patterns, thus the query results can be calculated by matching the pattern against the data graph. The most well-known dataset is the DBpedia [7], which contains the knowledge of Wikipedia in semantic form. DBpedia contains the latitude and longitude coordinates of numerous places, therefore it can be used as geographical data source. The last data source is the LinkedGeoData [1]. The goal of LinkedGeoData is to add a spatial dimension to Semantic Web. The spatial data is collected by OpenStreetMap6 project and it is available as RDF format. The large spa6
http://www.openstreetmap.org
tial knowledge base contains about 20 billion triples and it is interlinked with DBpedia. 3.2
Schema Matching
To create a global schema from the schemas of different data sources, out of the existing tools, COMA++ have been used. COMA++ provides schema matching using various matching algorithms. For this, it provides a graphical user interface which allows many interaction to the users. Due to the generic data representation, the tool supports schemas (e.g. W3C XML Schema7 ) and ontologies (e.g. OWL8 ) [2]. Firstly, the schemas of Foursquare, Facebook and Google data was downloaded and converted to XSD schema. In case of semantic datasets, the corresponding ontologies were used for the schema matching. The useless attributes for an augmented reality browser were filtered in advance. With COMA++ we have determined the common schemas pairwise in semi-automatic way and then we got the global schema. The COMA++ recommends a matching between two schema and the users could confirm or discard the suggestions. An example schema matching can be viewed on Figure 1.
Fig. 1. Matching two schema with COMA++
3.3
Ontology to the Global Schema
After the semi-automatically executed global schema determination the given result was stored in an ontology. This ontology builds on the LinkedGeoData’s ontology. We mapped the schemas of data sources to the classes of this ontology. This method fits one of the principles of Semantic Web, which has the purpose of reusing existing ontologies. We wanted to provide filtering by categories, thus these categories were also selected from this ontology. For this purpose, the ontology was extended with a Category class since the LinkedGeoData stores the different place types in the amenity class. The classes which are used as categories are also derived from this Category class. In addition, we had to create 7 8
http://www.w3.org/XML/Schema http://www.w3.org/2004/OWL/
some classes and properties for the data sources. With the help of these classes and properties, the data-specific information (e.g. category matching) can be described. It was also required to create a POI class (it will be the type of the emergent result). The ontology was extended with some classes which could describe the attributes which correspond to the POI-s. These attributes were described with the properties of DBpedia ontology and the W3C Geo Vocabulary. 3.4
Entity Resolution and Result Generation
When we have the global schema, the actual data integration can be started. As we mentioned before, the filtering of duplicated elements plays important role. Our approach is based on clustering and string similarity metrics. Firstly, a density clustering of the geographical data by coordinates was executed. For this purpose we used the DBSCAN [9] algorithm. According to density-based methods, the density within a cluster is much more than the density among the clusters. The DBSCAN algorithm uses two parameters. The first one is the radius (eps) and the second one is the threshold of the number of elements (minpts). The details of the algorithm can be viewed in [9]. The values of parameters was determined by empirical way. We set the value of minpts to 2 and the value of eps to 10 meter because of the inaccuracy of GPS. The places that are located near each other are given after the clustering. After that, the names of places were compared by two string similarity metrics. The first is the Jaro-Winkler distance which based on the determination of the number of common characters and transposition. The second one is the Levenshtein-distance, which gives the minimal number of deletion, insertion or replacement between two strings. The same elements within a cluster can be determined with high probability using this two metrics. After removing and merging the duplicated elements, the given result is converted to RDF document by the system. This result can be queried easily with using SPARQL queries. 3.5
The Core Algorithm
Algorithm 1 describes the core algorithm of the system. The inputs are the coordinates where the user would like to search, the category and the radius. The system firstly creates the data source specific queries and collects the result and then converts it to POI type. After that, the clustering begins. The distance is determined for each POI, and if this distance is less than the radius, the two POI will be adjacent. Thereafter, the algorithm determines the elements which belong to a common cluster. Finally, it executes the string similarity search on the POI-s within a cluster. If the algorithm found two equivalent elements, then it merges them. Finally, it transforms the given, deduplicated result to RDF format.
Algorithm 1: The core algorithm of the system Input: lat, lon, category, radius Output: integrated RDF document download the data from different sources and parse to POI foreach POI do compute the distance among the other POI-s if distance ≤ radius then set the two POI-s to adjacent end determine whether it is core, boundary or outliner foreach adjacent POI-s do check whether the two POI-s are same according the Jaro-Winkler and Levenshtein distance or not if the two POI-s are same then merge them and add the given POI to the result end end end
4
Results and Evaluation
For the evaluation of our system, various places have been selected in Budapest and Bangkok. The fundamental assumption was that the given result after data integration will be much wider than separately. Figure 2 shows the number of hits separately and collectively. On the x axis can be found the category of selected places. On this figure shows up that the number of the given unique places is much more after the integration than separately. It also shows that the semantic datasets do not contain any data in certain cases. However, in another case it exceeds the data coming from another type. Figure 3 shows the number of aggregated hits, the number of unique hits and the number of the hits after the entity resolution. This figure demonstrates that the given result after the entity resolution approximates well the concrete results. It was found that the number of results can be sometimes more and sometimes less than the number of real unique elements. In case of the category Cafe in Budapest, a false positive hit was found. The ”Cafe Illy” and ”Cafe Vian” are the same according to our system, it cannot be filtered by the similarity search. There were false negative hits in few times, for example ”Cafe Monaco” and ”Cafe Monaco & Coctail Bar”. These two places are different according to the system, while actually this is only one place. The number of false hits can be visited on Table 1. We can say that our assumption was right based on the given result. The system, which is able to filter the duplicated elements have been developed. This system can be a basis of an augmented reality browser which can provide more data than the nowadays existing ones.
Fig. 2. Number of hits separately and collectively.
Fig. 3. Number of aggregated hits. Table 1. Number of false positive and false negative hits. Category Cafe (Budapest) Museum (Budapest) Restaurant (Budapest) University (Budapest) Cafe (Bangkok) Museum (Bangkok) Restaurant (Bangkok) University (Bangkok)
5
Sum Sum (unique) After ER False positive False negative 95 83 86 1 4 42 40 41 0 1 106 96 99 0 3 78 74 73 2 1 38 36 37 0 1 29 28 28 0 0 77 75 74 1 0 75 75 75 0 0
Related Work
A mobile application called csxPOI (collaborative, semantic, and context-aware points-of-interest) was presented in [8]. This application allows for their users to collaboratively create, share and modify Point of Interest. These POI-s represent real physical places. The properties of such places are stored in a collaboratively
created ontology similarly to our solution. However, whereas our approach uses multiple data sources, their solution is based on POI-s created by the users. In [3], the author shows a generic tool that provides automatic retrieval of the updates in geographic databases. The geographic data integration can be made easier with this method. This metod is based on data matching tools, which are similar to our entity resolution method. Bennett presented a geoprocessing framework in [5]. This framework includes the basic principles of geographic information systems, modelbase management and computer simulation. All of these modules are integrated into an environment which could support the development of geopraphical models. The geographical data models include the spatial relations as well.
6
Conclusion
In this paper we presented a geographical data integration system which can be a base of an augmented reality browser. The global schema was created semiautomatically from different schemas using COMA++. The resulting schema was stored in an ontology. The system can be extended with other data sources easily. The deduplication is carried out in two steps. In the first step, a density clustering, namely DBSCAN is executed based on spatial dimensions. After that, string similarity search is executed among the elements within a cluster and the same elements are merged. The given results prove that we can get much wider information after integration than separately. In the future, we will improve our system so that the ontology will be extended automatically after the schema matching. We would also like to examine whether storing the results in a triple store could speed up the system. In addition, we will use linguistic approaches (e.g. synonyms) apart from string similarity search.
Acknowledgments This work was partially supported by the European Union and the European Social Fund through project FuturICT.hu (grant no.: TAMOP-4.2.2.C-11/1/KONV2012-0013).
References 1. Auer, S., Lehmann, J., Hellmann, S.: Linkedgeodata: Adding a spatial dimension to the web of data. In The Semantic Web-ISWC 2009 (pp. 731-746). Springer Berlin Heidelberg (2009) 2. Aumueller, D., Do, H. H., Massmann, S., Rahm, E.: Schema and ontology matching with COMA++. In Proceedings of the 2005 ACM SIGMOD international conference on Management of data (pp. 906-908). ACM (2005)
3. Badard, T.: On the automatic retrieval of updates in geographic databases based on geographic data matching tools. Bulletin du Comit franais de cartographie, (162), 34-40. (1999) 4. Benjelloun, O., Garcia-Molina, H., Menestrina, D., Su, Q., Whang, S. E., Widom, J.: Swoosh: a generic approach to entity resolution. The VLDB JournalThe International Journal on Very Large Data Bases, 18(1), 255-276 (2009) 5. Bennett, D. A.:. A framework for the integration of geographical information systems and modelbase management. International Journal of Geographical Information Science, 11(4), 337-357. (1997) 6. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American 284 5, 2837 (2001) 7. Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 7(3), 154-165. (2009) 8. Braun, M., Scherp, A., Staab, S.: Collaborative creation of semantic points of interest as linked data on the mobile phone. (2007) 9. Ester, M., Kriegel, H. P., Sander, J., Xu, X: A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD (Vol. 96, pp. 226-231) (1996) 10. Knoblock, C. A., Szekely, P., Ambite, J. L., Goel, A., Gupta, S., Lerman, K., Muslea, M., Taheriyan, M., Mallick, P.: Semi-automatically mapping structured sources into the semantic web. In The Semantic Web: Research and Applications (pp. 375-390). Springer Berlin Heidelberg (2012) 11. Lenzerini, M.: Data integration: A theoretical perspective. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (pp. 233-246). ACM. (2002) 12. Matuszka, T.: Augmented Reality Supported by Semantic Web Technologies. In The Semantic Web: Semantics and Big Data (pp. 682-686). Springer Berlin Heidelberg. (2013) 13. Matuszka, T., Gombos, G., Kiss, A.: A New Approach for Indoor Navigation Using Semantic Webtechnologies and Augmented Reality. In Virtual Augmented and Mixed Reality. Designing and Developing Augmented and Virtual Environments (pp. 202-210). Springer Berlin Heidelberg. (2013) 14. Noy, N.F.: Semantic integration: a survey of ontology-based approaches. ACM Sigmod Record, 33(4), 65-70 (2004) 15. P´erez, J., Arenas, M., Gutierrez, C.: Semantics and Complexity of SPARQL. In The Semantic Web-ISWC 2006 (pp. 30-43). Springer Berlin Heidelberg (2006) 16. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF, http://www.w3.org/TR/rdf-sparql-query/ 17. Rahm, E., Bernstein, P. A.: A survey of approaches to automatic schema matching. the VLDB Journal, 10(4), 334-350 (2001) 18. Volz, R., Kleb, J., Mueller, W.: Towards Ontology-based Disambiguation of Geographical Identifiers. In I3 (2007)