A RDF-based Semantic Schema Mapping Transformation System for Localized Data Integration Chi Po Cheong
Chris Chatwin
Rupert Young
School of Science and Technology University of Sussex Brighton, United Kingdom
[email protected]
School of Science and Technology University of Sussex Brighton, United Kingdom
[email protected]
School of Science and Technology University of Sussex Brighton, United Kingdom
[email protected]
Abstract—This paper proposes a new transformation system for data integration which is based on semantic schema mapping, which is RDF-based. An alternative approach for data integration is also proposed. It provides an efficient way to keep legacy systems running and provides an integrated view to users during enterprise mergers and acquisitions. Existing data integration solutions require huge resources, especially in the initial stages. The preparation for data integration is a time consuming and complex process. The proposed system is suitable for data integration in a lean economic environment such as the current depression. Keywords-component; Data Integration; Semantic; RDF;
I. INTRODUCTION Dealing with the data in legacy systems or heterogeneous data sources presents many challenges. The purpose of data integration is merging and matching the related information coming from different sources. Data integration is an expensive and time consuming process. However, it provides several advantages. It can combine two or more data sources for sharing and analysis. It can merge all legacy databases into a single schema. For example, a user can obtain a variety of information from a single query. A data integration system is defined by the triple {G, S, M}, where G is a global schema, S is source schemas and the M is the mapping between G and S. Data integration can be used in many domains. For instance, discovery and integration of themed data from the Internet or Integration of medical data from different healthcare information systems. Three approaches have been proposed for data integration, including: materialization, virtualization [1] and Ontology based. Data warehousing is an example of the materialization approach. It is a traditional approach for data integration and materializes the data at the global level. It can be achieved by the Extract, Transform and Load (ETL) process. However, it is not easy to support real-time data access and deal with new data sources. The virtualization approach is to define virtual schema which includes attributes from all data sources. Global-as-view (GAV) [2]
and Local-as-View (LAV) [3] are the examples of this approach. Both of them use query-reformation procedures to break down a user query into sub-queries for different sources. The advantage of this approach compared to the materialization approach is that it returns up-to-date data to the user. However, the challenge is how to define a mapping file between the virtual schema and data sources schema. The classification of schema matching approaches has been discussed in [ 4 ]. The Ontology based data integration approach is a new direction for the researcher to expose the Relational Database Management System (RDBMS) data to the Web. There are three approaches to create an ontology from a relational database including: full-automatic generation, semi-automatic generation and mapping an existing DB to an Ontology. The first one is discussed in [5], the second approach is discussed in [6] [7] and [8] [9] provide examples of the last approach. Although using ontology is a new trend for data integration, it has many issues and challenges. For instance, it consumes significant amounts of time to define all the classes and their relationships. It may also require data from the relational database to be dumped to the ontology regularly. Moreover, the Web Ontology Language (OWL) makes the open world assumption but in relational databases and SQL the closed world assumption is made. System performance is another major issue with the ontology based data integration approach. A. Motivation Due to the credit crunch, two or more individual companies want to join together in order to increase their capital resources. During mergers and acquisitions, the IT personnel face the problem of how to rapidly and efficiently integrate existing IT environments including infrastructure and the core business systems. Four system integration alternatives can be chosen from an empirical survey [ 10 ] including: Take-Over, Best of a Breed, Disconnection and New System. Developing an entirely system is unattractive because it wastes resources. Take-over and best of breed are not suitable in an
economic depression. Therefore, disconnection and data integration is a possible approach, the other three data integration approaches are not suitable in the current economic climate. B. Contribution of the paper This paper proposes an alternative approach to data integration, called Localized Data Integration. It does not require applications coming from a different company to understand all data sources schema. The paper also proposes a transformation system to facilitate data integration to support the Localized Data Integration. It provides an efficient way to keep all of the existing systems running and will provide an integrated view to users during enterprise mergers and acquisitions. The system architecture and details of each component will be discussed in the paper. However, it focuses on the role of defining the RDF-based mapping file and how the system cooperates with the WordNet Database and the process of translator module. Other components such as the Query Engine and Staging Area will not be discussed in detail but will be discussed in a following paper. II. LOCALIZED DATA INTEGRATION Materialization, virtualization and Ontology based data integration approach can provide all the attributes which are stored in different data sources to the user. Therefore, it requires a clear understanding of the entire data sources schema at the beginning stage of data integration. However, it is a time consuming and difficult process. The localized data integration can only provide the user with the integrated data which is already defined or exists in local database source schema. For instance, a local schema has two attributes, employee’s first name and last name, a second local schema has one more attribute call employee’s initial. Based on the localized data integration, the first schema’s user cannot obtain any information about the employee’s initial after data integration. Because the system assumes that it is useless for the first schema’s user otherwise this attribute would have been defined in the first schema. The scope of the Localized approach is different from other approaches such as materialization and virtualization, which use global schema to include all attributes from all data sources. Localized Data Integration can be defined as a triple {L, M, W}, where L is a local schema, W is WordNet Database and the M is the mapping between L and W. Each local application can only obtain the attributes defined in a corresponding local data source. In this example, the result set of local schema 1 (LC1) can be defined R = {RLC1+ RLC1 ∩ RLC2}. III.
coming from different data sources or enterprises. Unlike other approaches, the data from different enterprises is not transferred to a new central repository. Data remains at the original sources and the transformation system returns a unified view to the user by consolidating the results. The RSSMTS uses a “Noun” stored in the WordNet Database to describe the meaning for each table’s column or entity’s attribute. It provides a tool to generate a mapping file in RDF/XML format by screening the database schema. In order to show the details of the RSSMTS, two sample HR schemas are used. One is from Oracle 10g sample database [11] and the other is from IBM DB2 [12]. One of the major functions of the proposed system is to determine which data sources contain related information. During mergers and acquisitions, two groups of users coming from different companies are required to have a good understanding of both database schema metadata. However, it is a very time consuming and difficult process in a large and complex database system. In the RSSMTS, the integration process begins by the database owner defining semantics for each entity’s attribute or table’s column. Then the system can produce an unmatched report by using all the mapping files. Therefore, it can speed up the integration process. A. System Architecture The RSSMTS is composed of five components: RDF mapping files, WordNet Database, Translator Module, Query Engine and Staging Area. The architecture of RSSMTS is shown in Figure 1. The RDF mapping file is used to store the database schema metadata and maps the metadata to a distinct concept defined in WordNet. The WordNet database [ 13 ] is used to obtain the lexical relationship. The translator module can find the semantically related information from the different databases by use of the mapping files and the WordNet database. Query reformulation and execution runs in the Query Engine. It collects the result sets from different data sources and stores the results into the Staging Area.
RDF-BASED SEMANTIC SCHEMA MAPPING TRANSFORMATION SYSTEM (RSSMTS)
The objective of the proposed system, Resource Description Framework (RDF)-based Semantic Schema Transformation System (RSSMTS) is to minimize the time and resources for integrating the data or information
Figure 1. The System Architecture of RSSMTS
A web application can retrieve data from different data
sources by executing a standard SQL statement. For example, a user in company A, can write a SQL statement by utilizing company A’s database schema metadata, the SQL schema. The user is not required to have knowledge about the other company’s database schema metadata. The transformation system analyzes the SQL statement and determines which information the application requires, for example, the columns name mentioned in the SQL statement. The translator can obtain the corresponding semantic data from a RDF mapping file and acquires semantic relations such as synonyms, hypernyms, etc. from the WordNet database through the Java APIs. The transformation system uses the semantic relations to search another RDF mapping file to obtain the related columns. The Query Engine reformulates the SQL statement, issues and executes in the different database platform. The corresponding databases then return the results and send it
back to the transformation system. All the returned results are stored in the Staging Area. Finally, The Query Engine consolidates all the results and returns them to the user. B. RDF-based Mapping file The Resource Description Framework (RDF) is a language for representing information about resources in the World Wide Web [14]. The RSSMTS uses this concept to represent the database schema metadata and semantic schema coming from different data sources. Two steps are used to produce the mapping file. The RSSMTS provides a mapping generator which creates a mapping file in RDF/XML format by querying the data dictionary which is stored in the database. For example, the mapping generator retrieves the data dictionary in an Oracle database by use of dba_tables view, dba_tab_cols view, etc. Figure 2 shows the mapping file created by the mapping generator.
salary employee salary employee worker salary regular payment Figure 2. Mapping File in RDF/XML Format
All the contents of the mapping file are produced by the mapping generator except the “predicate” “terms:wordnet”. An integrator completes the mapping file by filling the value of “object” - “terms:wordnet-sense” and “terms: wordnet-hypernym”. The wordnet-sense is used to define the meaning of a particular column. The wordnet-hypernym defines a word that is more generic than the word defined in the wordnet-sense. The rules are: all of words come from WordNet database and they are a Noun. For example, a column “salary” comes from “employees” table; the integrator can find the sense and hypernym by searching the WordNet database through the WordNet Browser. In this example, the sense of salary is “salary” and the hypernym is “regular payment”. C. The Translator of RSSMTS Three tasks are performed by the translator. First, it obtains the column sense in a local mapping file. Second, it interfaces with JWI and acquires the semantic relations
from the WordNet database. The RSSMTS use “sense” and “hypernym” to identify the meaning of each column. Finally, it uses the semantic relations to find the matching columns which are defined in another mapping file. The RSSRS uses SPARQL [ 15 ] to query the RDF-base mapping file. A simple SPARQL query for searching the column of employee salary is shown in Figure 3. In this query, “columnSense” and “columnHypernym” can be found. select ?columnSense ?columnHypernym where {?x terms:table‐name +’employees’ . ?x terms:column ?y. ?y terms:column‐name ‘salary’. ?y terms:wordnet ?z. ?z terms:wordnet‐sense ?columnSense. "?z terms:wordnet‐hypernym ?columnHypernym}
Figure 3. A SPARQL Simple Query
The translator can find other semantic relations based on the query result, the columnSense and columnHypernym. For example, the synonyms of salary include: wage, pay, earnings and remuneration. Then, the translator will find if the synonyms exist in another mapping file. Based on the matched synonyms, the translator can obtain the columns name. A piece of transformation system source code is shown in Figure 4. It creates an instance of translator with a RDF mapping file. It uses table name and column name as parameters to obtain the sense by calling the getColumnSenses method. Then, the transformation system creates another translator instance with another data source mapping file. The system can find related columns in a second data source by calling the getColumnName method. String fileNameA="oracle‐hr‐rdf.xml"; Translator translatorA=new Translator(fileNameA); String tableName="employees"; String columnName="salary"; Map senses=translatorA.getColumnSenses (tableName,columnName); String fileNameB="ibm‐db2‐rdf.xml"; Translator translatorB=new Translator(fileNameB); List columns=translatorB.getColumnName (translatorB.getMatchedSynonyms(senses));
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
Figure 4. A piece of transformation system source code written in Java
IV.
of RSSMTS and focus on the Query Engine and the Staging Area.
[7]
ADVANTAGES OF RSSMTS
The RSSMTS has several advantages. Unlike other approaches, it provides a simple method for data integration. The integrator can create a RDF-based mapping file using the mapping generator. The RSSMTS can provide an unmatched report which gives the message to the integrator, which columns cannot be mapped. Compared to the ontology and materialization approaches, the RSSMTS is a real-time data integration solution. It is not necessary to understand the contents of each relation. Compared to the virtualization approach, the RSSMTS does not require the integrator to understand all the participant data sources or databases schema metadata at the beginning stage.
[8]
[9]
[10]
[11]
V. CONCLUSION AND FUTURE WORKS Different data integration approaches and applications have been discussed and we have proposed an alternative to data integration: Localized Data Integration. We have also proposed an integration system, RDF-based Semantic Schema Mapping Transformation System for Localized Data Integration. This paper presents the details of the system and the usage of the WordNet Database within the proposed system. The architecture and the function of the system are also discussed. It shows that the system provides an efficient way to keep all of the existing systems running and provide an integrated view for applications. Future work will concentrate on refinement
[12]
[13]
[14]
[15]
M. Mohania and M. Bhide, “New trends in information integration”, in Proceedings of the 2nd International Conference on Ubiquitous Information Management and Communication, pp. 74-81, 2008. S. Chawathe, H. Garcia-Molina, J. Hammer, K. Ireland, Y. Papakonstantinou, J. Ullman, and J. Widom, “The TSIMMIS project: integration of heterogeneous information sources”, In Proceedings of IPSJ Conference, pp. 7-18, Oct. 1994. A. Y. Levy, A. Rajaraman and J. J. Ordile, “Querying heterogenous information sources using source descriptions”. in Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB’96), pp 251-262, Sep. 1996. E. Rahm, Philip A and Bernstein, “A survey of approaches to automatic schema matching”, The VLDB Journal – The International Journal on Very Large Data Bases, Volume 10, Issue 4, pp. 334-350, 2001. G. Dogan, RezartaIslamaj, “Importing relational database into the semantic web”, Maryland Information and Network Dynamics Lab Semantic Web Agents Project, http://www.mindswap.org/webai/2002/fall/Importing_20Relation al_20Databases_20into_20the_20Semantic_20Web.html. I. Astrova and B. Stantic. “Reverse engineering of relational databases to ontologies: An approach based on an analysis of HTML forms”, in Proceedings of 1St European Semantic Web Symposium (ESWS), Heraklion, Crete, Grec, LNCS, pp. 327-341, 2004. S. M. Benslimane, D. Benslimane and M. Malki, “Acquiring OWL ontologies from data-intensive web sites”, in Proceedings of ICWE'06, Palo Alto, Califormia, USA, pp. 361-368, 2006. J. Barrasa, 0. Corcho and A. Gomez-Perez. FundFinder – “A case study of database-to-ontology mapping”, Semantic Integration Workshop, ISWC 2003, Sanibel Island, Florida, Sept 2003. J. Barrasa, 0. Corcho and A. Gomez-Perez. R20, “an extensible and semantically based database-to-ontology mapping language”, Ontology Engineering Group, Departamento de Inteligencia Artificial, Facultad de Informatica, Universidad Politecnica de Madrid, Espana. P. Wirz and M. Lusti, “Information technology strategies in mergers and acquisitions – an empirical survey”, in Proceedings of the Winter International Synposium on Information and Communication Technologies, ACM International Conference Proceeding Series; Vol. 58, pp.1-6, 2004. Oracle Database Sample Schema 10g Release 1 (10.1), Sample Schema Scripts and Object Description, HR schema, pp 4-5 to 4-8, Oracle Corporation. DB2 Universal Database, The Sample Database, http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?to pic=/com.ibm.db2.udb.doc/admin/r0001094.htm , IBM. WordNet Database 2.1 for windows, Cognitive Science Laboratory, Princeton University, Mar 2009 download from http://wordnet.princeton.edu/obtainty. F. Manola and E. Miller, W3C, RDF Primer, W3C Recommendation 10 February 2004, http://www.w3.org/TR/rdf-primer/. E. Prudhommeaux, A. Seaborne, “SPARQL query language for RDF”, W3C Recommendation, http://www.w3.org/TR/rdf-sparql-query.