Mapping of heterogeneous schemata, business structures ... - STASIS

Mapping of heterogeneous schemata, business structures, and terminologies Domenico Beneventano, University of Modena and Reggio Emilia ([email protected]) Sabina El Haoum, University of Oldenburg ([email protected]) Daniele Montanari, University of Modena and Reggio Emilia and Eni SpA ([email protected], [email protected]) Abstract The current effort to extend the power of information systems by making use of the semantics associated with terms and structures has resulted in a need to establish correspondences between different systems to allow a rich exchange of information. This paper describes the early efforts taking place in the STASIS project to identify the issues underlying support for mapping of corresponding entities between such heterogeneous systems. The STASIS system is meant to help a user establish such mappings by exploiting a semantic environment where he/she can contribute his/her own entities and relate them with other pre-existing entities. This process needs support at the entity representation level, to encapsulate each item into an appropriately rich representation structure, and at the logical level, where the resulting model is verified for consistency towards its future use. Examples are offered and discussed to highlight the issues and propose solutions.

1.

Introduction

STASIS is a research and development project1 which aims to enable SMEs and enterprises to fully participate in electronic business by offering semantic services and applications based on the open SEEM registry and repository network2. These will allow easy access to analyse, view, compare and distil semantics in an efficient environment to more effectively relate the business concepts of one organisation with that of another. From the point of view of implementing the STASIS semantic engine the goals outlined above translate into 1

The work reported in this paper has been partially funded by the European Community 6th Framework programme under contract FP6-2005-IST-5-034980. 2 www.seemseed.net

requirements concerning the representation of a variety of schemata and other structured environments, giving the user the opportunity to work at different knowledge levels. These include (i) structural knowledge (knowledge deriving from the structure of the schema or business structure); (ii) lexical knowledge (the meaning of terms used to label elements in the schema or business structure); and (iii) domain knowledge (real world knowledge about meanings and their relations.) These levels are often different enough to require different languages and modelling. In the STASIS framework a preliminary decision has been adopted to use the Web Ontology Language OWL [Bechhofer et al. 2004]. More precisely, (a) an OWL-based representation format will be used to represent structural knowledge, (b) an OWL-based representation will be used to include lexical resources like WordNet, and (c) OWL will be used to include and relate to relevant standards such as UNSPSC and the EDI standard message formats, towards the creation of a local STASIS upper level ontology. Many of these problems have already been discussed in the literature [Bouquet et al 2006] [Jarrar 2006], and our goal is to take and develop the most promising alternatives to integrate them in a common framework. This paper is organized as follows. Section 2 describes how to acquire a schema representation in OWL, and provides some examples of typical representations and encodings. Section 3 discusses the acquisition of external standards and their encoding in OWL; some problems and possible lines of attack are also outlined. Finally, section 4 uses examples of annotations and mappings to represent a number of interesting cases where different types of expressiveness are required, including relations among schemata, with background ontologies or fragments thereof, and lastly the definition and identification of logical inconsistencies of the model potentially leading the system to work incorrectly or not at all.

2.

A schema representation format based on OWL

In this section we introduce a representation format for schemata based on OWL. For each schema format used in STASIS (such as RDB, EDIFACT, FlatFile and XML-schema) we will define an OWL ontology which describes the schema format in an abstract way; for example, for the relational case, this ontology will include the classes Table, Column and PrimaryKey and the properties hasColumn and hasPrimaryKey. Then a schema in a given format will be represented as an instance of the related schema format OWL ontology. We also show this approach with respect to relational schemata, and we claim that the extension of this approach to other schema formats, such as XML schema, is straightforward; moreover, in section 3.3 we will discuss how the structure of EDI/EDIFACT document can similarly be represented.

2.1 A Relational schema representation based on OWL Each relational database schema with its tables and columns is mapped to instances in an OWL ontology, called RDB_OWL. Our approach is very similar to the one proposed in [Kupfer et al. 2006], where the ontology contains concepts for the terms Database, Relation and Attribute and an object property consistsOf to create a hierarchy involving them. Another related approach is presented in [Pérez de Laborda and Conrad 2005], where using the metamodelling capabilities of OWL-Full (a class can be instance of another class) a table is represented as a class instance of the Table class; this approach allows to also represent the data and not only the schema; on the other hand OWL-Full prevents the use of decidable inference on the resulting ontology. The definition of our ontology RDB_OWL is a research activity of the STASIS project. Our idea is to define this ontology starting from related approaches and by considering the structure of SQL Catalogues. A preliminary version of some classes and properties of RDB_OWL is shown in Appendix A. To give an intuition of the encoding with RDB_OWL, we consider the SQL table definitions of Fig. 1. The encoding of this schema is in Appendix A. For our discussion, in the following we will use an encoding with the abstract syntax for OWL [Antoniou et al. 2005]. We will use symbols c for classes, e for objects, p for properties between objects, and o for ontologies. Whenever useful, we will prefix classes and instances with pseudo-namespaces to indicate the ontology in which these symbols occur, e.g. o1:c1 and o2:c2 are two

different classes, the first occurring in the ontology o1, the second in ontology o2. Moreover, we will use

•

individual(e type(c))

•

individual(e1 value(p e2))

e belongs to a class c. e1 is related to e2 through the property p.

CREATE TABLE T2 ( ItemID int NOT NULL, Name char(20), CONSTRAINT PK PRIMARY KEY(ItemID)); CREATE TABLE T1 ( ItemID int NOT NULL, MaxTemperature int, CONSTRAINT PK PRIMARY KEY(ItemID), CONSTRAINT FK FOREIGN KEY(ItemID) REFERENCES T2));

Figure 1 : A relational schema Using the above notation the (partial) encoding of T2 is individual(T2 type(rdb:Table)) individual(T2.PK type(rdb:PrimaryKey)) individual(T2.ItemID type(rdb:Column)) individual(T2 value(rdb:hasColumn T2.ItemID)) individual(T2 value(rdb:hasColumn T2.Name)) individual(T2 value(rdb:hasPrimaryKey T2.T2PK)) individual(T2.PK value(rdb:hasColumn T2.ItemID))

3. Reference standards acquisition in STASIS A knowledge engineer or any other kind of user of a STASIS system may develop a number of semantic entities based on the qualifying elements in the source and target domains of the future mapping. Very often it is useful to recognize these entities among the members of some well known classification system or as elements of an EDI coding system which therefore may act (to some extent to be qualified) as a common background for the modelling development. This requires the preliminary ontologizing of the classification or EDI coding system, i.e. loading into the system environment of all these entities, their attributes, and their relations, constraints, and axioms. This material needs to be properly represented to have their proper semantic explicitly associated, and allow any reasoning that may be foreseen as necessary afterwards. In most products and services categorization standards (PSCS) product classes identify a number of categories organized in a hierarchy. More sophisticated PSCS include a dictionary of standardized properties used to describe product instances in mode detail and allow parametric searches. These properties may be either

captured by arbitrary strings or other basic types or, in a usually limited number of cases, they may be captured by an enumerated set of values. Finally, most PSCS with a dictionary of properties include a mapping between classes and properties, listing the latter which are recommended for each of the former. This is a huge task and most PSCS have only a limited subset of their categories actually qualified in this way. Several modelling choices are required to obtain a workable semantic environment from a classification or EDI coding system, namely (a) entities may be represented as classes, or instances, or a mix of the two; (b) attributes have to be transferred accordingly; (c) relations and hierarchies have to be represented; (d) when dealing with more than one such model, coherence among the resulting representations must be ensured to have a consistent global environment; (e) the semantics may need to be enhanced or refocused for the task at hand. Concerning this last point, it is well known that any ontology or classifier system or other modelling of a domain is defined and built to satisfy the needs of some specific future analysis. For example, UNSPSC is built following the model of a procurement system [Hepp 2005]; this results in an ad hoc hierarchy of entities, which is driven by the classification of goods as seen by a buyer, whereby items may be in a same category due to being purchased together, or by the same kind of supplier, or for the same group of end consumers. As a consequence, one has that ice (identified by the UNSPSC code 50.20.23.02) and water (50.20.23.01) are classified as two non-alcoholic beverages (50.20.23.00), so that some later reasoning activities may yield incorrect or surprising and occasionally humorous results. Moreover, the UNSPSC classification is only five levels deep throughout (four levels are most commonly used, skipping the business function, see below). This depth may be totally inadequate for e.g. a model describing chemical molecules and their properties. This may be the most difficult hurdle to overcome when acquiring such a system for use as a reference. In the following section we offer an overview of three commonly used classification schemes, namely UNSPSC, WordNet, and general EDI encodings, and we briefly discuss their use and the specific consequent issues.

3.1 UNSPSC and other Products and Services Categorization Standards The UNSPSC (United Nations Standard Products and Services Code) is a Products and Services Categorization Standard (PSCS) currently maintained by a non-profit organization under the control of the United Nations Development Program (UNDP). The UNSPSC was developed to offer a hierarchical convention to be used to classify all products and services, specifically for the uses

typical of procurement services. It consists of five levels (segment, family, class, commodity, business function) and new releases are issued semi-annually or more often. [Hepp 2005] offers an extensive discussion about transforming a PSCS, and UNSPSC specifically, into an OWL coded ontology. It illustrates by example that the semantics are defined for spend analysis categories, hence yielding very narrow concepts when straightforwardly translated. In order to capture the semantics of the PSCS an escalation of acquisition policies is presented: (i)

Create one class for each taxonomy category and assume that the meaning of the taxonomic relationship is equivalent to rdfs:subClassOf. (ii) Create one class for each taxonomy category and represent the taxonomic relationship using an annotation property taxonomySubClassOf. (iii) Treat the category concepts as instances instead of classes and connect them using a transitive object property taxonomySubClassOf. (iv) Create two concepts for each taxonomy category, one reflecting the generic product or service type and another reflecting the taxonomy concept. The approach using rdfs:subClassOf is chosen by most available transformations of UNSPSC into products and services ontologies [Klein 2002] [McGuinness 2001]. For example, we can use the following representation for “ice” [Klein 2002]. 014067 50.20.23.02

The (iv) approach does capture the original hierarchy while leaving the generic categories available to acquire the specific semantic. However, the latter is simply a collection of entities with no structural hierarchy defined, so that their semantic value must be redefined. This classification being a simple categorization, there are no further properties to be represented.

3.2 WordNet and other lexical ontologies Lexical ontologies have been among the first being recognized and developed as such. There are several such ontologies, both for extensive and general use, and for specialized sectors. The case of lexical ontologies and

their use for annotation of local semantic entities is simpler than the others, since the built-in hierarchies are usually defined to represent common senses and therefore can adapt better to general and diverse domains. WordNet [Fellbaum 1998] is one of the most popular lexical ontologies in use. While originally used by the natural language processing and information retrieval communities, it has also been adopted by the Semantic Web research community for use in annotation, reasoning, and ontology mapping tools. The WordNet Task Force of the Semantic Web Best Practices and Deployment Working Group aims at providing a standard conversion of WordNet into RDF/OWL as a reference point for developers [van Assem et al. 2006].

3.3 EDI messages In [Beneventano and Montanari 2007] details are given to explain how the approach to ontologizing EDI discussed in [Foxvog and Bussler 2006] actually adopts a similar method to the one outlined in section 2 to represent the typical structures found in EDI/EDIFACT messages. In extreme synthesis, we can define an OWL ontology which describes the structure and the format of transaction sets, data segment groups and so on, and then we can represent the structure of each elements of an EDI/EDIFACT message as an instance of this ontology.

3.4 Discussion Using background ontologies and domain information while working on semantically intensive domain specific tasks like building mappings may be quite positive in terms of homogeneity of results, accuracy, and alignment with previous knowledge of the domain. However this is a very expensive operation to setup and maintain, since the sources may be very large and designed to be used within different environments, and their respective hierarchies may even be (locally) logically incosistent with each other (none of them being actually wrong, but due to the different design goals implemented by each reference). Therefore a careful approach must be chosen to allow for a reasonable use of these components. One approach could be to use “(minimally) enclosing subsets” of each reference. Namely, we can choose to upload into the working space just the relevant elements (e.g. in most cases there is no need to include armadillos –10.10.15.15 in UNSPSC– if we are dealing with non alcoholic beverages). The minimality of the chosen subset may be a matter of opportunity, as sometimes it is easier to define a slightly larger subset and this may also offer a wider, more expressive angle on the domain. Even the selection of a suitable subset leaves the open problem of associating the proper semantic with the

background fragment. Obviously, if the actors on a given domain use (possibly different) reduced subsets for their work environments, then the resulting mappings may be less homogeneous than one could hope for. However, it is expected that the core of the domain be captured and used by all, so that association differences may be minor. Another issue concerns the acquisition of reduced subsets of each background classification or ontology, and the resulting inconsistencies hinted at before. Using an approach like the (iv) one suggested by Hepp there will be no risk for inconsistencies as discussed in the previous section, since there are no predefined hierarchies that might introduce contradictory relations. Finally, it is our expectation that the reduced size and natural choice of subset background classifications may lead to mappings which will still facilitate a successive common use of entities coming from different environments. This outcome will be observed for and studied in subsequent work within and after the STASIS project, as well as within other initiatives.

4. Annotation and Mapping The problem of mapping among schemata and ontologies has been widely discussed in the literature; see for example the “Mapping and Alignment” page in the wiki of the STASIS project: wiki.stasis-project.net. In this section we discuss the problem of annotation and mapping in our framework; we discuss how to define mappings between a schema and an ontology, and mappings between two schemata.

4.1 Mapping between a schema and an ontology Given a schema and an ontology, the simplest type of mapping is a one-to-one (1-1) mapping between an element of the schema and an element of the ontology. In our approach an element of a schema is represented as an individual and the simplest (1-1) mapping with a class of an ontology is the relation “instance_of”. Under these assumptions we can annotate the element T1 with the class ICE of UNSPSC by saying that T1 is an instance of the class ICE like individual(T1 type(unspsc:Ice)) Other types of 1-1 mappings need to be defined, like e.g. between the individual T1 and an object property of a generic OWL ontology. The definition of these mappings and more generally the definition of a mapping language will be a specific STASIS project activity. This paper will therefore only consider mappings expressed as “instance_of” relations.

4.2 Mapping between schemata In this section we consider two schemata and discuss again the simplest kind of mapping, namely a 1-1 mapping between elements. In order to further simplify and without loss of generality we actually consider two elements from the same schema. Concerning the representation of mappings between schemata, several approaches can be found in literature. In particular, mappings can be represented as instances in an ontology of mappings, as proposed in the MAFRA framework [23] and in RDFT [Omelayenko 2002]. For the purpose of STASIS, we can argue that a specific mapping ontology seems to provide a great context for mapping storage, sharing and reuse, as well as reasoning. Reasoning aims at drawing conclusions, e.g. to perform semantic integration tasks. In STASIS, reasoning over the mapping ontology can be used to highlight the existence of inconsistencies. As a first example of such mapping ontology we consider the definition of the mapping is-a w.r.t. UNSPSC, holding between two individuals e1 and e2 where e1 is an instance of a class C1 and e2 is an instance of a subclass C2 of C1 in the UNSPSC ontology. This mapping can be represented by the object property unspsc:is_a having as domain and range the class unspsc:Thing; for this mapping it is easy to formally define a derivation rule from the annotation w.r.t. UNSPSC. The rule may look like this individual(e1 individual(e2 subClassOf(C1 subClassOf(C1 subClassOf(C2 individual(e1

type(C1)), type(C2)), C2), unspsc:Thing), unspsc:Thing) -> value(unspsc:is_a e2))

On the basis of this derivation rule and the annotations individual(T1 type(unspsc:Ice)) individual(T2 type(unspsc:non_alcohol_beve))

we obtain individual(T1 value(unspsc:is_a T2))

As another simple example, let us define an is-a relation among relational tables by introducing the object property rdb:is_a having as domain and range the class rdb:Table. For example: individual(T1 value(rdb:is_a T2)) This mapping may be instantiated by the user, or it may be derived by the structure of T1 and T2 (see Fig. 1)

through a suitable derivation rule, since in T1 there is a foreign key on a key of the table T1.

4.3 Semantic inconsistencies Generally speaking a semantic inconsistency (or clash) is an inconsistency arising from some schema mapping definitions. A precise definition and characterization of these incoherencies requires the definition of the mapping definition language. It is a goal of the STASIS project to describe a mapping ontology in order to define and classify rigorously the concept of semantic clash, and devise ways to detect and handle semantic clashes. The following describes a preliminary scenario of the description of a semantic clash in this soon-to-be-defined mapping ontology. Assuming that an object property is defined stasis:semantic_clash, we expect to define rules like individual(e1 value(unspsc:is_a e2)), individual(e2 value(rdb:isa e1)) ->individual(e1 value(stasis:semantic_clash e2))

In the following example we focus on a clash internal to some ontologizing of UNSPSC or parts thereof. individual(e1 value(unspsc:is_a e2)), individual(e2 value(unspsc:is_a e1)), differentIndividuals(e1 e2) ->individual(e1 value(stasis:semantic_clash e2))

A minimal change yields yet another example where we annotations with respect to some ontologizing of UNSPSC or parts thereof and WordNet generate a global semantic inconsistency. individual(e1 value(unspsc:is_a e2)), individual(e2 value(wordnet:is_a e1)) ->individual(e1 value(stasis:semantic_clash e2))

Note that the clash arises here due to the fact that we have declared in the mapping ontology that this condition is a clash condition. Finally, if we annotate a schema with respect to UNSPSC and have individual(T1 value(unspsc:is_a T2)) and we annotate the schema with respect to UNSPSC like individual(T2 value(unspsc:is_a T1))

reasoning on the mapping ontology then derives a semantic inconsistency like the following individual(T2 value(stasis:semantic_clash T1))

5.

Conclusions

This paper outlines the early results obtained in the context of the STASIS project to represent several different semantic expressions that may occur in a system aiming to support the building of mappings among heterogeneous systems to support interoperability without changing the internals of the source or target system. The choice of language for the representation of these constructs needs to be carefully validated to assure that the resulting system will be sufficiently expressive to allow full representation of all the interesting entities and the ability to reason about them and the content of the system. A preliminary exploration has been made, and a wide spectrum of issues and potential solutions has emerged. The discussion presented in this paper is a preliminary assessment of these features, defining the work to come for the representation of fragments from external ontologies, schemata and structured business messaging systems, and the mappings that may be established among them. While several issues appear hard to overcome if stated in most general terms, it appears that many such issues may be limited by careful choice of size of the included external sources, while the resulting expressivity of constructs and overall use still needs to be defined and assessed.

6.

References

[Antoniou et al. 2005] G. Antoniou, E. Franconi, F. van Harmelen: Introduction to Semantic Web Ontology Languages. Reasoning Web 2005: pp. 1-21. [Bechhofer et al. 2004] S. Bechhofer, F. v. Harmelen, J. Hendler, I. Horrocks, D. McGuinness, P. Patel-Schneider, L. Stein.: OWL web ontology language reference, W3C Recommendation 10 February 2004. http://www.w3.org/TR/owl-ref/ [Beneventano and Montanari 2007] D. Beneventano, D. Montanari., An OWL representation of EDI/EDIFACT documents. STASIS Technical Report available at: http://155.185.48.139/publication/StasisTR2007_1.pdf. [Bouquet et al. 2006] P. Bouquet, L. Serafini, S. Zanobini: Bootstrapping Semantics on the Web: Meaning Elicitation from Schemas. In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 505-512.

[Fellbaum 1998] C. Fellbaum. WordNet: An Electronic Lexical Database. MIT Press, 1998. See also http://wordnet.princeton.edu/. See also http://www.cs.vu.nl/~guus/public/wn-conversion.html. [Foxvog and Bussler 2006] D. Foxvog and C. Bussler, Ontologizing EDI Semantics ER (Workshops), Lecture Notes in Computer Science, Vol. 4231, pp. 301-311, Springer, 2006. [Hepp 2005] M. Hepp: Representing the Hierarchy of Industrial Taxonomies in OWL: The gen/tax Approach, In: Proceedings of the ISWC Workshop on Semantic Web Case Studies and Best Practices for eBusiness (SWCASE'05), November 7, Galway, 2005, Ireland, pp. 49-56. [Jarrar 2006] M. Jarrar: Towards the notion of gloss, and the adoption of linguistic resources in formal ontology engineering . In Proceedings of the 15th International Conference on World Wide Web (Edinburgh, Scotland, May 23 - 26, 2006). WWW '06. ACM Press, New York, NY, 497-503. [Klein 2002] M. Klein. Daml+oil representation of UNSPSC, www.cs.vu.nl/~mcaklein/unspsc/.

and RDF schema available at

[Kupfer et al. 2006] A. Kupfer, S. Eckstein, K. Neumann, B. Mathiak: Handling Changes of Database Schemas and Corresponding Ontologies. ER (Workshops) 2006: 227-236. [McGuinness 2001] D. L. McGuinness (2001). UNSPSC Ontology in DAML+OIL. Available at: www.ksl.stanford.edu/projects/DAML/UNSPSC.daml (Retrieved 05.11.2004). [Omelayenko 2002] B. Omelayenko: Ontology-Mediated Business Integration, In: Proceedings of the 13-th International Conference on Knowledge Engineering and Knowledge Management (EKAW-2002), 1- 4 October, 2002, p. 264-269. [Pérez de Laborda and Conrad 2005] C. Pérez de Laborda, S. Conrad.: Relational.OWL - A Data and Schema Representation Format Based on OWL. In: Conceptual Modelling 2005 (APCCM05), Australian Computer Society (2005) 89–96. [van Assem et al. 2006] M. van Assem, A. Gangemi and G. Schreiber (eds.). RDF/OWL Representation of WordNet, W3C Working Draft 19 June 2006.. http://www.w3.org/TR/2006/WD-wordnet-rdf-20060619/

Appendix A: RDB_OWL In this appendix we show a preliminary version of some classes and properties of RDB_OWL. Classes: rdf:ID rdb:Database rdb:Table rdb:Column rdb:PrimaryKey rdb:ForeignKey

Properties: rdf:ID rdb:hasTable rdb:hasColumn rdb:hasPrimaryKey rdb:hasForeignKey

rdfs:subClassOf rdf:Bag rdf:Seq rdfs:Resource rdf:Seq rdf:Seq

rdfs:domain

rdfs:range

rdb:Database rdb:Table rdb:Table rdb:Table

rdb:Table rdb:Column rdb:PrimaryKey rdb:PrimaryKey

rdfs:comment The class of databases. The class of database tables. The class of database columns. The Primary Key of a Table. The Foreign Key of a Table. rdfs:comment A Database has a set of Tables. A Table has a set of Columns. A Table has a Primary Key. A Table has a set of Foreign Key.

The encoding of the tables in Fig. 1 are the following: