Practical Context Transformation for Information System ... - CiteSeerX

4 downloads 444 Views 738KB Size Report
information entities in a different terminological context. ... that creates a new information system with a unified semantics and the loose ..... W. Kim and J. Seo.
Practical Context Transformation for Information System Interoperability Holger Wache, Heiner Stuckenschmidt Center for Computing Technologies University of Bremen, Germany {wache, heiner}@tzi.de

Abstract. This paper discusses the use of contextual reasoning, i.e. context transformation for achieving semantic interoperability in heterogeneous information systems. We introduce terminological contexts and their explication in terms of formal ontologies. Using a real-world example, we compare two practical approaches for context transformation one based on transformation rule, the other of re-classification of information entities in a different terminological context. We argue that both approaches supplement each other and develop a unifying theory of context transformation. A sound and complete context transformation calculus is presented that covers both transformation approaches.

1

Introduction

Mediators [5] are middleware components that provide a flexible integration of several information systems such as database management systems, geographical information systems, or the World Wide Web. A mediator combines, integrates, and abstracts the information provided by the sources [24] tackling the same problems which are discussed in the federated database research area, i.e. structural heterogeneity (schematic heterogeneity) and semantic heterogeneity (data heterogeneity) [15]. Structural heterogeneity means that different information systems store their data in different structures. Semantic heterogeneity considers the content and its semantics of an information item. In rule–based mediators [6], rules are mainly designed in order to reconcile structural heterogeneity. Discovering semantic heterogeneity problems and their reconciliation play a subordinate role. But for the reconciliation of the semantic heterogeneity problems, the semantical level also has to be considered [11, 4, 14]. Contexts are one possibility to capture this semantical level. A context [13] contains “meta data relating to its meaning, properties (such as its source, quality, and precision), and organization” [19]. A value has to be considered in its context and may be transformed into another context (so–called context transformation). In this paper, we review two approaches to the implementation of context transformation in mediators, namely functional context transformation and context transformation by re-classification. We discuss their use for providing semantic interoperability among heterogeneous information systems. We propose a

unifying theory of practical context transformation that covers both approaches and present a sound and complete context transformation calculus. The paper is structured as follows: Section 2 introduces the problem of semantic heterogeneity and motivates the use of contextual knowledge. In section 3 we illustrate an integration process using an example from a real application. The use of the different transformation approaches is discussed in section 4. In section 5 we present the unifying theory of context transformation and the transformation calculus including sketches of the soundness and completeness proofs.

2

Context, Ontologies and Information Systems

In principle there are two possible solutions to achieving semantic interoperability between heterogeneous information systems [7]: the tight coupling strategy that creates a new information system with a unified semantics and the loose coupling approach that does not touch the individual semantics and instead provides transformations on a semantic level. There are strong arguments in favor of the loose coupling approach [12]. First of all the use of individual semantics allows small representations and efficient reasoning within the individual system. Second, the semantics in a multi-context system is much more flexible and can be used to handle inconsistencies that would become threatening when trying to create a single context with a global semantics. 2.1

Contexts and Semantic Heterogeneity

In order to achieve semantic interoperability in a heterogeneous information system, the meaning of the information that is interchanged has to be understood across the systems. Semantic conflicts occur, whenever two contexts do not use the same interpretation of the information. Goh identifies three main causes for semantic heterogeneity [7]. – Confounding conflicts occur when information items seem to have the same meaning, but differ in reality, e.g. due to different temporal contexts. – Scaling conflicts occur when different reference systems are used to measure a value. Examples are different currencies or marks. – Naming conflicts occurs when naming schemes of information differ significantly. A frequent phenomenon is the presence of homonyms and synonyms. It has been argued that semantic heterogeinity can be resolved by transforming information from one context into another. In [18] and [7] context transformation methods are developed. The scope of these approaches in mainly on the conversion of different scaling conflicts. In our work we address the problem of providing practical solutions for the context transformation problem that is not only capable of converting between different scales, but also covers the transformation of application-specific vocabularies. We therefore argue for a semantic interoperability approach that is based on transformations between individual terminological contexts.

2.2

Ontologies as Contextual Information

Ontologies have set out to overcome the problem of implicit and hidden knowledge by making the conceptualization of a domain explicit. This corresponds to one of the definitions of the term ontology most popular in computer science [8]: ”An ontology is an explicit specification of a conceptualization.” An ontology is used to make assumptions about the meaning of a term available. It can also be seen an an explication of the context a term is normally used in.

Fig. 1. The Role of Context in Information Systems Interoperability. (adapted from [12])

Kashyap and Shet [12] discuss the role of contexts and ontologies for semantic interoperability (compare figure 1). According to their view, contexts are used to abstract from the content of an information repository. So-called metadata contexts describe the information content of a repository and therefore allow to decide whether a repository contains relevant in formation. Additionally conceptual contexts are introduced. A conceptual context is an ontology that defines the meaning of terms used in the metadata context and the repository. While Kashyap and Shet define relationships between the ontologies, our approach relies on the use of shared basic vocabulary that is used to derive inter-ontology relationships. We propose to use formal ontologies in order to capture and explicate the assumptions made by each context, because they can be used as a basis for automatic translations between vocabularies that preserve the intended meaning of the translated vocabulary.

3

Context-Based Semantic Integration

We illustrate the need for context modeling and transformation by a real–world example which also serves to illustrate our approach. Two sources — CORINE and ATKIS — provides geological information.

The first source CORINE [3] stores its data in two tables1 . The first table is called clc ns2. Every entry represents one geological item. clc ns2 contains the attributes CLC NS2 ID (identifier), AREA (size in ha), and NS (classification). Especially the last attribute NS refers to catalog, wherein all items are classified. In CORINE, the catalog contains more than 64 concepts. The second table clc ns2 pol stores polygons describing the area of an item. The attributes are CLC NS ID (reference to clc ns2), VERT ID (identifier of a vertices), and NEXT V ID (identifier of the following vertices). In the second source ATKIS [1] a geological item is stored in one table atkisf with the attributes id, fl (size in m2 ), and folie (classification). Analogously to CORINE the last attribute folie refers to a classification catalog containing more than 250 terms. But the catalogs of CORINE and ATKIS are different. Further, both catalogs underly different conceptualizations. The task of this example is that the data of CORINE database has to be converted in the ATKIS database. Of course, this transformation can be viewed as a special case of an integration task demonstrating all the problems which can occur. Besides the obvious structural heterogeneity problems, the main problem relies on the reconciliation of the semantic heterogeneity: both geological information sources classify the common areas in different catalogs. A mediator system that tries to query information from one system in terms of the other will fail or return wrong results, because it will not be able to unify the land-use classifications and will not recognize that the returned size of an area refers to a different scale. Consequently, the classification of the CORINE has to be converted into the ATKIS catalog. Moreover, the size has to be converted according to their different currencies. Both conversions are the challenge for the semantic integration and are handled by the both kinds of context transformation. 3.1

A Minimal Modeling Language

In order to capture the semantics of the different land-use classifications used in the systems we want to integrate, we have to describe an ontologies of land-use classes. We use a description logic in order to build these ontologies. The features of this language are described below. Description logics are a family of logic-based representation formalisms that cover a decidable subset of first-order logic. Description logics are mostly used to describe terminological knowledge in terms of concepts and binary relation (slots) between concepts that can be used to define a concept term by necessary and sufficient conditions that have to be fulfilled by all instances of the concept. We use a minimal description logic that consist of conjunction, disjunction, negation of as well as existentially and universally qualified range restrictions on slots. These language elements can be used to describe concept expressions with the following syntax and semantics: 1

For readability reasons the tables of both sources are simplified. Some attributes are omitted.

syntax concept-name I(C) top I[top] bottom I[bottom] (and concept+ ) I[(and C1 C2 )] (or concept+ ) I[(or C1 C2 )] (not concept) I[(not C)] (some role concept) I[(all r C)] (all role concept) I[(some r C)]

semantics ⊆ = = = = = = =

D D ∅ I[C1 ] ∩ I[C2 ] I[C1 ] ∪ I[C2 ] D − I[C] {x ∈ D|∀y ∈ D : hx, yi ∈ I[r] ⇒ y ∈ I[C]} {x ∈ D|∃y ∈ I[C] : hx, yi ∈ I[r]}

Class expressions are used to define concepts. A concept is defined using the keyword concept followed by the name of the concept and a concept expression that restricts the set of entities belonging to the concept to a subset of the whole domain of discourse. The meaning of a concept definition is defined by an interpretation. The Tuple < D, I > is an interpretation, if D is a domain and I is an extension function that maps concept names into subsets of D and role names into D × D. Using this interpretation, the semantics of the language constructs is given by the equations in the table above. This Tarskian style semantics offers a formal framework for the comparison of different terminologies. 3.2

The Integration Process

Step 1:Authoring of Shared Terminology Our approach relies on the use of a shared terminology in terms of properties used to define different concepts. This shared terminology has to be general enough to be used across all information sources to be integrated but specific enough to make meaningful definitions possible. The top-level concept parcel is defined below:

(concept parcel (and (all ground ground-type) (all coverage structure) (all cultivation plant) (all vegetation plant) (all use use-type)))

For the given integration task the shared terminology mainly consists of ontologies that define concepts a parcel can be related to, namely ground types, artificial structures built on a parcel, different kinds of plants that may grow on a parcel and general types of land use. Step 2: Annotation of Information Sources Once a common vocabulary exists, it can be used to annotate different information sources. In this case annotation means that the inherent concept hierarchy of an information source is extracted and each concept is described by necessary and sufficient conditions using terms

from the vocabulary defined in the share terminology. The result of this annotation process is an ontology that contains a definition of the terminological context. The meaning of land-use classes from both classifications is formally defined by further restricting the range of the slots attached to the general parcel concept. Here is an example: (concept broad-leaved-forest (and parcel (all coverage no-stuctures) (all ground land) (all vegetation (or trees shrubs)) (some vegetation broad-leaved-trees))) The above example is taken form one of the entries in our CORINE data-set used in the case study. This entry is classified as ’broad-leaved-forst’ which is a subclass of parcel that can be identified by the absence of water, a lack of artificial structures and a vegetation that may consist of trees and shrubs where some of the vegetation consists of broad-leaved trees. We use so-called templates to assign a contextual description to data structures in a repository. A template is an fifth-ary predicate: T = hname, context, type, valuei@source A template has a name, a context addressing the semantics of the concept. The name of the context refers to the corresponding ontology that explicated the terminological context. Further elements of a template are: type determining the data type, the value referencing the information item itself, and the last identifier source denoting which source the template belongs to. The value can be a simple value, e.g. a number, or a string, or a list of attributes. An attribute consists of a name and a template the attribute refers to. In the last case the type is complex. In case of simple values the type slot contains the basic data type. Templates with attributes can represent tables (relations) in a relational data structure model. The attributes of the template are the attributes of the relation. The value of the template attributes are templates encapsulating the basic data types of the relation attributes2 . An example template for the ATKIS table is given below. The template contains variables and therefore describes a set of instances found in the database. , fl -> , folie -> }>@ATKIS 2

For readability reasons the source is omitted in the nested templates

Step 3: Semantic Translation of Information Entities The purpose of the steps described above was to lay a base for the actual translation step. The existence of a terminological context model for all information sources to be integrated enables a translation method to work on the contextual knowledge. Two different types of translation by context transformation have been investigated: – Rule-based functional transformation [23] – Classification-based transformation [20]. We argued that these two kinds of context transformation supplement each other in the sense that functional transformation is well suited to resolve scaling conflicts while classification based transformation can be used to resolve non-trivial naming conflicts [21]. The transformation step is discussed in more detail in the next section.

4

Context Transformation

A conceptual model of the context of each information source builds a basis for an integration on the semantic level. We call this process context transformation, because we take the information about the context of the source (in our case CORINE) and re-interpret this information in the terms of a target (ATKIS) providing a new context description for that entity within the new information source. We compare two different approaches for context transformation namely rule-based context transformation and context transformation based on classification and show how these two approaches can be used to integrate the example data. 4.1

Context Transformation with Rules

Context Transformation Rules (CTR’s) define a context transformation between two templates. Operationally they has to exchange information. More precisely CTR’s replace one template by another. An important aspect of CTR’s is that they can be applied to templates which are nested in the structure of a top– level template i.e. to an template in an attribute. This aspect simplifies the formulation of CTR’s and improves the scalability and the flexibility of context transformation. A CTR is represented as follows: a

b ← b1 , ..., bn

The head of the rule defines the relation — the so called context transformation relation. The relation describes which template a can be transformed into template b. The other terms in the body b1 , ..., bn are required to support or to restrict the context transformation. Normally the body terms are expressions but can also be further templates, e.g. if further information for the context transformation is needed. We illustrate the use of CTR’s in our example. The surfaces in ATKIS and CORINE are stored with different measures of size, namely square-meters and

hectares. Therefore the surface value of CORINE can not be copied but has to be converted dividing the number of square-meters by the factor 10000 . The conversion is done during the context transformation. The appropriate CTR looks like: