Information Integration through Reasoning on Meta-data - CiteSeerX

2 downloads 0 Views 220KB Size Report
the information sources are shown. Casual Users (top- left) communicate with SILK only via application pro- grams, with custom designed interfaces, while the ...
Information Integration through Reasoning on Meta-data Tam´ as Benk˝ o, Gergely Luk´ acsy, Attila Fokt, P´ eter Szeredi IQSYS Information Systems Ltd., H-1135 Budapest, Csata u. 8. {Benko.Tamas,Lukacsy.Gergely,Fokt.Attila,Szeredi.Peter}@iqsys.hu

Imre Kili´ an Gy˝ ur˝ uf˝ u Research Informatics Ltd., H-7683 Gy˝ ur˝ uf˝ u [email protected]

P´ eter Krauth KFKI Computer Systems Corp., H-1135 Budapest, T¨ uz´er u. 39-41. [email protected]

Abstract SILK is an Enterprise Information Integration (EII) system using knowledge representation and reasoning techniques to support both mediation and integration of heterogeneous information sources. The SILK system is built around a Model Warehouse which stores models of single information sources, as well as unified models and their customised abstractions. Models include constraints describing non-structural properties of the data sources. In the process of integration, the Model Warehouse is extended with mappings, which link the models, and spell out their relationship using appropriate constraints. The paper describes several components of the SILK tool-set. The Rule Compiler is responsible for transforming the constraints to a format suitable for mediation. The Comparator tool uses structural and textual comparison techniques to pinpoint related elements of the models and to help in building the mappings. The Verifier component contains multiple reasoning modules coordinated by a central scheduler, which help in the process of integration by detecting inconsistencies in the models and the mappings. The RDF-wrapper, a recent extension of SILK, not only gives access to information available in RDF format, but also includes its own reasoning and consistency checking capabilities. Finally, we also discuss a planned extension of the SILK, focusing on information integration in an agent-based mobile computing environment.

1

Introduction

The paper describes work carried out within the SILK international project (System Integration via Logic and

Knowledge, IST-1999-11135, 2000-2002) supported by the IST 5th Framework programme of the European Union. The need for software systems supporting information integration is on the increase. An enterprise uses several types of information sources, including various databases, semi-structured sources (e.g. XML), often complemented by Web-based information as well. The SILK approach aims at providing support for • bringing sources,

together

heterogeneous

information

• building high-level common views of these, • accessing the sources through high-level queries, • re-structuring the sources to make them more uniform and better integrated. SILK uses methods and techniques based on constraint programming and logic programming technology. The SILK system comprises: • the Mediator, which supports queries spanning multiple sources and multiple abstraction levels; • the Integrator, which helps in building integrated models, as well as customised user views of these; • the Wrappers, which provide a common interface for accessing various information source types, such as relational and object-oriented databases, semistructured sources, e.g. XML or RDF, as well as Web-services. The SILK approach puts meta-information in the focus. Meta-information is represented in the form of UML compatible models and mappings between them. In addition to the structural information, such as the classhierarchy, attributes, associations, etc., SILK also handles OCL constraints. The models, mappings and constraints are stored in the Model Warehouse repository. The structure of the paper is as follows. We start with a brief overview of the SILK system, and introduce

Figure 1: Context of use of the SILK System. the Model Warehouse. The core of the paper describes four components of SILK, focusing on the reasoning techniques used within them: the Rule Compiler of the Mediator, the Comparator and Verifier tools of the Integrator, and the RDF Wrapper. We then discuss future development plans, specifically the issues related to mobile and agent-based computing. The last two sections of the paper give a brief comparison of our approach with other ongoing research work, and summarise the conclusions.

2

The SILK tool-set

Figure 1 shows the subsystems of SILK and present their interfaces with the users and external data. The SILK system is comprised of a data integration engine, dubbed the SILK Server, and the SILK Integrator, a meta-data management system and repository. The system is highly modular and scalable. All its subsystems (Integrator, Mediator, Wrapper) can be utilised separately for various levels of information integration, and can be made into stand-alone products. There are several user roles shown on Figure 1. On the bottom of the figure, the users and administrators of the information sources are shown. Casual Users (topleft) communicate with SILK only via application programs, with custom designed interfaces, while the Business/Expert Users can directly use the query interface. The most critical person is the Integration Engineer, who is in charge of the whole integration process: importing/building models, establishing mappings, integrating the models, designing queries, etc.

2.1

SILK as an information aggregation system

Enterprise Information Integration systems like SILK are data integrators which aim to simplify the way companies access data scattered across the enterprise. SILK eliminates the need to physically upload and centralise data, unlike ETL (extraction, translation, and loading) tools for data warehousing or content management databases. Instead, SILK leaves data where it is, leveraging meta-data in its repository across multiple backend sources to pull information transparently into new applications or portals. In this way, SILK provides a single “database veneer” for what is actually a system of virtual, federated databases. It looks to the outside user as if dealing with one database and carrying out the usual operations of access. But beyond this front-end view, there are separate systems with their own code and engines. SILK in front of them makes this look like a single information source. SILK allows users to query and search data tucked away in systems across the enterprise, while also easing integration and development. SILK promotes the idea of transparency of data, so that no matter where a given piece of data lives, SILK will be able to access it. SILK utilises the best of SQL and OQL query languages, and enables access to data, regardless of format or location, based on a single query. SILK is also a modelling solution that creates an abstraction layer that sits atop application-data silos and manages meta-data in the Model Warehouse. SILK enables views of data (models, queries, mappings) to be customised and reused; SILK makes it easy,

quick and inexpensive to modify these data views when new data sources need to be added. SILK provides graphical design tools to allow developers to quickly examine source models, establish mappings between them, identify needed columns or elements, compose a query and define input/output forms. SILK has also much in common with the Semantic Web initiative due to its conceptual modelling capabilities (ontologies) and to the wide variety of web-based information sources which can be accessed by SILK (XML, XMI, WSDL/SOAP, RDF, HTML).

2.2

class Wheel: Product { attribute Integer size; }; association ’Car-Wheel’ { connection Car; connection Wheel composite; };

SILK as a meta-data management system

SILK is, however, more than a tool for Enterprise Information Integration. It is now getting widely recognised that the key to managing data is managing meta-data since meta-data become more valuable than the data itself when automatising information access. To put it in another way, the way you manage meta-information determines the efficiency of your actual information access. Unlike the approach used by other software vendors, SILK is based on the Prolog and Constraint Logic Programming technology. This allows for implementing more complex and intelligent functionality (like model comparison, constraint verification, model unification) than what is achievable with more conventional technology. In this way, SILK provides an environment for experimenting advanced, intelligent meta-data management functionality, too. The current functions of this category in SILK give only a first taste of such possibilities. The SILK system is implemented in SICStus Prolog [SIC, 2003], and makes use of several SICStus constraint programming libraries. The graphical user interfaces and the majority of the Wrappers (with the sole exception of the RDF Wrapper) is implemented in Java, through the Jasper interface of SICStus. More details on the implementation of SILK can be found in [Benk˝o et al., 2002].

3

class Car: Product { attribute String make; attribute Integer price; constraint self.wheel.size > 2; };

The Model Warehouse

We demonstrate the knowledge representation of SILK by a simple example showing the relevant features of the meta-data repository. The example is given using SILan, the SILK knowledge representation language. Note that models can be entered and displayed not only in SILan, but also via an appropriate graphical user interface. The example describes a factory producing cars and wheels (only). The model of the factory contains three classes, Product, Car, and Wheel, the first being the common base of the other two. It also contains a composition (an association) between a car and its wheels. model Factory { class Product { attribute String serial; primary key serial; };

};

3.1

Semantics of Object-Oriented Models

From information integration point of view, the central elements of object-oriented models are classes and associations, since these are the carriers of information. A class denotes a set of entities called the instances of the class. Similarly, an n-ary association denotes a set of nary tuples of class instances called links. As we will see later, there are several ways to name instances of classes and links of associations. Classes can have attributes which are defined as functions mapping the class to a subset of values allowed by the type of the attribute. (This is a simplification, the UML specification allows attributes with multiple values.) Besides attributes, a class can also have operations which are similar to attributes but their domain is the product of the class and the types of its arguments1 . Classes can inherit from other classes. The instances of the descendant class are all instances of the ancestor class. (This implies that attributes, associations, etc. are inherited.) In the example the classes Car and Wheel are both descendants of the class Product. Associations have connections, an n-ary association has n connections. In a binary association one of the connections can be declared composite, which means that the instance at the composite end is part of the instance at the other end, and is not part of any other instance. Specifying a multiplicity of maximum 1 has similar semantics. As an extension to the UML specification, classes are allowed to have a primary key, composed of one or more attributes. This specifies that the given subset of the attributes uniquely identifies an instance of the class. Finally, invariants can be specified for classes and associations. Invariants give statements about instances of classes (and links of associations) that hold for each of them. The constraint in the declaration of Car is an invariant stating that the size of each wheel of a car is greater than 2. The identifier self refers to an arbitrary instance of the context, in this case the class Car. 1 In fact, there is no semantic difference between a 0-ary operation and an attribute.

Then two navigation steps follow. In the first step we navigate through the association ’Car-Wheel’ to an arbitrary wheel of the car and in the second step from the wheel to its size and state that this is always greater than 2. Note that the semantics of the navigation through the association is different from that of the OCL specification. In OCL the navigation through an association with multiplicity greater than 1 results in the set of all associated values, whereas in our interpretation we navigate to an arbitrary associated value. This is why we can simply navigate further through an attribute and compare the final result of the navigation with 2. Note also that our definition maps nicely to the standard Prolog execution. In addition to the object oriented modelling paradigm, The SILan language also supports constructs from the Description Logic (DL) world, such as slots, corresponding to attributes, concepts corresponding to classes, etc. This way conceptual models and ontologies can also be developed within SILK. Note, however, that the current version of SILK does not support DL-based reasoning.

3.2

Abstractions

For mediation, we need mappings between the separate sources and the integrated model. These mappings are called abstractions because often they provide a more abstract view of the notions present in the lower level models. A very simple abstraction can be seen below. abstraction (s: Peugeot::Vehicle -> c: Factory::Car) { constraint s.type = "Car" implies s.serial_number = c.serial and "Peugeot" = c.make and s.price*1000 = c.price; }; This example links our “abstract” notion of car with some data to be found in a concrete data source about, say, Peugeot cars. This model Peugeot (not spelled out in detail here) contains a class Vehicle with attributes serial number and price. This abstraction gives the necessary rules to derive a Car in model Factory from a Vehicle in model Peugeot, provided its type is car. The rule determining the price of c contains a simple transformation (multiplication by 1000, because this source stores the price in units of 1000). The transformation can be arbitrarily complex, the only restriction is that it has to compute the value of the derived attribute. According to UML terminology, the declaration in the head of the abstraction and on the left of the arrow is called supplier, the one on the right is called client. SILK supports an arbitrary number of suppliers and clients in an abstraction.

3.3

Queries

Queries can be formulated on both information source level (such as model Peugeot), and on higher levels (e.g.

model Factory). It is more convenient to use higher level models for formulating a query, as these normally unify several information sources, and/or describe a specific customised view. Here is an example of a query: query CheapBigWheelCar select w.serial, w.car.serial from w: Factory::Wheel where w.size > 4 and w.car.price < 5000; The query above selects pairs of wheel and car serial numbers of big wheels of cheap cars. The from clause declares the “starting point” of the search, the where clause defines the condition that has to hold for each solution. The number of items in the from clause is not limited. The where clause uses the same expression syntax as the invariant constraints. Notice that by querying the “abstract” Factory model, the peculiarities of the data sources, such as different price units, become hidden.

4

Compiling constraints for Mediation

To produce integrated information, SILK uses the process called mediation. Mediation decomposes complex integrated queries to simple queries answerable by individual information sources and, having obtained data from these, composes the results into an integrated form. To support mediation, SILK stores integrated models, the models of information sources, as well as the mappings between them and uses these to drive the Query Planning submodule. The Query Planning submodule [Badea and Tilivea, 2001] understands the Mediator Logical Language (MLL), a simple language consisting of so called Description Rules, while the Integrator component uses UML class models and OCL constraints. Therefore the models and constraints have to be translated into MLL. The next subsections first discuss the properties of MLL and our object-oriented models. Next, we describe the Rule Compiler, i.e. the component which transforms the object-oriented representation to rules and queries in MLL.

4.1

The MLL

An MLL description consists of a set of Description Rules [Badea and Tilivea, 2001]. A Description Rule is an implication, with a conjunction of atoms (atomic formulae) on both sides. The variables occurring only in the consequent of the implication are existentially quantified, all other variables are universally quantified. The general form of Description Rules is thus the following: ∀X.(p0 (X 0 ) ∧ . . . → ∃Z.q0 (Y 0 ) ∧ . . .) S S where X = i X i , Y = i Y i , and Z = Y \X. The quantification is implicit, i.e. no quantifiers appear in the syntax of Description Rules. The current implementation places some other restrictions on the form of Description Rules but these have no impact on the present discussion.

The main advantages of MLL are its simplicity, and the ability to represent all required properties in a uniform way. The uniform representation is very important for doing complex analysis and optimisations. The following subsections introduce the properties relevant to object-oriented models and information integration.

4.2

Representation in MLL

Instances of classes and links of associations are represented by atoms in MLL. A class with n attributes is an atom of n arguments, an n-ary association is an atom of n arguments. Navigation When navigating through an attribute, we simply select the argument of the atom corresponding to the attribute of the instance. Similarly, when navigating from a link of an association to one of its connections, we select the corresponding argument. For example, the expression

Avoiding reification is one way of optimising an expression. Another way is the elimination of identical subexpressions. For example, in the query above, the navigation w.car occurs twice2 , but there is no need to include ’Car-Wheel’(C1,W) and ’Car-Wheel’(C2,W), since ’Car-Wheel’ is a composition and therefore C1 and C2 denote identical values.

4.3

4.4

• inheritance (e.g., a car is a product) ’Car’(Serial,Make,Price) ---> ’Product’(Serial)

gets translated as

where W corresponds to the wheel w in the original form and the variable Price is bound to the result of the navigation. Compilation of Expressions The previous example depicts the compilation of a boolean expression, and we can see that relational operators are translated into Prolog calls. The same is true for other operators and for the operations of classes. A call to an n-ary operation is translated into a call to an (n + 2)-ary predicate, where the first argument is the instance on which the operation is invoked, the last argument is the return value of the operation, while the other arguments correspond to the arguments of the operation. In most cases the compilation of expressions is straightforward, and we can take advantage of the way Prolog executes goals. Since both invariants and queries are essentially boolean expressions, they can be compiled as conjunctions of Prolog calls and unification can be used instead of boolean equality, just as in the previous example. When the values of boolean expressions are manipulated in a more complex way, we can not do this optimisation. For example the expression (c.price < 5000) = (c.serial = "s123") is translated as C = ’Car’(Serial,Make,Price), number_less(Price, 5000, B), string_equal(Serial, ’s123’, B) Note that the top-level equality is optimised to a unification. When the value of a boolean expression is explicitly manipulated we speak of reification.

Compilation of Rules

All knowledge about the models, as described earlier, can be formulated as a set of Description Rules.

w.car.price < 5000 ’Car-Wheel’(C, W), C = ’Car’(Serial,Make,Price), Price < 5000

Compilation of Queries

The compilation of a query is nothing more than the compilation of the classes and associations in the from clause in conjunction with the compiled form the boolean expression in the where clause.

• functional dependency – composition or max 1 multiplicity (a wheel is a part of a car) ’Car-Wheel’(C1,W), ’Car-Wheel’(C2,W) ---> C1 = C2 This example demonstrates that the optimisation of composition, shown in the previous subsection, is not vital, the Mediator could deduce that the two cars are the same. – primary (unique) key (there are no two wheels with the same serial number and different sizes) ’Wheel’(Serial,Size1), ’Wheel’(Serial,Size2) ---> Size1 = Size2 • invariants: the condition part is the existence of an instance or a link and any navigation through associations from them, the consequent is the constraint of the invariant ’Car’(S,M,P), ’Car-Wheel’(’Car’(S,M,P), ’Wheel’(WS,Size)) ---> Size > 2 • abstractions: the condition part consists of the existence of the suppliers in conjunction with the condition part of the top-level implies (if any). The consequent part is the right hand side of the toplevel implies ’Vehicle’(SNo,’Car’,Price) ---> ’Car’(SNo,’Peugeot’,CPrice), CPrice = 1000*Price 2

Once in the select and once in the where part.

4.5

An Example

This section gives an example of how the Mediator uses the compiled rules to answer a simple query. The example demonstrates how one can query all relevant sources by issuing a query formulated in terms of a high-level model. In the following, we give both the SILan and the compiled form of the relevant part of the Model Warehouse. Let us assume that we have two sources containing cars, one is the source of Peugeot vehicles, the other is a source of Ford cars. The high-level model is the Factory. Regarding the Peugeot source, we can use the same abstraction as shown at the end of the previous section, except for the use of model qualifiers in predicate names: ’Peugeot::Vehicle’(SNo,’Car’,Price) ---> ’Factory::Car’(SNo,’Peugeot’,CPrice), CPrice = 1000*Price Let us assume furthermore that we have additional knowledge about Peugeot cars, namely that each Peugeot car is more expensive than 7000. This knowledge is expressed as a global invariant. constraint context c: Factory::Car inv c.make = "Peugeot" implies c.price > 7000; ’Factory::Car’(S,’Peugeot’,Price) ---> Price > 7000 The abstraction from the source of Ford cars to Factory cars is a simple one-to-one mapping. abstraction (s: Ford::Car -> c: Factory::Car) { constraint s.serial_number = c.serial and "Ford" = c.make and s.price = c.price; }; ’Ford::Car’(SNo,Price) ---> ’Car’(SNo,’Ford’,Price) Consider now a simplified version of the query from Section 3.3, in which we are looking for cars cheaper than 5000. query CheapCar select c.serial from c: Factory::Car where c.price < 5000; ’Factory::Car’(S,M,P), P < 5000 When answering the above query, both the Peugeot and the Ford sources are candidates for delivering solutions, since for each of them there is a rule that can generate Factory::Cars. By applying the rule generating cars from the Peugeot source, we get a contradiction. The contradiction is detected in the second step when the rule compiled from

the invariant stating that all Peugeot cars are more expensive than 7000 fires. Note that this contradiction is detected before the source is queried. The other branch of execution succeeds, since there is no constraint on the price of cars in the Ford data source.

5

Comparison of UML models

The task of the Model Comparator is to find and connect similar elements in two sets of object-oriented model elements. This may be very useful in case of system and data integration, ontology acquisition or in model based application development. Mathematically, the comparison process involves the comparison of two graphs with many different kinds of nodes and leaves.

5.1

The functions of the Model Comparator

The Model Comparator provides several functions. It is able to compare models (or parts of models) and it also supports the “intelligent” exploration of them. The models compared may belong to the same or different abstraction levels. This means that it is possible to compare a conceptual level model with a unified or information source model, etc. The models may be composed of elements of the same or different modelling languages. For example, from the Model Comparator point of view the UML-like and DL-like modelling constructs are not distinguished. The result of the comparison is a set of links between corresponding model elements. The Comparator is also capable of labelling the links with some generic mapping schemes, which will have to be refined by the Integration Engineer. As an example, let us consider models Factory and Peugeot, as outlined in the earlier section on the Model Warehouse. Assume also that these models contain several other classes, in addition to the ones described here. When these two models are given to the Comparator, then it will compare all possible class-pairs from the two models, and will come back with a list of a few pairs with a high similarity measure. Heading this list will be the pair Factory::Car - Peugeot::Vehicle, because both contain fields describing the price and a serial number, and both are related to a class called Wheel. When the Integration Engineer confirms the relevance of this pair, the Comparator will build a skeleton mapping (abstraction): abstraction (s: Peugeot::Vehicle -> c: Factory::Car) { constraint s.serial_number = c.serial and s.price = c.price; }; Note that the Comparator will not be able to notice that the prices in the two models are expressed in different units.

5.2

Principles of operation

The comparison process takes into account both the structure and the content of the models in question. The Model Comparator performs comparative analysis of structured models or model parts using an extensible set of comparison methods. For elementary model elements the way to calculate the similarity factor can be specified. For composite elements the comparison method specifies which sub-elements have to be compared and how should the total similarity factor be calculated from those for sub-elements. Taking two models or model elements as its input, the Model Comparator synchronously and systematically traverses the given model graphs and finds the likely related pairs of elements based on their similarity.

5.3

Complexity of the comparison process

Because of the freedom of modelling given by the objectoriented notation the comparison process is very complicated and has a tendency to result in combinatorial explosion. To cope with these problems we designed the Model Comparator to be configurable, modular, and interactive. Configurability The process of the comparison is controlled by a specification loaded in the interpreter from a property file written in simple Prolog syntax. This configures both the elementary and the structured comparison modules of the comparator. Practically it means that almost all aspects of the Model Comparator can be configured by these declarative descriptions. Modularity Modularity means that the Model Comparator is implemented as a set of so called comparison methods. A given method is responsible for the comparison of a given type of element. There is a method for comparing the nodes of the graphs as well as several methods for comparing different kinds of leaves, e.g., informal texts (comments), identifiers, etc. In UML, model elements are said to have features, e.g. the features of a class are its name, attributes, operations, etc. We can choose to what extent different features of a model element influence its similarity if compared to some other element (not necessarily of the same type). Such aspects of the comparison are weighted. For any combination of model element type pairs, their comparison method can be specified in the form of so called declaration blocks. This means that model elements of the same type or of different types can be compared with each other. E.g., the similarity of two classes, or of an association with a slot can both be specified. Each declaration block of the specification describes how the names and other features of the model elements (of these types) should be compared. They specify • which features (sub-elements) are to be compared;

• the order and the comparison methods of these subelement pairs; • the way to calculate the similarity factor. The complexity of the operation can be tuned by turning the preferred comparison methods on or off. The scale between the left and right recursivity can be tuned by the ordering of the elementary and structured comparison method calls within a declaration block. Interactivity The model comparison is an interactive process, which means that the user can invoke the comparator several times. The result of each pass gives the starting point of subsequent passes, thereby refining the scope of links (e.g. from classes to attributes, etc.). From technical point of view, the Model Comparator actually does an iterative deepening search. The maximal recursion depth can be specified at each invocation, ensuring that an answer is found in a short time. Given a model element we have to find an appropriate match for it. This search is informed, namely, the goal function is the similarity degree of the two sets of model elements. We allow the selection not to always increase the similarity degree, so we implement a special kind of Hill-Climbing algorithm. Having obtained the result, it can be inspected using different similarity thresholds. This means that similarities with a weight less then the specified threshold will not be listed in the output. When an acceptable minimal similarity level is found, the Integration Engineer can confirm the good matches and discard the wrong ones. In the next step, she can request a new comparison either focusing on elements not yet compared or enabling the Model Comparator to go deeper in the recursion. As the last step, the Model Comparator can introduce default mappings between the elements found similar. These have to be confirmed, completed, or corrected by the Integration Engineer.

6

Verification of UML models

The Verifier component of SILK is used to discover inconsistencies in models or the mappings between them. The basic idea is to transform both the structural (e.g. inheritance) information and the constraints to first order logic formulae, and use theorem proving techniques to uncover any inconsistencies in these. As opposed to most of tools for handling OCL constraints, the SILK Verifier does not use any data instances. The process of verification relies solely on meta-information, and is independent from the actual instances. In a simple model, developed by a single author, it is not very likely that many contradictions are introduced. In a huge model however, which is semiautomatically assembled from distinct parts, the Verifier might be a good means for discovering modelling or mapping mistakes. The process of verification consists of several phases. First, the structural information present in the class di-

Figure 2: The two-layer structure of the Verifier. agrams is translated to constraints. Second, these, together with the invariant and mapping constraints are transformed to first order logic (FOL) formulae. Third, automated deduction algorithms are used to uncover inconsistencies in the set of formulae. The verification algorithm is sound, but it can not be expected to be complete. Because of the soundness, the Verifier will never raise a false alarm. Because of incompleteness, however, there might be inconsistent models, which are not detected as such by the Verifier.

6.1

Verifying OCL constraints as FOL formulae

As explained earlier for the case of the rule compiler, the structure and constraints of UML models and mappings can be viewed as the following logical formula: class(x1 , . . . , xn ) → inv(x1 , . . . , xn ) Here class(x1 , . . . , xn ), on the left hand side of the implication, represents the logical form of a class instance or an association link, with all of its attributes as arguments. The right hand side of the implication, inv(x1 , . . . , xn ), holds the invariant or mapping constraint, as a function of the attributes of the object(s) in question. As an example, let us assume that the class Peugeot::Vehicle has an invariant stating: v.price > 5 and v.price < 30 (where v is an arbitrary instance of Vehicle). Then this invariant will be interpreted as: ’Vehicle’(SN o, T ype, P ) → P > 5 ∧ P < 30 The task of the Verifier is to discover contradictions and inconsistencies in the constraints themselves, i.e. to prove that the invariant constraints allow no instances of the model: {hx1 , . . . , xn i|inv(x1 , . . . , xn )} = ∅ This is equivalent to: ∀x1 , . . . , xn ¬inv(x1 , . . . , xn ) ≡ true In the Verifier we actually start with the formula inv(x1 , . . . , xn ) and apply rewrite transformations, until

we reach f alse. The rewrite transformations are based on publicly available constraint reasoners, mostly from the CLP(X) family [Jaffar and Michaylov, 1987]. In more complex cases, the above inv(. . .) formula will be built from several invariant or mapping constraints. Starting from a set of model elements supplied by the user, the Topology control submodule of the Verifier will collect all the constraints that are associated with these model elements, directly or indirectly. For example, when verifying a mapping, the Topology submodule first puts together the constraint stated in the mapping, and the invariants stated in the classes linked by the mapping. It then extends this with the constraints of the classes/mappings referred to in the constraints collected so far, etc., calculating a transitive closure of this operation. Continuing our simple example, assume that the class Factory::Car also has an invariant which states: c.price > 2000 and c.price < 200000. (Recall that the Peugeot source stores prices in units of 1000, while in Factory single units are used.) As explained in the previous section, the Comparator, when given the task of establishing a mapping between Factory and Peugeot, will find the corresponding classes/attributes, and will make, as a default, a one-toone mapping between the corresponding attributes. Now if this mapping is submitted to the Verifier, it will build the formula: PV > 5 ∧ PV < 30 ∧ PV = PC ∧ PC > 2000 ∧ PC < 200000 Using the CLP(Q) constraint library, this formula can be easily proven to be false. If we correct the mapping to include the unit conversion, then PV = PC above will be replaced by PV ∗ 1000 = PC , and the formula ceases to be inconsistent.

6.2

Mixed strategy reasoning

A two layer algorithm has been chosen for proving the validity of logical formulae. The lower level contains a

set of Solvers, built around publicly available constraint reasoners. The upper level comprises a package called Scheduler, which controls the operation of the the lower layer Solvers. This mechanism is called mixed strategy reasoning. The structure of the Verifier and its data flow is shown in Figure 2. The Scheduler module sends the formula in question to the Solver modules in turn until it finds one which “recognises” it, i.e. declares to be capable to do some rewriting on the formula. The Solver then • either detects a contradiction, in which case it returns a smallest subset of conjuncts which it finds inconsistent; • or returns the result of the equivalence transformations, in the hope, that another Solver will be able to prove the contradiction from there. The Strategy control module decides the way the Scheduler works, based on some settings. These settings can be altered by the end user as well, and they influence the decision of the scheduler on the strategy of selecting the clauses and/or conjuncts to be applied, and the next solver to be run. The Topology control module, as mentioned earlier, works on the UML class diagram network. Its main task is to collect the classes and associations to be involved in verification according to the user specifications. The current implementation has three Solvers: a linear numeric equation solver based on CLP(Q) [Jaffar and Michaylov, 1987], a solver for finite sets building on both Setlog [Dovier et al., 1996] and CHR (Constraint Handling Rules) [Fruehwirth, 1998], and a solver for string operations, also based on CHR.

7

Reasoning on RDF sources

This section deals with a recently developed SILK Wrapper for accessing RDF sources. This differs from the “traditional” wrappers developed earlier in that it provides built-in reasoning capabilities in the wrapper itself. Also, by giving access to ontologies stored in RDF format, it provides additional help in the development of conceptual level models, representing the customised view of a certain domain or user group. The RDF-wrapper has full support for RDF data and schemas. We chose RDF for ontology representation, because RDF schemas have well defined formal semantics and are expressive enough for our present setup. The RDF-wrapper provides tools for querying, reasoning and consistency checking of RDF-based ontologies and data. While being part of the SILK system, the RDFwrapper is also available as a stand-alone application. The main sub-modules of the RDF-wrapper are: parser, query processor, query optimiser, answer generator, inference engine, consistency checker, knowledge base access and external interfaces. The stand-alone version provides a console-based and a GUI-based interface as well.

When used as part of the SILK system, the RDFwrapper maps RDF schemas to UML-style metainformation, and the actual RDF data to object instances. Thus RDF data can be used in the same way as the “traditional” information sources. Therefore, it is possible to link RDF schemas with the meta-information extracted from other kind of data sources, such as relational, OO, etc sources. Based on this, we can create unified models which describe both RDF and relational databases, and thus provide access to the underlying data in a homogeneous way. The RDF wrapper has two main functions. On one hand, it provides meta-information about the RDF source. This means that the wrapper builds an appropriate model in the SILK Model Warehouse, which contains the classes, slots, property-restrictions, etc. representing the RDF schemas in question. On the other hand, it supports the process of mediation, by implementing the query facility for RDF sources. Both functions involve reasoning: the meta-information and the answers to the queries may both contain information which is inferred by the RDF-wrapper and is not explicitly represented in the RDF data or schema. This mechanism is completely hidden from the SILK tool-set. In the following we describe the most important parts of RDF-wrapper: the query processor, the inference engine and the consistency checker.

7.1

The query processor

The query processor of the RDF-wrapper handles both instance and property queries. The former expresses the fact that an object is an instance of a class, the latter that a given property of one object is another object. Furthermore, as an extension, the RDF-wrapper allows the user to add so called rules to an RDF source. Rules are similar to SQL views, but can be recursive. Therefore, among other things, rules make it possible to construct transitive RDF properties. The following is an example for such a rule: offspring(A,B)