Web Builder - A tool which constructs a temporary Hyperdocument ... way as queries from hyperdocuments, except that here, the Web Builder ..... between the 'fastest' and 'cheapest' execution strategies in centralised or distributed data- .... The manual linking is effort intensive at the best of times and impossible at any com-.
Workshop on Cooperating KBS - Keele Sept 1993
Using an Intelligent Agent to Mediate Multibase Information Access. W. Behrendt, E. Hutchinson, K. G. Jeffery, J. Kalmus, C.A. Macnee, M.D. Wilson SERC Rutherford Appleton Laboratory, Chilton, Didcot, Oxon, OX11 0QX, UK. ABSTRACT This paper addresses the representation of knowledge about information sources required by an intelligent agent mediating a cooperative work environment. Loosely federated heterogeneous distributed data sources are considered where the maintainer of the mediating agent cannot control the availability, integrity etc of the remote sites. The semantic conflicts arising from heterogeneous data representations are described and the consequences for an ontology and representation to integrate them are outlined. A secondary issue of the selection of a remote site to provide information to a user, and the integration of the information required to support that task within the knowledge representation is also considered.
INTRODUCTION We regard “a Cooperating Knowledge Based System (CKBS) as a collection of autonomous, potenitially heterogeneous and possibly pre-existing, objects (or units) which cooperate in problem solving in a decentralised environment” (Deen, 1991). The problem considered in this paper is that of providing a cooperative work environment expressed as a hyperdocument, presented through an architecture incorporating a single knowledge based system which users interact with to solve problems drawing on data from many distributed heterogeneous pre-existing data sources connected to it. Queries to the data sources can be issued by the user, or from within the hyperdocument; answers to queries are incorporated into the hyperdocument. Where there are existing databases in the world which users wish to access, then they must either learn the interfaces of each independent database, or use a common interface tool to access all of these heterogenous databases. We are assuming a weakly federated multibase (following the terminology of Bell & Grimson, 1992) in which local operators have update rights to their own databases, and must obey local data integrity constraints; multibase data retrieval users do not have update rights, and although global integrity constraints may exist, they cannot be enforced since the autonomy of local databases cannot be violated. Homogeneous distributed databases that have been designed top-down so that global integrity constraints can be enforced, and where the distribution of data has been optimised will not give rise to the phenomena considered here. In practice the situation considered arises when publishers, universities or other information providers allowed their databases to be accessed remotely as parts of a multibase system, whilst retaining autonomy over the information that they own. Where multiple data sources are combined, the integration can be through top down design, through the intelligence of the user, or through a mediating intelligent agent (Wiederhold, 1992) acting on behalf of the user to reason over the semantics of the databases. The set of data sources and the “intelligent agent” must cooperate to respond to the user’s query for information from the set of databases. The individual data sources do not have knowledge of each other or the KBS, since the architecture consists of one agent and many servers rather than multiple agents. We assume a multibase, rather than conventional distributed database system, and we assume that there is a unique answer to a query for a given state of the entire system, since we do not assume the system itself is reliable or robust. These relaxations on the standard assumptions for a distributed database require the mediating KBS agent to contain more knowledge than conventional distributed databases, and permit more reasoning for query optomisation than is usual. The construction of intelligent mediators for heterogeneous co-operating knowledge sources would reflect the structure proposed here for a multibase server (Adler et al, 1992) and not follow those proposed for homogenous distributed co-operating knowledge bases by the ARPA knowledge sharing project (e.g. Neches et al, 1991)
1
Workshop on Cooperating KBS - Keele Sept 1993
amongst others. In heterogeneous distributed databases a user’s query can be interpreted against different databases which each may contribute to the answer (for ease of exemplification we will assume that the databases are all relational although the argument made here applies equally if they are not). The issue most frequently considered in this situation (e.g. Roussopoulos & Myopoulos; 1975; Bertino & Musto, 1993) is how to divide the query between the heterogeneous data sources on the basis of an overall conceptual schema built from the individual database semantics of attribute/key names, table/relation names, data models, etc. in order to select the appropriate data sources. This issue will be addressed in the first part of the paper. In the second part of the paper a secondary issue involved in the selection of data sources in query optimisation is addressed assuming the solution proposed in the first half to semantic integration. The approach to data source selection presented could equally be applied to distributed knowledge sources if the semantic problems for these were resolved. We are not presenting a design of an existing system here, nor a complete set of heuristics to resolve the problem. this paper only contains an outline of an approach which is currently being developed in order to obtain feedback on that approach. EXAMPLE SYSTEM We aim to achieve co-operation by providing an information presentation environment whose database management functions can be tailored to suit application specific purposes (The German word “Branchenloesung”, i.e. “Business sector solution” is a good description of what we mean here). The example system we are considering as a cooperative work environment is MIPS (Multimedia Information Presentation System) The architecture of MIPS consists of the following modules: •
General Query Tool (GQT) - to construct queries in a particular application domain.
•
Selection and Retrieval Tool (SRT) - to break down a query into components which can be despatched to identifiable databases.
•
Web Builder - A tool which constructs a temporary Hyperdocument representing the user’s information space.
•
Presentation Manager - a tool which manages all the display and rendering functions required for multimedia.
•
Embedded Knowledge Based System (EKBS) - to support all of the above.
Each of the modules has some local intelligence to do with its particular task within the system. The EKBS functions as the mediator between the modules, communicating via an internal representation language (IRL; after Doe et al, 1992). Overall control lies with the Presentation Manager - this is informed by inferences requested from the server EKBS. This sets our task in the following ways: •
we want to access heterogeneous databases but not (yet) update them
•
we need to understand the semantic heterogeneity of those databases
•
we need to map the heterogeneous semantics onto a global schema
•
the global schema should provide all functions expected for database access (this implies defining a high level query language for the user, as well as translation mechanisms into local database access languages).
•
the global schema must be tailorable so that the virtual workspace can be optimised for a specific business sector, business, business unit/site, user.
•
- we accept that there is a need for an application builder who will use tools to integrate a heterogeneous database into our global, KBS assisted DBMS.
•
- we need to take into account the trade-off between effort of integration and subsequent usability/ productivity of the Information Presentation System created this way. 2
Workshop on Cooperating KBS - Keele Sept 1993
PROCESS DESCRIPTION Interaction with the system is via two routes: Firstly, the user can browse a hyperdocument which includes query nodes. These query nodes can be instantiated by a user (similar to query forms) and the instantiated high-level query is then broken down by the Selection and Retrieval Tool and its component queries sent to relevant databases. The EKBS assists in that process by providing a meta-schema of what kind of information can be retrieved from what database. This is done on the basis of making inferences as to what this query is about and therefore, which databases in principle, may provide partial answers. Secondly, at any point, users can bring up the General Query Tool which allows them to pose queries in a guided free-form. Such queries are on the whole, dealt with in the same way as queries from hyperdocuments, except that here, the Web Builder together with the KBS has to amend the hyperdocument ‘on the fly’ to now include the information space covered by the new, unanticipated query. Issues of report generation, and intelligent information design are described elsewhere (e.g. Chappel & Wilson, 1993). Both GQT and SRT use “copies” of the KBS ontology which acts as a common dictionary between the modules. This common language allows the modules to ask meaningful questions to the KBS, e.g. (anthropomorphised): “Here is a request for a video about Madrid do we know about any databases which contain videos and whose subject is of type CITY, instantiated to the label MADRID”. The information retrieved is used to fully instantiate the hyperdocument which is then rendered by the tools invoked by the presentation manager. The retrieval and selection process may involve altering the user’s potentially ambiguous or over/underspecified query and it may require pruning or extending the information that has come back from the databases. All this requires balancing various criteria coming from the user model (Chappel et al, 1992) or from application specific constraints held in the domain model (e.g. not to query expensive databases too often, for too long) etc. The criteria and the heuristics for resolving any conflicts are located in the KBS. INTEGRATING HETEROGENEOUS SEMANTICS As has been pointed out by many (e.g. GrStBo 88, EnGoHoHueLoeSaEh 92) the structural simplicity of conventional data models forces database designers into ‘flattening’ the structural complexities of the ‘world of interest’, which in turn leads to loss of semantic information. While the structure of information is intuitively perceivable by humans, and usually expressable in natural language, it has been been difficult to couch the process of semantic reconstruction into formal models. The information available for each server database is therefore usually the semantically impoverished schema (or export schema). The KBS must include a knowledge representation for these database schema over which it can reason to resolve the semantic heterogeneity of different data sources. The Semantic Network approach taken by many was an attempt to map intuitive relationships between (data) objects on to a structure of labeled, directed graphs. As was shown later (Bra79) there was a fundamental mismatch between the impoverished denotational semantics of semantic nets when implemented, and the rich (as yet unformalised) semantics of natural language. Since the database designer is largely free to choose the primitives by which entities and relationships are defined, the resulting datamodels can be scrutinised in the way suggested by Brachman (Bra79) when he commented that in semantic networks there is often a danger of confusing 5 types of primitives. It would be convinent to employ a standard knowledge representation for classification tasks derived from these principled distinctions such as KL-ONE or its most recent incarnation, CLASSIC (Brachman et al, 1992). However, although restricted representations, these are not tuned to the task of schema integration considered here, and a specific representation 3
Workshop on Cooperating KBS - Keele Sept 1993
must be derived balancing the tradeoff between representational expressiveness and computational tractability (LevBra 85) for the task, and the main effort becomes one of ontology engineering (see Lenat & Guha, 1990). When we create a knowledge representation of a database schema we need to ensure that the underlying ontologies are cleanly defined at the different representational levels. In terms of the process of an application builder linking a new database schema to the KBS representation, we must resolve the problems of homonyms and synomyms of labels, but also those of semantic structure of the KBS and schema ontologies. The more detailed the ontology becomes the more divergence can be expected in terms of subentities and attributes defined, and subsequently, in terms of relations and constraints defined over the entities. Sheth & Larson (1990) show a hierarchy of schemas that need to be transformed in order to arrive at the federated schema. At the bottom of the hierarchy is the local schema of a component database. The local schemas have to be translated so that each component database has now a component schema which is consistent with the schema of other databases. From the component schema, a subset is chosen which will be part of the federated schema this is the export schema, i.e. the view that the federated database will receive from the component database. The authors class schema translation as the process of making local schemata consistent with each other. Schema definition is the process of specifying the component export schema to which the federated database will have access. Schema Integration is making the export schemas consistent. In MIPS, we simplify the view somewhat by assuming that local databases can remain inconsistent with each other - reconciliation is done by specifiying tranformation rules which are part of the federated database schema . This approach is only possible using rule-based techniques. It reduces the task to defining the global or federated schema and it defines the integration process as the specification of transformation rules between each remote schema and the global schema. This also means that we are indifferent whether we are dealing with the full local schema or just an export schema: the remote data source is what we see of it. The only strong requirements we have is that the remote schema MUST be accessible at the time of linking, at least for manual inspection, but preferably on-line, and that it SHOULD be accessible on-line in intervals so that we can retrieve the schema again and analyse it for structural differences which would point at necessary updates of the federal schema. Agreeing then with Sheth & Larson (1990) that integration and translation are closely interrelated we set ourselves the following schema reconciliation tasks:The federated schema must be checked for completeness: all entities which get linked must have a knowledge based interpretation at the federal level. Furthermore, we need to analyse the schema of the external data sources for the following properties:how do global entity types match with remote ones; how do global attributes match ; how do global attribute values match; how do global relationships match; how do global constraints match. The following semantic mismatches need to be catered for: Any combination of: [1] Ontologies for entities use different entity names [2] Ontologies for entities have different granularity [3] Ontologies for entities use different grouping criteria [4] Entities have differently named attributes [5] Entities have only partially overlapping attribute sets [6] Entities have disjoint attribute sets [7] Values of attributes use different scales (e.g. ordinal vs interval)
4
Workshop on Cooperating KBS - Keele Sept 1993
[8] [9]
Values of attributes have different value-types (e.g. GBP vs FF) Values of attributes have overlapping ranges (e.g. value “fast” = > 100mph vs “fast” = > 70mph; i.e. is 90 “fast” or not?) There is a dependency between [2],[3] and [4],[5],[6]. If [2] is the case then sub-entities of entities may end up at the level of attributes (indeed, an “attribute” is just the final subentity that is defined in a database); or conversely, attributes of another database may be subentities which are broken down further still. Even if an ontology is created using different clustering criteria [3], there may still be overlaps with other ontologies at the entity or attribute level. So the cause for observing symptoms [4], [5], [6] can be [2] or [3]. We will address each of these mismatches in the next sections. [1] Dealing with different but compatible semantics When we link databases into the KBS we use KBS Ontology concepts (e.g. “SOURCE” and “DESTINATION”) and we can then link entities in individual database schemata to these concepts. This addresses simple problems of homonyms and synonyms, e.g.: concept(SOURCE, db1, “From”). concept(SOURCE, db2, “Dep”). concept(DESTINATION, db1, “To”). concept(DESTINATION, db2, “Arr”).
What we have done is create a simple meta-schema which has the necessary information to ask about travel from a to b irresepective of the underlying database. We will of course now require both a higher level query language and a translation scheme from that query language to the individual query languages of the external databases in order to make use of this. The simple mechanism above allows us to relate single concepts which have 1-to-1 mappings between logical form. We can say that the two databases above have “compatible semantics”, i.e. translation is a simple mapping process. The scheme does not solve any issues arising from concepts which have only partially overlapping semantics and it does not resolve inconsistent semantics, nor incompatible semantics. [2] Ontologies for entities have different granularity Sheth and Larson (p 187) give the following example for semantic heterogeneity: Difference in the definition ( i.e. meaning) of related attributes: DB1: attribute MEAL_COST of relation RESTAURANT is defined as average cost of a meal per person NOT including service charge and tax. DB2: attribute MEAL_COST OF relation BOARDING is defined as average cost including service charge and tax. Whether or not we can deal with the problem depends on an assumption: a) we know that the difference exists between the two databases b) we do not know that the difference exists A knowledge based approach can deal with a) and can contribute to dealing with b). The principal source for this type of problem is that the databases use ontologies which are essentially the same but encode information which would really require a higher level of granularity. The system we envisage would know about service-providers and that a service comes in units, has an overall unit cost, and that UNIT_COST_INCL is composed of at least the price of the service (UNIT_COST_EXCL) plus additional charges plus tax. The federated schema would be of this general form so that in case a) the service-providers0001 of type restaurant would have a unit cost of X, and since we KNOW that service and tax are not included we can either express this fact explicitly or use default knowledge to add this information. In other words, the semantic decomposition of entities is more detailed
5
Workshop on Cooperating KBS - Keele Sept 1993
and better organised in the federated schema so that it makes visible the knowledge which is only implicit in the local database. Case b) is when we do not know and have no way of finding out whether service and tax is included: Again, the federated schema should look like case a). At linking time the FDBA can see that there are no attributes specifiying service and tax in the remote database so the FDBA does not know whether field MEAL_COST should be linked to type UNIT_COST_INCL or UNIT_COST_EXCL in our representation. Our KBS assisted DBS linking tool could support an XOR-link from the remote field to the virtual fields created for UNIT_COST_INCL and UNIT_COST_EXCL. This link could be interpreted as: the RESTAURANT provides a service of type meal whose unit cost is at least X and possibly no greater than X. The example illustrates how adding very little world knowledge can expand the cooperative behaviour of our federal database beyond the capabilities of the individual, contributing database. [3] Ontologies for entities use different grouping criteria This relates to our claim that in order to integrate schemata, we have to analyse the epistemological assumptions (Brachman’s third level) that were made in creating the schemata which we wish to integrate. Consider a database schema created using “subset + ranking” criteria in Table 1A. Consider the same information space using a functional criterion which could be translated as “X consists functionally of Y” or inversely, “Y contributes functionally to X” in Table 1B. Firm Employee Manager Managing Director Senior Manager Analyst Senior Analyst Junior Analyst Programmer Senior Programmer Junior Programmer Clerical Sales Helpdesk Administration
A
Table 1
B
Production Analyst/Designer employee employee Programmer employee employee Tester employee employee Sales National International Support Product 1 employee employee Product 2 ... Administration Management employee employee Personal Secretaries employee employee Finance / Salaries employee employee
Since we have committed ourselves to KBS assisted schema reconciliation using intensional information about our data space we are now obliged to analyse component schemata in this way because for efficiency as well as consistency reasons we certainly do not
6
Workshop on Cooperating KBS - Keele Sept 1993
want to end up with different intensional schemata in our knowledge base! This means that that a reference ontology has to be used in which the epistemological criteria of one ontology can be mapped onto those of another. Consider the query: “I need some help for installing software package A”. The Query Tool would have helped to construct an intensional form of this query: wants with . Database 2 can support this query easily because it is organised (by luck or design) in a fashion which greatly assists the user goals or the purpose expressed in the query. Its concept of “Support” maps closely onto the assumed federal concept of which requires five roles to be filled:
The expression “ wants ...” would be interpreted as ASSISTANCE.REQUESTER = ASSISTANCE.BENEFICIARY and would be quantified by the value for , e.g. their name or e-mail address. “with” would be an operator whose denotation in the context of would be . This would be linked to in DB2. In the interpretation of our query we are now left with two null-quantified roles: and . can be satisfied by DB1 by linking it to employees for Product 1. cannot be satisfied in one shot but we can tell the requester from whom the information can be obtained. Database 1 is more difficult to integrate because of its different ontology: How can we express the information space of ASSISTANCE in this scheme? As before the expression “ wants ...” would be interpreted as ASSISTANCE.REQUESTER = ASSISTANCE.BENEFICIARY and would be quantified by the value for , e.g. their name or e-mail address. is now null-quantified because DB2 does not know what products this firm supports. , however, is still quantifiable because we have an ontological overlap between this role and the concept “HelpDesk” (linked/linkable to ASSISTANCE.PROVIDER). So, in essence, we can still provide some useful information to the user who in this case, does not get the expert for product A, but gets at least the helpdesk. The method we have applied here is one of defining sets of purposes to which database entities can be put. By doing this at some level of generality we create a set of primitives which can be used to interpret any query as a semantic composition of such “purposes”. At present, entities/attributes are only linked to single concepts in the ontology. Inferences are mainly made by applying query-domain frames such as ASSISTANCE to the concepts identified in the query and by either specialising or generalising those concepts until a match with a database field can be made. Since each concept in any database has exactly one entry in the virtual database we can say that any query that can be formulated in any subset of databases can also be formulated in the virtual database. Since queries can be interpreted as to their intensional semantics as well as their extensional semantics, some answers are inferrable solely through intensional analysis.
7
Workshop on Cooperating KBS - Keele Sept 1993
[4]
Entities have differently named attributes The procedure is the same as for [1]: simple transformation rules will map db1.entity_n.attribute_a to db2.entity_m.attribute_b. [5] Entities have only partially overlapping attribute sets Consider a database DB1 which has the relation EMPLOYEE_INFO defined by: Employee.ID Employee.Group Employee.Telephone-extension and a DB2 which has PERSONNEL_INFO defined as: OurEmployee.Name OurEmployee.Telephone-extension
Employee.Name Employee.Office-number Employee.HomeNumber OurEmployee.Office-number OurEmployee.Project-Assignment
Discrete semantics mean if two entities overlap, then the attributes which they share are unambiguously denoting the same things in the world; e.g.: Employee.Name of DB1 and OurEmployee.Name in DB2 both denote the family names of people. The fact that the fields have different names is catered for by [1 ]. In principle, the tailoring mechanism of a MIPS application would allow us to define our federal-level entities as we like. For the above case, we can use any subset or union set of attributes to denote FED_EMPLOYEE_INFO. The intersection of attributes is trivial as they map by mechanism [1]. The disjoint set (i.e. empty for DB1 or DB2) can have two interpretations: - the information is computable via the database - no such information is available Consider DB2’s OurEmployee.Project-Assignment which in DB1 could be in a different table e.g. table STAFF_TO_PROJECT, with primary key Employee-ID. At the federal level the virtual attribute fed-employee.project-assignment should be represented as either a pointer to an explicit query on table STAFF_TO_PROJECT, or it could trigger an inference process over the knowledge base which could be assumed to know that employees work on projects and which could infer through the typing information on its representation of tables and attributes that table STAFF_TO_PROJECT is indeed about items of type employee and their relation to items of type WORK, with PROJECT as a “kind_of” WORK. The latter approach is more complex but relieves the application builder from explicitly defining procedural attachments which could get out of date rather easily. Both approaches will be investigated in our prototype system. [6] Entities have disjoint attribute sets If we have no information in DB2 about the disjoint DB1 attribute Employee.HomeNumber then again, an inference chain in the KBS using its ontology can be triggered: remember that every concept expressed in a linked database has been linked at the intensional level, that is, into the ontology of the KBS. Therefore, home.telephone.number is part of a concept or frame perhaps called ACCOMMODATION which is specified to express that all items of type PERSON can have a relation LIVES_AT with items of type GEO-LOC and that the supertype of ACCOMMODATION is GEO-LOC. The KBS can now infer that TELEPHONE_NO is either linked to WORKS_AT or to LIVES_AT . Employee.HomeNumber in DB1 is marked as referring to . The KBS can therefore infer that the missing information in DB2 would have to come frome a table whose type is somehow related to ACCOMMODATION. Since DB2 has no such table the KBS can now provide this intensional information and the system can generate an answer that the home number of the employee in DB2 is unknown. The important difference to standard database technology is that we did not have to issue a query to a remote database but could infer this from our intensional model.
8
Workshop on Cooperating KBS - Keele Sept 1993
[7] Values of attributes use different scales Difference in the precision of data taken by related attributes Example: DB1: COURSE.GRADE = A | B | C | D | E DB2: CLASS.SCORE = 0-10, based on weighted averages of exams on scales 0-100, rounded to nearest half-point. This example corresponds to the sort of heterogeneity where values of attributes use different scales or at least, different granularity on scales. The knowledge based approach can again help us here but we would argue that the underlying problems of the particular example have more to do with measurement theory than database theory: Grades A to E denote strictly speaking no more than an ordinal scale, i.e. A>B> ... it isn’t even clear whether A > B or A < B and whether < should be interpreted as “better” or “worse”. Similarly, CLASS.SCORE 0-10 could be interpreted as ordinal or ratio. If the interpretation is on a ratio scale then somebody getting 10 is said to be exactly 5 times better than somebody getting a 2 - we suggest that the figures only denote a monotonic ordering i.e. “1” < “2” < ... < “10”. they are labels and could equally well be characters “A” to “J”. [8] Values of attributes have different value-types (e.g. GBP vs FF) We need to satisfy ourselves that underlying scale types are compatible. Thus GBP and FF are value-type different but scale-compatible if a relation conversion_rate (X norm_units A = Y norm_units B) is known. The KBS Ontology will include AMOUNTS consisting of VALUES and UNITS. Subtypes of UNITS include, LENGTH_UNITS, COST_UNITS etc.. Simple conversions apply between amounts of the same unit subtype. A more complex dimensional calculus is used to determine compound units (e.g. metres/second). [9] Values of attributes have overlapping ranges Again, knowledge is required whether firstly, a conversion is meaningful, and secondly, whether it is possible given available information about the measurement process underlying the data. Only scale-compatible and unit-compatible (in either case directly or through transformation) data can be considered. Once ‘normalised’, these value-constraints can be inclusive, intersecting, or disjoint. Value range issues are only considered at the level of whether mapping is possible in principle. Since we are not concerned with updates the issue is of less importance to us. However, by explicitly specifying knowledge about measurements we have, at least in principle introduced the necessary conditions for making multidatabase updates without violating integrity constraints on value ranges. Any measure conversion requires knowledge of the underlying measurement procedure and precision. A KBS can assist by capturing such knowledge and by putting dynamic interpretation rules on remote schema-items but it cannot solve the problem in general when there is insufficient knowledge about the interpretation of the data stored. What we can do is provide meta-information as above: we can for example make explicit that there are 5 scales on which measures can be taken (nominal, ordinal, interval, ratio, absolute). We can augment our attribute definitions at the federated level to warn users of possible incompatibilities when they attempt any queries which rely on a comparison of potentially incompatible values. The obvious checks are: •
compatible measurement scale e.g. ratio / ratio
•
compatible measurement unit e.g. GBP / D-Mark
•
compatible value range e.g. 0-2000 / 0 - 6000
As mentioned above, the underlying problem is outside database theory but can be
9
Workshop on Cooperating KBS - Keele Sept 1993
dealt with to varying degree, using KBS support. THE REQUIRED KNOWLEDGE REPRESENTATION In MIPS, for a database to be linked we require the following information for each database (assuming a relational datamodel for simplicity). Below are sketches of the datastructures required, in BNF notation. The BNF is by no means complete. Definitions of Relation Constructors are only by example, referring to some of the cases discussed in the text. The set of primitive Relation Constructors may be different for each MIPS application, but their formal semantics are not. DatabaseDefinition ::= , , , , ,... ...
is not strictly necessary but may improve efficiency in topdown search. It would be created as an aggregation (inheritance) via its list of and the lists of used in those tables. ::= ,,, ,, , , , AttributeDefinition ::= , ,, , , , . ::= . ::= ‘ASSISTANCE’ | ‘WANTING’ | ‘COST’ | ... ::= | | | | ... ::= ‘PROVIDER’ | ‘SELLER’ | ‘BUYER’ | ‘
Example: a 3 place relation ‘WANTING’ defined for the application builder. WANTING. = WANTING.= WANTING. WANTING.=
This means that our denotational semantics are such that a person can only WANT an entity for themselves. Example: a 5 place relation ASSISTANCE defined by the application builder. ASSISTANCE. = ASSISTANCE. = ASSISTANCE. = ASSISTANCE. = WANTING. ASSISTANCE.REQUESTER = WANTING. =
This means that ASSISTENCE.REQUESTER is a Person WANTING an entity of type ASSISTENCE, where the Agent of ASSISTANCE is some sort of provider and the referenced object is some Goods or Service, and the instrument by which ASSISTANCE is satisfied is of type ProductSupport. Example: a 2 place relationship COST defined for the application builder. COST. = COST. =
10
Workshop on Cooperating KBS - Keele Sept 1993
Example for a link into the schema: ::= (.)MIPSvsl (, ).
Example: (ASSISTANCE.PROVIDER) The structures above should allow us to interpret at the intensional level, a query to our virtual database of the form: “How much does it cost to obtain support for product P from firm F?” The resulting intensional structure of this query could be something like: WANTING. = AND Person = USER WANTING.= WANTING. AND = USER WANTING.= AND Entity = COST. = AND Money = COST. = AND Service = ASSISTANCE AND ASSISTANCE. = AND Provider = ASSISTANCE. = AND Goods = Product P ASSISTANCE. = AND ProductSupport = ASSISTANCE. = WANTING. AND RecipientRole = USER ASSISTANCE.REQUESTER = WANTING. = AND AgentRole = USER.
We can now follow up our intensional links to find those attribute-types and tabletypes that can in principle, provide partial answers to our query. Clearly, the null-quantified items are queried for, with the quantified items acting as constraints. As indicated above, the query is not satisfiable because none of the two assumed databases has cost information on their product support. Nonetheless, since we are able to construct a partially answerable intensional query we can inform the user that we have identified a ASSISTANCE. for which an extension (i.e. their employee information exists). From this, we can boot-strap further queries via other defined intensional relational constructors, such as “WORKING_ON” which would rely on a similar inference chain as sketched above. The above is only a sketch of a prototype which is currently being implemented by the MIPS consortium. As can be seen, our notion of a federated database schema is far more dynamic than would be possible with conventional database technology. In fact, the application of rules is done largely at runtime so that in effect, a temporary database plus appropriate schema is created at session time, with the KBS interpreting the meta-information provided about the databases which may need accessing. Answering a query is thus an operation firstly at the intensional level, with the aid of the KBS, and secondly, the generation of real database queries once the users goals have been sanctioned as potentially satisfiable.
11
Workshop on Cooperating KBS - Keele Sept 1993
DATA SOURCE SELECTION A major aim in the design of single centralised databases is to hide the structural details of the data from the user as much as possible. In distributed databases this principle is both applied and widened to encompass the hiding of the locations of and access mechanisms to that data. The multibase language approach to distributed databases allows individual users to select the appropriate target databases for their quieries, and forces them to be very aware of the physical data sources used (Litwin & Abdellatif, 1986), given the inability to automatically optimise data location or enforce global integrity constraints at update. Here we consider the benefits gained by allowing the mediating agent to make the choice of target data sources through allowing users an intermediate amount of explicit knowledge of the access mechanisms. Query optimisation is usually considered as the process of ensuring that either the total cost or the total response time for a query is minimised. Obviously, there could be a difference between the ‘fastest’ and ‘cheapest’ execution strategies in centralised or distributed databases. In the class of multibase considered here, a fifth optimisation choice can be added to the four conventional ones: 1) the order of executing operations 2) the access methods for relations 3) the algorithms for performing operations 4) the order of data movements between sites 5) the data sources chosen The eight factors considered for selecting data sources conventionally occur in different parts of the life cycle. Contemporary standards for requirements specification (e.g. Yeh et al, 1984) consider that specific information about non-functional requirements should be gathered for database applications. These include the performance time and cost for transactions, but also the accuracy & recency of data , and the reliability of data through availability and integrity. Along with these five variables, the content of the returned data is always stated in the query itself at run time, and sometimes the quantity of the required data. If not just these two but all seven of these factors are included in the query, they can all be used in themselves to select the data source to be queried in heterogeneous distributed databases. Examples of the role if each of these factors individually in the choice of data sources are: Response time: If there are two databases (DB1 & DB2) which contain subsets of answers X & Y to the query, and DB1 takes 3 minutes to respond whereas DB2 takes 40 minutes to respond, then a user may only want a quick answer from DB1 (X) and not want to wait for the answer to DB2(Y). In this case the user will state a time limit on the retrieval process which will be used to select data from DB1 only, and the set X will be returned rather than the larger set X&Y. Cost of Data: Data may be available on vacancies at Hotel D from the individual hotel (DB1) or from a booking agency (DB2). Because the booking agency provides many services on the same machine the cost of using DB2 is more expensive than DB1 where the overheads are less. A travel agency trying to book a holiday for a customer does not want to spend more money than it needs, and therefore wants the cheapest way to retrieve the information. Therefore they use DB1 only and not DB2 to provide the data. Content of Data: Issues of semantic heterogeneity and the approach taken to resolve them have been described above, and this section builds on that solution. Domain constraints theoretically exist at 3 levels (Omolulu et al, 1989), all of which must be considered in order to enforce global data integrity constraints on retrieval since only local constraints can be enforced on update (note that domain is used here in the database usage not the knowledge 12
Workshop on Cooperating KBS - Keele Sept 1993
engineering one). i) The Platonic set of possible values for an attribute in a domain ii)The values which can exist in a DB (which may be a subset of the domain) iii) The values which do exist in a domain. Most commonly, if the KBS contains values for these 3 sets of domain constraints, queries would only be modified to reflect the semantics of the individual database as a specialised interpretation of the conceptual schema. However, a database which appears relevent on the basis of the conceptual schema alone may be excluded since the actual values in the DB are outside the scope of a query. Quantity of Data: In a similar circumstance, with 2 databases where the amounts of data in them differs, DB1: may contain the set X which contains 100 elements, whereas DB2 may contain a further 1000 elements. The user may only wish to receive 100 elements and therefore access will only be made to DB1 and not to DB2. Recency of Data: A user may wish to access the IBM share price. This may be available in DB1 in the local Madrid finance house, it may also be available in DB2: the London Financial Times Database and thirdly in DB3: the New York Stock Exchange Database. DB1: may be reliable at close of business on the previous day; DB2 may be reliable to update every hour on the hour; DB3 may be reliable to the previous second. The user may only want an approximate figure for yesterday’s close of business, and therefore only DB1 will be queried and only that data returned. Accuracy of Data: Data on the exact location of stars my be available from several astronomical databases. Each database is associated with an error in the accuracy of the instruments used to collect the data. A large database (DB1) may use less accurate measuring devices (R) to collect large amounts of data, a second small database (DB2) may only contain less quantity of data but it was collected with more accurate devices (P). A user may say that they wish to access the locations of stars in a portion of the heavens, but it must be more accurate than some measure Q. The accuracy calculation shows that P>Q>R, therefore only DB2 is queried for the data, and only that subset is returned. Availability of Data: We have not assumed a robust distributed heterogeneous database. Therefore we cannot assume that all data sources which could be available to the user, are available at any time. Therefore the KBS must include timeouts on the time to wait for the returned data due to communications, server of other failures which may be indistinguishable to the client KBS. Since this variable is trivial and addressed muc helsewhere, it will not be considered further here. The information required in the KBS representation of each of these factors is clearly compatible with the representation of information in data sources proposed in the previous section. Obviously, the evaluation of these factors independently is unrealistic since they will nearly always interact. The approach advocated here is to use the intelligent agent to represent the information required about each of the databases, and to perform the selection of the databases using exclusion rules applied to the set selected by the conceptual schema interpretation process. The complete set of heuristics so far developed to account for the interaction of these factors is too complex for presentation here, however, they are being refined in the development of the MIPS prototype system to isolate domain dependent components and leave domain independent parts which can be used in the selection of data or knowledge sources in distributed systems. This information must be acquired from various sources in order to allow the optimi-
13
Workshop on Cooperating KBS - Keele Sept 1993
sation to take place. The query must be expanded to include values for each attribute over the query formula. This can be performed by the user explicitly, or a system including a user model would contain overridable defaults for these values. These values can be negotiated between the system and the user through a dialogue when conflicts require resolution. The information stored in the KBS can be stated as part of the export schema of the individual data sources. Obviously conventional export schema do not contain this information so it would have to be added to them. This places a considerable burden on the application builder who adds data sources to the intelligent agent. The benefit of this cost is being experimentally determined through the MIPS prototype system. CONCLUSION ADVANTAGES OF THE APPROACH With a set of primitive constructors pre-defined we provide a tool to define more application specific semantics in a MIPS virtual database. The manual linking process supports one-by-one integration of remote databases. No effort is required from the remote database to fit the virtual schema, and no adherence to this ‘global schema’ is required. The onus is on the MIPS customiser who has appropriate tools available for this task. The approach supports the partial integration of a database: this again, provides a more natural migration path because integration can be evolutionary, according to priorities. We also believe that MIPS can support convergence through standardisation because the system provides different speed merging rather than relying on ‘big bang’ introduction. This means that useful core MIPS applications can be ported and extended in a transparent way. The system allows modularisation and is specified with various customisation requirements in mind, at an early design stage. REMAINING PROBLEMS AND ISSUES Several issues remain open for discussion: The manual linking is effort intensive at the best of times and impossible at any commercial scale without well-equipped support tools. Despite the definition of primitives as well as generic checking devices at the intensional level, practice with many complex IT systems has shown that maintenance and/or customisation tends to require more skills than was originally anticipated. It would be wise to expect similar tendencies here. The approach assumes that either the benefits are so great that maintenance is affordable or the underlying domains/databases are so stable that maintenance will happen rarely. However, the problem remains that when remote databases change in structure we have no easy way of detecting this and take immediate action. One suggestion put forward is that we can do simple structural checks whether the cardinality of remote schemata has changed. However, if the changes have severe implications on the overall semantics then manual relinking is required. In MIPS, we do not address possibilities of machine learning for automatic schema reconciliation, neither do weinvestigate radically different approaches to database access, e.g. content addressable memories, associative memories, etc. Finally, an interesting issue could arise if we assume several MIPS applications: The question is then whether we can get the KBS’ in these applications to act as co-operative agents? Our presence at this workshop is partially motivated by the question of what additional functionality would be needed to make them co-operate at the knowledge level.
14
Workshop on Cooperating KBS - Keele Sept 1993
ACKNOWLEDGEMENTS The work reported in this paper was partly funded by the CEC through Esprit project MIPS (No 6542). The participating organisations in MIPS are Longman Cartermill (UK), SERC RAL (UK), Heriot Watt University (UK), Trinity College Dublin (Ireland), Sema Benelux (Belgium), STI (Spain), DTI (Denmark).
REFERENCES Adler, M., Durfee, E., Huhns, M., Punch, W., & Simoudis, E. (1992) AAAI Workshop on Cooperation Among Heterogenous Intelligent Agents. AI Magazine, 13(2) Summer, 39-42. Bell, D. & Grimson, J. (1992) Distributed Database Systems, Addison-Wesley: Wokingham, England. Bertino, E. & Musto, D. (1993) Query optimization by using knowledge about data semantics. Data & Knowledge Engineering Vol 9, No 2, December 1992 pp. 121-155. Brachman, R.J. (1979) On the Epistemological Status of Semantic Networks. Reprinted in: Readings in Knowledge Representation, Morgan Kaufmann, 1985, pp 192-215. Brachman, R.J., Borgida, A., McGuiness, D., Patel-Schneider, P.F., Resnick, L.A. (1992) The CLASSIC Knowledge Representation System, or, KL-ONE: The Next Generation. In Proceedings of the International Conference on Fifth generation Computer Systems. ICOT: Japan. Bright, M.W., Hurson, A.R., & Pakzad, S.H. (1992) A Taxonomy and Current Issues in Multidatabase Systems. IEEE Computer, March, 50-59. Chappel, H., Wilson, M. and Cahour, B. (1992) Engineering User Models to Enhance Multi-Modal Dialogue. In J.A. Larson and C.A. Unger (Eds.) Engineering For Human-Computer Interaction. Elsevier Science Publishers B.V. (North-Holland): Amsterdam, pp 297-315. Chappel, H. & Wilson, M.D. (1993) Knowledge-Based Design of Graphical Responses. In Proceedings of the ACM International Workshop on Intelligent User Interfaces, pp 29-36. ACM Press: New York. Deen, S.M. (1991) Cooperating Agents - A Database Perspective, In S.M. Deen (Ed) Coooperating Knowledge Based Systems 1990, Springer Verlag: London. Doe, G.J., Ringland, G.A., Wilson, M.D. (1992) A Meaning Representation Language for Co-operative Dialogue. Proceedings of the ERCIM Workshop on Experimental and Theoretical studies in Knowledge Representation, pp 33-40, Pisa, Italy, May. Engels G. Gogolla M. Hohenstein U. Huelsmann K. Loehr-Richter P. Saake G. Ehrich H.-D. (1992) Conceptual modelling of database applications using an extended E-R Model. Data & Knowledge Engineering Vol 9, No 2, December 1992, pp 157-204. Gray P.M.D. Storrs G.E. du Boulay J.B.H. (1988) Knowledge Representation for Database Metadata. Artificial Intelligence Review, Vol2, No 1, 1988, pp 3-30. Lenat, D.B. & Guha, R.V. (1990) Buildiong Large Knowledge Based Systems. Addison-Wesley: Reading, MA. Levesque H.J. & Brachman, R.J. (1985) A Fundamental Tradeoff in Knowledge Representation and Reasoning. Reprinted in: Readings in Knowledge Representation, Morgan Kaufmann, 1985. pp. 41-70. Litwin, W., & Abdellatif, A. (1986) Multibase Interoperability, IEEE Computer, Dec., 10-18. Neches, R., Fikes, R., Finin, T., Gruber, T., Patil, R., Senator, T., & Swartout, W. (1991) Enabling Technology for knowledge sharing, AI Magazine, Fall, 36-56. Omolulu, A.O., Fiddian, N.J. an Gray, W.A. (1989) Confederated Database Management Sysytems. In M.H. Williams (Ed.) Proc. of BNCOD, CUP: Cambridge. Roussopoulos, N. & Myopoulos, J.(1975) Using Semantic Networks for Database Management. Proceedings of the International Conference VLDB, Framingham, MA, 144-157. Sheth A.P. Larson J.A. (1990) Federated Database Systems For Managing Distributed, Heterogeneous, and Autonomous Databases ACM Computing Surveys Vol 22, No 3, September 1990, pp 183-236. Wiederhold, G. (1992) Mediators in the Arhiecture of Future Information Systems. IEEE Computer March, 38-49. Yeh, R.T, Zave, P, Conn, A.P & Cole, G.E. Jr. (1984) Software Requirements: New Directions and Perspectives. In C.R. Vick and C.V. Ramamoorthy (Eds.) Handbook of Software Engineering, 519-543. Van Nostrand Reinhold Company Ltd.
15