Knowledge Based Architecture to Integrate Heterogeneous Distributed Information Systems P. Bernus’ Key Centre for Software Technology The University of Queensland St. Lucia, Queensland QLD 4072, Australia
[email protected] Abstract A knowledge-based architectuTe to connect and correlate autonomous disparate information sources is presented. The information sources being integrated come equipped with logical front-ends that build up only those parts of a virtual global schema, which are needed t o process local or global requests. The schema building and translation processes are driven by respective knowledge bases. The core of the architecture is a “connecting tissue” in the form of a distributed knowledge modeling platform- also referred to as Distributed Cooperative Object Management. The proposed architecture is intended for a broad spectrum of use, ranging from heterogeneous business applications to sophisticated design systems. We expect that the architecture can be implemented on top of open distributed processing environments. Introduction With the widespread use of computer technology throughout large organisations or enterprises there are requirements to manage in a collective manner the data which are dispersed between individual groups of people. The past few years have experienced a significant increase in awareness of the potential benefits which may derive from using distributed database technology in the context of corporate enterprises. However, due to heterogeneity problems, retrieving information from a collection of independently developed database systems represents a formidable task. Due t o the success of relational database technology the first attempts to integrated distributed database systems (ddbs) have concentrated on the utilisation of the very same principles. For this reason the current situation in the distributed database arena is mainly reflected in a handful of predominantly homogeneous commercial systems. Problems of heterogeneity are addressed by experimental projects and research prototypes. ‘Research supported by the Centre of Expertise in Distributed Information Systems and funded by Telecom Australia Research Laboratories 2Researchsupported by FRF, grant no. R66474.
682
CH2915-7/90/0000/0682$01 .OO 0 1990 IEEE
M.P. Papazoglou2 Department of Computer Science Australian National University GPO Box 4, Canberra ACT 2601, Australia
[email protected] The reality of present day enterprises is that enterprise data are considered a corporate-wide resource and are managed by qualitatively different database management systems. Corporations have made major investments in database applications which they are keen to preserve and modernise to meet changing organisational requirements. Moreover corpora tions have a keen interest in seeing already available data management systems and application programs realising corporate, or distributed database applications, which can span more than one data source. This could serve as a competitive advantage by providing collective management and coordinated access to distributed data. Unfortunately, t o achieve this form of intergration corporations normally rely on an undisciplined and awkward methodology. 1.t is most likely that their distributed database systems comprise a number of completely autonomous database repositories patched together to provide an ad hoc functionality which suits the purposes of the particular organka tion which utilises it. The penalty that these organisations normally have to pay is quite dear: interconnection and cooperation between the individual database systems requires frequent human intervention as the software extensions which are developed to guarantee integrated functionality are fragile and of temporary nature. Although, such systems may provide various solutions to several basic data management problems are normally short-lived and certainly do not meet the advanced requirements for cooperation which are prerequisites for developing integrated applications. The general requirement is to be able to identify and code the kind of human knowledge and expertise that is required when using disparate information systems to carry out corporate business functions. Resulting corporate applications may thus range from very specific t o very general. Therefore, it is desirable to design and develop an advanced knowledge-based environment which integrates related individual data, intensive applications into a single corporate application that may require complex operations on data and knowledge - possibly located in remote systems.
To this end a knowledge-based system which integrates heterogeneous, distributed information systems will provide the following functionality: 0 it will connect to external traditional heterogeneous database systems and knowledge bases, providing a complex-object view of their data content; it will also allow for advanced forms of cooperation between the pre-existing system, thus supporting an interoperable society of distributed knowledgedriven systems. In the following section we outline the requirements likely for a knowledge-based integration of distributed information systems; these requirements will be used to evaluate the capabilities of the proposed system. Other architectures designed for integration of distributed database systems are also briefly discussed. Sections 3 and 4 contain an overview of the most salient features and the functionality of a knowledgedriven architecture for the integration of heterogeneous distributed information systems. Finally, Section 5 discusses various ways of prototyping the representation and reasoning substrate of the proposed architecture. 1 Knowledge Based Integration A rather straightforward approach to the data integration problem is to use a mere physical interconnection of the application programs through standard communication facilities. Still, the solution of this task is a major achievement when considered in a distributed network of multi vendor equipment. Figure-1 shows the approach to integration using open systems interconnection (OSI). The interconnected system is shown to consist of application programs connected through OS1 protocols (COM). Application programs utilise their respective component information systems (IS) for data management purposes. This solution is adopted in most cases where a quick approach to the integration problem is required and clearly subsumes the cases considered in the previous section. In other words, if we can rely only on OSI, then the integration and distribution requires that the application programs be aware of many operational factors such as the intricacies of data and transaction distribution, the maintenance of consistency between the information systems in remote sites, and the provision of a uniform distributed interface for the users. These factors are obviously not relevant to the application itself but to the computing equipment utilised. In order to use automated tools for truly integrating disparate heterogeneous information sources, we need to implement a substantial portion of the human knowledge required to put otherwise unconnected information systems into shape so that they can be viewed and used as an integrated whole.
/ User Interface ~
---------application . ....................
in
Applications know about distribution, concurrency, consistency, and processing - or avoid these problems by administrative means
Figure 1: Applications interconnected through communication protocols When doing so, humans coordinate access to these s y 5 tems in a rational manner and use their inferential abilities to interpret the functions of the preexisting application programs and the contents of the underlying component information systems. This process of interpretation draws on a variety of types of knowlIn order to achieve the above functionality edge. the integrating environment has to offer a number of facilities: 0 A universal conceptual schema (UCS) used to integrate the individual local, or component, schemata. 0 Schema translation knowledge which can be used for the purpose of structural and semantic translation (between the universal and the local schemata). 0 Request processing knowledge (including query and update processing4, and support of interoperability) 0 Models or encapsulations of knowledge regarding: syntactic knowledge to homogenise the query language interfaces of the component systems. 0 Capability models regarding the functionality of the preexisting application programs. Models which underly the interaction pattern of the component information systems with the integrated environment. 0 Application programs, or tools, which are implemented in the active knowledge representation environment which capitalises on the advanced data 31t is interesting to note that the situation is somewhat similar to natural language interpretation. Which further involves constraint satisfaction plauning and execution.
modeling facilities of the integration environment. Such tools include decision support tools, engineering analysis tools, or design and planning tools [4]. From the abovementioned properties we can derive the first thesis in knowledge based integration is that the modeling power of the integration platform must be what is best described as an active knowledge representation environment [l], [a], [3]. Recently, there is a growing belief that the next generation of distributed information systems will be based on the paradigm of interoperable, intelligent, heterogeneous agents which share a common pool in a well disciplined and systematic way[7]. Therefore, basing integration on a mere physical interconnection, we propose to use a network of “Distributed Cooperative Object Management Systems” (DCOM-s), as substratum for integration (see Figure2). A DCOM is a substrate which implements the DIS: integration platform above knowledge and provides the foundations for an DCOM: components of DIS environment which integrates already developed appliA1,2,3,4,5,6: existing application programs cation programs and is used to develop further corpoFigure 2: Corporate application relying on a rate applications. distributed IS (or network of DCOM-s) The requirements of the modeling needs of the DCOM, or similar integration environments (e.g. for The other dimension of integration is the tightness of CAD databases) have been extensively dealt with in coupling [SI. The fundamental question here is whether the literature [3], [5]. The conclusion that can be we require: drawn from these investigations is that the practical needs of integration are best met by an elaborate object- (i) A single common query only facility using a common query language, oriented data model. - The structural and behavioural modeling power (ii) A Common Q u e 7and Updatefacility using a C a n ” n of object-oriented data models are analogous to that language, of knowledge representation languages developed for (ic) Controlled collaboration, where every component inof requirements in this formation system is able to view a partition of the AI applications. The outside world through the “eyes” of its own private case is even closer: 1. Since in large scale integration projects the collecschema and access information in other component information system as if they were ofthe Same type, tion of all the local schemata is constantly changing, schemata are not be separated from data. (iv) Interoperability of the component information sysother words schemata should behave and be viewed tems defined as some advanced form Of cooperation* as data in conventional database management sysThe first two requirements are satisfied by logtems. ically centralised, or tightly coupled architectures 2. The aforementioned types of knowledge are best rep[81,[9], while the the third pertains to federated, or looselY coupled, architectures [lo]. None of the first resented as a mixture of functions, procedures and three architectures can satisfy efficiently the advanced rules.5 forms of interoperabiliiy required for a smooth and ef2 The Knowledge-Driven Architecture fective logical integration of autonomous information Following [6], we may categorise a multi-model hetsystems. Interoperability may be defined as a form erogeneous information system topology as 1-1, 1-n, Of integration based On cooperation - on the premises n-l,and n-n architectures depending on the particuOf and meta-knowledge - and may lar model-to-model (and language-to-language) translation methodology This dimension decides range from the cooperation requested when staging data between a datalknowledge base and an applicathe direction of translation processes between the intion to intelligent collaboration to achieve a common dividual information systems. goal [7]. Table-I presents the necessary functions involved when we select any combination along the di5Or sets of rules, called scenarios [3], SO that the model to be used for implementing the present pmposal needs to support mensions examined. rules and rule processing
684
1-1
hard wired interconnection
I
1-n
multi databases
direct ST6
SQP
direct ST
SQP
1
n-1
multiple front-ends n separate directST processes n separate direct^^ processes
+ cs
+CS
same as n-n
N.A.
N.A.
sameas n-n
sameas
I
n-n
arbitrary interconnection ST processes tocanonical form SQP
+
N.A.
n(n-1)/2
same as n-n
1 I I n-n
ST processes
+ limited SOP query transl. fromlocalto remote scheme +SQP
Table I: Classification (i)-(iv) of functional requirements for the integration platform As a result of the above discussion the second thesis of knowledge based integration is that all compo-
nents of a distributed integration platform are interoperable and that no single component needs to store or posses an entire universal conceptual schema (UCS). Although the knowledge of a global “corporate problem domain“ is implicitly present in the distributed interoperable integration platform, we do not insist that application programs using the integrated system have direct access to this information. The UCS is a distributed resource of which only a part is directly available to the component applications at any given time. Indeed, it is only required, that any appropriate part of the UCS is generated whenever and wherever it is necessary. The universal schema is nowhere completely visible, however, all individual sites know how to generate those parts of the UCS which deal with the information they can contribute. This may happen after interaction with and aid from other sites, hence the name cooperative object management. Different sites may generate portions of the UCS depending on the information they can contribute to a global query. This relaxation of the centralisation criteria results in that this particular architecture can enjoy both the advantages of logically centralised and federated systems without having to pay the heavy penalty of physical centralisation. This architecture is called semidecentralised, because it maintains only a virtual universal conceptual schema. The UCS is described in a canonical form [5]. The number of necessary translation functions is lower than it would be necessary in federated systems (where pairwise schema matching is maintained) [6]. This solution is preferable to federated systems in cases where close integration is required while federated systems may be an efficient option when export-import schemata are 6ST: schema translation, SQP: semantic query processing,
CS: constraint satisfaction
685
very small relative to the size of the local schemata. An important feature of the semi-decentralised architecture is that no site assumes authority over others, hence there is a complete lack of explicit hierarchy between sites of the integration platform. No integration function is centralised so the architecture is inherently more fault tolerant than centralised systems. For the execution of particular transactions sites may assume responsibility and, as long as they play according to the “rules of the game”, they become the top of the hierarchy for that particular transaction (e.g., in the case of a global transaction). This solution avoids the bottleneck of having to turn to central authority in planning the execution of a request and in processing a complex transaction. (Some systems avoid this bottleneck by introducing multiple levels of hierarchy or domains [ll],so that only inter-domain transactions need central intervention.) 3 Knowledge Organisation in the Integration
Platform The DCOM network maintains a shared workspace, i.e. knowledge and object base, which raises a number of datalknowledge management issues not addressed by current database technology. Each DCOM integrates two essential interrelated components: a data representation and a reasoning substrate. The data representation substrate embraces both descriptive and prescriptive computations representing, thus faithfully domain entities and recording domain specific knowledge in manageable pieces. The reasoning substrate is capable of exploiting accumulated domain knowledge and meta-knowledge, enhancing thus the scope and sophistication of the inferential services. The representation and reasoning substrates integrate distinct problem solving components involving objects, cIusters of objects, rules, and constraints, and have knowledge of the use of the individual components and basic knowledge of the modeled problemdomains. DCOM-s are responsible for the following functions: e Accepting requests from local applications. e Translating requests to the universal conceptual schema level. This translation task is further divided into two subtasks: Syntactic translation of requests issued in the data language of the local information system/ application into the homogeneous (canonical) pivot language. Semantic translation of the homogenised requests into requests referring to universal schema level objects.
issue local request
--_-------------
local HCHS and additional semantics
1
local request
of HCHS
in local DL
application
translate request
program a,
to canonical form