A Global Object Model for Accommodating ... - Semantic Scholar

3 downloads 0 Views 311KB Size Report
di cult due to imperfect data quality in the legacy databases. ... In fact, it is common that not all instances from legacy databases can be properly .... ename = kain.
A Global Object Model for Accommodating Instance Heterogeneities Ee-Peng Lim Centre for Advanced Information Systems School of Applied Science Nanyang Technological University Singapore 639798 Email: [email protected] Roger H.L. Chiang Information Management Research Centre Nanyang Business School Nanyang Technological University Singapore 639798 Email: [email protected] Abstract

To completely address database integration problems in the context of multidatabase and data warehousing systems, one has to examine various integration and query requirements. In this paper, we identify three phases of database integration, i.e. analysis, derivation, and evolution. Due to poor data quality in local databases and local database updates, some unresolved instance con icts have to be accommodated by the integrated databases. We have therefore proposed a new global object model to accommodate instance con icts and to provide query support on the data containing instance con icts throughout the derivation and evolution phases of database integration. The new global object model, known as OORA , has been designed to allow database integrators and end users to query both the local and resolved instance values using the same query language.

1 Introduction To integrate two or more pre-existing databases without violating their local autonomy, either a multidatabase approach or a warehousing approach can be adopted. In both approaches, the database integration tasks are similar. Before pre-existing databases can be integrated together, the di erences among them must rst be identi ed and further resolved if possible. All inter-database heterogeneities can be generally classi ed into schema and instance ones. Schema heterogeneities refer to di erences among schema elements. Instance heterogeneities refer to con icts that arise when data from di erent databases have to be integrated into multidatabases or data warehouses. The resolutions of schema 1

and instance heterogeneities are also known as Schema Integration and Instance Integration respectively. In general, schema integration must be done before instance integration. In the past database integration research, focuses have been given to schema integration[BLN86, LC94, LNE86, SPD92, VA96]. Instance con icts, in contrast, have not been fully addressed. To fully address the schema and instance integration issues in both multidatabase and data warehousing systems, one has to examine the database integration process at the macro level. We believe that inter-database heterogeneities should be handled throughout the entire database integration process. While there has not been a well accepted database integration methodology, we proposed to divide the entire integration process into three phases, namely Analysis, Derivation, and Evolution as shown in Figure 1. 





Analysis: Analysis is essentially a knowledge acquisition phase. In this phase,

database integrators are expected to understand pre-existing databases at both the conceptual and implementation levels. These local database semantics can be acquired from local database owners. Database integrators are also required to nd out from the integrated database users their global application requirements in order to derive the global schema and instances. Derivation: The actual derivation of global schema and integrated instances is done in this phase. Once the derivation is done, queries on the integrated database can be evaluated. It is in this phase a complete mapping from local schemas to global schema, as well as a mapping from local instances to global instances are speci ed. Evolution: Due to the autonomy of local database systems, updates to the local databases may violate the mapping from local instances to global instances. Evolution therefore refers to the ongoing re nement of integrated databases as the local database schemas and instances evolve. It becomes the most important phase to maintain a multidatabase or data warehousing system. Analysis

Derivation

Evolution Evolution

People:

Global DB users DB integrators Local DB owners

Global DB users DB integrators

Global DB users DB integrators

Tasks:

Understand the local DB semantics Understand the global application requirement

Defining the integrated schema Defining the derivation of global instances

Querying the integrated DB Identifying post-derivation conflicts and their resolution

Figure 1: Database Integration Phases Among the above three phases, evolution has been largely ignored in the database integration research primarily due to two reasons. Firstly, most researchers focus on schema integration issues. While a lot of schema integration issues have to be investigated for 2

di erent databases during the derivation phase, it is uncommon to investigate schema integration issues during the evolution phase due to rare modi cation to pre-existing local schemas. Secondly, research on the integration of instances has been pre-occupied by query processing issues instead of local database updates during the evolution phase. In this paper, we argue that instance integration may not be complete in the derivation phase. During the evolution phase, one also has to consider local database updates which lead to new instance con icts that cannot be handled by pre-de ned integration methods. Hence, new global data models that can accommodate instance heterogeneities become necessary.

Literature Review

Most previous database integration research focused on resolving schema con icts. Lately, as researchers begin to address instance integration problems, several solutions of instance con ict resolution have been proposed[DeM89, WM90, LSPR93, TCY93, LSS94]. Most of these solutions resolve instance con icts in some pre-determined approaches [DeM89, LSPR93, TCY93, LSS94], e.g. using probabilistic reasoning, uncertainty theories, etc. Hence, the data models adopted by these solutions neither facilitate di erent ways to resolve con icts, nor accommodate much information about instance heterogeneities. There is some primitive work in extending data model to accommodate instance con icts.  Polygen model [WM90] was proposed to capture source information of attribute values that come from di erent local relations. A source value is associated with every attribute value of the tuples of polygen relations. The source information captured include the sites the attributes originated from and the intermediate sites at which they are processed. The model, however, does not provide the mechanism to accommodate or resolve instance heterogeneities.  TS Relational model [LC96] was proposed to accommodate entity and attribute con icts in a relational integrated database. A special source attribute is assigned to every relations. An extended relational algebra has been proposed to manipulate the TS relations. Like the Polygen model, TS Relational model is not designed to represent resolved instance values.  Role-based Multidatabase model [SC94] extended Litwin's multidatabase model [LA94] by considering the di erent roles (or relations) assumed by real-world objects. Queries on a role-based multidatabase are decomposed into queries on di erent combinations of roles. Apart from not handling resolved instance values, the role-based multidatabase model does not classify between tolerable and intolerable relationship and attribute value con icts. The notions of tolerable and intolerable con icts will be elaborated in Section 2.

Research Objectives

Our research addresses the problem of accommodating instance heterogeneities (con icts) in the global data model adopted for integrated databases. There are a number of reasons for accommodating instance heterogeneities in integrated databases: 3

Resolving all instance di erences may not be desirable because some global applications may want to retain and view these di erences in the integrated database. For example, the di erent prices for the same product sold in di erent stores may be required to be retained and queried in the integrated database.  Preserving the instance heterogeneities allows database integrators to apply di erent resolution techniques on the same integrated database for di erent global application requirement.  Resolving all instance con icts may not be possible because the information and knowledge required for the complete con ict resolution is not available during the moment of instance integration.  Resolving all instance con icts may not be feasible because the amount of processing time to resolve con icts for large number of instances may be so much that integrated information may not be available on time. In this paper, we present an object-oriented global data model that can accommodate instance heterogeneities for attributes and relationships in the integrated database. The new global data model supports di erent integration and query requirements from the database integrators and database users during the derivation and evolution phases of database integration. To our best knowledge, this is the rst attempt in developing a global data model with the purpose of supporting queries on integrated databases containing resolved and unresolved instances. 

Research Contributions

On the whole, our research contributes to database integration in a number of ways. Firstly, it presents the di erent types of instance heteroegeneities one may encounter during database integration. Secondly, it clearly points out the di erent integration and query requirements from the database integrators and database users during the derivation and evolution phases of database integration. We introduce the concept of threshold predicates and resolution functions to manage and accommodate instance heterogeneities. Thirdly, an extended object-oriented data model has been introduced to accommodate attribute value and relationship con icts in the integrated databases. The proposed model is also equipped with the necessary query and integration primitives required.

Outline of Paper

This paper is organized as follows. In Section 2, we describe the various types of instance con icts and the overall approaches to handle them in database integration. Section 3 describes requirements for accommodating instance heterogeneities during the derivation and evolution phases. This study motivates the development of an extended object-oriented data model called OORA which is presented in Section 4. Section 5 presents examples of OORA queries. The conclusions and future research directions are given in Section 6.

4

2 Instance Heterogeneities

Instance heterogeneities can be classi ed into entity con icts, attribute con icts, and relationship con icts [LSS94, LC97]. Entity con icts arise when it is not known which entity instances from matching entity types1 correspond to the same real-world entities. Relationship con icts occur when it is not known which relationship instances from matching relationship types correspond to the same real-world relationships. Attribute con icts arise when the matching entity (or relationship) instances (determined by resolving entity or relationship con icts) do not have the same attribute values. Consider the following integration scenario. Let DBA and DBB be two databases containing employee information. The former is owned by the headquarter; while the latter is maintained by the regional oce. DBA :

DBB :

Staff(ename,position,salary)

Emp(ename,title,salary,qual)

Assume that during the derivation phase, the database integrator de nes an integrated relation Employee directly from the above two relations as shown below. Employee(ename,position,salary,qual)

The integration of the two local relations Staff and Emp into Employee at the schema level can be performed prior to integrating their instances. To integrate employee tuples from the two local relations at the instance level, one has to address the issues of matching tuples that represent the same real world entities, and resolving their attribute value and relationship con icts. As pointed out by a number of researchers [GSSC96, LC96], instance integration may be dicult due to imperfect data quality in the legacy databases. For example, a typo-error in an employee's ename in Staff may lead to the failure of matching the Staff data instance with the corresponding instance in Emp. The method used for resolving discrepancies in employees' positions may also be di erent from that used for salary. In this paper, we will further point out that the tolerance of instance con icts varies among di erent attributes. In fact, it is common that not all instances from legacy databases can be properly integrated during the derivation phase. Assuming that all instances of Staff and Emp have been properly integrated during the derivation phase, one still has to handle integration issues arising from the updates to local database(s) during the evolution phase. For example, new employee data instances could be added to Emp making it necessary to perform instance integration on the new instances. Similarly, instance integration is required for changes to attributes of some pre-existing data instances. There are essentially two approaches to handle instance integration problems during the derivation and evolution phases. The rst approach requires the database integrator to anticipate all possible integration scenarios during the derivation phase and de ne the instance integration methods accordingly, hoping that all integration problems in the evolution phase can be predicted in advance. When the integration scenarios cannot be predicted in advance (which is often the case), one has to resort to accommodating instance con icts in the integrated database before these con icts can be nally resolved some time in the future or may not be resolved at all. 1

Matching entity types are determined by schema integration.

5

2.1 Entity Con icts

To resolve entity con icts, the knowledge for identifying instances representing the same real-world entities is required. In some simple cases, common keys among entity instances could be used to match instances. For example, the employee name attribute can be used to match data from DBA and DBB . Complicated entity con icts arise when there is no common attribute that can be used to match instances from di erent databases. Some special techniques designed for resolving entity con icts have been proposed [WM89, CS92, LSPR93, LS93]. Although it may not be possible to resolve all entity con icts, instances that could not be determined to represent the same real-world entities can still be retained as separated instances in the global database.

2.2 Attribute Con icts

Given two instances that represent the same real-world entity, the di erences in their equivalent attributes are known as attribute con icts2 . We distinguish two main types of attribute con icts, namely tolerable and intolerable attribute con icts, that should be handled in database integration. Tolerable attribute con icts are those expected by a database integrator at the time an integrated database is derived. Intolerable attribute con icts, on the other hand, refer to those unexpected attribute con icts that cannot be dealt with using the prede ned resolution methods. To distinguish between the above two types of attribute con icts, we introduce the concept of threshold predicate. Very often, when the di erence between two or more con icting attribute values is smaller than a threshold value or when the con icting attribute values di ers in some patterns, there is an straightforward pre-de ned approach to handle the con icts. The exact con ict handling approach can be readily speci ed during the derivation phase of database integration. The primary purpose of a threshold predicate is therefore to explicitly capture the criteria to be satis ed by tolerable attribute con icts. In other words, we de ne tolerable attribute con icts to be those satisfying the threshold predicates de ned for the attributes involved. To handle tolerable attribute con icts, it is necessary to resolve tolerable con icting attribute values derived from di erent local databases. Some resolution functions can be de ned to resolve the corresponding tolerable attribute con icts. However, it is not always possible to apply resolution functions to resolve all possible tolerable attribute con icts. Sometime, one may not know the correct resolution function to be used. In other occasions, it may be the case that the con icting local values of an attribute are also considered to be valid to the integrated database users. Thus, no resolution function is required for the attribute. When the con icting local attribute values cannot satisfy the threshold predicate de ned for the corresponding attribute, database integrators should be alerted. Attribute con icts which fail the speci ed thresholds are de ned to be intolerable. Database integrators can be alerted for intolerable attribute con icts by having the intolerable attribute con icts recorded in a log le.

Attribute Con ict Example:

When the price of a product has di erent values in di erent databases, it is an attribute 2

If the attribute values are identical, the attribute con ict does not exist.

6

con ict. However, a di erence of less than $1 may not be signi cant enough to warrant the database integrators' attention (The company may practise geographic price discrimination.). In this case, the multidatabase system or data warehousing system can simply apply a pre-de ned resolution function (e.g. choosing the large value, averaging the two values) to derive an integrated product price. A product price di erence of $0.75 is considered tolerable while a di erence of $1.50 is considered intolerable.

2.3 Relationship Con icts

Relationship con icts, rst discussed in [LC97], arise when the relationship between two real-world entities may not be represented consistently in di erent databases. In [LC97], di erent types of relationship con icts have been derived and they can be caused by incorrect schema integration, incorrect entity con ict resolution and inaccurate database content.

Relationship Con ict Example

For example, employee e1 is known to work for department d1 in database A but d2 in database B where d1 and d2 are determined to represent di erent departments in the realworld after entity con ict resolution. This relationship con ict could be caused by incorrect relationship cardinalities in the integrated schema. It could also be caused by d1 and d2 being wrongly determined to represent the same department. Or, it could be the case that the employee data in either database A or B is not accurate. Like other instance-level con icts, a complete resolution of relationship con icts may not always be possible. When relationship con icts cannot be resolved by the multidatabase system or data warehousing system, they should be retained and accommodated.

3 Requirements for Accommodating Instance Heterogeneities If inter-database con icts at both the schema and instance levels can be resolved completely, the integrated databases can be represented by using some standard data model. However, when only the schema con icts have been resolved, we have to extend the standard data models and their query languages to accommodate and manipulate instance heterogeneities. Before we propose a global object model for this purpose, we rst investigate the potential query and integration requirements imposed by two types of users of the integrated databases, the database integrators (system users) and end users.

Database Integrators

A database integrator's involvement in database integration encompasses both the derivation and evolution phases as shown in Figure 1. During the derivation phase, the database integrator rst performs schema integration on the local databases. He or she later de nes some transformations on the instances from each local database so that they conform to the integrated schema before instances from di erent local databases are integrated together. The database integrator's tasks during the evolution phase are similar except that schema integration is usually not required and the instances to be dealt with are fewer. A detailed 7

discussion about the transformation of local instances is beyond the scope of this paper. We will instead focus on the database integrator's activities in resolving the con icts between transformed local instances. Henceforth, without loss of generality, we assume that the database integrator is given local databases that conform to the integrated schema. To eciently integrate local instances, database integrators require some mechanisms to easily compare local instances and discover con icts between them. Subsequently, appropriate resolution methods can be applied to the discovered con icts that can be resolved. Throughout this integration process, a global data model that accommodates instance con icts becomes necessary. With such a global data model, the database integrator can query the integrated database containing instance con icts and view these con icts in the query results. The global data model should also allow database integrators to de ne di erent resolution functions to resolve con icts or enforce consistency of data stored in local databases.

End Users

End users utilize the integrated database for report generation and decision making. Depending on their needs, they may choose to query the resolved or original instance values. For example, one may want to query the resolved salary values of employees who hold manager positions in a particular local database. Hence, a global data model should support

exible queries on both original and resolved instance values. As both multidatabase and data warehousing systems preserve the autonomy of local database systems, updates to local databases can often introduce new instance con icts to integrated databases. Some of these new instance con icts could be handled automatically by the resolution functions prede ned by database integrators and hence no further actions would be required by the end users. For other new instance con icts that cannot be resolved automatically, database integrators have to be called upon to handle them. Nevertheless, before any actions can be taken by the database integrators, these new con icts have to be accommodated by the global data model and the end users should be allowed to continue using the integrated database.

An Example of Integration Scenario

We employ an integration scenario to demonstrate the attribute and relationship con icts. Figure 2 depicts the object-oriented schemas of the local databases (DBA and DBB ) containing company information. Here, we assume that schema integration has been performed and the schemas of existing databases have been made compatible to facilitate instance comparisons. The instances of DBA and DBB are shown in Figures 3 and 4 respectively. Suppose all entity con icts are resolved by matching ename and dname of employee and department instances respectively. We notice that John in DBA works in the Marketing department but John in DBB works in the Research department. This is a relationship con ict. On the other hand, the di erence in salary for Chen in DBA and DBB is an attribute con ict. Figure 5 depicts the integrated schema derived from DBA and DBB .

8

work_in Employee position ename

managed_by qual

work_in Department floor

Employee

dname

ename

budget

salary

position

(a) Schema of DB A

managed_by

salary

(b) Schema of DB B

Figure 2: Schemas of Local Databases

Employee instances ename = john position = trainee salary = 1000 qual = diploma ename = kim position = secretary salary = 1500 qual = NULL ename = chen position = engineer salary = 2500 qual=MEng ename = mark position = manager salary = 3000 qual = BBus ename = daniel position = engineer salary = 2400 qual = BEng

Department instances work_in

work_in

dname = marketing floor = 3 budget = 2M

work_in managed_by work_in managed_by work_in managed_by

dname = planning floor = 2 budget = 4M

dname = library floor = 1 budget = 1M

Figure 3: Instances of DBA

9

Department floor budget

dname

Employee instances

Department instances

ename = kim position = assistant salary = 1500

work_in

ename = chen position = leader salary = 2600

dname = marketing floor = 3 budget = 2.1M

work_in managed_by

ename = stacy position = sales rep salary= 3500

work_in

ename = john position = trainee salary = 1200

work_in

ename = sugimoto position = fellow salary = 10000

dname = research floor = 6 budget = 1M

work_in managed_by

ename = kain position = engineer salary = 5000

work_in

Figure 4: Instances of DBB

work_in Employee

managed_by qual

ename position

floor

Department

budget

salary

Figure 5: Integrated Schema

10

dname

4 The OORA Object-Oriented Data Model In this section, we propose the OORA 3 extended object-oriented data model for representing \partially" integrated databases. These integrated databases have fully integrated schema but partially integrated instances. Speci cally, the integrated database has undergone entity con ict resolution. However, attribute and relationship con ict resolution have not been performed on the local instances. Other than accommodating attribute and relationship con icts, the OORA data model is designed such that queries on the integrated databases can be properly formulated and evaluated. Furthermore, the OORA data model ensures that the source of instance con icts can be identi ed in order to support subsequent integration work on the partially integrated database. OORA di ers from the traditional OO data model in a number of ways:  Identi cation of matching criteria for deriving global objects;  Speci cation of threshold predicates and resolution functions;  Representation of original and resolved attribute values; and  Uniform treatment of attribute and relationship con icts. In the following, we describe the unique features of OORA model in detail.

4.1 Global Objects

A global object in the integrated database is derived from one or more local objects that represent the same real-world entity. Like in the traditional OO data model, each global object is assigned a unique global object id (oid). In OORA , we assume that local objects corresponding to the same global objects (or real-world entities) can be matched by examining some common attribute values. These common attribute(s) can be speci ed as matching criteria by the database integrator. For example, a database integrator may use ename to match Employee objects, and dname to match Department objects from DBA and DBB . The following two data de nition statements have been used to identify matching local objects: DERIVE EMP from Employee@DB A, Employee@DB B USING Employee@DB A(ename), Employee@DB B(ename); DERIVE DEPT from Department@DB A, Department@DB B USING Department@DB A(dname), Department@DB B(dname);

4.2 Threshold Predicates and Resolution Functions

A threshold predicate and a resolution function can be optionally de ned for each attribute in the global schema. Given an attribute in a class of global objects, the threshold predicate determines for each global object if a di erence between local values of the attribute is tolerable. The resolution function is then speci ed to resolve tolerable attribute con icts automatically. Depending on the 3 R represents the relationship con icts. A represents the attribute con icts. The name OORA indicates that both relationship and attribute con icts can be accommodated.

11

characteristics of attributes, di erent threshold predicates and resolution functions should be de ned by the database integrators and be implemented using system-de ned functions/operators or general programs. Examples of thresholds and resolution functions for tolerable instance con icts are:  A price di erence of less than 5 cents for an apparel product in di erent databases of a department store may be tolerable. The highest price can be assigned to the product in the integrated database.  A oor area di erence of less than 1 square foot for the same apartment in di erent property databases may be considered insigni cant. Such di erence could be resolved by choosing the smallest oor area for the integrated database. Given an attribute in the global schema, three combinations of threshold predicates and resolution functions can be constructed4:  Both the threshold predicate and resolution function are unde ned: This implies that any di erence between the corresponding attribute values is considered an intolerable attribute con ict. Unless all corresponding attribute values given are identical, the resolved attribute value is always NULL.  The threshold predicate is de ned, but not the resolution function: This implies that tolerable attribute con icts can exist among distinct instances. These con icts are also acceptable. However, unless the acceptable attribute con ict involves identical values, the resolved attribute value is always NULL.  Both the threshold predicate and resolution function are de ned: This implies that tolerable attribute con ict can exist among distinct instance and the resolution function will return the resolved attribute values.

4.3 Elements of Attribute Values

In the OORA model, every non-oid attribute has a domain consisting of three elements, namely the original values (denoted by ovalue), resolved values (denoted by rvalue) and con ict type (denoted by conflictType). The resolved value, original value, and con ict type of an attribute A are represented by A.rvalue, A.ovalue and A.conflictType respectively. The A.ovalue of a global object is de ned to be a set of (value; database id) pairs where value denotes the attribute value contributed by the corresponding object from the existing database identi ed by database id. The A.rvalue of a global object is de ned to be any A value contributed by local objects if there is no attribute con ict. If a di erence is found among the local A values, the tolerance of the con ict is rst determined using A.threshold(). If the con ict is tolerable, A.rvalue is obtained by applying A.resolution() on the local attribute values. In the event where the con ict is intolerable or A.resolution() is unde ned, NULL is assigned to A.rvalue. Depending on the original attribute values and the threshold predicate de ned for the attribute, di erent con ict types can be derived and be captured in A.conflictType. A.conflictType is assigned NULL if there is no con ict, Resolvable if there is a tolerable con ict that can be resolved by the pre-de ned resolution function, Acceptable if there is a tolerable con ict and there is no pre-de ned resolution function for resolving the con ict, and Intolerable if there is a intolerable con ict. In our integrated database example, we can de ne the threshold predicates and resolution function for the salary and position attributes as follows: Note that when the threshold predicate is not de ned for an attribute, it is meaningless to de ne the resolution function for the attribute since any di erence between corresponding attribute values is considered intolerable, and such con ict cannot be resolved by a resolution function at all. 4

12

DEFINE salary.threshold@EMP(s1,s2) =

(abs(s1-s2)  100) or

abs(s1?s2)

( max(s1;s2)  0.1)

DEFINE salary.resolution@EMP(s1,s2) = max(s1,s2) DEFINE position.threshold@EMP(p1,p2) =(p1=p2) or (p1=secretary and p2=assistant)

With the above de nition, the salary values of 2500 and 2600 for the employee Chen constitute a resolvable attribute con ict. The global object for the employee Chen will have a resolved salary value of 2600 computed by the resolution function. On the other hand, the salary values of 1000 and 1200 for the employee John constitute an intolerable attribute con ict. In this situation, database integrators should be alerted and the con ict should be resolved manually. Since only threshold predicate is de ned for the position attribute, the position values of secretary and assistant constitute an acceptable con ict. For example, the attribute elements of the salary attribute of Chen and John, and the attribute elements of the position attribute of Kim are shown below: Chen: John:

Kim:

A

B

A

B

Salary.ovalue = f (2500, ), (2600, ) g Salary.rvalue = 2600 Salary.conflictType = Resolvable

Salary.ovalue = f (1000, ), (1200, ) g Salary.rvalue = NULL Salary.conflictType = Intolerable

A

B

Position.ovalue = f (secretary, ), (assistant, ) g Position.rvalue = NULL Position.conflictType = Acceptable

To keep the discussion simple, we have so far only mentioned attributes with simple domain.

OORA can handle multivalued attributes and attributes with complex data types in a similar manner.

4.4 Relationship Con icts

In the object-oriented data model, relationships can be treated as attributes that provide references to objects in other classes. These attributes are known as reference attributes. The domain of reference attributes consists of object ids. A one-to-one or many-to-one relationship from class C1 to class C2 can be represented as an single-valued reference attribute in C1, while a one-to-many or many-to-many relationship can be represented as an multi-valued reference attribute in C1. For multi-valued reference attributes, the reference attribute values are represented as sets of object ids. In the OORA model, global relationships are derived from relationships between objects of existing databases. The global relationships, represented as reference attributes in the global schemas, relate global objects from di erent classes in the integrated database. Similar to the attribute con ict, we represent the original and resolved values of a reference attribute R in the global schema by R.ovalue and R.rvalue respectively. Threshold predicates and resolution functions can also be de ned on reference attributes. For illustration, let say the research department is in fact part of the marketing department. The following threshold predicate and resolution function can be de ned.

DEFINE work in.threshold@EMP(o1,o2) = (o1 = o2) _ (8 o, o.oid 2 fo1,o2g, o.dname 2 fresearch,marketingg)5 5

In practice, this threshold predicate can be implemented as a general program.

13

DEFINE work in.resolution@EMP(o1,o2) = o1 if (o1=o2), marketing otherwise In the above statements, o1 and o2 denotes global object ids of DEPT objects. Using the threshold predicate and resolution function for work in, the elements of work in relationships for John and Chen are shown below. Note that d1 and d4 are global object ids assigned to the Marketing

and Research departments respectively. John:

Chen:

A

B

A

B

work in.ovalue = f (d1, ), (d4, ) g work in.rvalue = d1 work in.conflictType = Acceptable work in.ovalue = f (d1, ), (d1, ) g work in.rvalue = d1 work in.conflictType = NULL

4.5 Integrated Database Instances

The OORA objects of the integrated database are shown in Tables 4.1 and 4.2. Note that the Attribute-element columns in the above tables are included simply to illustrate the three elements of attribute values. As shown in Tables 4.1 and 4.2, the OORA data model retains both attribute and relationship con icts while holding the matching objects from di erent databases together by assigning global object ids to them. Respective resolution functions are de ned to perform various resolutions of instance con icts when they are tolerable. For example, the following threshold predicate and resolution function are de ned for budget attribute. DEFINE budget.threshold@DEPT(b1,b2) = (abs(b1-b2)  100000) DEFINE budget.resolution@DEPT(b1,b2) = min(b1,b2)

Table 4.1: EMP's Global Objects Attribute-element oid ename ovalue rvalue con ictType ovalue rvalue con ictType ovalue rvalue con ictType ovalue rvalue con ictType ovalue rvalue con ictType ovalue rvalue con ictType ovalue rvalue con ictType ovalue rvalue con ictType

e1

e2 e3 e4 e5 e6 e7 e8

(john,A)(john,B) john NULL (kim,A)(kim,B) kim NULL (chen,A)(chen,B) chen NULL (mark,A) mark NULL (daniel,A) daniel NULL (stacy,B) stacy NULL (sugimoto,B) sugimoto NULL (kain,B) kain NULL

position

(trainee,A)(trainee,B) trainee NULL (secretary,A)(assistant,B) NULL Acceptable (engineer,A)(leader,B) NULL Intolerable (manager,A) manager NULL (engineer,A) engineer NULL (sales rep,B) sales rep NULL (fellow,B) fellow NULL (engineer,B) engineer NULL

14

salary

(1000,A)(1200,B) NULL Intolerable (1500,A)(1500,B) 1500 NULL (2500,A)(2600,B) 2600 Resolvable (3000,A) 3000 NULL (2400,A) 2400 NULL (3500,B) 3500 NULL (10000,B) 10000 NULL (5000,B) 5000 NULL

qual

(diploma,A) diploma NULL (NULL,A) NULL NULL (MEng,A) MEng NULL (BBus,A) BBus NULL (BEng,A) BEng NULL (NULL,B) NULL NULL (NULL,B) NULL NULL (NULL,B) NULL NULL

work in

(d1,A)(d4,B) d1 Resolvable (d1,A)(d1,B) d1 NULL (d1,A)(d1,B) d1 NULL (d2,A) d2 NULL (d3,A) d3 NULL (d1,B) d1 NULL (d4,B) d4 NULL (d4,B) d4 NULL

Table 4.2: DEPT's Global Objects Attribute-element oid dname ovalue rvalue con ictType ovalue rvalue con ictType ovalue rvalue con ictType ovalue rvalue con ictType

d1 d2 d3 d4

(marketing,A)(marketing,B) marketing NULL (planning,A) planning NULL (library,A) library NULL (research,B) research NULL

oor

(3,A)(3,B) 3 NULL (2,A) 2 NULL (1,A) 1 NULL (6,B) 6 NULL

budget

(2M,A)(2.1M,B) 2M Resolvable (4M,A) 4M NULL (1M,A) 1M NULL (1M,B) 1M NULL

managed by (e3,A)(e3,B) e3 NULL (e4,A) e4 NULL (e5,A) e5 NULL (e7,B) e7 NULL

5 OORA Query Language and Examples

To query the global objects represented in the OORA data model, one has to formulate queries in a language we refer to as OOQLRA . OOQLRA adapts the existing SQL syntax for objectoriented queries[KKS92]. In addition, it is specially designed to support the query requirement for an integrated database containing attribute and relationship con icts in the derivation and evolution phases of database integration. A OOQLRA SELECT query statement can be expressed as:

< <


SELECT target attribute 1 , ..., FROM table 1 , ..., table n WHERE predicate expression ;

>




>



attribute m

Unlike the usual SQL statements, every non-oid attribute (say A) found in a OOQLRA query statement must be in one of the forms, A, A.ovalue, A.ovalue(D), A.rvalue and A.conflictType where D is some local database id. Only attributes of the forms A.ovalue, A.ovalue(D), A.rvalue and A.conflictType can be used in the WHERE clause. In other words, the attribute in the form of attribute name can only appear in the SELECT clause. For Q1 example below, we retrieve all information about the salary attribute of employees.

Example (Q1):

SELECT E.oid,E.ename,E.salary FROM EMP E;

Table 5.3: Query Result of Q1 oid ename e1 (john,A)(john,B) john NULL e2 (kim,A)(kim,B) kim NULL e3 (chen,A)(chen,B) chen NULL e4 (mark,A) mark NULL .. .

15

salary (1000,A)(1200,B) NULL Intolerable (1500,A)(1500,B) 1500 NULL (2500,A)(2600,B) 2600 Resolvable (3000,A) 3000 NULL

In the following subsections, we will use several query examples to illustrate other essential features of OOQLRA .

Queries on Original Attribute/Relationship Values

The original attribute and relationship values in the existing databases have to be examined by the database integrators during the process of deriving objects in the integrated databases in both the derivation and evolution phase. For example, the following OOQLRA statement (Q2) could be used to identify unresolved intolerable attribute con ict in the EMP class.

Example (Q2):

SELECT E.oid,E.ename.ovalue,E.position.ovalue, E.salary.ovalue,E.qual.ovalue FROM EMP E WHERE E.ename.conflictType=Intolerable OR E.position.conflictType=Intolerable OR E.salary.conflictType=Intolerable OR E.qual.conflictType=Intolerable;

Table 5.4: Query Result of Q2 oid ename.ovalue e3 (chen,A)(chen,B)

position.ovalue (engineer,A)(leader,B)

salary.ovalue (2500,A)(2600,B)

qual.ovalue (MEng,A)

Since the OORA model accommodates all the original attribute and relationship values in the integrated database, users can query local databases via the global schema using OOQLRA statement. An example of such queries is illustrated in Q3.

Example (Q3):

SELECT E.oid,E.ename.ovalue(A),E.position.ovalue(A), E.salary.ovalue(A),E.qual.ovalue(A),E.work in(A).dname.ovalue(A) FROM EMP E;

Table 5.5: Query Result of Q3 oid e1 e2 e3 e4 e5

ename.ovalue(A) john kim chen mark daniel

position.ovalue(A) trainee secretary engineer manager engineer

salary.ovalue(A) 1000 1500 2500 3000 2400

qual.ovalue(A) diploma NULL MEng BBus BEng

work in.ovalue(A).dname.ovalue(A) marketing marketing marketing planning library

In the Q3 statement, E.work in.ovalue(A).dname.ovalue(A) represents a path expression that involves the relationship values provided by DBA . Equivalently, Q3 can be written as: SELECT E.oid,E.ename,E.position,E.salary,E.qual,E.work in.dname FROM EMP(A) E; quali ed class name EMP(A) indicates that only attribute values and relationship

The from DBA are required.

Queries on Resolved Attribute/Relationship Values

values

Once an integrated database is derived, OOQLRA allows end users to query only the resolved attribute and relationship values in the integrated database while hiding the con icts from the

16

users.

Example (Q4):

SELECT E.ename.rvalue,E.work in.rvalue.dname.rvalue, E.work in.rvalue.budget.rvalue FROM EMP E;

Table 5.6: Query Result of Q4 ename.rvalue john kim chen mark daniel

E.work in.rvalue.dname.rvalue marketing marketing marketing planning library

E.work in.rvalue.budget.rvalue 2M 2M 2M 4M 1M

Evolution of Local Databases

When an integrated database is rst derived during the derivation phase, all con icts between local instances may be fully resolved. As the local database evolves, new records are added to the databases, some old ones are removed, and other old ones get updated. These local changes may lead to un-anticipated con ict(s) in the integrated database. In this case, queries similar to Q4 can be used to identify unresolved attribute and relationship con icts.

Example (Q5):

SELECT E.oid,E.ename,E.position,E.salary,E.qual,E.work in FROM EMP E WHERE E.ename.conflictType=Intolerable OR E.position.conflictType=Intolerable OR E.salary.conflictType=Intolerable OR E.qual.conflictType=Intolerable OR E.work in.conflictType=Intolerable;

oid ename e3 (chen,A)(chen,B) chen NULL

Table 5.7: Query Result of Q5

position (engineer,A)(leader,B) NULL Intolerable

salary (2500,A)(2600,B) 2600 Resolvable

qual (MEng,A) MEng NULL

work in (d1,A)(d1,B) d1 NULL

To resolve the identi ed con icts, one has to examine the cause of con icts. If the con icts are due to aws in the derivation of integration database, we can de ne attribute threshold predicates and resolution functions using the DEFINE statements. Otherwise, the con icts may be caused by erroneous information introduced to some local database, e.g. typographical errors made during data entry. In this case, the appropriate local database administrator can be advised to correct the error. We can further de ne triggers using OOQLRA to detect intolerable attribute or relationship con icts in the integrated objects.

Aggregate Queries

Like other query languages, OOQLRA supports aggregate queries that are often used in decision making.

17

Query Example Q6: To calculate the number of employees and the average salary for those employees whose salary values contributed by database A exceed 2000, Q6 is required. SELECT COUNT(*), AVG(E.salary.ovalue) FROM EMP E WHERE E.salary.ovalue(A) 2000;

>

As shown in Table 5.8, Q6 rst identi es the objects satisfying the predicate E.salary.ovalue(A) > 2000. The quali ed objects are summarized by the COUNT and AVG functions. Note that COUNT(*) is performed by counting the number of unique global object ids while AVG is evaluated by dividing the separate sums of salary values contributed by databases A and B by the number returned by COUNT(*).

Table 5.8: Query Result of Q6 COUNT(*) AVG(E.salary.ovalue) 3 (2633,A) (3000,B)

5.1 Discussions

From the above query examples, we show that OOQLRA allows users to query either con icting or resolved information in an integrated database. It also supports direct queries on some selected local database. This exibility is not available in the traditional object-oriented data models. The source information assigned to the attribute values not only help to distinguish the origin of the attribute values, but also provide the essential meta-information required for further con ict resolution. For example, if the salary information from database A is more reliable than that from database B , by examining the source information in the attribute values, one derive the appropriate salary information for the global objects. If necessary, special mapping functions that relate the source information to other meta-information of the existing databases can be developed. These meta-information further capture the additional semantics about the existing databases that may be useful during database integration or when the attribute values are interpreted.

6 Conclusions and Future Research This research introduces a fresh approach to examine the database integration process. To support the query and integration activities in all phases of database integration, we believe that some amount of instance-level con icts have to be accommodated by the integrated databases. Furthermore, not all instance con icts can always be resolved during database integration. This paper examines the impact of instance con icts on global data model. The concept of threshold predicate and resolution function have been adopted to handle both attribute and relationship con icts. An extended objectoriented data model called OORA has been proposed to accommodate attribute and relationship con icts. Its query language OOQLRA and some query examples were given. This research can be seen as an initial e ort to systematically devise di erent solutions to resolve as well as to accommodate instance heterogeneity in the integrated databases. This is in contrast to past database integration research which often emphasized on con ict resolution only. The following are some future research directions:  OOQLRA query algebra: The query algebra of OOQLRA will be developed so that the formal theory of the OOQLRA language can be de ned. The query algebra will also be useful for designing the query evaluation strategies for OOQLRA queries.

18







Design and implementation of a database engine based on the OORA model: The ultimate goal of our research is to provide a thorough solution to the construction and maintenance of multidatabase or data warehousing systems. As part of our e ort, a database engine based on OORA model will be developed to support both the integration and query requirements of multidatabase and data warehouse users. Comprehensive classi cation of schema and instance con icts: Based on the classi cation of instance con icts given in this paper and further research on schema con icts, a comprehensive classi cation of schema and instance con icts can be derived. The classi cation will provide better understanding of inter-database con icts and their solutions. Multidatabase views: In [SL90], a 5-level schema architecture similar to the CODASYL schema architecture has been proposed for multidatabase systems. With the di erent query and integration requirement imposed by the global application, we believe that multidatabase users should be given a exibility to decide how the con icts can be viewed and resolved. In this case, a exible multidatabase view de nition mechanism based on OORA model can be extremely useful. For example, users can choose di erent threshold predicates and resolution functions for di erent multidatabase views de ned over the same set of local databases.

References [BLN86] C. Batini, M. Lenzerini and S.B. Navathe. A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Survey, 18(4), December 1986. [CS92] A. Chatterjee and A. Segev. Resolving data heterogeneity in scienti c and statistical databases. In International Working Conference on Scienti c and Statistical Database Management, Switzerland, June 1992. [DeM89] L.G. DeMichiel. Resolving Database Incompatibility:An Approach to Performing Relational Operations over Mismatched Domains. IEEE Transactions on Knowledge and Data Engineering, 1(4), December 1989. [GSSC96] M. Garcia-Solaco, F. Saltor, and M. Castellanos. Semantic Heterogeneity in Multidatabase Systems, chapter 5, pages 129{202. Prentice Hall, 1996. [KKS92] M. Kifer, W. Kim, and Y. Sagiv. Querying Object-Oriented Databases. In International Conference on Management of Data, pages 59{88, San Diego, California, June 1992. [KS91] W. Kim and J. Seo. Classifying Schematic and Data Heterogeneity in Multidatabase Systems. IEEE Computer, December, 1991. [LA94] W. Litwin and A.Adbellatif, Multidatabase Interoperability. In IEEE Computer, Vol 19, 1994. [LC94] W-S. Li and C. Clifton. Semantic integration in heterogeneous databases using neural networks. In International Conference on Very Large Data Bases, pages 1{12, Chile, 1994. [LC96] E.-P. Lim and R.H.L. Chiang. TS Relational Model: A Data Model for Multidatabases with Integrated Global Schemas and Non-integrated Instances. Information Management Research Centre, Nanyang Technological University, 1996. Working paper. [LC97] E.-P. Lim and R.H.L. Chiang. Resolving instance-level relationship con icts in database integration. In Workshop on Information Technologies and Systems (WITS'97), Atlanta, Georgia, December 1997. [LNE86] J.A. Larson, S.B. Navathe, and R. Elmasri. A Theory of Attribute Equivalence in Databases with Application to Schema Integration. IEEE Transactions in Software Engineering, 15(4), April 1989.

19

[LS93] [LSPR93] [LSS94] [SC94] [SL90] [SPD92] [TCY93] [WM89] [WM90] [VA96]

E.-P. Lim and J. Srivastava. Entity identi cation in database integration. In International Symposium on Next Generation Database Applications, Fukuoka, Japan, 1993. E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity Identi cation Problem in Database Integration. In Proceedings of IEEE Data Engineering Conference, Vienna, Austria, 1993. E.-P. Lim, J. Srivastava, and S. Shekhar. Resolving Attribute Incompatibility in Database Integration: An Evidential Reasoning Approach. In IEEE International Conference on Data Engineering, Houston, TX, February 1994. P. Scheuermann, and E.I. Chong. Role-based Query Processing in Multidatabase Systems. In International Conference on Extending Database Technology, Cambrige, U.K., March 1994. A.P. Sheth and J.A. Larson. Federated database systems for managing distributed heterogeneous, and autonomous databases. ACM Computing Surveys, 22(3), September 1990. S. Spaccapietra, C. Parent and Y. Dupont. Model Independent Assertions for Integration of Heterogeneous Schemas. Very Large Database Journal, 1(1), 1992. F.S.C. Tseng, A.L.P. Chen, and W.-P. Yang. Answering Heterogeneous Database Queries with Degrees of Uncertainty. Distributed and Parallel Databases, 1:281{302, 1993. Y.R. Wang and S.E. Madnick. The Inter-database Instance Identi cation Problem in Integrating Autonomous Systems. In IEEE International Conference on Data Engineering, 1989. R. Wang and S. Madnick. A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. In International Conference on Very Large Data Bases, pages 519{538, Brisbane,Australia, 1990. M.W.W. Vermeer and P.M.G. Apers. On the applicability of schema integration techniques to database interoperation. In Entity-Relationship Conference, Cottbus,Germany, 1996.

20

Suggest Documents