PROVIDING OBJECT-ORIENTED ACCESS TO EXISTING ... - CiteSeerX

2 downloads 0 Views 473KB Size Report
Jan 16, 1997 - incompatibilities between the relational and object-oriented data models. ... environment for mapping a relational schema to an object-oriented ...
PROVIDING OBJECT-ORIENTED ACCESS TO EXISTING RELATIONAL DATABASES

By Chandrashekar Ramanathan

A Dissertation Submitted to the Faculty of Mississippi State University in Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy in Computer Science in the Department of Computer Science Mississippi State, Mississippi May 1997

PROVIDING OBJECT-ORIENTED ACCESS TO EXISTING RELATIONAL DATABASES By Chandrashekar Ramanathan Approved: Julia E. Hodges Professor of Computer Science Director of Dissertation

Susan M. Bridges Associate Professor and Graduate Coordinator in the Department of Computer Science

Nancy E. Miller Associate Professor of Computer Science

Thomas Philip Professor of Computer Science

Michael Pearson Associate Professor of Mathematics and Statistics

Clayborne D. Taylor Associate Dean of the College of Engineering

Richard D. Koshel Dean of the Graduate School

Name: Chandrashekar Ramanathan Date of Degree: May 10, 1997 Institution: Mississippi State University Major Field: Computer Science Major Professor: Dr. Julia E. Hodges Title of Study: PROVIDING OBJECT-ORIENTED ACCESS TO EXISTING RELATIONAL DATABASES Pages in Study: 118 Candidate for Degree of Doctor of Philosophy The ability to use relational database technology and object-oriented technology together in an environment that preserves past investments in relational databases and applications is very important. Many information systems use relational database systems for ecient sharing, storage, and retrieval of large quantities of data. On the other hand, object-oriented programming has been gaining wide acceptance in the programming community as a paradigm for developing complex applications that are easy to extend and maintain. The main power of object-oriented programming stems from the semantically rich modelling constructs. By de nition, there are many incompatibilities between the relational and object-oriented data models. Software developers need special techniques to convert the data that is residing in a relational database to a format that is suitable for object-oriented applications to access and manipulate. This dissertation presents algorithms for extracting the structural portion of the object-oriented schema that corresponds to a given existing relational schema. Algorithms for extracting object-oriented constructs such as object classes and relationships between object classes (viz., association, aggregation, and inheritance)

from the de nition of an existing relational database are speci ed. The algorithms take into consideration many modern relational database design alternatives. A few such alternatives are the use of binary data types either to store multi-valued attributes or complex-valued attributes in a relational column, denormalization of the schema to avoid join operations, and so on. It is assumed that the relational schema includes the de nition of the primary key and foreign keys for the relations. The relations need to be at least in second normal form (2NF). User interaction is required for providing the functional dependencies for all 2NF relations, and for specifying the composition of complex objects that the binary objects may represent. The quality of the extracted object-oriented schema is evaluated using the concepts of consistency and completeness of the object-oriented schema with respect to the source relational schema. A signi cant result of this research is the development of an integrated environment for mapping a relational schema to an object-oriented schema without the need to modify the existing relational schema.

ACKNOWLEDGMENTS A task of this magnitude is never accomplished without the help and support of many people. Although it may sound like a cliche, a lot of my gratitude goes to my family for being there for me as I set o from halfway around the world in pursuit of academic goals. Dr. Julia Hodges has been my guru ever since I started my graduate studies here. I am indebted to her for letting me work in a research area that is dear to me and for keeping my work on course as my major advisor right from the formative period through completion. I am thankful to my committee member Dr. Susan Bridges for adding breadth to my knowledge with her insightful input and all her encouraging words from across the hall. While the presence of Dr. Thomas Philip in my committee added credibility to the object-oriented nature of my work, Dr. Nancy Miller's expertise in the area of databases was an invaluable asset in the evaluation of my work. I would like to thank both of them for serving on my committee. I am grateful to my minor professor Dr. Michael Pearson, who was never short of words of encouragement both during my involvement in the Mathematics minor and after. The Center for Air Sea Technology provided funding for a major portion of my graduate study and I am grateful to them for this. The experience I gained in working on many projects for them is invaluable. I would like to take this opportunity to acknowledge the e orts of the Department of Computer Science and its administrative sta (technical and non-technical) in helping out the students to the best of their abilities. ii

Living in the same place for this long is bound to build many friendships and acquaintances. A special note of appreciation goes to all those who shared the Intelligent Systems lab with me over the years. My heartfelt thanks to everyone who made my stay at MSU a very memorable and rewarding experience.

iii

TABLE OF CONTENTS ACKNOWLEDGMENTS : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : :

ii

LIST OF TABLES : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : vii LIST OF FIGURES :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : viii I.

INTRODUCTION :: : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : Thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Research Objectives : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

1 5 5

II.

LITERATURE REVIEW : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : ER-based Schema Mapping : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Complete Systems : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The ARCUS Project : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Penguin : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : COMan : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Other Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

10 10 12 13 13 14 15 16

III.

CONTEXT, ALTERNATIVES, AND OVERVIEW :: :: : : : : : :: : : : : :: De nitions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : The Object-Relational Dilemma : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Data in RDB : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Data in ODB : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Data in RDB and ODB : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Object Wrapper : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Relational Wrapper : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary of Alternatives :: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Need for User Interaction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Overview of the Approach : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

18 18 19 20 21 22 23 24 24 26 29

IV.

MAPPING COMPONENTS : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : : 34 iv

Schema Mapping : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Assumptions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Notation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Phase One: Adjusting the Relational Schema : : : : : : : : : : : : : : : : : Phase Two: Generation of the Object Schema : : : : : : : : : : : : : : : : Identifying Object Classes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Identifying Associations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Identifying Inheritance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Identifying Aggregation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Computational Complexity :: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Data Mapping : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary of Mapping : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

34 36 36 37 49 50 51 51 54 56 59 63

V.

EVALUATION : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : : Completeness : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Accessibility of Attributes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Accessibility of Data : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Summary of Completeness : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Consistency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : E ectiveness of the Approach : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

65 65 66 68 70 70 74

VI.

RESULTS : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : The \Personnel" Database : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Relational Schema #1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Relational Schema #2 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Relational Schema #3 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Grid Data Relational Schema :: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

77 77 77 80 82 84

VII. CONCLUSIONS :: : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : Contributions of this Research : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Review of Research Objectives :: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Limitations and Future Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

88 88 90 92

REFERENCES : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : 93 APPENDIX :: : : : : :: : : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : 97 A.

EVALUATION EXAMPLES : :: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : 98 v

B.

FORMAT OF SchemaBase :: : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : :: : : : : : : 105

C.

SOURCE CODE GENERATION ::: : : : : : :: : : : : : :: : : : : :: : : : : : :: : : : : : 110

vi

LIST OF TABLES 3.1 4.1 6.1 6.2 6.3 6.4 B.1

Summary of Various Alternatives : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Computational Complexity of Schema Mapping : : : : : : : : : : : : : : : : : : : : : \Personnel" Relational Schema #1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : \Personnel" Relational Schema #2 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : \Personnel" Relational Schema #3 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Grid Data Relational Schema : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Description of Terminal Symbols : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

vii

25 60 78 80 83 86 108

LIST OF FIGURES 1.1 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 5.1 5.2 6.1 6.2 6.3

Generic Mapping Steps : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Alternatives for Storing Persistent Data : : : : : : : : : : : : : : : : : : : : : : : : : : : : High-level System Architecture of SOAR : : : : : : : : : : : : : : : : : : : : : : : : : : : Layered Architecture of SOAR Applications : : : : : : : : : : : : : : : : : : : : : : : : Two Phases of Static Schema Mapping : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Elimination of 2NF relations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Elimination of Widow Relations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Elimination of Orphan Relations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Elimination of BLOB Attributes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Identifying Classes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Identifying Associations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Identifying Inheritance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Example for Case 1 Inheritance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Example for Case 2 Inheritance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Example for Case 3 Inheritance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Identifying Aggregation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Proving Consistency : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Extracting Implied Functional Dependencies : : : : : : : : : : : : : : : : : : : : : : : : Object Schema for \Personnel" Relational Schema #1 : : : : : : : : : : : : : : : Object Schema for \Personnel" Relational Schema #2 : : : : : : : : : : : : : : : Object Schema for \Personnel" Relational Schema #3 : : : : : : : : : : : : : : : viii

4 20 32 32 35 39 42 46 48 51 52 53 55 55 55 56 71 72 79 81 84

6.4 Object Schema for Grid Data Relational Schema : : : : : : : : : : : : : : : : : : : : 87 C.1 Processing Steps for Data Mapping : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112

ix

CHAPTER I INTRODUCTION Many information systems use relational database systems for ecient sharing, storage, and retrieval of large quantities of data. Relational database management systems (RDBMS) provide a variety of tools and services for data management. Due to the relational model's simple data modelling constructs and facilities for ad hoc querying using SQL, there are many tools that interface with RDBMSs to enable end-users to carry out reporting, querying, and other data analysis activities easily. On the other hand, object-oriented programming has been gaining wide acceptance in the programming community as a paradigm for developing complex applications that are easy to extend and maintain. The main power of object-oriented programming stems from the semantically rich modelling constructs. Concepts such as association, aggregation, and generalization/specialization allow the software developers to map the problem domain to the solution domain very naturally. By de nition, there are many incompatibilities between the relational and object-oriented data models (Loomis 1994; Simon 1994; Robie and Bartels 1994). Yet, there is much interest in using these two technologies together in an environment that preserves past investments in relational databases and applications while enabling the development of new object-oriented applications on the same data. A data item is said to be transient if the data is accessible only within the same application that created it. A persistent data item is a data item that can be accessed even after the application that created it terminates. Persistent data can be handled 1

2 by either using at les or by using database management systems (DBMSs). DBMSs in particular allow easy storage, retrieval, and sharing of persistent data. Two of the most widely used DBMSs are relational DBMS (RDBMSs) and object-oriented DBMS (ODBMSs). They are based on the relational data model and the object-oriented data model, respectively. What does it mean to have object-oriented access to relational data? Developers typically implement object-oriented applications using object-oriented programming languages such as C++ and Smalltalk. These applications use a schema made up of object classes and relationships between those object classes. Each object has a set of attributes. The value of an attribute could be another object itself, thus giving rise to complex objects. The main problem arises when the data corresponding to such objects are persistent in a relational database. The problem is due to incompatibility between relations and objects. At rst, a solution for this problem may seem to be to use an ODBMS instead of RDBMS. If an ODBMS is managing the persistent data, the objects do not lose their structure after the application stores them in the object database. However, ODBMS technology is fairly new and a lot of investment has already been made in developing large relational databases and applications. Moving to an ODBMS might mean throwing away all of the old (\legacy") data and applications. Most users of databases will not accept such a solution. They wish to be able to run their existing applications on existing databases and have access to the same data from object-oriented programs, too. Therefore, we need special techniques to convert the data that is residing in a relational database to a format that is suitable for access and manipulation by object-oriented applications. One task that is common to the various alternatives to accessing relational databases from object-oriented programs is mapping the schema

3 of an existing relational database to an object-oriented schema . Schema mapping is required for the system component that does the data mapping from relational data to objects. If the data resides in a relational database, a schema mapping must exist in order to have object-oriented access to the data. This is true irrespective of the speci c approach taken to provide that access (e.g., one-time porting, dynamic mapping, etc.). Database and application developers would like this mapping to be automatic. This research will establish that it is not possible to fully automate the schema mapping from the relational to the object-oriented model. In fact, in the few commercial systems that try to provide object-oriented access to existing relational databases, the developer must specify the entire schema mapping explicitly. On the other hand, the author believes that this mapping process can be at least partially automated and a complete mapping process can be completed with the help of an interactive tool. In fact, the user intervention could be made optional if the relational database could provide information about primary keys and foreign keys. Figure 1.1 contains an overview of the activities involved in the mapping procedure described above. In general, every mapping approach pre-processes the source schema to a form that is more suitable for that approach. Once the schema is reduced to a desirable intermediate form, the major task is to identify various object-oriented constructs, which includes object classes, associations, aggregations, inheritance, and methods. The result of identifying these constructs is a mapping database that contains the information regarding mapping between the source and target schema. Finally, the dynamic data mapper provides run-time support for the application in instantiating objects using the data stored in the relational database. 1

1

The next chapter contains a formal de nition of schema mapping.

4

Source Schema

Pre–process

Schema in some desirable form

Identify classes

Identify associations

Identify aggregation

Identify inheritance

Schema mapping database

Dynamic data mapper

Figure 1.1. Generic Mapping Steps

Identify methods

5 With respect to the mapping steps outlined in Figure 1.1, the scope of this dissertation includes identi cation of the structural portion of the object-oriented schema. Hence, this research does not deal with identi cation of methods. A simple implementation of dynamic data mapping is included as part of this research in order to demonstrate that the static schema mapping is e ective in providing dynamic support in the overall process. Thesis The thesis of this research is that there is no need for the object-oriented application developer to specify either schema mapping or data mapping for accessing relational databases from object-oriented programs. In order to validate this thesis, a user-assisted schema mapping approach was developed as part of this research. This mapping approach handles various types of relational database design optimizations such as relations that are not in third normal form, usage of binary data types, and so on. The evaluation of the mapping itself was done in terms of completeness and consistency. A target-schema is complete if every data item included in the source schema can be accessed using the target schema. A mapped object-oriented schema is consistent with the relational schema from which it was derived if all the functional dependencies of the relational schema are preserved in the object-oriented schema. 2

Research Objectives Though several researchers concur that mapping a schema from a relational model to any other conceptual model such as Entity-Relationship(ER), ExtendedEntity-Relationship(EER), Object-Oriented(OO), and others cannot be fully Schema mapping is concerned with the mapping between the relations and object classes while data mapping is concerned with the mapping between relational tuples and actual objects. 2

6 automated , no formal study had been reported on that aspect before now. As part of this research, a study was conducted to lay out the speci c reasons why human interaction is required to carry out the mapping. This study aided in some cases in identifying the ways in which the schema mapping process could be improved to provide greater opportunity for automation. For example, user input may be required for identifying discriminant attributes in relations. However, this process could be improved by automatically identifying potential discriminant attributes. Although data mapping between the relational and the object-oriented model itself is not new, combining data mapping with an interactively generated schema mapping o ers greater utility to the overall goal of providing object-oriented access to existing relational databases. The primary research objectives of this research are as follows: 3

 Study the automatability of the relational-to-OO schema mapping process.  De ne an interactive process for mapping an existing relational schema to an object-oriented schema.

 Show that the above schema mapping process is both complete and consistent.  Develop a proof-of-concept system to validate the research thesis A System for Object-Oriented Access to a Relational database (SOAR) is a research prototype that implements an interactive static schema mapping process. It produces a machine-readable schema mapping database and generates source code for the class declarations and for automatic creation of object instances based on the mapping database. 3

Please see next chapter.

7 Although the schema mapping process is not fully automatable, it is possible to develop a heuristics-based mapping system that applies some rules of thumb to guess the corresponding object-oriented structures. The user can then intervene and accept, modify, or reject the suggestion. Very few commercial systems try to provide object-oriented access to existing relational data. In all such systems, the developer has to explicitly specify all the object-relational mappings. Completeness and consistency are the two factors that were used to evaluate the mapped schema produced by the mapping procedure developed in this research. A rigorous procedure for testing both these properties of the mapping process was speci ed without the need for using subjective human interpretation of the mapped schema. A generalized proof for completeness involved showing that every attribute of the relational schema is accessible from the object-oriented schema too. Similarly, consistency was proven by showing that the semantics of the object-oriented constructs implicitly satisfy various types of functional dependencies. Comparing the results produced by the machine with those produced by humans is a frequently used technique in many areas for evaluation of machine-generated output. Two factors must be considered when using humans for evaluation. First, the humans should be able to generate the results without a lot of e ort. Second, the comparison of the results should be quanti able. It may be possible to quantify the di erences between the schemas produced by the system and by the human based on the number of similar object-classes found, number of relationships found, and so on. On the other hand, among these two aspects, the rst of these is a larger obstacle for evaluating the mapped object-oriented schema. To generate reasonable conclusions, more than a trivial number of schemas have to be mapped by both the humans and the machine. Finding the databases and the human resources to do this is dicult.

8 Due to these reasons, the notions of consistency and completeness provide a better way of evaluating the mapped schema. Most of the database reverse engineering research has concentrated only on the schema mapping from the relational model to some other conceptual model such as the ER model, EER model, and so on. While the focus of reverse engineering is only the extraction of a higher level model (e.g., the OO model) from a lower level model (e.g., the relational model), re-engineering includes reverse engineering as well as forward engineering the systems based on the higher level model (Jacobson and Lindstrom 1991). The SOAR research aims not just at reverse engineering but at the entire re-engineering process. The major goal of this research was to develop an integrated environment for allowing object-oriented programs to access existing relational data. One end of the spectrum is the schema mapping operation, while at the other end, there is an actual object-oriented program, say in C++, that successfully accesses the relational data. In order to achieve the objective of an integrated environment, the schema mapping process should generate a machine-readable mapping database and also C++ header les that contain the class declarations corresponding to the reengineered relational schema. Application programs can then use the header les and run-time mapping system to access the relational database. The run-time mapping system is a class library that interfaces with the relational database using the mapping database. The rest of this document is organized as follows. A literature review covering the area of reverse engineering of relational schemas to other conceptual schemas is given in Chapter II. Chapter III establishes the context for this dissertation research. It includes the de nitions of frequently-used terms, a discussion on the

9 various alternatives that were considered to address the research problem, and an overview of the entire approach. Chapter IV describes in detail the schema mapping and data mapping components of SOAR. The chapter includes all the main algorithms required to carry out the schema mapping process. A description of a data mapping procedure is also presented in the chapter. The evaluation of the mapping algorithms based on consistency and completeness is given in Chapter V. In order to illustrate the capabilities of the mapping procedures described in this dissertation, the results of mapping four relational schemas are given in Chapter VI. Finally, the conclusions and potential for future work are discussed Chapter VII. The appendices include the le format of the mapping database (\meta-database") and a sampling of the source les generated by SOAR.

CHAPTER II LITERATURE REVIEW This chapter presents a comprehensive literature review in areas related to research in providing object-oriented access to existing relational databases. The chapter is divided into three sections. The rst section looks at work in the area of schema mapping from the relational model to another semantic model such as the entity-relationship (ER) model, the extended ER (EER) model, or the objectoriented model. The next section looks at some research projects whose objective was the development of an integrated approach. The nal section contains references to work related to this research that does not fall into either of the other two categories. ER-based Schema Mapping Schema mapping forms a part of a wider scheme of activities of database reverse engineering. The idea behind database reverse engineering is to extract the conceptual design of a database starting from the existing database structure (Fahrner and Vossen 1995a). The conceptual design representation can be any of the semantic data models like ER, EER, object-oriented, and so on. On the other hand, the existing database structure can be represented in the relational model, hierarchical model, and so on. Hainaut et al. (1993) propose a two-step process for database reverse engineering. The rst step is \data structure extraction." During this step, the existing database structure is analyzed (e.g., schema, database contents, queries) to produce a complete description of the existing source schema. During the second step, called \data 10

11 structure conceptualization," the actual extraction of the conceptual model takes place. The whole idea is to carry out the inverse of the forward engineering database design process. Chiang, Barron, and Storey (1994) provide a very comprehensive methodology for extracting an EER model from a relational database. Starting from relations that are in third normal form, the methodology rst classi es the relations into strong entity relations, weak entity relations, and so on. Then they use a heuristicsbased approach to nd inclusion dependencies. Finally, using all the information gathered in the earlier phases, they specify rules for identifying EER structures. They acknowledge the need to use human interaction to successfully complete the mapping process. Johannesson and Kalman (1989) provide an exhaustive analysis and interpretation of inclusion dependencies in relational schemas for identifying conceptual schemas. Their approach also assumes that the relations are in third normal form and that the user provides all the inclusion dependencies. Andersson (1994) proposes a technique based on the analysis of the data manipulation statements. He rst creates a connection diagram that represents potential relationships between the relations based on the data manipulation statements. The connection diagram is then translated into an Entity-Relationship-Category(ERC+) model (Parent and Spaccapietra 1992). This method relies on the join conditions that generally appear in the application programs for establishing connections between relations. An approach based on meta-models has been proposed by Jeusfeld and Johnen (1994). In their approach, the mapping takes place at the data model level as opposed to the schema level. The source and the target data models are mapped to the meta-

12 model using a deductive query language. Then, for all the concepts found in the source schema, the \similar" concepts are found in the target data model by navigating the meta-model hierarchy. Once such concepts are found, the components in the target data model are instantiated. Some work has been reported for mapping ER and EER structures into an object-oriented model (Herzig and Gogolla 1992; Narasimhan, Navathe, and Jayaraman 1993; Fong 1995). This mapping process is straightforward by de nition since all three are conceptual models. The object-oriented model actually subsumes the ER and EER models. When compared to all these schema mapping approaches, Premerlani and Blaha (1994) try to impose the least number of restrictions to address the problems caused in trying to reverse engineer relational databases. They propose a toolkit-based approach that takes into consideration various types of database optimizations and non-standard designs. On the other hand, since the method tries to cope with as many situations as possible, it provides less opportunity for automation. Complete Systems This section presents a description of a few research projects whose goal is to provide a complete environment for accessing relational databases from objectoriented programs. The environment includes not only schema mapping from the relational to the object-oriented model but also a run-time system that maps the actual data from tuples to objects.

13 The ARCUS Project The ARCUS project is a venture of Technical University of Munich and Software Design and Management, Munich. The ARCUS project carries out research in \architecture for business information systems with a focus on object-oriented architecture of client/server applications" (ARCUS 1996). Providing higher level access to relational databases is on-going work in the ARCUS project. Results have been reported in developing a relational data access layer that hides an optimized relational schema design from the logical relational schema (Keller and Coldewey 1996). The data mapping component of SOAR can particularly gain from this work. Penguin The Penguin system (Keller and Hamon 1993; Keller 1994) is designed to allow the coexistence of the relational and object technology. The system provides an object-oriented interface for programming while using a relational database for handling data persistence. The overall goal of the project is to support multiple object schemas on the same underlying data using object-views of the underlying relational database. The system uses a semantic model called the Structural Model as an intermediate model for allowing integration of multiple heterogeneous views on the underlying data. The Structural Model is a network of nodes where each node represents a relation and each edge represents a link (relationship). Given a Structural Model as input, the system then generates object views from them. For the object-oriented interface, the system generates C++ classes (Keller and Hamon 1993). Penguin uses two layers of mapping for the code generation. The rst layer corresponds to the integrated structural model and the second layer corresponds to the application-speci c view-

14 object level. This allows di erent object applications to have di erent object views based on the same Structural Model. Although the Structural Model provides a middle ground for mapping from existing relational schemas to object schemas, the current work on Penguin has concentrated on using relational databases to add persistence to existing objectoriented applications. COMan COMan (Complex Object Manager) (Kappel et al.1994) is a system that allows existing relational databases to be re-engineered to object-oriented databases. The system can be used to add persistence to existing object-oriented applications in relational databases. Following is a brief description of COMan as described in (Kappel et al. 1994). There are primarily two components in COMan. The rst one is concerned with the mapping between the schemas and the second one is concerned with data mapping at run-time. The user carries out the mapping phase with the help of a tool. The mapping can be from the object-oriented schema to a relational schema, or vice versa. The result of the mapping process is the creation of a meta-database of mapping knowledge. That is, the mapping knowledge itself is stored as part of the relational database. The run-time phase makes use of the database interface to the C++ applications. It carries out all the mapping operations based on the de nitions stored in the meta-database. The schema mapping process is primarily carried out by the user. The interface only checks to see if the mapping is consistent with the constraints. For example, if the user speci es an inheritance relationship between two relations, the system checks

15 to make sure that the names and domains of the primary keys of the relations are identical. Other Related Work Many of the research issues addressed in the Penguin project (discussed above) have been incorporated into a commercial product called Persistence (Keller, Jensen, and Agarwal 1993). Persistence aims to provide transparent object-oriented access to data that is actually stored in the relational database. The system generates the relational database schema automatically based on the object class de nitions. It also generates the database interface code (i.e., code that reads objects, writes objects, etc.,) automatically. The product as implemented currently cannot be used for accessing an existing relational database. That is, it can be used only for forward engineering. Nevertheless, old applications on the RDBMS can continue to run independently of the new database. ODMG-93 (Cattell 1994) is an emerging standard for object databases. Among other things, the standard speci es a common object model for all object databases irrespective of the particular ODBMS. Fahrner and Vossen (1995b) have proposed a transformational approach for migrating from a relational database to an object database. They use the ODMG-93 as the target model so that it may be implemented on any ODBMS that conforms to the standard. They specify a three-step process. First, the relational schema is \completed." That is, the schema is augmented with semantic information such as functional dependencies and inclusion dependencies, reduced to 3rd normal form, and so on. In the second step, a straightforward mapping of the completed relational schema to ODMG-93 schema is carried out. In the third step, the resultant object schema is optimized using general OO principles (e.g.,

16 elimination of arti cial keys). Just as for any of the reverse engineering mappings, this approach, too, requires user interaction. There are at least three commercial systems that serve as middle-ware products between an object-oriented programming language and the relational database. The ONTOS Object Integration Server (OIS TM ) allows the developer to select from various mapping possibilities and carry out the mapping process interactively (Ontos Inc. 1996). PersistenceTM is an \object-to-relational mapping system which automates the mapping of object applications to relational databases"(Persistence Software Inc. 1996). The system also provides the ability to read from existing relational data dictionaries to generate the object schema. More details on the reverse mapping are not publicly available at the time of this writing. A third product called CGEN TM does a straight table-to-class mapping without providing any interpretation of foreign key information(Subtle Software Inc. 1996) and hence cannot be considered relational-to-OO mapping. Ananthanarayanan et al. (1993) describe a method to extend RDBMS functionality for storing C++ objects. Their technique does not require run-time mapping operations between objects and tuples while accessing the database. Their goal is to gain the bene ts of the mature RDBMS technology without changing the data model itself with object concepts. Finally, machine learning techniques have also been applied for extracting object classes from a relational database (McKearney, Bell, and Hickey 1992). Summary This chapter presented a survey of the research in the area of the re-engineering of relational databases to semantic models. Very few systems have attempted to

17 address the whole re-engineering process, which as explained in the previous chapter includes static schema mapping and dynamic data mapping. A majority of the work has been done in the area of mapping the relational schema to a conceptual model such as ER, EER, and so on. Two major systems whose goals are similar to that of the SOAR research were presented. In conclusion, none of the systems described attempts to automate the entire schema mapping process.

CHAPTER III CONTEXT, ALTERNATIVES, AND OVERVIEW The earlier two chapters presented the need for sophisticated schema mapping approaches that take into consideration relational schemas that may not be in third normal form (3NF), optimized relational schemas, and so on. This chapter establishes a context for the rest of the dissertation. The next section contains de nitions of some of the terms that are used frequently in this document. Following that is a detailed look at the various alternatives for tackling what we call the object-relational dilemma. Finally, the various components of SOAR, the implementation prototype, are laid out. De nitions A data model is a collection of concepts that can be used to describe data without enumerating the actual data (Elmasri and Navathe 1992). Examples of database models are the hierarchical model, the network model, the relational model, the entity-relationship model, and the object-oriented model (in approximate chronological order of development). A schema (denoted S M ) is a structural description of the data speci ed using a particular data model M . A schema represents tautological properties of the data, that is, properties that are always true (Date 1986, 361). A database schema is often referred to as the \intension" of the database, as opposed to the \extension" (i.e., the actual data itself) (Elmasri and Navathe 1992, 26). Di erent data models give rise to di erent schema speci cations for the same data. For example, a relational schema 18

19 speci cation will include the de nitions of the various relations, attributes, keys, and so on. On the other hand, an object-oriented schema is de ned in terms of object classes, attributes, associations, aggregations, generalizations/specializations, and so on. A schema mapping of a source schema S M1 de ned on a data model M to a target schema S M2 de ned on a model M may be de ned as a transformation T such that T (S M1 ) = S M2 . The transformations are usually a collection of rules that map concepts from the source data model to the target data model. One schema mapping that relational database designers often use is the mapping from an entity-relationship (ER) schema to a relational schema (Elmasri and Navathe 1992). Another example of a schema mapping technique that is gaining a lot of interest is from the objectoriented model to the relational model (Blaha, Premerlani, and Rumbaugh 1988; Premerlani et al. 1990). Data mapping of a data item dataS1M1 de ned on a schema S M1 to another data item dataS2M2 de ned on a schema S M2 is a function DM such that DM (dataS1 M1 ; T ) = dataS2M2 where T is the schema mapping as de ned above. Note that data mapping is a function of schema mapping. A target-schema S M2 is complete if every data item included in the source schema S M1 can be accessed using the target schema. A mapped object-oriented schema S o?o is consistent with the relational schema S rel from which it was derived if all the functional dependencies of S rel are preserved in S o?o . 1

2

1

1

2

2

1

2

2

1

1

2

1

2

The Object-Relational Dilemma This section elaborates on some of the alternatives available to a database developer for storing persistent data and the impact of the choice on being able to

20 use various relational and object-oriented applications on the data. The primary motivation for looking at these alternatives is to overcome the object-relational dilemma. This dilemma is caused by the strong software engineering motivation for using the object-oriented approach for developing new software combined with the existence of legacy data stored in a relational database. The need for object technology to co-exist with relational technology is known as the \coexistence requirement" (Loomis 1995, 36) in the literature. Persistent Data

Data in RDB

Object Wrapper

Data in ODB

Data in RDB & ODB

Relational wrapper

Figure 3.1. Alternatives for Storing Persistent Data Figure 3.1 graphically illustrates the various options available for the database developer for addressing the object-relational dilemma. An arrow from A to B indicates that B \inherits" all the advantages of A. Data in RDB One of the traditional choices for managing persistent data is to use an RDBMS. Relational technology has been around for about 20 years now. It has gained wide acceptance with database developers mainly due to the simple model with a strong mathematical foundation and a non-procedural query language SQL. Due to this

21 wide use, many tools for RDBMSs have been developed that aid users in easy sharing and analysis of stored data. Relational databases have become so ubiquitous that they are being used even in applications that are ill-suited for the technology. For example, the OO1 benchmark showed that relational databases perform poorly when the structure of the data is very complex (e.g., a lot of relationships) (Dick and Chaudhri 1995). One of the causes of this problem is the at nature of the relational model - everything must be attened to a two-dimensional structure of rows and columns (Robie and Bartels 1994). As a result, the data has to be decomposed for storing (normalization) and reassembled for retrieving (`join'), resulting in additional processing overhead. Even though object-oriented technology o ers better solutions to such problems, most of the developments in the area have taken place over the last few years. So, there are already many applications that use relational databases for storing complex data. If new object-oriented applications want to access the data, they have to carry out all the decomposing (and assembling) of objects to (from) relational structures. The applications will not have transparent access to the data. The advantage of such an approach, however, is that the investments in legacy relational data, applications, and tools are preserved. Data in ODB A modern solution to the problems caused by the diculty of storing complex objects in an RDBMS is to use an ODBMS (Catell 1991). ODBMSs are becoming the choice of data management option for many applications. Object-oriented databases provide support for storing objects directly without the need to carry out any translation between the stored version and the in-memory version of the same

22 data. Object-oriented databases also provide support for sophisticated notions such as object identity, version management, schema evolution, and so on (Bertino and Martino 1991; Loomis 1995; Cattell 1994). Choosing an ODBMS for new object-oriented applications and discarding relational technology may mean that all of the legacy data, applications, and tools are no longer usable. Even though \starting fresh" may be a choice, this choice is attractive only if there are no legacy data and applications to continue to support. One minor enhancement that may mitigate at least the problem of losing the existing data is to do a one-time port of the data from the relational database to the new object-oriented database (Hodges, Ramanathan, and Bridges 1995). This way, objectoriented applications have transparent access even to existing data. Data in RDB and ODB Sometimes, the cost of investment in relational technology is very high and the diculty of accessing the data from object-oriented programs is overwhelming. In such a situation, both the relational database and the object-oriented database can be used. One way of using both an ODBMS and an RDBMS is by having the same data in both. Old applications can continue to run against the relational database while new object-oriented applications can run against the object-oriented database. This way, relational data, applications, and tools are preserved while object applications can have native access to the same data. One obvious trade-o with this approach is that storage requirements are doubled. Second, the two databases have to be kept consistent (i.e., by maintaining two databases). Finally, all database mutating operations (write, delete, and update) have to be duplicated, too. The last problem can be mitigated with periodic batch processing.

23 On the other hand, the database requirements of object-oriented applications may be very di erent from those that can be satis ed by an RDBMS. In such a situation, it is better to use an ODBMS instead of the existing RDBMS, and implement an independent database that satis es the requirements of new applications. It is again obvious that the two systems that use the RDBMS and ODBMS, respectively, cannot share data. The existence of such diverse requirements for a single application environment is very unlikely. Object Wrapper An emerging trend in satisfying the coexistence requirement of relational and object-oriented technology is the development of object-oriented wrappers around existing relational databases (Kappel et al. 1994; Keller and Hamon 1993; Keller 1994). Instead of requiring the object applications to do the conversion between relational tuples and objects each time, the wrappers encapsulate the relational database operations. The applications communicate only to the wrappers without concern for data mapping operations. A big advantage of the object wrapper approach is that the object-oriented applications can have transparent object access to the data without using an ODBMS (Ramanathan 1994). However, the run-time overhead of mapping the data is still a factor. Object wrappers can exist just like any other application on top of the relational database. All the existing data will be accessible to all the applications. Relational tools continue to be used on the data as well. It should be noted that object wrappers are distinct from a category of ODBMS called ORDBMS (Object-Relational Database Management Systems). ORDBMS extend the relational model itself with object-oriented extensions (Kim 1995; Stonebraker 1995).

24 Relational Wrapper Sometimes there is a need for relational access to object-oriented data, too. That is, the object database should provide an interface through which SQL queries can be executed. This scenario occurs, for example, when end-users want to use their relational tools on the object-oriented database. Issues involved in developing relational wrappers are not relevant to this research. Summary of Alternatives The following feature matrix highlights important issues that were discussed in the previous sections and whether or not a particular approach addresses that issue. The contents of the matrix can be easily translated into advantages and disadvantages from the perspective of providing object-oriented access to an existing relational database. Looking at the summary matrix given in Table 3.1, it is evident that the coexistence requirement as de ned for this research can be best met by using both a relational database and an object-oriented database. However, in the likely case that maintaining two databases is not acceptable to the developers, maximum bene ts can be obtained by using an object wrapper around the relational database. The rest of this dissertation focuses on addressing the issues involved in developing an object wrapper. This research de nes a unique object wrapper architecture where the bene ts of having data in both RDB and ODB can be gained without the overhead of maintaining two databases.

25

Table 3.1. Summary of Various Alternatives Feature

Can continue to use existing relational data and applications? Can continue to use relational tools? (e.g., report writers) Need any data mapping code? Transparent OO access? (i.e., automatic mapping from the application's standpoint?) Additional run-time overhead to access objects? Only one database to maintain? Increased storage requirements for data?

Data in Data in Data in Object RDB ODB RDB & ODB Wrapper yes no yes yes yes

no

yes

yes

yes no

no yes

no yes

yes yes

yes

no

no

yes

yes no

yes no

no yes

yes no

26 Need for User Interaction As observed in the literature survey, every approach involving the extraction of conceptual data models from the relational model involves user interaction at some point in the process. This section shows the need for user interaction during the mapping process. There are several mapping strategies from conceptual models such as ER, EER, and OO to the relational model (Fahrner and Vossen 1995a). Relational database design for any real-world problem results in \semantic degradation" (Chiang, Barron, and Storey 1994, 109). That is, the designers cannot fully express the real semantic intent of the database using the relational model. A simple example illustrates a semantic degradation caused by the inability to represent relationships between data items explicitly in the relational model. Consider the modelling of the relationship between employees and departments. The relationship \an employee works for a department" may be de ned in the relational schema using relations called EMPLOYEE( name, age, dept id ) and DEPARTMENT( dept id, name, location ). It can be seen that there is no syntax in the EMPLOYEE relation that indicates the relationship with the DEPARTMENT relation. Hence, this semantic intent is lost. Only the application programmer knows that the dept id attribute (a foreign key) in the EMPLOYEE relation refers to the dept id attribute of the DEPARTMENT relation. It could be argued that this problem is not inherent to the relational data model but a limitation of the various commercial implementations in RDBMSs. To that end, this problem is alleviated in many modern relational databases that support the speci cation of explicit referential integrity constraints in the schema itself.

27 While limitations in RDBMS implementations are one cause for loss of semantics, performance optimization can be another. The join operation in relational databases is one the most fundamental ways of retrieving data from multiple relations (Elmasri and Navathe 1992). In the most primitive form, this operator computes the cartesian product of two relations and returns the subset for which the values match for one common attribute between the two relations. Although the RDBMS may use several optimization techniques to avoid computing the entire cartesian product, the join operation is a very expensive database operation that can impose severe performance overhead. Therefore, database designers take great e ort to avoid the need for join operations by choosing an appropriate design. There are several relational database design optimizations that result in the loss of semantics. One of the fundamental activities during \good" relational database design is normalization. Normalization causes the data to be decomposed into several relations based on functional dependencies (if the universal relation approach is used). This type of decomposition implies that join operations have to be used in order to combine the data that was decomposed. In general, the higher the level of normalization, the greater the extent of decomposition. Therefore, in order to reduce the need for join operations, designers opt for a lower level of normalization, which is typically the second normal form (2NF). In the second normal form, a one-many association between two entity types is realized by embedding both the entity types in a single relation. For example, consider the following two relations : R (A; B ; E ) R (B; C; D) 1

1

2

In this and other table de nitions in this document, underlined attribute names indicate primary keys and asterisks on attribute names indicate foreign keys. 1

28 In order to obtain the value of attribute C corresponding to some row in R , an equi-join operation must be performed between the attributes R :B and R :B . Such a join operation can be avoided by using a 2NF relation de ned as follows: R(A; B; C; D; E ); For such a relation R, there is no explicit indication that an association is embedded inside the relation. The user has to interact with the mapping procedure to provide additional information such as functional dependencies. For this example, the two functional dependencies A ! B; B ! CD would enable the mapping system to identify such an association. There are at least three di erent ways of representing inheritance structures in the relational schema. The \good" way to represent an inheritance hierarchy is to have one relation each for every class in the hierarchy and establish the superclass/subclass relationship using foreign keys. Extracting the inheritance hierarchy from such a design is straight-forward using the foreign key relationships. However, since this design would involve join operations for extracting a single object, two alternatives are employed for optimization. One alternative is to have relations corresponding only to the leaf classes in the hierarchy. The relations would include all the inherited attributes. It is dicult to extract the inheritance hierarchy from such relations without the user informing the mapping process about which relations are part of the hierarchy. A second alternative for representing inheritance structures that does not involve join operations is to have only one relation that has attributes corresponding to all the subclasses. A discriminant attribute helps the applications in determining which attributes are relevant for a particular subclass. Since there is no construct in the relational model to specify a discriminant attribute, the subclasses cannot be 1

1

2

29 extracted directly from the relational constructs. This research provides a solution that does not immediately require user assistance for both of these cases. Finally, user assistance is needed to cope with binary large objects (BLOBs) in databases. Many RDBMSs these days provide support for the BLOB data type. Their primary purpose is to facilitate the storage of multimedia objects such as images, audio data, and so on. Lack of support for arrays of values as part of a single tuple (i.e., multi-valued attributes) is a well-known limitation of the relational model. Many implementations of relational databases utilize the BLOB data type for storing multivalued attributes. The applications take the responsibility of converting multi-valued attributes (arrays) to binary streams and vice versa. Since such arrays in scienti c applications can contain thousands of elements, using BLOB data types is the most ecient way to store and retrieve the values. It is impossible for the schema mapping procedure to determine whether a particular attribute of BLOB data type represents a single value such as an image or multi-values such as an array. Therefore, whenever the mapping procedure comes across BLOB data types, it must consult the user. Loss of semantic information may not be a major problem for humans in domains where they can understand the exact relationship by looking at the names of the foreign keys of relations. On the other hand, extracting such semantic intents and representing them explicitly in the object-oriented model is not automatable without human intervention. Overview of the Approach The solution provided by this research has two major components. The rst one is concerned with the mapping of the existing relational schema to an object-oriented schema. Schema mapping is a one-time static operation. The result of carrying out

30 this schema mapping is a mapping database that contains information about the new object classes that have been generated and the manner in which these could be associated with existing relational tables. The information includes de nition of object classes, associations, inheritance relationships, and aggregation relationships. Moreover, it contains additional information in order to allow creation of objects from a combination of one or more tuples from the relational database. Most of the information required for the schema mapping comes from the catalog of the relational database. As demonstrated in an earlier section, some user interaction is needed in the process, too. An important constraint satis ed by the schema mapping presented in this dissertation is that the original relational schema is not modi ed in any manner. This ensures that existing relational database applications are not a ected by the mapping process and thus satisfy the goals of using an object wrapper, as described in the previous section. While the rst component is concerned with the schema mapping, the second component is concerned with the data mapping. Unlike schema mapping, data mapping is a dynamic process in that it takes place each time an object-oriented application needs to access the data in the database in the form of an object. The primary activity during data mapping includes establishing object-oriented semantics of the instances. For example, data access using navigation is one of the primitive operations using objects. Associations are typically used for such navigational access. The data mapping component should ensure that the applications are able to traverse the associations in a transparent manner although the underlying operation could be a relation join operation. Figure 3.2 illustrates the system architecture of SOAR. The Static Schema Mapper (SSM) interacts with the user and the RDBMS to produce the SchemaBase.

31 The SchemaBase is a repository that contains the schema mapping knowledge. Object-oriented applications communicate with the Dynamic Object-Relational Mapper (DORM) to retrieve data from the relational database. The DORM will use the SchemaBase to map the relational data to objects that the application wants. Legacy relational applications can continue to run against the existing RDBMS and existing relational schema without any change. The next chapter contains details of the schema mapping process. The SSM carries out a one-time process to generate the mapping information between the existing relational schema and the corresponding object-oriented schema. The SSM stores the extracted object-oriented structures in a machine-readable format (SchemaBase). All object-oriented applications communicate with the relational database through the Dynamic Object-Relational Mapper component. The DORM receives requests for reading objects from the application. The DORM then reads the SchemaBase for information about which relations and attributes to access, what operations to perform (e.g., join), and so on. It then retrieves the corresponding tuples from the relational database, makes objects of the appropriate type, and returns them to the application. All these steps take place dynamically (i.e., during run-time). The applications can carry out all the operations by making calls to the DORM class library. The overall runtime layered architecture is summarized in gure 3.3. In a layered architecture, each layer in the system is dependent only on the layer immediately below it. Hierarchical layering is a common architecture in the integration of multidatabase systems (Shaw and Garlan 1996). A layered architecture decreases the dependence among the various components of the system and allows di erent components to grow or shrink with minimum impact on the overall system.

32

Legend

data flow process data store

User Interaction

mapping information

SchemaBase

Static Schema Mapper (SSM)

mapping information

query

Dynamic Object-Relational Mapper (DORM)

catalog information tuples Relational Database System tuples

Legacy Relational Applications

object query

objects

Object-oriented Applications

query

Figure 3.2. High-level System Architecture of SOAR

Object–oriented Applications

Relational Applications

Dynamic Object–Relational Mapper RDBMS

Figure 3.3. Layered Architecture of SOAR Applications

33 In SOAR's architecture, for example, object-oriented applications rely only on the DORM for accessing the data and are totally shielded from the RDBMS. This allows the underlying RDBMS to be changed without a ecting the applications as long as the interface between the application layer and the DORM layer is kept consistent.

CHAPTER IV MAPPING COMPONENTS This chapter describes in detail the two major components needed for providing object-oriented access to existing relational databases. The rst component deals with mapping the relational schema to an object-oriented schema. The second component deals with the mapping between the relational data and objects. As mentioned elsewhere in this document, data mapping is included in this research only for the sake of completeness of the mapping process. Data mapping approaches are fairly standard in the industry and there are several commercial tools that can carry out this process in an ecient manner. For example, see Keller and Basu (1996) and Keller and Coldewey (1996). Schema Mapping The static schema mapping process is a two-phase process. In the rst phase, the relational schema is adjusted and transformed into another virtual relational schema that has some speci c properties. In the second phase, object-oriented structures are extracted from the virtual relational schema. The motivation for this two-phase approach is explained below. As has been reported in the literature, there are many object-oriented constructs that cannot be represented directly in the relational model (Robie and Bartels 1994; Simon 1994; Loomis 1994). Nevertheless, there are several mapping strategies for generating a relational schema that corresponds to a given object schema, which 34

35 is typically referred to as forward engineering (Elmasri and Navathe 1992). Such mapping strategies have also been successfully used in reverse engineering relational schemas to object- oriented schemas (Chiang, Barron, and Storey 1994; Fahrner and Vossen 1995b). The basic idea is to \guess" what object-oriented structures would have given rise to the given relational schema if forward engineering practices were used on the object-oriented schema. However, such a procedure does not yield desirable results if the relational schema contains peculiar constructs such as second normal form (2NF) relations, binary large objects (BLOBS) for multi-valued attributes, and so on. Therefore, there is a need to adjust the original relational schema such that these unconventional constructs are replaced by conventional constructs like 3NF relations, atomic attributes, and so on, without resulting in loss of information. The two-phase approach is illustrated in Figure 4.1.

Source relational schema Relational Database System

Phase 1 Adjust Relational Schema

Virtual relational schema Mapping Information SchemaBase

Phase 2 Extract object–oriented structures from the virtual relational schema

Figure 4.1. Two Phases of Static Schema Mapping

36 Assumptions The procedure presented here takes into consideration relations that are not in third normal form (3NF) but are at least in second normal form (2NF). That is, the relations in the input schema for the mapping may contain transitive functional dependencies (but not partial dependencies). The availability of functional dependencies for the 2NF relations is required for the mapping process. The functional dependencies may be provided by the user or could be derived automatically from the database extension (i.e., the actual data). In addition, any schema mapping procedure involving the relational model and a conceptual model like the ER model or OO model requires the knowledge of primary keys and foreign keys or their equivalent (e.g., inclusion dependencies)(Johannesson and Kalman 1989). Two attributes are considered the \same" (synonymous) if they have the same domain. For example, \SSN" and \Leader-SSN" are considered the same since they both have the same domain, namely, the set of all social security numbers. Synonyms are typically used for the names of foreign keys (as in this example). Such similarity of the attributes can be determined based on the primary keys to which the foreign keys refer. It is also assumed that there are no homonyms in the schema. Homonyms are attributes that have the same name but di erent domains. Notation R(R ; R ;   ) is a relational schema where each Ri(A; PK; FK; FD) represents a relation. AR = fA ; A ;   g are the attributes of some relation R. PKR represents the primary key of the relation R. FK R = fFK ; FK ;   g are the foreign keys of the relation, where each FKi  AR. FDR = fFD ; FD ;   g are the functional 1

2

1

2

1

2

1

2

37 dependencies that hold true in the relation, where each functional dependency FDi is of the form X ! Y and X; Y  AR. In the example table de nitions given in this document, underlined attribute names indicate primary keys and an asterisk on an attribute name indicates a foreign key. Relational algebra is a collection of operations that can be used to manipulate relations and attributes (Elmasri and Navathe 1992). Three primary operations are the select operation (which is distinct from the SELECT SQL command), the project operation, and the join operation. The general form of the select operation is  (< relation >), which denotes the set of tuples from the given relation that satisfy the given condition. The general form of project operation is  (< relation >), which denotes the selection of the speci ed attributes from the given relation. Finally, the join operation is denoted by R 1 R , which selects tuples from the cartesian product R  R based on the speci ed condition. The assignment operation is denoted using !. 1

2

1

2

Phase One: Adjusting the Relational Schema There are four speci c aspects that are addressed during phase one. They are as follows:  Step 1.1 Eliminate 2NF relations and replace them with new 3NF virtual relations.

 Step 1.2 Create virtual subclass relations for widow superclass relations (to be de ned later).

38

 Step 1.3 Create virtual superclass relations for orphan subclass relations (to be de ned later).

 Step 1.4 Eliminate multi-valued attributes and replace them with new 3NF virtual relations. Note that in the process of adjusting the relational schema, the original schema is not being modi ed in any manner. Not requiring modi cation of the original schema is very important in order to satisfy the \co-existence requirement" because changing the source schema would render existing data and applications unusable to the users. Therefore, before initiating the mapping process, a new temporary virtual schema is created and initialized with the contents of the original source schema. All the adjustment operations are now carried out on the virtual relational schema. The algorithm for the elimination of 2NF relations is given in Figure 4.2. The procedure basically loops through all the transitive dependencies in each relation and replaces them with the corresponding 3NF virtual relations. For example, consider a 2NF relation: R(A; B; C; D; E ); A ! B; B ! CD This relation R can be split into two temporary relations R (A; B ; E ) R (B; C; D) In relation R , B is a foreign key that refers to the relation R . Step 1.2 is concerned with the elimination of widow relations. Analogous to the term \widow lines" used in document publishing, widow relations are those relations that have the attributes corresponding to potential subclasses embedded in a single relation. The goal of step 1.2 is to extract such embedded structures from relations and represent them explicitly in the form of virtual relations and foreign key 1

1

2

2

39 Algorithm 1: Step 1.1 INPUT/OUPUT: Virtual relational schema V 1. 2. 3.

for each relation R in the relational schema V for each pair of functional dependencies FDi and FDj of form X  Y and Y  Z, where Y is not a candidate key of R Create a new virtual relation V1 having attributes AV1 = AR – Z, PKV1 = PKR

4. 5. 6. 7. 8. 9.

Create a new virtual relation V2 having attributes AV2 = Y  Z, PKV2 = Y Make V1 .Y a foreign key that refers to the virtual relation V2 Make V2 .Y the primary key of V2 Replace relation R in V with virtual relations V1 and V2 end for end for

Figure 4.2. Elimination of 2NF relations relationships. These types of constructs typically represent inheritance relationships where the attributes corresponding to the each subclass are embedded in a single relation. NULL values are stored for attributes that are not applicable for a particular subclass. The application programs use one of the attributes of the relation as a discriminant attribute in order to determine which attributes of a given tuple have to be used for processing. Relational database designers use this type of embedded inheritance in order to avoid join operations that would be required if each subclass were represented by a distinct relation. For example, consider the PROJECT relation shown below. That relation actually contains records corresponding to two types of projects, namely, hardware projects and software projects. The discriminant attribute is ProjType that uses the values `H' and `S' to discriminate between the two sets of tuples. The attribute Supplier# is relevant only for hardware projects (i.e., ProjType = `H'). Similarly, the attributes Language and LOC are relevant only for software projects (i.e., ProjType

40 = `S'). This type of classi cation is evident from the NULL values for attributes that are not relevant. PROJECT Relation Proj# 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010

ProjName ProjType Supplier# Language LOC Project One `H' 2102 NULL NULL Project Two `H' 2102 NULL NULL Project Three `H' 2107 NULL NULL Project Four `H' 3102 NULL NULL Project Five `H' 2102 NULL NULL Project Six `S' NULL C++ 12,000 Project Seven `S' NULL Java 3,000 Project Eight `S' NULL Java NULL Project Nine `S' NULL Smalltalk 10,100 Project Ten `S' NULL Lisp 8,000

Knowledge discovery from databases is concerned with extracting non-trivial information from existing databases (Piatetsky-Shapiro and Frawley 1991). There are several knowledge discovery algorithms that can identify rules from relational databases (Piatetsky-Shapiro and Frawley 1991). By using one of those algorithms on the PROJECT relation, for instance, we can identify rules such as ProjType =0 S 0 ) Supplier# = NULL ^

ProjType =0 H 0 ) (Language = NULL LOC = NULL) Such rules are typically called \strong rules" since they hold for all the instances of the relation. In e ect, knowledge discovery algorithms can be used to identify the discriminant attribute of the generalization by nding rules whose right hand side is of the form A = NULL where A is some attribute name. The algorithm speci ed in Figure 4.3 contains a procedure that is optimized to speci cally identify rules of the form Rulei : (D = Vj ) ) Xi = NULL; where

41

D is the discriminant attribute and Xi is an attribute or a combination of attributes The main algorithm contains a loop to check if each attribute is a discriminant attribute. The MakeSubclasses sub-algorithm does the actual work of computing the rules and generating new virtual relations corresponding to the subclasses. Since this part of the algorithm examines the actual contents of the database, it is important to minimize the database access as much as possible. Therefore, the algorithm uses three conditions to enable the \short-circuit" of the computation (line 1 in Algorithm 3). First, if the data type of the attribute is BLOB, then the attribute cannot be a discriminant attribute. Similarly, if the attribute has NULL values in the database, it cannot be a discriminant attribute. Except in the worst case (when the last tuple in the relation contains a NULL value for the attribute), these two conditions can usually be checked without accessing all the tuples of the relation. The third condition uses a resource variable called UNIQUE THRESHOLD, which is an indicator of the selectivity of the attribute. Discriminant attributes typically do not contain many unique values (relative to the total number of tuples). The variable UNIQUE THRESHOLD speci es the maximum number of unique values that can be permitted for a discriminant attribute. The use of this variable helps in speeding up the table-scanning process by ruling out attributes such as primary keys without reading in all the values. A typical value for UNIQUE THRESHOLD is 5. If an attribute satis es all three conditions, then a boolean matrix (as de ned in line 6) is computed. For the example PROJECT table, the matrix corresponding to the ProjType attribute is as follows (`F' stands for FALSE and `T' for TRUE).

42 Algorithm 2: Step 1.2 INPUT: UNIQUE_THRESHOLD INPUT/OUPUT: Virtual relational schema V 1.

for each relation R in the relational schema V

2. 3. 4. 5.

for each attribute Ai  AR call MakeSubclasses( R, Ai , V, UNIQUE_THRESHOLD ) end for end for

Algorithm 3: MakeSubclasses INPUT: Relation R, attribute A, UNIQUE_THRESHOLD OUPUT: Virtual relational schema V 1.

if data_type(A) = BLOB or sA=NULL(R) > 0 or unique_values(A) > UNIQUE_THRESHOLD

2. 3. 4. 5. 6.

then

7. 8. 9. 10.

if matrix M does not contain at least one column having exactly one FALSE value then return endif

11. 12. 13. 14.

Let Asuper = p Ap , where p is such that ( mip = FALSE ô i ) or ( mip = TRUE ô Create a virtual relation S corresponding to the superclass having attributes Asuper Replace relation R in V with S for each unique value vi of the attribute A

15. 16. 17. 18. 19.

return endif Compute unique_values[], an array of unique values of attribute A. Compute a boolean matrix M where the element mij = TRUE if PAj ( sA=unique_values[i](R) ) = NULL = FALSE, otherwise i = 1, 2, ... , |unique_values|, j = 1, 2, ... ,  AR 

Let Anew = ƒAp , where p is such that (mip = FALSE ƞ Ap Ŷ Create a virtual relation V having attributes Anew ƒ PKS Make the primary key of V a foreign key that refers to S Add V to V end for

Figure 4.3. Elimination of Widow Relations

Asuper )

i)

43 2 66 66 66 4

3 7 7 7 7 7 7 5

Proj # ProjName ProjType Supplier# Lang LOC ProjType = `H 0 F F F F T T ProjType = `S 0 F F F T F F The matrix contains three columns having exactly one FALSE value, which indicates that there are three attributes in the relation that have NULL values for some unique value of the discriminant. Therefore, the attribute ProjType quali es as a discriminant attribute (line 7). In lines 11{13, a relation corresponding to the superclass is created. The attributes corresponding to the matrix columns that are entirely composed of FALSE values comprise this relation's attributes. Lines 14{19 create one subclass for each unique value of the discriminant attribute and add them to the virtual relation schema. Note that this algorithm carries out an inference process based on the extent (instances) of the database. Consequently, the `discovery' made by the algorithm (the discriminant attribute in this case) may get invalidated if an instance is added or updated. For example, a tuple might be inserted or updated such that it violates the criteria for a discriminant attribute. Violation of the criteria actually indicates that a false inheritance relationship was extracted initially. In order to guard against extraction of such relationships that are true only by coincidence, the user must be asked to con rm such relationships. The de nition of the new relations corresponding to the example PROJECT is as follows: PROJECT (Proj #; ProjName; ProjType) V IRTUAL REL 1 (Proj #; Supplier#) V IRTUAL REL 2 (Proj #; Language; LOC )

44 Obviously, the algorithm cannot assign appropriate names for the new virtual relations. The user has to assign those. The primary keys of both VIRTUAL REL 1 and VIRTUAL REL 2 refer to the PROJECT relation. Step 1.3 is concerned with the elimination of orphan relations. Orphan relations are analogous to \orphan lines." Orphan relations are relations that are part of an inheritance hierarchy but do not have relations corresponding to their respective superclasses. The goal of step 1.3 is to create virtual relations corresponding to superclasses by factoring out common attributes among a group of relations. Just as in the case of widow relations, the rationale behind including orphan relations in relational database design is, once again, to avoid expensive join operations. For example, the PROJECT relation shown earlier could be implemented as two di erent relations HARDWARE PROJ and SOFTWARE PROJ as follows. HARDWARE PROJ Proj# 0001 0002 0003 0004 0005

ProjName Supplier# Project One 2102 Project Two 2102 Project Three 2107 Project Four 3102 Project Five 2102

SOFTWARE PROJ Proj# 0006 0007 0008 0009 0010

ProjName Language LOC Project Six C++ 12,000 Project Seven Java 3,000 Project Eight Java NULL Project Nine Smalltalk 10,100 Project Ten Lisp 8,000

The result of applying step 1.3 to the above two relations is the creation of a new virtual relation having attributes V IRTUAL REL(Proj #; ProjName) Figure 4.4 provides the algorithm for adding parents to orphan relations. The algorithm relies on the assumption that there are no synonyms or homonyms in the relational schema. Two relations are considered to have the same super class if they

45 have the same primary key and they have one additional attribute in common other than the primary key (line 4 in Figure 4.4). The process is repeated for all pairs of relations. At the end of line 14 in the algorithm, a set G containing potential subclass-superclass pairs is obtained. Since G will contain entries for every possible combination of subclass-superclass pairs, lines 15{18 eliminate all the transitive dependencies present in G. For every pair in G, a foreign key referring to the relation corresponding to the superclass is added to the relation corresponding to the subclass (line 20) and common attributes between the two relations are removed from the relation corresponding to the subclass. The algorithm is better illustrated with the next example, which demonstrates the extraction of multi-level generalization hierarchies. Consider the following three relations:

R (A; B; C; D; E ) R (A; B; C; F; G) R (A; B; H; I ) 1

2 3

At the end of line 14, two new virtual relations are created as follows:

V (A; B; C ) V (A; B ) G = f < R ; V >; < R ; V >; < R ; V >; < R ; V >; < R ; V >; < V ; V >g 1

2

1

1

2

1

2

2

3

2

1

2

1

2

Finally, redundant generalization pairs are eliminated (lines 15{18) by removing < R ; V >, < R ; V > from G. Recalling that in a pair < T ; T >, T is the relation corresponding to a subclass of the class represented by relation T , the nal generalization hierarchy after applying this process is: G = f< R ; V >; < R ; V >; < R ; V >; < V ; V >g 1

2

2

2

1

2

1

2

1

1

2

1

3

2

1

2

46

Algorithm 4: Step 1.3. INPUT/OUPUT: Virtual relational schema V 1. 2. 3. 4. 5.

Let G be a set of ordered pairs where the primary key of T1 is a foreign key that refers to T2 . for each relation Ri in V for each relation Rj in V, i  j if PKRi = PKRj and PKRi Ũ ARi Ɛ ARj then Let Anew = ARi Ɛ ARj Create a virtual relation V having attributes Anew Remove Anew – PKRi from ARi Remove Anew – PKRj from ARj Add V to V Add the ordered pairs < Ri , V > and < Rj , V > to G.

6. 7. 8. 9. 10. 11. 12. 13. 14.

endif end for end for

15. 16. 17. 18.

if õ T such that Ů G ƞ Ů G then remove from G endif

19. 20. 21.

for each Ů G Make the primary key of T1 a foreign key that refers to T2 end for

Figure 4.4. Elimination of Orphan Relations

47 Step 1.4 of this rst phase in the schema mapping relies almost entirely on user input. As discussed in earlier chapters, relational database designers apply several optimization techniques in order to improve the performance of data retrieval. The amount of overhead involved in retrieving one large tuple is typically much lower than the overhead in retrieving multiple tuples, even if their size is relatively small. Therefore, BLOB data types are often used to store multi-valued attributes if the number of values is very large. Scienti c databases employ this technique for quickly retrieving large quantities of data (Jurkevics 1992). Consider the following relation: BUOYDATA(dataset id; lat; lon; num values; data blob) This hypothetical relation represents data collected by a buoy placed at various locations in the ocean. In this relation, the attributes lat and lon represent the location of the buoy. data blob is a multi-valued attribute where each value is actually made up of three other values (possibly more), viz., time of measurement, temperature, and wind speed. In fact, data blob is an example of a complex multi-valued attribute since each of the multiple values is non-primitive. The mapping procedure cannot guess the composition of each unit of the BLOB data attribute. Therefore, the mapping algorithm for this step, given in Figure 4.5, requests the user to specify the composition of the attribute for all attributes whose data type is BLOB. Line 6 creates a virtual relation corresponding to the attributes given by the user. Applying step 1.4 to the example relation BUOYDATA results in the creation of a virtual relation corresponding to the sub-objects. The new relational de nitions will be as follows:

48

Algorithm 5: Step 1.4 INPUT/OUPUT: Virtual relational schema V 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.

for each relation R in V Aaggr = AR for each attribute Ai Ů AR if Ai is a BLOB data type and Ai represents complex data then Anew = attributes of the complex data corresponding to Ai Create a virtual relation V having attributes Anew ƒ PKR Add to V a foreign key that refers to R Add V to V Remove the attribute R.Ai from Aaggr endif end for Create a virtual relation S having attributes Aaggr Replace relation R in V with S end for

Figure 4.5. Elimination of BLOB Attributes

49

BUOY DATA(dataset id; lat; lon; num values) V IRTUAL REL(dataset id; time; temperature; wind speed) Phase Two: Generation of the Object Schema At the end of phase one of the schema mapping process, the relational schema has been adjusted to a form in which schema mapping rules can be applied uniformly irrespective of the type of optimization structures the original relational schema may contain. The rest of the subsection describes the procedure to extract object classes and relationships from the virtual relational schema. The rst task of mapping from the virtual relational schema to an object-oriented schema is to identify what object-oriented constructs we are interested in extracting from the relational schema. Once they are speci ed, then the actual procedure for carrying out the mapping can be speci ed to target only the constructs of interest. In keeping with the standard practice in the database reverse engineering literature, extraction of the object-oriented schema refers to the extraction of the structural portion of the object-oriented model. Thus, the main goals of phase two are: 1. Identifying Object Classes. Those relations that correspond to object-classes must be identi ed. 2. Identifying Relationships. There are mainly three types of relationships that can be represented in an object model. They are associations, generalizations/specializations, and aggregations (Rumbaugh et al. 1991). Identifying each of these constructs constitutes a step in the mapping process. (a) Identifying Associations. Since the object model allows associations to be modelled as classes, we must either establish a simple association between

50 two object classes or identify relationships where the associations are modelled as classes. (b) Identifying Inheritance. Inheritance structures capture the generalization and specialization relationships between object classes that have been identi ed so far. (c) Identifying Aggregation. The aggregation relationship models the composition of one object with other objects. Such complex objects must be identi ed. The di erence between aggregation and association is that the former involves existence dependence of the sub-object on the whole object. For example, a door object, which is a \part-of" of a car object, cannot exist if the car object does not exist. On the other hand, the enrollment of a student in a course is an association rather than an aggregation because the student and course objects can exist independent of each other. 3. Establishing Cardinalities. Establishing the cardinalities of associations is important in order to facilitate the implementation of the object schema in a given programming language (e.g., C++). The di erent possible cardinalities are one-one, one-many, and many-many. Identifying Object Classes The term class-relation is used here to denote a relation in the relational schema that quali es to be mapped to an object class. In the adjusted virtual relational schema, all the 3NF relations qualify to be class-relations and hence each of them map to an object class. The classes whose relations have the primary key entirely composed of foreign keys represent associations that are themselves classes. Such

51 classes will be termed association-classes. The algorithm for identifying object class is given in Figure 4.6. Algorithm 6: Identifying Classes INPUT: Virtual relational schema V OUTPUT: Object classes C and association classes S 1. 2. 3.

for each relation V Ů V Create a class C having the same attributes as V Add C to C

4.

if ťPKV ť > 1ƞPKV ŪFKV

5. 6. 7.

Add C to S endif end for

Figure 4.6. Identifying Classes Identifying Associations Associations between object classes can be identi ed based on two cases. First, every relation whose primary key is entirely composed of foreign keys is a class that represents an association. The association relationship is between all the classes corresponding to the class-relations that the foreign keys refer to. Second, for every class-relation that is not an association class, establish an association relationship between the corresponding class and the class corresponding to each non-key foreign key. These two cases for identifying associations are speci ed in the algorithm given in Figure 4.7. Identifying Inheritance Three forms of relational structures may indicate an inheritance relationship. The rst one is the most straight-forward case which involves one relation for the

52

Algorithm 7: Identifying Associations INPUT: Virtual relational schema V OUTPUT: Object classes C and association classes S 1. 2.

for each relation V Ů V Let C be the class corresponding to the relation V

3. 4. 5. 6. 7.

for each foreign key FK Ů FKV Let T be the class corresponding to the relation that FK refers to Create a many:one association between C and T end for end for

8. 9. 10.

for each association class C Ů C Let R be the relation in V that corresponds to the class C Let F be the set of foreign keys that the primary key of R refers to

11. 12. 13. 14. 15. 16. 17. 18.

for each Fi Ů F for each Fj Ů F, i 0 j Let C1 be the class corresponding to the relation that Fi refers to Let C2 be the class corresponding to the relation that Fj refers to Create a many:many association between C1 and C2 with C as the association class end for end for end for

Figure 4.7. Identifying Associations

53 super class and one relation each for all the subclasses. Widow relations and orphan relations, as explained earlier, are the other two forms of relational structures that indicate an inheritance relationship. However, the algorithms for eliminating these two types of relational structures reduce the relational schema to contain only the form of the rst case, i.e., separate relations corresponding to the superclasses and subclasses, respectively. When the relational schema involves one relation for the super class and one relation each for all the subclasses, every pair of class-relations (CR ; CR ) that have the same primary key may be involved in an inheritance relationship. The inheritance relationship CR \is-a" CR exists if the primary key of CR is also a foreign key and it refers to the class-relation CR . The algorithm for extracting inheritance relationship from the adjusted relational schema is given in Figure 4.8. 1

2

1

2

1

2

Algorithm 8: Identifying Inheritance INPUT: Virtual relational schema V INPUT/OUTPUT: Object classes C 1. 2. 3. 4. 5. 6. 7. 8.

for each relation V Ů V if PKV Ů FKV Let FK Ů FKV be the foreign key corresponding to the primary key PKV Let C1 be the class corresponding to the relation V Let C2 be the class corresponding to the relation that FK refers to Make the class C1 a subclass of class C2 endif end for

Figure 4.8. Identifying Inheritance Figures 4.9, 4.10, and 4.11 illustrate the three cases, respectively, using the examples de ned earlier for phase one. The notation of the Object Modelling Technique (OMT) is used in the gures (Rumbaugh et al. 1991). Rectangular boxes

54 represent object classes. Lines connecting the object classes represent associations. A lled circle at the end of an association indicates the \zero or more" cardinality of the association. Aggregation is represented using a little diamond-shaped box near the end of the aggregate object. Inheritance is represented by a triangle connecting the superclass and all the subclasses. Identifying Aggregation Recall that an aggregation relationship exhibits existence dependency. That is, if A \is-part-of" B, then the existence of A depends on the existence of B. Consider a class-relation CR whose primary key has more than one attribute and at least one of them is not a foreign key. Let CR be the class-relation corresponding to the largest subset of foreign keys in the primary key of CR . If such a class-relation exists, then CR \is-part-of" CR . Notice that this de nition recognizes aggregations of association classes, too. The algorithm for identifying aggregations is given in Figure 4.12. The following two example relations illustrate this case: 1

2

1

1

2

EMPLOY EE (SSN; Name; Age) DEPENDENT (SSN  ; Dependent#; DependentName; DependentAge) In the DEPENDENT relation, SSN is a foreign key that refers to the key of EMPLOYEE relation. Therefore, the class-relation corresponding to the largest subset of foreign keys in the primary key of DEPENDENT is EMPLOYEE. Moreover, the key of DEPENDENT has the attribute Dependent# that is not a foreign key. Hence, the DEPENDENT object \is-part-of" an EMPLOYEE object. That is, a DEPENDENT object cannot exist without a corresponding EMPLOYEE object.

55 PROJECT

PROJECT

Proj#, ProjName, ProjType

Proj# ProjName ProjType

HARDWARE_PROJ Proj#, Supplier

SOFTWARE_PROJ Proj#, Language, LOC

HARDWARE_PROJ

SOFTWARE_PROJ

Proj# Supplier

Proj# Language LOC

Figure 4.9. Example for Case 1 Inheritance PROJECT Proj# ProjName ProjType

PROJECT Proj#, ProjName, ProjType, Supplier, Language, LOC

HARDWARE_PROJ

SOFTWARE_PROJ Proj#

Proj#

Language

Supplier

LOC

Figure 4.10. Example for Case 2 Inheritance R1

T2 A,B

A, B, C, D, E

R2 A, B, C, F, G

R3

T1

A,H,I

A,C

R3 A, B, H, I

R1 A,D,E

R2 A,F,G

Figure 4.11. Example for Case 3 Inheritance

56 Algorithm 9: Identifying Aggregation INPUT: Virtual relational schema V INPUT/OUTPUT: Object classes C 1.

for each relation V Ů V if PKV  > 1PKV Ũ

2.

FKV

3.

Let C1 be the class corresponding to the relation V

4. 5. 6. 7. 8. 9.

if õ FK Ů FKV such that ( PKV Ɛ FKV ) = FK Let C2 be the class corresponding to the relation that FK refers to Make C1 ‘‘a–part–of” C2 endif endif end for

Figure 4.12. Identifying Aggregation Computational Complexity The computational complexity of the algorithms de ned in this chapter is derived in the section. The static schema mapping involves CPU activity and also disk input/output activity. Since the amount of time spent in both these activities is not comparable, it is standard for those quantities to be treated separately for the sake of computing the run-time complexity (Chiang, Barron, and Storey 1993). The symbols and variables to be used in the analyses are as follows:

 n is the number of relations in the existing relational database  n is the number of of relations after carrying out the rst phase of the schema mapping.

 m is the average number of attributes in each relation  p represents the average number of records in a relation

57

 f is the average number of functional dependencies per relation In general, n  p, n  p, and m  p. The rst step in the schema mapping process is the initialization operation. During the initialization, the existing relational schema de nition is read from the secondary storage. This operation involves reading n records for each schema and m records for each relation, giving the total secondary access time of O(nm). Algorithm 1 (Figure 4.2) eliminates 2NF relations. The loop in lines 2{8 execute f times for determining transitive dependencies. Taking the main loop in to consideration, the total CPU time for this algorithm is O(nf ). In general, there are few relations in 2NF and hence fewer transitive dependencies. This algorithm does not access the secondary storage. Algorithm 2 (Figure 4.3) eliminates widow relations. First, the time complexity of the MakeSubclasses sub-algorithm will be considered. Line 1 computes the unique values for the given attribute. This operation requires O(p) secondary storage accesses. The select operation A NULL(R) in line 2 takes O(p) secondary storage access time. Recall the the boolean matrix M has one row for each unique value of the attribute and one column for each attribute of the relation. The resource variable UNIQUE THRESHOLD (denoted by the constant U ) controls the number of rows in the matrix and the value of this variable is very small compared to p. Computing the boolean matrix M requires O(Um) = O(m) CPU operations (line 6). Line 7 checks if there is a column that contains exactly one FALSE value in the matrix. This operation takes O(m) CPU operations. In line 11, the attributes corresponding to the super class are computed by scanning the matrix. This operation takes O(m) CPU operations. Line 12 takes O(m) time to create the relations corresponding to the superclass. Lines 14{19 create relations corresponding to each subclass. Line 15 takes O(m) 2

2

=

58 CPU operations as the columns of the matrix is searched for appropriate attributes. Line 16 takes takes O(m) time to create the relations corresponding to the subclasses. Therefore, the entire loop takes O(2Um) = O(m) CPU operations. The overall CPU time complexity of Algorithm 3 is O(m) + O(m) + O(m) + O(m) + O(m) = O(m). Since Algorithm 2 invokes Algorithm 3 for every attribute of every relation, the total CPU time complexity is O(nm ). The total secondary storage access time complexity is O(np). Algorithm 4 (Figure 4.4) eliminates orphan relations. Lines 2{12 contain a nested loop over all pairs of relations. Hence, the if statement is always executed n times. Since in the worst case, G contains n elements, the statements in lines 13{16 will execute n times in the worst case. Finally, lines 17{20 also execute n times in the worst case. Therefore, the overall time complexity for the algorithm is O(n ). Algorithm 5 checks every attribute in the relational schema for BLOB attributes and consequently has a time complexity of O(nm). Algorithms 4 and 5 do not have any secondary storage accesses. None of the remaining algorithms access the secondary storage. Algorithm 6 (Figure 4.6) creates object classes based on the relations in the adjusted schema. Each object class can be created in O(m) time. Therefore, the total time to create all the relations is O(nm). Algorithm 7 (Figure 4.7) identi es associations. Lines 1{7 examine each foreign key of every relation. Since the number foreign keys in a relation is bound by the number of attributes in the relation, the time complexity for these lines of the algorithm is O(nm). Line 8 is a loop on the association classes in the schema. Since the number of association classes in the schema is bound by the number of relations in the schema, the number of times the loop starting in line 8 is executed is bound by n. By a similar argument based on the number of attributes in 2

2

2

2

2

2

59 the primary key, the number of times the loop starting in line 11 executes is bound by m. Hence, the total time complexity of lines 8{18 is O(nm). The total time complexity of Algorithm 7 is O(nm) + O(nm) = O(nm). Algorithm 8 (Figure 4.8) merely examines the primary key of each relation to identify inheritance relationships. This algorithm contains only one loop that goes through all the relation and hence has a run-time complexity of O(n). By a similar argument, Algorithm 8 (Figure 4.12) has a run-time complexity of O(n). A summary of the computational complexities of all the algorithms is given in Table 4.1. Data Mapping In this section, the mapping between the relational tuples and the objects corresponding to the generated object schema is described. The Dynamic ObjectRelational Mapper (DORM) component of the SOAR prototype uses the data mapping strategies given here in order to satisfy the queries of object-oriented applications on the object schema. Since there are two phases in the schema mapping process, two data mapping strategies have been speci ed. One mapping procedure speci es the data mapping between the original relational schema and the adjusted relational schema. The second mapping procedure speci es the data mapping between the adjusted relational schema and the object schema. The data mapping procedures have been speci ed using relational algebra for each new virtual relation that is created during phase one one of the static schema mapping. In Algorithm 1, relation R(AR) is replaced by V (AV1 ) and V (AV2 ) (line 7). The following relational algebra expression speci es the data mapping for the two relations V and V . (4.1) V Av1 (R) 1

1

2

1

2

60

Table 4.1. Computational Complexity of Schema Mapping Algorithm

CPU

Secondary storage

{

O(nm)

Elimination of 2NF relations

O(nm)

{

Elimination of widow relations

O(nm )

O(np)

Elimination of orphan relations

O(n )

{

Elimination of BLOB attributes

O(nm)

{

Identifying classes

O(nm)

{

Identifying associations

O(nm)

{

Identifying inheritance

O(n)

{

Identifying aggregation

O(n)

{

Initialization

2

2

61

V Av2 (R) (4.2) Since Av and Av are subsets of AR, the relational algebra expressions 4.1 and 4.2 are well-de ned. Algorithm 2 is concerned with the elimination of widow relations. The primary operation takes place in the sub-algorithm Algorithm 3. If a particular attribute quali es to be a discriminant attribute, Algorithm 3 replaces that relation by one relation corresponding to the superclass and one relation each corresponding to the subclasses. The relations corresponding to the subclasses are created in line 16. The relational algebra expression corresponding to each new relation V created there is: (4.3) V Anew (A vi (R)) Similarly the data corresponding to the superclass S that is created in line 12 is given by the following expression: S Asuper (R) (4.4) Algorithm 4 eliminates orphan relations and adds relations corresponding to an inheritance hierarchy. The relations corresponding to superclasses are created in line 7. The data mapping expression for each of the new virtual relations is: [ (4.5) V Anew (Ri) Anew (Rj ) The data mapping for relations that have BLOB elements is not straightforward. Algorithm 5 utilizes user interaction to determine the composition of the BLOB data type and creates a corresponding virtual relation. Since the composition of the attribute of BLOB data type could be arbitrary, it is not possible to specify, using relational algebra expressions, the data mapping between that attribute and the corresponding new virtual relation in the adjusted schema. For the attributes whose data type is not BLOB, the relational algebra expression for the adjusted 2

1

2

=

62 relational schema is as follows: (4.6) S Aaggr (R) The second aspect of data mapping is the one between the instances of the object schema and the virtual relations of the adjusted relational schema. Since the objects are complex by de nition, it is not possible to de ne each object's complete composition using simple relational algebra expressions. Therefore, relational algebra expressions have been given for each major type of object-oriented construct that was extracted during the schema mapping process. For each class C having primitive attribute A, there is a corresponding virtual relation V containing those attributes (a proof of this is given in the next chapter). Therefore, the primitive composition of the class is given by: C A(V ) (4.7) Let C and C be classes and let V and V be their corresponding virtual relations having primary keys X and Y , respectively. Consider a one-one or onemany association between classes C and C . In general, binary associations can be traversed in either direction of the relationship. Therefore, the part of the C object that is associated with instances of C (denoted as C :: C ) and the part of C object that is associated with instances of C (denoted as C :: C ) can be mapped as follows: TEMP V 1 V1:X V2:X V (4.8) 1

2

1

1

2

2

2

1

1

2

2

1

C :: C 1

2

2

(

=

1

1

)

2

Av2 (TEMP )

(4.9)

(4.10) C :: C Av1 (TEMP ) Depending upon the cardinality of the association, C :: C and C :: C yield either a single row or multiple rows. The case of a many-many association needs to be handled di erently. If there is a many-many association between the two classes C and C , then there exists 2

1

1

1

2

2

2

1

63 an association class C having a corresponding virtual relation V (as speci ed in the static schema mapping procedure). Using the same notation as in the previous example, C :: C and C :: C are denoted as follows: TEMP V 1 V1:X V3:X V (4.11) 3

1

2

3

2

1

1

TEMP

2

C :: C

2

1

1

(

=

)

3

TEMP 1 TEMP1:Y 1

(

V :Y )

= 2

Av2 (TEMP )

V

2

(4.12) (4.13)

2

C :: C Av1 (TEMP ) (4.14) Data mapping for an inheritance relationship can be speci ed along similar lines based on the join operation. If C is a superclass of C , then the inherited attributes from C for instances of C , denoted below as C :: C , are given by the following relational algebra expressions: C :: C V 1 V2:Y V1:Y V (4.15) Aggregation is handled exactly like an association except that this relationship is usually only uni-directional from the parent object to the child object. If C \ispart-of" C , then for given instances of C , the component objects are mapped as follows: TEMP V 1 V1:X V2:X V (4.16) 2

1

2

1

1

2

2

2

2

1

2

(

=

1

)

2

2

1

1

1

C :: C 1

2

(

=

Av2 (TEMP )

)

2

(4.17)

Summary of Mapping This chapter provided the core algorithms corresponding to the two major components { static schema mapping and dynamic data mapping { needed to provide object-oriented access to an existing relational database. Static schema mapping is a two-phase process. In the rst phase the relational schema is \adjusted" into a virtual relational schema by eliminating 2NF relations, multi-valued attributes, and

64 so on. In the second phase, the adjusted relational schema is mapped to objectoriented constructs. Nine algorithms were presented for generating an object-oriented schema corresponding to the relational schema and their computational complexity was discussed. Data mapping corresponds to the mapping between the instances of the generated object schema (objects) and the instances of the relational schema (tuples). Data mapping was speci ed in the form of relational algebra expressions. The DORM component of SOAR utilizes the relational algebra expressions in order to satisfy the queries on the object schema by object-oriented applications. The next chapter contains an evaluation of the mapping procedure described in this chapter. Evaluation was carried out by demonstrating that the algorithms presented in this chapter are complete and consistent.

CHAPTER V EVALUATION Completeness and consistency are the two factors that will be used to evaluate the mapped schema. A rigorous procedure for testing both these properties of the mapping process can be speci ed without the need for using subjective human interpretation of the mapped schema. A generalized proof for completeness will involve showing that every attribute of the relational schema is also accessible from the object-oriented schema. Similarly, consistency will be proven by showing that the semantics of the object-oriented constructs implicitly satisfy various types of functional dependencies. Completeness A target-schema S M2 is complete if every data item included in the source schema S M1 can be accessed using the target schema. The target-schema here is the object-oriented schema S OO and the source schema is the relational schema S Rel. The focus of completeness is to ensure that there is no loss of information in the process of carrying out the schema mapping. Therefore, there are two aspects to showing completeness { attribute completeness and data completeness. Proving attribute completeness is done by showing that all the attributes of the source relational schema are present in the target object schema as well. Proving data completeness is done by showing that all the values stored in the tuples are accessible through the instances of the classes from the object schema. 2

1

2

1

65

66 Accessibility of Attributes Accessibility of attributes is demonstrated by examining each algorithm for operations that alter the schema. Those operations are then shown to maintain attribute completeness. The static schema mapping process begins by initializing the virtual relational schema using the entire source relational schema. Therefore, there is no loss of attributes in that step. Next, Algorithm 1 eliminates 2NF relations and replaces them with 3NF relations. In line 7, relation R(AR) is replaced by V (AV1 ) and V (AV2 ). The attribute completeness of Algorithm 1 can be proven if it can be shown that AV1 [ AV2 = AR From lines 3 and 4, AV1 [ AV2 = (AR ? Z ) [ (Y [ Z ) 1

2

= AR [ Y = AR, since Y  AR Therefore, the attribute completeness of Algorithm 1 is proven. Algorithm 2 is concerned with the elimination of widow relations. The primary operation takes place in the sub-algorithm Algorithm 3. If a particular attribute quali es to be a discriminant attribute, Algorithm 3 replaces that relation by one relation corresponding to the superclass and one relation corresponding to the subclasses. The attribute completeness of this algorithm can be proven if it is shown that the union of all the attributes of the superclass and the subclasses together equals the set of attributes of the original relation. By de nition of the boolean matrix, every attribute of the relation is represented by one and only one column of the boolean matrix M (line 6). The matrix M can be partitioned as follows into three sets of columns: MT , MF , and MTF . MT represents

67 the set of columns where each column is entirely composed of TRUE values. Similarly, MF represents the set of columns where each column is entirely composed of FALSE values. Finally, MTF represents the set of columns where each column has both TRUE and FALSE values. A particular column in the matrix can be considered to have been touched if the attribute corresponding to that column of the matrix has already been assigned to some virtual relation. Therefore, attribute completeness of this algorithm is achieved if every column of the matrix is touched. In line 12, the set of attributes for the relation corresponding to the superclass is created using the partitions MT and MF . Hence, the columns belonging to MT and MF have been touched. It remains to be shown that all the columns of MTF are touched in the loop de ned in lines 14{19. By de nition, every column of MTF has at least one FALSE value in each column. Line 15 guarantees that every column of MTF is touched by virtue of the line being a part of a loop over the rows of the matrix. That is, there exists some row i for which mip = FALSE for all mip 2 MTF . Therefore, every column of MTF is touched, too. Since each of the three partitions of M is touched, it has been shown that Algorithm 3 (and in turn Algorithm 2) is complete. Algorithm 4 is concerned with eliminating orphan relations. The only portion of the algorithm that adds and removes attributes from the schema is lines 6{11. While lines 8 and 9 remove the non-key attributes that are common to both Ri and Rj , those same attributes are being added into a new virtual relation V in lines 6 and 7. Therefore, no attribute is \lost" in this algorithm and hence attribute completeness of the algorithm is preserved.

68 As mentioned in the previous chapter, Algorithm 5 for elimination of BLOB data types is primarily carried out by user interaction. Line 6 creates a virtual relation based on the user's speci cation. The assumption of the mapping procedure is that this virtual relation is logically equivalent to the BLOB attribute Ai and hence this attribute is replaced by a reference to the virtual relation. Attribute completeness is thus maintained in this process as one attribute is being replaced by another logically equivalent attribute. Algorithms 1{5 comprise phase one of the two-phase static schema mapping. The remaining algorithms comprise phase two of the mapping process. In Algorithm 6, line 2 creates a new object class that has the same attributes as virtual relations from the adjusted relational schema. Since all the attributes from the virtual relational schema are accommodated in the object classes, attribute completeness is retained in this algorithm. Algorithms 7{9 neither add nor remove attributes from the object schema. Thus, the entire second phase also preserves attribute completeness. Since both the phases of static schema mapping exhibit attribute completeness, the static schema mapping process preserves attribute completeness. Accessibility of Data While adjusting the relational schema during phase one of the static schema mapping process, the primary schema-change operation is a relation or attribute being replaced by one or more relations or attributes, respectively. Accessibility of data, i.e., data completeness, can be proven by showing that in all such operations, the tuples (or attribute values) of the relation (or attribute) being replaced are accessible through their replacement constructs.

69 In Algorithm 1, relation R is being replaced by two virtual relations V and V in line 7. Relational algebra expressions 4.1 and 4.2 de ne the data mapping. Since neither of the two expressions involve a select operator with a boolean condition, there is no loss of tuples in the process. Moreover, since attribute completeness is also ensured, there is no loss of attribute values. Therefore, data completeness is preserved in Algorithm 1. Algorithms 2 and 3 replace relations containing embedded relations with virtual relations corresponding to a superclass and one or more subclasses. Data mapping expression 4.3 speci es the contents of each new subclass relation that is created. The constraint A = vi partitions the rows of relation R since vi takes on all the unique values of the attribute A. Therefore, no rows are lost in the relations corresponding to the subclasses. The data mapping for the relation corresponding to the superclass is given by the data mapping expression 4.4. Since this expression does not involve a constraining select operation, there is no loss of rows in this virtual relation. Finally, since it has been shown that all of these operations preserve attribute completeness, there is no loss of attribute values. Hence, Algorithms 2 and 3 preserve data completeness. Algorithm 4 groups relations into inheritance hierarchies by factoring out common attributes into virtual relations corresponding to superclasses and subclasses. Expression 4.5 is the primary data mapping expression for this algorithm. Since it involves only the projection operation for each new virtual relation created by the algorithm, there is no loss of rows from all the original relations. Attribute completeness is therefore maintained for this algorithm. Thus, data completeness is assured since neither rows nor column values are lost. 1

2

70 As mentioned earlier, data mapping for relations that have BLOB elements is not simple. Nevertheless, data completeness of the non-BLOB attributes of relations is easily shown based on the data mapping expression 4.6 { the expression does not involve the select operator and attribute completeness of non-BLOB attributes has already been shown. However, the data completeness of virtual relations that were created corresponding to each BLOB attribute cannot be proven because this process is entirely carried out by the user, and hence beyond the mapping procedure's control. Summary of Completeness The mapping procedure involves mapping of the attributes and mapping of the data (rows of values). Therefore, completeness of the mapping procedure was shown by proving that all the attributes from the source schema are accessible from the target object schema and all data values of the source schema are accessible through the instances of the object schema. Completeness ensures that there is no loss of information in the mapping process. Appendix A contains some additional explanation of the evaluation procedure along with examples. Consistency Recall that a mapped object-oriented schema S o?o is consistent with the relational schema S rel from which it was derived if all the functional dependencies (FDs) of S rel are preserved in S o?o . Using FDs to establish consistency is justi ed by the fact that consistency corresponds to the integrity of the transformed schema (and hence the transformed data). In turn, the integrity can be violated if, for example, the key constraints are violated in the transformed schema. 2

1

1

2

71 FDs specify relationships between the attributes values of a single relation only. Since a schema is typically comprised of more than one relation, some mechanism is needed to check for functional dependency consistency across multiple relations. This problem can be solved by rst forming the universal relations UR and UR corresponding to the original relational schema R and the adjusted virtual relational schema V , respectively. These universal relations are not actually used to represent the database, but will be used for establishing consistency. Consistency can be shown if both the universal relations have the same set of FDs. Figure 5.1 illustrates this idea. 1

Relational Schema

Universal Relation UR1

2

Virtual Relational Schema

Universal Relation UR2

Figure 5.1. Proving Consistency In general, functional dependencies (FDs) are a property of the attribute semantics and hence usually speci ed by the user (Elmasri and Navathe 1992). However, when universal relations are formed from existing relations some or all of which may be in 3NF, FDs can also be derived based on certain properties of those relations. Figure 5.2 provides the algorithm for extracting the FDs. The algorithm

72 basically computes the FDs based on the primary key of the relations. Since those FDs are of the form X ! Y where X is a superkey of some relation R, they do not violate the requirement of 3NF (Elmasri and Navathe 1992). On the other hand, source relational schema R may contain some relations that are in 2NF. Since it is assumed that the user provides additional FDs in such cases, additional FDs need not be derived by the algorithm. Algorithm 10: Extracting FDs for universal relation INPUT: Relational schema R OUTPUT: Functional dependencies FDU implied in the universal relation U corresponding to the relational schema R 1. 2. 3.

for each relation R in R Add the functional dependency PKR  AR to FDU end for

Figure 5.2. Extracting Implied Functional Dependencies The FDs for universal relations UR and UR can be obtained by applying Algorithm 10. Note that UR may include FDs provided by the user. In order to show that the FDs are preserved in the schema mapping, it must be shown that all operations in the transformation algorithms of phase 1 preserve the FDs of the corresponding universal relation. In Step 1.1 given in Figure 4.2, lines 2{8 normalize the relation. In that transformation, two relations with attributes Av = AR ? Z and Av = Y [ Z are added to the schema V . According to Algorithm 10, there is an implied FD PKv ! Av , which can be shown as follows to be equivalent to Y ! Z . 1

2

1

1

2

2

2

PKv ! Av = Y ! Av 2

2

2

= Y !Y [Z

73 = Y !Z The FD X ! Y is preserved in V since X; Y  Av . Therefore, the schema changing operations of this algorithm preserve the functional dependencies of their respective universal relations. In Step 1.2, a widow relation R is replaced by virtual relations V ;    ; Vn such that AR = AV1 [    AVn 1

1

1

This is true based on attribute completeness of this algorithm. Since all the attributes are preserved, all the keys and foreign keys are also preserved. All the relations have the same primary key, say PK . The following derivation shows that the FDs derived from the application of Algorithm 10 on widow relations of R is equivalent to the set of FDs derived from the virtual relations of V . PK ! AV1 ... PK ! AVn

=) PK ! AV1 [    [ AVn =) PK ! AR Hence, all the implied functional dependencies, which are derived based on primary keys, are preserved in Step 1.2. The major operation in the algorithm for elimination of orphan relations (Figure 4.4) is the application of the projection operation on orphan relations in order to form parent relations. Based on attribute completeness of Algorithm 4, the group of virtual relations corresponding to an inheritance hierarchy preserves all the attributes of the original group of orphan relations. In addition, those virtual relations have the same

74 primary key. Therefore, using a reasoning similar to the previous case, all the implied FDs derived using Algorithm 10 on the original schema can also be derived from the new virtual relations. Algorithm 5 (Figure 4.5) replaces BLOB attributes with an additional virtual relation if the attribute represents complex data. If PK is the primary key of some relation R containing such complex attributes, then PK ! AR 2 FDUR1 The BLOB attribute is removed from the original relation and a new virtual relation corresponding to the complex attribute is created. If Ai is the BLOB attribute being removed, then the FD PK ! Ai does not hold any more. However, the additional FDs corresponding to the new virtual relation holds in the adjusted schema: PK ! Anew 2 FDUR2 If the new attributes speci ed by the user are representative of the BLOB attribute being replaced, then these new FDs are equivalent to the missing FD corresponding to the BLOB attribute. Algorithms 6{9 do not change the structure of the adjusted relational schema and hence do not change the consistency of the attributes. Therefore, all the transformation operations in the static schema mapping preserve the consistency of the original relational schema. E ectiveness of the Approach The e ectiveness of the approach presented in this dissertation can be evaluated based on two criteria. The e ectiveness rst relies on the correctness of the algorithms. The previous section has established the correctness of the steps using consistency

75 and correctness. The second factor is to examine the extent to which the schema mapping operations can be automated using the speci ed algorithms. Given the assumptions stated in Chapter IV, the mapping procedure described in this dissertation fully automates the following activities:  Identi cation of object classes

 Identi cation of association relationships between object classes  Extraction of inheritance relationships when they are represented using {

one relation for each class in the hierarchy

{

one relation for each leaf class in the hierarchy

{

one relation to represent the entire class hierarchy

 Extraction of aggregation relationships when the aggregates are represented using separate relations. The mapping procedure requires mandatory user interaction for the following activities:  Extraction of embedded association relationships. The user has to supply all the transitive functional dependencies.

 Extraction of aggregation relationship when the aggregates are represented using BLOB data types in the relational database. The user has to supply the composition of the individual components of each aggregate object. Finally, the user can optionally interact with the mapping procedure to carry out the following operations:  Reject some constructs extracted by the system. It is possible that some relationship that the system discovered was merely incidental.

76

 Change the automatically generated names The SOAR implementation of the static schema mapping provides a graphical user interface that allows the user to load relational schema de nitions and carry out the above mentioned steps in an interactive manner. The tool allows the user to save the generated mapping for further use by the DORM component.

CHAPTER VI RESULTS This chapter contains the consolidated results of applying the schema and data mapping algorithms to four di erent relational schemas. Since all the special features that the schema mapping process can handle may not be present in a single relational schema, three di erent perspectives of a single logical domain model were considered to illustrate the overall capabilities of the mapping. One real-world geophysical relational database was used for schema mapping and comments on the extracted object schema are provided for discussion. The \Personnel" Database The rst database example is a hypothetical database corresponding to personnel information of a computer rm. The rm consists of many employees who are attached to a single department in the organization. The rm undertakes both hardware and software projects. Each project has one leader assigned to it, and each project has a speci c budget and target date allotted. All hardware projects get their hardware supplies from a given set of suppliers. For software projects, the programming language and the lines of code (LOC) are recorded for future reference. Relational Schema #1 The rst relational schema for the problem statement is given below. The database design for this schema is fairly intuitive and contains no special optimization constructs. The relational schema is in 3NF. 77

78

Table 6.1. \Personnel" Relational Schema #1 Employee

SSN, Name, Age, Sex, Dept-Num PK = fSSNg FK = fDept-Num (Department)g Department Dept-Num, Name, Emp-Count, Location PK = fDept-Numg FK = Head-SSN (Employee) Project Proj-Num, Proj-Name, Leader-SSN, Budget, Proj-Type, Target-Date PK = Proj-Num FK = fLeader-SSN (Employee)g Works-On SSN, Proj-Num, Start-Date PK = fSSN, Proj-Numg FK = fSSN (Employee)g, fProj-Num (Project)g Hardware-Project Proj-Num, Supplier-Num PK = fProj-Numg FK = fProj-Num (Project)g, fSupplier-Num (Supplier)g Software-Project Proj-Num, Language, Lines-of-code PK = fProj-Numg FK = fProj-Num (Project)g Supplier Supplier-Num, Name, Phone, Address PK = fSupplier-Numg FK = None Dependent Emp-SSN, Dependent-Num, Dependent-Name, Dependent-Age PK = fEmp-SSN, Dependent-Nameg FK = fEmp-SSN (Employee)g

79 Given the relation de nitions of Table 6.1, there is no activity during phase 1 of the static schema mapping because the schema does not contain any constructs that are handled in phase 1. That is, the schema does not contain any 2NF relations, widow relations, or orphan relations. Therefore, phase 2 (algorithms 6{9) can be applied directly to this schema. The object schema that results from the straight-forward mapping process is given in Figure 6.1. Algorithm 6 identi es all the object classes Employee, Department, Project, Works-On, HardwareProject, Software-Project, Supplier, and Dependent. Algorithm 7 extracts a many-one relationship between Employee and Department, Hardware-Project and Supplier, and Project and Employee, respectively. The many-many association between Employee and Project is established by the algorithm.

Department

Employee

Project

Dept–Num Name Emp–Count Location

SSN Name Age Sex Dept–Num

Proj–Num Proj–Name Leader–SSN Budget Proj–Type Target–Date

Dependent Emp–SSN Dependent–Num Dependent–Name Dependent–Age

Works–On SSN Proj–Num Start–Date

Hardware–Project

Software–Project

Proj–Num Supplier–Num

Proj–Num Language Lines–of–Code

Supplier Supplier–Num Name Phone Address

Figure 6.1. Object Schema for \Personnel" Relational Schema #1

80 Table 6.2. \Personnel" Relational Schema #2 Employee

SSN, Name, Age, Sex, Dept-Num PK = fSSNg FK = fDept-Num (Department)g Department Dept-Num, Name, Emp-Count, Location PK = fDept-Numg FK = Head-SSN (Employee) Project Proj-Num, Proj-Name, Leader-SSN, Budget, Proj-Type, Lines-of-Code, Language, Supplier-Num, Supplier-Name, Supplier-Address, Supplier-Phone PK = Proj-Num FK = fLeader-SSN (Employee)g FD = fSupplier-Num ! Supplier-Name, Supplier-Address, Supplier-Phoneg Works-On SSN, Proj-Num, Start-Date PK = fSSN, Proj-Numg FK = fSSN (Employee)g, fProj-Num (Project)g Dependent SSN, Dependent-Num, Dependent-Name, Dependent-Age PK = fSSN, Dependent-Nameg FK = fSSN (Employee)g Relational Schema #2 The second relational design to be considered here is given in Table 6.2. This design is more optimized when compared to the design presented in the earlier section. In the new schema, the notion of a widow relation is incorporated for representing the various types of projects. The motivation for employing widow relations in relational schema design has already been discussed in Chapter IV. This schema also includes a relation (Project) that is in 2NF in order to avoid the need to join tables to retrieve, for example, the supplier name for a given hardware project. When the schema mapping algorithms are applied to this relational schema, the object schema shown in Figure 6.2 is extracted. The overall structure of the schema

81 remains the same except that the mapping algorithms produced new object-classes although there were no corresponding relations in the relational schema. The rst new object class extracted by the mapping process is named Project Sub1Embedded 1. This class is extracted by Algorithm 1. This object class corresponds to the embedded association in the Project relation. Two other new object classes are Project Sub0 and Project Sub1 corresponding to software and hardware projects, respectively. They are extracted by Algorithm 2. Note that all the newly extracted object classes have been assigned automatically-generated names. The user can change their names (e.g., to meaningful names). The extraction of the remaining object classes and relationship proceeds as in the previous case.

Department

Employee

Project

Dept–Num Name Emp–Count Location

SSN Name Age Sex Dept–Num

Proj–Num Proj–Name Leader–SSN Budget Proj–Type Target–Date

Dependent Emp–SSN Dependent–Num Dependent–Name Dependent–Age

Works–On SSN Proj–Num Start–Date

Project_Sub1 Proj–Num Supplier–Num

Project_Sub0 Proj–Num Language Lines–of–Code

Project_Sub1Embedded_1 Supplier–Num Name Phone Address

Figure 6.2. Object Schema for \Personnel" Relational Schema #2

82 Relational Schema #3 The third design for the sample relational database includes orphan relations. The orphan relations correspond to hardware and software projects. One problem introduced by using such orphan relations in the design is that foreign key relationships involving generic projects cannot be expressed because there is no single relation that will include all the project numbers. Consequently, there is no single relation in the design that can act as part of a referential integrity constraint involving projects. That is the reason why the de nition of the Works-On relation does not include any foreign key corresponding to the Proj-Num attribute (\semantic degradation"). As will be seen shortly, this a ects the relationships the schema mapper identi es. The entire relational schema for the third design is given in Table 6.3. The result of applying the schema mapping algorithms on this relational schema is the object schema shown in Figure 6.3. Algorithm 4 eliminates the orphan relations by creating an inheritance hierarchy involving Hardware-Project, Software-Project, and a newly-created object class Temp-Sup-0. The class Temp-Sup-0 is analogous to the class Project included in the previous two object schemas. One major di erence between this object schema and the earlier two object schemas is that there is no many-many relationship between the classes corresponding to employees and projects. The reason for this, as indicated earlier, is that the de nition of the Works-On relation does not include any foreign key corresponding to the Proj-Num attribute, which in turn was caused by the absence of any relation corresponding to the object class Temp-Sup-0. So, instead of a many-many relationship, an aggregation relationship is extracted. In order to correct the shortcoming in the source schema, the user can interact at this point by rst creating a temporary foreign key between the relation

83

Table 6.3. \Personnel" Relational Schema #3 Employee

SSN, Name, Age, Sex, Dept-Num PK = fSSNg FK = fDept-Num (Department)g Department Dept-Num, Name, Emp-Count, Location PK = fDept-Numg FK = Head-SSN (Employee) Works-On SSN, Proj-Num, Start-Date PK = fSSN, Proj-Numg FK = fSSN (Employee)g Hardware-Project Proj-Num, Proj-Name, Leader-SSN, Budget, Proj-Type, Target-Date, Supplier-Num PK = fProj-Numg FK = fProj-Num (Project)g, fSupplier-Num (Supplier)g Software-Project Proj-Num, Proj-Name, Leader-SSN, Budget, Proj-Type, Target-Date, Language, Lines-of-code PK = fProj-Numg FK = fLeader-SSN (Employee)g Supplier Supplier-Num, Name, Phone, Address PK = fSupplier-Numg FK = None Dependent SSN, Dependent-Num, Dependent-Name, Dependent-Age PK = fSSN, Dependent-Nameg FK = fSSN (Employee)g

84 Employee and the virtual relation corresponding to Temp-Sup-0 and then re-running the mapping procedure. Doing this will result in the generation of a many-many relationship similar to the other two. This schema is in fact yet another demonstration of the need for user interaction in the mapping process. The remaining object schema is extracted in a manner similar to that of the other two.

Department Dept–Num Name Emp–Count Location

Employee

Temp–Sup–0

SSN Name Age Sex Dept–Num

Proj–Num Proj–Name Leader–SSN Budget Proj–Type Target–Date

Dependent Emp–SSN Dependent–Num Dependent–Name Dependent–Age

Works–On SSN Proj–Num Start–Date

Hardware–Project

Software–Project

Proj–Num Supplier–Num

Proj–Num Language Lines–of–Code

Supplier Supplier–Num Name Phone Address

Figure 6.3. Object Schema for \Personnel" Relational Schema #3

Grid Data Relational Schema The Naval Environmental Oceanographic Nowcasting System (NEONS) is a geophysical relational database designed by the Naval Oceanographic and Atmospheric Research Laboratory (Jurkevics 1992). NEONS is a large database containing a variety of data such as grid model output, satellite images, and so on. The portion of the database to be considered here corresponds to grid data.

85 Grid data contains atmospheric and oceanographic numerical model outputs, userde ned parameters, and gridded climatology data. The collection of data values corresponding to the grid points in a rectangular grid comprises a grid eld. Table 6.4 contains the relational schema from the NEONS database corresponding to grid data. A few attributes have been left out of the schema for the sake of brevity. The relational schema does not contain any widow or orphan relations. One noticeable relation in the schema is otisP1, which includes an attribute named bitstream of BLOB data type. The bitstream is the attribute that contains all the measurements of a given parameter at all the speci ed grid points. Since the number of grid points usually exceeds 10,000, using a bitstream is an ecient way of improving the throughput of the database queries. The object schema that results from the application of the mapping algorithms is given in Figure 6.4. The major adjustment the source relational schema goes through is the creation of an aggregation relationship corresponding to the BLOB attribute of the otisP1 relation. A new virtual relation and a corresponding object class named bitstream are created based on the description given by the user (the author in this case). Based on the available description of the data, the object schema is intuitive with respect to the domain semantics.

86

Table 6.4. Grid Data Relational Schema as grid

grid id, model type, vrsn name, geom id, min base etm, max base etm PK = fgrid idg FK = fgeom id (grid geom)g, fmodel type (grid model)g grid model model id, model type, fcst dsc, remark, pack null PK = fmodel idg, fmodel typeg FK = None grid geom geom id, geom name, geom dsc, bgn etm, stor dsc PK = fgeom idg, fgeom nameg FK = None grid reg geom geom id, prjn name, longitude, latitude, geom parm 1, geom parm 2, geom parm 3 PK = fgeom idg FK = fgeom id (grid geom)g, fprjn name (projection)g grid spct geom geom id, max lat wav num, max lon wav num 1, max lon wav num 2, coef cnt, trnc type PK = fgeom idg FK = fgeom id (grid geom)g projection prjn name, row int dsc, col int dsc, geom parm dsc 1, geom parm dsc 2, geom parm dsc 3, remark PK = fprjn nameg FK = None grid parm parm id, parm name, parm dsc, bit cnt, unit name PK = fparm idg, fparm nameg FK = None grid lvl lvl id, lvl type, lvl name 1, lvl name 2, unit name, lvl dsc PK = flvl idg, flvl typeg FK = None otisP1 grid id, parm id, lvl id, lvl 1, lvl 2, base etm, fcst per, bitstream PK = fgrid id, parm id, base etm, lvl id, lvl 1, lvl 2g FK = flvl id (grid lvl)g, fparm id (grid parm)g, fgrid id (as grid)g

87

otisP1

as_grid

grid_model

val_num parm_val

grid_id parm_id lvl_id lvl_1 lvl_2 base_etm fcst_per

grid_id model_type version_name geom_name min_base_etm max_base_etm

model_id model_type fcst_dsc remark pack_null

grid_parm

grid_lvl

parm_id parm_name parm_dsc bit_cnt unit_name

lvl_id lvl_type lvl_name_1 lvl_name_2 unit_name lvl_dsc

geom_id geom_name geom_dsc bgn_etm stor_dsc

bitstream

grid_geom

grid_reg_geom geom_id prjn_name longitude latitude geom_parm_1 geom_parm_2 geom_parm_3

grid_spct_geom geom_id max_lat_wav_num max_lon_wav_num_1 max_lon_wav_num_2 coef_cnt trnc_type

projection prjn_name row_int_dsc col_int_dsc geom_parm_dsc_1 geom_parm_dsc_2 geom_parm_dsc_3 remark

Figure 6.4. Object Schema for Grid Data Relational Schema

CHAPTER VII CONCLUSIONS This dissertation presents the results of research relating to a dilemma being faced in real world software development: being able to support and maintain existing relational databases while adopting new and potentially incompatible technology such as object-orientation. Speci cally, algorithms were presented for extracting object schema corresponding to an existing relational schema without modifying the relational schema. The state of the art was described in a literature review. Finally, the computational complexity of the mapping algorithms and an evaluation in terms of consistency and completeness were discussed. The next section describes the unique contributions of this research. Following that section is a review of the goals achieved by the research in terms of the initial research objectives. Finally, the limitations of this work along with scope for future work are presented. Contributions of this Research This section details the major unique contributions of this research to the area of reverse engineering relational databases to object-oriented databases. The contributions range from eliminating shortcomings of approaches that have already been published to providing fresh approaches to address some additional issues in the area. As was pointed out in the literature review, a majority of the published work in reverse engineering of relational databases has focussed on extracting the entity88

89 relationship (ER) model or its derivatives. Although the object-oriented model subsumes the ER model, only the extended-ER (EER) model and its derivatives have concepts similar to the structural portion of the object-oriented model. The schema mapping described by Chiang, Barron, and Storey (1994) is fairly extensive except that it does not take into consideration some of the database optimization structures (e.g., widow and orphan relations) for which algorithms have been provided in this dissertation. There has been very little published work in extracting the object-oriented models as opposed to extracting ER-based models. Fahrner and Vossen (1995b) have provided the most comprehensive transformational approach for migrating from a relational database to an object database. Since the focus of their approach is migration, unlike this research's focus of coexistence, they do not make much e ort to avoid modifying the source schema. For example, they suggest normalizing the relational schema to 3NF if it is not already so. The conscious avoidance of changing the source schema in SOAR ensures that the coexistence requirement is not compromised in the process of schema mapping. Among the commercial products that provide tools for reverse engineering existing relational schemas, based on publicly available information, the ONTOS Object Integration Server (OIS TM ) provides the maximum capability to cope with various kinds of relational constructs. However, the extent of static schema mapping automation is minimal as it requires the developer to select from various mapping possibilities and carry out the mapping process interactively (Ontos Inc. 1996). This ability to automatically handle widow and orphan relations for extracting inheritance relationships is a unique contribution in published results.

90 One more unique contribution made by this research is the special consideration being given to attributes of BLOB data type. No other approach has considered the possibility of BLOB columns in relations giving rise to potential aggregation relationships. Finally, the adaptation of the notions of consistency and completeness for evaluating schema mapping is unique to this research. It is a rigorously veri able technique that does not involve the subjective nature of other types of evaluations. This technique is also \economical" as it does not use external resources such as human evaluators and databases in sizable number and volume. Review of Research Objectives This section reviews the research objectives stated in the rst chapter. A commentary on the importance of the objective and how it was achieved is included for each objective. One of the initial research objectives was to examine the diculties involved in fully automating the schema mapping process. Chapter III included a discussion of the various types of semantic degradation that result in the design of the relational schema. The conclusion was that apart from the limitation of the relational model, the desire to optimize database performance by avoiding expensive join operations leads to loss of semantics in the resulting database design. This conclusion is consistent with the view expressed in the literature that mapping a relational schema to an objectoriented schema constitutes reverse engineering and cannot be fully automated. The second major objective was to develop a user assisted process for extracting the object-oriented schema. Chapter IV contains all the algorithms needed for carrying out the process. The algorithms were divided into two phases: one phase

91 for adjusting the relational schema to a convenient intermediate form and one phase for converting from that intermediate form to the nal object-oriented structures. The advantage of this two-phase approach is that the initial phase can be easily expanded to handle newer forms of database optimizations. Apart from schema mapping procedures, data mapping procedures were also speci ed using relation algebra operations. The third major objective was to show that the schema mapping process is both complete and consistent. A generalized proof of completeness was speci ed for showing attribute and data accessibility from the object- oriented schema. Consistency was proven by showing that functional dependencies are preserved across the schema mapping. The nal objective was to develop a proof-of-concept system to validate the research thesis. In order to satisfy this objective, the SOAR system was developed. The system includes an implementation of the schema and data mapping algorithms. The mapping information was stored in a machine-readable form, which allows this information to be used in a dynamic component for mapping the actual data. Moreover, storing this information in a separately structured database will increase the portability of the system across many implementations of the same database. The database will serve as the \meta database" for the runtime component (Kappel et al. 1994). For each generated object class, the saved schema mapping information includes details about the all the attributes, associations, subclasses, superclasses, names of the tables corresponding to those classes, and so on. See Appendix B for the grammar corresponding to the le of the mapping information. See Appendix C for sample header les generated by SOAR.

92 Limitations and Future Work Recall that there are two major user interaction operations in the whole schema mapping process described here. The rst one is for specifying functional dependencies for non-3NF relations. There is an algorithm for automatically extracting functional dependencies. The algorithm is of exponential complexity on the number of attributes in the relation (Bitton, Millman, and Torgersen 1989). Some work needs to be done to take this burden of specifying functional dependencies o the user. The second mandatory user interaction involves the speci cation of the composition of BLOB attributes if they represent complex data. Applications that use BLOB attributes to represent complex data typically include routines for converting from bitstream to programming language data structures (e.g., C `struct's) and vice versa. Techniques involving source code analysis can possibly uncover the composition of BLOB attributes from such routines. The state of art in mapping existing relational schemas to object-oriented schemas is advancing forward at a vigorous pace with many vendors sensing the commercial potential of a robust tool with the capability to automate much of the database reengineering process. While relational databases will continue to be used for applications best suited for the relational model, mapping strategies will be needed to alleviate the practical and/or economic problems of replacing relational databases with object-oriented databases for applications that are more suited for the objectoriented model.

REFERENCES Ananthanarayanan, R., V. Gottemukkala, T. J. L. W. Kaefer, and H. Pirahesh. 1993. Using the co-existence approach to achieve combined functionality of object-oriented and relational systems. In Proceedings of the 1993 ACM SIGMOD international conference on management of data held in Washington, D.C., May, 1993, edited by P. Buneman and S. Jajodia, 109{ 18, Volume 22, ACM Press. Andersson, M. 1994. Extracting an entity-relationship schema from a relational database through reverse engineering. In Proceedings of the 13th international conference on entity relationship approach held in Manchester, UK, December, 1994, edited by P. Loucopoulos, 403{19, Springer. ARCUS. 1996. About ARCUS. World Wide Web Homepage. http://www.sdm.de/g/arcus/. Bertino, E., and L. Martino. 1991. Object-oriented database management systems: Concepts and issues. IEEE Computer, 24 (April): 33{47. Bitton, D., J. Millman, and S. Torgersen. 1989. A feasibility and performance study of dependency inference. In Proceedings of the 5th international conference on data engineering held in Los Angeles, CA, February, 1989, IEEE Computer Society, 635{41. Blaha, M. R., W. J. Premerlani, and J. E. Rumbaugh. 1988. Relational database design using an object-oriented methodology. Communications of the ACM 31 (April): 414{27. Catell, R. 1991. What are next generation database systems? Communications of the ACM 34 (October): 31{3. Cattell, R. 1994. Object data management: Object-oriented and extended relational database systems. Menlo Park, California: Addison-Wesley. Chiang, R. H., T. M. Barron, and V. C. Storey. 1994. Reverse engineering of relational databases: Extraction of an EER model from a relational database. Data & Knowledge Engineering (12): 107{42. 93

94 Chiang, R. H. L., T. M. Barron, and V. C. Storey. 1993. Performance evaluation of reverse engineering relational databases into extended entity-relationship models. In Proceedings of the 12th international conference on entity relationship approach held in Texas, USA, December, 1993, edited by R. A. Elmasri, V. Kouramajian, and B. Thalheim, 352{63, Springer. Date, C. J. 1986. An introduction to database systems (Fourth ed.), Volume 1. Reading, Massachusetts: Addison-Wesley Publishing Company. Dick, K., and A. Chaudhri. 1995. Tutorial: Selecting an ODBMS { Practical advice for evaluators. ACM Press. OOPSLA '95. Elmasri, R., and S. Navathe. 1992. Fundamentals of database systems. Redwood City, California: The Benjamin/Cummings Publishing Company, Inc. Fahrner, C., and G. Vossen. 1995a. A survey of database design transformations based on the entity-relationship model. Data & Knowledge Engineering (15): 213{50. Fahrner, C., and G. Vossen. 1995b. Transforming relational database schemas into object oriented schemas according to ODMG-93. In Proceedings of the 4th international conference on deductive and object-oriented databases held in Singapore, 1995, by the National University of Singapore. to appear. Fong, J. 1995. Mapping extended entity relationship model to object modelling technique. SIGMOD Record 24 (September): 18{22. Hainaut, J., C. Tonneau, M. Joris, and M. Chandelon. 1993. Transformationbased database reverse engineering. In Proceedings of the 12th international conference on entity relationship approach held in Texas, USA, December, 1993, edited by R. Elmasri, V. Kouramajian, and B. Thalheim, 364{75, Springer. Herzig, R., and M. Gogolla. 1992. Transforming conceptual data models into an object model. In Proceedings of the 11th international conference on entity relationship approach held in Karlsruhe, Germany, October, 1992, edited by G. Pernul and A. Tjoa, 280{98, Springer. Hodges, J. E., S. Ramanathan, and S. Bridges. 1995. A prototype object-oriented geophysical database system developed by re-engineering a relational database. Technical Report 950612, Mississippi State University, Department of Computer Science, Mississippi State University.

95 Jacobson, I., and F. Lindstrom. 1991. Re-engineering of old systems to an object-oriented architecture. In Proceedings of sixth international conference on object-oriented programming systems, languages, and applications (OOPSLA) held in Phoenix, Arizona, 1991, ACM Press, 340{50. Jeusfeld, M. A., and U. A. Johnen. 1994. An executable meta model for re-engineering of database schemas. Technical Report 94-19, Technical University of Aachen. Johannesson, P., and K. Kalman. 1989. A method for translating relational schemas into conceptual schemas. In Proceedings of the eighth international conference on entity relationship approach held in Toronto, Canada, October, 1989, edited by F. H. Lochovsky, 271{85, North-Holland. Jurkevics, A. 1992. Database design document for the naval environmental operational nowcasting system (Version 3.5 ed.). Montery, California: Naval Oceanographic and Atmospheric Research Laboratory. Kappel, G., S. Preishuber, E. Proll, S. Rausch-Schott, W. Retschitzegger, R. Wagner, and C. Gierlinger. 1994. COMan{Coexistence of object-oriented and relational technology. In Proceedings of the 13th international conference on entity relationship approach held in Manchester, UK, December, 1994, edited by P. Loucopoulos, 259{77, Springer. Keller, A. M. 1994. Penguin: Objects for programs, relations for persistence. World Wide Web Homepage. http://www-db.stanford.edu/pub/keller/. Keller, A. M., and J. Basu. 1996. A predicate-based caching scheme for client-server database architectures. VLDB Journal 5 (January): 35{47. Keller, A. M., and C. Hamon. 1993. A C++ binding for Penguin: a system for data sharing among heterogeneous object models. In Proceedings of 4th international conference on foundations on data organization and algorithms held in Chicago, Illinois, October, 1993, edited by D. B. Lomet, 215{30, Springer. Keller, A. M., R. Jensen, and S. Agarwal. 1993. Persistence Software: Bridging object-oriented programming and relational databases. In Proceedings of the 1993 ACM SIGMOD international conference on management of data held in Washington, D.C., May, 1993, edited by P. Buneman and S. Jajodia, 523{8, Volume 22, ACM Press. Keller, W., and J. Coldewey. 1996. Relational database access layers { A pattern language. In Pattern languages of programming design 3: Addison-Wesley. to be published.

96 Kim, W. 1995. Object-relational database technology. White paper, Informix Software Inc. Loomis, M. E. 1994. Hitting the relational wall. Journal of Object-Oriented Programming 6 (January). Loomis, M. E. S. 1995. Object databases: The essentials. Reading, Massachusetts: Addison-Wesley Publishing Company. McKearney, S., D. A. Bell, and R. Hickey. 1992. Inferring abstract objects in a database. In Proceedings of the rst international conference on information and knowledge management CIKM'92 held in Baltimore, Maryland, November, 1992, edited by T. W. Finin, C. K. Nicholas, and Y. Yesha, 318{25, Springer. Narasimhan, B., S. B. Navathe, and S. Jayaraman. 1993. On mapping ER and relational model into OO schemas. In Proceedings of the 12th international conference on entity relationship approach held in Texas, USA, December, 1993, edited by R. A. Elmasri, V. Kouramajian, and B. Thalheim, 402{12, Springer. Ontos Inc. 1996. Ontos object integration server. World Wide Web Homepage. http://www.ontos.com/. Parent, C., and S. Spaccapietra. 1992. ERC+: An object-based entity-relationship approach. In Conceptual modelling, databases and CASE: An integrated view of information systems development, edited by P. Loucopoulos and R. Zicari: John Wiley. Persistence Software Inc. 1996. Persistence software: Technical overview. World Wide Web Homepage. http://www.persistence.com. Piatetsky-Shapiro, G., and W. Frawley. (Eds.) 1991. Knowledge discovery in databases. California: AAAI Press/The MIT Press. Premerlani, W. J., and M. R. Blaha. 1994. An approach for reverse engineering of relational databases. Communications of the ACM 37 (May): 42{9. Premerlani, W. J., M. R. Blaha, J. E. Rumbaugh, and T. A. Varwig. 1990. An object-oriented relational database. Communications of the ACM 33 (November): 99{108. Ramanathan, C. 1994. Providing object-oriented access to a relational database. In Proceedings of the 32nd ACM annual southeast conference held in Tuscaloosa, Alabama, March, 1994, edited by D. Cordes and S. Vrbsky, 162{5, ACM Press.

97 Robie, J., and D. Bartels. 1994. A comparison between relational and object-oriented databases for object-oriented application development. White paper, POET Software Corporation. Rumbaugh, J., M. Blaha, W. Premerlani, F. Eddy, and W. Lorensen. 1991. Object-oriented modelling and design. Englewood Cli s, New Jersey: Prentice-Hall, Inc. Shaw, M., and D. Garlan. 1996. Software architecture { Perspectives on an emerging discipline. Upper Saddle River, New Jersey: Prentice Hall. Simon, R. 1994. Object or relational? A guide for selecting database technology. White paper, Servio Corporation. Stonebraker, M. 1995. Object-relational DBMS { The next wave. White paper, Informix Software Inc. Subtle Software Inc. 1996. Subtleware C++ class generator. World Wide Web Homepage. http://world.std.com/~subtle/.

APPENDIX A EVALUATION EXAMPLES

98

99 Chapter V contains a formal evaluation of the mapping algorithms. This appendix contains additional explanation of the evaluation of the algorithms involved in phase 1 of static schema mapping along with some examples. The sample relations used here are drawn from those used in the explanation of the algorithms in Chapter IV. Example for Evaluation of Algorithm 1 The algorithm for the elimination of 2NF relations basically loops through all the transitive dependencies in each relation and replaces them with the corresponding 3NF virtual relations. For example, consider the 2NF relation PROJECT given below. PROJECT Relation Proj# 0001 0002 0003 0004 0005

ProjName ProjType Supplier# SupplierName SupplierPhone Project One `H' 2102 Supplier2 555-1212 Project Two `H' 2102 Supplier2 555-1212 Project Three `H' 2107 Supplier7 555-3213 Project Four `H' 3102 Supplier4 555-4431 Project Five `H' 2102 Supplier2 555-1212

The PROJECT relation also has the following functional dependencies: Proj # ! fProjName; ProjType; Supplier#; SupplierName; SupplierPhoneg

Supplier# ! fSupplierName; SupplierPhoneg Algorithm 1 splits that relation into two relations. The new PROJECT virtual relation is as follows:

100 PROJECT Relation Proj# 0001 0002 0003 0004 0005

ProjName ProjType Supplier# Project One `H' 2102 Project Two `H' 2102 Project Three `H' 2107 Project Four `H' 3102 Project Five `H' 2102

This virtual relation has the following functional dependency: Proj # ! fProjName; ProjType; Supplier#g The second virtual relation PROJECT SUB1 is as follows: PROJECT SUB1 Relation Supplier# 2102 2107 3102

SupplierName SupplierPhone Supplier2 555-1212 Supplier7 555-3213 Supplier4 555-4431

This virtual relation has the functional dependency: Supplier# ! fSupplierName; SupplierPhoneg This example clearly demonstrates attribute and data completeness of Algorithm 1. The union of the attributes of the newly created PROJECT and PROJECT SUB1 relations equals the attributes of the original PROJECT relation. Moreover, all the attribute values in the original relation can be found in one of the two newly created relations. In addition, consistency of the algorithm is also evident from the fact that the functional dependencies of the original relation can be logically derived from the union of functional dependencies of the two virtual relations.

101 Example for Evaluation of Algorithms 2 and 3 Following is an example of a widow relation: PROJECT Relation Proj# 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010

ProjName ProjType Supplier# Language LOC Project One `H' 2102 NULL NULL Project Two `H' 2102 NULL NULL Project Three `H' 2107 NULL NULL Project Four `H' 3102 NULL NULL Project Five `H' 2102 NULL NULL Project Six `S' NULL C++ 12,000 Project Seven `S' NULL Java 3,000 Project Eight `S' NULL Java NULL Project Nine `S' NULL Smalltalk 10,100 Project Ten `S' NULL Lisp 8,000

The functional dependency for this relation is: Proj # ! fProjName; ProjType; Supplier#; Language; LOC g Algorithm 2 splits the above relation into three new virtual relations as follows. PROJECT Relation Proj# 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010

ProjName ProjType Project One `H' Project Two `H' Project Three `H' Project Four `H' Project Five `H' Project Six `S' Project Seven `S' Project Eight `S' Project Nine `S' Project Ten `S'

102 PROJ SUB0 Proj# 0001 0002 0003 0004 0005

Supplier# 2102 2102 2107 3102 2102

PROJ SUB1 Proj# 0006 0007 0008 0009 0010

Language C++ Java Java Smalltalk Lisp

LOC 12,000 3,000 NULL 10,100 8,000

These three relations have the following combined functional dependencies: Proj # ! fProjName; ProjTypeg

Proj # ! fSupplier#g Proj # ! fLanguage; LOC g It is clear from this example that both the attributes and attribute values of the original relation are preserved in the virtual relations created by Algorithm 2. Moreover, the functional dependencies of the original relation can be trivially derived from the combined functional dependencies listed above. Example for Evaluation of Algorithm 4 The following two relations are examples of orphan relations. HARDWARE PROJ Proj# 0001 0002 0003 0004 0005

ProjName Supplier# Project One 2102 Project Two 2102 Project Three 2107 Project Four 3102 Project Five 2102

SOFTWARE PROJ Proj# 0006 0007 0008 0009 0010

ProjName Language LOC Project Six C++ 12,000 Project Seven Java 3,000 Project Eight Java NULL Project Nine Smalltalk 10,100 Project Ten Lisp 8,000

103 The functional dependencies of the above two relations are: Proj # ! fProjName; ProjType; Supplier#g

Proj # ! fProjName; ProjType; Language; LOC g Algorithm 3 creates a relation corresponding to the superclass of these relations. The modi ed virtual relations are as follows: TMP SUP0 Relation Proj# 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010

ProjName ProjType Project One `H' Project Two `H' Project Three `H' Project Four `H' Project Five `H' Project Six `S' Project Seven `S' Project Eight `S' Project Nine `S' Project Ten `S'

HARDWARE PROJ Proj# 0001 0002 0003 0004 0005

Supplier# 2102 2102 2107 3102 2102

SOFTWARE PROJ Proj# 0006 0007 0008 0009 0010

Language C++ Java Java Smalltalk Lisp

LOC 12,000 3,000 NULL 10,100 8,000

These three relations have the following combined functional dependencies: Proj # ! fProjName; ProjTypeg

Proj # ! fSupplier#g Proj # ! fLanguage; LOC g

104 Just as in the case of the previous two algorithms, it can seen by visual inspection that Algorithm 4 preserves the attributes and data of the original two relations after mapping them to the newly created three virtual relations. In addition, the functional dependencies of the virtual relations are trivially equivalent to those of the original relations by concatenation of the right hand sides since all them have the same left hand side. Algorithm 5 replaces BLOB attributes with an additional virtual relation if the attribute represents complex data. Since the composition of the complex data is provided by the user, the evaluation simply assumes that the BLOB attribute and the attributes provided by the user are equivalent. Similarly, the BLOB value is assumed to be composed of the values corresponding to the attributes given by the user. Therefore, attribute and data completeness for Algorithm 5 are satis ed implicitly. Algorithms 6{9 do not change the structure of the adjusted relational schema and hence do not change the consistency of the attributes. Therefore, all the transformation operations in the static schema mapping preserve the consistency of the original relational schema.

APPENDIX B FORMAT OF SchemaBase

105

106 Following is the format of the le that is created by SOAR for storing the schema mapping information. It is given in the form of a BNF grammar. Following the grammar is a table that contains a description of all the terminal symbols in the grammar.

::=



::= MANY_CLASSES_TAG classCount



::= |



::= MANY_ASSOCS_TAG assocCount



::= |



::= SINGLE_ASSOC_TAG assocName



::=



::= joinAttrsCount



::= attrBaseName | attrBaseName



::= "one" | "zero_one" | "many" | "undefined"



::= |



::= RELATIONSHIPS_TAG className



::= SUBCLASSES_TAG subClassCount



::= |



::=



::= AGGR_TAG aggrClassCount

107

::= |



::=



::= SINGLE_CLASS_TAG className



::= MANY_ATTRS_TAG attrCount



::= |



::= SINGLE_ATTR_TAG attrName isHidden



::=



::= baseTableName | baseTableName



::=



::=



::= baseColumnName



::= value | NULL_TAG

108

Table B.1. Description of Terminal Symbols Terminal Symbol Description databaseName Name of the relational database that is the source of the mapping classCount The number of classes in the object schema assocCount The number of binary associations in the object schema assocName The name of the association classOneName The name of the rst class in the binary association joinAttrOne The name of the join attribute of classOne for this association classTwoName The name of the second class in the binary association joinAttrOne The name of the join attribute of classTwo for this association className The name of an object class subClassCount The number of subclasses of the class corresponding to className subClassName The name of a subclass of the class corresponding to className aggrClassCount The number of component classes of the class corresponding to className aggrClassName The name of a component class of the class corresponding to className attrCount The number of attributes of the class corresponding to className attrName The name of an attribute of the class corresponding to className isHidden 1 if the user has hidden the attribute and 0 if not

109

Table B.1 (continued) Terminal Symbol baseTableName

Description The name of a table from the relational schema that acts as the the source of values for some/all attributes of the object. baseColumnName The name of a column in the relational schema that corresponds to the attribute in the object schema whose name is attrName value Some numeric or a string value depending upon the data type of the column corresponding to baseColumnName NULL TAG The string \# NULL #" AGGR TAG The string \# AggrClasses #" SINGLE ASSOC The string \# Assoc #" MANY ASSOCS The string \# Assocs #" SINGLE ATTR The string \# Attr #" MANY ATTRS The string \# Attrs #" SINGLE CLASS The string \# Class #" MANY CLASSES The string \# Classes #" EMPRESS TYPE The string \# EmpressType #" RELATIONSHIPS TAG The string \# Relationships #" SUBCLASSES TAG The string \# SubClasses #"

APPENDIX C SOURCE CODE GENERATION

110

111 The DORM component of the SOAR system uses a combination of source code generation and a class library in order to provide run-time mapping between relational data and objects for object-oriented applications written in C++. This appendix discusses the techniques that were adopted for generating the source les. DORM uses a tag-based approach to source le generation. In a tag-based approach, the general template of a source le is described by a text le. That le includes the text that should appear in all the source les and also some tags to enable customization of the source le for a particular object class. The customization tags pertain to, for example, the class name, attributes of the class, and so on. The tag interpreter uses the information from the SchemaBase generated by the schema mapping procedure for replacing the tags with appropriate values and programming statements. The original source le template contains general coding strategies for computing join operations, invoking the database access routines on behalf of the application, and so on. The tags help the tag interpreter in generating code for di erent classes using same programming logic. For example, all occurences of the tag \# class name #" in the template le is replaced with the name of a class read from SchemaBase. Thus, the template le could be changed without the need to recompile the code generation modules of the system. The source les thus generated save the application programmer the burden of writing data mapping code for the classes in the object schema. The general processing steps are shown in Figure C.1. There are two le templates for generating data mapping code. The rst one corresponds to the generation of header (.h) les for the classes. This template includes code prototypes for the de nition of a C++ class. The template le for the generation of header les is given below.

112

File Templates

SchemaBase

Tag Interpreter

Application Class Library

Generated Source Code

DORM Class Library

Compiler

Executable

Figure C.1. Processing Steps for Data Mapping

113 ///////////////////////////////////////// // Generated by SOAR on #__curr_time__# ///////////////////////////////////////// #ifndef #__class_name__#_aux_H #define #__class_name__#_aux_H #include #include #include #include



#include "db_structs.h" #__for_each_super_class_begin__# #include "#__curr_super_class__#.h" #__for_each_super_class_end__# // Forward declarations class SDatabase; #__for_each_assoc_begin__# class #__other_assoc_class__#; #__for_each_assoc_end__# #__for_each_aggr_class_begin__# class #__curr_aggr_class__#; #__for_each_aggr_class_end__# class #__class_name__# #__case_super_class_exists_begin__# : public #__for_each_super_class_begin__# #__curr_super_class__# #__for_each_super_class_end__# #__case_super_class_exists_end__# { public: #__class_name__#(); virtual ~#__class_name__#();

114 static void load( SDatabase& database, const RWTValOrderedVector< SQualElement >& userQual, RWTPtrOrderedVector< #__class_name__# >& result ); // The body for the following method should be supplied by // the application programmer. int operator == ( const #__class_name__#& other ) const; friend ostream& operator obj_#__other_assoc_class__#@#__class_name__#; #__case_many_other_end__# #__case_one_other_begin__# #__other_assoc_class__#* obj_#__class_name__#@#__other_assoc_class__#;

115 #__case_one_other_end__# #__for_each_assoc_end__# #__for_each_aggr_class_begin__# RWTPtrOrderedVector< #__curr_aggr_class__# > aggr_#__curr_aggr_class__#; #__for_each_aggr_class_end__# // // // // // // //

The following are additional data members declared by the application programmer. They should be written in a separate file since this file will be overwritten each time and hence user changes will be lost. This application-specific file should have the name _app.h and it is 'include'd in the class declaration below.

#include "#__class_name__#_app.h" }; inline #__class_name__# :: #__class_name__#() { } inline #__class_name__# :: ~#__class_name__#() { } #endif

All tags begin with the string \# " and end with the string \ #". The tag interpreter writes out all non-tag characters present in the template le without any translation. The information pertaining to the tags comes from the SchemaBase mapping information. A header le for one example class is given below. The attributes and schema mapping for this class was discussed in Chapter VI.

116 ////////////////////////////////////////// // Generated by SOAR on 01/16/97 04:56:56 ////////////////////////////////////////// #ifndef employee_aux_H #define employee_aux_H #include #include #include #include



#include "db_structs.h"

// Forward declarations class SDatabase; class department; class department;

class employee { public: employee(); virtual ~employee(); static void load( SDatabase& database, const RWTValOrderedVector< SQualElement >& userQual, RWTPtrOrderedVector< employee >& result ); // The body for the following method should be supplied by // the application programmer. int operator == ( const employee& other ) const; friend ostream&

117 operator obj_department_employee; department* obj_employee_department; // // // // // // //

The following are additional data members declared by the application programmer. They should be written in a separate file since this file will be overwritten each time and hence user changes will be lost. This application-specific file should have the name _app.h and it is 'include'd in the class declaration below.

#include "employee_app.h" }; inline employee :: employee() { } inline employee :: ~employee() { } #endif

While the rst template le generates header les, the second le template exists for generating compilable source code for instantiating objects from the database. That le includes additional statements for communicating with the underlying

118 database, retrieving the attribute values, and instantiating objects using those values. Just as in the case of generation of header les, the pertinent information for inserting programming statements based on tags comes from the schema mapping le SchemaBase.